JP2019079227A

JP2019079227A - State transition rule acquisition device, action selection learning device, action selection device, state transition rule acquisition method, action selection method, and program

Info

Publication number: JP2019079227A
Application number: JP2017205050A
Authority: JP
Inventors: 鈴木　潤; Jun Suzuki; 潤鈴木; 慶雅鶴岡; Yoshimasa Tsuruoka
Original assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Current assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Priority date: 2017-10-24
Filing date: 2017-10-24
Publication date: 2019-05-23

Abstract

To enable the acquisition of a state and a state transition rule for selecting action even in environment in which the state and the state transition rule are not clear.SOLUTION: A state acquisition part 210 acquires a state of an environment after action when the selected action is performed, a reward calculation part 220 calculates a reward when the action is performed in the state on the basis of the acquired state and the selected action, a parameter update part 240 updates a parameter of a model for selecting the action with the state as an input on the basis of the selected action and the reward, an action selection part 270 selects the action with a state after the action as an input by using the model, and repeats acquisition, calculation, update and selection until a repetition end condition is satisfied, and the state acquisition part 210 compares the acquired state with a set of states which have already been acquired, adds the acquired state to the set of states if it is a new state, and acquires a state transition rule on the basis of the set of states.SELECTED DRAWING: Figure 2

Description

本発明は、状態遷移規則獲得装置、行動選択学習装置、行動選択装置、状態遷移規則獲得方法、行動選択方法、およびプログラムに関し、特に、状態における行動を選択するための状態遷移規則獲得装置、行動選択学習装置、行動選択装置、状態遷移規則獲得方法、行動選択方法、およびプログラムに関する。 The present invention relates to a state transition rule acquisition device, an action selection learning device, an action selection device, a state transition rule acquisition method, an action selection method, and a program, and in particular, a state transition rule acquisition device for selecting an action in a state The present invention relates to a selective learning device, an action selection device, a state transition rule acquisition method, an action selection method, and a program.

人間の意思決定を補助する意思決定支援システムは、医学的診断、銀行での融資決定、企業での経営判断など、様々な分野で幅広い実用システムを含む。こういった意思決定支援システムの強化は、近年発展が著しい人工知能研究の重要な課題の一つと考えられる。 Decision support systems that support human decision making include a wide range of practical systems in various fields, such as medical diagnosis, bank loan decisions, and business decisions by companies. Such strengthening of decision support systems is considered to be one of the important issues in artificial intelligence research that has been rapidly developing in recent years.

意思決定支援システムの構成に定型は存在しないが、ここでは一つの方法論として、事象を一つの状態として記述し、ある行動を取ることで現在の状態から次の状態へ遷移する状態遷移モデルを考える。そして、得られる期待利得が最も大きい最終状態へ到達できるように現在の状態における取るべき行動を提示する、という戦略を用いるシステムを仮定する。つまり、ここで取り上げる意思決定支援システムは、現時点の状態から最良の最終状態へ到達するために必要な行動を自動で選択して提示するシステムとなる。 Although there is no fixed form in the configuration of the decision support system, here we describe an event as a single state, and consider a state transition model that transitions from the current state to the next state by taking a certain action as one methodology . We then assume a system that uses the strategy of presenting the action to be taken in the current state so that the final state with the highest expected gain can be reached. In other words, the decision support system described here is a system that automatically selects and presents the action necessary to reach the best final state from the current state.

一般的に、意思決定支援システムにおいては、選択できる行動の数や、取り得る状態の総数の多さに依存して問題の難易度も変わってくる。仮に、行動を選択する際に必要な情報が、現在の状態の情報として全て取得することが可能、かつ、行動に対する状態の遷移が決定的な（不確定要素はない）場合、計算時間を無限に使うことが許されるなら、現在の状態における最良の行動を、システムを用いて自動的に計算できる可能性は高い。 In general, in a decision support system, the degree of difficulty of a problem changes depending on the number of selectable actions and the total number of possible states. If it is possible to obtain all the information necessary to select an action as information on the current state, and if the transition of the state to the action is crucial (no uncertainty factor), the calculation time is infinite. It is likely that the system can automatically calculate the best behavior in the current state if it is allowed to

ただし、現実的には、現在の状態に対する情報が全て取得可能と言う状況はほとんど起こり得ない。また、行動に対する状態の遷移は必ずしも一意ではなく、確率的に次の状態が決定する場合がほとんどである。つまり、通常、様々な不確定要素を考慮しながら最適な行動の選択を強いられる。 However, in reality, there is almost no situation where all information on the current state can be obtained. Also, the transition of the state to the action is not necessarily unique, and in most cases the next state is determined probabilistically. In other words, it is usually forced to select the optimal action while considering various uncertain factors.

近年の人工知能技術の高まりの中で、ポーカーといった閉じた環境の中であれば「自分の行動（手）を決定する際に相手の手配の情報は不明」かつ「自分のとった行動に対する相手の行動は不確定」という情報が不完全かつ不確定な環境でも、人間のプロを超える強さをもつエージェントを構築可能な方法論が考案されている（例えば、非特許文献１）。ここで、「閉じた環境」とは、環境下で取り得る状態の種類が時間と共に増えたり減ったりすることはなく不変であり列挙可能、ということを意味することとする。具体的な方法としては、不確定不完全情報環境下で、現在の状態から未来の状態への遷移を仮定しながら探索し、最も「失敗の少ない（後悔がない）」状態に到達する行動を取る戦略を用いている。このように閉じた環境で、かつ、比較的状態数の少ない環境では、行動を決定する際に必要な情報が不完全かつ不確定であったとしても、最良の結果、あるいは、最良に近い結果を得る行動を自動で選択することができるようになりつつある。 In the recent development of artificial intelligence technology, in a closed environment such as poker "If you decide your own action (hands), the information of the other party's arrangement is unknown" and "The other party to your own action In an environment where the information is “indeterminate” is incomplete and indeterminate, a methodology has been devised which can construct an agent with strength exceeding human professional (for example, Non-Patent Document 1). Here, "closed environment" means that the types of possible states in the environment do not increase or decrease with time, and are immutable and can be enumerated. A concrete method is to search for the transition from the current state to the future state in the uncertain incomplete information environment, and to reach the state with the least failure (no regret) state. It uses a strategy to take. In such a closed environment, and in an environment with a relatively small number of states, even if the information needed to determine the behavior is incomplete and uncertain, the best or near-best results It is becoming possible to automatically select the action to get

Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization in gameswithincomplete information. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances inNeural Information Processing Systems 20, pages 1729{1736. 2008.Regret minimization in games within complete information. In JC Platt, D. Koller, Y. Singer, and ST Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1729 {1736 {2008. 2008} Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. .

前述の非特許文献１では、システムは、環境（ゲームのルール）、状態、及び状態遷移規則をあらかじめ与えられて知っている、と言う仮定で構築されている。状態や状態遷移規則が分かっていれば、それを用いて勝率や利得のシミュレーションを行うことが可能であり、膨大なシミュレーション計算により、全ての状態においてどのような行動を取るべきかの指針を得ることができる。 In the above-mentioned Non-Patent Document 1, the system is built on the assumption that the environment (game rules), the state, and the state transition rule are given and known in advance. If you know the states and state transition rules, it is possible to use it to simulate the winning percentage and gain, and a huge amount of simulation calculation gives guidance on what action should be taken in all states. be able to.

しかし、実世界の問題では、状態の定義とある行動をとった際に状態がどのように遷移するか遷移規則を獲得することは困難であるし、全て書き出すのは不可能に近い。実問題では一般的に状態の遷移規則は不明であり、かつ、状態の種類数も膨大な数になるためである。つまり、状態遷移規則や状態の定義を事前にシステムに与えることができないため、閉じた環境で現在有望な方法論である非特許文献１などを実応用システムへ適用できない、といった問題がある。 However, in real-world problems, it is difficult to obtain transition rules as to how the state transitions when the state definition and certain actions are taken, and it is almost impossible to write out all. In real problems, transition rules of states are generally unknown, and the number of types of states is also enormous. That is, since the state transition rules and the definitions of the states can not be given to the system in advance, there is a problem that Non-Patent Document 1 or the like, which is a promising methodology currently in closed environment, can not be applied to a practical application system.

本発明はこの課題に鑑みてなされたものであり、状態や状態遷移規則が不明な環境であっても、行動を選択するための状態や状態遷移規則を獲得することができる状態遷移規則獲得装置、行動選択学習装置、状態遷移規則獲得方法、およびプログラムを提供することを目的とする。 The present invention has been made in view of this problem, and a state transition rule acquisition device capable of acquiring a state or state transition rule for selecting an action even in an environment where the state or state transition rule is unknown. An object of the present invention is to provide an action selection learning device, a state transition rule acquiring method, and a program.

また、本発明は、状態や状態遷移規則が不明な環境であっても、適切な行動を選択することができる行動選択装置、行動選択方法、およびプログラムを提供することを目的とする。 Another object of the present invention is to provide an action selection device, an action selection method, and a program capable of selecting an appropriate action even in an environment where the state or state transition rule is unknown.

本発明に係る状態遷移規則獲得装置は、選択された行動を行ったときの行動後の状態を獲得する状態獲得部と、前記獲得した前記状態と前記選択された行動とに基づいて、前記状態において前記行動を行った際の報酬を計算する報酬計算部と、前記選択された行動と前記報酬計算部により計算された報酬とに基づいて、前記状態を入力とし、行動を選択するためのモデルのパラメタを更新するパラメタ更新部と、前記行動後の状態を入力とし、前記モデルを用いて、行動を選択する行動選択部と、予め定められた反復終了条件を満たすまで、前記状態獲得部による獲得、前記報酬計算部による計算、前記パラメタ更新部による更新、及び前記行動選択部による選択を繰り返させる終了判定部と、を備え、前記状態獲得部は、獲得した状態が、既に獲得した状態の集合と比較して、新しい状態であれば、前記獲得した状態を、前記状態の集合に追加し、前記状態の集合に基づいて、状態遷移規則を獲得する。 A state transition rule acquisition device according to the present invention is a state acquisition unit for acquiring a state after an action when a selected action is performed, and the state based on the acquired state and the selected action. A model for selecting an action based on the selected action and the reward calculated by the reward calculation unit, which calculates a reward when the action is performed, and selecting the action A parameter updating unit for updating the parameters of the parameter, an action selecting unit for selecting an action using the model after the state after the action is input, and the condition acquiring unit until a predetermined iteration end condition is satisfied. Acquisition, calculation by the reward calculation unit, update by the parameter update unit, and an end determination unit for repeating selection by the action selection unit, and the state acquisition unit is in the acquired state, Compared to the set of the acquired state, if the new state, a state in which the acquired and added to the set of the state, based on a set of the state, and obtains the state transition rule.

また、本発明に係る状態遷移規則獲得方法は、状態獲得部が、選択された行動を行ったときの行動後の環境の状態を獲得するステップと、報酬計算部が、前記獲得した前記状態と前記選択された行動とに基づいて、前記状態において前記行動を行った際の報酬を計算するステップと、パラメタ更新部が、前記選択された行動と前記報酬計算部により計算された報酬とに基づいて、前記状態を入力とし、行動を選択するためのモデルのパラメタを更新するステップと、行動選択部が、前記行動後の状態を入力とし、前記モデルを用いて、行動を選択するステップと、終了判定部が、予め定められた反復終了条件を満たすまで、前記状態獲得部による獲得、前記報酬計算部による計算、前記パラメタ更新部による更新、及び前記行動選択部による選択を繰り返させるステップと、を備え、前記状態獲得部が獲得するステップは、獲得した状態が、既に獲得した状態の集合と比較して、新しい状態であれば、前記獲得した状態を、前記状態の集合に追加し、前記状態の集合に基づいて、状態遷移規則を獲得する。 In the state transition rule acquiring method according to the present invention, the state acquiring unit acquires the state of the environment after the action when the selected action is performed, and the reward calculating unit determines the acquired state. Calculating a reward when the action is performed in the state based on the selected action, and the parameter updating unit is based on the selected action and the reward calculated by the reward calculation unit Step of updating parameters of a model for selecting an action based on the state, and selecting an action using the model after the action selecting unit inputs a state after the action; Acquisition by the state acquisition unit, calculation by the reward calculation unit, update by the parameter update unit, and update by the action selection unit until an end determination unit satisfies a predetermined iteration end condition Repeating the selection step, and the step of acquiring by the state acquisition unit may compare the acquired state with the acquired state if the acquired state is a new state as compared to the set of already acquired states. Add to the set of, and acquire state transition rules based on the set of states.

本発明に係る状態遷移規則獲得装置及び状態遷移規則獲得方法によれば、状態獲得部が、入力された行動を行ったときの行動後の環境の状態を獲得し、報酬計算部が、獲得した状態と選択された行動とに基づいて、状態において行動を行った際の報酬を計算し、パラメタ更新部が、選択された行動と報酬計算部により計算された報酬とに基づいて、状態を入力とし、行動を選択するためのモデルのパラメタを更新する。 According to the state transition rule acquisition device and the state transition rule acquisition method of the present invention, the state acquisition unit acquires the state of the environment after the action when the input action is performed, and the reward calculation unit acquires the state. Based on the state and the selected action, the reward for performing the action in the state is calculated, and the parameter updating unit inputs the state based on the selected action and the reward calculated by the reward calculation unit. And update the parameters of the model to select the action.

そして、行動選択部が、行動後の状態を入力とし、モデルを用いて、行動を選択し、終了判定部が、予め定められた反復終了条件を満たすまで、状態獲得部による獲得、報酬計算部による計算、パラメタ更新部による更新、及び行動選択部による選択を繰り返させ、状態獲得部は、獲得した状態が、既に獲得した状態の集合と比較して、新しい状態であれば、獲得した状態を、状態の集合に追加し、状態の集合に基づいて、状態遷移規則を獲得する。 Then, the action selecting unit takes the state after the action as an input, selects the action using the model, and the end determining unit acquires the reward calculation unit by the state acquiring unit until it satisfies the predetermined iteration end condition. The state acquisition unit repeats the calculation by the parameter update unit, the update by the parameter update unit, and the selection by the action selection unit, and if the acquired state is a new state as compared to the set of acquired states, the acquired state is , Add to the set of states, and acquire state transition rules based on the set of states.

このように、予め定められた反復終了条件を満たすまで、入力された行動を行ったときの行動後の環境の状態を獲得し、獲得した状態と選択された行動とに基づいて、状態において行動を行った際の報酬を計算し、選択された行動と報酬計算部により計算された報酬とに基づいて、状態を入力とし、行動を選択するためのモデルのパラメタを更新し、行動後の状態を入力とし、モデルを用いて、行動を選択することを繰り返し、獲得した状態が、既に獲得した状態の集合と比較して、新しい状態であれば、獲得した状態を、状態の集合に追加し、状態の集合に基づいて、状態遷移規則を獲得することにより、状態や状態遷移規則が不明な環境であっても、行動を選択するための状態や状態遷移規則を獲得することができる。 In this way, the state of the environment after the action when the input action is performed is acquired until the predetermined repetitive end condition is satisfied, and the action in the state is performed based on the acquired state and the selected action. Calculate the reward at the time of performing, and based on the selected action and the reward calculated by the reward calculation unit, input the state, update the parameters of the model for selecting the action, and the state after the action Using the model as input, repeat selecting the action, add the acquired state to the set of states if the acquired state is new compared to the set of acquired states By acquiring state transition rules based on a set of states, it is possible to acquire states or state transition rules for selecting an action, even in an environment where states or state transition rules are unknown.

本発明に係る行動選択学習装置は、上記の状態遷移規則獲得装置と、前記状態遷移規則獲得装置により得られた前記状態の集合及び状態遷移規則に基づいて、前記状態と前記行動との各ペアに対して、前記状態において行動を行った時の期待利得を学習する行動選択方策獲得部と、前記行動選択方策獲得部における学習が収束するまで、前記状態遷移規則獲得装置による処理、及び前記行動選択方策獲得部による学習を繰り返させる収束判定部と、を備えて構成される。 The action selection learning device according to the present invention comprises each pair of the state and the action based on the state transition rule acquiring device described above, the set of states obtained by the state transition rule acquiring device, and the state transition rule. Against the action selection policy acquisition unit that learns an expected gain when an action is performed in the state, and the processing by the state transition rule acquisition device until the learning in the activity selection policy acquisition unit converges; And a convergence determination unit that repeats learning by the selection policy acquisition unit.

このように、状態遷移規則獲得装置と、行動選択方策獲得部が、状態遷移規則獲得装置により得られた状態の集合及び状態遷移規則に基づいて、状態と行動との各ペアに対して、状態において行動を行った時の期待利得を学習し、学習が収束するまで、状態遷移規則獲得装置による処理、及び行動選択方策獲得部による学習を繰り返させることにより、状態や状態遷移規則が不明な環境であっても、行動を選択するための状態や状態遷移規則を獲得することができる。 In this way, the state transition rule acquisition device and the action selection policy acquisition unit perform the state for each pair of state and action based on the set of states obtained by the state transition rule acquisition device and the state transition rule. An environment where the state or state transition rule is unknown by repeating the processing by the state transition rule acquisition device and the learning by the action selection policy acquisition unit until the expected gain at the time of performing the action is learned and the learning converges. Even in this case, it is possible to acquire a state or state transition rule for selecting an action.

本発明に係る状態遷移規則獲得装置は、前記モデルは、前記行動後の状態を入力とし、行動を選択するための多層ニューラルネットワークであるとすることができる。 In the state transition rule acquisition device according to the present invention, the model may be a multilayer neural network for selecting an action, using the state after the action as an input.

本発明に係る行動選択学習装置は、前記報酬計算部は、前記状態において前記行動を行った際の報酬を、前記状態を訪問した回数と、前記状態において前記行動を選択した回数と、前記状態において前記行動を行った時の期待利得とに基づいて計算することができる。 In the action selection learning device according to the present invention, the reward calculating unit may calculate the reward when the action is performed in the state, the number of visits to the state, the number of times the action is selected in the state, and the state And the expected gain when the action is taken.

本発明に係る行動選択装置は、予め学習された、環境の状態において行動を行った時の期待利得に基づいて、入力された状態に対して、前記期待利得が最大となる行動を選択する行動選択部を備えることを特徴とする行動選択装置であって、前記期待利得は、選択された行動を行った時の行動後の環境の状態を獲得し、前記獲得した前記状態と前記選択された行動とに基づいて、前記状態において前記行動を行った際の報酬を計算し、前記選択された行動と計算された報酬とに基づいて、前記状態を入力とし、行動を選択するためのモデルのパラメタを更新し、前記行動後の状態を入力とし、前記モデルを用いて、行動を選択することを予め定められた反復終了条件を満たすまで繰り返すことと、前記状態の集合及び状態遷移規則に基づいて、前記状態と前記行動との各ペアに対して、前記状態において行動を行った時の期待利得を学習することと、を交互に繰り返すことにより予め学習される。 The action selection apparatus according to the present invention selects an action that maximizes the expected gain with respect to the input state based on the expected gain obtained when performing the action in the state of the environment, which has been learned in advance. The behavior selection apparatus according to claim 1, further comprising: a selection unit, wherein the expected gain acquires a state of an environment after the action when the selected action is performed, and the acquired state and the selected state. A reward for performing the action in the state is calculated based on the action, and the state is used as an input based on the selected action and the calculated reward, and a model for selecting the action The parameter is updated, the state after the action is input, and selecting the action using the model is repeated until a predetermined iteration end condition is satisfied, and the set of states and the state transition rule are used. The For each pair of said action and the state, and learning the expected gain when performing an action in the state, is learned in advance by alternately repeating.

また、本発明に係る行動選択方法は、行動選択部が、予め学習された、環境の状態において行動を行った時の期待利得に基づいて、入力された状態に対して、前記期待利得が最大となる行動を選択するステップを備えることを特徴とする行動選択方法であって、前記期待利得は、選択された行動を行った時の行動後の環境の状態を獲得し、前記獲得した前記状態と前記選択された行動とに基づいて、前記状態において前記行動を行った際の報酬を計算し、前記選択された行動と計算された報酬とに基づいて、前記状態を入力とし、行動を選択するためのモデルのパラメタを更新し、前記行動後の状態を入力とし、前記モデルを用いて、行動を選択することを予め定められた反復終了条件を満たすまで繰り返すことと、前記状態の集合及び状態遷移規則に基づいて、前記状態と前記行動との各ペアに対して、前記状態において行動を行った時の期待利得を学習することと、を交互に繰り返すことにより予め学習されることを特徴とする。 Further, in the action selection method according to the present invention, the expected gain is maximum with respect to the input state based on the expected gain when the action selection unit performs the action in the state of the environment, which is learned in advance. The expected gain is obtained by acquiring the state of the environment after the action when the selected action is performed, and the acquired state Calculating a reward when performing the action in the state based on the selected action and the selected action, and selecting the action based on the selected action and the calculated reward, using the state as an input Parameters of the model to be updated, the state after the action is taken as an input, and selecting the action using the model is repeated until a predetermined iteration end condition is satisfied, the set of states and State It is characterized in that learning is performed in advance by alternately repeating learning of an expected gain when an action is performed in the state with respect to each pair of the state and the action based on a transfer rule. Do.

本発明に係る行動選択装置及び行動選択方法によれば、予め学習された、環境の状態において行動を行った時の期待利得に基づいて、入力された状態に対して、期待利得が最大となる行動を選択する。 According to the action selection device and the action selection method according to the present invention, the expected gain is maximized with respect to the input state based on the expected gain when performing the action in the environmental state, which is learned in advance. Choose an action.

そして、期待利得は、選択された行動を行った時の行動後の環境の状態を獲得し、獲得した状態と選択された行動とに基づいて、状態において行動を行った際の報酬を計算し、選択された行動と計算された報酬とに基づいて、状態を入力とし、行動を選択するためのモデルのパラメタを更新し、行動後の状態を入力とし、モデルを用いて、行動を選択することを予め定められた反復終了条件を満たすまで繰り返すことと、状態の集合及び状態遷移規則に基づいて、状態と行動との各ペアに対して、状態において行動を行った時の期待利得を学習することと、を交互に繰り返すことにより予め学習される。 The expected gain is obtained by acquiring the state of the environment after the action when the selected action is performed, and calculating the reward when the action is performed in the state based on the acquired state and the selected action. Based on the selected action and the calculated reward, the state is input, the parameters of the model for selecting the action are updated, the state after action is input, and the action is selected using the model Learn the expected gain when performing an action in a state with respect to each pair of a state and an action on the basis of a set of states and a state transition rule, by repeating the above until a predetermined iteration end condition is satisfied, and Learning is done in advance by alternately repeating.

このように、選択された行動を行った時の行動後の環境の状態を獲得し、獲得した状態と選択された行動とに基づいて、状態において行動を行った際の報酬を計算し、選択された行動と計算された報酬とに基づいて、状態を入力とし、行動を選択するためのモデルのパラメタを更新し、行動後の状態を入力とし、モデルを用いて、行動を選択することを予め定められた反復終了条件を満たすまで繰り返すことと、状態の集合及び状態遷移規則に基づいて、状態と行動との各ペアに対して、状態において行動を行った時の期待利得を学習することと、を交互に繰り返すことにより予め学習された、環境の状態において行動を行った時の期待利得に基づいて、入力された状態に対して、期待利得が最大となる行動を選択することにより、状態や状態遷移規則が不明な環境であっても、適切な行動を選択することができる。 Thus, the state of the environment after the action when the selected action is performed is acquired, and the reward when performing the action in the state is calculated and selected based on the acquired state and the selected action. Based on the calculated action and the calculated reward, the state is input, the parameters of the model for selecting the action are updated, the state after action is input, and the action is selected using the model. Repeating until a predetermined iteration end condition is satisfied, and learning expected gain when performing an action in the state with respect to each pair of the state and the action based on the set of states and the state transition rule And by alternately repeating, and based on the expected gain at the time of performing the action in the state of the environment, which has been learned in advance, by selecting the action for which the expected gain is maximum for the input state, State or condition Transition rule even unknown environment, it is possible to select the appropriate action.

本発明に係るプログラムは、上記の状態遷移規則獲得装置、行動選択学習装置、又は行動選択装置の各部として機能させるためのプログラムである。 A program according to the present invention is a program for functioning as each unit of the above-mentioned state transition rule acquisition device, action selection learning device, or action selection device.

本発明の行動選択学習装置、行動選択学習方法、およびプログラムによれば、状態や状態遷移規則が不明な環境であっても、行動を選択するための状態や状態遷移規則を獲得することができる。 According to the action selection learning device, the action selection learning method, and the program of the present invention, it is possible to obtain a state or state transition rule for selecting an action even in an environment where the state or state transition rule is unknown. .

また、本発明の行動選択装置、行動選択方法、およびプログラムによれば、状態や状態遷移規則が不明な環境であっても、適切な行動を選択することができる。 Further, according to the action selection device, the action selection method, and the program of the present invention, an appropriate action can be selected even in an environment where the state or the state transition rule is unknown.

本発明の実施の形態に係る行動選択学習装置の構成を示すブロック図である。It is a block diagram which shows the structure of the action selection learning apparatus based on embodiment of this invention. 本発明の実施の形態に係る状態遷移規則獲得部の構成を示すブロック図である。It is a block diagram which shows the structure of the state transition rule acquisition part which concerns on embodiment of this invention. 本発明の実施の形態に係る行動選択学習装置の行動選択学習処理ルーチンを示すフローチャートである。It is a flowchart which shows the action selection learning process routine of the action selection learning apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る状態と状態遷移規則の獲得処理ルーチンを示すフローチャートである。It is a flowchart which shows the acquisition process routine of the state which concerns on embodiment of this invention, and a state transition rule. 本発明の実施の形態に係る行動選択装置の構成を示すブロック図である。It is a block diagram which shows the structure of the action selection apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る行動選択装置の行動選択処理ルーチンを示すフローチャートである。It is a flowchart which shows the action selection processing routine of the action selection apparatus which concerns on embodiment of this invention.

以下、本発明の実施の形態について図面を用いて説明する。 Hereinafter, embodiments of the present invention will be described using the drawings.

＜本発明の実施の形態の原理＞
まず、本発明の実施形態の原理について説明する。 <Principle of the embodiment of the present invention>
First, the principle of the embodiment of the present invention will be described.

本発明の実施の形態では、実環境を仮想的に実現するシミュレータを仮定する。このシミュレータは意思決定支援システムがある行動を選択すると、その環境下でどのようなことが起こるかを提示するものである。このシミュレータは過去の事例などを元に情報を提示するものとする。また同時に、その時の状態の利得も提示するものとする。 In the embodiment of the present invention, a simulator that virtually realizes a real environment is assumed. This simulator presents what happens in the environment when a decision support system selects a certain action. This simulator presents information based on past cases and the like. At the same time, the gain of the current state is also presented.

まず、意思決定支援システムは、状態及び状態遷移規則を獲得する処理を行う（処理１）。ここでは、環境下で取り得る状態と状態遷移規則を獲得することを目指す。このために、シミュレーション下で、新たな状態や状態遷移を発見すべく探索を行う。この探索には、例えば強化学習の一種であるＱ学習に深層学習を適用したｄｅｅｐＱ−ｎｅｔｗｏｒｋ（ＤＱＮ）などを用いる。 First, the decision support system performs a process of acquiring states and state transition rules (process 1). Here, we aim to acquire possible states and state transition rules under the environment. To this end, under simulation, a search is made to discover new states and state transitions. For this search, for example, deep Q-network (DQN) or the like in which deep learning is applied to Q learning, which is a type of reinforcement learning, is used.

エージェントは、環境から状態とその状態に対する利得を獲得し、より利得が得られると考えられる状態へ優先的に遷移できるような行動を学習していく。 The agent acquires the state and the gain for the state from the environment, and learns an action that can preferentially transition to the state considered to be able to obtain more gain.

次に、意思決定支援システムは、行動選択の方策を獲得する処理を行う（処理２）。ここでは、得られた状態と状態遷移規則に相当するものを利用して、各状態での行動の利得計算を行う。 Next, the decision support system performs a process of acquiring an action selection policy (process 2). Here, the gain calculation of the action in each state is performed using the obtained state and the one corresponding to the state transition rule.

具体的には、意思決定支援システムは、状態とその状態に対する行動のペアに対して、最終状態まで遷移した際に最も損失が大きいものが小さくなるように学習を進めていく。すなわち、最終状態まで遷移した際の損失をｘ_ｉとすると、損失ｘ_ｉのうち、最も大きい値（ｍａｘ（ｘ_ｉ））が、なるべく小さい値になるように（ｍｉｎ（ｍａｘ（ｘ_ｉ）））学習する。この学習を繰り返し行うことによって、状態とその状態に対する行動のペアの良さが次第に得られることになる。最終的に十分な回数シミュレーションを行うと、各状態で取るべき行動が得られるようになる。 Specifically, the decision support system proceeds learning so that, for a pair of a state and an action for the state, the one with the largest loss becomes smaller when it transitions to the final state. That is, assuming that the loss at the time of transition to the final state is x _i , the largest value (max (x _i )) of the losses x _i is as small as possible (min (max (x _i )) )learn. By repeating this learning, the goodness of the state and the action pair for the state will be gradually obtained. Finally, when simulation is performed a sufficient number of times, actions to be taken in each state can be obtained.

そして、意思決定支援システムは、行動選択の方策の収束判定の処理を行う（処理３）。行動選択の方策が収束している場合は処理を終了し、そうでない場合は処理１に戻る。すなわち、処理１及び処理２を繰り返し行うことで最終的な行動選択の戦略を得る。 Then, the decision support system performs processing of convergence judgment of the action selection policy (Process 3). If the action selection strategy converges, the process ends; otherwise, the process returns to process 1. That is, the process 1 and the process 2 are repeated to obtain a final action selection strategy.

＜＜処理１．状態と状態遷移規則の獲得＞＞
状態の集合をＳとする。また、状態遷移の集合をＴ＝Ｓ×Ｓとする。基本的に、Ｓは初期状態は空集合を仮定する。ただし、繰り返し処理により本処理を行う場合は、一つ前の処理の最終状態が、本処理の初期状態となる。よって、その場合、Ｓは空集合ではなくなるが、Ｓの初期状態は本処理には影響を与えないので、どのような集合が与えられても良い。 << Process 1. Acquisition of states and state transition rules >>
Let S be a set of states. Further, a set of state transitions is T = S × S. Basically, S assumes an empty set in the initial state. However, when this process is performed by repeated processing, the final state of the immediately preceding process is the initial state of this process. Therefore, in this case, S is not an empty set, but the initial state of S does not affect the present processing, so any set may be given.

次にＡを行動の集合とする。また、ａとｓをそれぞれ（時間に依存しない）行動と状態を表す記号として用いる。つまり

であり、

である。本発明では、多層ニューラルネットワークによる強化学習の枠組みを用いて状態と状態遷移規則の探索を行う。ここでは、多層ニューラルネットワーク全体をＭ_θで表し、そのパラメタをθとする。 Next, let A be a set of actions. Also, a and s are used as symbols representing action (independent of time) and state, respectively. In other words

And

It is. In the present invention, a state and state transition rule are searched using a framework of reinforcement learning by a multi-layered neural network. Here, the entire multilayer neural network is represented by M _θ , and its parameter is θ.

時刻をｔとし、時刻ｔの状態を

と表記することにする。時刻ｔの状態において取り得る行動の集合Ａ（ｓ_ｔ）に対して、各行動

を、多層ニューラルネットワークを用いて確率値によりモデル化する。また、時刻ｔの終了状態ｆ_ｔは、０または１を返す関数で、終了状態であれば１、そうでなければ０を返す。 Let t be time, and state of time t

I will write it as. For each set of possible actions A (s _t ) in the state of time t, each action

Are modeled by probability values using a multi-layered neural network. Also, the end state f _t of time t is a function that returns 0 or 1, and returns 1 if it is an end state, 0 otherwise.

そして、以下のアルゴリズムに則って、状態及び状態遷移を獲得する。
ステップ１．(初期化)ｔ＝１、多層ニューラルネットワークＭ_θの構成を読み込み。
ステップ２．一時刻前の行動ａ_ｔ−１を環境に投入、ｔ＝０の場合は「行動なし」と仮定。
ステップ３．行動に対する、状態ｓ_ｔ、行動に対する報酬ｒ（ｓ_ｔ−１，ａ_ｔ−１）、及び終了状態ｆ_ｔを環境から取得。
ステップ４．新しい状態を発見したらその状態ｓ_ｔをＳへ追加。
ステップ５．行動ａ_ｔ−１、行動に対する報酬ｒ（ｓ_ｔ−１，ａ_ｔ−１）を用いてネットワークのパラメタθを更新。
ステップ６．終了判定：ｆ_ｔが１（終了状態）であれば、終了し、ｆ_ｔが０（終了状態でない）なら、以下の処理を継続。
ステップ７．ステップ３．で取得した状態を入力とし、ステップ１．で構築したネットワークの定義に従ってネットワークの各要素の値を計算し、時刻ｔの行動として、行動

の値がもっとも高い行動を選択。
ステップ８．ｓ_ｔで行動ａ_ｔを取ることに対する報酬ｒ（ｓ_ｔ，ａ_ｔ）を更新。例えば、後述の式（１）を用いて更新する。
ステップ９．ｔ＝ｔ＋１として、ステップ２．に戻る。 Then, according to the following algorithm, states and state transitions are acquired.
Step 1. (Initialization) t = 1, read the configuration of the multilayer neural network M _θ .
Step 2. Put the action a _t-1 one time before in the environment, in the case of t = 0 assume that "no action".
Step 3. State s _t for action, reward r (s _{t -1} , at ₁ ) for action, and end state f _t from environment.
Step 4. If you find a new state add the state s _t to S.
Step 5. Update the parameter θ of the network using the action at _-1 and the reward r for the action ( _st-1 , at _-1 ).
Step 6. End determination: If f _t is 1 (in the end state), the process ends, and if f _t is 0 (not in the end state), the following processing is continued.
Step 7. Step 3. Using the state acquired in step 1 as an input. Calculate the value of each element of the network according to the definition of the network built in and act as the action of time t

Choose the action with the highest value of.
Step 8. s _t reward for taking the action _{a t} in the _{r (s} _{t, a} t) update. For example, it updates using Formula (1) mentioned later.
Step 9. Let t = t + 1, step 2. Return to

ある状態ｓ_ｔで行動ａ_ｔをとった際の報酬ｒ（ｓ_ｔ,ａ_ｔ）は、ａ_ｔの結果得られる状態ｓ_ｔ＋１が未知の状態であった場合に高い評価値を与え、既知の状態の場合は訪問回数が多ければ多いほど報酬が減衰する関数を用いる。この関数は、上記ステップ８において、報酬ｒ（ｓ_ｔ，ａ_ｔ）を更新する際に用いられる。 Reward r (s _{_t,} a _{_t)} at the time of taking the action a _t in a certain state s _t is, given a high evaluation value when the state s _{t + 1} obtained as a result of a _t was an unknown state, known In the case of a state, a function is used in which the reward decreases as the number of visits increases. This function is used in updating the reward r (s _t , a _t ) in step 8 above.

報酬の具体的な定義は、無限に考えられるが、例えば、ｕｐｐｅｒｃｏｎｆｉｄｅｎｃｅｂｏｕｎｄ（ＵＣＢ）と呼ばれる計算式に基づいて計算する場合を考える。 Although the specific definition of the reward can be considered infinitely, for example, it is considered to calculate based on a formula called upper confidence bound (UCB).

ここで、ｇ（ｓ_ｔ，ａ_ｔ)を時刻ｔの状態ｓ_ｔにおいて選択した行動ａ_ｔを行った際にえられる期待利得、ｎ(ｓ_ｔ)を時刻ｔの状態ｓ_ｔを訪問した回数、ｎ（ｓ_ｔ，ａ_ｔ)を時刻ｔの状態ｓ_ｔにおいて行動ａ_ｔを選択した回数、をそれぞれ表す。また、αは第一項と第二項の重み係数である。 The number of times _{_{here, g (s t, a t}} ) a visit to the state _{s t} of the selected in the state _{s t} of the time t action _{a t} the expected gain to be example when I went, n _{(s t)} the time t represents _{n (s} t, _{a t)} number of selecting an action _{a t} in state _{s t} of the time t, respectively. Further, α is a weighting factor of the first term and the second term.

ＵＣＢに基づく行動選択を無限回行うと、式（１）の第二項が０に漸近的に近づくので、ｒ(ｓ_ｔ，ａ_ｔ)の値に従った評価値となる。逆に、状態ｓ_ｔを初めて訪問した場合、あるいは、ほとんど訪問していない場合は、ランダムに近い評価値となる。 When the action selection based on UCB is performed infinite times, the second term of the equation (1) asymptotically approaches 0, so that the evaluation value according to the value of r (s _t , a _t ) is obtained. Conversely, when the state s _t is visited for the first time or hardly visited, an evaluation value close to random is obtained.

また、その丁度中間に相応する場合は、ある状態ｓ_ｔにおいてとった行動ａ_ｔに対して過去に同じ行動をとった回数が少ない場合ほど優先的に選択するような評価値となる。 Further, if corresponding to the just middle, an evaluation value such as the past to preferentially selected as when a small number of taking the same action with respect to action a _t taken in some states s _t.

＜＜処理２．行動選択の方策の獲得＞＞
得られた状態の集合Ｓと状態遷移規則Ｔを用いて、従来法であるＣＦＲ（非特許文献１）などを用いて、行動選択の方策を獲得する。より具体的には、ある時刻ｔの状態ｓ_ｔにおいて選択した行動ａ_ｔに対する期待利得ｇ（ｓ_ｔ，ａ_ｔ）を計算し保持する。 << Process 2. Acquisition of action selection policy >>
Using the set S of state obtained and the state transition rule T, a policy of action selection is obtained using CFR (non-patent document 1) or the like which is a conventional method. More specifically, the expected gain _{g (s} t, _{a t)} on behavior _{a t} selected at state _{s t} at a certain time t and holds calculated.

＜＜処理３．行動選択の方策の収束判定＞＞
全てのｓ_ｔ及びａ_ｔのペアに対して得られた期待利得ｇ（ｓ_ｔ，ａ_ｔ）が、前回の処理結果との差分が十分に小さければ、学習の処理を終了する。もし、期待利得の差分が十分小さくない場合は、再度、上記処理１に戻る。 << Process 3. Convergence judgment of action selection policy >>
If the expected gains g (s _t , a _t ) obtained for all the s _t and a _t pairs are sufficiently small from the previous processing result, the learning processing ends. If the difference between the expected gains is not sufficiently small, the process returns to the process 1 again.

＜＜行動選択＞＞
最終的に、意思決定支援システムが、上記学習により得られたｇ（ｓ，ａ）に基づいて状態ｓ_ｔに対する行動を選択する。より具体的には、ある状態ｓに対して、最良のｇ（ｓ，ａ）となる行動
を提示する。 << Behavior selection >>
Finally, decision support systems, to select an action to the state s _t based on g (s, a) obtained by the learning. More specifically, the action that gives the best g (s, a) for a certain state s
To present.

以下、与えられた状態に対して行動選択の戦略を自動的に獲得する行動選択学習装置について述べた後、獲得した行動選択の戦略を用いて、実際に提示する行動選択装置について述べる。 Hereinafter, an action selection learning device that automatically acquires an action selection strategy for a given state will be described, and then an action selection device that is actually presented will be described using the acquired action selection strategy.

＜本発明の実施の形態に係る行動選択学習装置の構成＞
図１を参照して、本発明の実施の形態に係る行動選択学習装置の構成について説明する。図１は、本発明の実施の形態に係る行動選択学習装置の構成を示すブロック図である。 <Configuration of Action Selection Learning Device According to Embodiment of the Present Invention>
The configuration of the action selection learning device according to the embodiment of the present invention will be described with reference to FIG. FIG. 1 is a block diagram showing a configuration of an action selection learning device according to an embodiment of the present invention.

行動選択学習装置１０は、ＣＰＵと、ＲＡＭと、後述する行動選択学習処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。 The action selection learning device 10 is configured by a computer including a CPU, a RAM, and a ROM storing a program for executing an action selection learning processing routine described later, and is functionally configured as follows: ing.

図１に示すように、本実施形態に係る行動選択学習装置１０は、状態遷移規則獲得部１００と、シミュレーション部１１０と、状態記憶部１２０と、期待利得記憶部１３０と、行動選択方策獲得部１４０と、収束判定部１５０と、出力部１６０とを備えて構成される。 As shown in FIG. 1, the action selection learning device 10 according to the present embodiment includes a state transition rule acquisition unit 100, a simulation unit 110, a state storage unit 120, an expected gain storage unit 130, and an action selection policy acquisition unit. 140, a convergence determination unit 150, and an output unit 160.

状態遷移規則獲得部１００は、入力された行動を行ったときの行動後の環境の状態を獲得し、獲得した状態が、既に獲得した状態の集合と比較して、新しい状態であれば、獲得した状態を、状態の集合に追加し、状態の集合に基づいて、状態遷移規則を獲得する。 The state transition rule acquiring unit 100 acquires the state of the environment after the action when the input action is performed, and acquires the acquired state if it is a new state in comparison with the set of already acquired states. State is added to the set of states, and a state transition rule is acquired based on the set of states.

具体的には、図２に示すように、状態遷移規則獲得部１００は、初期化部２００と、状態獲得部２１０と、報酬計算部２２０と、計算データ記憶部２３０と、パラメタ更新部２４０と、モデル記憶部２５０と、終了判定部２６０と、行動選択部２７０と、行動記憶部２８０とを備える。 Specifically, as shown in FIG. 2, the state transition rule acquisition unit 100 includes an initialization unit 200, a state acquisition unit 210, a reward calculation unit 220, a calculation data storage unit 230, and a parameter update unit 240. , A model storage unit 250, an end determination unit 260, an action selection unit 270, and an action storage unit 280.

初期化部２００は、状態遷移規則獲得部１００の処理が開始され、または収束判定部１５０から状態遷移規則獲得部１００の処理を行う命令を受け取ると、時刻ｔを１に初期化し、行動ａ_ｔ−１（ａ_０：行動なし）を状態獲得部２１０に入力する。 Initializing unit 200 is started the process of the state transition rule acquisition unit 100, or receives the process is carried out instructions of the convergence determination portion 150 the state transition rule acquisition section 100 from, initializes the time t 1, action a _{t -1} (a ₀ : no action) is input to the state acquisition unit 210.

状態獲得部２１０は、選択された行動を行ったときの行動後の環境の状態を獲得する。 The state acquisition unit 210 acquires the state of the environment after the action when the selected action is performed.

具体的には、状態獲得部２１０は、初期化部２００により入力され、又は行動選択部２７０により選択された行動ａ_ｔ−１をシミュレーション部１１０に渡す。そして、シミュレーション部１１０により計算された行動ａ_ｔ−１を行った場合における行動後の環境の状態ｓ_ｔと終了状態ｆ_ｔとを獲得する。 Specifically, the state acquisition unit 210 passes the behavior at _-1 input by the initialization unit 200 or selected by the behavior selection unit 270 to the simulation unit 110. Then, when the action at _-1 calculated by the simulation unit 110 is performed, the state s _t of the environment after the action and the end state f _t are acquired.

ここで、状態ｓは、例えば環境がトランプ等のカードゲームであれば、札を取得・捨てる等の行動後における手札の状態や場に出された札の状態等であり、ロボット等であれば、移動等の行動後におけるロボット等に搭載されたカメラに撮像される画像等である。 Here, the state s is, for example, if the environment is a card game such as playing cards, the state of the hand after the action of acquiring or discarding the bill, the state of the bill put out to the place, etc. , An image captured by a camera mounted on a robot or the like after an action such as movement.

また、終了状態ｆ_ｔは、環境がトランプ等のカードゲームであれば、ゲームの決着（勝敗）が付いているか否か、ロボット等であれば、ロボットが停止すべきか否か、目的を達成したか否か、目的達成可能性の判断ができるか否か等の、環境の終了状態である。 In addition, the end state f _t achieved the purpose if the environment was a card game such as playing cards or not, and if the game was finalized (or lost), if it was a robot etc, it should be stopped or not. It is an end state of the environment, such as whether or not it is possible to judge the possibility of achieving the purpose.

そして、状態獲得部２１０は、獲得した状態ｓ_ｔが、既に獲得した状態の集合Ｓと比較して、新しい状態であれば、獲得した状態ｓ_ｔを、状態記憶部１２０に記憶されている状態の集合Ｓに追加し、状態の集合Ｓに基づいて、状態遷移規則Ｔを獲得する。 The status acquisition unit 210 acquires state s _t is compared with the set S of states already acquired, if the new state, the acquired state s _t, stored in the state storage unit 120 conditions State transition rule T based on the set S of states.

このとき、獲得した状態が新しい状態であるか否かの判断は、状態が手札の組み合わせのような離散的である場合、一致する状態が含まれているか否かによって行う。また、状態が画像のような連続的である場合、状態間の類似度（例えば、画像の類似度）が閾値以上であれば、新しい状態であると判断する。この他にも、開いた環境では様々な状態が存在し得るため、状態同士を比較することによる様々な基準を設けることができる。 At this time, whether the acquired state is a new state or not is determined based on whether or not a coincident state is included if the states are discrete, such as a combination of hands. If the state is continuous like an image, it is determined that the state is new if the similarity between the states (for example, the similarity of the images) is equal to or higher than a threshold. Besides this, since various states may exist in an open environment, various criteria can be provided by comparing the states.

そして、状態獲得部２１０は、獲得した状態ｓ_ｔを報酬計算部２２０へ渡す。 Then, the state acquisition unit 210 passes the acquired state _st to the reward calculation unit 220.

報酬計算部２２０は、獲得した状態ｓ_ｔにおいて行動ａ_ｔを行った際の報酬ｒ（ｓ_ｔ，ａ_ｔ）を、計算データ記憶部２３０に記憶されている時刻ｔにおける状態ｓ_ｔを訪問した回数ｎ（ｓ_ｔ）、及び状態ｓ_ｔにおいて行動ａ_ｔを選択した回数ｎ（ｓ_ｔ，ａ_ｔ）と、期待利得記憶部１３０に記憶されている状態ｓ_ｔにおいて行動ａ_ｔを行った時の期待利得ｇ（ｓ_ｔ，ａ_ｔ）とに基づいて計算する。 Fee calculator 220, reward _{r (s} t, _{a t)} when the acquired state _{s t} was action _{a t} the visited state _{s t} at time t stored in the calculation data storage section 230 number n _(s t), and state _{s t} at action _{a t} times the selected _{n (s} t, _{a t)} and, when performing an action _{a t} in state _{s t} stored in the expected gain storage unit 130 It calculates based on the expected gain g (s _t , a _t ) of

具体的には、報酬計算部２２０は、状態ｓ_ｔにおいて選択し得る行動ａ_ｔを行った時ｎ（ｓ_ｔ）、ｎ（ｓ_ｔ，ａ_ｔ）、及び期待利得ｇ（ｓ_ｔ，ａ_ｔ）から、上記式（１）に従って、報酬ｒ（ｓ_ｔ，ａ_ｔ）を計算する。報酬計算部２２０は、状態ｓ_ｔにおいて選択し得る行動ａ_ｔ全てについて、報酬ｒ（ｓ_ｔ，ａ_ｔ）の計算を行う。 Specifically, compensation calculation unit 220, the state _s when performing an action _{a t} that can be selected at _{_{_{t n (s t), n}}} (s t, a t), and the expected gain _{g (s} t, _{a t} Based on the above equation (1), calculate the reward r (s _t , a _t ). Compensation calculation unit 220, the action _{a t} all be selected in the state _{s t,} reward _{r (s} t, _{a t)} calculations performed.

そして、報酬計算部２２０は、計算した報酬ｒ（ｓ_ｔ，ａ_ｔ）をパラメタ更新部２４０に渡す。 Then, the reward calculating unit 220 passes the calculated reward r (s _t , a _t ) to the parameter updating unit 240.

計算データ記憶部２３０は、時刻ｔにおけるｓ_ｔの訪問回数ｎ（ｓ_ｔ）、及び時刻ｔの状態ｓ_ｔにおいて行動ａ_ｔを選択した回数ｎ（ｓ_ｔ，ａ_ｔ）を記憶している。 Calculation data storage unit 230, the number of visits _{s t} at time t n _(s t), and time t in state _{s t} at action _{a t} times the selected _{n (s} t, _{a t)} stores.

パラメタ更新部２４０は、入力された行動ａ_ｔ−１と報酬計算部２２０により計算された前回の状態ｓ_ｔ−１において行動ａ_ｔ−１を取ったことに対する報酬ｒ（ｓ_ｔ−１，ａ_ｔ−１）とに基づいて、状態ｓ_ｔを入力とし、行動ａ_ｔを選択するためのモデルのパラメタを更新する。本実施形態において、当該モデルは、行動後の状態を入力とし、行動を選択するための多層ニューラルネットワークＭ_θである。 Parameter update unit 240, the input action _{a t-1} and the compensation calculator 220 reward for that acted _{a t-1} in the previous state _{s t-1} calculated by _{r (s t-1,} a _t-1) and on the basis, as input state s _t, to update the model parameters for selecting the action a _t. In the present embodiment, the model is a multi-layered neural network M _θ for selecting an action based on the state after the action.

具体的には、パラメタ更新部２４０は、行動ａ_ｔ−１と、行動ａ_ｔ−１に対する報酬ｒ（ｓ_ｔ−１，ａ_ｔ−１）とに基づいて、報酬が高い行動を選択するように、多層ニューラルネットワークＭ_θのパラメタθを更新し、更新したパラメタθをモデル記憶部２５０に記憶させる。 Specifically, the parameter updating unit 240, an action _{a t-1,} based on the reward _r on behavior _{_{a t-1 (s t-}} 1, a t-1), to select the reward is high action The parameter θ of the multilayer neural network M _θ is updated, and the updated parameter θ is stored in the model storage unit 250.

モデル記憶部２５０は、多層ニューラルネットワークＭ_θのパラメタθを記憶している。また、モデル記憶部２５０は、パラメタ更新部２４０から、パラメタθの更新を受け付けると、パラメタθを更新する。 The model storage unit 250 stores parameters θ of the multilayer neural network M _θ . When the model storage unit 250 receives an update of the parameter θ from the parameter update unit 240, the model storage unit 250 updates the parameter θ.

終了判定部２６０は、予め定められた反復終了条件を満たすまで、状態獲得部２１０による獲得、報酬計算部２２０による計算、パラメタ更新部２４０による更新、及び行動選択部２７０による選択を繰り返させる。 The end determination unit 260 repeats the acquisition by the state acquisition unit 210, the calculation by the reward calculation unit 220, the update by the parameter update unit 240, and the selection by the action selection unit 270 until the predetermined repetition end condition is satisfied.

具体的には、終了判定部２６０は、状態獲得部２１０が獲得した終了状態ｆ_ｔが、終了状態を表すか否かを判定する。終了状態ｆ_ｔが１（終了状態）であれば、行動選択方策獲得部１４０に処理を開始させる。 Specifically, the end determination unit 260 determines whether the end state f _t acquired by the state acquisition unit 210 represents an end state. If the end state f _t is 1 (end state), the action selection strategy acquisition unit 140 starts processing.

また、終了状態ｆ_ｔが０（終了状態でない）であれば、行動選択部２７０に処理を行わせる。 If the end state f _t is 0 (not the end state), the action selection unit 270 is made to perform processing.

行動選択部２７０は、行動ａ_ｔ−１を行った後の状態ｓ_ｔを入力とし、多層ニューラルネットワークＭ_θを用いて、行動ａ_ｔを選択する。 Action selection unit 270 inputs the state _{s t} after the action _{a t-1,} using a multi-layer neural network M _theta, selects an action _{a t.}

具体的には、行動選択部２７０は、行動記憶部２８０に記憶されている行動の集合Ａ、及びモデル記憶部２５０に記憶されている多層ニューラルネットワークＭ_θにより、状態ｓ_ｔに対して取り得る行動

の、確率値を計算する。例えば、ｓ_ｔを入力として多層ニューラルネットワークＭ_θの定義に従って、Ｍ_θの各要素の値を計算し、時刻ｔの行動として、最も確率の高い行動ａ_ｔを選択する。 Specifically, the action selection unit 270 can take on the state s _t by the action set A stored in the action storage unit 280 and the multi-layered neural network M _θ stored in the model storage unit 250. Action

Calculate the probability value of For example, according to the definition of multi-layer neural network M _theta a s _t as input, calculates the value of each element of M _theta, as the action of the time t, selecting a most probable action a _t.

そして、行動選択部２７０は、計算データ記憶部２３０に記憶されている時刻ｔにおけるｓ_ｔの訪問回数ｎ（ｓ_ｔ）、及び時刻ｔの状態ｓ_ｔにおいて行動ａ_ｔを選択した回数ｎ（ｓ_ｔ，ａ_ｔ）に１を追加し、報酬ｒ（ｓ_ｔ，ａ_ｔ）を更新し、選択した行動ａ_ｔを行動ａ_ｔ−１として状態獲得部２１０に渡す。 The action selecting section 270 calculates the data visits n _(s t) of _{s t} at time t in the storage unit 230 are stored, and the number of times n (s selecting an action _{a t} in state _{s t} at time t _t, to add 1 to _{a t),} reward _{r (s} t, to update _{a t),} and passes to the state acquisition unit 210 the selected action _{a t} as the action _{a t-1.}

行動記憶部２８０は、当該環境における行動ａの集合Ａが、予め記憶されている。 In the action storage unit 280, a set A of actions a in the environment is stored in advance.

シミュレーション部１１０は、入力された行動ａを行った後の環境を計算し、行動後の環境を返す。行動後の環境には、環境の状態、環境の終了状態等が含まれる。ここで用いる環境は、状態や状態遷移規則が不明な環境（開いた環境）であるが、閉じた環境であってもよい。 The simulation unit 110 calculates the environment after performing the input action a, and returns the environment after the action. The post-action environment includes the state of the environment, the end state of the environment, and the like. Although the environment used here is an environment (open environment) whose state and state transition rule are unknown, it may be a closed environment.

状態記憶部１２０は、状態遷移規則獲得部１００により得られた状態ｓの集合Ｓ及び状態遷移規則Ｔを記憶している。また、状態獲得部２１０によって新しい状態と判断された状態を、状態の集合Ｓに追加する。なお、状態獲得部２１０によって状態が追加されるまでは、状態の集合Ｓは空集合であってもよい。 The state storage unit 120 stores the set S of states s and the state transition rules T obtained by the state transition rule acquisition unit 100. Also, a state determined to be a new state by the state acquisition unit 210 is added to the state set S. Note that the set of states S may be an empty set until the states are added by the state acquisition unit 210.

期待利得記憶部１３０は、行動選択方策獲得部１４０により学習された期待利得ｇを記憶している。 The expected gain storage unit 130 stores the expected gain g learned by the action selection and strategy acquisition unit 140.

行動選択方策獲得部１４０は、状態遷移規則獲得部１００により得られた状態の集合Ｓ及び状態遷移規則Ｔに基づいて、状態ｓと行動ａとの各ペアに対して、状態ｓにおいて行動ａを行った時の期待利得ｇ（ｓ，ａ）を学習する。 The action selection strategy acquisition unit 140 performs the action a in the state s for each pair of the state s and the action a based on the set S of states and the state transition rule T obtained by the state transition rule acquisition unit 100. Learn the expected gain g (s, a) when done.

具体的には、行動選択方策獲得部１４０は、状態記憶部１２０に記憶されている状態の集合Ｓと状態遷移規則Ｔを用いてシミュレーション部１１０による計算を行い、従来法であるＣＦＲ（非特許文献１）等を用いて、行動選択の方策を獲得する。より具体的には、全ての状態ｓと行動ａとのペア、すなわち、ある時刻ｔの状態ｓ_ｔにおいて選択した行動ａ_ｔに対して、期待利得ｇ（ｓ_ｔ，ａ_ｔ）を計算して、当該方策を獲得する。 Specifically, the action selection strategy acquisition unit 140 performs calculation by the simulation unit 110 using the set of states S stored in the state storage unit 120 and the state transition rule T, and the CFR (non-patent document is a conventional method). Obtain a policy of action selection using the literature 1). More specifically, the expected gain g (s _t , a _t ) is calculated for all the pairs of the state s and the action a, that is, the action a _t selected in the state s _{t at} a certain time t. , To get that measure.

そして、行動選択方策獲得部１４０は、獲得した期待利得ｇを期待利得記憶部１３０に記憶させる。 Then, the action selection policy acquisition unit 140 stores the acquired expected gain g in the expected gain storage unit 130.

収束判定部１５０は、行動選択方策獲得部１４０における学習が収束するまで、状態遷移規則獲得部１００による処理、及び行動選択方策獲得部１４０による学習を繰り返させる。 The convergence determination unit 150 repeats the process by the state transition rule acquisition unit 100 and the learning by the action selection policy acquisition unit 140 until the learning in the behavior selection policy acquisition unit 140 converges.

具体的には、収束判定部１５０は、全てのｓ_ｔ及びａ_ｔのペアに対して得られた期待利得ｇ（ｓ_ｔ，ａ_ｔ）と、前回の処理結果である期待利得ｇ（ｓ_ｔ−１，ａ_ｔ−１）との差分が十分に小さければ、学習が収束したと判定し、収束判定結果を出力部１６０に渡す。 Specifically, the convergence determination unit 150, the expected were obtained for pairs of all _{s t} and _{a t} gain _{g (s} t, _{a t)} and the expected gain g _{(s t} a previous processing results If the difference from ( ₋₁ , at ₋₁ ) is sufficiently small, it is determined that the learning has converged, and the convergence determination result is passed to the output unit 160.

また、差分が十分小さくない場合は、再度、状態遷移規則獲得部１００に対して処理を行わせる。 If the difference is not sufficiently small, the state transition rule acquiring unit 100 is caused to perform processing again.

出力部１６０は、収束判定部１５０から取得した収束判定結果を出力する。 The output unit 160 outputs the convergence determination result acquired from the convergence determination unit 150.

＜本発明の実施の形態に係る行動選択学習装置の作用＞
図３は、本発明の実施の形態に係る行動選択学習装置の行動選択学習処理ルーチンを示すフローチャートである。 <Operation of action selection learning device according to the embodiment of the present invention>
FIG. 3 is a flowchart showing an action selection learning processing routine of the action selection learning device according to the embodiment of the present invention.

行動選択学習装置が起動すると、図３に示す行動選択学習処理ルーチンが実行される。 When the action selection learning device is activated, an action selection learning processing routine shown in FIG. 3 is executed.

まず、ステップＳ１００において、状態遷移規則獲得部１００が、後述する状態と状態遷移規則の獲得処理ルーチンを実行することにより、状態の集合Ｓと状態遷移規則Ｔが、状態記憶部１２０に記憶される。 First, in step S100, the state transition rule acquiring unit 100 stores the state set S and the state transition rule T in the state storage unit 120 by executing an acquisition process routine for states and state transition rules described later. .

次に、ステップＳ１１０において、行動選択方策獲得部１４０が、状態遷移規則獲得部１００により得られた状態の集合Ｓ及び状態遷移規則Ｔに基づいて、状態ｓと行動ａとの各ペアに対して、状態ｓにおいて行動ａを行った時の期待利得ｇ（ｓ，ａ）を学習する。 Next, in step S110, the action selection strategy acquisition unit 140 determines each pair of the state s and the action a based on the set S of states and the state transition rule T obtained by the state transition rule acquisition unit 100. Then, learn the expected gain g (s, a) when the action a is performed in the state s.

ステップＳ１２０において、収束判定部１５０は、行動選択方策獲得部１４０における学習が収束したか否かを判定する。 In step S120, the convergence determination unit 150 determines whether the learning in the action selection and strategy acquisition unit 140 has converged.

学習が収束していないと判定した場合（ステップＳ１２０のＮＯ）、収束判定部１５０は、ステップＳ１００に戻り、状態遷移規則獲得部１００による処理（ステップＳ１００）、及び行動選択方策獲得部１４０による学習（ステップＳ１１０）を繰り返させる。 If it is determined that the learning has not converged (NO in step S120), the convergence determination unit 150 returns to step S100, and the process by the state transition rule acquisition unit 100 (step S100) and the learning by the action selection policy acquisition unit 140 (Step S110) is repeated.

一方、学習が収束したと判定した場合（ステップＳ１２０のＹＥＳ）、ステップＳ１３０において、出力部１６０は、収束判定結果を出力する。 On the other hand, when it is determined that the learning has converged (YES in step S120), the output unit 160 outputs the convergence determination result in step S130.

次に、図４を用いて、ステップＳ１００における状態と状態遷移規則の獲得処理ルーチンについて説明する。 Next, the acquisition process routine of the state and the state transition rule in step S100 will be described using FIG.

状態遷移規則獲得部１００の処理が開始され、または収束判定部１５０から状態遷移規則獲得部１００の処理を行う命令を受け取ると、図４に示す状態と状態遷移規則の獲得処理ルーチンが実行される。 When the process of the state transition rule acquisition unit 100 is started or an instruction to perform the process of the state transition rule acquisition unit 100 is received from the convergence determination unit 150, the acquisition process routine of the state and the state transition rule shown in FIG. .

ステップＳ２００において、初期化部２００は、時刻ｔを１に初期化し、行動ａ_ｔ−１（ａ_０：行動なし）を状態獲得部２１０に入力する。 In step S200, the initialization unit 200 initializes the time t 1, action _{a t-1:} inputs _{(a 0} no action) to the state acquisition section 210.

ステップＳ２１０において、状態獲得部２１０は、行動ａ_ｔ−１をシミュレーション部１１０に渡す。 In step S210, the state acquisition unit 210 passes the action at _-1 to the simulation unit 110.

ステップＳ２２０において、状態獲得部２１０は、シミュレーション部１１０により計算された行動ａ_ｔ−１を行った場合における環境の状態ｓ_ｔと終了状態ｆ_ｔとを獲得する。 In step S220, the state acquisition unit 210 acquires the state s _t of the environment and the end state f _t when the action at _-1 calculated by the simulation unit 110 is performed.

ステップＳ２３０において、状態獲得部２１０は、獲得した状態ｓ_ｔが、既に獲得した状態の集合Ｓと比較して、新しい状態か否かを判定する。 In step S230, the status acquisition unit 210 acquires state s _t is compared with the set S of states already acquired, determines whether the new state.

獲得した状態ｓ_ｔが、新しい状態である場合（ステップＳ２３０のＹＥＳ）、ステップＳ２４０において、状態獲得部２１０は、獲得した状態ｓ_ｔを、状態記憶部１２０に記憶されている状態の集合Ｓに追加し、状態の集合Ｓに基づいて、状態遷移規則Ｔを獲得する。 If the acquired state s _t is a new state (YES in step S 230), in step S 240, the state acquiring unit 210 sets the acquired state s _t to the set S of states stored in the state storage unit 120. A state transition rule T is acquired based on the set S of states.

獲得した状態ｓ_ｔが、新しい状態でない場合（ステップＳ２３０のＮＯ）、ステップＳ２４０の処理を行わず、ステップＳ２５０に進む。 When the acquired state s _t is not a new state (NO in step S230), the process of step S240 is not performed, and the process proceeds to step S250.

ステップＳ２５０において、パラメタ更新部２４０は、報酬計算部２２０が前回計算した、前回の状態ｓ_ｔ−１において行動ａ_ｔ−１を行った際の報酬ｒ（ｓ_ｔ−１，ａ_ｔ−１）を取得する。 In step S250, parameter update unit 240, compensation calculation unit 220 has calculated the last time, reward _r when performing an action _{a t-1} in the previous state _{_{s t-1 (s t-}} 1, a t-1) To get

ステップＳ２６０において、パラメタ更新部２４０は、入力された行動ａ_ｔ−１と取得した報酬ｒ（ｓ_ｔ−１，ａ_ｔ−１）とに基づいて、多層ニューラルネットワークＭ_θのパラメタを更新する。 In step S260, the parameter updating unit 240 updates the parameters of the multilayer neural network M _θ based on the input action at _-1 and the acquired reward r ( _st-1 , at _-1 ).

ステップＳ２７０において、終了判定部２６０は、状態獲得部２１０が獲得した終了状態ｆ_ｔが、終了状態を表す１であるか否かを判定する。 In step S270, the end determination unit 260 determines whether the end state f _t acquired by the state acquisition unit 210 is 1 representing the end state.

終了状態ｆ_ｔが１でない場合（ステップＳ２７０のＮＯ）、終了状態ｆ_ｔは終了状態でないと判定し、ステップＳ２８０に進む。 If the end state f _t is not 1 (NO in step S270), it is determined that the end state f _t is not the end state, and the process proceeds to step S280.

ステップＳ２８０において、行動選択部２７０は、行動ａ_ｔ−１を行った後の状態ｓ_ｔを入力とし、多層ニューラルネットワークＭ_θを用いて、行動ａ_ｔを選択する。 In step S280, the action selecting section 270 inputs the state _{s t} after the action _{a t-1,} using a multi-layer neural network M _theta, selects an action _{a t.}

ステップＳ２９０において、行動選択部２７０は、計算データ記憶部２３０に記憶されている時刻ｔにおけるｓ_ｔの訪問回数ｎ（ｓ_ｔ）、及び時刻ｔの状態ｓ_ｔにおいて行動ａ_ｔを選択した回数ｎ（ｓ_ｔ，ａ_ｔ）に１を追加し、報酬ｒ（ｓ_ｔ，ａ_ｔ）を更新する。報酬ｒ（ｓ_ｔ，ａ_ｔ）の更新は、報酬計算部２２０が、獲得した状態ｓ_ｔにおいて行動ａ_ｔを行った際の報酬ｒ（ｓ_ｔ，ａ_ｔ）を、計算データ記憶部２３０に記憶されている時刻ｔにおける状態ｓ_ｔを訪問した回数ｎ（ｓ_ｔ）、及び状態ｓ_ｔにおいて行動ａ_ｔを選択した回数ｎ（ｓ_ｔ，ａ_ｔ）と、期待利得記憶部１３０に記憶されている状態ｓ_ｔにおいて行動ａ_ｔを行った時の期待利得ｇ（ｓ_ｔ，ａ_ｔ）とに基づいて計算することにより行う。 In step S290, the action selecting section 270, the number n of times calculated data number of visits _{s t} at time t in the storage unit 230 are stored n _(s t), and were selected action _{a t} in state _{s t} at time t _(s _{t, a} t) to add 1 to, reward _{r (s} _{t, a} t) to update. Reward _{r (s} t, _{a t)} updates the compensation calculation unit 220, reward _{r (s} t, _{a t)} when the acquired state _{s t} was action _{a t} a, the calculation data storage section 230 number visited state _{s t} at the stored time t n _(s t), and the number of times _{n (s} t, _{a t)} to the action _{a t} selected at state _{s t} a, are stored in the expected gain storage unit 130 performed in the state _{s t} in which action _a expectations _{when t} was performed gain _{g (s} t, _{a t)} by calculating, based on the.

ステップＳ３００において、行動選択部２７０は、選択した行動ａ_ｔを行動ａ_ｔ−１として状態獲得部２１０に渡し、ステップＳ２１０に戻る。 In step S300, the action selecting section 270 passes the state acquisition section 210 the selected action _{a t} as an action _{a t-1,} the flow returns to step S210.

一方、終了状態ｆ_ｔが１である場合（ステップＳ２７０のＹＥＳ）、終了判定部２６０は、 On the other hand, when the end state f _t is 1 (YES in step S270), the end determination unit 260

以上説明したように、本実施形態に係る行動選択学習装置によれば、予め定められた反復終了条件を満たすまで、入力された行動を行ったときの行動後の環境の状態を獲得し、獲得した状態と選択された行動とに基づいて、状態において行動を行った際の報酬を計算し、選択された行動と報酬計算部により計算された報酬とに基づいて、状態を入力とし、行動を選択するためのモデルのパラメタを更新し、行動後の状態を入力とし、モデルを用いて、行動を選択することを繰り返し、獲得した状態が、既に獲得した状態の集合と比較して、新しい状態であれば、獲得した状態を、状態の集合に追加し、状態の集合に基づいて、状態遷移規則を獲得することにより、状態や状態遷移規則が不明な環境であっても、行動を選択するための状態や状態遷移規則を獲得することができる。 As described above, according to the action selection learning device according to the present embodiment, the state of the environment after the action when the input action is performed is acquired and acquired until the predetermined iteration end condition is satisfied. Based on the selected state and the selected action, the reward at the time of performing the action in the state is calculated, and based on the selected action and the reward calculated by the reward calculating unit, the state is input and the action is Update the parameters of the model for selection, take the post-action state as input, repeat selecting the action using the model, and compare the acquired states with the already acquired set of states. Then, the acquired state is added to the set of states, and by acquiring the state transition rule based on the set of states, the action is selected even in an environment where the state and the state transition rule are unknown State or condition for You can acquire the transition rules.

また、本実施形態に係る行動選択学習装置によれば、状態遷移規則獲得装置と、行動選択方策獲得部が、状態遷移規則獲得装置により得られた状態の集合及び状態遷移規則に基づいて、状態と行動との各ペアに対して、状態において行動を行った時の期待利得を学習し、学習が収束するまで、状態遷移規則獲得装置による処理、及び行動選択方策獲得部による学習を繰り返させることにより、状態や状態遷移規則が不明な環境であっても、行動を選択するための状態や状態遷移規則を獲得することができる。 Further, according to the action selection learning device according to the present embodiment, the state transition rule acquiring device and the action selection policy acquiring unit are states based on the set of states acquired by the state transition rule acquiring device and the state transition rule. For each pair of behavior and behavior, learn expected gain when behavior is performed in the state, and repeat processing by the state transition rule acquisition device and learning by the action selection strategy acquisition unit until learning converges. Thus, even in an environment where the state or state transition rule is unknown, it is possible to obtain the state or state transition rule for selecting an action.

＜本発明の実施の形態に係る行動選択装置の構成＞
次に、本実施形態に係る行動選択装置について説明する。本実施形態において、行動選択装置は、選択した行動を行うように制御する対象である制御対象（例えば、ロボット）に搭載されているものとして説明する。本実施形態に係る行動選択装置は、実際に制御対象が目的地まで移動する実環境において、制御対象のセンサによって得られる状態（例えば、制御対象に搭載されたカメラによって撮像される画像）を行動選択装置に入力し、これに対して期待利得が最大となる（例えば、目的地に到達するために必要な）行動（例えば、右に曲がる、直進する等）を行うように、制御対象を制御するように構成されるものとする。 <Configuration of Action Selection Device According to Embodiment of the Present Invention>
Next, the action selection device according to the present embodiment will be described. In the present embodiment, the behavior selection device is described as being mounted on a control target (for example, a robot) which is a target to be controlled to perform the selected behavior. The action selection apparatus according to the present embodiment acts in a state obtained by the sensor of the control target (for example, an image captured by a camera mounted on the control target) in an actual environment in which the control target actually moves to the destination Control the controlled object so as to perform an action (eg, turn to the right, go straight, etc.) to the selection device with which the expected gain is maximized (for example, to reach the destination) Shall be configured to

図５を参照して、本実施形態に係る行動選択装置について説明する。図５は、本発明の実施の形態に係る行動選択装置の構成を示すブロック図である。 The action selection apparatus according to the present embodiment will be described with reference to FIG. FIG. 5 is a block diagram showing the configuration of the action selection device according to the embodiment of the present invention.

行動選択装置２０は、ＣＰＵと、ＲＡＭと、後述する行動選択学習処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。 The action selection device 20 is configured by a computer including a CPU, a RAM, and a ROM storing a program for executing an action selection learning processing routine described later, and is functionally configured as shown below There is.

図５に示すように、本実施形態に係る行動選択装置２０は、入力部３００と、行動選択部３１０と、期待利得記憶部３２０と、出力部３３０と、制御部３４０とを備えて構成される。 As shown in FIG. 5, the action selection apparatus 20 according to the present embodiment includes an input unit 300, an action selection unit 310, an expected gain storage unit 320, an output unit 330, and a control unit 340. Ru.

入力部３００は、制御対象のセンサによって得られた状態ｓの入力を受け付ける。そして、入力部３００は、受け付けた状態ｓを、行動選択部３１０に渡す。 The input unit 300 receives an input of the state s obtained by the sensor to be controlled. Then, the input unit 300 passes the received state s to the action selection unit 310.

行動選択部３１０は、予め学習された、環境の状態において行動を行った時の期待利得に基づいて、入力された状態ｓに対して、期待利得が最大となる行動ａを選択する。 The action selection unit 310 selects, for the input state s, an action a with which the expected gain is maximum, based on the expected gain obtained when performing an action in the state of the environment, which has been learned in advance.

具体的には、行動選択部３１０は、期待利得記憶部３２０が記憶している学習済みの期待利得ｇ（ｓ，ａ）に基づいて、入力された状態ｓに対して、期待利得ｇ（ｓ，ａ）が最大となる行動ａを選択する。すなわち、ある状態ｓに対して、最良のｇ（ｓ，ａ）となる行動

を選択する。そして、行動選択部３１０は、選択した行動ａを出力部３３０に渡す。 Specifically, based on the learned expected gain g (s, a) stored in the expected gain storage unit 320, the behavior selecting unit 310 obtains the expected gain g (s) for the input state s. , A) select the action a with the largest. That is, the action that is the best g (s, a) for a certain state s

Choose Then, the action selection unit 310 passes the selected action a to the output unit 330.

期待利得記憶部３２０は、上述の行動選択学習装置１０によって学習された期待利得を記憶している。 The expected gain storage unit 320 stores the expected gain learned by the action selection learning device 10 described above.

出力部３３０は、ディスプレイ、プリンタ、磁気ディスクなどで実装され、行動選択部３１０が選択した行動ａを出力する。 The output unit 330 is implemented by a display, a printer, a magnetic disk, etc., and outputs the action a selected by the action selection unit 310.

また、出力部３３０は、行動選択部３１０が選択した行動ａを制御部３４０に出力する。 Also, the output unit 330 outputs the action a selected by the action selection unit 310 to the control unit 340.

制御部３４０は、出力部３３０により入力された行動を行うように制御対象の行動を制御する。例えば、右に曲がる、直進する等の行動をロボットに対して命令する。 The control unit 340 controls the behavior of the control target so as to perform the behavior input by the output unit 330. For example, the robot is instructed to take actions such as turning to the right or going straight.

＜本発明の実施の形態に係る行動選択装置の作用＞
図６は、本発明の実施の形態に係る行動選択装置の行動選択処理ルーチンを示すフローチャートである。 <Operation of Behavior Selection Device According to Embodiment of the Present Invention>
FIG. 6 is a flowchart showing an action selection processing routine of the action selection device according to the embodiment of the present invention.

入力部３００に、制御対象のセンサによって得られた状態が入力されると、図６に示す行動選択処理ルーチンが実行される。 When the state obtained by the sensor to be controlled is input to the input unit 300, an action selection processing routine shown in FIG. 6 is executed.

まず、ステップＳ４００において、入力部３００は、制御対象のセンサによって得られた状態ｓの入力を受け付ける。 First, in step S400, the input unit 300 receives an input of the state s obtained by the sensor to be controlled.

次に、ステップＳ４１０において、入力された状態ｓに基づいて、終了状態であるか否かを判定する。終了状態である場合（ステップＳ４１０のＹＥＳ）には、行動選択処理ルーチンを終了する。一方、終了状態で無い場合（ステップＳ４１０のＮＯ）には、ステップＳ４２０へ進む。 Next, in step S410, based on the input state s, it is determined whether it is an end state. If it is in the end state (YES in step S410), the action selection processing routine is ended. On the other hand, if it is not in the end state (NO in step S410), the process proceeds to step S420.

ステップＳ４２０において、行動選択部３１０は、期待利得記憶部３２０に記憶されている期待利得を読み込む。 In step S 420, the behavior selection unit 310 reads the expected gain stored in the expected gain storage unit 320.

ステップＳ４３０において、上記ステップＳ４２０で読み込んだ期待利得に基づいて、上記ステップＳ４００で取得した状態ｓに対して、期待利得が最大となる行動ａを選択する。 In step S430, based on the expected gain read in step S420, the action a with the largest expected gain is selected with respect to the state s acquired in step S400.

ステップＳ４４０において、出力部３３０は、行動選択部３１０が選択した行動ａを出力する。また、出力部３３０は、行動選択部３１０が選択した行動ａを制御部３４０に出力し、上記ステップＳ４００へ戻る。 In step S440, the output unit 330 outputs the action a selected by the action selection unit 310. Further, the output unit 330 outputs the action a selected by the action selection unit 310 to the control unit 340, and returns to step S400.

この後、制御部３４０により制御対象に対して行動の制御が行われ、制御対象のセンサによって状態が取得されることにより、行動選択処理ルーチンが繰り返される。 Thereafter, the control unit 340 controls the behavior of the control target, and the sensor of the control target acquires a state, whereby the behavior selection processing routine is repeated.

以上説明したように、行動選択装置によれば、選択された行動を行った時の行動後の環境の状態を獲得し、獲得した状態と選択された行動とに基づいて、状態において行動を行った際の報酬を計算し、選択された行動と計算された報酬とに基づいて、状態を入力とし、行動を選択するためのモデルのパラメタを更新し、行動後の状態を入力とし、モデルを用いて、行動を選択することを予め定められた反復終了条件を満たすまで繰り返すことと、状態の集合及び状態遷移規則に基づいて、状態と行動との各ペアに対して、状態において行動を行った時の期待利得を学習することと、を交互に繰り返すことにより予め学習された、環境の状態において行動を行った時の期待利得に基づいて、入力された状態に対して、期待利得が最大となる行動を選択することにより、状態や状態遷移規則が不明な環境であっても、適切な行動を選択することができる。 As described above, according to the action selection device, the state of the environment after the action when the selected action is performed is acquired, and the action is performed in the state based on the acquired state and the selected action. Calculate the rewards of the user, input the state based on the selected action and the calculated reward, update the parameters of the model for selecting the action, input the post action state, and enter the model Use the action to repeat the action selection until the predetermined iteration end condition is satisfied, and perform the action in the state for each pair of the state and the action based on the set of states and the state transition rule The expected gain is maximum for the input state based on the expected gain when the user performs an action in the environmental state, which has been learned in advance by learning the expected gain at the same time and repeating alternately Became action By selecting the state and the state transition rule even unknown environment, it is possible to select the appropriate action.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the scope of the present invention.

本実施形態において、行動選択装置は、制御対象のセンサによって取得された実環境の状態を入力としたが、シミュレータにより計算した環境の状態を入力としてもよい。例えば、環境がトランプ等のカードゲームであれば、シミュレータにより計算された札を取得・捨てる等の行動後における手札の状態や場に出された札の状態等を入力とするように構成してもよい。 In the present embodiment, the action selection device uses the state of the real environment acquired by the sensor to be controlled as an input, but the state of the environment calculated by the simulator may be an input. For example, if the environment is a card game such as playing cards, the state of the hand after the action such as obtaining / discarding the bill calculated by the simulator or the state of the bill put out to the place may be input. It is also good.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 Furthermore, although the present invention has been described as an embodiment in which the program is installed in advance, it is also possible to provide the program by storing the program in a computer readable recording medium.

１０行動選択学習装置
２０行動選択装置
１００状態遷移規則獲得部
１１０シミュレーション部
１２０状態記憶部
１３０期待利得記憶部
１４０行動選択方策獲得部
１５０収束判定部
１６０出力部
２００初期化部
２１０状態獲得部
２２０報酬計算部
２３０計算データ記憶部
２４０パラメタ更新部
２５０モデル記憶部
２６０終了判定部
２７０行動選択部
２８０行動記憶部
３００入力部
３１０行動選択部
３２０期待利得記憶部
３３０出力部
３４０制御部 10 action selection learning device 20 action selection device 100 state transition rule acquisition unit 110 simulation unit 120 state storage unit 130 expectation gain storage unit 140 action selection strategy acquisition unit 150 convergence determination unit 160 output unit 200 initialization unit 210 state acquisition unit 220 reward Calculation unit 230 Calculation data storage unit 240 Parameter update unit 250 Model storage unit 260 End determination unit 270 Action selection unit 280 Action storage unit 300 Input unit 310 Action selection unit 320 Expected gain storage unit 330 Output unit 340 Control unit

Claims

A state acquisition unit for acquiring the state of the environment after the action when the selected action is performed;
A reward calculation unit that calculates a reward when the action is performed in the state based on the acquired state and the selected action;
A parameter updating unit that updates the parameters of a model for selecting an action based on the selected action and the reward calculated by the reward calculation unit, using the state as an input;
An action selecting unit that selects an action using the state after the action as an input and using the model;
An end determination unit that repeats acquisition by the state acquisition unit, calculation by the reward calculation unit, update by the parameter update unit, and selection by the action selection unit until a predetermined iteration end condition is satisfied;
Equipped with
The state acquisition unit adds the acquired state to the set of states if the acquired state is a new state as compared to the set of already acquired states, and based on the set of states, A state transition rule acquisition device characterized by acquiring a state transition rule.

The state transition rule acquisition device according to claim 1;
An action of learning an expected gain when performing an action in the state with respect to each pair of the state and the action based on the set of states and the state transition rule obtained by the state transition rule acquisition device A selection policy acquisition unit,
A convergence determination unit that repeats the processing by the state transition rule acquisition device and the learning by the behavior selection policy acquisition unit until the learning in the activity selection policy acquisition unit converges;
An action selection learning device comprising:

The state transition rule acquisition device according to claim 1, wherein the model is a multilayer neural network for selecting an action, using the state after the action as an input.

The reward calculation unit is configured to calculate the reward when the action is performed in the state, the number of visits to the state, the number of times the action is selected in the state, and an expected gain when the action is performed in the state 3. The action selection learning device according to claim 2, wherein the action selection learning device calculates on the basis of.

The apparatus is characterized by comprising an action selection unit which selects an action having the maximum expected gain with respect to the input state based on the expected gain obtained when performing an action in the state of the environment, which has been learned in advance. A behavior selection device,
The expected gain is
The state of the environment after the action when the selected action is performed is acquired, and the reward when the action is performed in the state is calculated based on the acquired state and the selected action, Based on the selected action and the calculated reward, the state is used as an input, parameters of a model for selecting an action are updated, the state after action is used as an input, and the action is performed using the model. Repeating the selection until the predetermined iteration end condition is satisfied;
Learning an expected gain when an action is performed in the state with respect to each pair of the state and the action based on the set of states and a state transition rule;
An action selection device which is learned in advance by alternately repeating.

A step of acquiring a state of the environment after the action when the state acquisition unit performs the selected action;
Calculating a reward at the time of performing the action in the state based on the acquired state and the selected action;
The parameter updating unit, based on the selected action and the reward calculated by the reward calculation unit, updating the parameters of the model for selecting the action, using the state as an input;
The action selection unit selects an action using the state after the action as an input, and using the model;
Causing an end determination unit to repeat acquisition by the state acquisition unit, calculation by the reward calculation unit, update by the parameter update unit, and selection by the action selection unit until a predetermined iteration end condition is satisfied;
Equipped with
The step of acquiring the state acquiring unit adds the acquired state to the set of states if the acquired state is a new state as compared to the set of already acquired states, and the set of states is acquired. A method of acquiring a state transition rule, comprising acquiring a state transition rule based on.

The action selecting unit comprises a step of selecting an action having the maximum expected gain with respect to the input state based on the expected gain when performing an action in the state of the environment, which has been learned in advance. It is an action selection method characterized by
The expected gain is
The state of the environment after the action when the selected action is performed is acquired, and the reward when the action is performed in the state is calculated based on the acquired state and the selected action, Based on the selected action and the calculated reward, the state is used as an input, parameters of a model for selecting an action are updated, the state after action is used as an input, and the action is performed using the model. Repeating the selection until the predetermined iteration end condition is satisfied;
Learning an expected gain when an action is performed in the state with respect to each pair of the state and the action based on the set of states and a state transition rule;
A method of behavior selection characterized in that it is learned in advance by alternately repeating.

The program for functioning a computer as a state transition rule acquisition apparatus of Claim 1 or Claim 3, the action selection learning apparatus of Claim 2 or Claim 4, or the action selection apparatus of Claim 5.