JP6764143B2

JP6764143B2 - Reinforcement learning equipment, reinforcement learning methods, and reinforcement learning programs

Info

Publication number: JP6764143B2
Application number: JP2019532275A
Authority: JP
Inventors: 貴士大西; 正明土田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2017-07-26
Filing date: 2017-07-26
Publication date: 2020-09-30
Anticipated expiration: 2037-07-26
Also published as: WO2019021401A1; JPWO2019021401A1

Description

本発明は、強化学習装置、強化学習方法、および強化学習プログラムに関する。 The present invention relates to a reinforcement learning device, a reinforcement learning method, and a reinforcement learning program .

強化学習（Reinforcement Learning）とは、ある環境内におけるエージェントが、現在の状態を観測し、取るべき行動を決定する問題を扱う機械学習の一種である。エージェントは行動を選択することで環境から報酬を得る。強化学習は、一連の行動を通じて報酬が最も多く得られるような方策（policy）を学習する。 Reinforcement learning is a type of machine learning that deals with the problem of agents in an environment observing their current state and deciding what action to take. Agents get rewards from the environment by choosing actions. Reinforcement learning learns policies that maximize rewards through a series of actions.

このような強化学習の一つとして、非特許文献１は、Meta-ControllerとControllerとの２つの強化学習エージェントからなる「階層強化学習」を提案している。開始点から目標（Goal）までの間に複数の状態がある状況において、開始点から最短経路で目標まで到達したい場合を想定する。ここで、各状態はサブゴール(Subgoal)とも呼ばれる。非特許文献１においては、Meta-Controllerは、あらかじめ与えられた複数のサブゴール（但し、非特許文献１では、”goal”と記している）の中から、次に達成すべきサブゴールをControllerへ提示している。 As one of such reinforcement learnings, Non-Patent Document 1 proposes "hierarchical reinforcement learning" composed of two reinforcement learning agents, Meta-Controller and Controller. Suppose that there are multiple states between the start point and the goal (Goal), and you want to reach the goal by the shortest route from the start point. Here, each state is also called a subgoal. In Non-Patent Document 1, Meta-Controller presents to Controller the sub-goal to be achieved next from a plurality of sub-goals given in advance (however, in Non-Patent Document 1, it is described as "goal"). are doing.

Meta-Controllerはハイレベルプランナとも呼ばれ、Controllerはローレベルプランナとも呼ばれる。したがって、非特許文献１では、ハイレベルプランナが複数のサブゴールの中から特定のサブゴールを決定し、ローベルプランナが特定のサブゴールに基づいて実際のアクションを決めている。ハイレベルプランナは、サブゴール決定部を備えている。εを０から１の間の変数とする（０≦ε≦１）。変数εの初期値は１である。試行回数が少ない間は、変数εの値は１に近い。試行回数が増えていくにつれて経験値が蓄積していくので、変数εの値は０に近づくように徐々に減少する。この状況において、サブゴール決定部は、複数のサブゴールの中からεの確率でランダムに特定のサブゴールを選択し、（１−ε）の確率で経験的に特定のサブゴールを選択する。 Meta-Controller is also called a high-level planner, and Controller is also called a low-level planner. Therefore, in Non-Patent Document 1, the high-level planner determines a specific subgoal from a plurality of subgoals, and the Robel planner determines the actual action based on the specific subgoal. The high-level planner has a sub-goal determination unit. Let ε be a variable between 0 and 1 (0 ≤ ε ≤ 1). The initial value of the variable ε is 1. While the number of trials is small, the value of the variable ε is close to 1. As the number of trials increases, the experience value accumulates, so the value of the variable ε gradually decreases as it approaches zero. In this situation, the subgoal determination unit randomly selects a specific subgoal from a plurality of subgoals with a probability of ε, and empirically selects a specific subgoal with a probability of (1-ε).

また、特許文献１は、対象とするタスクを自らが選択して、次々と能力を伸長させていくことができる自律エージェントの学習を実現することができる、「学習制御装置」を開示している。特許文献１に開示された学習制御装置は、予測部と、評価部と、制御部と、計画部とを備える。予測部は、環境を教師とした予測学習を行う。評価部は、予測部による予測のエラー、計画部による計算のエラー、制御部による行動の制御のエラーを観測し、それをもとに、自律エージェントが達成するべきセンサ状態空間上の達成状態を設定し、目標とする達成状態（目標状態）を計画部に与える。計画部は、現在の状態から、評価部により与えられた目標状態に達するまでの行動シーケンスをプラン（計画）する。制御部は、計画部による計画と環境とを教師とした学習を実行し、自律エージェントのアクションを制御する。予測部と制御部の学習が十分に進むと、目標状態を一つのアクションとして階層化することができる。 Further, Patent Document 1 discloses a "learning control device" capable of realizing learning of an autonomous agent capable of selecting a target task by itself and expanding its ability one after another. .. The learning control device disclosed in Patent Document 1 includes a prediction unit, an evaluation unit, a control unit, and a planning unit. The prediction unit performs predictive learning using the environment as a teacher. The evaluation unit observes prediction errors by the prediction unit, calculation errors by the planning unit, and behavior control errors by the control unit, and based on these, the achievement status on the sensor state space that the autonomous agent should achieve is determined. Set and give the target achievement state (target state) to the planning department. The planning department plans the action sequence from the current state to the target state given by the evaluation department. The control unit executes learning by the planning unit with the planning and the environment as teachers, and controls the actions of the autonomous agent. When the learning of the prediction unit and the control unit is sufficiently advanced, the target state can be hierarchized as one action.

予測部は、自分自身の取ったアクションと環境の変化（センサ入力の変化）の関係を常に学習しており、誤った計画でも実行されることによって予測部の予測精度が改善されていく。予測部は、大規模サンプルや、大次元入力の学習に耐える関数近似器の能力を利用することにより、次元に呪われることなく、予測学習を行うことができる。また、未熟な予測部で生成したプランにより、誤ったプランが実行されることでも、予測部は不得手な状態空間を経験し、予測性能を向上させることができる。計画部がヒューリスティクス探索の手法を用いることにより、入力の次元が増えて状態空間が大きくなっても、Ｑ学習や動的計画法を用いた場合と比較して、探索の組み合わせが爆発してしまうことを抑制することができる。また、成功シーケンスの学習が繰り返されることにより、制御部を汎化することが可能である。 The prediction unit constantly learns the relationship between the actions taken by itself and changes in the environment (changes in sensor input), and the prediction accuracy of the prediction unit is improved by executing even an erroneous plan. The predictor can perform predictive learning without being cursed by the dimension by utilizing the ability of the function approximator to withstand the learning of large-scale samples and large-dimensional inputs. Further, even if an erroneous plan is executed by the plan generated by the immature prediction unit, the prediction unit can experience a weak state space and improve the prediction performance. By using the heuristics search method by the planning department, even if the input dimension increases and the state space becomes large, the search combination explodes compared to the case of using Q-learning or dynamic programming. It is possible to suppress the storage. In addition, it is possible to generalize the control unit by repeating the learning of the success sequence.

特許文献２は、予め定義されたアクションのセットに基づいて作動されるロボットの動作の改善方法を提供している。特許文献２は、次のことを記載している。アクションライブラリに保存されているオリジナルアクションの集合の中の少なくとも２つのアクションを組み合わせることにより、複合アクションが生成される。複合アクションを含めてポリシーが学習された後では、それらの複合アクションの多くは使用することができない。一つの理由は、関節動作制限（joints limits）や衝突などのロボットの制約に違反し得るためであり、他の理由は、複合アクションが特定のシナリオにおいては何らの利益ももたらさいためである。したがって、上述の理由からアクションライブラリを小さく維持すべく、そのような無意味な複合アクションがアクションライブラリから除去される。 Patent Document 2 provides a method for improving the movement of a robot that is operated based on a predetermined set of actions. Patent Document 2 describes the following. A composite action is generated by combining at least two actions in the set of original actions stored in the action library. After the policy has been learned, including compound actions, many of those compound actions cannot be used. One reason is that it can violate robot constraints such as joints limits and collisions, and the other reason is that compound actions do not provide any benefit in certain scenarios. Therefore, such nonsensical compound actions are removed from the action library in order to keep the action library small for the reasons mentioned above.

特開２００６−２６８８１２号公報Japanese Unexamined Patent Publication No. 2006-268812 特開２０１６−１９６０７９号公報Japanese Unexamined Patent Publication No. 2016-196079

Tejas D. Kulkarni, et al. "Hierarchical Deep Reinforcement Learning: Integrating Tmporal Abstraction and Intrinsic Motivation." 30th Conference on Nural Information Processing Systems (NIPS 2016), Barcelona, Spein.Tejas D. Kulkarni, et al. "Hierarchical Deep Reinforcement Learning: Integrating Tmporal Abstraction and Intrinsic Motivation." 30th Conference on Nural Information Processing Systems (NIPS 2016), Barcelona, Spein.

複雑なシステムのオペレーションを、非特許文献１に開示されているような、階層強化学習によって学習させるとする。この場合、サブゴールの数が多くなる。換言すれば、サブゴールを探索するための探索空間が膨大となる。学習のために、サブゴール決定部は、様々なサブゴールを試行錯誤する必要がある。その結果、非特許文献１に開示された階層強化学習方法では、学習時間が非常に長くなってしまうという課題がある。 It is assumed that the operation of a complicated system is learned by hierarchical reinforcement learning as disclosed in Non-Patent Document 1. In this case, the number of subgoals increases. In other words, the search space for searching the subgoal becomes enormous. For learning, the subgoal determination department needs to try and error various subgoals. As a result, the hierarchical reinforcement learning method disclosed in Non-Patent Document 1 has a problem that the learning time becomes very long.

特許文献１も、階層強化学習を開示しているに過ぎない。また、特許文献１では、開始点について何ら開示も示唆もしていない。さらに、特許文献１においては、目標（ゴール）が予め設定されてはおらず、評価部が、上述したエラーの観測に基づいて目標状態を設定して、計画部が、現在の状態から目標状態に達するまでの行動シーケンスを計画している。よって、特許文献１においては、開始点からゴールに到達するまでの複数のサブゴールという概念について、何ら開示も示唆もしていない。 Patent Document 1 also only discloses hierarchical reinforcement learning. Further, Patent Document 1 does not disclose or suggest any starting point. Further, in Patent Document 1, the target (goal) is not set in advance, the evaluation unit sets the target state based on the above-mentioned error observation, and the planning unit changes from the current state to the target state. We are planning an action sequence to reach it. Therefore, Patent Document 1 does not disclose or suggest the concept of a plurality of subgoals from the starting point to reaching the goal.

特許文献２は、単に、無意味な複合アクションがアクションライブラリから除去することを記載しているに過ぎない。 Patent Document 2 merely describes that a meaningless compound action is removed from the action library.

本発明の目的は、上述した課題を解決できる強化学習装置、強化学習方法、および強化学習プログラムを提供することにある。 An object of the present invention is to provide a reinforcement learning device, a reinforcement learning method, and a reinforcement learning program capable of solving the above-mentioned problems.

本発明の一形態は、開始点からゴールに到達するまでのサブゴールの中から一部のサブゴールを決定するハイレベルプランナと、前記一部のサブゴールに従ってアクションを決めるローレベルプランナと、を備える強化学習装置であって、前記サブゴールを選ぶ規則を予め格納する記憶装置を有し、前記ハイレベルプランナは、前記規則に従い、前記一部のサブゴールを決定するサブゴール決定部を有する。 Strengthening one aspect of the present invention, comprising a high level planner to determine the subgoal in either et part subgoal from the start point to reach the goal, and a low level planner to determine the actions in the subgoals of the portion What learning apparatus der has a storage device for previously storing a rule to select the sub-goal, the high level planner, in accordance with the rules, that having a sub-goal decision unit for determining a sub-goal of the part.

本発明の一形態は、データ処理装置が、開始点からゴールに到達するまでのサブゴールの中から、前記サブゴールを選ぶ規則に従い、一部のサブゴールを決定し、前記一部のサブゴールに従ってアクションを決める、強化学習方法であって、前記サブゴールを選ぶ規則を予め記憶装置に格納しておき、前記データ処理装置が、前記規則に従い、前記一部のサブゴールを決定する。 In one embodiment of the present invention, the data processing device determines a part of the subgoals according to the rule of selecting the subgoals from the subgoals from the start point to the goal, and determines the action according to the part of the subgoals. , I reinforcement learning methods der, may be stored in advance in the storage device a rule to select the sub-goal, said data processing apparatus, in accordance with the rules, that determine the subgoals of the part.

本発明の一形態は、データ処理装置によって、開始点からゴールに到達するまでのサブゴールの中から一部のサブゴールを決定するハイレベルプランナ機能と、前記一部のサブゴールに従ってアクションを決めるローレベルプランナ機能と、を実現させる強化学習プログラムであって、前記サブゴールを選ぶ規則を予め記憶装置に格納しておき、前記ハイレベルプランナ機能は、前記規則に従い、前記一部のサブゴールを決定する。
One form of the present invention, the data processing apparatus, and a high level planner function of determining the subgoal in either et part subgoal from the start point to reach the goal, the low level to determine the actions in the subgoals of the portion What reinforcement learning program der to achieve a planner function, and may be stored in advance in the storage device a rule to select the sub-goal, said high level planner function, in accordance with the rules, that determine the subgoals of the portion ..

本発明によれば、試行回数を減らして学習時間を短縮することできる。 According to the present invention, the number of trials can be reduced and the learning time can be shortened.

本発明の実施形態に係る強化学習装置が適用される、対象システムの概略構成図である。It is a schematic block diagram of the target system to which the reinforcement learning device which concerns on embodiment of this invention is applied. 本発明の一実施形態に係る強化学習装置のハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware structure of the reinforcement learning apparatus which concerns on one Embodiment of this invention. 図２に示された、タスク知識とサブゴール決定部との詳細な一構成例を示すブロック図である。It is a block diagram which shows one detailed configuration example of a task knowledge and a sub-goal determination part shown in FIG. 図２に示したハイレベルプランナにおけるサブゴールの決定フローを示す流れ図である。It is a flow chart which shows the determination flow of the sub-goal in the high-level planner shown in FIG. 知識タスクが優先ルールのみから成る場合における、ハイレベルプランナにおけるサブゴール決定フローを示す流れ図である。It is a flow chart which shows the subgoal decision flow in a high level planner when a knowledge task consists only of a priority rule. 知識タスクが抑制ルールのみから成る場合における、ハイレベルプランナにおけるサブゴール決定フローを示す流れ図である。It is a flow chart which shows the subgoal decision flow in a high level planner when a knowledge task consists only of a suppression rule. タスクルールから知識タスクを作成する一構成例を示すブロック図である。It is a block diagram which shows one configuration example which creates a knowledge task from a task rule. アイテムが配置される、１３×１３升目のフィールドを示す図である。It is a figure which shows the field of the 13 × 13th square in which an item is arranged. 図８に示すフィールドでのアイテム配置の一例を示す図である。It is a figure which shows an example of the item arrangement in the field shown in FIG. 第１の実施例におけるタスクルールである、Craftルールを示す図である。It is a figure which shows the Craft rule which is a task rule in 1st Example. 優先ルールの一例を示す図である。It is a figure which shows an example of a priority rule. 抑制ルールの一例を示す図である。It is a figure which shows an example of the suppression rule. 本実施形態における強化学習装置と、非特許文献１に開示されている階層強化学習（先行技術）との比較結果（実験結果）を示す図である。It is a figure which shows the comparison result (experimental result) of the reinforcement learning apparatus in this embodiment, and the hierarchical reinforcement learning (prior art) disclosed in Non-Patent Document 1. 優先ルールを推論器を用いて導出するために必要な「背景知識」と「目的状態」とを示す図である。It is a figure which shows "background knowledge" and "objective state" necessary for deriving a priority rule by using an inference device. 推論器によって導出された優先ルールの一例を示す図である。It is a figure which shows an example of the priority rule derived by an inference device. 推論器において定義される「非目的状態」の一例を示す図である。It is a figure which shows an example of the "non-purpose state" defined in an inference device. 推論器によって導出された抑制ルールの一例を示す図である。It is a figure which shows an example of the suppression rule derived by an inference device.

図１は、本発明の実施形態に係る強化学習装置が適用される、対象システムの概略構成図である。 FIG. 1 is a schematic configuration diagram of a target system to which the reinforcement learning device according to the embodiment of the present invention is applied.

対象システムは、開始点Ｓと、目標（ゴール）Ｇとを有する。対象システムには、開始点ＳからゴールＧまでの間にＮ（Ｎは３以上の整数）個のサブゴールが存在する。図１に示す例では、Ｎ個のサブゴールとして、Ａ、Ｂ、およびＣで示される３つのサブゴールを代表的に図示している。ここでは、サブゴールＡを第１のサブゴールと呼び、サブゴールＢを第２のサブゴールと呼び、サブゴールＣを第３のサブゴールと呼ぶことにする。 The target system has a starting point S and a goal G. In the target system, there are N (N is an integer of 3 or more) subgoals between the start point S and the goal G. In the example shown in FIG. 1, three subgoals represented by A, B, and C are typically illustrated as N subgoals. Here, subgoal A is referred to as a first subgoal, subgoal B is referred to as a second subgoal, and subgoal C is referred to as a third subgoal.

対象システムには、開始点ＳからゴールＧに到達するまでに満たすべきタスクルールが規定されている。図１に示す対象システムの場合、そのタスクルールに従って、開始点Ｓから、第１のサブゴールＡ、第２のサブゴールＢ、および第３のサブコールＣを経由して、ゴールＧに最短で到達することができる例を示している。 In the target system, task rules to be satisfied from the start point S to the goal G are defined. In the case of the target system shown in FIG. 1, the goal G must be reached from the starting point S via the first subgoal A, the second subgoal B, and the third subcall C in the shortest time according to the task rule. Here is an example of what you can do.

しかしながら、一般的には、対象システムはサブゴールの数が多く、その結果、サブゴールを探索するための探索空間が膨大となる。そこで、本実施形態に係る強化学習装置では、後述するように、タスク知識を利用して探索範囲を絞り、学習の効率化を図っている。 However, in general, the target system has a large number of subgoals, and as a result, the search space for searching the subgoals becomes enormous. Therefore, in the reinforcement learning device according to the present embodiment, as will be described later, the search range is narrowed down by utilizing the task knowledge to improve the learning efficiency.

[実施の形態]
図２は、本発明の一実施形態に係る強化学習装置１００のハードウェア構成を示すブロック図である。図示の強化学習装置１００は、プログラム制御により動作するコンピュータで実現可能である。[Embodiment]
FIG. 2 is a block diagram showing a hardware configuration of the reinforcement learning device 100 according to the embodiment of the present invention. The illustrated reinforcement learning device 100 can be realized by a computer operated by program control.

図示の強化学習装置１００は、図１に示されるような対象システムにおいて、サブゴールを探索する装置である。 The illustrated reinforcement learning device 100 is a device that searches for a subgoal in a target system as shown in FIG.

強化学習装置１００は、データを入力する入力装置１０１と、データを出力する出力装置１０２と、後述するプログラムやデータを記憶する記憶装置１０４と、データを処理するデータ処理装置１０５とを備えている。 The reinforcement learning device 100 includes an input device 101 for inputting data, an output device 102 for outputting data, a storage device 104 for storing programs and data described later, and a data processing device 105 for processing data. ..

出力装置１０２は、ＬＣＤ（Liquid Crystal Display）やＰＤＰ（Plasma Display Panel）などの表示装置やプリンタからなる。出力装置１０２は、データ処理装置１０５からの指示に応じて、操作メニューなどの各種情報を表示したり、最終結果を印字出力する機能を有する。 The output device 102 includes a display device such as an LCD (Liquid Crystal Display) or a PDP (Plasma Display Panel), or a printer. The output device 102 has a function of displaying various information such as an operation menu and printing out the final result in response to an instruction from the data processing device 105.

記憶装置１０４は、ハードディスクやリードオンリメモリ（ＲＯＭ）およびランダムアクセスメモリ（ＲＡＭ）などのメモリからなる。記憶装置１０４は、データ処理装置１０５における各種処理に必要な処理情報(後述する)やプログラム２０１を記憶する機能を有する。 The storage device 104 includes a memory such as a hard disk, a read-only memory (ROM), and a random access memory (RAM). The storage device 104 has a function of storing processing information (described later) and a program 201 required for various processes in the data processing device 105.

データ処理装置１０５は、ＭＰＵ（micro processing unit）などのマイクロプロセッサや中央処理装置（ＣＰＵ）からなる。データ処理装置１０５は、記憶装置１０４からプログラム２０１を読み込んで、プログラム２０１に従ってデータを処理する各種処理部を実現する機能を有する。 The data processing device 105 includes a microprocessor such as an MPU (micro processing unit) and a central processing unit (CPU). The data processing device 105 has a function of reading the program 201 from the storage device 104 and realizing various processing units that process data according to the program 201.

データ処理装置１０５で実現される主な処理部は、ハイレベルプランナ３０１およびローベルプランナ３０２からなる。 The main processing unit realized by the data processing device 105 includes a high-level planner 301 and a Robel planner 302.

ハイレベルプランナ３０１は、後述するように、上記Ｎ個のサブゴールの中から特定のサブゴールを決定する。ローレベルプランナ３０２は、その特定のサブゴールに従って実際のアクションを決める。 The high level planner 301 determines a specific subgoal from the above N subgoals, as will be described later. The low level planner 302 determines the actual action according to its particular subgoal.

すなわち、ハイレベルプランナ３０１は、図１に示されるような、目標Ｇまでのサブゴールを順次、ローレベルプランナ３０２に指示する。ローレベルプランナ３０２は、その指示されたサブゴールを達成するようにシミュレータ（図示せず）を操作する。ローレベルプランナ３０２は、目標達成の結果をハイレベルプランナ３０１にフィードバックする。 That is, the high level planner 301 sequentially instructs the low level planner 302 to subgoals up to the target G as shown in FIG. The low level planner 302 operates a simulator (not shown) to achieve its indicated subgoal. The low level planner 302 feeds back the result of achieving the goal to the high level planner 301.

詳述すると、記憶装置１０４は、後述するような、タスク知識２０２を予め格納している。タスク知識２０２は、上記タスクルールに基づいて、後述するように決定された知識である。 More specifically, the storage device 104 stores the task knowledge 202 in advance, as will be described later. The task knowledge 202 is knowledge determined as will be described later based on the above task rules.

ハイレベルプランナ３０１は、サブゴール決定部３０３を備える。サブゴール決定部３０３は、タスク知識２０２を用いて、上記Ｎ個のサブゴールをＭ（ＭはＮより小さい１以上の整数）個のサブゴール候補に絞って、Ｍ個のサブゴール候補の中から優先的に上記特定のサブゴールを決定する。 The high level planner 301 includes a subgoal determination unit 303. Using the task knowledge 202, the subgoal determination unit 303 narrows down the above N subgoals to M (M is an integer of 1 or more smaller than N), and preferentially among the M subgoal candidates. Determine the specific subgoal above.

図３は、タスク知識２０２とサブゴール決定部３０３との詳細な一構成例を示すブロック図である。 FIG. 3 is a block diagram showing a detailed configuration example of the task knowledge 202 and the subgoal determination unit 303.

図示のタスク知識２０２は、優先ルール２０４と、抑制ルール２０６と含む。優先ルール２０４は、上記タスクルールに基づいて求められた、ゴールＧの到達に資するサブゴールを優先するルールである。一方、抑制ルール２０６は、上記タスクルールに基づいて求められた、ゴールＧの到達に資さないサブゴールを抑制するルールである。 The illustrated task knowledge 202 includes a priority rule 204 and a suppression rule 206. Priority rule 204 is a rule that gives priority to a subgoal that contributes to the achievement of goal G, which is obtained based on the above task rule. On the other hand, the suppression rule 206 is a rule for suppressing a sub-goal that does not contribute to the achievement of the goal G, which is obtained based on the above task rule.

サブゴール決定部３０３は、優先選択部３０５と、サブゴールチェック部３０７とを含む。優先選択部３０５は、優先ルール２０４に従って、Ｎ個のサブゴールの中からＭ個のサブゴール候補を優先的に抽出して選択する。 The sub-goal determination unit 303 includes a priority selection unit 305 and a sub-goal check unit 307. The priority selection unit 305 preferentially extracts and selects M subgoal candidates from the N subgoals according to the priority rule 204.

詳述すると、優先選択部３０５は、サブゴール候補抽出部３１１と、サブゴール選択部３１３とから成る。サブゴール候補抽出部３１１は、優先ルール２０４に従って、Ｎ個のサブゴールからＭ個のサブゴール候補を抽出する。サブゴール選択部３１３は、Ｍ個のサブゴール候補の中から優先的に１つのサブゴールを選択して、選択したサブゴールを出力する。 More specifically, the priority selection unit 305 includes a subgoal candidate extraction unit 311 and a subgoal selection unit 313. The subgoal candidate extraction unit 311 extracts M subgoal candidates from N subgoals according to the priority rule 204. The sub-goal selection unit 313 preferentially selects one sub-goal from the M sub-goal candidates and outputs the selected sub-goal.

サブゴールチェック部３０７は、抑制ルール２０６に基づいて、上記選択したサブゴールが、上記特定のサブゴールとしてＯＫかＮＧかを判定する。ＯＫの場合、サブゴールチェック部３０７は、選択したサブゴールを、特定のサブゴールとして出力する。サブゴールチェック部３０７でＮＧと判定されたとする。この場合、所定の確率ｐで、サブゴール選択部３１３は、サブゴール選択をやり直す。また、確率（1-p）で、サブゴールチェック部３０７は、ＮＧとされたサブゴールをそのまま特定のサブゴールとして出力する。 The subgoal check unit 307 determines whether the selected subgoal is OK or NG as the specific subgoal based on the suppression rule 206. If OK, the subgoal check unit 307 outputs the selected subgoal as a specific subgoal. It is assumed that the subgoal check unit 307 determines that the result is NG. In this case, with a predetermined probability p, the subgoal selection unit 313 redoes the subgoal selection. Further, with a probability (1-p), the subgoal check unit 307 outputs the NG subgoal as it is as a specific subgoal.

[動作の説明]
次に、図４のフローチャートを参照して、ハイレベルプランナ３０１におけるサブゴールを決定する動作（すなわち、サブゴール決定部３０３の動作）について詳細に説明する。[Description of operation]
Next, the operation of determining the subgoal in the high-level planner 301 (that is, the operation of the subgoal determination unit 303) will be described in detail with reference to the flowchart of FIG.

ここで、上述したのと同様に、εを０から１の間の変数とする（０≦ε≦１）。試行回数が少ない間は、変数εの値は１に近い。試行回数が増えていくにつれて経験値が蓄積されていくので、変数εの値は０に近づくように徐々に減少する。この状況において、本実施形態に係るサブゴール決定部３０３は、εの確率で上記タスク知識２０２を用いて、後述するように、特定のサブゴールを選択し、決定する。一方、先行技術の場合と同様に、サブゴール決定部３０３は、（１−ε）の確率で経験的に特定のサブゴールを選択し（ステップＳ１０１）、特定のサブゴールを決定する（ステップＳ１０２）。 Here, ε is a variable between 0 and 1 (0 ≦ ε ≦ 1), as described above. While the number of trials is small, the value of the variable ε is close to 1. As the number of trials increases, the experience value is accumulated, so the value of the variable ε gradually decreases so as to approach 0. In this situation, the subgoal determination unit 303 according to the present embodiment selects and determines a specific subgoal using the task knowledge 202 with a probability of ε, as will be described later. On the other hand, as in the case of the prior art, the subgoal determination unit 303 empirically selects a specific subgoal with a probability of (1-ε) (step S101) and determines a specific subgoal (step S102).

次に、εの確率でタスク知識２０２を用いて、特定のサブゴールを選択し、決定する場合の動作について説明する。 Next, the operation when a specific subgoal is selected and determined by using the task knowledge 202 with a probability of ε will be described.

まず、サブゴール候補抽出部３１１は、優先ルール２０４に従って、Ｎ個のサブゴールからＭ個のサブゴール候補を抽出する（ステップＳ１０３）。次に、サブゴール選択部３１３は、抽出したＭ個のサブゴール候補の中から１つのサブゴールを選択し、選択したサブゴールを出力する（ステップＳ１０４）。 First, the subgoal candidate extraction unit 311 extracts M subgoal candidates from N subgoals according to the priority rule 204 (step S103). Next, the subgoal selection unit 313 selects one subgoal from the extracted M subgoal candidates and outputs the selected subgoal (step S104).

次に、サブゴールチェック部３０７は、抑制ルール２０６に基づいて、選択したサブゴールが特定のサブゴールとしてＯＫかＮＧかを判定する（ステップＳ１０５）。ＯＫの場合、サブゴールチェック部３０７は、選択したサブゴールを特定のサブゴールとして決定する（ステップＳ１０２）。一方、サブゴールチェック部３０７でＮＧと判定された場合、所定の確率ｐでステップＳ１０４に戻って、サブゴール選択部３１３は、抽出したＭ個のサブゴール候補の中から１つのサブゴールを選択し直す。また、確率（1-p）で、サブゴールチェック部３０７は、ＮＧとされたサブゴールをそのまま特定のサブゴールとして出力する。 Next, the subgoal check unit 307 determines whether the selected subgoal is OK or NG as a specific subgoal based on the suppression rule 206 (step S105). If it is OK, the subgoal check unit 307 determines the selected subgoal as a specific subgoal (step S102). On the other hand, when the subgoal check unit 307 determines that the subgoal is NG, the subgoal selection unit 313 returns to step S104 with a predetermined probability p, and the subgoal selection unit 313 reselects one subgoal from the extracted M subgoal candidates. Further, with a probability (1-p), the subgoal check unit 307 outputs the NG subgoal as it is as a specific subgoal.

上記実施形態では、タスク知識２０２は、優先ルール２０４と抑制ルール２０６とを備えているが、それに限定されない。例えば、タスク知識２０２は、優先ルール２０４のみから成ってもよく、或いは、抑制ルール２０６のみから成ってもよい。 In the above embodiment, the task knowledge 202 includes, but is not limited to, priority rule 204 and suppression rule 206. For example, the task knowledge 202 may consist only of the priority rule 204 or the suppression rule 206 only.

図５は、知識タスク２０２が優先ルール２０４のみから成る場合における、ハイレベルプランナ３０１におけるサブゴール決定フローを示す流れ図である。図５から明らかなように、図４からステップＳ１０５が省略されている。 FIG. 5 is a flow chart showing a subgoal determination flow in the high level planner 301 when the knowledge task 202 consists of only the priority rule 204. As is clear from FIG. 5, step S105 is omitted from FIG.

図６は、知識タスク２０２が抑制ルール２０６のみから成る場合における、ハイレベルプランナ３０１におけるサブゴール決定フローを示す流れ図である。図６から明らかなように、図４からステップＳ１０３が省略されている。この場合、サブゴール選択部３１３は、Ｎ個のサブゴールの中からランダムに１つのサブゴールを選択することになる（ステップＳ１０４）。 FIG. 6 is a flow chart showing a subgoal determination flow in the high level planner 301 when the knowledge task 202 consists of only the suppression rule 206. As is clear from FIG. 6, step S103 is omitted from FIG. In this case, the subgoal selection unit 313 randomly selects one subgoal from the N subgoals (step S104).

尚、優先ルール２０４や抑制ルール２０６は、人手で作成されてよい。或いは、図７に示されるように、推論器３２０を用いて、タスクルール２１０から動的に優先ルール２０４および抑制ルール２０６を動的に作成してもよい。 The priority rule 204 and the suppression rule 206 may be manually created. Alternatively, as shown in FIG. 7, the inference device 320 may be used to dynamically create the priority rule 204 and the suppression rule 206 from the task rule 210.

[効果の説明]
次に、本実施の形態の効果について説明する。[Explanation of effect]
Next, the effect of this embodiment will be described.

本発明の実施の形態によれば、試行回数を減らして、学習時間を短縮することができる。その理由は、タスク知識を用いて、探索範囲（選択すべきサブゴール候補）を絞り、学習を高速化しているからである。 According to the embodiment of the present invention, the number of trials can be reduced and the learning time can be shortened. The reason is that the task knowledge is used to narrow down the search range (subgoal candidates to be selected) and speed up learning.

尚、強化学習装置１００の各部は、ハードウェアとソフトウェアとの組み合わせを用いて実現すればよい。ハードウェアとソフトウェアとを組み合わせた形態では、ＲＡＭ（random access memory）に強化学習プログラムが展開され、該強化学習プログラムに基づいて制御部（ＣＰＵ（central processing unit））等のハードウェアを動作させることによって、各部を各種手段として実現する。また、該強化学習プログラムは、記録媒体に記録されて頒布されても良い。当該記録媒体に記録された強化学習プログラムは、有線、無線、又は記録媒体そのものを介して、メモリに読込まれ、制御部等を動作させる。尚、記録媒体を例示すれば、オプティカルディスクや磁気ディスク、半導体メモリ装置、ハードディスクなどが挙げられる。 Each part of the reinforcement learning device 100 may be realized by using a combination of hardware and software. In the form of combining hardware and software, a reinforcement learning program is deployed in RAM (random access memory), and hardware such as a control unit (CPU (central processing unit)) is operated based on the reinforcement learning program. Each part is realized as various means. Further, the reinforcement learning program may be recorded on a recording medium and distributed. The reinforcement learning program recorded on the recording medium is read into the memory via wired, wireless, or the recording medium itself, and operates the control unit and the like. Examples of recording media include optical disks, magnetic disks, semiconductor memory devices, hard disks, and the like.

上記実施の形態を別の表現で説明すれば、強化学習装置１００として動作させるコンピュータを、ＲＡＭに展開された強化学習プログラムに基づき、優先選択部３０５（サブゴール候補抽出部３１１、サブゴール選択部３１３）、およびサブゴールチェック部３０７として動作させることで実現することが可能である。 To explain the above embodiment in another expression, the computer operated as the reinforcement learning device 100 is a priority selection unit 305 (subgoal candidate extraction unit 311 and subgoal selection unit 313) based on the reinforcement learning program developed in the RAM. , And it can be realized by operating as a subgoal check unit 307.

次に、本発明の実施形態に係る強化学習装置１００を、具体的な対象システムに適用した場合の第１の実施例について説明する。第１の実施例に係る対象システムは、Minecraftを模したクラフトゲームである。すなわち、フィールドにある材料を収集／クラフトし、目標となるアイテムをクラフトするタスクである。 Next, a first embodiment when the reinforcement learning device 100 according to the embodiment of the present invention is applied to a specific target system will be described. The target system according to the first embodiment is a craft game that imitates Minecraft. That is, the task of collecting / crafting materials in the field and crafting the target item.

以下に、本第１の実施例におけるミッション定義について説明する。目的（目標）は、材料を集めて、rabbit_stewを作ることである。ただし、適切な順番で材料を集めないと違うもの（たとえば、stick、mushroom_stew）ができて失敗してしまう。 The mission definition in the first embodiment will be described below. The purpose (goal) is to collect materials and make rabbit_stew. However, if you don't collect the ingredients in the proper order, you will end up with something different (eg stick, mushroom_stew) and you will fail.

逐次報酬は得られず、成功か失敗かによってだけ報酬が得られる。 You don't get sequential rewards, you only get rewards for success or failure.

図８に示されるように、１３×１３升目のフィールドに様々なアイテムを配置している。図９は、そのアイテム配置の一例を示している。したがって、材料は決められた８箇所（サブゴール）にある。常に、同じ初期状態（start）からミッションを開始する。 As shown in FIG. 8, various items are arranged in the field of 13 × 13 squares. FIG. 9 shows an example of the item arrangement. Therefore, the materials are in eight fixed locations (subgoals). Always start the mission from the same initial state (start).

アクションは４方向の移動のみである。収集／クラフトは自動的に行われる。図１０は、本例におけるトイタスクのタスクルール２１０である、Craftルールを示す図である。本例のトイタスクは、最短３９手である。 The action is only movement in four directions. Collection / crafting is automatic. FIG. 10 is a diagram showing a Craft rule, which is a task rule 210 of the toy task in this example. The toy task in this example has a minimum of 39 moves.

本第１の実施例では、タスク知識２０２を人手で作成している。本例における優先ルール２０４は、目標アイテムの前提となる材料の位置をルール化したものである。また、抑制ルール２０６は、失敗アイテムの前提となる材料の位置をルール化したものである。 In the first embodiment, the task knowledge 202 is manually created. Priority rule 204 in this example is a rule of the position of the material that is the premise of the target item. Further, the suppression rule 206 is a rule of the position of the material that is the premise of the failed item.

図１１は、優先ルール２０４の一例を示す図である。図１２は、抑制ルール２０６の一例を示す図である。 FIG. 11 is a diagram showing an example of the priority rule 204. FIG. 12 is a diagram showing an example of the suppression rule 206.

図１３は、本実施形態における強化学習装置１００と、非特許文献１に開示されている階層強化学習（先行技術）との比較結果（実験結果）を示す図である。図１３において、横軸は試行回数を示し、縦軸はタスク成功率を示す。また、図１３において、一点破線は、先行技術の実験結果を示し、二点破線は、タスク知識２０２として抑制ルール２０６のみを利用した実験結果を示し、破線は、タスク知識２０２として優先ルール２０４のみを利用した実験結果を示す。そして、実線は、タスク知識２０２として優先ルール２０４と抑制ルール２０６とを併用した実験結果を示す。 FIG. 13 is a diagram showing a comparison result (experimental result) of the reinforcement learning device 100 in the present embodiment and the hierarchical reinforcement learning (prior art) disclosed in Non-Patent Document 1. In FIG. 13, the horizontal axis shows the number of trials, and the vertical axis shows the task success rate. Further, in FIG. 13, the one-dot dashed line indicates the experimental result of the prior art, the two-dot dashed line indicates the experimental result using only the suppression rule 206 as the task knowledge 202, and the dashed line indicates only the priority rule 204 as the task knowledge 202. The experimental results using the above are shown. The solid line shows the experimental result in which the priority rule 204 and the suppression rule 206 are used together as the task knowledge 202.

図１３から明らかなように、タスク知識２０２として優先ルール２０４と抑制ルール２０６とを併用した、本実施形態に係る強化学習装置１００の学習速度は、先行技術の学習速度と比較して、約５倍高速化されることが分かる。また、タスク知識２０２として優先ルール２０４のみを用いた場合でも、本実施形態に係る強化学習装置１００の学習速度は、先行技術の学習速度と比較して高速化されていることが分かる。 As is clear from FIG. 13, the learning speed of the reinforcement learning device 100 according to the present embodiment in which the priority rule 204 and the suppression rule 206 are used together as the task knowledge 202 is about 5 as compared with the learning speed of the prior art. It can be seen that the speed is doubled. Further, even when only the priority rule 204 is used as the task knowledge 202, it can be seen that the learning speed of the reinforcement learning device 100 according to the present embodiment is higher than the learning speed of the prior art.

上述した第１の実施例では、人手によって、優先ルール２０４と抑制ルール２０６とを作成している。これに対して、以下に述べる第２の実施例では、推論器３２０を用いて、優先ルール２０４と抑制ルール２０６とを動的に作成する。 In the first embodiment described above, the priority rule 204 and the suppression rule 206 are manually created. On the other hand, in the second embodiment described below, the inference device 320 is used to dynamically create the priority rule 204 and the suppression rule 206.

最初に、推論器３２０を用いて優先ルール２０４を導出する例について説明する。但し、タスクルール２１０は、説明を簡略化するために、図１０に示したものとは異なるものとする。 First, an example of deriving the priority rule 204 using the inferior 320 will be described. However, the task rule 210 is different from that shown in FIG. 10 in order to simplify the explanation.

図１４は、優先ルール２０４を推論器３２０を用いて導出するために必要な「背景知識」と「目的状態」とを示す図である。述語として、動作述語（goto）と状態述語（have）とを定義している。図１４において、「背景知識」のPickupルールは、図９に示すアイテム配置を表現したものである。 FIG. 14 is a diagram showing "background knowledge" and "objective state" necessary for deriving the priority rule 204 using the inference device 320. Action predicates (goto) and state predicates (have) are defined as predicates. In FIG. 14, the pickup rule of “background knowledge” expresses the item arrangement shown in FIG.

推論器３２０は、図１４に示された「背景知識」および「目的状態」のもとで、後ろ向き推論を適用していき、導出された動作述語を優先ルール２０４とする。図１５は、そのようにして導出された優先ルール２０４の一例を示す図である。 The inference device 320 applies backward inference under the "background knowledge" and the "objective state" shown in FIG. 14, and sets the derived action predicate as the priority rule 204. FIG. 15 is a diagram showing an example of the priority rule 204 thus derived.

次に、推論器３２０を用いて抑制ルール２０６を導出する例について説明する。 Next, an example of deriving the suppression rule 206 using the inference device 320 will be described.

推論器３２０は、図１６に示される「非目的状態」を定義し、それに至る動作述語を抑制ルール２０６とする。図１６において、分岐している箇所の条件は、ＡＮＤなので、すべて満たされていると非目的状態になる。 The inference device 320 defines the "non-purpose state" shown in FIG. 16, and the action predicate leading to the definition is set as the suppression rule 206. In FIG. 16, since the condition of the branched portion is AND, it becomes a non-purpose state if all the conditions are satisfied.

図１７は、そのようにして導出された抑制ルール２０６の一例を示す図である。図１７は、３つの抑制ルールを示している。最初の抑制ルールは、red_mushroomとbrown_mushroomとを持っていて、bowlを持っていないときに、SWに行くのは抑制ルールであることを示している。以下の２つの抑制ルールも同様である。 FIG. 17 is a diagram showing an example of the suppression rule 206 thus derived. FIG. 17 shows three suppression rules. The first suppression rule shows that it is the suppression rule that goes to SW when it has red_mushroom and brown_mushroom and does not have a bowl. The same applies to the following two suppression rules.

以上のようにして、推論器３２０は、タスクルール２１０から優先ルール２０４と抑制ルール２０６とを動的に作成することができる。 As described above, the inferior 320 can dynamically create the priority rule 204 and the suppression rule 206 from the task rule 210.

なお、本発明の具体的な構成は前述の実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の変更があってもこの発明に含まれる。 The specific configuration of the present invention is not limited to the above-described embodiment, and is included in the present invention even if there is a change within the scope not departing from the gist of the present invention.

以上、実施形態（実施例）を参照して本願発明を説明したが、本願発明は上記実施形態（実施例）に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 Although the present invention has been described above with reference to the embodiment (Example), the present invention is not limited to the above embodiment (Example). Various changes that can be understood by those skilled in the art can be made within the scope of the present invention in terms of the structure and details of the present invention.

上記の実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。 Some or all of the above embodiments may also be described, but not limited to:

（付記１）開始点からゴールに到達するまでのＮ（Ｎは３以上の整数）個のサブゴールの中から特定のサブゴールを決定するハイレベルプランナと；前記特定のサブゴールに従って実際のアクションを決めるローレベルプランナと；を備え、前記ハイレベルプランナは、タスク知識を用いて、前記Ｎ個のサブゴールをＭ（ＭはＮより小さい１以上の整数）個のサブゴール候補に絞って、前記Ｍ個のサブゴール候補の中から優先的に前記特定のサブゴールを決定するサブゴール決定部を備え、前記タスク知識は、前記開始点から前記ゴールに到達するまでに満たすべきタスクルールに基づいて決定された知識である、強化学習装置。 (Appendix 1) A high-level planner that determines a specific subgoal from N (N is an integer of 3 or more) subgoals from the starting point to reaching the goal; a low that determines the actual action according to the specific subgoal. With a level planner; the high level planner uses task knowledge to narrow down the N subgoals to M (M is an integer of 1 or more smaller than N) subgoal candidates, and the M subgoals. A subgoal determination unit for preferentially determining the specific subgoal from among the candidates is provided, and the task knowledge is knowledge determined based on a task rule to be satisfied from the start point to the arrival of the goal. Reinforcement learning device.

（付記２）前記タスク知識は、前記タスクルールに基づいて求められた、前記ゴールの到達に資するサブゴールを優先する優先ルールを含み、前記サブゴール決定部は、前記優先ルールに従って、前記Ｎ個のサブゴールの中から前記Ｍ個のサブゴール候補を優先的に抽出して選択する優先選択部を含む、付記１に記載の強化学習装置。 (Appendix 2) The task knowledge includes a priority rule that prioritizes subgoals that contribute to the achievement of the goal, which is obtained based on the task rule, and the subgoal determination unit determines the N subgoals according to the priority rule. The reinforcement learning device according to Appendix 1, which includes a priority selection unit that preferentially extracts and selects the M subgoal candidates from among the above.

（付記３）前記優先選択部は、前記優先ルールに従って、前記Ｎ個のサブゴールから前記Ｍ個のサブゴール候補を抽出するサブゴール候補抽出部と；前記Ｍ個のサブゴール候補の中から優先的に１つのサブゴールを選択して、選択したサブゴールを出力するサブゴール選択部と；を含む、付記２に記載の強化学習装置。 (Appendix 3) The priority selection unit includes a subgoal candidate extraction unit that extracts the M subgoal candidates from the N subgoals according to the priority rule; and one preferentially one from the M subgoal candidates. The reinforcement learning device according to Appendix 2, which includes a subgoal selection unit that selects a subgoal and outputs the selected subgoal;

（付記４）前記タスク知識は、前記タスクルールに基づいて求められた、前記ゴールの到達に資さないサブゴールを抑制する抑制ルールを更に含み、前記サブゴール決定部は、前記抑制ルールに基づいて、前記選択したサブゴールが、前記特定のサブゴールとしてＯＫかＮＧかを判定するサブゴールチェック部を更に備える、付記３に記載の強化学習装置。 (Appendix 4) The task knowledge further includes a suppression rule for suppressing a subgoal that does not contribute to the achievement of the goal, which is obtained based on the task rule, and the subgoal determination unit is based on the suppression rule. The reinforcement learning device according to Appendix 3, further comprising a subgoal check unit for determining whether the selected subgoal is OK or NG as the specific subgoal.

（付記５）前記サブゴール選択部は、前記サブゴールチェック部でＮＧと判定された場合に、所定の確率で、前記Ｍ個のサブゴール候補の中から、前記１つのサブゴールを選択し直す、付記４に記載の強化学習装置。 (Appendix 5) The subgoal selection unit reselects the one subgoal from the M subgoal candidates with a predetermined probability when the subgoal check unit determines that the subgoal is NG. The described reinforcement learning device.

（付記６）ハイレベルプランナが、開始点からゴールに到達するまでのＮ（Ｎは３以上の整数）個のサブゴールの中から特定のサブゴールを決定し、ローレベルプランナが、前記特定のサブゴールに従って実際のアクションを決める、強化学習方法であって、前記ハイレベルプランナのサブゴール決定部が、タスク知識を用いて、前記Ｎ個のサブゴールをＭ（ＭはＮより小さい１以上の整数）個のサブゴール候補に絞って、前記Ｍ個のサブゴール候補の中から優先的に前記特定のサブゴールを決定し、前記タスク知識は、前記開始点から前記ゴールに到達するまでに満たすべき規則を規定しているタスクルールに基づいて決定された知識である、強化学習方法。 (Appendix 6) The high-level planner determines a specific subgoal from N (N is an integer of 3 or more) subgoals from the starting point to the goal, and the low-level planner follows the specific subgoal. It is a reinforcement learning method that determines the actual action, and the subgoal determination unit of the high-level planner uses task knowledge to convert the N subgoals into M (M is an integer greater than or equal to 1 smaller than N). A task that narrows down the candidates and preferentially determines the specific subgoal from the M subgoal candidates, and the task knowledge defines the rules to be satisfied from the start point to the goal. Reinforcement learning method, which is knowledge determined based on rules.

（付記７）前記タスク知識は、前記タスクルールに基づいて求められた、前記ゴールの到達に資するサブゴールを優先する優先ルールを含み、前記サブゴール決定部の優先選択部が、前記優先ルールに従って、前記Ｎ個のサブゴールの中から前記Ｍ個のサブゴール候補を優先的に抽出して選択する、付記６に記載の強化学習方法。 (Appendix 7) The task knowledge includes a priority rule that prioritizes a subgoal that contributes to the achievement of the goal, which is obtained based on the task rule, and the priority selection unit of the subgoal determination unit determines the priority rule according to the priority rule. The reinforcement learning method according to Appendix 6, wherein the M subgoal candidates are preferentially extracted and selected from the N subgoals.

（付記８）前記優先選択部のサブゴール候補抽出部が、前記優先ルールに従って、前記Ｎ個のサブゴールから前記Ｍ個のサブゴール候補を抽出し、前記優先選択部のサブゴール選択部が、前記Ｍ個のサブゴール候補の中から優先的に１つのサブゴールを選択して、選択したサブゴールを出力する、付記７に記載の強化学習方法。 (Appendix 8) The sub-goal candidate extraction unit of the priority selection unit extracts the M sub-goal candidates from the N sub-goals according to the priority rule, and the sub-goal selection unit of the priority selection unit has the M sub-goal candidates. The reinforcement learning method according to Appendix 7, wherein one subgoal is preferentially selected from the subgoal candidates and the selected subgoal is output.

（付記９）前記タスク知識は、前記タスクルールに基づいて求められた、前記ゴールの到達に資さないサブゴールを抑制する抑制ルールを更に含み、前記サブゴール決定部のサブゴールチェック部が、前記抑制ルールに基づいて、前記選択したサブゴールが、前記特定のサブゴールとしてＯＫかＮＧかを判定する、付記８に記載の強化学習方法。 (Appendix 9) The task knowledge further includes a suppression rule for suppressing a subgoal that does not contribute to the achievement of the goal, which is obtained based on the task rule, and the subgoal check unit of the subgoal determination unit controls the suppression rule. The reinforcement learning method according to Appendix 8, wherein it is determined whether the selected subgoal is OK or NG as the specific subgoal based on the above.

（付記１０）前記サブゴール選択部が、前記サブゴールチェック部でＮＧと判定された場合に、所定の確率で、前記Ｍ個のサブゴール候補の中から、前記１つのサブゴールを選択し直す、付記９に記載の強化学習方法。 (Appendix 10) When the subgoal selection unit is determined to be NG by the subgoal check unit, the one subgoal is reselected from the M subgoal candidates with a predetermined probability. The described reinforcement learning method.

（付記１１）開始点からゴールに到達するまでのＮ（Ｎは３以上の整数）個のサブゴールの中から特定のサブゴールを決定するハイレベルプランナ手順と、前記特定のサブゴールに従って実際のアクションを決めるローレベルプランナ手順と、をコンピュータに実行させる強化学習プログラムを記録した強化学習プログラム記録媒体であって、前記ハイレベルプランナ手順は、タスク知識を用いて、前記Ｎ個のサブゴールをＭ（ＭはＮより小さい１以上の整数）個のサブゴール候補に絞って、前記Ｍ個のサブゴール候補の中から優先的に前記特定のサブゴールを決定するサブゴール決定手順を備え、前記タスク知識は、前記開始点から前記ゴールに到達するまでに満たすべきタスクルールに基づいて決定された知識である、強化学習プログラム記録媒体。 (Appendix 11) A high-level planner procedure for determining a specific subgoal from N (N is an integer of 3 or more) subgoals from the starting point to reaching the goal, and the actual action is determined according to the specific subgoal. A low-level planner procedure and an enhanced learning program recording medium that records an enhanced learning program that causes a computer to execute the procedure. The high-level planner procedure uses task knowledge to set the N subgoals to M (M is N). A subgoal determination procedure is provided in which the specific subgoal is preferentially determined from the M subgoal candidates by narrowing down the subgoal candidates (less than 1 or more integers), and the task knowledge is described from the start point. An enhanced learning program recording medium that is knowledge determined based on task rules that must be met before reaching a goal.

（付記１２）前記タスク知識は、前記タスクルールに基づいて求められた、前記ゴールの到達に資するサブゴールを優先する優先ルールを含み、前記サブゴール決定手順は、前記優先ルールに従って、前記Ｎ個のサブゴールの中から前記Ｍ個のサブゴール候補を優先的に抽出して選択する優先選択手順を含む、付記１１に記載の強化学習プログラム記録媒体。 (Appendix 12) The task knowledge includes a priority rule that prioritizes subgoals that contribute to the achievement of the goal, which is obtained based on the task rule, and the subgoal determination procedure follows the N subgoals according to the priority rule. The reinforcement learning program recording medium according to Appendix 11, which includes a priority selection procedure for preferentially extracting and selecting the M subgoal candidates from among them.

（付記１３）前記優先選択手順は、前記優先ルールに従って、前記Ｎ個のサブゴールから前記Ｍ個のサブゴール候補を抽出するサブゴール候補抽出手順と、前記Ｍ個のサブゴール候補の中から優先的に１つのサブゴールを選択して、選択したサブゴールを出力するサブゴール選択手順と、を含む、付記１２に記載の強化学習プログラム記録媒体。 (Appendix 13) The priority selection procedure includes a subgoal candidate extraction procedure for extracting the M subgoal candidates from the N subgoals and one preferentially one from the M subgoal candidates according to the priority rule. The reinforcement learning program recording medium according to Appendix 12, which includes a subgoal selection procedure for selecting a subgoal and outputting the selected subgoal.

（付記１４）前記タスク知識は、前記タスクルールに基づいて求められた、前記ゴールの到達に資さないサブゴールを抑制する抑制ルールを更に含み、前記サブゴール決定手順は、前記抑制ルールに基づいて、前記選択したサブゴールが、前記特定のサブゴールとしてＯＫかＮＧかを判定するサブゴールチェック手順を更に備える、付記１３に記載の強化学習プログラム記録媒体。 (Appendix 14) The task knowledge further includes a suppression rule for suppressing a subgoal that does not contribute to the achievement of the goal, which is obtained based on the task rule, and the subgoal determination procedure is based on the suppression rule. The reinforcement learning program recording medium according to Appendix 13, further comprising a subgoal check procedure for determining whether the selected subgoal is OK or NG as the specific subgoal.

（付記１５）前記サブゴール選択手順は、前記サブゴールチェック手順でＮＧと判定された場合に、所定の確率で、前記Ｍ個のサブゴール候補の中から、前記１つのサブゴールを選択し直す、付記１４に記載の強化学習プログラム記録媒体。 (Appendix 15) The subgoal selection procedure reselects the one subgoal from the M subgoal candidates with a predetermined probability when the subgoal check procedure determines that the subgoal is NG. The described reinforcement learning program recording medium.

本発明に係る強化学習装置は、プラント運転支援システムや、インフラ運転支援システム等の用途に適用可能である。 The reinforcement learning device according to the present invention can be applied to applications such as a plant operation support system and an infrastructure operation support system.

１００強化学習装置
１０１入力装置
１０２出力装置
１０４記憶装置
１０５データ処理装置
２０１プログラム
２０２タスク知識
２０４優先ルール
２０６抑制ルール
２１０タスクルール
３０１ハイレベルプランナ
３０２ローレベルプランナ
３０３サブゴール決定部
３０５優先選択部
３０７サブゴールチェック部
３１１サブゴール候補抽出部
３１３サブゴール選択部
３２０推論器
100 Reinforcement learning device 101 Input device 102 Output device 104 Storage device 105 Data processing device 201 Program 202 Task knowledge 204 Priority rule 206 Suppression rule 210 Task rule 301 High level planner 302 Low level planner 303 Subgoal decision unit 305 Priority selection unit 307 Subgoal check Part 311 Subgoal candidate extraction part 313 Subgoal selection part 320 Inference device

Claims

A high-level planner that determines a specific subgoal from N (N is an integer of 3 or more) subgoals from the start point to the goal.
With a low-level planner that decides the action according to the specific subgoal,
It has a storage device that stores task knowledge that represents the rules for selecting the subgoal in advance.
The high level planner, before using Kita risk knowledge, the N (the M N is smaller than an integer of 1 or more) of the subgoal M Search in pieces subgoal candidates, from among the M sub-goal candidate It has a subgoal determination unit that preferentially determines the specific subgoal.
Reinforcement learning device.

The task knowledge includes a priority rule that prioritizes a subgoal that contributes to the achievement of the goal, which is obtained based on the rule.
The subgoal determination unit includes a priority selection unit that preferentially extracts and selects the M subgoal candidates from the N subgoals according to the priority rule.
The reinforcement learning device according to claim 1.

The priority selection unit is
A subgoal candidate extraction unit that extracts the M subgoal candidates from the N subgoals according to the priority rule, and a subgoal candidate extraction unit.
A subgoal selection unit that preferentially selects one subgoal from the M subgoal candidates and outputs the selected subgoal.
2. The reinforcement learning device according to claim 2.

The task knowledge further includes a suppression rule that suppresses a subgoal that does not contribute to the achievement of the goal, which is obtained based on the rule.
The subgoal determination unit further includes a subgoal check unit that determines whether the selected subgoal is OK or NG as the specific subgoal based on the suppression rule.
The reinforcement learning device according to claim 3.

When the subgoal check unit determines that the subgoal is NG, the subgoal selection unit reselects the one subgoal from the M subgoal candidates with a predetermined probability.
The reinforcement learning device according to claim 4.

The data processor determines a specific subgoal from N (N is an integer of 3 or more) subgoals from the start point to the goal.
A reinforcement learning method that determines actions according to the specific subgoal.
The task knowledge representing the rule for selecting the subgoal is stored in the storage device in advance.
Using the task knowledge, the data processing device narrows down the N subgoals to M (M is an integer of 1 or more smaller than N), and preferentially among the M subgoal candidates. Reinforcement learning method for determining the specific subgoal.

A high-level planner procedure that determines a specific subgoal from N (N is an integer of 3 or more) subgoals from the start point to the goal.
A low-level planner procedure that determines actions according to the specific subgoal,
Is a reinforcement learning program that lets a computer execute
In the high-level planner procedure, the N subgoals are narrowed down to M (M is an integer of 1 or more smaller than N) subgoal candidates by using the task knowledge representing the rule for selecting the subgoals. A reinforcement learning program including a subgoal determination procedure for preferentially determining the specific subgoal from among subgoal candidates.

A high level planner to determine the subgoal in either et part subgoal to reach the goal from the start point,
It is a reinforcement learning device equipped with a low-level planner that decides actions according to some of the subgoals mentioned above .
It has a storage device that stores the rules for selecting the subgoal in advance.
The high-level planner is a reinforcement learning device having a subgoal determination unit that determines a part of the subgoals in accordance with the rules .

The data processing device determines some subgoals from the subgoals from the start point to the goal according to the rules for selecting the subgoals.
This is a reinforcement learning method that determines actions according to some of the subgoals mentioned above .
The rules for selecting the subgoal are stored in the storage device in advance.
A reinforcement learning method in which the data processing device determines a part of the subgoals according to the rules .

By the data processing apparatus, and a high level planner function of determining the subgoal in either et part subgoal from the start point to reach the goal,
It is a reinforcement learning program that realizes a low-level planner function that decides actions according to some of the subgoals mentioned above .
The rules for selecting the subgoal are stored in the storage device in advance.
The high-level planner function is an reinforcement learning program that determines some of the subgoals according to the rules .