JP2006309519A

JP2006309519A - Reinforcement learning system and reinforcement learning program

Info

Publication number: JP2006309519A
Application number: JP2005131570A
Authority: JP
Inventors: Tetsuya Fukunaga; 哲也福永
Original assignee: Institute of National Colleges of Technologies Japan
Current assignee: Institute of National Colleges of Technologies Japan
Priority date: 2005-04-28
Filing date: 2005-04-28
Publication date: 2006-11-09

Abstract

<P>PROBLEM TO BE SOLVED: To provide a reinforcement learning system capable of significantly improving learning speed of an agent in an early stage of learning. <P>SOLUTION: The reinforcement learning computer 2 used for a reinforcement learning system 1 is provided with: an undefined state setting means 12 for setting an initial value of Q value in an undefined state, a state observation means 14 for observing the state of the agent 13, a behavior output means 15 for outputting behavior, a next state observation means 16 for observing the next state of the agent 13, a reward providing means 17 for providing the agent 13 with a reward r, a determination means 20 for determining the reward r and the next Q value according to a criterion 19, an unlearning means 21 for setting the Q value in an unlearned state according to the criterion 19, a second determination means 23 for performing determination according to a second criterion 23a, a learning means 24 for updating the Q value, an initialization means 25 for initializing the Q value, and a state update means 26 for updating the next state. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、強化学習システム、及び強化学習プログラムに関するものであり、特に、二足歩行型ロボットや惑星探査に利用される探査装置等のように、周囲環境の変化や発生する種々の事象を体験または経験し、ロボット等が学習によって自律的に行動を決定することが可能な強化学習を実施するための強化学習システム、及び強化学習プログラムに関するものである。 The present invention relates to a reinforcement learning system and a reinforcement learning program. In particular, the present invention experiences changes in the surrounding environment and various events that occur, such as a biped robot and a search device used for planetary exploration. The present invention also relates to a reinforcement learning system and a reinforcement learning program for performing reinforcement learning that can be experienced and a robot or the like can determine an action autonomously by learning.

従来からロボット等の自律的な制御を可能とするために、所謂「強化学習」と呼ばれる学習手法が用いられることがある（例えば、非特許文献１参照）。ここで、「強化学習」は、一般に学習対象（エージェント）が、ある環境の中を無作為的に行動し、その結果として目標に到達することによって初めて報酬を得るものである。そして、次エピソードの際に、エージェントが以前に経験したエピソードと同一の状況に遭遇した場合、以前の経験に基づいて報酬を得る可能性の高い行動を選択する可能性が高くなる。そして、エージェントが報酬を得るエピソードを繰返すことにより、最終的に全ての状態（場面）において、報酬を得ることが可能な最適な行動を選択するように学習をすることができる。 Conventionally, a learning method called “reinforcement learning” is sometimes used to enable autonomous control of a robot or the like (see, for example, Non-Patent Document 1). Here, “reinforcement learning” is a method in which a learning object (agent) generally obtains a reward only when a user randomly acts in a certain environment and reaches a target as a result. In the next episode, if the agent encounters the same situation as the previous episode, the possibility of selecting an action with a high possibility of obtaining a reward based on the previous experience increases. Then, by repeating episodes in which the agent obtains rewards, it is possible to learn to finally select an optimal action that can obtain rewards in all states (scenes).

ここで、強化学習において、最も代表的な例として「Ｑ学習（Ｑ−Ｌｅａｒｎｉｎｇ）」と呼ばれる学習手法が知られている。Ｑ学習は、予め初期値が与えられた最適行動価値関数を示すＱ値を直接的に近似するものであり、強化学習が実施される環境における全状態数と全可能行動数とを掛け合わせた積によって示されるものである。 Here, in reinforcement learning, a learning method called “Q-learning” is known as the most typical example. Q-learning is a direct approximation of the Q-value indicating an optimal action value function that is given an initial value in advance, and is multiplied by the total number of states and the total number of possible actions in the environment in which reinforcement learning is performed. It is indicated by the product.

このとき、エージェントの学習（換言すれば、Ｑ値の更新）は、下記に示す式（５）によって行われる。ここで、Ｑ（ｓ，ａ）（Ｑ値に相当）は、状態ｓにおける行動ａの価値を示し、αは学習のステップサイズ、γは割引率、ｒは報酬、Ｔは目標値を表している。具体的に示すと、Ｑ学習システムは、図７に示すように、最適行動価値関数を示すＱ値の全てを所定の値に初期化（例えば、全てのテーブルに対し、Ｑ＝０を与える等）し（ステップＴ１）、その後、学習対象となるエージェントの状態ｓを観測する（ステップＴ２）。さらに、当該エージェントの状態ｓからの行動ａを出力する（ステップＴ３）。そして、行動ａの結果に基づく次状態ｓ’を観測し（ステップＴ４）、報酬ｒを獲得する（ステップＴ５）。その後、下記の式（５）を利用し、学習（Ｑ値の更新）がなされる（ステップＴ６）。そして、次状態ｓ’を状態ｓに置換する処理を行う（ステップＴ７）。そして、再びステップＴ３の処理に戻り、エージェントの学習を継続する。その結果、個々の行動ａに対して、Ｑ値が更新され、エージェントが目標に到達し、報酬ｒを得るまでの過程（エピソードに相当）を複数回に亘って繰り返すことにより、ステップサイズαに基づいて徐々に最適なＱ値に近似するようになる。これにより、Ｑ値の更新によって、学習初期の段階に比べ、目標に到達するまでの所要時間や所要行動数等を短縮することができる。 At this time, learning of the agent (in other words, updating of the Q value) is performed by the following equation (5). Here, Q (s, a) (corresponding to the Q value) indicates the value of the action a in the state s, α is the learning step size, γ is the discount rate, r is the reward, and T is the target value. Yes. Specifically, as shown in FIG. 7, the Q learning system initializes all the Q values indicating the optimum action value function to predetermined values (for example, assigns Q = 0 to all the tables, etc.) (Step T1), and then the state s of the agent to be learned is observed (Step T2). Furthermore, the action a from the state s of the agent is output (step T3). Then, the next state s ′ based on the result of the action a is observed (step T4), and the reward r is acquired (step T5). Thereafter, learning (update of Q value) is performed using the following equation (5) (step T6). Then, a process of replacing the next state s' with the state s is performed (step T7). And it returns to the process of step T3 again and continues learning of an agent. As a result, for each action a, the Q value is updated, and the process until the agent reaches the target and obtains the reward r (corresponding to an episode) is repeated a plurality of times. Based on this, the optimum Q value is gradually approximated. Thereby, by updating the Q value, the time required to reach the target, the number of required actions, and the like can be shortened as compared with the initial stage of learning.

ＲｉｃｈａｒｄＳ．Ｓｕｔｔｏｎ・ＡｎｄｒｅｗＧ．Ｂａｒｔｏ著、三上貞芳・皆川雅章訳、「強化学習」、第１版、森北出版、２０００年１２月２０日Richard S. Sutton Andrew G. Barto, Sadayoshi Mikami and Masaaki Minagawa, “Reinforcement Learning”, 1st edition, Morikita Publishing, December 20, 2000

しかしながら、上述したＱ学習を用いた強化学習の場合、その学習速度の遅さ、特に学習初期の段階における学習効率が悪いことが問題であった。すなわち、上述の式（５）によるＱ学習の基本式によれば、ステップサイズα（０＜α＜１）は、学習の速度を決定するためのパラメータであり、Ｑ値の更新の効率を決定するものである。ところが、ステップサイズαは、０＜α＜１の範囲で与えられる数値であり、第１式右辺第１項に比べ、Ｑ値全体に対する右辺第２項の値は極めて小さくなる。そのため、Ｑ値全体において、第１式右辺第１項は支配的となり、得られた報酬がＱ値全体に反映することは非常に小さかった。その結果、近似されたＱ値を得ようとする場合、エージェントは非常に多くの経験（エピソード）を繰返す必要があった。特に、Ｑ値から大幅に離れた初期値が設定された場合には、学習に要する時間が多くなり、特にエージェントの状態数が増加することによって、必要となる学習時間は指数関数的に増大する結果となった。 However, in the case of reinforcement learning using Q-learning described above, there is a problem that the learning speed is slow, in particular, the learning efficiency is low in the initial learning stage. That is, according to the basic equation of Q learning by the above equation (5), the step size α (0 <α <1) is a parameter for determining the learning speed, and determines the update efficiency of the Q value. To do. However, the step size α is a numerical value given in the range of 0 <α <1, and the value of the second term on the right side with respect to the entire Q value is extremely small compared to the first term on the right side of the first equation. Therefore, in the entire Q value, the first term on the right side of the first expression is dominant, and the obtained reward is very little reflected in the entire Q value. As a result, when trying to obtain an approximated Q value, the agent had to repeat a great number of experiences (episodes). In particular, when an initial value that is significantly different from the Q value is set, the time required for learning increases. In particular, the required learning time increases exponentially as the number of agent states increases. As a result.

また、式（５）における目標値Ｔを算出するための第２式では、右辺第２項（ステップサイズαを除く）の値も、学習開始の段階では任意の初期値が与えられることがあった。そのため、仮に与えれる初期値が正確である保証はなく、目標値Ｔに到達させるために、エージェントを繰返し行動させることに何の意味を伴わないケースもあった。 In the second equation for calculating the target value T in equation (5), the value of the second term on the right side (excluding the step size α) may be given an arbitrary initial value at the start of learning. It was. For this reason, there is no guarantee that the initial value given is accurate, and there is a case where there is no meaning in causing the agent to repeatedly act in order to reach the target value T.

そこで、本発明は、上記実情に鑑み、学習初期の段階におけるエージェントの学習速度を飛躍的に向上させることが可能な強化学習システム、及び強化学習プログラムを提供することを課題とするものである。 Therefore, in view of the above circumstances, an object of the present invention is to provide a reinforcement learning system and a reinforcement learning program capable of dramatically improving the learning speed of an agent in an early stage of learning.

上記の課題を解決するため、本発明にかかる強化学習システムは、「行動価値関数または状態価値関数を含む価値関数を示す価値Ｖの初期値を未定義に設定する未定義設定手段と、強化学習を行う学習対象のエージェントの状態を観測する状態観測手段と、前記状態における前記エージェントの行動を出力する行動出力手段と、出力された前記行動によって遷移する前記エージェントの次状態を観測する次状態観測手段と、前記次状態に遷移した前記エージェントに報酬ｒを提供する報酬提供手段と、前記報酬、及び前記次状態における前記価値関数を示す次価値Ｖ’を、予め規定された判定基準に従って判定する判定手段と、前記判定手段の前記判定基準に従って、前記報酬が零及び前記次価値Ｖ’が未定義であると判定されると、前記の学習処理または初期化処理をキャンセルする未学習手段と、前記状態における前記価値Ｖを第二判定基準に従って判定する第二判定手段と、前記第二判定手段の前記第二判定基準に従って、前記価値Ｖが定義済みであると判定されると、次式：

（α：ステップサイズ、γ：割引率）
に基づいて、前記価値Ｖを更新し、学習する学習手段と、前記第二判定手段の前記第二判定基準に従って、前記価値Ｖが未定義であると判定されると、次式：

に基づいて、前記価値Ｖを初期化する初期化手段と、前記判定手段、前記初期化手段、及び前記学習手段のいずれか一つの処理が行われた前記次状態を前記状態に更新する状態更新手段と」を主に具備して構成されている。 In order to solve the above-mentioned problem, the reinforcement learning system according to the present invention includes: “undefined setting means for setting an initial value of a value V indicating a value function including an action value function or a state value function to be undefined; State observation means for observing the state of the learning target agent, behavior output means for outputting the agent's action in the state, and next state observation for observing the next state of the agent transitioned by the outputted action Means, reward providing means for providing reward r to the agent that has transitioned to the next state, and the next value V ′ indicating the reward and the value function in the next state is determined according to a predetermined criterion. When it is determined that the reward is zero and the next value V ′ is undefined according to the determination means and the determination criterion of the determination means, An unlearned means for canceling the process or the initialization process, a second determination means for determining the value V in the state according to a second determination criterion, and the value V according to the second determination criterion of the second determination means. If it is determined that it is already defined, the following formula:

(Α: Step size, γ: Discount rate)
If it is determined that the value V is undefined according to learning means for updating and learning the value V and the second determination criterion of the second determination means based on

On the basis of the state update, the state update unit updates the state after the initialization unit that initializes the value V and the next state after any one of the determination unit, the initialization unit, and the learning unit is performed. It is mainly provided with “means”.

なお、本発明の強化学習システムを「ＴＤ学習」に適用する場合には、上記式（６）を下記の式（８）に置換して用い、式（７）を式（９）に置換して用いることが可能である。また、「Ｓａｒｓａ」に適用する場合には、式（６）を下記の式（１０）に置換して用い、式（７）を式（１１）に置換して用いることが可能である。さらに、「Ｑ学習」に適用する場合には、式（６）を下記の式（１２）に置換して用い、式（７）を式（１３）に置換して用いることが可能である。すなわち、本発明は、価値関数（行動価値関数及び状態価値関数等を含む）を最適価値関数へ漸近する学習方式を採用する強化学習に適用することが可能である。なお、ＴＤ学習の場合、状態価値関数Ｖ（ｓ）が価値関数に相当し、次価値は「状態価値」に相当する。また、Ｓａｒｓａの場合、次価値は「実際に行動した行動価値」に相当し、Ｑ学習の場合、次価値は「最大の行動価値」に相当する。なお、請求項及び式（６）及び式（７）において便宜上、行動価値関数を示すＱ値によって表現しているが、状態価値関数Ｖ（ｓ）で上記式（６）及び式（７）を表すものであっても構わない（式（８）及び式（９）参照）。なお、説明を簡略化するため、以下はＱ学習に適用した場合について説明を行うものとする。

When the reinforcement learning system of the present invention is applied to “TD learning”, the above equation (6) is replaced with the following equation (8), and the equation (7) is replaced with the equation (9). Can be used. Further, when applied to “Sarsa”, it is possible to use the expression (6) by replacing the expression (6) with the following expression (10) and replacing the expression (7) with the expression (11). Furthermore, when applied to “Q learning”, it is possible to replace equation (6) with the following equation (12) and replace equation (7) with equation (13). That is, the present invention can be applied to reinforcement learning that employs a learning method in which a value function (including an action value function and a state value function) is asymptotic to an optimal value function. In the case of TD learning, the state value function V (s) corresponds to a value function, and the next value corresponds to a “state value”. In the case of Sarsa, the next value corresponds to “actual action value actually acted”, and in the case of Q learning, the next value corresponds to “maximum action value”. In the claims and the formulas (6) and (7), for convenience, the Q value indicating the action value function is expressed. However, the above formulas (6) and (7) are expressed by the state value function V (s). It may be expressed (see Formula (8) and Formula (9)). In order to simplify the description, the following description is given for a case where the present invention is applied to Q learning.

したがって、本発明の強化学習システムによれば、Ｑ値の初期値が予め定義された任意の値（例えば、Ｑ＝０等）に設定されるものではなく、未定義の状態に設定される。そして、係る設定条件に基づいて、エージェントの現在の状態を観測し、行動を出力する。さらに、当該行動の結果として遷移したエージェントの次状態を観測し、係る行動による報酬をエージェントに提供する。なお、現在の状態の観測から報酬を提供するまでの一連の処理は、従来のＱ学習において実施されるものと同様である。その後、判定手段の判定基準によってそれまでの処理が判定される。 Therefore, according to the reinforcement learning system of the present invention, the initial value of the Q value is not set to a predetermined value (for example, Q = 0, etc.) but is set to an undefined state. Based on the set condition, the current state of the agent is observed and the action is output. Furthermore, the next state of the agent that has transitioned as a result of the action is observed, and a reward for the action is provided to the agent. Note that a series of processing from observation of the current state to provision of reward is the same as that performed in conventional Q-learning. Thereafter, the processing so far is determined according to the determination criteria of the determination means.

ここで、判定手段では、獲得した報酬が零以外の値を有するか否か、及び次Ｑ値が定義済み若しくは未定義かの判定が行われる。このとき、従来のＱ学習においては、報酬はエージェントが目標（ゴール）に到達した際に初めて供与されるように設定されていることが多い。すなわち、報酬が零の場合、エージェントは目標に未到達の状態を示すことになり、かつ、次状態の次Ｑ値が未定義の場合は、式（６）における右辺第２項は、”目標値が問題の解に対する情報を含んでいない”こととなる。そのため、係る状況が判定手段によって判定されると、学習しない（未学習）の処理が行われる。すなわち、双方の基準のいずれか一方でも条件をクリアすることにより、事後の学習処理または初期化処理のいずれかが行われる。 Here, in the determination means, it is determined whether the acquired reward has a value other than zero and whether the next Q value is already defined or not defined. At this time, in the conventional Q-learning, the reward is often set to be provided only when the agent reaches the goal (goal). That is, when the reward is zero, the agent indicates a state where the goal has not been reached, and when the next Q value of the next state is undefined, the second term on the right side in Equation (6) is “target The value does not contain information about the solution to the problem. Therefore, when such a situation is determined by the determination unit, a process of not learning (unlearned) is performed. In other words, either the post-learning process or the initialization process is performed by clearing the condition of either one of the two criteria.

そして、判定手段の判定基準に従って、”報酬が零以外の値を有する”及び／または”次Ｑ値が定義済みである”のいずれか一方の条件でも満たす場合、第二判定手段によって現在の状態におけるＱ値を対象とした判定が行われる。このとき、現在の状態の価値（Ｑ値）が未定義であると判定される場合、式（７）に基づいてＱ値の初期化が行われる。すなわち、式（６）における右辺第１項は、Ｑ値が未定義であるため、無意味となる。そこで、式（７）に示すように、獲得した報酬の値と、Ｑ値が最大となる値を利用して初期化が図られる。これにより、Ｑ値が定義された状態となる。一方、Ｑ値が定義済みの場合、式（６）における右辺第１項及び右辺第２項のいずれもが意味を有する、換言すれば、”目標値が問題の解に対する情報を含んでいる”こととなる。 Then, according to the determination criteria of the determination means, if any one of the conditions “reward has a value other than zero” and / or “next Q value is already defined” is satisfied, the second determination means The determination for the Q value at is performed. At this time, when it is determined that the value (Q value) of the current state is undefined, the Q value is initialized based on Expression (7). That is, the first term on the right side in Equation (6) is meaningless because the Q value is undefined. Therefore, as shown in Expression (7), initialization is achieved using the value of the acquired reward and the value that maximizes the Q value. As a result, the Q value is defined. On the other hand, when the Q value has been defined, both the first term on the right side and the second term on the right side in Equation (6) have meaning, in other words, “the target value includes information about the solution of the problem”. It will be.

そのため、式（６）を利用して、Ｑ値の更新（＝学習）する処理が行われる。その後、遷移した次状態を新たな状態に更新し、状態の観測→行動の出力→次状態の観測→報酬の獲得→判定（判定手段及び／または第二判定手段）→未学習・初期化・学習処理がエピソード毎に繰返し実施される。これにより、エージェントの学習が進行する。このとき、Ｑ値を更新する際に意味を有しないケースの場合（報酬＝０、かつ次Ｑ値が未定義）、学習がキャンセルされる。そのため、最適なＱ値に徐々に近似する際に、任意に設定された初期値によって学習回数を無駄にすることがなくなる。さらに、報酬が零以外の値を有し、Ｑ値が未定義の状態にのみ初期値が与えられることにより、与えられる初期値は従来のものと比べ、有意な値である。その結果、最適なＱ値に近似し、収束する可能性が高くなり、学習効率が向上する。特に、学習初期の段階における初期値の設定が、意味を成さない場合は省略されるため、近似されるＱ値との間に大きな差異を生じることがなくなる。 For this reason, a process of updating (= learning) the Q value is performed using Expression (6). After that, the transitioned next state is updated to a new state, state observation → action output → next state observation → reward acquisition → determination (determination means and / or second determination means) → unlearned / initialized / The learning process is repeated for each episode. Thereby, learning of the agent proceeds. At this time, in the case where there is no meaning in updating the Q value (reward = 0 and the next Q value is undefined), the learning is canceled. For this reason, when gradually approximating the optimum Q value, the number of learnings is not wasted by an arbitrarily set initial value. Furthermore, since the reward has a value other than zero and the initial value is given only to the state where the Q value is undefined, the given initial value is a significant value compared to the conventional one. As a result, the possibility of approximating the optimum Q value and convergence is increased, and learning efficiency is improved. In particular, since the setting of the initial value at the initial stage of learning is omitted when it does not make sense, there is no significant difference from the approximated Q value.

さらに、本発明にかかる強化学習システムは、上記構成に加え、「前記判定手段は、前記報酬が零以外の値を有するか否かを判定する報酬判定基準に従って判定する報酬判定手段と、前記次価値Ｖ’が定義済みか否かを判定する次価値判定基準に従って判定する次価値判定手段と」を具備するものであっても構わない。 Further, the reinforcement learning system according to the present invention includes, in addition to the above-described configuration, “the determination unit determines a reward according to a determination criterion for determining whether the reward has a value other than zero, and the next And “next value determination means for determining according to a next value determination criterion for determining whether or not the value V ′ has been defined”.

したがって、本発明の強化学習システムによれば、報酬判定手段及び次Ｑ値判定手段を個々に有して形成されている。これにより、双方の基準に基づく判定によって、強化学習のアルゴリズムを、未学習、初期化、及び学習のそれぞれの処理に的確に分類することが可能となり、特に、学習初期の段階における学習効率を飛躍的に向上させることができる。 Therefore, according to the reinforcement learning system of the present invention, the reward determination means and the next Q value determination means are individually provided. This makes it possible to accurately classify the reinforcement learning algorithm into unlearned, initialized, and learned processes based on the determination based on both criteria, and in particular, leap in learning efficiency at the initial stage of learning. Can be improved.

一方、本発明にかかる強化学習プログラムは、「行動価値関数または状態価値関数を含む価値関数を示す価値Ｖの初期値を未定義に設定する未定義設定手段、強化学習を行う学習対象のエージェントの状態を観測する状態観測手段、前記状態における前記エージェントの行動を出力する行動出力手段、出力された前記行動によって遷移する前記エージェントの次状態を観測する次状態観測手段、前記次状態に遷移した前記エージェントに報酬ｒを提供する報酬提供手段、前記報酬、及び前記次状態における前記価値関数を示す次価値Ｖ’を、予め規定された判定基準に従って判定する判定手段、前記判定手段の前記判定基準に従って、前記報酬が零及び前記次価値Ｖ’が未定義であると判定されると、前記の学習処理または初期化処理をキャンセルする未学習手段、前記状態における前記価値Ｖを第二判定基準に従って判定する第二判定手段、前記第二判定手段の前記第二判定基準に従って、前記価値Ｖが定義済みであると判定されると、次式：

（α：ステップサイズ、γ：割引率）
に基づいて、前記価値Ｖを更新し、学習する学習手段、前記第二判定手段の前記第二判定基準に従って、前記価値Ｖが未定義であると判定されると、次式：

に基づいて、前記価値Ｖを初期化する初期化手段、及び前記判定手段、前記初期化手段、及び前記学習手段のいずれか一つの処理が行われた前記次状態を前記状態に更新する状態更新手段として、強化学習コンピュータを機能させる」ものから主に構成されている。 On the other hand, the reinforcement learning program according to the present invention is “undefined setting means for setting an initial value of a value V indicating a value function including an action value function or a state value function to be undefined, an agent of a learning target agent that performs reinforcement learning. State observation means for observing the state, action output means for outputting the action of the agent in the state, next state observation means for observing the next state of the agent that is transitioned by the outputted action, and the state that has transitioned to the next state Remuneration providing means for providing a reward r to the agent, determination means for determining the reward and the next value V ′ indicating the value function in the next state according to a predetermined determination criterion, according to the determination criterion of the determination means If it is determined that the reward is zero and the next value V ′ is undefined, the learning process or the initialization process is canceled. The value V in the state is determined according to the second determination criterion, and the value V is determined to be defined according to the second determination criterion of the second determination unit. And the following formula:

(Α: Step size, γ: Discount rate)
If the value V is determined to be undefined according to the second determination criterion of the learning means that updates and learns the value V, and the second determination means based on:

On the basis of the state update, the state update for updating the next state after the process of any one of the initialization unit for initializing the value V, the determination unit, the initialization unit, and the learning unit is performed. As a means, it is mainly comprised from what makes a reinforcement learning computer function.

さらに、本発明にかかる強化学習プログラムは、上記構成に加え、「前記報酬が零以外の値を有するか否かを判定する報酬判定基準に従って判定する報酬判定手段、及び、前記次価値Ｖ’が定義済みか否かを判定する次価値判定基準に従って判定する次価値判定手段を有する前記判定手段として、前記強化学習コンピュータをさらに機能させる」ものであっても構わない。 Furthermore, the reinforcement learning program according to the present invention includes, in addition to the above configuration, “a reward determination means for determining according to a reward determination criterion for determining whether or not the reward has a value other than zero, and the next value V ′. The reinforcement learning computer may be further functioned as the determination means having the next value determination means for determining according to the next value determination criterion for determining whether or not it has been defined.

したがって、本発明の強化学習プログラムによれば、プログラムを実行することにより、強化学習コンピュータは、上述した強化学習システムにおける優れた作用を奏することが可能となる。 Therefore, according to the reinforcement learning program of the present invention, by executing the program, the reinforcement learning computer can exhibit an excellent action in the above-described reinforcement learning system.

本発明の効果として、最初に価値関数を示すＱ値を未定義に設定することにより、従来のように近似された価値からの学習初期段階における初期値の大幅な逸脱を防ぐことができる。その結果、学習初期における学習時間を短縮し、エージェントの学習効率を大幅に増大することができる。さらに、判定手段及び第二判定手段によって、それぞれの状態（状況）に応じて、強化処理を未学習、初期化、及び学習の三態様の処理を実施することが可能となり、価値の更新が従来と比して効率的に行われるようになる。その結果、学習効率が向上し、従来のＱ学習等の価値関数を最適価値関数へ漸近する学習方式に比してＱ値を最適な値に近似し、収束させるための時間を大幅に短縮することができる。 As an effect of the present invention, by initially setting the Q value indicating the value function to be undefined, it is possible to prevent a significant deviation of the initial value in the initial stage of learning from the value approximated as in the past. As a result, the learning time at the initial learning stage can be shortened and the learning efficiency of the agent can be greatly increased. Furthermore, according to each state (situation), the determination unit and the second determination unit can perform the three types of processing of unlearning, initialization, and learning according to each state (situation). It will be performed more efficiently than As a result, the learning efficiency is improved, and the time for approximating the Q value to the optimum value and converging is greatly reduced as compared with the learning method in which the value function such as conventional Q learning is asymptotic to the optimum value function. be able to.

以下、本発明の一実施形態である強化学習システム１について、図１乃至図７に基づいて説明する。ここで、図１は本実施形態の強化学習システム１に使用される強化学習コンピュータ２の機能的構成を示すブロック図であり、図２は強化学習システム１における学習手順３（学習アルゴリズム）を表現した説明図であり、図３は判定手段２０及び第二判定手段２３の判定に基づいて実施される処理を一覧表形式に分類した説明図であり、図４は強化学習コンピュータ２の処理の流れを示すフローチャートであり、図５は（ａ）１００×１００のグリッドワールド４、及び（ｂ）Ｑ値データ２９の一例を示す説明図であり、図６は強化学習システム１及びＱ学習システム５のシミュレーション結果を比較したグラフである。 Hereinafter, a reinforcement learning system 1 according to an embodiment of the present invention will be described with reference to FIGS. 1 to 7. Here, FIG. 1 is a block diagram showing a functional configuration of the reinforcement learning computer 2 used in the reinforcement learning system 1 of the present embodiment, and FIG. 2 represents a learning procedure 3 (learning algorithm) in the reinforcement learning system 1. FIG. 3 is an explanatory diagram in which the processes performed based on the determination of the determination means 20 and the second determination means 23 are classified into a list form. FIG. 4 is a process flow of the reinforcement learning computer 2 5 is an explanatory diagram showing an example of (a) a 100 × 100 grid world 4 and (b) Q value data 29. FIG. 6 is a diagram of the reinforcement learning system 1 and the Q learning system 5. It is the graph which compared the simulation result.

ここで、本実施形態の強化学習システム１は、従来のＱ学習を基にして適用されたものについて例示している。そして、強化学習コンピュータ２は、予めハードディスク等の記憶媒体（記憶手段３２等）に記憶された強化学習プログラム６を実行し、機能させるが過可能に構築されている。また、強化学習プログラム６は、図２に示す学習手順３及び図３に示す判定基準１９，２３ａに従って強化学習システム１を機能させるようにプログラムされている。加えて、図１乃至図６の一部において、Ｑ値１１をＱ（ｓ，ａ）、次Ｑ値１８をＱ（ｓ’，ａ）と便宜的に示している。 Here, the reinforcement learning system 1 of this embodiment has illustrated what was applied based on the conventional Q learning. The reinforcement learning computer 2 is constructed to be able to execute and function the reinforcement learning program 6 stored in advance in a storage medium such as a hard disk (storage means 32 or the like). The reinforcement learning program 6 is programmed to cause the reinforcement learning system 1 to function according to the learning procedure 3 shown in FIG. 2 and the determination criteria 19 and 23a shown in FIG. In addition, in part of FIGS. 1 to 6, the Q value 11 is indicated as Q (s, a), and the next Q value 18 is indicated as Q (s', a) for convenience.

さらに、詳細に説明すると、本実施形態の強化学習システム１は、図１に示されるように、種々の演算処理及び記憶処理等を実行可能な強化学習コンピュータ２によって構成されている。ここで、強化学習コンピュータ２は、周囲の環境Ｅの状態ｓを観測し、該観測結果に応じて所定の行動ａを出力するように制御可能なエージェント１３と接続している。なお、本実施形態では、該エージェント１３はコンピュータ上に仮想的に構築されたグリッドワールド４内を移動可能な仮想体として存在している。ここで、係るエージェント１３は、例えば、複数のセンサ（例えば、視覚センサ等）を備え、駆動走行手段によって自律的に移動可能な自律移動型ロボットのような実体物を利用し、周囲の環境Ｅに対する行動を適宜出力するものを用いるものであっても構わない。 More specifically, as shown in FIG. 1, the reinforcement learning system 1 of the present embodiment is configured by a reinforcement learning computer 2 that can execute various arithmetic processes and storage processes. Here, the reinforcement learning computer 2 is connected to an agent 13 that can be controlled to observe the state s of the surrounding environment E and output a predetermined action a according to the observation result. In the present embodiment, the agent 13 exists as a virtual body that can move in the grid world 4 virtually constructed on the computer. Here, the agent 13 includes, for example, a plurality of sensors (for example, visual sensors) and uses an entity such as an autonomous mobile robot that can move autonomously by a driving travel unit, and the surrounding environment E You may use what outputs the action with respect to appropriately.

さらに、強化学習コンピュータ２は、その他の機能的構成として、図１に主に示すように、最適行動価値関数Ｑ値１１の初期値を未定義に設定する未定義設定手段１２と、強化学習の行われるエージェント１３の周囲の環境Ｅに対する状態ｓ（ここでは、後述するグリッドワールド４における位置）を観測（認識）する状態観測手段１４と、状態ｓにおけるエージェント１３の行動を予め規定された複数の行動基準の中から選択し、該行動ａを出力する行動出力手段１５と、行動ａによって状態ｓから遷移するエージェント１３の次状態ｓ’を観測する次状態観測手段１６と、次状態ｓ’に遷移したエージェント１３に報酬ｒを提供する報酬提供手段１７と、報酬ｒ及び次状態ｓ’に従って判定をする判定手段２０と、判定基準１９に従って、報酬ｒの値が零及び次Ｑ値１８が未定義状態であると判定手段２０によって判定がされると、Ｑ値１１の更新による学習処理及びＱ値１１の初期化処理のいずれもをキャンセルし、”学習しない状態にする”未学習処理を行う未学習手段２１と、状態ｓにおけるＱ値１１を第二判定基準２３ａに従って判定する第二判定手段２３と、第二判定手段２３の第二判定基準２３ａに従って、Ｑ値１１が定義済みと判定されると、式Ａ（図４及び式（１）等参照）に基づいて、Ｑ値を更新する処理を行い、学習を実施する学習手段２４と、Ｑ値が未定義と判定されると、式Ｂ（図４及び式（２）等参照）に基づいて、Ｑ値１１を初期化する初期化手段２５と、未学習、学習、及び初期化のいずれか一つの処理が実行された後、前出の次状態ｓ’を状態ｓに更新する状態更新手段２６とを具備して主に構成されている。 Further, the reinforcement learning computer 2 has, as other functional configurations, as shown mainly in FIG. 1, an undefined setting means 12 for setting the initial value of the optimum action value function Q value 11 undefined, and reinforcement learning. State observation means 14 for observing (recognizing) a state s (here, a position in a grid world 4 to be described later) with respect to the environment E around the agent 13 to be performed, and a plurality of predefined actions of the agent 13 in the state s. The action output means 15 for selecting from the action criteria and outputting the action a, the next state observing means 16 for observing the next state s ′ of the agent 13 transitioning from the state s by the action a, and the next state s ′. According to reward providing means 17 for providing reward r to the transitioned agent 13, determination means 20 for determining according to reward r and next state s', and according to determination criterion 19 If the determination means 20 determines that the value of the reward r is zero and the next Q value 18 is in an undefined state, both the learning process by updating the Q value 11 and the initialization process of the Q value 11 are canceled. The unlearned means 21 that performs the unlearned process of “putting into a state that does not learn”; the second determination means 23 that determines the Q value 11 in the state s according to the second determination criterion 23 a; and the second determination of the second determination means 23 If it is determined that the Q value 11 has been defined according to the standard 23a, the learning means 24 performs a process of updating the Q value based on the formula A (see FIG. 4 and the formula (1), etc.) and performs learning. If the Q value is determined to be undefined, the initialization means 25 for initializing the Q value 11 based on the formula B (see FIG. 4 and the formula (2), etc.), unlearned, learned, and initialized After any one of the processes is executed, the previous state s ′ And and a state update means 26 for updating the s is mainly composed.

なお、判定手段２０は、報酬判定基準２７ａに従って、報酬ｒが零以外の値を有するか否かを判定する報酬判定手段２７と、次Ｑ値判定基準２８ａに従って、次Ｑ値１８が定義済みか否かを判定する次Ｑ値判定手段２８とを含んで構成されている。加えて、強化学習コンピュータ２は、その他の機能的構成として、定義済み、未定義、初期化、及び更新されたＱ値１１及び次Ｑ値１８を記憶し、テーブル化した状態で保持するＱ値データ２９（図５（ｂ）参照）、観測された状態ｓ及び次状態ｓ’を記憶し、エージェント１３の状態ｓ及び行動の履歴を蓄積し、保持する状態データ３０、及び報酬ｒを記憶し、保持する報酬データ３１をまとめて記憶する記憶手段３２とを具備している。ここで、記憶手段３２には、強化学習コンピュータ２を強化学習システム１として機能させるための強化学習プログラム６が併せて記憶され、プログラム実行手段３３に基づいて実行可能となっている。ここで、次Ｑ値判定基準２８ａが本発明の次価値判定基準に相当し、次Ｑ値判定手段２８が本発明の次価値判定手段に相当する。 Note that the determination unit 20 determines whether the reward r has a value other than zero according to the reward determination criterion 27a and whether the next Q value 18 has been defined according to the next Q value determination criterion 28a. And a next Q value judging means 28 for judging whether or not. In addition, the reinforcement learning computer 2 stores the Q value 11 and the next Q value 18 that are defined, undefined, initialized, and updated as other functional configurations, and holds the Q value in a tabulated state. Data 29 (see FIG. 5B), the observed state s and the next state s ′ are stored, the state s and action history of the agent 13 are accumulated, the state data 30 to be retained, and the reward r are stored. And storage means 32 for storing the reward data 31 to be held together. Here, the storage means 32 also stores a reinforcement learning program 6 for causing the reinforcement learning computer 2 to function as the reinforcement learning system 1, and can be executed based on the program execution means 33. Here, the next Q value determination standard 28a corresponds to the next value determination standard of the present invention, and the next Q value determination means 28 corresponds to the next value determination means of the present invention.

ここで、本実施形態の強化学習システム１に使用される強化学習コンピュータ２は、本実施形態においては、市販の汎用コンピュータが利用され、上述した、各々の手段は各ＣＰＵを主として構成する演算処理回路に基づいて、係る機能を発揮することが可能に形成されている。なお、記憶手段３２は、ハードディスク等の固定記憶媒体、或いは半導体メモリ等の不揮発性の記憶媒体を用いることが可能であり、エージェント１３の行動ａ等の種々の情報を逐次、記憶することができる。なお、上述した自律移動型ロボットの場合、上記強化学習コンピュータ２の構成を、該自律移動型ロボットの内部の制御回路に構築するようにしたものであってもよい。 Here, as the reinforcement learning computer 2 used in the reinforcement learning system 1 of the present embodiment, a commercially available general-purpose computer is used in the present embodiment, and each of the means described above is an arithmetic process mainly comprising each CPU. Based on the circuit, it is formed to be able to exhibit such a function. The storage means 32 can use a fixed storage medium such as a hard disk or a non-volatile storage medium such as a semiconductor memory, and can sequentially store various information such as the action a of the agent 13. . In the case of the above-described autonomous mobile robot, the configuration of the reinforcement learning computer 2 may be constructed in a control circuit inside the autonomous mobile robot.

次に、強化学習コンピュータ２によってシミュレートされる強化学習システム１の一例を主に図４及び図５に基づいて説明する。ここで、図５（ａ）に示すように、本実施形態の強化学習システム１のために、”１００×１００”に上下左右が仕切られた仮想的な空間（グリッドワールド４に相当）を想定する。すなわち、グリッドワールド４には、グリッド位置Ｍ１（スタート地点Ｓに相当）からグリッド位置Ｍ１００００（ゴール地点Ｇに相当）までの１００００個のグリッドが存在している。このとき、エージェント１３が左下角のスタート地点Ｓから右上角のゴール地点Ｇに到達するまでの最短のステップ数は、＜上＞方向に９９ステップ、＜右方向＞に９９ステップ移動するものであり、１９８ステップである。また、エージェント１３は、スタート地点Ｓから出発し、ゴール地点Ｇに到達した時に、初めて”０”以外の実数値の報酬ｒを獲得することができ、それ以外の場合、報酬ｒとして”０”を獲得するものと、本実施形態では規定する。そして、スタート地点Ｓからゴール地点Ｇに到達するまでのエージェント１３が採る行動ａに基づいて、Ｑ値の更新及び初期化等の処理を繰り返すことにより、上述した１９８ステップの最短数に収束するように学習することができる。なお、本実施形態では、スタート地点Ｓからゴール地点Ｇまでを１エピソードとしている。また、図４におけるステップＳ１からステップＳ１０の処理が本発明の強化学習プログラムに相当する。 Next, an example of the reinforcement learning system 1 simulated by the reinforcement learning computer 2 will be described mainly based on FIGS. Here, as shown in FIG. 5 (a), for the reinforcement learning system 1 of the present embodiment, a virtual space (corresponding to the grid world 4) in which “100 × 100” is vertically and horizontally divided is assumed. To do. That is, in the grid world 4, there are 10,000 grids from the grid position M1 (corresponding to the start point S) to the grid position M10000 (corresponding to the goal point G). At this time, the shortest number of steps required for the agent 13 to reach the goal point G in the upper right corner from the start point S in the lower left corner is 99 steps in the <upward> direction and 99 steps in the <rightward> direction. 198 steps. Further, the agent 13 can obtain a real value reward r other than “0” for the first time when the agent 13 starts from the start point S and reaches the goal point G. In other cases, the agent 13 can obtain “0” as the reward r. In the present embodiment. Then, based on the action a taken by the agent 13 from the start point S to the goal point G, it repeats the process of updating and initializing the Q value so as to converge to the shortest number of 198 steps described above. Can learn to. In the present embodiment, one episode is from the start point S to the goal point G. Further, the processing from step S1 to step S10 in FIG. 4 corresponds to the reinforcement learning program of the present invention.

まず、記憶手段３２に格納された強化学習プログラム６をプログラム実行手段３３によって実行し、強化学習コンピュータ２を機能させ、グリッドワールド４上に強化学習システム１を構築させる。そして、まず記憶手段３２のＱ値データ２９にテーブル化して記憶されるＱ値１１の初期値を未定義の状態に設定する（ステップＳ１）。これにより、エージェント１３のおかれた状態ｓにおける価値を示す最適行動価値関数の値が定義されない状態となる。その後、エージェント１３の状態ｓを観測する（ステップＳ２）。ここで、本実施形態の強化学習システム１では、状態ｓとしてエージェント１３が存在するグリッドワールド４上の位置が観測される。さらに、エージェント１３が当該状態ｓから遷移する行動ａを出力する（ステップＳ３）。このとき、図５（ａ）に示すように、仮想的に構築されたグリッドワールド４内では、エージェント１３は、現在の状態ｓの位置を示すグリッドから上下左右の四方向の中からいずれか一方向に進むことが可能に定義されている。 First, the reinforcement learning program 6 stored in the storage means 32 is executed by the program execution means 33 to cause the reinforcement learning computer 2 to function and to construct the reinforcement learning system 1 on the grid world 4. First, the initial value of the Q value 11 stored as a table in the Q value data 29 of the storage means 32 is set to an undefined state (step S1). As a result, the state of the optimal action value function indicating the value in the state s in which the agent 13 is placed is not defined. Thereafter, the state s of the agent 13 is observed (step S2). Here, in the reinforcement learning system 1 of the present embodiment, the position on the grid world 4 where the agent 13 exists is observed as the state s. Further, the agent 13 outputs an action a that makes a transition from the state s (step S3). At this time, as shown in FIG. 5A, in the virtually constructed grid world 4, the agent 13 is one of four directions, up, down, left, and right from the grid indicating the position of the current state s. It is defined to be able to go in the direction.

すなわち、図５（ａ）の状態では、エージェント１３はグリッドＭ３０３に位置し（状態ｓに相当）、上方向（グリッドＭ４０３）、下方向（グリッドＭ２０３）、左方向（グリッドＭ３０２）、及び右方向（グリッドＭ３０４）に移動可能（行動ａ）となっている。このとき、初期の状態ではＱ値１１が未定義に設定されるため、どの方向に進むことによって最も速くゴール地点Ｇに到達することができるかを示す価値を有するＱ値１１は有していない。係る場合は、四方向の中から一方向（ここでは、「上方向：グリッドＭ４０３方向に相当）に、任意に行動ａによって移動することができる。そして、行動ａによって遷移した新たな位置（グリッドＭ４０３）における次状態ｓ’を観測する（ステップＳ４）。その後、行動ａに対し、次状態ｓ’に遷移したことにより、報酬ｒをエージェント１３は獲得し（ステップＳ５）、記憶手段３２の報酬データ３１に記憶される。 That is, in the state of FIG. 5A, the agent 13 is located in the grid M303 (corresponding to the state s), and the upward direction (grid M403), the downward direction (grid M203), the left direction (grid M302), and the right direction. It is possible to move to (grid M304) (action a). At this time, since the Q value 11 is set to be undefined in the initial state, the Q value 11 having a value indicating which direction can be reached the fastest to reach the goal point G is not included. . In such a case, it is possible to arbitrarily move in one direction from the four directions (here, “upward direction: corresponding to the grid M403 direction) by the action a. The next state s ′ in M403) is observed (step S4), and then the agent 13 obtains the reward r by the transition to the next state s ′ for the action a (step S5). Stored in data 31.

さらに、強化学習コンピュータ２は、観測された状態ｓ、次状態ｓ’、及び報酬ｒの値を利用して適宜判定の処理を行う（ステップＳ６、またはステップＳ７）。ここで、判定手段２０によって、報酬ｒが零以外の値を有する、または遷移した次状態ｓ’における次Ｑ値１８が定義済みのいずれか一方である場合（ステップＳ６においてＹＥＳ）、第二判定手段２３に基づいた判定を実施する（ステップＳ７１）。一方、報酬ｒが零、かつ、次Ｑ値１８が未定義の状態の双方の条件に合致する場合（ステップＳ６においてＮＯ）、後述する学習処理または初期化処理を実施することなく、ステップＳ７乃至ステップＳ８の処理をキャンセルし、ステップＳ１０の処理に移行する。すなわち、”ステップＳ６におけるＮＯの処理”が、本発明における未学習手段２１に相当する。 Further, the reinforcement learning computer 2 performs an appropriate determination process using the observed state s, next state s', and reward r (step S6 or step S7). Here, when the determination means 20 has either the reward r has a value other than zero or the next Q value 18 in the transitioned next state s ′ is already defined (YES in step S6), the second determination The determination based on the means 23 is performed (step S71). On the other hand, if the condition for both the reward r is zero and the next Q value 18 is undefined (NO in step S6), steps S7 to S7 are performed without performing learning processing or initialization processing described later. The process of step S8 is cancelled, and the process proceeds to step S10. That is, “NO processing in step S6” corresponds to the unlearned means 21 in the present invention.

さらに、強化学習コンピュータ２は、報酬ｒが零以外の値、または遷移した次状態ｓ’における次Ｑ値１８が定義済みのいずれか一方の条件に合致する場合（ステップＳ６においてＹＥＳ）、状態ｓにおけるＱ値１１の判定を第二判定基準２３ａに従って判定する。ここで、Ｑ値１１が定義済みである場合（ステップＳ７においてＹＥＳ）、図４の式Ａに従ってＱ値１１を更新する（ステップＳ８）。係る場合、右辺第１項の現在の状態ｓにおけるＱ値１１を示すＱ（ｓ，ａ）と、右辺第２項の報酬ｒまたは次Ｑ値１８を示すＱ（ｓ’，ａ）のいずれか一方とが有意性をなし、問題に対する解を有することになる。その結果、Ｑ値を更新し、学習が行われる。一方、Ｑ値１１が未定義の場合（ステップＳ７においてＮＯ）、式Ａにおける右辺第１項が有意性を持たないため、式Ｂに従ってＱ値１１を初期化する（ステップＳ９）。これにより、有意性を持つ値がＱ値の初期値として設定される。そして、学習処理（ステップＳ８）、初期化処理（ステップＳ９）、または未学習処理（ステップＳ６におけるＮＯ）を経た後、次状態ｓ’を状態ｓに更新する処理が行われる（ステップＳ１０）。その後、ステップＳ３の処理に戻り、行動ａの出力（ステップＳ３）、次状態ｓ’の観測（ステップＳ４）、及び報酬ｒの獲得（ステップＳ５）の処理を繰返し行う。 Further, the reinforcement learning computer 2 determines that the state s when the reward r is a value other than zero or the next Q value 18 in the transitioned next state s ′ satisfies one of the defined conditions (YES in step S6). The determination of the Q value 11 is determined according to the second determination criterion 23a. If the Q value 11 is already defined (YES in step S7), the Q value 11 is updated according to the equation A in FIG. 4 (step S8). In this case, either Q (s, a) indicating the Q value 11 in the current state s of the first term on the right side, or Q (s ′, a) indicating the reward r or the next Q value 18 on the second term on the right side. One will be significant and will have a solution to the problem. As a result, the Q value is updated and learning is performed. On the other hand, when the Q value 11 is undefined (NO in step S7), since the first term on the right side in the equation A has no significance, the Q value 11 is initialized according to the equation B (step S9). Thereby, a value having significance is set as an initial value of the Q value. Then, after the learning process (step S8), the initialization process (step S9), or the unlearned process (NO in step S6), a process of updating the next state s' to the state s is performed (step S10). Thereafter, the process returns to step S3, and the process of outputting the action a (step S3), observing the next state s' (step S4), and acquiring the reward r (step S5) is repeated.

これにより、エージェント１３は、複数回のエピソードを経験することにより、未定義の状態に設定された各グリッド毎（グリッドＭ１〜グリッドＭ１００００）に対応するＱ値１１を徐々に更新し、テーブル化されたＱ値データ２９（図５（ｂ）参照）に逐次記憶することができる。これにより、エージェント１３は、Ｑ値１１に基づいて最適な行動ａを決定し、スタート地点Ｓからゴール地点Ｇに到達するのに適する状態ｓ’に遷移することができる。 Thereby, the agent 13 gradually updates the Q value 11 corresponding to each grid (grid M1 to grid M10000) set to an undefined state by experiencing a plurality of episodes, and is tabulated. The Q value data 29 (see FIG. 5B) can be stored sequentially. As a result, the agent 13 can determine the optimum action a based on the Q value 11 and transition to the state s ′ suitable for reaching the goal point G from the start point S.

ここで、本実施形態の強化学習システム１における効果を、従来のＱ学習システム５と比較したものを示す。図６は上述した１００×１００のグリッドワールド４を利用して、スタート地点Ｓからゴール地点Ｇに到達するまでのシミュレーションを行った結果を比較したグラフである。ここで、グラフ縦軸がスタート地点Ｓからゴール地点Ｇに到達するまでに要した各エピソードに対するステップ数を示し、グラフ横軸がエピソード数を示している。このグラフにより、本実施形態の強化学習システム１を採用した場合、約５００エピソードを越えると、ほぼ最短ステップ数の１９８ステップに値が収束することが示された。一方、従来のＱ学習システム５の場合、徐々に１９８ステップに収束するようにステップ数が減少する傾向は見られるものの、１０００エピソードを越えても、本発明の強化学習システム１のように１９８ステップに収束することがない。特に、学習初期の段階では、その学習効率の速さは著しく、約１００エピソードの場合、本システム１は約３０００ステップ以下であるのに対し、Ｑ学習システム５では約１５０００ステップを必要としている。このため、本発明の強化学習システム１の有用性を示すことができる。 Here, what compares the effect in the reinforcement learning system 1 of this embodiment with the conventional Q learning system 5 is shown. FIG. 6 is a graph comparing the results of simulations from the start point S to the goal point G using the 100 × 100 grid world 4 described above. Here, the vertical axis of the graph indicates the number of steps for each episode required to reach the goal point G from the start point S, and the horizontal axis of the graph indicates the number of episodes. This graph shows that when the reinforcement learning system 1 of the present embodiment is employed, the value converges to approximately 198 steps, which is the shortest number of steps, when about 500 episodes are exceeded. On the other hand, in the case of the conventional Q learning system 5, although the number of steps tends to decrease so that it gradually converges to 198 steps, even if it exceeds 1000 episodes, 198 steps as in the reinforcement learning system 1 of the present invention. Never converge. In particular, at the initial stage of learning, the speed of the learning efficiency is remarkable. In the case of about 100 episodes, the present system 1 has about 3000 steps or less, whereas the Q learning system 5 requires about 15000 steps. For this reason, the usefulness of the reinforcement learning system 1 of this invention can be shown.

以上、本発明について好適な実施形態を挙げて説明したが、本発明はこれらの実施形態に限定されるものではなく、以下に示すように、本発明の要旨を逸脱しない範囲において、種々の改良及び設計の変更が可能である。 The present invention has been described with reference to preferred embodiments. However, the present invention is not limited to these embodiments, and various modifications can be made without departing from the spirit of the present invention as described below. And design changes are possible.

すなわち、本実施形態において、強化学習システム１の効果を確認するために、仮想的に構築されたグリッドワールド４を用いるものを示したが、これに限定されるものではなく、前述した自律移動型ロボットに強化学習システム１を適用するものであっても構わない。これにより、周囲の環境Ｅの状況に応じて各行動ａを出力する自律移動型ロボットは、初期の段階で速やかに学習が進行し、従来のＱ学習システム５に比べ、短いエピソード数で最適の行動ａを採るような制御を行うことができるようになる。 That is, in this embodiment, in order to confirm the effect of the reinforcement learning system 1, what was used the grid world 4 constructed virtually was shown, but it is not limited to this. The reinforcement learning system 1 may be applied to a robot. As a result, the autonomous mobile robot that outputs each action a according to the situation of the surrounding environment E learns quickly at the initial stage, and is optimal with a shorter number of episodes than the conventional Q learning system 5. Control that takes action a can be performed.

強化学習システムに使用される強化学習コンピュータの機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the reinforcement learning computer used for a reinforcement learning system. 強化学習システムにおける学習手順を表現した説明図である。It is explanatory drawing expressing the learning procedure in a reinforcement learning system. 判定手段及び第二判定手段の判定に基づいて実施される処理を一覧表形式に分類した説明図である。It is explanatory drawing which classified into the list form the process implemented based on the determination of a determination means and a 2nd determination means. 強化学習コンピュータの処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of reinforcement learning computer. （ａ）１００×１００のグリッドワールド、及び（ｂ）Ｑ値データの一例を示す説明図である。It is explanatory drawing which shows an example of (a) 100 * 100 grid world and (b) Q value data. 本実施形態の強化学習システム及びＱ学習システムのシミュレーション結果を比較したグラフである。It is the graph which compared the simulation result of the reinforcement learning system of this embodiment, and the Q learning system. 従来のＱ学習システムの処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the conventional Q learning system.

Explanation of symbols

１強化学習システム
２強化学習コンピュータ
６強化学習プログラム
１１Ｑ値（Ｑ（ｓ，ａ）、価値）
１２未定義設定手段
１３エージェント
１４状態観測手段
１５行動出力手段
１６次状態観測手段
１７報酬提供手段
１８次Ｑ値（Ｑ（ｓ’，ａ）、次価値）
１９判定基準
２０判定手段
２１未学習手段
２３第二判定手段
２３ａ第二判定基準
２４学習手段
２５初期化手段
２６状態更新手段
２７報酬判定手段
２７ａ報酬判定基準
２８次Ｑ値判定手段（次価値判定手段）
２８ａ次Ｑ値判定基準（次価値判定基準）
ａ行動
Ｅ環境
ｒ報酬
ｓ状態
ｓ’ 次状態 DESCRIPTION OF SYMBOLS 1 Reinforcement learning system 2 Reinforcement learning computer 6 Reinforcement learning program 11 Q value (Q (s, a), value)
12 undefined setting means 13 agent 14 state observation means 15 action output means 16th order state observation means 17 reward providing means 18th order Q value (Q (s ′, a), next value)
DESCRIPTION OF SYMBOLS 19 Determination criteria 20 Determination means 21 Unlearned means 23 Second determination means 23a Second determination criteria 24 Learning means 25 Initialization means 26 State update means 27 Reward determination means 27a Reward determination criteria 28 Next Q value determination means (next value determination means )
28a Next Q value criteria (next value criteria)
a Action E Environment r Reward s State s' Next State

Claims

Undefined setting means for setting an initial value of value V indicating a value function including an action value function or a state value function to be undefined;
A state observing means for observing the state of the learning target agent for reinforcement learning;
Action output means for outputting the action of the agent in the state;
A next state observing means for observing a next state of the agent that is transited by the output action;
Reward providing means for providing reward r to the agent that has transitioned to the next state;
Determination means for determining the reward and the next value V ′ indicating the value function in the next state according to a predetermined criterion;
In accordance with the determination criteria of the determination means, if it is determined that the reward is zero and the next value V ′ is undefined, unlearned means for canceling the learning process or the initialization process;
Second determination means for determining the value V in the state according to a second determination criterion;
When it is determined that the value V has been defined according to the second determination criterion of the second determination means, the following formula:

(Α: Step size, γ: Discount rate)
Learning means for updating and learning the value V based on
If it is determined that the value V is undefined according to the second determination criterion of the second determination means, the following formula:

Based on the initialization means for initializing the value V;
A reinforcement learning system comprising: a state update unit that updates the next state in which any one of the determination unit, the initialization unit, and the learning unit is performed to the state.

The determination means includes
Reward determination means for determining according to a reward determination criterion for determining whether or not the reward has a value other than zero;
The reinforcement learning system according to claim 1, further comprising: a next value determining unit that determines in accordance with a next value determination criterion that determines whether or not the next value V ′ is already defined.

An undefined setting means for setting an initial value of value V indicating a value function including an action value function or a state value function to be undefined, a state observing means for observing the state of a learning target agent that performs reinforcement learning, Action output means for outputting the action of the agent, next state observation means for observing the next state of the agent transitioned by the outputted action, reward providing means for providing a reward r to the agent transitioned to the next state, A determination means for determining a reward and a next value V ′ indicating the value function in the next state according to a predetermined criterion, and according to the determination criterion of the determination means, the reward is zero and the next value V ′ is If determined to be undefined, unlearned means for canceling the learning process or initialization process, the value in the state Second determination means for determining in accordance with the second criterion, was prepared in accordance with the second criterion of the second determination means, when the value V is determined to be defined by the following formula:

On the basis of the state update, the state update for updating the next state after the process of any one of the initialization unit for initializing the value V, the determination unit, the initialization unit, and the learning unit is performed. A reinforcement learning program characterized by causing a reinforcement learning computer to function as a means.

Reward determination means for determining according to a reward determination criterion for determining whether or not the reward has a value other than zero, and a next value determined according to a next value determination criterion for determining whether or not the next value V ′ is already defined The reinforcement learning program according to claim 3, wherein the reinforcement learning computer is further caused to function as the determination means having a determination means.