JP2019219741A

JP2019219741A - Learning control method and computer system

Info

Publication number: JP2019219741A
Application number: JP2018114702A
Authority: JP
Inventors: ウシンリョウ; Yuxin Liang; 正啓間瀬; Tadakei Mase; 恵木　正史; Masashi Egi; 正史恵木; 隆雄櫻井; Takao Sakurai; 弘充中川; Hiromitsu Nakagawa
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2018-06-15
Filing date: 2018-06-15
Publication date: 2019-12-26
Anticipated expiration: 2038-06-15
Also published as: JP6975685B2

Abstract

To achieve machine learning of short computing time while avoiding overlearning.SOLUTION: In a learning control method for learning a policy, a computer system includes: a learning unit which executes a plurality of times of simulations based on a transition model parameter obtained by modifying a part of a target model parameter for executing a simulation, and executes learning processing for updating the policy; and a history database which manages information related to the policy updated by execution of the transition simulation as a learning history. The learning unit stores the learning history in the history database, and executes a plurality of times of simulations based on the transition model parameter obtained by calculation based on the learning history, when it is necessary to execute the simulation using the learning history.SELECTED DRAWING: Figure 1

Description

本発明は、機械学習、特に、強化学習の演算性能を向上させる技術に関する。 The present invention relates to a technique for improving the computational performance of machine learning, particularly, reinforcement learning.

近年、様々な場面での機械学習の活用されている。例えば、製造工場における製品の生産性を向上させる運用計画等を自動的に提示するシステムが注目されている。 In recent years, machine learning has been used in various situations. For example, a system that automatically presents an operation plan or the like for improving the productivity of a product in a manufacturing factory has been receiving attention.

機械学習の一つとして強化学習が知られている。強化学習を利用したシステムは、製造工場の業務環境等を模倣した環境と、製品の製造作業等の行動を行うエージェントとを用いて行動の試行錯誤を行って、行動を選択する指針となるポリシ又は行動の計画等を出力する。 Reinforcement learning is known as one of the machine learning. A system using reinforcement learning is a policy that serves as a guideline for selecting actions by performing trial and error of actions using an environment that imitates the business environment of a manufacturing factory and agents that perform actions such as product manufacturing work. Alternatively, an action plan or the like is output.

強化学習の演算手法としては様々な手法が提案されている。例えば、特許文献１に記載の技術が知られている。特許文献１には、ポリシの最適化のために、ポリシの初期のパラメータを予め定めて、行動及び環境の状態遷移の試行錯誤を行って、ポリシを反復的に更新することが記載されている。 Various methods have been proposed as calculation methods for reinforcement learning. For example, a technique described in Patent Document 1 is known. Patent Literature 1 describes that in order to optimize a policy, initial parameters of the policy are determined in advance, trial and error of behavior and environmental state transition are performed, and the policy is repeatedly updated. .

複雑な環境（問題）の場合、探索空間が大きいため、反復的なポリシの更新によるポリシの最適化には時間を要する。そこで、非特許文献１に記載のような演算時間の削減手法が知られている。 In the case of a complex environment (problem), the search space is large, so that it takes time to optimize the policy by iteratively updating the policy. Therefore, a calculation time reduction method as described in Non-Patent Document 1 is known.

非特許文献１には、複雑な環境を簡易な環境に置き換え、簡易な環境に対して機械学習を行い、得られた結果を利用して本来の環境に対する機械学習を実行することが記載されている。非特許文献１に記載の技術と特許文献１に記載の技術とを組み合わせることによって、機械学習の様々な手法に適用できる。 Non-Patent Document 1 describes that a complicated environment is replaced with a simple environment, machine learning is performed on the simple environment, and machine learning is performed on the original environment using the obtained result. I have. By combining the technology described in Non-Patent Document 1 with the technology described in Patent Document 1, the technology can be applied to various methods of machine learning.

米国特許出願公開第２０１７／０２７８０１８号明細書US Patent Application Publication No. 2017/0278018

Sermanet, Pierre, et al、"Pedestrian detection with unsupervised multi-stage feature learning."、Computer Vision and Pattern Recognition (CVPR)、IEEE Computer Society、2013Sermanet, Pierre, et al, "Pedestrian detection with unsupervised multi-stage feature learning.", Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society, 2013.

簡易な環境に対する機械学習から得られた結果を利用した場合、簡易な環境に特化したポリシが出力される可能がある。すなわち、本来の環境の局所解に収束する可能性がある。前述のような現象を過学習と呼ぶ。特許文献１に記載の技術は、過学習が発生する傾向が高いことが知られている。したがって、過学習の発生を抑止する工夫が必要となる。 When a result obtained from machine learning for a simple environment is used, a policy specialized for a simple environment may be output. That is, it may converge to a local solution of the original environment. The phenomenon described above is called overlearning. It is known that the technique described in Patent Document 1 has a high tendency to cause over-learning. Therefore, a device for suppressing the occurrence of over-learning is required.

過学習を回避する手法としては、ポリシ等のパラメータをランダムに設定する手法が知られている。しかし、この手法では、機械学習の演算時間が長くなる問題がある。 As a method of avoiding over-learning, a method of randomly setting parameters such as a policy is known. However, this method has a problem that the calculation time of machine learning becomes long.

本発明は、上記の課題を解説することを目的とする。すなわち、過学習を回避し、かつ、演算時間が短い機械学習を実現する方法及びシステムを実現する。 The object of the present invention is to explain the above problems. That is, a method and system that avoids over-learning and realizes machine learning with a short operation time are realized.

本願において開示される発明の代表的な一例を示せば以下の通りである。すなわち、対象を制御するための処理の制御内容を決定するためのポリシを学習する計算機システムにおける学習制御方法であって、前記計算機システムは、任意のポリシに基づいて前記処理の制御内容を選択するシミュレーションを実現する目標モデルパラメータの一部を変更した遷移モデルパラメータを算出する学習制御部と、前記学習制御部から入力された前記遷移モデルパラメータ又は前回のシミュレーションの結果に基づいて算出された前記遷移モデルパラメータに基づく前記シミュレーションを複数回実行し、前記シミュレーションの結果に基づいて前記ポリシを更新する学習処理を実行する学習器と、前記遷移モデルパラメータ及び前記遷移シミュレーションの実行によって更新された前記ポリシに関連する情報を学習履歴として管理する履歴データベースと、を備え、前記学習制御方法は、前記学習器が、任意のタイミングで、前記履歴データベースに前記学習履歴を格納する第１のステップと、前記学習器が、任意の回数だけ実行された前記シミュレーションによって更新された前記ポリシの評価値に基づいて、前記学習履歴を利用した前記シミュレーションを実行する必要があるか否かを判定する第２のステップと、前記学習履歴を利用した前記シミュレーションを実行する必要があると判定された場合、前記学習器が、前記履歴データベースから選択された使用学習履歴に基づいて算出された前記遷移モデルパラメータに基づく前記シミュレーションを複数回実行し、前記シミュレーションの結果に基づいて前記ポリシを更新する第３のステップと、を含む。 A typical example of the invention disclosed in the present application is as follows. That is, a learning control method in a computer system that learns a policy for determining a control content of a process for controlling an object, wherein the computer system selects the control content of the process based on an arbitrary policy. A learning control unit that calculates a transition model parameter obtained by partially changing a target model parameter for realizing a simulation; and the transition calculated based on the transition model parameter input from the learning control unit or a result of a previous simulation. A learning device that executes the simulation based on the model parameters a plurality of times, and executes a learning process of updating the policy based on the result of the simulation; and the transition model parameters and the policy updated by executing the transition simulation. Relevant information and learning history A learning database, the learning control method comprising: a first step in which the learning device stores the learning history in the history database at an arbitrary timing; A second step of determining whether it is necessary to execute the simulation using the learning history based on the evaluation value of the policy updated by the simulation executed only, and using the learning history. If it is determined that it is necessary to execute the simulation, the learning device executes the simulation based on the transition model parameters calculated based on the use learning history selected from the history database a plurality of times, Updating the policy based on a result of the simulation.

本発明の一形態によれば、過学習を回避し、かつ、演算時間が短い機械学習を実現できる。上記した以外の課題、構成及び効果は、以下の実施例の説明により明らかにされる。 According to one embodiment of the present invention, machine learning that avoids over-learning and has a short operation time can be realized. Problems, configurations, and effects other than those described above will be apparent from the following description of the embodiments.

実施例１のシステムの構成例を示す図である。FIG. 1 is a diagram illustrating a configuration example of a system according to a first embodiment. 実施例１のサブプロセスコントローラの構成例を示す図である。FIG. 3 is a diagram illustrating a configuration example of a sub process controller according to the first embodiment. 実施例１の学習条件パラメータ情報のデータ構造の一例を示す図である。FIG. 4 is a diagram illustrating an example of a data structure of learning condition parameter information according to the first embodiment. 実施例１の環境パラメータのデータ構造の一例を示す図である。FIG. 4 is a diagram illustrating an example of a data structure of an environmental parameter according to the first embodiment. 実施例１のエージェントパラメータのデータ構造の一例を示す図である。FIG. 4 is a diagram illustrating an example of a data structure of an agent parameter according to the first embodiment. 実施例１の履歴関係管理情報のデータ構造の一例を示す図である。FIG. 6 is a diagram illustrating an example of a data structure of history relation management information according to the first embodiment. 実施例１の学習結果ＤＢのデータ構造の一例を示す図である。FIG. 4 is a diagram illustrating an example of a data structure of a learning result DB according to the first embodiment. 実施例１の履歴ＤＢのデータ構造の一例を示す図である。FIG. 4 is a diagram illustrating an example of a data structure of a history DB according to the first embodiment. 実施例１の計算機が実行する処理の概要を説明するフローチャートである。6 is a flowchart illustrating an outline of a process executed by a computer according to the first embodiment. 実施例１の学習コントローラが実行する処理を説明するフローチャートである。6 is a flowchart illustrating a process executed by the learning controller according to the first embodiment. 実施例１の学習コントローラが実行する処理を説明するフローチャートである。6 is a flowchart illustrating a process executed by the learning controller according to the first embodiment. 実施例１のサブプロセスコントローラが実行する処理を説明するフローチャートである。6 is a flowchart illustrating a process executed by a sub process controller according to the first embodiment. 実施例１のスコア判定モジュールが実行する処理を説明するフローチャートである。6 is a flowchart illustrating a process executed by a score determination module according to the first embodiment. 実施例１の計算機によって表示されるＧＵＩの一例を示す図である。FIG. 6 is a diagram illustrating an example of a GUI displayed by the computer according to the first embodiment. 実施例１の計算機によって表示されるＧＵＩの一例を示す図である。FIG. 6 is a diagram illustrating an example of a GUI displayed by the computer according to the first embodiment. 実施例１のシステムの構成の変形例を示す図である。FIG. 2 is a diagram illustrating a modification of the configuration of the system according to the first embodiment.

以下、本発明の実施例を、図面を用いて説明する。ただし、本発明は以下に示す実施の形態の記載内容に限定して解釈されるものではない。本発明の思想ないし趣旨から逸脱しない範囲で、その具体的構成を変更し得ることは当業者であれば容易に理解される。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. Note that the present invention is not construed as being limited to the description of the embodiments below. It is easily understood by those skilled in the art that the specific configuration can be changed without departing from the spirit or spirit of the present invention.

以下に説明する発明の構成において、同一又は類似する構成又は機能には同一の符号を付し、重複する説明は省略する。 In the structures of the invention described below, the same or similar structures or functions are denoted by the same reference numerals, and redundant description will be omitted.

本明細書等における「第１」、「第２」、「第３」等の表記は、構成要素を識別するために付するものであり、必ずしも、数又は順序を限定するものではない。 Notations such as “first”, “second”, and “third” in this specification and the like are used to identify components, and do not necessarily limit the number or order.

図面等において示す各構成の位置、大きさ、形状、及び範囲等は、発明の理解を容易にするため、実際の位置、大きさ、形状、及び範囲等を表していない場合がある。したがって、本発明では、図面等に開示された位置、大きさ、形状、及び範囲等に限定されない。 The position, size, shape, range, or the like of each component illustrated in the drawings and the like is not accurately represented in some cases in order to facilitate understanding of the invention. Therefore, the present invention is not limited to the position, size, shape, range, and the like disclosed in the drawings and the like.

本明細書では、機械学習の一つである強化学習を一例として発明を説明する。強化学習では、環境及びエージェントを用いたシミュレーションを実行することによって、目的とする結果が取得される。 In the present specification, the invention will be described by taking reinforcement learning which is one of machine learning as an example. In reinforcement learning, a desired result is obtained by executing a simulation using an environment and an agent.

図１は、実施例１のシステムの構成例を示す図である。 FIG. 1 is a diagram illustrating a configuration example of the system according to the first embodiment.

システムは、計算機１００及び端末１０１から構成される。計算機１００及び端末１０１は、ネットワークを介して互いに接続される。ネットワークは、例えば、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）及びＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）等が考えられる。ネットワークの接続方式は無線及び有線のいずれでもよい。なお、計算機１００及び端末１０１は直接接続されてもよい。 The system includes a computer 100 and a terminal 101. The computer 100 and the terminal 101 are connected to each other via a network. As the network, for example, a LAN (Local Area Network) and a WAN (Wide Area Network) can be considered. The network connection method may be either wireless or wired. Note that the computer 100 and the terminal 101 may be directly connected.

端末１０１は、ユーザが操作する端末である。端末１０１は、プロセッサ、メモリ、及びネットワークインタフェースを有する汎用計算機又は携帯端末等である。 The terminal 101 is a terminal operated by a user. The terminal 101 is a general-purpose computer or a portable terminal having a processor, a memory, and a network interface.

ユーザは、端末１０１を用いて、強化学習を実行に必要なパラメータ、すなわち、学習条件パラメータを設定し、当該パラメータを格納する学習条件パラメータ情報１７０を計算機１００に入力する。また、ユーザは、端末１０１を用いて、計算機１００から出力される情報を確認する。学習条件パラメータ情報１７０のデータ構造については図３を用いて説明する。 Using the terminal 101, the user sets parameters necessary for executing reinforcement learning, that is, learning condition parameters, and inputs learning condition parameter information 170 for storing the parameters to the computer 100. In addition, the user uses the terminal 101 to check information output from the computer 100. The data structure of the learning condition parameter information 170 will be described with reference to FIG.

計算機１００は、学習条件パラメータ情報１７０に基づいて、任意の対象を制御するための処理に関する強化学習を実行する。例えば、クレーンを用いた荷物の搬入作業の最適な処理手順又は処理内容を選択するためのポリシを探索するための強化学習が実行される。なお、本発明は、学習の対象及び学習内容等に限定されない。 The computer 100 executes reinforcement learning related to processing for controlling an arbitrary target based on the learning condition parameter information 170. For example, reinforcement learning is performed to search for a policy for selecting an optimal processing procedure or processing content for loading work using a crane. Note that the present invention is not limited to learning targets and learning contents.

計算機１００は、ハードウェアとして、プロセッサ１１０、メモリ１１１、及びネットワークインタフェース１１２を有する。なお、計算機１００は、入力装置及び出力装置と接続するＩＯインタフェース、並びに、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）及びＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の記憶媒体を有してもよい。 The computer 100 has a processor 110, a memory 111, and a network interface 112 as hardware. Note that the computer 100 may include an IO interface connected to an input device and an output device, and a storage medium such as a hard disk drive (HDD) and a solid state drive (SSD).

プロセッサ１１０は、メモリ１１１に格納されるプログラムを実行する。プロセッサ１１０がプログラムにしたがって処理を実行することによって、特定の機能を実現するモジュールとして動作する。以下の説明では、モジュールを主語に処理を説明する場合、プロセッサ１１０が当該モジュールを実現するプログラムを実行していることを示す。 The processor 110 executes a program stored in the memory 111. The processor 110 operates as a module that implements a specific function by executing processing according to a program. In the following description, when processing is described using a module as a subject, it indicates that the processor 110 is executing a program that implements the module.

メモリ１１１は、プロセッサ１１０が実行するプログラム及びプログラムが使用する情報を格納する。また、メモリ１１１は、プログラムが一時的に使用するワークエリアを含む。 The memory 111 stores a program executed by the processor 110 and information used by the program. Further, the memory 111 includes a work area temporarily used by the program.

ネットワークインタフェース１１２は、ネットワークを介して他の装置と接続するためのインタフェースである。 The network interface 112 is an interface for connecting to another device via a network.

ここで、メモリ１１１に格納されるプログラム及び情報について説明する。メモリ１１１は、学習コントローラ１２０、サブプロセスコントローラ１３０、及びスコア判定モジュール１４０を実現するプログラムを格納する。また、メモリ１１１は、履歴ＤＢ１５０及び学習結果ＤＢ１６０を格納する。 Here, programs and information stored in the memory 111 will be described. The memory 111 stores a program that implements the learning controller 120, the sub-process controller 130, and the score determination module 140. The memory 111 stores a history DB 150 and a learning result DB 160.

サブプロセスコントローラ１３０は、強化学習を実行する。具体的には、サブプロセスコントローラ１３０は、環境パラメータ１７１に基づいて構築される環境モジュール１３１、及び、エージェントパラメータ１７２に基づいて構築されるエージェントモジュール１３２を用いたシミュレーションを繰り返し実行する。 The sub-process controller 130 performs reinforcement learning. Specifically, the sub-process controller 130 repeatedly executes a simulation using the environment module 131 constructed based on the environment parameters 171 and the agent module 132 constructed based on the agent parameters 172.

実施例１では、サブプロセスコントローラ１３０は、目的のシミュレーションの難易度より難易度が低いシミュレーションを実現する環境に対応する環境モジュール１３１を用いて、シミュレーションを実行する。所望の学習結果が得られた場合、サブプロセスコントローラ１３０は、現在のシミュレーションの難易度を変更した環境に対応する環境モジュール１３１及び学習結果を用いたシミュレーションを実行する。前述のような難易度に応じた強化学習の遷移によって、目的の難易度の環境モジュール１３１を用いたシミュレーションの演算を高速化する。 In the first embodiment, the sub-process controller 130 executes the simulation using the environment module 131 corresponding to the environment that realizes the simulation whose difficulty is lower than the difficulty of the target simulation. When a desired learning result is obtained, the sub-process controller 130 executes a simulation using the environment module 131 corresponding to the environment in which the difficulty level of the current simulation has been changed and the learning result. The transition of the reinforcement learning according to the difficulty level as described above speeds up the calculation of the simulation using the environment module 131 having the target difficulty level.

学習コントローラ１２０は強化学習を制御する。学習コントローラ１２０には、対象に関するシミュレーションを行うための学習モデルが設定される。学習モデルは、環境のモデル及びエージェントのモデルを含む。学習コントローラ１２０は、学習条件パラメータ情報１７０及び学習モデルに基づいて、環境パラメータ１７１及びエージェントパラメータ１７２を生成し、サブプロセスコントローラに各パラメータを出力する。以下の説明では、環境パラメータ１７１及びエージェントパラメータ１７２を区別しない場合、モデルパラメータと記載する。 The learning controller 120 controls the reinforcement learning. In the learning controller 120, a learning model for performing a simulation on the target is set. The learning model includes a model of the environment and a model of the agent. The learning controller 120 generates an environment parameter 171 and an agent parameter 172 based on the learning condition parameter information 170 and the learning model, and outputs each parameter to the sub process controller. In the following description, when the environment parameter 171 and the agent parameter 172 are not distinguished, they are described as model parameters.

実施例１のサブプロセスコントローラ１３０は、任意のタイミングで、現在実行しているシミュレーションのモデルパラメータ及び学習結果（学習済のポリシ）を学習履歴として履歴ＤＢ１５０に格納する。また、サブプロセスコントローラ１３０は、任意のタイミングで学習結果の評価をスコア判定モジュール１４０に依頼する。なお、学習履歴に含めるデータは任意に設定できる。例えば、ポリシ内部データ２４１のみを学習履歴として保存してもよい。 The sub-process controller 130 according to the first embodiment stores the model parameters and the learning result (learned policy) of the currently executed simulation in the history DB 150 at an arbitrary timing as a learning history. Further, the sub-process controller 130 requests the score determination module 140 to evaluate the learning result at an arbitrary timing. The data included in the learning history can be set arbitrarily. For example, only the policy internal data 241 may be stored as the learning history.

なお、環境パラメータ１７１のデータ構造は図４を用いて説明し、エージェントパラメータ１７２のデータ構造は図５を用いて説明する。 The data structure of the environment parameter 171 will be described with reference to FIG. 4, and the data structure of the agent parameter 172 will be described with reference to FIG.

学習コントローラ１２０は、学習履歴を反映した環境モジュール１３１及びエージェントモジュール１３２を復元する環境ロールバックコントローラ１２１及びポリシルールバックコントローラ１２２を有する。学習コントローラ１２０は、使用する学習履歴を選択するための履歴関係管理情報１２３を管理する。また、学習コントローラ１２０は、任意のシミュレーション難易度の強化学習の実行回数の合計値を管理する。 The learning controller 120 includes an environment rollback controller 121 and a policy rule back controller 122 for restoring the environment module 131 and the agent module 132 reflecting the learning history. The learning controller 120 manages history relationship management information 123 for selecting a learning history to be used. Further, the learning controller 120 manages the total value of the number of executions of the reinforcement learning of any simulation difficulty.

学習コントローラ１２０は、学習条件パラメータ情報１７０を受信した場合、最もシミュレーションの難易度が低い環境を実現する環境パラメータ１７１を算出する。学習コントローラ１２０は、所望の学習結果が得られた場合、現在のシミュレーションの難易度より難易度が高いシミュレーションを実現するための環境パラメータ１７１を算出する。 When the learning controller 120 receives the learning condition parameter information 170, the learning controller 120 calculates an environment parameter 171 for realizing an environment with the lowest difficulty of the simulation. When a desired learning result is obtained, the learning controller 120 calculates an environment parameter 171 for realizing a simulation whose difficulty is higher than the current difficulty of the simulation.

なお、学習コントローラ１２０は、目的とする環境を実現するためのパラメータに含まれる一部のパラメータの値を変更した環境パラメータ１７１を算出し、又は、目的とする環境を実現するためのパラメータに含まれる一部のパラメータを含まない環境パラメータ１７１を算出することによって、シミュレーションの難易度を変更できる。 The learning controller 120 calculates an environment parameter 171 in which some of the parameters included in the parameter for realizing the target environment are changed, or includes the environment parameter 171 in the parameter for realizing the target environment. By calculating the environmental parameters 171 that do not include some of the parameters, the difficulty of the simulation can be changed.

スコア判定モジュール１４０は、学習結果を評価する。スコア判定モジュール１４０は、最適なポリシが算出されたと判定した場合、学習結果ＤＢ１６０に学習結果を格納する。また、スコア判定モジュール１４０は、学習結果の評価に基づいてく、強化学習の実行計画を決定し、実行計画に基づく指示を学習コントローラ１２０に出力する。 The score determination module 140 evaluates the learning result. When determining that the optimal policy has been calculated, the score determination module 140 stores the learning result in the learning result DB 160. The score determination module 140 determines an execution plan of the reinforcement learning based on the evaluation of the learning result, and outputs an instruction based on the execution plan to the learning controller 120.

図２は、実施例１のサブプロセスコントローラ１３０の構成例を示す図である。 FIG. 2 is a diagram illustrating a configuration example of the sub-process controller 130 according to the first embodiment.

環境パラメータ１７１に基づいて構築される環境モジュール１３１は、環境制御モジュール２１０及び報酬算出モジュール２２０を含む。 The environment module 131 constructed based on the environment parameters 171 includes an environment control module 210 and a reward calculation module 220.

環境制御モジュール２１０は、強化学習における環境の状態を管理し、また、状態の遷移をシミュレーションする。環境制御モジュール２１０は、シミュレーション管理モジュール２１１を有し、また、内部パラメータとして環境状態２１２を保持する。 The environment control module 210 manages the state of the environment in reinforcement learning, and simulates a state transition. The environment control module 210 has a simulation management module 211, and holds an environment state 212 as an internal parameter.

環境状態２１２は、現在の環境の状態を示すパラメータである。シミュレーション管理モジュール２１１は、エージェントモジュール１３２から出力される行動２５２に基づいて、状態の遷移をシミュレーションする。 The environment state 212 is a parameter indicating the current environment state. The simulation management module 211 simulates a state transition based on the behavior 252 output from the agent module 132.

報酬算出モジュール２２０は、状態２５０に基づいて報酬２５１を算出し、エージェントモジュール１３２に出力する。 The reward calculation module 220 calculates the reward 251 based on the state 250 and outputs the calculated reward 251 to the agent module 132.

エージェントパラメータ１７２に基づいて構築されるエージェントモジュール１３２は、オプティマイザ２３０及びポリシコントローラ２４０を含む。 The agent module 132 constructed based on the agent parameters 172 includes an optimizer 230 and a policy controller 240.

ポリシコントローラ２４０は、ポリシを対応するポリシ内部データ２４１を保持する。オプティマイザ２３０は、ポリシを更新するための更新用データ２３１及びオプティマイザ内部データ２３２を保持する。更新用データ２３１は、状態パラメータ、報酬パラメータ、及び行動パラメータから構成されるデータを格納する。 The policy controller 240 holds policy internal data 241 corresponding to the policy. The optimizer 230 holds update data 231 for updating a policy and optimizer internal data 232. The update data 231 stores data including a state parameter, a reward parameter, and an action parameter.

ここで、サブプロセスコントローラ１３０の内部の動作について説明する。 Here, the internal operation of the sub-process controller 130 will be described.

環境モジュール１３１は、状態確認フラグ２５３又は行動２５２を受信するまで待ち状態となる。 The environment module 131 is in a waiting state until the state confirmation flag 253 or the action 252 is received.

状態確認フラグ２５３を受信した場合、環境モジュール１３１の環境制御モジュール２１０は、環境状態２１２に設定された状態２５０をエージェントモジュール１３２に出力する。このとき、報酬２５１を示すデータは出力されない。 When receiving the state confirmation flag 253, the environment control module 210 of the environment module 131 outputs the state 250 set in the environment state 212 to the agent module 132. At this time, data indicating the reward 251 is not output.

なお、環境モジュール１３１は、状態２５０とともに「０」に対応する報酬２５１をエージェントモジュール１３２に出力してもよい。この場合、報酬算出モジュール２２０が環境制御モジュール２１０から出力された状態２５０に基づいて「０」を算出する。 The environment module 131 may output the reward 251 corresponding to “0” to the agent module 132 together with the state 250. In this case, the reward calculation module 220 calculates “0” based on the state 250 output from the environment control module 210.

行動２５２を受信した場合、環境モジュール１３１の環境制御モジュール２１０は、行動２５２及び環境状態２１２をシミュレーション管理モジュール２１１に入力してシミュレーションを実行し、状態２５０を算出する。環境制御モジュール２１０は、環境状態２１２に算出された状態２５０を設定し、また、報酬算出モジュール２２０に算出された状態２５０を出力する。 When the action 252 is received, the environment control module 210 of the environment module 131 inputs the action 252 and the environment state 212 to the simulation management module 211, executes a simulation, and calculates the state 250. The environment control module 210 sets the calculated state 250 as the environment state 212 and outputs the calculated state 250 to the reward calculation module 220.

環境モジュール１３１の報酬算出モジュール２２０は、状態２５０を入力とする所定の演算方法に基づいて報酬２５１を算出する。演算方法は、例えば、学習条件パラメータに含まれる。 The reward calculation module 220 of the environment module 131 calculates the reward 251 based on a predetermined calculation method using the state 250 as an input. The calculation method is included in, for example, the learning condition parameter.

環境モジュール１３１は、状態２５０及び報酬２５１をエージェントモジュール１３２に出力する。 The environment module 131 outputs the state 250 and the reward 251 to the agent module 132.

エージェントモジュール１３２は、まず、環境モジュール１３１に状態確認フラグ２５３を出力する。エージェントモジュール１３２は、環境モジュール１３１から状態確認フラグ２５３に対する応答として状態２５０を受信する。このとき、エージェントモジュール１３２のオプティマイザ２３０は、更新用データ２３１に初期値を設定する。具体的には、オプティマイザ２３０は、状態パラメータが受信した状態２５０、行動パラメータが「なし」、及び報酬パラメータが「報酬なし」であるデータを更新用データ２３１に追加する。 The agent module 132 first outputs a state confirmation flag 253 to the environment module 131. The agent module 132 receives the status 250 from the environment module 131 as a response to the status confirmation flag 253. At this time, the optimizer 230 of the agent module 132 sets an initial value in the update data 231. Specifically, the optimizer 230 adds, to the update data 231, data in which the state parameter is the received state 250, the behavior parameter is “none”, and the reward parameter is “no reward”.

エージェントモジュール１３２のポリシコントローラ２４０は、状態２５０及びポリシ内部データ２４１に基づいて行動２５２を選択し、環境モジュール１３１に行動２５２を出力する。エージェントモジュール１３２のオプティマイザ２３０は、環境モジュール１３１から行動２５２に対する応答として状態２５０及び報酬２５１を受信した場合、更新用データ２３１を更新する。具体的には、オプティマイザ２３０は、状態パラメータが受信した状態２５０、行動パラメータが出力した行動２５２、報酬パラメータが受信した報酬２５１であるデータを更新用データ２３１に追加する。 The policy controller 240 of the agent module 132 selects the action 252 based on the state 250 and the policy internal data 241, and outputs the action 252 to the environment module 131. When the optimizer 230 of the agent module 132 receives the status 250 and the reward 251 from the environment module 131 as a response to the action 252, the optimizer 230 updates the update data 231. Specifically, the optimizer 230 adds, to the update data 231, data that is the state 250 received by the state parameter, the behavior 252 output by the behavior parameter, and the reward 251 received by the reward parameter.

オプティマイザ２３０は、ポリシ内部データ２４１を更新する必要があるか否かを判定する。例えば、オプティマイザ２３０は、状態２５０を受信した場合、ポリシ内部データ２４１を更新する必要があると判定する。なお、オプティマイザ２３０は、状態２５０を受信する度に、ポリシ内部データ２４１を更新する必要があると判定してもよいし、一定の回数、状態２５０を受信した場合に、ポリシ内部データ２４１を更新する必要があると判定してもよい。 The optimizer 230 determines whether the policy internal data 241 needs to be updated. For example, when the optimizer 230 receives the status 250, it determines that the policy internal data 241 needs to be updated. The optimizer 230 may determine that the policy internal data 241 needs to be updated each time the status 250 is received, or may update the policy internal data 241 when the status 250 is received a certain number of times. It may be determined that it is necessary to do so.

ポリシ内部データ２４１を更新する必要があると判定された場合、オプティマイザ２３０は、更新用データ２３１に基づいて、オプティマイザ内部データ２３２を更新する。また、オプティマイザ２３０は、更新されたオプティマイザ内部データ２３２に基づいてポリシ内部データ２４１を更新する。 When it is determined that the policy internal data 241 needs to be updated, the optimizer 230 updates the optimizer internal data 232 based on the update data 231. Further, the optimizer 230 updates the policy internal data 241 based on the updated optimizer internal data 232.

実施例１のサブプロセスコントローラ１３０は、強化学習におけるシミュレーションの実行中に、学習履歴を保存するか否かを判定する。履歴ＤＢ１５０に学習履歴を保存すると判定された場合、サブプロセスコントローラ１３０は、履歴ＤＢ１５０に環境モジュール１３１及びエージェントモジュール１３２が保持する内部パラメータ等を学習履歴として履歴ＤＢ１５０に格納する。 The sub-process controller 130 according to the first embodiment determines whether or not to store the learning history during the execution of the simulation in the reinforcement learning. When it is determined that the learning history is stored in the history DB 150, the sub-process controller 130 stores the internal parameters and the like held by the environment module 131 and the agent module 132 in the history DB 150 as the learning history.

なお、計算機が有する各モジュールについては、複数のモジュールを一つのモジュールにまとめてもよいし、一つのモジュールを機能毎に複数のモジュールに分けてもよい。例えば、サブプロセスコントローラ１３０にスコア判定モジュール１４０を含めるようにしてもよい。 As for each module of the computer, a plurality of modules may be combined into one module, or one module may be divided into a plurality of modules for each function. For example, the sub process controller 130 may include the score determination module 140.

図３は、実施例１の学習条件パラメータ情報１７０のデータ構造の一例を示す図である。 FIG. 3 is a diagram illustrating an example of a data structure of the learning condition parameter information 170 according to the first embodiment.

学習条件パラメータ情報１７０は、学習形態３０１、学習回数３０２、上限回数３０３、遷移条件３０４、提示情報３０５、保存条件３０６、及び選択方式３０７から構成される。 The learning condition parameter information 170 includes a learning mode 301, a learning frequency 302, an upper limit frequency 303, a transition condition 304, presentation information 305, a storage condition 306, and a selection method 307.

学習形態３０１は、強化学習の学習方式を示す値を格納するフィールドである。学習回数３０２は、ポリシを保存するタイミングを示す強化学習の実行回数を格納するフィールドである。上限回数３０３は、任意のシミュレーション難易度の強化学習の実行回数の上限値を格納するフィールドである。遷移条件３０４は、シミュレーションの難易度を調整するための情報を格納するフィールドである。提示情報３０５は、強化学習の処理結果として出力する情報を指定する値を格納するフィールドである。保存条件３０６は、履歴ＤＢ１５０に格納するデータを指定する値を格納するフィールドである。選択方式３０７は、利用する学習履歴の選択方式を格納するフィールドである。 The learning mode 301 is a field for storing a value indicating a learning method of reinforcement learning. The number of times of learning 302 is a field for storing the number of times of execution of the reinforcement learning indicating the timing of saving the policy. The upper limit number 303 is a field for storing an upper limit value of the number of executions of the reinforcement learning of an arbitrary simulation difficulty level. The transition condition 304 is a field for storing information for adjusting the difficulty of the simulation. The presentation information 305 is a field that stores a value that specifies information to be output as a processing result of reinforcement learning. The storage condition 306 is a field for storing a value designating data to be stored in the history DB 150. The selection method 307 is a field that stores a selection method of a learning history to be used.

なお、学習条件パラメータ情報１７０には、評価値の定義を設定するフィールドが含まれてもよい。 The learning condition parameter information 170 may include a field for setting the definition of the evaluation value.

図４は、実施例１の環境パラメータ１７１のデータ構造の一例を示す図である。 FIG. 4 is a diagram illustrating an example of a data structure of the environment parameter 171 according to the first embodiment.

環境パラメータ１７１は、タイムステップ４０１、関係式４０２、方程式４０３、状態４０４、及び報酬４０５から構成される。 The environment parameter 171 includes a time step 401, a relational expression 402, an equation 403, a state 404, and a reward 405.

タイムステップ４０１は、状態の遷移間隔を指定する値を格納するフィールドである。関係式４０２及び方程式４０３は、数式等、環境を定義する情報を格納するフィールドである。状態４０４は、環境の状態を定義する情報を格納するフィールドである。報酬４０５は、数式等、報酬の算出方法を定義する情報を格納するフィールドである。 The time step 401 is a field that stores a value that specifies a state transition interval. The relational expression 402 and the expression 403 are fields for storing information that defines an environment, such as a mathematical expression. The status 404 is a field for storing information that defines the status of the environment. The reward 405 is a field for storing information, such as a mathematical expression, that defines a reward calculation method.

図５は、実施例１のエージェントパラメータ１７２のデータ構造の一例を示す図である。 FIG. 5 is a diagram illustrating an example of a data structure of the agent parameter 172 according to the first embodiment.

エージェントパラメータ１７２は、ポリシ内部変数５０１及びオプティマイザ内部変数５０２から構成される。 The agent parameter 172 includes a policy internal variable 501 and an optimizer internal variable 502.

ポリシ内部変数５０１は、ポリシ内部データ２４１に設定する変数の値を格納するフィールドである。オプティマイザ内部変数５０２は、オプティマイザ内部データ２３２に設定する変数の値を格納するフィールドである。 The policy internal variable 501 is a field for storing a value of a variable set in the policy internal data 241. The optimizer internal variable 502 is a field for storing a value of a variable set in the optimizer internal data 232.

図５に示すポリシ内部変数５０１には、ポリシに対応するニューラルネットワークの重みの係数が格納される。オプティマイザ内部変数５０２には、勾配法のパラメータα及びβと、更新頻度を制御するパラメータηが格納される。 The weight coefficient of the neural network corresponding to the policy is stored in the policy internal variable 501 shown in FIG. The optimizer internal variable 502 stores parameters α and β of the gradient method and a parameter η that controls the update frequency.

図６は、実施例１の履歴関係管理情報１２３のデータ構造の一例を示す図である。 FIG. 6 is a diagram illustrating an example of a data structure of the history relationship management information 123 according to the first embodiment.

履歴関係管理情報１２３は、学習履歴の関係を木構造として管理するためのデータであり、ノードＩＤ６０１、親ノードＩＤ６０２、子ノードＩＤ６０３、難易度係数６０４、及び探索フラグ６０５から構成されるエントリを含む。一つのエントリは一つの学習履歴に対応する。 The history relationship management information 123 is data for managing the relationship between learning histories as a tree structure, and includes an entry including a node ID 601, a parent node ID 602, a child node ID 603, a difficulty coefficient 604, and a search flag 605. . One entry corresponds to one learning history.

ノードＩＤ６０１は、学習履歴に対応するノードの識別情報を格納するフィールドである。親ノードＩＤ６０２は、親ノードの識別情報を格納するフィールドである。子ノードＩＤ６０３は、子ノードの識別情報を格納するフィールドである。難易度係数６０４は、学習履歴を得るために実行されたシミュレーションの難易度を格納するフィールドである。探索フラグ６０５は、学習履歴が利用できるか否かを示すフラグを格納する。「ＯＮ」は学習履歴が利用できることを示し、「ＯＦＦ」は学習履歴が利用できないことを示す。なお、空欄は判定が行われていないノードであることを示す。 The node ID 601 is a field for storing identification information of a node corresponding to the learning history. The parent node ID 602 is a field for storing identification information of the parent node. The child node ID 603 is a field for storing identification information of the child node. The difficulty coefficient 604 is a field for storing the difficulty of the simulation executed to obtain the learning history. The search flag 605 stores a flag indicating whether the learning history can be used. “ON” indicates that the learning history can be used, and “OFF” indicates that the learning history cannot be used. Note that a blank column indicates that the node has not been determined.

図７は、実施例１の学習結果ＤＢ１６０のデータ構造の一例を示す図である。 FIG. 7 is a diagram illustrating an example of a data structure of the learning result DB 160 according to the first embodiment.

学習結果ＤＢ１６０は、結果ＩＤ７０１、ポリシ内部変数７０２、及び累積報酬７０３から構成されるエントリを含む。一つのエントリは、任意の難易度の強化学習によって算出された最適な学習結果に対応する。 The learning result DB 160 includes an entry including a result ID 701, a policy internal variable 702, and a cumulative reward 703. One entry corresponds to an optimal learning result calculated by reinforcement learning of any difficulty.

結果ＩＤ７０１は、学習結果ＤＢ１６０のエントリを識別するための識別情報を格納するフィールドである。ポリシ内部変数７０２は、学習結果として出力されるポリシ内部データ２４１を格納するフィールドである。累積報酬７０３は、学習結果を評価する評価値である累積報酬を格納するフィールドである。累積報酬は、過学習の発生の有無を判定する指標としても用いられる。なお、累積報酬以外にも、重要業績評価指標（ＫＰＩ）を評価値として用いることもできる。ＫＰＩは複数存在してもよい。 The result ID 701 is a field for storing identification information for identifying an entry in the learning result DB 160. The policy internal variable 702 is a field for storing the policy internal data 241 output as a learning result. The cumulative reward 703 is a field for storing a cumulative reward that is an evaluation value for evaluating a learning result. The cumulative reward is also used as an index for determining whether or not over-learning has occurred. In addition to the cumulative reward, a key performance evaluation index (KPI) can be used as an evaluation value. A plurality of KPIs may exist.

図８は、実施例１の履歴ＤＢ１５０のデータ構造の一例を示す図である。 FIG. 8 is a diagram illustrating an example of a data structure of the history DB 150 according to the first embodiment.

履歴ＤＢ１５０は、履歴ＩＤ８０１、モデルパラメータ８０２、及び出力パラメータ８０３から構成されるエントリを含む。一つのエントリは、任意の難易度の強化学習の学習結果に対応する。 The history DB 150 includes an entry including a history ID 801, a model parameter 802, and an output parameter 803. One entry corresponds to a learning result of reinforcement learning of any difficulty.

履歴ＩＤ８０１は、履歴ＤＢ１５０のエントリを識別するための識別情報を格納するフィールドである。モデルパラメータ８０２は、任意の難易度の強化学習を実行するために入力されたパラメータを格納するフィールド群である。モデルパラメータ８０２は、環境パラメータ８１１及びエージェントパラメータ８１２を含む。出力パラメータ８０３は、任意の難易度の強化学習を実行することによって算出された学習結果を格納するフィールド群である。出力パラメータ８０３は、ポリシ内部変数８２１及び累積報酬８２２を含む。 The history ID 801 is a field for storing identification information for identifying an entry in the history DB 150. The model parameters 802 are a group of fields for storing parameters input for executing reinforcement learning of an arbitrary difficulty level. The model parameters 802 include environment parameters 811 and agent parameters 812. The output parameter 803 is a group of fields that stores a learning result calculated by executing reinforcement learning of an arbitrary difficulty level. The output parameters 803 include a policy internal variable 821 and a cumulative reward 822.

なお、エントリは学習条件等を格納するフィールドを含んでもよい。また、モデルパラメータ８０２は、環境パラメータのみを含んでもよい。 Note that the entry may include a field for storing a learning condition or the like. Further, the model parameters 802 may include only environmental parameters.

図９は、実施例１の計算機１００が実行する処理の概要を説明するフローチャートである。 FIG. 9 is a flowchart illustrating an outline of processing executed by the computer 100 according to the first embodiment.

計算機１００は、端末１０１から学習条件パラメータ情報１７０を受信した場合（ステップＳ１０１）、当該学習条件パラメータ情報１７０に基づいてモデルパラメータ（遷移モデルパラメータ）を設定し（ステップＳ１０２）、強化学習を実行する（ステップＳ１０３）。この時点では、計算機１００は、シミュレーション難易度が最も低い環境を実現する環境パラメータ１７１を設定する。強化学習では、学習履歴の出力契機が検出された場合、履歴ＤＢ１５０に学習履歴が格納される。 When receiving the learning condition parameter information 170 from the terminal 101 (step S101), the computer 100 sets a model parameter (transition model parameter) based on the learning condition parameter information 170 (step S102) and executes reinforcement learning. (Step S103). At this point, the computer 100 sets the environment parameters 171 for realizing the environment with the lowest simulation difficulty. In the reinforcement learning, when the output of the learning history is detected, the learning history is stored in the history DB 150.

計算機１００は、任意のタイミングで、スコア判定処理を実行する（ステップＳ１０４）。 The computer 100 performs a score determination process at an arbitrary timing (step S104).

計算機１００は、スコア判定処理の処理結果に基づいて、任意のシミュレーション難易度における最適ポリシが算出されたか否かを判定する（ステップＳ１０５）。ここで、最適ポリシとは、過学習又は学習効率の低迷が発生していない状態で算出されたポリシであって、報酬を最大化し、かつ、制約条件を満たすポリシを意味する。 The computer 100 determines whether or not the optimal policy at any simulation difficulty level has been calculated based on the processing result of the score determination processing (Step S105). Here, the optimal policy is a policy calculated in a state where over-learning or a decrease in learning efficiency has not occurred, and means a policy that maximizes rewards and satisfies constraint conditions.

最適ポリシが算出されていないと判定された場合、計算機１００は、学習履歴を使用するか否かを判定する（ステップＳ１０６）。過学習又は学習効率の低迷の発生が原因で最適ポリシが算出されていないか否かが判定される。 If it is determined that the optimal policy has not been calculated, the computer 100 determines whether to use the learning history (step S106). It is determined whether or not the optimal policy has not been calculated due to the occurrence of over-learning or a decrease in learning efficiency.

学習履歴を使用しないと判定された場合、すなわち、現在のパラメータで学習を継続すると判定された場合、計算機１００は、学習条件パラメータ情報１７０及び学習結果に基づいて新たなモデルパラメータを設定し（ステップＳ１０２）、強化学習を実行する（ステップＳ１０３）。 If it is determined that the learning history is not used, that is, if it is determined that learning is to be continued with the current parameters, the computer 100 sets a new model parameter based on the learning condition parameter information 170 and the learning result (step (S102), reinforcement learning is executed (step S103).

具体的には、モデルパラメータに含まれるエージェントパラメータ１７２には、前回の強化学習の実行時のポリシ内部データ２４１が設定される。 Specifically, the policy internal data 241 at the time of the previous execution of the reinforcement learning is set in the agent parameter 172 included in the model parameter.

学習履歴を使用すると判定された場合、計算機１００は、使用する学習履歴を選択し（ステップＳ１０７）、学習条件パラメータ情報１７０及び学習履歴に基づいて新たなモデルパラメータを設定し（ステップＳ１０２）、強化学習を実行する（ステップＳ１０３）。 If it is determined that the learning history is to be used, the computer 100 selects a learning history to be used (step S107), sets a new model parameter based on the learning condition parameter information 170 and the learning history (step S102), and enhances. The learning is performed (step S103).

例えば、計算機１００は、学習履歴に含まれるポリシ内部変数を反映したエージェントパラメータ１７２を算出し、学習履歴に含まれる環境パラメータ１７１を反映した環境パラメータ１７１を算出する。例えば、ポリシ内部変数５０１に学習履歴に含まれるポリシ内部変数が設定されたエージェントパラメータ１７２が算出される。 For example, the computer 100 calculates an agent parameter 172 reflecting a policy internal variable included in the learning history, and calculates an environment parameter 171 reflecting the environment parameter 171 included in the learning history. For example, an agent parameter 172 in which a policy internal variable included in the learning history is set as the policy internal variable 501 is calculated.

なお、環境パラメータ１７１及びエージェントパラメータ１７２のいずれか一方にのみ学習履歴を反映してもよい。 The learning history may be reflected on only one of the environment parameter 171 and the agent parameter 172.

ステップＳ１０５において最適ポリシが算出されたと判定された場合、計算機１００は、シミュレーション難易度を変更するか否かを判定する（ステップＳ１０８）。 If it is determined in step S105 that the optimal policy has been calculated, the computer 100 determines whether to change the simulation difficulty level (step S108).

シミュレーション難易度を変更しないと判定された場合、計算機１００は処理を終了する。これは、目標のシミュレーション難易度における最適ポリシが得られたことを示す。 If it is determined that the simulation difficulty level is not changed, the computer 100 ends the processing. This indicates that the optimal policy at the target simulation difficulty level has been obtained.

シミュレーション難易度を変更すると判定された場合、計算機１００は、シミュレーションの難易度を変更する（ステップＳ１０９）。 If it is determined that the simulation difficulty level is changed, the computer 100 changes the simulation difficulty level (step S109).

具体的には、計算機１００は、前回の強化学習によって算出されたポリシに基づいてエージェントパラメータ１７２を算出し、さらに、難易度が高いシミュレーションを実現するための環境の環境パラメータ１７１を算出する。 Specifically, the computer 100 calculates the agent parameter 172 based on the policy calculated by the previous reinforcement learning, and further calculates the environment parameter 171 of the environment for realizing the simulation with high difficulty.

その後、計算機１００は、変更されたモデルパラメータを設定し（ステップＳ１０２）、強化学習を実行する（ステップＳ１０３）。 Thereafter, the computer 100 sets the changed model parameters (Step S102), and executes the reinforcement learning (Step S103).

実施例１の強化学習アルゴリズムは、以下のような特徴を有する。 The reinforcement learning algorithm according to the first embodiment has the following features.

（特徴１）計算機１００は、難易度が低いシミュレーションを実行し、難易度を変更したシミュレーションを実行する場合、難易度の変更前の強化学習から算出された学習結果に基づいて算出されたモデルパラメータを設定する。これによって、効率的な強化学習を実現でき、学習に要する時間を削減できる。 (Feature 1) When the computer 100 executes a simulation with a low difficulty and executes a simulation with a changed difficulty, the computer 100 calculates model parameters based on a learning result calculated from reinforcement learning before the change in the difficulty. Set. Thereby, efficient reinforcement learning can be realized, and the time required for learning can be reduced.

（特徴２）計算機１００は、任意の難易度の強化学習において、前回の強化学習の学習結果を使用せずに、過去の強化学習の学習結果を使用して強化学習を再度実行する。これによって、累積報酬（評価値）の上昇が見込まれない強化学習の実行を抑止することができ、また、過学習が発生した場合の強化学習の継続を抑止することができる。 (Feature 2) In the reinforcement learning of any difficulty level, the computer 100 executes the reinforcement learning again using the learning result of the past reinforcement learning without using the learning result of the previous reinforcement learning. As a result, it is possible to suppress the execution of the reinforcement learning in which the increase of the accumulated reward (evaluation value) is not expected, and it is possible to suppress the continuation of the reinforcement learning when the overlearning occurs.

（特徴２）の処理を実現するために、計算機１００は、任意のタイミングで、履歴ＤＢ１５０に学習履歴を保存する。 In order to realize the processing of (Feature 2), the computer 100 stores the learning history in the history DB 150 at an arbitrary timing.

図１０Ａ及び図１０Ｂは、実施例１の学習コントローラ１２０が実行する処理を説明するフローチャートである。学習コントローラ１２０は、外部入力を受け付けた場合、以下で説明する処理を実行する。なお、学習コントローラ１２０は、学習条件パラメータ情報１７０、最適ポリシ通知、継続指示、履歴使用指示、及び履歴更新通知のいずれかを外部入力として受け付ける。 FIGS. 10A and 10B are flowcharts illustrating processing executed by the learning controller 120 according to the first embodiment. When the learning controller 120 receives an external input, the learning controller 120 executes a process described below. The learning controller 120 receives any of the learning condition parameter information 170, the optimal policy notification, the continuation instruction, the history use instruction, and the history update notification as an external input.

学習コントローラ１２０は、学習条件パラメータ情報１７０を受信したか否かを判定する（ステップＳ２０１）。 The learning controller 120 determines whether or not the learning condition parameter information 170 has been received (Step S201).

学習条件パラメータ情報１７０を受信したと判定された場合、学習コントローラ１２０は、総学習回数及び履歴関係管理情報１２３を初期化する（ステップＳ２０２）。 When it is determined that the learning condition parameter information 170 has been received, the learning controller 120 initializes the total number of learning times and the history relation management information 123 (Step S202).

具体的には、学習コントローラ１２０は、総学習回数を「０」に設定する。また、学習コントローラ１２０は、履歴関係管理情報１２３の全てのエントリを削除した後、一つのエントリを追加し、追加されたエントリのノードＩＤ６０１に「１」を設定する。 Specifically, the learning controller 120 sets the total number of times of learning to “0”. Further, the learning controller 120 deletes all entries of the history relationship management information 123, adds one entry, and sets “1” to the node ID 601 of the added entry.

次に、学習コントローラ１２０は、学習条件パラメータ情報１７０に基づいて、初期モデルパラメータを算出する（ステップＳ２０３）。 Next, the learning controller 120 calculates an initial model parameter based on the learning condition parameter information 170 (Step S203).

ステップＳ２０３では、学習コントローラ１２０は、モデルパラメータの算出時に、シミュレーション難易度を示す難易度係数を算出する。学習コントローラ１２０は、履歴関係管理情報１２３を参照し、追加されたエントリの難易度係数６０４に算出された難易度係数を設定する。また、学習コントローラ１２０は、ルートノードの識別情報をポインタとして保持する。 In step S203, the learning controller 120 calculates a difficulty coefficient indicating the simulation difficulty when calculating the model parameters. The learning controller 120 refers to the history relationship management information 123 and sets the calculated difficulty coefficient as the difficulty coefficient 604 of the added entry. Further, the learning controller 120 holds the identification information of the root node as a pointer.

次に、学習コントローラ１２０は、初期モデルパラメータをサブプロセスコントローラ１３０に出力する（ステップＳ２０４）。その後、学習コントローラ１２０は、待ち状態に移行し、処理を終了する。このとき、学習コントローラ１２０は、初期モデルパラメータとともに、追加されたエントリのノードＩＤ６０１に設定された識別情報を出力する。 Next, the learning controller 120 outputs the initial model parameters to the sub-process controller 130 (Step S204). After that, the learning controller 120 shifts to a waiting state and ends the processing. At this time, the learning controller 120 outputs the identification information set in the node ID 601 of the added entry together with the initial model parameters.

ステップＳ２０１において、学習条件パラメータ情報１７０を受信していないと判定された場合、学習コントローラ１２０は、最適ポリシ通知を受信したか否かを判定する（ステップＳ２０５）。 When it is determined in step S201 that the learning condition parameter information 170 has not been received, the learning controller 120 determines whether an optimal policy notification has been received (step S205).

最適ポリシ通知を受信したと判定された場合、学習コントローラ１２０は、最適ポリシ通知に含まれる更新判定リストに基づいて、履歴関係管理情報１２３を更新する（ステップＳ２０６）。更新判定リストはノードの識別情報のリストである。更新判定リストについては図１２で説明する。 When it is determined that the optimal policy notification has been received, the learning controller 120 updates the history relationship management information 123 based on the update determination list included in the optimal policy notification (Step S206). The update determination list is a list of identification information of the nodes. The update determination list will be described with reference to FIG.

具体的には、学習コントローラ１２０は、更新判定リストを参照し、選択対象として除外されることを示す除外フラグが付与されていないノードに対応するエントリの探索フラグ６０５に「ＯＮ」を設定する。また、学習コントローラ１２０は、除外フラグが付与されたノードに対応するエントリの探索フラグ６０５に「ＯＦＦ」を設定する。以下の説明では、更新判定リストに登録され、かつ、除外フラグが付与されていないノードを候補ノードと記載する。 Specifically, the learning controller 120 refers to the update determination list, and sets “ON” to the search flag 605 of the entry corresponding to the node to which the exclusion flag indicating that the node is to be excluded from selection is not added. Further, the learning controller 120 sets “OFF” to the search flag 605 of the entry corresponding to the node to which the exclusion flag has been added. In the following description, a node that is registered in the update determination list and has not been given an exclusion flag is referred to as a candidate node.

次に、学習コントローラ１２０は、シミュレーション難易度を変更するか否かを判定する（ステップＳ２０７）。 Next, the learning controller 120 determines whether to change the simulation difficulty level (step S207).

例えば、学習コントローラ１２０は、前回出力した環境パラメータ１７１に含まれる一部の値が目標値に一致するか否かを判定する。 For example, the learning controller 120 determines whether or not some of the values included in the previously output environment parameter 171 match the target value.

前回出力した環境パラメータ１７１に含まれる一部の値が目標値に一致しない場合、学習コントローラ１２０は、シミュレーション難易度を変更すると判定する。 When some values included in the previously output environment parameter 171 do not match the target values, the learning controller 120 determines that the simulation difficulty level is to be changed.

シミュレーション難易度を変更すると判定された場合、学習コントローラ１２０は、シミュレーション難易度を変更した環境を実現するための新規モデルパラメータを算出する（ステップＳ２０８）。具体的には、以下のような処理が実行される。 When it is determined that the simulation difficulty level is changed, the learning controller 120 calculates a new model parameter for realizing the environment in which the simulation difficulty level has been changed (Step S208). Specifically, the following processing is executed.

学習コントローラ１２０は、候補ノードの中から一つのノードを選択する。ここでは、累積報酬が最も大きいノードが選択されるものとする。学習コントローラ１２０は、選択されたノードの識別情報をポインタとして保持する。 The learning controller 120 selects one node from the candidate nodes. Here, it is assumed that the node with the largest accumulated reward is selected. The learning controller 120 holds the identification information of the selected node as a pointer.

学習コントローラ１２０は、学習条件パラメータ情報１７０及び選択されたノードに対応する学習履歴に含まれる環境パラメータ１７１に基づいて、新たな環境パラメータ１７１を算出する。また、学習コントローラ１２０は、選択されたノードに対応する学習履歴に含まれるポリシ内部データに基づいて新たなエージェントパラメータ１７２を算出する。学習コントローラ１２０は、環境パラメータ１７１に基づいてシミュレーション難易度を示す難易度係数を算出する。 The learning controller 120 calculates a new environment parameter 171 based on the learning condition parameter information 170 and the environment parameter 171 included in the learning history corresponding to the selected node. Further, the learning controller 120 calculates a new agent parameter 172 based on the policy internal data included in the learning history corresponding to the selected node. The learning controller 120 calculates a difficulty coefficient indicating the simulation difficulty based on the environment parameter 171.

学習コントローラ１２０は、履歴関係管理情報１２３にエントリを追加し、追加されたエントリのノードＩＤ６０１に識別情報を設定し、親ノードＩＤ６０２にポインタに設定されたノードの識別情報を設定し、難易度係数６０４に難易度係数を設定する。 The learning controller 120 adds an entry to the history relationship management information 123, sets the identification information to the node ID 601 of the added entry, sets the identification information of the node set to the pointer to the parent node ID 602, and sets the difficulty coefficient A difficulty coefficient is set to 604.

学習コントローラ１２０は、ポインタに設定されたノードの識別情報に対応するエントリの子ノードＩＤ６０３に、追加されたエントリのノードＩＤ６０１に設定された識別情報を設定する。以上がステップＳ２０８の処理の説明である。 The learning controller 120 sets the identification information set in the node ID 601 of the added entry to the child node ID 603 of the entry corresponding to the node identification information set in the pointer. The above is the description of the process in step S208.

次に、学習コントローラ１２０は、サブプロセスコントローラ１３０に新規モデルパラメータを出力する（ステップＳ２０９）。その後、学習コントローラ１２０は、待ち状態に移行し、処理を終了する。このとき、学習コントローラ１２０は、新規モデルパラメータとともに追加されたエントリのノードＩＤ６０１に設定された識別情報を出力する。 Next, the learning controller 120 outputs a new model parameter to the sub-process controller 130 (Step S209). After that, the learning controller 120 shifts to a waiting state and ends the processing. At this time, the learning controller 120 outputs the identification information set in the node ID 601 of the entry added together with the new model parameter.

シミュレーション難易度を変更しないと判定された場合、学習コントローラ１２０は、待ち状態に移行し、処理を終了する。 If it is determined that the simulation difficulty level is not changed, the learning controller 120 shifts to the waiting state and ends the processing.

ステップＳ２０５において、最適ポリシ通知を受信していないと判定された場合、学習コントローラ１２０は、継続指示を受信したか否かを判定する（ステップＳ２１１）。 If it is determined in step S205 that the optimal policy notification has not been received, the learning controller 120 determines whether a continuation instruction has been received (step S211).

継続指示を受信したと判定された場合、学習コントローラ１２０は、履歴関係管理情報１２３を更新する（ステップＳ２１２）。 When it is determined that the continuation instruction has been received, the learning controller 120 updates the history relationship management information 123 (Step S212).

具体的には、学習コントローラ１２０は、更新判定リストに登録されたノードに対応するエントリを特定し、特定されたエントリの探索フラグ６０５に「ＯＦＦ」を設定する。 Specifically, the learning controller 120 specifies the entry corresponding to the node registered in the update determination list, and sets “OFF” to the search flag 605 of the specified entry.

次に、学習コントローラ１２０は、総学習回数が上限回数以下であるか否かを判定する（ステップＳ２１３）。すなわち、現在のモデルパラメータに基づいて強化学習を継続するか否かが判定される。 Next, the learning controller 120 determines whether or not the total number of times of learning is equal to or less than the upper limit number (Step S213). That is, it is determined whether to continue the reinforcement learning based on the current model parameters.

総学習回数が上限回数より大きいと判定された場合、学習コントローラ１２０は、待ち状態に移行し、処理を終了する。 When it is determined that the total number of times of learning is larger than the upper limit number of times, the learning controller 120 shifts to a waiting state and ends the processing.

総学習回数が上限回数以下であると判定された場合、学習コントローラ１２０は、前回の強化学習の学習結果を反映した更新モデルパラメータを算出する（ステップＳ２１４）。具体的には、以下のような処理が実行される。 When it is determined that the total number of learning times is equal to or less than the upper limit number, the learning controller 120 calculates an updated model parameter reflecting the learning result of the previous reinforcement learning (step S214). Specifically, the following processing is executed.

学習コントローラ１２０は、前回の強化学習の実行時のポリシ内部データ２４１を初期値として設定するためのエージェントパラメータ１７２を算出する。学習コントローラ１２０は、環境パラメータ１７１は前回の強化学習と同一のものを算出する。 The learning controller 120 calculates an agent parameter 172 for setting the policy internal data 241 at the time of executing the previous reinforcement learning as an initial value. The learning controller 120 calculates the same environment parameters 171 as those in the previous reinforcement learning.

学習コントローラ１２０は、履歴関係管理情報１２３にエントリを追加し、追加されたエントリのノードＩＤ６０１に識別情報を設定し、親ノードＩＤ６０２にポインタに設定されたノードの識別情報を設定し、難易度係数６０４に前回の強化学習の難易度係数を設定する。 The learning controller 120 adds an entry to the history relationship management information 123, sets the identification information to the node ID 601 of the added entry, sets the identification information of the node set to the pointer to the parent node ID 602, and sets the difficulty coefficient At 604, the difficulty coefficient of the previous reinforcement learning is set.

学習コントローラ１２０は、ポインタに設定されたノードの識別情報に対応するエントリの子ノードＩＤ６０３に、追加されたエントリのノードＩＤ６０１に設定された識別情報を設定する。 The learning controller 120 sets the identification information set in the node ID 601 of the added entry to the child node ID 603 of the entry corresponding to the node identification information set in the pointer.

また、学習コントローラ１２０は、追加されたエントリのノードＩＤ６０１に設定された識別情報をポインタとして保持する。以上がステップＳ２１４の処理の説明である。 Further, the learning controller 120 holds the identification information set in the node ID 601 of the added entry as a pointer. The above is the description of the process in step S214.

次に、学習コントローラ１２０は、サブプロセスコントローラ１３０に更新モデルパラメータを出力する（ステップＳ２１５）。その後、学習コントローラ１２０は、待ち状態に移行し、処理を終了する。このとき、学習コントローラ１２０は、更新モデルパラメータとともに、追加されたエントリのノードＩＤ６０１に設定された識別情報を出力する。 Next, the learning controller 120 outputs the updated model parameters to the sub process controller 130 (Step S215). After that, the learning controller 120 shifts to a waiting state and ends the processing. At this time, the learning controller 120 outputs the identification information set in the node ID 601 of the added entry together with the update model parameter.

ステップＳ２１１において、継続指示を受信していないと判定された場合、学習コントローラ１２０は、履歴使用指示を受信したか否かを判定する（ステップＳ２１６）。 If it is determined in step S211 that the continuation instruction has not been received, the learning controller 120 determines whether a history use instruction has been received (step S216).

履歴使用指示を受信したと判定された場合、学習コントローラ１２０は、履歴関係管理情報１２３を更新する（ステップＳ２１７）。 When it is determined that the history use instruction has been received, the learning controller 120 updates the history relationship management information 123 (Step S217).

具体的には、学習コントローラ１２０は、更新判定リストを参照して、除外フラグが付与されていないノードに対応するエントリの探索フラグ６０５に「ＯＮ」を設定し、除外フラグが付与されたノードに対応するエントリの探索フラグ６０５に「ＯＦＦ」を設定する。 Specifically, the learning controller 120 refers to the update determination list, sets “ON” to the search flag 605 of the entry corresponding to the node to which the exclusion flag is not assigned, and sets the search flag 605 to the node to which the exclusion flag is assigned. The search flag 605 of the corresponding entry is set to “OFF”.

次に、学習コントローラ１２０は、使用する学習履歴を選択するためのノード選択処理を実行する（ステップＳ２１８）。具体的には、以下のような処理が実行される。 Next, the learning controller 120 executes a node selection process for selecting a learning history to be used (Step S218). Specifically, the following processing is executed.

学習コントローラ１２０は、履歴関係管理情報１２３を参照して、ポインタに設定されたノードの識別情報に対応するエントリを特定する。学習コントローラ１２０は、特定されたエントリを基準として設定し、選択方式３０７に設定された探索方式にしたがってノードを選択する。学習コントローラ１２０は、選択されたノードの識別情報をポインタとして保持する。 The learning controller 120 refers to the history relationship management information 123 and specifies an entry corresponding to the identification information of the node set in the pointer. The learning controller 120 sets the specified entry as a reference, and selects a node according to the search method set in the selection method 307. The learning controller 120 holds the identification information of the selected node as a pointer.

例えば、選択方式３０７が「深さ優先」である場合、学習コントローラ１２０は、難易度係数６０４が特定されたノードの難易度係数と一致するノードを選択する。なお、探索フラグ６０５が「ＯＦＦ」及び空欄であるノードは検索対象から除外される。該当するノードが複数存在する場合、学習コントローラ１２０は、履歴ＤＢ１５０を参照して、履歴ＩＤ８０１が特定されたノードの識別情報と一致するエントリを検索する。学習コントローラ１２０は、累積報酬が最も大きいエントリに対応するノードを選択する。 For example, when the selection method 307 is “depth priority”, the learning controller 120 selects a node whose difficulty coefficient 604 matches the difficulty coefficient of the specified node. Note that nodes for which the search flag 605 is “OFF” and that are blank are excluded from search targets. When there are a plurality of applicable nodes, the learning controller 120 refers to the history DB 150 and searches for an entry that matches the identification information of the node whose history ID 801 is specified. The learning controller 120 selects a node corresponding to the entry with the largest accumulated reward.

他の選択方法としては、学習コントローラ１２０は、親ノードが、ポインタに対応するノードの親ノードに一致するノード、又は、累積報酬が最も大きいノードを選択する。実施例１では、学習履歴に環境パラメータ１７１が含まれているため、難易度が異なるシミュレーションを実行することができる。 As another selection method, the learning controller 120 selects a node whose parent node matches the parent node of the node corresponding to the pointer, or a node with the largest accumulated reward. In the first embodiment, since the learning history includes the environment parameter 171, it is possible to execute simulations having different degrees of difficulty.

学習コントローラ１２０は、履歴関係管理情報１２３にエントリを追加し、追加されたエントリのノードＩＤ６０１に識別情報を設定し、親ノードＩＤ６０２にポインタに設定されたノードの識別情報を設定し、難易度係数６０４に難易度係数を設定する。難易度係数６０４には、ポインタ更新前のノードの難易度係数と同一の値が設定される。 The learning controller 120 adds an entry to the history relationship management information 123, sets the identification information to the node ID 601 of the added entry, sets the identification information of the node set to the pointer to the parent node ID 602, and sets the difficulty coefficient A difficulty coefficient is set to 604. The same value as the difficulty coefficient of the node before updating the pointer is set in the difficulty coefficient 604.

学習コントローラ１２０は、履歴関係管理情報１２３を参照し、ポインタに設定されたノードの識別情報に対応するエントリの子ノードＩＤ６０３に、追加されたエントリのノードＩＤ６０１に設定された識別情報を設定する。以上がステップＳ２１８の処理の説明である。 The learning controller 120 refers to the history relation management information 123, and sets the identification information set in the node ID 601 of the added entry in the child node ID 603 of the entry corresponding to the identification information of the node set in the pointer. The above is the description of the process in step S218.

次に、学習コントローラ１２０は、履歴ＤＢ１５０を参照して、選択されたノードに対応するエントリを検索し、検索されたエントリに基づいてモデルパラメータを算出し、復元モデルパラメータとしてサブプロセスコントローラ１３０に出力する（ステップＳ２１９）。その後、学習コントローラ１２０は、待ち状態に移行し、処理を終了する。このとき、学習コントローラ１２０は、復元モデルパラメータとともに、追加されたエントリのノードＩＤ６０１に設定された識別情報を出力する。具体的には、以下のような処理が実行される。 Next, the learning controller 120 refers to the history DB 150, searches for an entry corresponding to the selected node, calculates a model parameter based on the searched entry, and outputs the model parameter to the sub-process controller 130 as a restored model parameter. (Step S219). After that, the learning controller 120 shifts to a waiting state and ends the processing. At this time, the learning controller 120 outputs the identification information set in the node ID 601 of the added entry together with the restoration model parameters. Specifically, the following processing is executed.

学習コントローラ１２０は、学習履歴に含まれるポリシ内部変数を反映したエージェントパラメータ１７２を算出する。例えば、ポリシ内部変数５０１に学習履歴に含まれるポリシ内部変数が設定されたエージェントパラメータ１７２が算出される。 The learning controller 120 calculates an agent parameter 172 reflecting a policy internal variable included in the learning history. For example, an agent parameter 172 in which a policy internal variable included in the learning history is set as the policy internal variable 501 is calculated.

学習コントローラ１２０は、現在のシミュレーションと学習履歴に対応するシミュレーションの難易度が同一である場合、現在の環境パラメータ１７１をそのまま用いる。現在のシミュレーションと学習履歴に対応するシミュレーションの難易度が異なる場合、学習コントローラ１２０は、学習履歴に含まれる環境パラメータ１７１を反映した環境パラメータ１７１を算出する。 When the difficulty of the current simulation is the same as the difficulty of the simulation corresponding to the learning history, the learning controller 120 uses the current environment parameter 171 as it is. When the difficulty level of the current simulation is different from the difficulty level of the simulation corresponding to the learning history, the learning controller 120 calculates an environment parameter 171 reflecting the environment parameter 171 included in the learning history.

すなわち、現在のシミュレーションと学習履歴に対応するシミュレーションの難易度が同一である場合、エージェントパラメータ１７２が異なるモデルパラメータが算出される。現在のシミュレーションと学習履歴に対応するシミュレーションの難易度が異なる場合、環境パラメータ１７１及びエージェントパラメータ１７２が異なるモデルパラメータが算出される。 That is, when the difficulty of the current simulation is the same as that of the simulation corresponding to the learning history, model parameters having different agent parameters 172 are calculated. When the difficulty of the current simulation differs from the difficulty of the simulation corresponding to the learning history, model parameters having different environment parameters 171 and agent parameters 172 are calculated.

なお、学習履歴に環境パラメータ１７１が含まれない場合、環境パラメータ１７１は現在のものを算出する。以上がステップＳ２１９の処理の説明である。 If the learning history does not include the environment parameter 171, the current environment parameter 171 is calculated. The above is the description of the process in step S219.

ステップＳ２１６において、継続指示を受信していないと判定された場合、すなわち、履歴更新通知を受信したと判定された場合、学習コントローラ１２０は、履歴関係管理情報１２３を更新する（ステップＳ２２０）。具体的には、以下のような処理が実行される。 If it is determined in step S216 that the continuation instruction has not been received, that is, if it is determined that the history update notification has been received, the learning controller 120 updates the history relationship management information 123 (step S220). Specifically, the following processing is executed.

学習コントローラ１２０は、履歴関係管理情報１２３にエントリを追加し、追加されたエントリのノードＩＤ６０１に識別情報を設定し、親ノードＩＤ６０２にポインタに設定されたノードの識別情報を設定し、難易度係数６０４に実行中の強化学習の難易度係数を設定する。 The learning controller 120 adds an entry to the history relationship management information 123, sets the identification information to the node ID 601 of the added entry, sets the identification information of the node set to the pointer to the parent node ID 602, and sets the difficulty coefficient At 604, a difficulty coefficient of the reinforcement learning being executed is set.

学習コントローラ１２０は、履歴関係管理情報１２３を参照し、ポイントに設定されたノードの識別情報に対応するエントリの子ノードＩＤ６０３に、追加されたエントリのノードＩＤ６０１に設定された識別情報を設定する。 The learning controller 120 refers to the history relationship management information 123, and sets the identification information set in the node ID 601 of the added entry in the child node ID 603 of the entry corresponding to the node identification information set in the point.

学習コントローラ１２０は、追加されたエントリのノードＩＤ６０１に設定された識別情報をサブプロセスコントローラ１３０に出力する。以上がステップＳ２２０の処理の説明である。 The learning controller 120 outputs the identification information set in the node ID 601 of the added entry to the sub process controller 130. The above is the description of the process in step S220.

次に、学習コントローラ１２０は、総学習回数に、履歴更新通知に含まれる学習回数を加算する（ステップＳ２２１）。その後、学習コントローラ１２０は、待ち状態に移行し、処理を終了する。 Next, the learning controller 120 adds the number of times of learning included in the history update notification to the total number of times of learning (step S221). After that, the learning controller 120 shifts to a waiting state and ends the processing.

なお、学習コントローラ１２０は、周期的に、履歴関係管理情報１２３を参照し、探索フラグ６０５に「ＯＦＦ」が設定されるエントリを削除し、また、履歴ＤＢ１５０から対応するエントリを削除しててもよい。 The learning controller 120 periodically refers to the history relation management information 123, deletes the entry for which the search flag 605 is set to “OFF”, and deletes the corresponding entry from the history DB 150. Good.

図１１は、実施例１のサブプロセスコントローラ１３０が実行する処理を説明するフローチャートである。サブプロセスコントローラ１３０は、学習コントローラ１２０からモデルパラメータを受信した場合、以下で説明する処理を実行する。 FIG. 11 is a flowchart illustrating a process executed by the sub-process controller 130 according to the first embodiment. When receiving the model parameters from the learning controller 120, the sub-process controller 130 executes processing described below.

サブプロセスコントローラ１３０は、受信したモデルパラメータに基づいて、環境モジュール１３１及びエージェントモジュール１３２を構築する（ステップＳ３０１）。 The sub-process controller 130 constructs an environment module 131 and an agent module 132 based on the received model parameters (Step S301).

サブプロセスコントローラ１３０は、環境モジュール１３１及びエージェントモジュール１３２を用いてシミュレーションを実行する（ステップＳ３０２）。シミュレーションでは、現在の状態の取得、行動の選択、及び状態の更新が行われる。実施例１では、一回のシミュレーション毎にポリシが更新される。なお、サブプロセスコントローラ１３０（オプティマイザ２３０）は、学習終了条件が満たされた場合に、ポリシを更新してもよい。 The sub-process controller 130 executes a simulation using the environment module 131 and the agent module 132 (Step S302). In the simulation, acquisition of a current state, selection of an action, and update of a state are performed. In the first embodiment, the policy is updated for each simulation. Note that the sub-process controller 130 (optimizer 230) may update the policy when the learning end condition is satisfied.

サブプロセスコントローラ１３０は、保存条件を満たすか否かを判定する（ステップＳ３０３）。 The sub process controller 130 determines whether the storage condition is satisfied (step S303).

例えば、サブプロセスコントローラ１３０は、学習終了条件を満たす場合、又は、シミュレーションの実行回数が学習回数３０２の値と一致する場合、保存条件を満たすと判定する。また、ポリシ内部データ２４１が更新された場合、保存条件を満たすと判定されてもよい。 For example, the sub-process controller 130 determines that the storage condition is satisfied when the learning end condition is satisfied, or when the number of times of executing the simulation matches the value of the learning number 302. When the policy internal data 241 is updated, it may be determined that the storage condition is satisfied.

保存条件を満たさないと判定された場合、サブプロセスコントローラ１３０は、ステップＳ３０６に進む。 If it is determined that the storage condition is not satisfied, the sub-process controller 130 proceeds to step S306.

保存条件を満たすと判定された場合、サブプロセスコントローラ１３０は、履歴ＤＢ１５０にモデルパラメータ及び学習結果を格納する（ステップＳ３０４）。 If it is determined that the storage condition is satisfied, the sub-process controller 130 stores the model parameters and the learning result in the history DB 150 (Step S304).

具体的には、サブプロセスコントローラ１３０は、履歴ＤＢ１５０にエントリを追加する。サブプロセスコントローラ１３０は、追加されたエントリの履歴ＩＤ８０１に、学習コントローラ１２０から通知されたノードの識別情報を設定する。これによって、履歴ＤＢ１５０のエントリ及び履歴関係管理情報１２３のエントリが関連づけられる。また、サブプロセスコントローラ１３０は、ノードの識別情報を更新判定リストに登録する。 Specifically, the sub-process controller 130 adds an entry to the history DB 150. The sub-process controller 130 sets the identification information of the node notified from the learning controller 120 to the history ID 801 of the added entry. Thereby, the entry of the history DB 150 and the entry of the history relationship management information 123 are associated with each other. Further, the sub-process controller 130 registers the identification information of the node in the update determination list.

次に、サブプロセスコントローラ１３０は、学習コントローラ１２０に履歴更新通知を出力する（ステップＳ３０５）。サブプロセスコントローラ１３０は、学習コントローラ１２０からノードの識別情報が入力されるまで待ち状態に移行する。ノードの識別情報が入力された場合、サブプロセスコントローラ１３０はステップＳ３０６に進む。 Next, the sub-process controller 130 outputs a history update notification to the learning controller 120 (Step S305). The sub-process controller 130 shifts to a waiting state until the identification information of the node is input from the learning controller 120. If the node identification information has been input, the sub-process controller 130 proceeds to step S306.

次に、サブプロセスコントローラ１３０は、学習終了条件を満たすか否かを判定する（ステップＳ３０６）。 Next, the sub-process controller 130 determines whether or not a learning end condition is satisfied (step S306).

例えば、サブプロセスコントローラ１３０は、シミュレーションの実行回数が学習回数３０２の値と一致する場合、又は、更新後の状態が終了状態に一致する場合、学習終了条件を満たすと判定する。 For example, the sub-process controller 130 determines that the learning end condition is satisfied when the number of times of execution of the simulation matches the value of the number of times of learning 302 or when the updated state matches the end state.

学習終了条件を満たさないと判定された場合、サブプロセスコントローラ１３０は、ステップＳ３０２に戻る。 If it is determined that the learning end condition is not satisfied, the sub-process controller 130 returns to step S302.

学習終了条件を満たすと判定された場合、サブプロセスコントローラ１３０は、スコア判定モジュール１４０にスコア判定要求を出力する（ステップＳ３０７）。その後、サブプロセスコントローラ１３０は、処理を終了する。なお、スコア判定要求には更新判定リストが含まれる。 When it is determined that the learning end condition is satisfied, the sub-process controller 130 outputs a score determination request to the score determination module 140 (Step S307). Thereafter, the sub-process controller 130 ends the processing. Note that the score determination request includes an update determination list.

図１２は、実施例１のスコア判定モジュール１４０が実行する処理を説明するフローチャートである。スコア判定モジュール１４０は、サブプロセスコントローラ１３０からスコア判定要求を受信した場合、以下で説明する処理を実行する。 FIG. 12 is a flowchart illustrating a process performed by the score determination module 140 according to the first embodiment. When receiving a score determination request from the sub-process controller 130, the score determination module 140 executes processing described below.

スコア判定モジュール１４０は、更新判定リストのループ処理を開始する（ステップＳ４０１）。 The score determination module 140 starts a loop processing of the update determination list (step S401).

具体的には、スコア判定モジュール１４０は、更新判定リストに登録されたノードの中からターゲットノードを選択する。 Specifically, the score determination module 140 selects a target node from the nodes registered in the update determination list.

次に、スコア判定モジュール１４０は、スコア判定モジュール１４０内に環境モジュール１３１及びエージェントモジュール１３２を構築する（ステップＳ４０２）。 Next, the score determination module 140 constructs an environment module 131 and an agent module 132 in the score determination module 140 (Step S402).

具体的には、スコア判定モジュール１４０は、ターゲットノードの識別情報に基づいて履歴ＤＢ１５０を参照して、環境モジュール１３１のパラメータ及びエージェントモジュール１３２のパラメータを取得する。スコア判定モジュール１４０は、取得した各パラメータを用いて環境モジュール１３１及びエージェントモジュール１３２を構築する。 Specifically, the score determination module 140 acquires the parameters of the environment module 131 and the parameters of the agent module 132 with reference to the history DB 150 based on the identification information of the target node. The score determination module 140 constructs an environment module 131 and an agent module 132 using the acquired parameters.

次に、スコア判定モジュール１４０は、環境モジュール１３１及びエージェントモジュール１３２を用いたシミュレーションを実行することによって評価値を算出する（ステップＳ４０３）。 Next, the score determination module 140 calculates an evaluation value by executing a simulation using the environment module 131 and the agent module 132 (step S403).

具体的には、スコア判定モジュール１４０は、終了条件が満たされるまでシミュレーションを繰り返し実行して、累積報酬を算出する。なお、当該シミュレーションでは、ポリシの更新が行われないように制御される。 Specifically, the score determination module 140 repeatedly executes the simulation until the termination condition is satisfied, and calculates the cumulative reward. In the simulation, control is performed so that the policy is not updated.

次に、スコア判定モジュール１４０は、累積報酬が閾値より大きいか否かを判定する（ステップＳ４０４）。 Next, the score determination module 140 determines whether the accumulated reward is larger than a threshold (Step S404).

なお、複数の種類の評価値が設定されている場合、スコア判定モジュール１４０は、評価値の組合せから定義される判定基準を満たすか否かを判定する。 When a plurality of types of evaluation values are set, the score determination module 140 determines whether or not a criterion defined from a combination of evaluation values is satisfied.

累積報酬が閾値より大きいと判定された場合、スコア判定モジュール１４０は、学習結果ＤＢ１６０を更新し（ステップＳ４０５）、その後、ステップＳ４０７に進む。 When it is determined that the cumulative reward is larger than the threshold, the score determination module 140 updates the learning result DB 160 (Step S405), and then proceeds to Step S407.

具体的には、スコア判定モジュール１４０は、学習結果ＤＢ１６０にエントリを追加し、追加されたエントリの結果ＩＤ７０１に識別情報を設定する。スコア判定モジュール１４０は、追加されたエントリのポリシ内部変数７０２にエージェントモジュール１３２のパラメータとして取得したポリシ内部データ２４１を設定し、当該エントリの累積報酬７０３に算出された累積報酬を設定する。 Specifically, the score determination module 140 adds an entry to the learning result DB 160, and sets identification information in the result ID 701 of the added entry. The score determination module 140 sets the policy internal data 241 acquired as a parameter of the agent module 132 in the policy internal variable 702 of the added entry, and sets the calculated cumulative reward in the cumulative reward 703 of the entry.

累積報酬が閾値以下であると判定された場合、スコア判定モジュール１４０は、ターゲットノードに除外フラグを付与し（ステップＳ４０６）、その後、ステップＳ４０７に進む。 When it is determined that the accumulated reward is equal to or less than the threshold, the score determination module 140 assigns an exclusion flag to the target node (Step S406), and then proceeds to Step S407.

ステップＳ４０７では、スコア判定モジュール１４０は、更新判定リストに登録された全てのノードについて処理が完了したか否かを判定する（ステップＳ４０７）。 In step S407, the score determination module 140 determines whether the processing has been completed for all nodes registered in the update determination list (step S407).

更新判定リストに登録された全てのノードについて処理が完了していないと判定された場合、スコア判定モジュール１４０は、ステップＳ４０１に戻り、新たなターゲットノードを選択する。 When it is determined that the processing has not been completed for all the nodes registered in the update determination list, the score determination module 140 returns to step S401, and selects a new target node.

更新判定リストに登録された全てのノードについて処理が完了したと判定された場合、スコア判定モジュール１４０は、最適ポリシが存在するか否かを判定する（ステップＳ４０８）。 When it is determined that the processing has been completed for all nodes registered in the update determination list, the score determination module 140 determines whether or not an optimal policy exists (step S408).

具体的には、スコア判定モジュール１４０は、更新判定リストに登録されたノードの中に除外フラグが付与されていないノードが存在するか否かを判定する。更新判定リストに登録されたノードの中に除外フラグが付与されていないノードが存在する場合、スコア判定モジュール１４０は、最適ポリシが存在すると判定する。 Specifically, the score determination module 140 determines whether there is a node to which the exclusion flag has not been added among the nodes registered in the update determination list. When there is a node to which the exclusion flag has not been added among the nodes registered in the update determination list, the score determination module 140 determines that the optimal policy exists.

最適ポリシが存在すると判定された場合、スコア判定モジュール１４０は、学習コントローラ１２０に最適ポリシ通知を出力する（ステップＳ４０９）。その後、スコア判定モジュール１４０は処理を終了する。最適ポリシ通知には更新判定リストが含まれる。 If it is determined that the optimal policy exists, the score determination module 140 outputs an optimal policy notification to the learning controller 120 (Step S409). Thereafter, the score determination module 140 ends the processing. The optimal policy notification includes an update determination list.

最適ポリシが存在しないと判定された場合、スコア判定モジュール１４０は、使用条件を満たすか否かを判定する（ステップＳ４１０）。 When it is determined that the optimal policy does not exist, the score determination module 140 determines whether the use condition is satisfied (step S410).

例えば、累積報酬の上昇率が閾値より小さい場合、又は、各学習結果（ノード）の累積報酬が閾値より小さい場合、スコア判定モジュール１４０は、使用条件を満たすと判定する。 For example, when the rate of increase of the cumulative reward is smaller than the threshold, or when the cumulative reward of each learning result (node) is smaller than the threshold, the score determination module 140 determines that the use condition is satisfied.

実施例１では、強化学習を継続しても最適ポリシが算出される見込みが低い場合、又は、過学習が発生した場合、現在のモデルパラメータに基づく強化学習の実行を中止して、新たなモデルパラメータに基づく強化学習を開始する。 In the first embodiment, when the possibility that the optimal policy is calculated is low even if the reinforcement learning is continued, or when over-learning occurs, the execution of the reinforcement learning based on the current model parameters is stopped and a new model is created. Start reinforcement learning based on parameters.

使用条件を満たすと判定された場合、スコア判定モジュール１４０は、学習コントローラ１２０に履歴使用指示を出力する（ステップＳ４１１）。その後、スコア判定モジュール１４０は処理を終了する。履歴使用指示には更新判定リストが含まれる。 When it is determined that the use condition is satisfied, the score determination module 140 outputs a history use instruction to the learning controller 120 (Step S411). Thereafter, the score determination module 140 ends the processing. The history use instruction includes an update determination list.

使用条件を満たさないと判定された場合、スコア判定モジュール１４０は、学習コントローラ１２０に継続指示を出力する（ステップＳ４１２）。その後、スコア判定モジュール１４０は処理を終了する。継続指示には更新判定リストが含まれる。 When it is determined that the use condition is not satisfied, the score determination module 140 outputs a continuation instruction to the learning controller 120 (Step S412). Thereafter, the score determination module 140 ends the processing. The continuation instruction includes an update determination list.

図１３及び図１４は、実施例１の計算機１００によって表示されるＧＵＩの一例を示す図である。図１３は、ユーザが強化学習の各種設定を行うために提示されるＧＵＩ１３００を示す。図１４は、ユーザが学習の推移を確認するために提示されるＧＵＩ１４００を示す。 FIG. 13 and FIG. 14 are diagrams illustrating an example of a GUI displayed by the computer 100 according to the first embodiment. FIG. 13 shows a GUI 1300 presented for the user to make various settings for reinforcement learning. FIG. 14 shows a GUI 1400 presented for the user to confirm the transition of learning.

ＧＵＩ１３００は、学習形態欄１３０１、学習回数欄１３０２、上限回数欄１３０３、遷移条件欄１３０４、提示情報欄１３０５、保存対象欄１３０６、選択方式欄１３０７、及び設定ボタン１３０８を含む。 The GUI 1300 includes a learning mode column 1301, a learning frequency column 1302, an upper limit frequency column 1303, a transition condition column 1304, a presentation information column 1305, a storage target column 1306, a selection method column 1307, and a setting button 1308.

学習形態欄１３０１は、学習形態を選択する欄である。実施例１では、「Ｏｎ−Ｐｏｌｉｃｙ」及び「Ｏｆｆ−Ｐｏｌｉｃｙ」等を選択するためのドロップダウンリストが提示される。 The learning mode column 1301 is a column for selecting a learning mode. In the first embodiment, a drop-down list for selecting “On-Policy”, “Off-Policy”, or the like is presented.

学習回数欄１３０２は、学習回数を設定する欄である。上限回数欄１３０３は、上限回数を設定する欄である。 The learning frequency column 1302 is a column for setting the learning frequency. The upper limit number column 1303 is a column for setting the upper limit number.

遷移条件欄１３０４は、シミュレーションの難易度の調整方法を設定するための欄である。 The transition condition column 1304 is a column for setting a method of adjusting the difficulty of the simulation.

提示情報欄１３０５は、強化学習の結果として出力する情報を設定する欄である。実施例１では、ポリシ及び行動等を選択するためのドロップダウンリストが提示される。 The presentation information column 1305 is a column for setting information to be output as a result of reinforcement learning. In the first embodiment, a drop-down list for selecting a policy, an action, and the like is presented.

保存対象欄１３０６は、保存対象を設定する欄である。保存対象欄１３０６のボックスは必要に応じて追加又は削除できる。 The storage target column 1306 is a column for setting a storage target. The box in the storage target column 1306 can be added or deleted as needed.

選択方式欄１３０７は、使用する学習履歴の選択方式を設定する欄である。 The selection method column 1307 is a column for setting a selection method of a learning history to be used.

設定ボタン１３０８は、各欄の値を計算機１００に設定するためのボタンである。ユーザが当該ボタンを操作した場合、各欄の値を含む学習条件パラメータ情報１７０が計算機１００に入力される。 The setting button 1308 is a button for setting the value of each column in the computer 100. When the user operates the button, the learning condition parameter information 170 including the value of each column is input to the computer 100.

ＧＵＩ１４００は、表示ボタン１４０１、設定ボタン１４０２、履歴関係表示欄１４０３、詳細表示欄１４０４を含む。 The GUI 1400 includes a display button 1401, a setting button 1402, a history-related display field 1403, and a detail display field 1404.

表示ボタン１４０１は、履歴関係表示欄１４０３を表示するためのボタンである。設定ボタン１４０２は、履歴関係表示欄１４０３に対する操作結果を計算機１００に反映させるためのボタンである。 The display button 1401 is a button for displaying a history-related display column 1403. The setting button 1402 is a button for reflecting the operation result of the history-related display column 1403 on the computer 100.

履歴関係表示欄１４０３は、難易度係数１４１１及び履歴構造１４１２から構成される。難易度係数１４１１は、難易度係数を表示する。履歴構造１４１２には、履歴関係管理情報１２３のノード間の接続関係を示すグラフが表示される。図１４に示すように、難易度係数毎に層を形成するグラフが表示される。黒丸は探索フラグ６０５が「ＯＦＦ」である。点線の丸は学習結果が格納されていないノードを示す。 The history relation display column 1403 includes a difficulty coefficient 1411 and a history structure 1412. The difficulty coefficient 1411 displays a difficulty coefficient. In the history structure 1412, a graph indicating the connection relationship between the nodes of the history relationship management information 123 is displayed. As shown in FIG. 14, a graph for forming a layer for each difficulty coefficient is displayed. The black circle indicates that the search flag 605 is “OFF”. Dotted circles indicate nodes where the learning result is not stored.

詳細表示欄１４０４は、ノードに対応する学習結果を表示する欄である。ユーザが履歴構造１４１２のノードにカーソルを合わせた場合、詳細表示欄１４０４に当該ノードに対応する学習結果が表示される。 The detail display column 1404 is a column for displaying a learning result corresponding to a node. When the user positions the cursor on a node in the history structure 1412, a learning result corresponding to the node is displayed in a detail display column 1404.

ユーザは、詳細表示欄１４０４を用いて、選択対象として選択するノード及び選択対象とするノードを選択することができる。ユーザは、前述の操作を行った後、設定ボタン１４０２を操作した場合、履歴関係管理情報１２３の探索フラグ６０５の値が更新される。 The user can use the detail display field 1404 to select a node to be selected as a selection target and a node to be selected. When the user operates the setting button 1402 after performing the above operation, the value of the search flag 605 of the history relationship management information 123 is updated.

（変形例）
図１５は、実施例１のシステムの構成の変形例を示す図である。 (Modification)
FIG. 15 is a diagram illustrating a modification of the configuration of the system according to the first embodiment.

システムは、計算機１５００、複数の計算機１５１０、計算機１５２０、及び端末１５３０から構成される。計算機１５００、計算機１５１０、計算機１５２０はネットワーク１５５０を介して互いに接続される。また、計算機１５００及び端末１５３０は、直接、又は、ネットワークを介して接続される。 The system includes a computer 1500, a plurality of computers 1510, a computer 1520, and a terminal 1530. The computer 1500, the computer 1510, and the computer 1520 are connected to each other via a network 1550. The computer 1500 and the terminal 1530 are connected directly or via a network.

計算機１５００は学習コントローラ１２０を有し、計算機１５１０はサブプロセスコントローラ１３０及び履歴ＤＢ１５０を有し、計算機１５２０はスコア判定モジュール１４０及び学習結果ＤＢ１６０を有する。本システムでは、複数の計算機１５１０が、並列に強化学習を並列実行する。 The computer 1500 has a learning controller 120, the computer 1510 has a sub-process controller 130 and a history DB 150, and the computer 1520 has a score determination module 140 and a learning result DB 160. In this system, a plurality of computers 1510 execute reinforcement learning in parallel.

学習コントローラ１２０は、計算機１５１０毎に履歴関係管理情報１２３を保持する。また、学習結果ＤＢ１６０には、計算機１５１０の識別情報を格納するフィールドが追加される。 The learning controller 120 holds the history relation management information 123 for each computer 1510. Further, a field for storing identification information of the computer 1510 is added to the learning result DB 160.

なお、本発明は上記した実施例に限定されるものではなく、様々な変形例が含まれる。また、例えば、上記した実施例は本発明を分かりやすく説明するために構成を詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、各実施例の構成の一部について、他の構成に追加、削除、置換することが可能である。 Note that the present invention is not limited to the above-described embodiment, and includes various modifications. Further, for example, in the above-described embodiment, the configuration has been described in detail in order to explain the present invention in an easily understandable manner, and the present invention is not necessarily limited to those having all the configurations described above. Further, a part of the configuration of each embodiment can be added, deleted, or replaced with another configuration.

また、上記の各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現してもよい。また、本発明は、実施例の機能を実現するソフトウェアのプログラムコードによっても実現できる。この場合、プログラムコードを記録した記憶媒体をコンピュータに提供し、そのコンピュータが備えるプロセッサが記憶媒体に格納されたプログラムコードを読み出す。この場合、記憶媒体から読み出されたプログラムコード自体が前述した実施例の機能を実現することになり、そのプログラムコード自体、及びそれを記憶した記憶媒体は本発明を構成することになる。このようなプログラムコードを供給するための記憶媒体としては、例えば、フレキシブルディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、ハードディスク、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）、光ディスク、光磁気ディスク、ＣＤ−Ｒ、磁気テープ、不揮発性のメモリカード、ＲＯＭなどが用いられる。 Further, each of the above-described configurations, functions, processing units, processing means, and the like may be partially or entirely realized by hardware, for example, by designing an integrated circuit. The present invention can also be realized by software program codes for realizing the functions of the embodiments. In this case, a storage medium storing the program code is provided to a computer, and a processor included in the computer reads the program code stored in the storage medium. In this case, the program code itself read from the storage medium realizes the function of the above-described embodiment, and the program code itself and the storage medium storing the program code constitute the present invention. As a storage medium for supplying such a program code, for example, a flexible disk, a CD-ROM, a DVD-ROM, a hard disk, an SSD (Solid State Drive), an optical disk, a magneto-optical disk, a CD-R, a magnetic tape, A non-volatile memory card, ROM, or the like is used.

また、本実施例に記載の機能を実現するプログラムコードは、例えば、アセンブラ、Ｃ／Ｃ＋＋、ｐｅｒｌ、Ｓｈｅｌｌ、ＰＨＰ、Ｊａｖａ（登録商標）等の広範囲のプログラム又はスクリプト言語で実装できる。 Further, the program code for realizing the functions described in the present embodiment can be implemented by a wide range of programs or script languages such as assembler, C / C ++, perl, Shell, PHP, and Java (registered trademark).

さらに、実施例の機能を実現するソフトウェアのプログラムコードを、ネットワークを介して配信することによって、それをコンピュータのハードディスクやメモリ等の記憶手段又はＣＤ−ＲＷ、ＣＤ−Ｒ等の記憶媒体に格納し、コンピュータが備えるプロセッサが当該記憶手段や当該記憶媒体に格納されたプログラムコードを読み出して実行するようにしてもよい。 Furthermore, by distributing the program code of the software for realizing the functions of the embodiment via a network, the program code is stored in a storage unit such as a hard disk or a memory of a computer or a storage medium such as a CD-RW or a CD-R. Alternatively, a processor included in a computer may read out and execute the program code stored in the storage unit or the storage medium.

上述の実施例において、制御線や情報線は、説明上必要と考えられるものを示しており、製品上必ずしも全ての制御線や情報線を示しているとは限らない。全ての構成が相互に接続されていてもよい。 In the above-described embodiment, the control lines and the information lines are considered to be necessary for the explanation, and do not necessarily indicate all the control lines and the information lines on the product. All components may be interconnected.

１００計算機
１０１端末
１１０プロセッサ
１１１メモリ
１１２ネットワークインタフェース
１２０学習コントローラ
１２１環境ロールバックコントローラ
１２２ポリシルールバックコントローラ
１２３履歴関係管理情報
１３０サブプロセスコントローラ
１３１環境モジュール
１３２エージェントモジュール
１４０スコア判定モジュール
１５０履歴ＤＢ
１６０学習結果ＤＢ
１７０学習条件パラメータ情報
１７１環境パラメータ
１７２エージェントパラメータ
２００環境モジュール
２０１エージェントモジュール
２１０環境制御モジュール
２１１シミュレーション管理モジュール
２１２環境状態
２２０報酬算出モジュール
２３０オプティマイザ
２３１更新用データ
２３２オプティマイザ内部データ
２３４オプティマイザ内部データ
２４０ポリシコントローラ
２４１ポリシ内部データ
２５０状態
２５１報酬
２５２行動
２５３状態確認フラグ
１３００、１４００ＧＵＩ Reference Signs List 100 computer 101 terminal 110 processor 111 memory 112 network interface 120 learning controller 121 environment rollback controller 122 policy rule back controller 123 history relation management information 130 sub process controller 131 environment module 132 agent module 140 score determination module 150 history DB
160 learning result DB
170 Learning condition parameter information 171 Environment parameter 172 Agent parameter 200 Environment module 201 Agent module 210 Environment control module 211 Simulation management module 212 Environment state 220 Reward calculation module 230 Optimizer 231 Update data 232 Optimizer internal data 234 Optimizer internal data 240 Policy controller 241 Internal policy data 250 State 251 Reward 252 Action 253 State confirmation flag 1300, 1400 GUI

Claims

A learning control method in a computer system for learning a policy for determining a control content of a process for controlling an object,
The computer system includes:
A learning control unit for calculating a transition model parameter obtained by partially changing a target model parameter for realizing a simulation for selecting the control content of the processing based on an arbitrary policy;
The simulation based on the transition model parameter input from the learning control unit or the transition model parameter calculated based on the result of the previous simulation is executed a plurality of times, and the policy is updated based on the result of the simulation. A learning device for performing a learning process;
A history database that manages, as a learning history, information related to the transition model parameters and the policy updated by the execution of the simulation,
The learning control method includes:
A first step in which the learning device stores the learning history in the history database at an arbitrary timing;
A second determining unit that determines whether it is necessary to execute the simulation using the learning history based on the evaluation value of the policy updated by the simulation executed an arbitrary number of times; Steps and
When it is determined that it is necessary to execute the simulation using the learning history, the learning device performs the simulation based on the transition model parameter calculated based on the use learning history selected from the history database. A third step of executing a plurality of times and updating the policy based on a result of the simulation.

The learning control method according to claim 1, wherein
A fourth step in which the learning device determines whether to update the transition model parameter based on the evaluation value;
A fifth step of, when it is determined that the learning device updates the transition model parameter, outputting an instruction to update the transition model parameter to the learning control unit;
A step of, when the learning control unit receives the update instruction of the transition model parameter, updating the current transition model parameter, and outputting the updated transition model parameter to the learning device. A learning control method characterized in that:

The learning control method according to claim 2, wherein
The learning control unit manages history relationship management information indicating a relationship between the learning histories,
The first step is
When the learning device stores a new learning history in the history database, notifying the learning control unit of a storage notification of the new learning history,
The learning control unit updates the history relationship management information so as to be associated with the learning history used to calculate the transition model parameter used in the simulation from which the new learning history is generated. , A learning control method.

The learning control method according to claim 3, wherein
The third step is
The learning control unit, based on the history relationship management information, selecting the use learning history,
A step in which the learning control unit calculates the transition model parameter based on the use learning history and outputs the transition model parameter to the learning device.

The learning control method according to claim 4, wherein
The third step is
The learning control unit calculates a new transition model parameter by changing a part of the transition model parameter used in the previous simulation based on the use learning history, and outputs the calculated new transition model parameter to the learning device. ,
A learning control method, wherein the learning device executes the simulation based on the new transition model parameter a plurality of times.

A computer system for learning a policy for determining a control content of a process for controlling an object,
A learning control unit for calculating a transition model parameter obtained by partially changing a target model parameter for realizing a simulation for selecting the control content of the processing based on an arbitrary policy;
The simulation based on the transition model parameters input from the learning control unit or the transition model parameters calculated based on the previous simulation result is executed a plurality of times, and the policy is updated based on the simulation result. A learning device for performing a learning process;
A history database that manages, as a learning history, information related to the transition model parameters and the policy updated by the execution of the simulation,
The learning device includes:
At any time, storing the learning history in the history database,
Based on the evaluation value of the policy updated by the simulation executed an arbitrary number of times, determine whether it is necessary to execute the simulation using the learning history,
When it is determined that it is necessary to execute the simulation using the learning history, the learning device performs the simulation based on the transition model parameters calculated based on the use learning history selected from the history database. A computer system which is executed a plurality of times and updates the policy based on a result of the simulation.

The computer system according to claim 6, wherein:
The learning device includes:
Based on the evaluation value, determine whether to update the transition model parameters,
If it is determined to update the transition model parameters, output an instruction to update the transition model parameters to the learning control unit,
The computer system, wherein the learning control unit updates a current transition model parameter when receiving an instruction to update the transition model parameter, and outputs the updated transition model parameter to the learning device.

The computer system according to claim 7, wherein
The learning control unit manages history relationship management information indicating a relationship between the learning histories,
The learning device, when storing a new learning history in the history database, notifies the learning control unit of a storage notification of the new learning history,
The learning control unit, when receiving the storage notification of the new learning history, associates the learning history with the learning history used to calculate the transition model parameter used in the simulation from which the new learning history is generated. A computer system for updating the history relation management information as described above.

The computer system according to claim 8, wherein
The learning control unit includes:
When it is determined that the simulation using the learning history needs to be performed by the learning device, the use learning history is selected based on the history relationship management information,
A computer system, wherein the transition model parameters are calculated based on the use learning history and output to the learning device.

The computer system according to claim 9, wherein:
The learning control unit calculates a new transition model parameter by changing a part of the transition model parameter used in the previous simulation based on the use learning history, and outputs the new transition model parameter to the learning device.
The computer system, wherein the learning device executes the simulation based on the new transition model parameter a plurality of times.