JP7421391B2

JP7421391B2 - Learning methods and programs

Info

Publication number: JP7421391B2
Application number: JP2020053613A
Authority: JP
Inventors: 雅司岡田
Original assignee: Panasonic Intellectual Property Corp of America
Current assignee: Panasonic Intellectual Property Corp of America
Priority date: 2019-07-05
Filing date: 2020-03-25
Publication date: 2024-01-24
Anticipated expiration: 2040-03-25
Also published as: JP2021012678A

Description

本発明は、学習方法、および、プログラムに関する。 The present invention relates to a learning method and a program.

エージェントの制御をするための学習方法として、不確実性を考慮したダイナミクスモデルを採用する方法がある（非特許文献１参照）。ここで、エージェントとは、環境に対して行動を起こす行動主体である。 As a learning method for controlling an agent, there is a method that employs a dynamics model that takes uncertainty into consideration (see Non-Patent Document 1). Here, an agent is a behavioral entity that takes action against the environment.

K. Chua, R. Calandra, R. McAllister, and S. Levine. “Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In NeurIPS, 2018.”K. Chua, R. Calandra, R. McAllister, and S. Levine. “Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In NeurIPS, 2018.”

しかしながら、エージェントの行動の制御をするための学習方法について改善の余地がある。 However, there is room for improvement in learning methods for controlling agent behavior.

そこで、本発明は、エージェントの動作を改善させる学習方法などを提供する。 Therefore, the present invention provides a learning method for improving the behavior of an agent.

本発明の一態様に係る学習方法は、モデルベース強化学習を用いた、エージェントの行動の学習方法であって、前記エージェントが行動したときの、前記エージェントの状態および行動を示す時系列データを取得し、取得した前記時系列データを用いて教師付き学習を行うことでダイナミクスモデルを構築し、前記ダイナミクスモデルに基づいて、変分分布として混合モデルを用いた変分推論により、前記エージェントの行動系列の複数の候補を導出し、導出した前記複数の候補のうちから選択した一の候補を、前記エージェントの行動系列として出力する。 A learning method according to one aspect of the present invention is a learning method of agent behavior using model-based reinforcement learning, the learning method acquiring time-series data indicating the state and behavior of the agent when the agent acts. Then, a dynamics model is constructed by performing supervised learning using the acquired time series data, and based on the dynamics model, the action sequence of the agent is determined by variational inference using a mixture model as a variational distribution. A plurality of candidates are derived, and one candidate selected from the plurality of derived candidates is output as an action sequence of the agent.

なお、これらの包括的または具体的な態様は、システム、装置、集積回路、コンピュータプログラムまたはコンピュータ読み取り可能なＣＤ－ＲＯＭなどの記録媒体で実現されてもよく、システム、装置、集積回路、コンピュータプログラムおよび記録媒体の任意な組み合わせで実現されてもよい。 Note that these comprehensive or specific aspects may be realized by a system, a device, an integrated circuit, a computer program, or a computer-readable recording medium such as a CD-ROM, and the system, device, integrated circuit, computer program and a recording medium may be used in any combination.

本発明の学習方法は、エージェントの行動の制御をするための学習方法を改善することができる。 The learning method of the present invention can improve the learning method for controlling the behavior of an agent.

図１は、実施の形態における学習装置の機能構成を示すブロック図である。FIG. 1 is a block diagram showing the functional configuration of a learning device in an embodiment. 図２は、実施の形態におけるエージェントの状態と行動との時系列を示す説明図である。FIG. 2 is an explanatory diagram showing a time series of agent states and actions in the embodiment. 図３は、実施の形態における変分分布を示す説明図である。FIG. 3 is an explanatory diagram showing a variational distribution in the embodiment. 図４は、実施の形態における推論部が行う変分推論の概念を示す説明図である。FIG. 4 is an explanatory diagram showing the concept of variational inference performed by the inference unit in the embodiment. 図５は、関連技術における推論部が行う変分推論の概念を示す説明図である。FIG. 5 is an explanatory diagram showing the concept of variational inference performed by the inference unit in the related technology. 図６は、実施の形態における推論部が決定する複数の候補を、関連技術における場合と比較して示す説明図である。FIG. 6 is an explanatory diagram illustrating a plurality of candidates determined by the inference unit in the embodiment in comparison with a case in related technology. 図７は、実施の形態における学習方法を示すフロー図である。FIG. 7 is a flow diagram showing the learning method in the embodiment. 図８は、実施の形態における学習方法における累積報酬の時間的変化を関連技術と比較して示す説明図である。FIG. 8 is an explanatory diagram illustrating temporal changes in cumulative rewards in the learning method according to the embodiment in comparison with related techniques. 図９は、実施の形態における学習方法における累積報酬の収束値を関連技術と比較して示す説明図である。FIG. 9 is an explanatory diagram illustrating convergence values of cumulative rewards in the learning method according to the embodiment in comparison with related techniques.

上記態様によれば、変分分布として混合モデルを用いた変分推論を行うので、エージェントの行動系列の複数の候補を導出することができる。このように導出された行動系列の複数の候補のそれぞれは、変分分布として単一モデルを用いた変分推論を行う場合と比較して、最適な行動系列との差異が小さく、また、当該複数の候補を導出する際の収束の速度が早い。そして、導出された行動系列の複数の候補のうちの一の候補が、エージェントの行動系列として用いられることが想定される。このようにすることで、より短時間で、より適切な行動系列を出力することができる。このように、上記態様によれば、エージェントの制御をするための学習方法を改善することができる。 According to the above aspect, since variational inference is performed using a mixture model as a variational distribution, it is possible to derive a plurality of candidates for the action sequence of the agent. Each of the multiple behavioral sequence candidates derived in this way has a smaller difference from the optimal behavioral sequence than when performing variational inference using a single model as a variational distribution, and Convergence speed is fast when deriving multiple candidates. It is assumed that one of the plurality of derived behavioral sequence candidates is used as the agent's behavioral sequence. By doing so, a more appropriate action sequence can be output in a shorter time. In this way, according to the above aspect, it is possible to improve the learning method for controlling the agent.

例えば、さらに、出力した前記行動系列に従って前記エージェントが行動したときの前記エージェントの状態および行動を示す新たな時系列データを、前記時系列データとして取得してもよい。 For example, new time-series data indicating the state and behavior of the agent when the agent acts according to the output action sequence may be acquired as the time-series data.

上記態様によれば、学習の結果として出力した行動系列を用いて制御されたエージェントの状態および行動を新たな学習に用いるので、学習の結果として構築する行動系列をより適切なものに近づけていくことができる。このように、エージェントの制御をするための学習方法を、より一層、改善することができる。 According to the above aspect, since the state and behavior of the agent controlled using the action sequence output as a result of learning are used for new learning, the action sequence constructed as a result of learning is brought closer to an appropriate one. be able to. In this way, the learning method for controlling the agent can be further improved.

例えば、前記複数の候補を導出する際には、前記複数の候補とともに、前記複数の候補それぞれに対応する確率分布の、混合モデル全体における混合割合を導出し、前記一の候補を選択する際には、導出した前記複数の候補のうち、前記混合割合が最大である確率分布に対応する候補を、前記一の候補として選択してもよい。 For example, when deriving the plurality of candidates, the mixture proportion of the probability distribution corresponding to each of the plurality of candidates in the entire mixture model is derived together with the plurality of candidates, and when selecting the one candidate, Of the plurality of derived candidates, a candidate corresponding to a probability distribution with the highest mixing ratio may be selected as the one candidate.

上記態様によれば、混合モデルにおける混合割合が最大である確率分布に対応する候補を、エージェントの行動系列として用いるので、最適な行動系列との差異をより小さくすることができる。よって、上記態様によれば、最適な行動系列により近い行動系列を出力するように、エージェントの制御をするための学習方法を改善することができる。 According to the above aspect, since the candidate corresponding to the probability distribution with the maximum mixing ratio in the mixture model is used as the agent's action sequence, the difference from the optimal action sequence can be further reduced. Therefore, according to the above aspect, it is possible to improve the learning method for controlling the agent so as to output an action sequence closer to the optimal action sequence.

例えば、前記混合モデルは、混合ガウス分布であってもよい。 For example, the mixture model may be a Gaussian mixture distribution.

上記態様によれば、混合モデルとして具体的に混合ガウス分布を用いるので、より容易に、変分推論による複数の候補を導出することができる。よって、エージェントの制御をするための学習方法を、より容易に改善することができる。 According to the above aspect, since a Gaussian mixture distribution is specifically used as the mixture model, it is possible to more easily derive a plurality of candidates by variational inference. Therefore, the learning method for controlling the agent can be improved more easily.

例えば、前記ダイナミクスモデルは、複数のニューラルネットワークのアンサンブルであってもよい。 For example, the dynamics model may be an ensemble of multiple neural networks.

上記態様によれば、ダイナミクスモデルとして、複数のニューラルネットワークのアンサンブルを用いるので、より容易に、推定精度が比較的高いダイナミクスモデルを構築することができる。よって、エージェントの制御をするための学習方法を、より容易に改善することができる。 According to the above aspect, since an ensemble of a plurality of neural networks is used as the dynamics model, it is possible to more easily construct a dynamics model with relatively high estimation accuracy. Therefore, the learning method for controlling the agent can be improved more easily.

また、本発明の一態様に係るプログラムは、上記の学習方法をコンピュータに実行させるためのプログラムである。 Further, a program according to one aspect of the present invention is a program for causing a computer to execute the above learning method.

上記態様により、上記学習方法と同様の効果を奏する。 The above aspect provides the same effects as the learning method described above.

なお、これらの包括的または具体的な態様は、システム、装置、集積回路、コンピュータプログラムまたはコンピュータ読み取り可能なＣＤ－ＲＯＭなどの記録媒体で実現されてもよく、システム、装置、集積回路、コンピュータプログラムまたは記録媒体の任意な組み合わせで実現されてもよい。 Note that these comprehensive or specific aspects may be realized by a system, a device, an integrated circuit, a computer program, or a computer-readable recording medium such as a CD-ROM, and the system, device, integrated circuit, computer program Alternatively, it may be realized using any combination of recording media.

以下、実施の形態について、図面を参照しながら具体的に説明する。 Hereinafter, embodiments will be specifically described with reference to the drawings.

なお、以下で説明する実施の形態は、いずれも包括的または具体的な例を示すものである。以下の実施の形態で示される数値、形状、材料、構成要素、構成要素の配置位置及び接続形態、ステップ、ステップの順序などは、一例であり、本発明を限定する主旨ではない。また、以下の実施の形態における構成要素のうち、最上位概念を示す独立請求項に記載されていない構成要素については、任意の構成要素として説明される。 Note that the embodiments described below are all inclusive or specific examples. The numerical values, shapes, materials, components, arrangement positions and connection forms of the components, steps, order of steps, etc. shown in the following embodiments are merely examples, and do not limit the present invention. Further, among the constituent elements in the following embodiments, constituent elements that are not described in the independent claims indicating the most significant concept will be described as arbitrary constituent elements.

（実施の形態）
本実施の形態において、エージェントの行動の制御をするための学習方法を改善する学習装置および学習方法について説明する。 (Embodiment)
In this embodiment, a learning device and a learning method for improving the learning method for controlling agent behavior will be described.

図１は、本実施の形態における学習装置１０の機能構成を示すブロック図である。 FIG. 1 is a block diagram showing the functional configuration of a learning device 10 in this embodiment.

図１に示される学習装置１０は、モデルベース強化学習を用いた、エージェント２０の行動を学習する装置である。学習装置１０は、エージェント２０が行動したときの情報を取得し、取得した情報を用いてモデルベース強化学習を行い、エージェント２０の行動の制御を行い得る。 A learning device 10 shown in FIG. 1 is a device that uses model-based reinforcement learning to learn the behavior of an agent 20. The learning device 10 can acquire information when the agent 20 behaves, perform model-based reinforcement learning using the acquired information, and control the behavior of the agent 20.

エージェント２０は、複数の状態のうちの一の状態を選択的に順次にとり、また、複数の行動のうちの一の行動を選択的に順次に行う装置である。エージェント２０は、例えば、ロボットアームを備え対象物に処理を行う産業機械である。この場合、ロボットアームの座標値、対象物に関する情報、および、処理の状況を示す情報などが上記「状態」に相当する。また、ロボットアームを動作させるための情報（ロボットアームの目標位置の座標値）、および、産業機械の状態を変化させる情報などが上記「行動」に相当する。 The agent 20 is a device that selectively and sequentially assumes one of a plurality of states and selectively and sequentially performs one of a plurality of actions. The agent 20 is, for example, an industrial machine that is equipped with a robot arm and processes objects. In this case, the coordinate values of the robot arm, information regarding the target object, information indicating the processing status, etc. correspond to the above-mentioned "state". Further, information for operating the robot arm (coordinate values of the target position of the robot arm), information for changing the state of the industrial machine, etc. correspond to the above-mentioned "action".

エージェント２０は、一連の行動をしたときのエージェント２０の状態および行動を示す時系列データを学習装置１０に提供する。また、エージェント２０は、学習装置１０が出力する行動系列に従って行動することができる。 The agent 20 provides the learning device 10 with time-series data indicating the state and behavior of the agent 20 when performing a series of actions. Further, the agent 20 can act according to the action sequence output by the learning device 10.

図１に示されるように学習装置１０は、取得部１１と、学習部１２と、記憶部１３と、推論部１４と、出力部１５とを備える。学習装置１０が備える各機能部は、学習装置１０が備えるＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）（不図示）がメモリ（不図示）を用いて所定のプログラムを実行することで実現され得る。 As shown in FIG. 1, the learning device 10 includes an acquisition section 11, a learning section 12, a storage section 13, an inference section 14, and an output section 15. Each functional unit included in the learning device 10 can be realized by a CPU (Central Processing Unit) (not shown) included in the learning device 10 executing a predetermined program using a memory (not shown).

取得部１１は、エージェント２０が行動したときの、エージェント２０の情報を示す時系列データを取得する機能部である。取得部１１が取得するエージェント２０の情報は、エージェント２０の状態と行動とを少なくとも含む。取得部１１が取得するエージェント２０の行動の系列は、どのようなものであってもよく、ランダムに設定された行動の系列でもよいし、出力部１５（後述）が出力した行動系列であってもよい。 The acquisition unit 11 is a functional unit that acquires time-series data indicating information about the agent 20 when the agent 20 acts. The information on the agent 20 that the acquisition unit 11 acquires includes at least the state and behavior of the agent 20. The sequence of actions of the agent 20 that the acquisition unit 11 acquires may be of any type, such as a randomly set sequence of actions, or a sequence of actions output by the output unit 15 (described later). Good too.

そして、取得部１１は、出力部１５（後述）が出力した行動系列によってエージェント２０を制御することで、エージェント２０の状態および行動を示す新たな時系列データを、上記時系列データとして取得してもよい。取得部１１は、上記時系列データを１以上取得する。 Then, the acquisition unit 11 acquires new time-series data indicating the state and behavior of the agent 20 as the above-mentioned time-series data by controlling the agent 20 according to the action sequence output by the output unit 15 (described later). Good too. The acquisition unit 11 acquires one or more of the above time series data.

学習部１２は、取得部１１が取得した時系列データを用いて教師付き学習を行うことでダイナミクスモデル１７を構築する機能部である。学習部１２は、構築したダイナミクスモデル１７を記憶部１３に格納する。 The learning unit 12 is a functional unit that constructs the dynamics model 17 by performing supervised learning using the time series data acquired by the acquisition unit 11. The learning unit 12 stores the constructed dynamics model 17 in the storage unit 13.

学習部１２は、教師付き学習を行う際に、エージェント２０の行動系列に応じて定められる報酬値を用いる。例えば、エージェント２０がロボットアームを備える産業機械である場合、ロボットアームの一連の行動がより適切であった場合ほど、より高い報酬値が定められる。また、産業機械によって処理された対象物の品質がより良い場合ほど、より高い報酬値が定められる。 The learning unit 12 uses a reward value determined according to the action sequence of the agent 20 when performing supervised learning. For example, if the agent 20 is an industrial machine equipped with a robot arm, a higher reward value is determined if the series of actions of the robot arm is more appropriate. Furthermore, the higher the quality of the object processed by the industrial machine, the higher the reward value is determined.

ダイナミクスモデル１７の一例は、複数のニューラルネットワークのアンサンブルである。つまり、学習部１２は、取得部１１が取得したエージェント２０の行動の系列と同一の行動の系列、または、取得部１１が取得したエージェント２０の行動系列に含まれる一部の行動系列であって互いに異なる行動系列によって、複数のニューラルネットワークに含まれる各層のフィルタの係数（重み）を決定する。この場合、複数のニューラルネットワークのアンサンブルの出力は、複数のニューラルネットワークそれぞれが、入力データに対して出力する出力データを集計関数によって集計した結果である。集計関数は、さまざまな演算を含むことができ、例えば、各ニューラルネットワークの出力データの平均値をとる演算、または、各出力データの多数決をとる演算などが含まれる。 An example of the dynamics model 17 is an ensemble of multiple neural networks. In other words, the learning unit 12 acquires the same sequence of actions as the sequence of actions of the agent 20 acquired by the acquisition unit 11, or a part of the sequence of actions included in the sequence of actions of the agent 20 acquired by the acquisition unit 11. Coefficients (weights) of filters in each layer included in the plurality of neural networks are determined based on mutually different action sequences. In this case, the output of the ensemble of the plurality of neural networks is the result of aggregating the output data output by each of the plurality of neural networks with respect to the input data using an aggregation function. The aggregation function can include various operations, such as an operation that takes the average value of the output data of each neural network, or an operation that takes a majority vote of each output data.

記憶部１３は、学習部１２が構築したダイナミクスモデル１７を記憶している記憶装置である。ダイナミクスモデル１７は、学習部１２により格納され、また、推論部１４により読み出される。記憶部１３は、メモリまたはストレージデバイスにより実現される。 The storage unit 13 is a storage device that stores the dynamics model 17 constructed by the learning unit 12. The dynamics model 17 is stored by the learning section 12 and read out by the inference section 14. The storage unit 13 is realized by a memory or a storage device.

推論部１４は、ダイナミクスモデル１７に基づいて、エージェント２０の行動系列の複数の候補を導出する機能部である。推論部１４は、エージェント２０の行動系列の複数の候補を導出する際に、変分分布として混合モデルを用いた変分推論を用いる。これにより、推論部１４は、エージェント２０の行動系列として複数の行動系列を導出する。 The inference unit 14 is a functional unit that derives a plurality of candidates for the action sequence of the agent 20 based on the dynamics model 17. The inference unit 14 uses variational inference using a mixture model as a variational distribution when deriving a plurality of candidates for the action sequence of the agent 20. Thereby, the inference unit 14 derives a plurality of action sequences as the action sequences of the agent 20.

推論部１４が変分推論に用いる混合モデルは例えば混合ガウス分布であり、この場合を例として説明するが、これに限られない。例えば、行動系列が離散値をとるベクトルである場合の混合モデルの例として、混合カテゴリカル分布がある。 The mixture model that the inference unit 14 uses for variational inference is, for example, a Gaussian mixture distribution, and although this case will be described as an example, it is not limited to this. For example, a mixed categorical distribution is an example of a mixed model when the behavior sequence is a vector that takes discrete values.

出力部１５は、エージェント２０の行動系列を出力する機能部である。出力部１５は、推論部１４が導出した複数の候補のうちから選択した一の候補を、エージェント２０の行動系列として出力する。例えば、出力部１５が出力した行動系列は、エージェント２０に入力され、この行動系列に従ってエージェント２０が行動することが想定される。 The output unit 15 is a functional unit that outputs the action sequence of the agent 20. The output unit 15 outputs one candidate selected from among the plurality of candidates derived by the inference unit 14 as the action sequence of the agent 20. For example, it is assumed that the action sequence output by the output unit 15 is input to the agent 20, and the agent 20 acts according to this action sequence.

なお、出力部１５が出力した行動系列は、数値データとして別の装置（不図示）で管理されてもよい。 Note that the action series output by the output unit 15 may be managed as numerical data by another device (not shown).

なお、推論部１４は、複数の候補を導出する際には、複数の候補とともに、複数の候補それぞれに対応する確率分布の、混合モデル全体における混合割合を導出してもよい。そして、出力部１５は、上記一の候補を選択する際には、導出した複数の候補のうち、混合モデル全体における混合割合が最大である確率分布に対応する候補を、上記一の候補として選択してもよい。 Note that when deriving a plurality of candidates, the inference unit 14 may derive a mixture ratio of the probability distributions corresponding to each of the plurality of candidates in the entire mixture model together with the plurality of candidates. Then, when selecting the one candidate, the output unit 15 selects the candidate corresponding to the probability distribution with the maximum mixing ratio in the entire mixture model from among the derived candidates as the one candidate. You may.

図２は、本実施の形態におけるエージェント２０の状態の時系列の例を示す説明図である。 FIG. 2 is an explanatory diagram showing an example of a time series of states of the agent 20 in this embodiment.

図２において、エージェント２０の状態の系列の例として、状態Ｓ_ｔ及びＳ_ｔ＋１が示されている。ここで、時刻ｔにおけるエージェント２０の状態を状態Ｓ_ｔとし、時刻ｔ＋１におけるエージェント２０の状態をＳ_ｔ＋１とする。 In FIG. 2, states S _t and S _t+1 are shown as an example of a series of states of the agent 20. Here, the state of the agent 20 at time t is defined as state S _t , and the state of the agent 20 at time t+1 is defined as S _t+1 .

また、図２に示される行動ａ_ｔは、エージェント２０が状態Ｓ_ｔでとった行動を示している。図２に示される報酬ｒ_ｔは、エージェント２０が状態Ｓ_ｔからＳ_ｔ＋１に遷移したことで得た報酬を示している。 Further, the action a _t shown in FIG. 2 indicates the action taken by the agent 20 in the state S _t . The reward r _t shown in FIG. 2 indicates the reward obtained by the agent 20 for transitioning from the state S _t to S _t+1 .

このように、エージェント２０の状態および行動、並びに、エージェント２０が得る報酬値が図式化され得る。推論部１４は、エージェント２０が一連の行動ａ_ｔをとるときの報酬ｒ_ｔの累積値（累積報酬ともいう）の期待値を最大化することができる行動系列を導出する。 In this way, the state and behavior of agent 20 and the reward value that agent 20 receives can be diagrammed. The inference unit 14 derives an action sequence that can maximize the expected value of the cumulative value (also referred to as cumulative reward) of the reward r _t when the agent 20 takes a series of actions a _t .

以降において、推論部１４による行動系列の導出方法を説明する。 Hereinafter, a method for deriving a behavior sequence by the inference unit 14 will be explained.

図３は、本実施の形態における、行動系列ごとの累積報酬を示す説明図である。ここでは、行動系列が連続値を要素とするベクトルである場合を例として説明するが、離散値であっても同様の説明が成立する。 FIG. 3 is an explanatory diagram showing cumulative rewards for each action sequence in this embodiment. Here, we will explain a case in which the action sequence is a vector whose elements are continuous values, but the same explanation holds true even if it is a discrete value.

図３は、行動系列を横軸とし、各行動系列での累積報酬を縦軸としたグラフである。図３に示されるように、累積報酬は一般に、行動系列に対して複数の極大値をとる。 FIG. 3 is a graph in which the horizontal axis represents the action sequence and the vertical axis represents the cumulative reward for each action sequence. As shown in FIG. 3, the cumulative reward generally takes multiple maximum values for the action sequence.

図４は、本実施の形態における推論部１４が行う変分推論の概念を示す説明図である。 FIG. 4 is an explanatory diagram showing the concept of variational inference performed by the inference unit 14 in this embodiment.

図４は、行動系列を横軸とし、各行動系列に対応する確率分布を縦軸としたグラフである。なお、図４の縦軸である確率分布は、図３の縦軸である累積報酬から所定の演算により変換され得る。 FIG. 4 is a graph in which the horizontal axis represents the action sequence and the vertical axis represents the probability distribution corresponding to each action sequence. Note that the probability distribution, which is the vertical axis in FIG. 4, can be converted from the cumulative reward, which is the vertical axis in FIG. 3, by a predetermined calculation.

推論部１４は、変分分布として混合ガウス分布（ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ（ＧＭＭ））を用いた変分推論を行うことで、エージェント２０の最適行動系列の候補を導出する。 The inference unit 14 derives candidates for the optimal action sequence of the agent 20 by performing variational inference using a Gaussian Mixture Model (GMM) as a variational distribution.

図４に示される実線は、図３に示される行動系列に対する累積報酬を確率分布に変換したものである。累積報酬を確率分布に変換する手段としては、累積報酬を指数関数で写像する方法などが挙げられる。 The solid line shown in FIG. 4 is the cumulative reward for the action sequence shown in FIG. 3 converted into a probability distribution. Examples of means for converting the cumulative reward into a probability distribution include a method of mapping the cumulative reward with an exponential function.

図４に示される破線は、混合ガウス分布、つまり、複数のガウス分布を混合（合成）した確率分布を示している。混合ガウス分布は、当該混合ガウス分布を構成するガウス分布それぞれの平均値、分散およびピーク値の組を含むパラメータにより特定される。 The broken line shown in FIG. 4 shows a mixed Gaussian distribution, that is, a probability distribution that is a mixture (synthesis) of a plurality of Gaussian distributions. The Gaussian mixture distribution is specified by parameters including a set of mean value, variance, and peak value of each Gaussian distribution that constitutes the Gaussian mixture distribution.

推論部１４は、変分推論により混合ガウス分布の上記複数のパラメータを決定する。 The inference unit 14 determines the plurality of parameters of the Gaussian mixture distribution by variational inference.

具体的には、推論部１４は、図４に破線で示される確率分布（混合ガウス分布）と、実線の確率分布との差異を小さくするように、混合ガウス分布の複数のパラメータを決定する演算を繰り返し行う。演算の実行回数は、任意に定められてよく、繰り返し毎の累積報酬の改善率が一定値（例えば±1%）に収まった場合に、繰り返しを終えるようにしてもよい。 Specifically, the inference unit 14 performs a calculation to determine multiple parameters of the Gaussian mixture distribution so as to reduce the difference between the probability distribution (Gaussian mixture distribution) indicated by the broken line in FIG. 4 and the probability distribution indicated by the solid line. Repeat. The number of times the calculation is performed may be arbitrarily determined, and the repetition may be terminated when the improvement rate of the cumulative reward for each repetition falls within a certain value (for example, ±1%).

推論部１４は、上記のように変分推論を行うことで、混合ガウス分布に含まれる複数のガウス分布それぞれのパラメータを決定する。決定した複数のガウス分布それぞれは、行動系列に対応しているので、上記パラメータの決定が、推論部１４がエージェント２０の行動系列の複数の候補を導出したことに相当する。 The inference unit 14 determines the parameters of each of the plurality of Gaussian distributions included in the Gaussian mixture distribution by performing variational inference as described above. Since each of the plurality of determined Gaussian distributions corresponds to an action sequence, the determination of the parameters corresponds to the inference unit 14 deriving a plurality of candidates for the action sequence of the agent 20.

図４の例では、推論部１４は、２個のガウス分布それぞれのパラメータとして、平均値μ１およびμ２、分散σ１およびσ２、ならびに、混合比率π１およびπ２を得る。このように決定した行動系列のパラメータが、エージェント２０の行動系列の複数の候補となる。 In the example of FIG. 4, the inference unit 14 obtains mean values μ1 and μ2, variances σ1 and σ2, and mixing ratios π1 and π2 as parameters for each of the two Gaussian distributions. The parameters of the action sequence determined in this way become a plurality of candidates for the action sequence of the agent 20.

以降において、推論部１４が行う変分推論と、関連技術における変分推論とを比較しながら説明する。関連技術における変分推論は、混合モデル（混合ガウス分布）の代わりに、単一モデル（例えば単一のガウス分布）が用いられるものである。 Hereinafter, the variational inference performed by the inference unit 14 and the variational inference in related technology will be explained while being compared. In variational inference in related art, a single model (eg, a single Gaussian distribution) is used instead of a mixed model (Gaussian mixture distribution).

図５は、関連技術における変分推論の概念を示す説明図である。図５の縦軸および横軸の表示は、図４と同様である。 FIG. 5 is an explanatory diagram showing the concept of variational inference in related technology. The vertical and horizontal axes in FIG. 5 are displayed in the same way as in FIG. 4 .

関連技術における推論部は、変分分布としてガウス分布（つまり、混合ガウス分布ではない単一のガウス分布）を用いた変分推論を行うことで、エージェント２０の最適行動系列の候補を導出する。 The inference unit in the related technology derives candidates for the optimal action sequence of the agent 20 by performing variational inference using a Gaussian distribution (that is, a single Gaussian distribution, not a mixed Gaussian distribution) as a variational distribution.

図５に示される破線は、ガウス分布を示している。ガウス分布は、平均値、分散および確率分布のピーク値の組により特定される。 The dashed line shown in FIG. 5 indicates a Gaussian distribution. A Gaussian distribution is specified by a set of mean, variance, and peak value of the probability distribution.

推論部は、変分推論によりガウス分布の上記パラメータを決定する。 The inference unit determines the parameters of the Gaussian distribution by variational inference.

具体的には、推論部は、図５に破線で示される確率分布（ガウス分布）と、実線の確率分布との差異を小さくするように、混合ガウス分布の複数のパラメータを決定する演算を繰り返し行う。演算の実行回数については、推論部１４における場合と同様である。 Specifically, the inference unit repeats calculations to determine multiple parameters of the Gaussian mixture distribution so as to reduce the difference between the probability distribution (Gaussian distribution) indicated by the broken line in FIG. 5 and the probability distribution indicated by the solid line. conduct. The number of times the calculation is executed is the same as in the inference unit 14.

これにより、推論部は、ただ１つの行動系列μ０を取得する。 Thereby, the inference unit obtains only one action sequence μ0.

図４と図５とに示されるように、破線で示される変分分布（図４における混合ガウス分布と、図５におけるガウス分布とに相当）と、実線で示される確率分布（行動系列に対応する確率分布）との差異の大きさは、混合ガウス分布を用いる場合の方が小さくなる傾向がある（図４参照）。混合ガウス分布を用いる場合、各ガウス分布のパラメータを調整することによって、実線で示される確率分布との差異を小さくすることができるからである。 As shown in FIGS. 4 and 5, the variational distribution shown by the broken line (corresponding to the Gaussian mixture distribution in FIG. 4 and the Gaussian distribution in FIG. 5) and the probability distribution shown by the solid line (corresponding to the behavioral sequence) The magnitude of the difference from the probability distribution (probability distribution) tends to be smaller when a Gaussian mixture distribution is used (see FIG. 4). This is because when a Gaussian mixture distribution is used, by adjusting the parameters of each Gaussian distribution, the difference from the probability distribution indicated by the solid line can be reduced.

図６は、本実施の形態における推論部１４が決定する複数の候補を概念的に示す説明図である。 FIG. 6 is an explanatory diagram conceptually showing a plurality of candidates determined by the inference unit 14 in this embodiment.

ここでは、経路探索の問題を例として説明する。この問題は、図７において黒丸（●）で示される障害物を避けながら矩形の領域内を進み、スタート位置（図中の“Ｓ”）からゴール位置（図中の“Ｇ”）まで到達する経路を導出する問題である。経路上の各位置において進む方向が「行動」に対応している。 Here, the route search problem will be explained as an example. This problem consists of moving within a rectangular area while avoiding obstacles indicated by black circles (●) in Figure 7, and reaching from the start position ("S" in the diagram) to the goal position ("G" in the diagram). This is a problem of deriving a route. The direction of travel at each location on the route corresponds to an "action."

図６において、推論部１４が導出する複数の経路について、関連技術における推論部が導出する一の経路と比較しながら説明する。 In FIG. 6, a plurality of routes derived by the inference unit 14 will be explained while being compared with one route derived by the inference unit in related technology.

図６の（ａ）は、関連技術に係る推論部が導出する一の経路３０を示している。図７の（ａ）に示される一の経路３０が、図５に示される１つの行動系列μ０に対応している。 (a) of FIG. 6 shows one route 30 derived by the inference unit according to the related technology. One route 30 shown in FIG. 7(a) corresponds to one action sequence μ0 shown in FIG. 5.

図６の（ｂ）は、本実施の形態における推論部１４が導出する複数の経路３１、３２、３３、３４および３５を示している。ここでは、推論部１４が５個のガウス分布を混合した混合ガウス分布を変分分布として用いる場合を例示している。この場合、導出される経路の個数は５である。 FIG. 6B shows a plurality of routes 31, 32, 33, 34, and 35 derived by the inference unit 14 in this embodiment. Here, a case is illustrated in which the inference unit 14 uses a mixed Gaussian distribution, which is a mixture of five Gaussian distributions, as the variational distribution. In this case, the number of routes derived is five.

また、図６の（ｂ）において経路３１等を示す線の幅は、混合ガウス分布における当該経路に対応するガウス分布の混合の割合を示している。この例では、経路３１の線の幅が他より太く描かれており、経路３１に対応するガウス分布の混合の割合が、他の経路３２～３５より高いことが示されている。 Further, in FIG. 6B, the width of the line indicating the path 31 etc. indicates the mixing ratio of the Gaussian distribution corresponding to the path in the mixed Gaussian distribution. In this example, the line width of the path 31 is drawn thicker than the others, indicating that the mixing ratio of the Gaussian distribution corresponding to the path 31 is higher than the other paths 32 to 35.

出力部１５は、図６の（ｂ）に示される経路のうちから選択した一の経路、例えば、経路３１を出力することができる。 The output unit 15 can output one route selected from the routes shown in FIG. 6(b), for example, route 31.

図７は、本実施の形態における学習装置１０が実行する学習方法を示すフロー図である。本学習方法は、モデルベース強化学習を用いた、エージェント２０の行動の学習方法である。 FIG. 7 is a flow diagram showing a learning method executed by the learning device 10 in this embodiment. This learning method is a learning method for the behavior of the agent 20 using model-based reinforcement learning.

図７に示されるように、ステップＳ１０１において、エージェント２０が行動したときの、エージェント２０の状態および行動を示す時系列データを取得する。 As shown in FIG. 7, in step S101, time-series data indicating the state and behavior of the agent 20 when the agent 20 acts is acquired.

ステップＳ１０２において、ステップＳ１０１で取得した時系列データを用いて教師付き学習を行うことでダイナミクスモデル１７を構築する。 In step S102, the dynamics model 17 is constructed by performing supervised learning using the time series data acquired in step S101.

ステップＳ１０３において、ダイナミクスモデル１７に基づいて、変分分布として混合モデルを用いた変分推論により、エージェント２０の行動系列の複数の候補を導出する。 In step S103, based on the dynamics model 17, a plurality of candidates for the action sequence of the agent 20 are derived by variational inference using a mixture model as a variational distribution.

ステップＳ１０４において、導出した複数の候補のうちから選択した一の候補を、エージェント２０の行動系列として出力する。 In step S104, one candidate selected from the plurality of derived candidates is output as the action sequence of the agent 20.

上記一連の処理により、学習装置１０は、エージェント２０の行動の制御をするための学習方法を改善する。 Through the series of processes described above, the learning device 10 improves the learning method for controlling the behavior of the agent 20.

以降において、本実施の形態における学習方法の効果を説明する。 Hereinafter, the effects of the learning method in this embodiment will be explained.

図８は、本実施の形態における学習方法における累積報酬の時間的変化を関連技術と比較して示す説明図である。 FIG. 8 is an explanatory diagram showing temporal changes in cumulative rewards in the learning method according to the present embodiment in comparison with related techniques.

本実施の形態に係る学習装置１０による学習をした場合の累積報酬について、既存のシミュレータを使用したシミュレーション結果の一例を示す。既存のシミュレータは、物理演算エンジンＭｕＪｏＣｏである。図８の（ａ）～（ｄ）は、上記シミュレータにて動作する強化学習のための評価プラットフォームOpenAI Gymにて提供されている４つの異なるタスク（すなわち、ＨａｌｆＣｈｅｅｔａｈ、Ａｎｔ、Ｈｏｐｐｅｒ、Ｗａｌｋｅｒ２ｄ）における結果である。 An example of a simulation result using an existing simulator will be shown regarding cumulative rewards when learning is performed by the learning device 10 according to the present embodiment. An existing simulator is the physics engine MuJoCo. Figures (a) to (d) show the results for four different tasks (i.e., HalfCheetah, Ant, Hopper, and Walker2d) provided by OpenAI Gym, an evaluation platform for reinforcement learning that runs on the above simulator. It is.

図８に示される太い実線は、本実施の形態に係る学習装置１０に係る累積報酬の変化を示している。また、図８に示される細い実線は、関連技術に係る学習装置に係る累積報酬の変化を示している。また、上記の累積報酬の変化のそれぞれについての収束値が破線で示されている。 The thick solid line shown in FIG. 8 indicates the change in cumulative reward related to the learning device 10 according to the present embodiment. Moreover, the thin solid line shown in FIG. 8 shows the change in the cumulative reward related to the learning device related to the related technology. Furthermore, the convergence values for each of the changes in the cumulative reward described above are indicated by dashed lines.

図８に示されるように、図８の（ａ）～（ｄ）に示されるすべてのタスクにおいて、本実施の形態に係る学習装置１０は、関連技術に係る学習装置と比較して、経過時間に対する累積報酬の増加が早く始まり、また、より短い時間で収束値に収束する（つまり収束の速度が速い）。また、本実施の形態に係る学習装置１０は、関連技術に係る学習装置と比較して、累積報酬の収束値が大きい。 As shown in FIG. 8, in all the tasks shown in (a) to (d) of FIG. The cumulative reward starts increasing earlier and converges to the convergence value in a shorter time (i.e., the speed of convergence is faster). Furthermore, the learning device 10 according to the present embodiment has a larger convergence value of cumulative reward than learning devices according to related technologies.

このことは、本実施の形態に係る学習装置１０が、関連技術における学習装置と比較して、学習の性能が改善されていることを示している。 This indicates that the learning device 10 according to the present embodiment has improved learning performance compared to learning devices in related technology.

図９は、本実施の形態における学習方法における累積報酬の収束値を関連技術と比較して示す説明図である。図９は、ＨａｌｆＣｈｅｅｔａｈタスクにおいて、ガウス分布および混合ガウス分布に含まれるガウス分布の個数Ｍごとに累積報酬の収束値を示している。 FIG. 9 is an explanatory diagram illustrating the convergence value of the cumulative reward in the learning method according to the present embodiment in comparison with related techniques. FIG. 9 shows the convergence value of the cumulative reward for each number M of Gaussian distributions included in the Gaussian distribution and Gaussian mixture distribution in the HalfCheetah task.

なお、Ｍ＝１の場合が単一のガウス分布に対応しており、つまり、関連技術における学習方法に対応している。また、Ｍ＝３、５または７の場合が混合ガウス分布に対応しており、つまり、本実施の形態における学習方法に対応している。 Note that the case of M=1 corresponds to a single Gaussian distribution, that is, corresponds to the learning method in the related technology. Moreover, the case where M=3, 5, or 7 corresponds to a Gaussian mixture distribution, that is, corresponds to the learning method in this embodiment.

図９に示されるように、単一のガウス分布により変分推論を行う場合（Ｍ＝１である場合）と比較して、混合ガウス分布により変分推論を行う場合（Ｍ＝３、５または７である場合）の方が、累積報酬の収束値が大きい。また、異なるＭの値に応じて、累積報酬の収束値が変化する。なお、累積報酬の収束値は、報酬の定義の方法、タスクの種類などによって異なるという性質を有する。 As shown in Figure 9, compared to the case where variational inference is performed using a single Gaussian distribution (when M = 1), the case where variational inference is performed using a mixed Gaussian distribution (when M = 3, 5, or 7), the convergence value of the cumulative reward is larger. Further, the convergence value of the cumulative reward changes depending on the different value of M. Note that the convergence value of the cumulative reward has a property that it differs depending on the method of defining the reward, the type of task, etc.

以上のように、本実施の形態にかかる学習方法によれば、変分分布として混合モデルを用いた変分推論を行うので、エージェントの行動系列の複数の候補を導出することができる。このように導出された行動系列の複数の候補のそれぞれは、変分分布として単一モデルを用いた変分推論を行う場合と比較して、最適な行動系列との差異が小さく、また、当該複数の候補を導出する際の収束の速度が早い。そして、導出された行動系列の複数の候補のうちの一の候補が、エージェントの行動系列として用いられることが想定される。このようにすることで、より短時間で、より適切な行動系列を出力することができる。このように、上記態様によれば、エージェントの制御をするための学習方法を改善することができる。 As described above, according to the learning method according to the present embodiment, variational inference is performed using a mixture model as a variational distribution, so a plurality of candidates for the action sequence of the agent can be derived. Each of the multiple behavioral sequence candidates derived in this way has a smaller difference from the optimal behavioral sequence than when performing variational inference using a single model as a variational distribution, and Convergence speed is fast when deriving multiple candidates. It is assumed that one of the plurality of derived behavioral sequence candidates is used as the agent's behavioral sequence. By doing so, a more appropriate action sequence can be output in a shorter time. In this way, according to the above aspect, it is possible to improve the learning method for controlling the agent.

また、学習の結果として出力した行動系列を用いて制御されたエージェントの状態および行動を新たな学習に用いるので、学習の結果として構築する行動系列をより適切なものに近づけていくことができる。このように、エージェントの制御をするための学習方法を、より一層、改善することができる。 Furthermore, since the state and behavior of the agent controlled using the action sequence output as a result of learning are used for new learning, the action sequence constructed as a result of learning can be brought closer to an appropriate one. In this way, the learning method for controlling the agent can be further improved.

また、混合モデルにおける混合割合が最大である確率分布に対応する候補を、エージェントの行動系列として用いるので、最適な行動系列との差異をより小さくすることができる。よって、上記態様によれば、最適な行動系列により近い行動系列を出力するように、エージェントの制御をするための学習方法を改善することができる。 Furthermore, since the candidate corresponding to the probability distribution with the maximum mixing ratio in the mixture model is used as the agent's action sequence, the difference from the optimal action sequence can be further reduced. Therefore, according to the above aspect, it is possible to improve the learning method for controlling the agent so as to output an action sequence closer to the optimal action sequence.

また、混合モデルとして具体的に混合ガウス分布を用いるので、より容易に、変分推論による複数の候補を導出することができる。よって、エージェントの制御をするための学習方法を、より容易に改善することができる。 Further, since a Gaussian mixture distribution is specifically used as the mixture model, a plurality of candidates can be more easily derived by variational inference. Therefore, the learning method for controlling the agent can be improved more easily.

また、ダイナミクスモデルとして、複数のニューラルネットワークのアンサンブルを用いるので、より容易に、推定精度が比較的高いダイナミクスモデルを構築することができる。よって、エージェントの制御をするための学習方法を、より容易に改善することができる。 Furthermore, since an ensemble of a plurality of neural networks is used as the dynamics model, it is possible to more easily construct a dynamics model with relatively high estimation accuracy. Therefore, the learning method for controlling the agent can be improved more easily.

なお、上記実施の形態において、各構成要素は、専用のハードウェアで構成されるか、各構成要素に適したソフトウェアプログラムを実行することによって実現されてもよい。各構成要素は、ＣＰＵまたはプロセッサなどのプログラム実行部が、ハードディスクまたは半導体メモリなどの記録媒体に記録されたソフトウェアプログラムを読み出して実行することによって実現されてもよい。ここで、上記実施の形態のコンテンツ管理システムなどを実現するソフトウェアは、次のようなプログラムである。 Note that in the above embodiments, each component may be configured with dedicated hardware, or may be realized by executing a software program suitable for each component. Each component may be realized by a program execution unit such as a CPU or a processor reading and executing a software program recorded on a recording medium such as a hard disk or a semiconductor memory. Here, the software that implements the content management system of the above embodiment is the following program.

すなわち、このプログラムは、コンピュータに、モデルベース強化学習を用いた、エージェントの行動の学習方法であって、前記エージェントが行動したときの、前記エージェントの状態および行動を示す時系列データを取得し、取得した前記時系列データを用いて教師付き学習を行うことでダイナミクスモデルを構築し、前記ダイナミクスモデルに基づいて、変分分布として混合モデルを用いた変分推論により、前記エージェントの行動系列の複数の候補を導出し、導出した前記複数の候補のうちから選択した一の候補を、前記エージェントの行動系列として出力する学習方法を実行させるプログラムである。 That is, this program is a method of learning agent behavior using model-based reinforcement learning in a computer, and acquires time-series data indicating the state and behavior of the agent when the agent acts, A dynamics model is constructed by performing supervised learning using the acquired time series data, and based on the dynamics model, multiple behavioral sequences of the agent are determined by variational inference using a mixture model as a variational distribution. This program executes a learning method for deriving candidates, and outputting one candidate selected from the plurality of derived candidates as an action sequence of the agent.

以上、一つまたは複数の態様に係る学習方法などについて、実施の形態に基づいて説明したが、本発明は、この実施の形態に限定されるものではない。本発明の趣旨を逸脱しない限り、当業者が思いつく各種変形を本実施の形態に施したものや、異なる実施の形態における構成要素を組み合わせて構築される形態も、一つまたは複数の態様の範囲内に含まれてもよい。 Although the learning method and the like according to one or more aspects have been described above based on the embodiments, the present invention is not limited to these embodiments. Unless departing from the spirit of the present invention, various modifications that can be thought of by those skilled in the art to this embodiment, and embodiments constructed by combining components of different embodiments are within the scope of one or more embodiments. may be included within.

本発明は、エージェントの行動の制御をする学習装置に利用可能である。 INDUSTRIAL APPLICATION This invention can be utilized for the learning device which controls the behavior of an agent.

１０学習装置
１１取得部
１２学習部
１３記憶部
１４推論部
１５出力部
１７ダイナミクスモデル
２０エージェント
３０、３１、３２、３３、３４、３５経路 10 learning device 11 acquisition unit 12 learning unit 13 storage unit 14 inference unit 15 output unit 17 dynamics model 20 agent 30, 31, 32, 33, 34, 35 route

Claims

A method for learning agent behavior using model-based reinforcement learning, the method comprising:
obtaining time series data indicating the state and behavior of the agent when the agent acts;
Build a dynamics model by performing supervised learning using the acquired time series data,
Based on the dynamics model, derive a plurality of candidates for the action sequence of the agent by variational inference using a mixture model as a variational distribution;
A learning method in which one candidate selected from the plurality of derived candidates is output as an action sequence of the agent.

moreover,
The learning method according to claim 1, wherein new time-series data indicating the state and behavior of the agent when the agent acts according to the outputted action sequence is acquired as the time-series data.

When deriving the plurality of candidates,
Deriving the mixture ratio of the probability distribution corresponding to each of the plurality of candidates in the entire mixture model along with the plurality of candidates,
When selecting the first candidate,
The learning method according to claim 1 or 2, wherein, among the plurality of derived candidates, a candidate corresponding to a probability distribution in which the mixing ratio is maximum is selected as the one candidate.

The learning method according to claim 1, wherein the mixture model is a Gaussian mixture distribution.

The learning method according to claim 1, wherein the dynamics model is an ensemble of a plurality of neural networks.

A program for causing a computer to execute the learning method according to any one of claims 1 to 5.