JP2018005739A

JP2018005739A - Method for learning reinforcement of neural network and reinforcement learning device

Info

Publication number: JP2018005739A
Application number: JP2016134486A
Authority: JP
Inventors: テェラパトロジャナアーパー; Teerapat Rojanaarpa
Original assignee: Denso Corp
Current assignee: Denso Corp
Priority date: 2016-07-06
Filing date: 2016-07-06
Publication date: 2018-01-11

Abstract

PROBLEM TO BE SOLVED: To provide a device for learning neural network reinforcement with which it is possible to effectively prevent the neural network from being overfitted when learning neural network reinforcement.SOLUTION: A data deletion unit 15 calculates a uniqueness parameter pertaining to experience data stored in an experience data storage unit 13 that indicates a degree of difference of each experience data from the other. The data deletion unit 15 deletes experience data similar to the other experience data on the basis of the calculated uniqueness parameter. Thus, it is possible to prevent the experience data stored in the experience data storage unit 13 from being one sided to experience data of high similarity. Therefore, it is possible to effectively prevent the overfitting of a DNN 12 by learning enforcement of the DNN 12 using a variety of experience data stored in the experience data storage unit 13.SELECTED DRAWING: Figure 1

Description

本発明は、制御対象の状態に応じて行動を決定するための最適方策を学習する学習器としてニューラルネットワークを用いる場合に、そのニューラルネットワークを強化学習するための強化学習方法及び強化学習装置に関する。 The present invention relates to a reinforcement learning method and reinforcement learning apparatus for reinforcement learning of a neural network when a neural network is used as a learning device for learning an optimal policy for determining an action according to a state of a control target.

例えば、非特許文献１には、多層構造のニューラルネットワークを、最適行動価値関数の関数近似器として用いた場合における、強化学習（reinforcement learning）の手法について記載されている。 For example, Non-Patent Document 1 describes a method of reinforcement learning when a neural network having a multilayer structure is used as a function approximator of an optimal action value function.

強化学習とは、ある環境中に置かれたエージェントが、環境との相互作用を通じて、最適な方策を得るための機械学習の手法をいう。具体的には、エージェントは、環境の現在の状態を観測し、方策に基づいて取るべき行動を決定する。エージェントが決定した行動により、環境の状態が変化する。環境の状態がどのように変化したかに応じて報酬が定まる。強化学習では、一連の行動を通じて、報酬が最も多く得られるような行動を決定するようにエージェントの方策を学習する。この強化学習の代表的な手法としてＴＤ学習やＱ学習が知られている。 Reinforcement learning is a machine learning technique that enables an agent placed in a certain environment to obtain an optimal policy through interaction with the environment. Specifically, the agent observes the current state of the environment and determines an action to be taken based on the policy. The state of the environment changes according to the action determined by the agent. Rewards are determined according to how the state of the environment has changed. In reinforcement learning, an agent's policy is learned so as to determine an action that can obtain the most reward through a series of actions. TD learning and Q learning are known as typical methods of reinforcement learning.

非特許文献１では、Ｑ学習に基づく手法によって強化学習を行っている。Ｑ学習に基づく手法では、行動価値関数と呼ばれる関数を近似することで最適方策を学習する。換言すると、将来に渡る累積的な報酬の和を最大化させる行動価値関数の近似関数を、最適方策として学習する。非特許文献１では、この最適方策の学習器として、多層構造のニューラルネットワークを用いている。 In Non-Patent Document 1, reinforcement learning is performed by a method based on Q-learning. In the method based on Q-learning, an optimal policy is learned by approximating a function called an action value function. In other words, an approximate function of an action value function that maximizes the sum of cumulative rewards in the future is learned as an optimal policy. In Non-Patent Document 1, a multi-layered neural network is used as a learning device for this optimal policy.

非特許文献１における、多層構造のニューラルネットワークの強化学習の手法について簡単に説明すると、まず、上述した環境の状態、その状態に対する行動、行動により得られる報酬、及び行動により遷移した環境の状態を収集し、それらを経験データとして所定のメモリに保存する。強化学習では、そのメモリから経験データをサンプリングして、以下の数式１により教師信号を作成する。

To briefly explain the reinforcement learning method of the neural network having a multilayer structure in Non-Patent Document 1, first, the state of the environment described above, the action for the state, the reward obtained by the action, and the state of the environment transitioned by the action are described. Collect them and store them as experience data in a predetermined memory. In reinforcement learning, experience data is sampled from the memory, and a teacher signal is created according to the following Equation 1.

数式１において、ｒは報酬を示し、γは割引率と呼ばれる強化学習のパラメータ（０＜γ＜１）を示し、Ｑ_θ（ｓ，ａ）はニューラルネットワークのパラメータθを用いて表された行動価値関数の近似関数を示し、ｓ’は状態ｓで行動ａを取った場合の次の状態を示し、ａ’は次の状態ｓ’で取るべき次の行動を示す。 In Equation 1, r represents a reward, γ represents a parameter of reinforcement learning called a discount rate (0 <γ <1), and Q _θ (s, a) is an action expressed using the parameter θ of the neural network. An approximate function of the value function is shown, s ′ represents the next state when the action a is taken in the state s, and a ′ represents the next action to be taken in the next state s ′.

この教師信号targetを用いることで、誤差関数は、以下の数式２のように定めることができる。

By using this teacher signal target, the error function can be determined as in the following Equation 2.

そして、ニューラルネットワークに対して誤差逆伝播法を適用して、各ニューロンの重みを更新する。その結果、上記誤差関数Ｌ_θ（ｓ，ａ）が十分に小さくなったと判定されると、学習は終了する。 Then, the back propagation method is applied to the neural network to update the weight of each neuron. As a result, when it is determined that the error function L _θ (s, a) has become sufficiently small, the learning ends.

ここで、強化学習により、ある種の経験データに過剰に適合するようにニューラルネットワークの学習が行われてしまうと、ニューラルネットワークは新たなデータに対してうまく適応することができなくなるという問題がある。つまり、学習に用いられた経験データとは傾向の異なるデータに対して、ニューラルネットワークは、最大の報酬を得るための行動を決定することができないといった問題が生じる。このような過剰適合が発生する原因の一つとして、連続的に観測される経験データの相関性が挙げられる。連続的に観測される経験データは、通常、大きく変化することはなく、ある相関性を有している。そのため、これら相関性を持つ経験データを用いて強化学習を行った場合、その相関性により学習結果がバイアスを受けることになる。 Here, there is a problem that if the neural network is learned so as to be excessively adapted to certain kinds of experience data by reinforcement learning, the neural network cannot be adapted well to new data. . That is, there is a problem that the neural network cannot determine an action for obtaining the maximum reward for data having a tendency different from that of the experience data used for learning. One of the causes of such overfitting is the correlation of continuously observed experience data. Continuously observed empirical data usually does not change significantly and has a certain correlation. Therefore, when reinforcement learning is performed using experience data having such correlation, the learning result is biased by the correlation.

このような問題に対処するために、非特許文献１では、「Experience Replay（経験再生）」という手法を用いている。「Experience Replay」とは、エージェントが経験した様々な状況における経験データをメモリに記憶しておき、強化学習の際には、そのメモリから、ランダムに経験データをサンプリングするものである。これにより、経験データの相関性が低減され、学習結果がバイアスを受けることを抑制することができる。 In order to deal with such a problem, Non-Patent Document 1 uses a technique called “Experience Replay”. “Experience Replay” stores experience data in various situations experienced by an agent in a memory, and randomly samples the experience data from the memory during reinforcement learning. Thereby, the correlation of experience data is reduced and it can suppress that a learning result receives a bias.

“Human-level control through deep reinforcement learning”, Volodymyr Mnih, et al., Nature, vol. 518, no.7540, pp.529-533, 2015“Human-level control through deep reinforcement learning”, Volodymyr Mnih, et al., Nature, vol. 518, no.7540, pp.529-533, 2015

上述したように、「Experience Replay」を実行するためには、エージェントが経験した経験データをメモリに保存しておく必要がある。しかし、メモリは、無限に経験データを保存できるわけではなく、経験データの保存量がメモリの記憶容量の上限値に達すると、いずれかの経験データを削除する必要が生じる。この際、一般的には、「First In First Out（ＦＩＦＯ）」方式により、最も古い経験データが削除される。 As described above, in order to execute “Experience Replay”, it is necessary to store the experience data experienced by the agent in a memory. However, the memory cannot store the experience data indefinitely. When the storage amount of the experience data reaches the upper limit value of the storage capacity of the memory, it is necessary to delete any experience data. At this time, the oldest experience data is generally deleted by the “First In First Out (FIFO)” method.

しかしながら、ＦＩＦＯ方式で経験データを削除すると、学習が進むにつれて、エージェントが直面する状況のバリエーションが減少するため、類似性の低い経験データが削除される一方で、類似性の高い経験データが新たに保存される可能性が高くなる。その結果、メモリに保存される経験データ全体として、類似性の高い経験データの比率が高まることになる。このため、「Experience Reply」を実行しても、学習のための経験データとして、類似性が高い経験データがサンプリングされる可能性が高くなるので、ニューラルネットワークは、その類似性の高い経験データに過剰適合してしまう傾向が生じる。 However, when the experience data is deleted by the FIFO method, as the learning progresses, the variation of the situation faced by the agent decreases. Therefore, the experience data with low similarity is deleted while the experience data with high similarity is newly added. The possibility of being preserved increases. As a result, the ratio of highly similar experience data increases as the entire experience data stored in the memory. For this reason, even if “Experience Reply” is executed, there is a high possibility that experience data with high similarity will be sampled as experience data for learning. There is a tendency to overfit.

本発明は、上述した点に鑑みてなされたもので、ニューラルネットワークを強化学習する際に、ニューラルネットワークの過剰適合を効果的に防止することが可能なニューラルネットワークの強化学習方法及び強化学習装置を提供することを目的とする。 The present invention has been made in view of the above-described points, and provides a reinforcement learning method and reinforcement learning device for a neural network that can effectively prevent excessive adaptation of the neural network when performing reinforcement learning on the neural network. The purpose is to provide.

上記目的を達成するために、本発明によるニューラルネットワーク（１２）の強化学習方法は、制御対象の状態に応じて行動を決定するための最適方策を学習する学習器としてニューラルネットワーク（１２）を用いる場合において、ニューラルネットワーク（１２）を強化学習するものであって、
コンピュータ（１０）が、制御対象の状態、制御対象に対する行動、その行動により得られる報酬、及びその行動によって遷移した制御対象の状態を含む経験データを収集して、有限の記憶容量を持つ経験データ記憶部（１３）に記憶させ、
コンピュータ（１０）が、経験データ記憶部（１３）に記憶されたそれぞれの経験データに関して、他の経験データとどの程度異なっているかを示すユニークネスパラメータを算出し、
コンピュータ（１０）が、算出したユニークネスパラメータに基づいて、他の経験データと類似している経験データを経験データ記憶部（１３）から削除し、
コンピュータ（１０）が、経験データ記憶部（１３）に記憶されている経験データを用いて、ニューラルネットワーク（１２）の強化学習を行う。 In order to achieve the above object, the reinforcement learning method of the neural network (12) according to the present invention uses the neural network (12) as a learning device for learning an optimal policy for determining an action according to the state of the controlled object. In some cases, reinforcement learning of the neural network (12),
The computer (10) collects experience data including the state of the controlled object, the action for the controlled object, the reward obtained by the action, and the state of the controlled object changed by the action, and the experience data having a finite storage capacity Store in the storage unit (13),
The computer (10) calculates a uniqueness parameter indicating how different each experience data stored in the experience data storage unit (13) from other experience data,
Based on the calculated uniqueness parameter, the computer (10) deletes experience data similar to other experience data from the experience data storage unit (13),
The computer (10) performs reinforcement learning of the neural network (12) using the experience data stored in the experience data storage unit (13).

また、本発明によるニューラルネットワーク（１２）の強化学習装置は、
制御対象の状態、制御対象に対する行動、その行動により得られる報酬、及びその行動によって遷移した制御対象の状態を含む経験データが収集されるごとに、その経験データを記憶する、有限の記憶容量を持つ経験データ記憶部（１３）と、
経験データ記憶部に記憶されたそれぞれの経験データに関して、他の経験データとどの程度異なっているかを示すユニークネスパラメータを算出する算出部（Ｓ２００）と、
算出部が算出したユニークネスパラメータに基づいて、他の経験データと類似している経験データを経験データ記憶部から削除する削除部（Ｓ２１０）と、
経験データ記憶部に記憶されている経験データを用いて、ニューラルネットワークの強化学習を行う強化学習部（１１）と、を備える。 Further, the reinforcement learning device of the neural network (12) according to the present invention is:
Each time empirical data is collected that includes the state of the controlled object, the action for the controlled object, the reward obtained by the action, and the state of the controlled object that has been transitioned by the action, a finite storage capacity is stored to store the experience data. Having an experience data storage unit (13);
For each experience data stored in the experience data storage unit, a calculation unit (S200) that calculates a uniqueness parameter indicating how different from other experience data;
A deletion unit (S210) that deletes experience data similar to other experience data from the experience data storage unit based on the uniqueness parameter calculated by the calculation unit;
A reinforcement learning unit (11) that performs reinforcement learning of the neural network using the experience data stored in the experience data storage unit.

上述したように、本発明によるニューラルネットワークの強化学習方法及び強化学習装置では、経験データ記憶部に記憶されたそれぞれの経験データに関して、他の経験データとどの程度異なっているかを示すユニークネスパラメータを算出する。そして、算出したユニークネスパラメータに基づいて、他の経験データと類似している経験データを経験データ記憶部から削除する。これにより、経験データ記憶部に記憶される経験データが、類似性の高い経験データに偏ることを防ぐことができる。換言すれば、経験データ記憶部には、他の経験データとの類似性が低い、すなわち独自性の高い経験データが削除されずに残される。そのため、経験データ記憶部に記憶されている経験データを、経験データの要素を軸とする空間にプロットした場合、経験データは広い範囲に分布するとともに、分布密度に極端な差が生じることも抑制される。従って、このような広く分布した経験データを用いてニューラルネットワークの強化学習を行うことにより、ニューラルネットワークの過剰適合を効果的に防止することができる。 As described above, in the neural network reinforcement learning method and reinforcement learning apparatus according to the present invention, the uniqueness parameter indicating how different each experience data stored in the experience data storage unit is from other experience data. calculate. Based on the calculated uniqueness parameter, experience data similar to other experience data is deleted from the experience data storage unit. Thereby, it is possible to prevent the experience data stored in the experience data storage unit from being biased toward highly similar experience data. In other words, experience data having a low similarity with other experience data, that is, highly unique experience data is left in the experience data storage unit without being deleted. Therefore, when the experience data stored in the experience data storage unit is plotted in a space centered on the elements of the experience data, the experience data is distributed over a wide range and the occurrence of extreme differences in distribution density is also suppressed. Is done. Therefore, by performing reinforcement learning of the neural network using such widely distributed experience data, it is possible to effectively prevent overfitting of the neural network.

上記括弧内の参照番号は、本発明の理解を容易にすべく、後述する実施形態における具体的な構成との対応関係の一例を示すものにすぎず、なんら本発明の範囲を制限することを意図したものではない。 The reference numerals in the parentheses merely show an example of a correspondence relationship with a specific configuration in an embodiment described later in order to facilitate understanding of the present invention, and are intended to limit the scope of the present invention. Not intended.

また、上述した特徴以外の、特許請求の範囲の各請求項に記載した技術的特徴に関しては、後述する実施形態の説明及び添付図面から明らかになる。 Further, the technical features described in the claims of the claims other than the features described above will become apparent from the description of embodiments and the accompanying drawings described later.

実施形態に係るニューラルネットワークの強化学習装置の構成を概念的に示した図である。It is the figure which showed notionally the structure of the reinforcement learning apparatus of the neural network which concerns on embodiment. エージェントが、経験データを収集して経験データ記憶部に保存するための処理を示したフローチャートである。It is the flowchart which showed the process for an agent to collect experience data and to preserve | save in an experience data memory | storage part. 経験データ記憶部が満杯になった場合に、データ削除部によって実行されるデータ削除処理の詳細を示すフローチャートである。It is a flowchart which shows the detail of the data deletion process performed by a data deletion part, when an experience data storage part becomes full. ユニークネスパラメータの算出手法の一例を説明するための説明図である。It is explanatory drawing for demonstrating an example of the calculation method of a uniqueness parameter. （ａ）は、ＦＩＦＯ方式において、経験データ記憶部がほぼ満杯となったが、まだデータの削除を行っていないときの、経験データの分散の様子を示す図であり、（ｂ）は、ＦＩＦＯ方式により経験データの削除を行いつつ、エピソードを７０００回繰り返した後に、経験データ記憶部に保存されている経験データの分散の様子を示す図である。(A) is a figure which shows the mode of dispersion | distribution of experience data when the experience data storage part is almost full in a FIFO system, but has not deleted data yet, (b) is a figure which shows FIFO It is a figure which shows the mode of dispersion | distribution of the experience data preserve | saved in the experience data storage part, after repeating an episode 7000 times, deleting experience data by a system. （ａ）は、実施形態のデータ削除処理によりデータ削除を行う場合に、経験データ記憶部がほぼ満杯となったときの初期段階の経験データの分散の様子を示す図であり、（ｂ）は、実施形態によるデータ削除処理によってデータ削除を行いつつ、エピソードを７０００回繰り返した後に、経験データ記憶部に保存されている経験データの分散の様子を示す図である。(A) is a figure which shows the mode of dispersion | distribution of the experience data of the initial stage when the experience data storage part is almost full when performing data deletion by the data deletion process of embodiment, (b) It is a figure which shows the mode of distribution of the experience data preserve | saved in the experience data memory | storage part, after repeating an episode 7000 times, performing data deletion by the data deletion process by embodiment. ＦＩＦＯ方式でデータ削除を行いつつ、経験データ記憶部に保存されている経験データを用いてＤＮＮの強化学習を繰り返した場合に、５回のエピソードが完了するごとに、それらエピソードの実行中に得られた報酬の累積値をカウントした結果を示す図である。If you delete data using the FIFO method and repeat DNN reinforcement learning using experience data stored in the experience data storage unit, it will be obtained during the execution of those episodes every time 5 episodes are completed. It is a figure which shows the result of having counted the accumulated value of the received reward. 本実施形態によるデータ削除処理によりデータ削除を行いつつ、経験データ記憶部に保存されている経験データを用いてＤＮＮの強化学習を繰り返した場合に、５回のエピソードが完了するごとに、それらエピソードの実行中に得られた報酬の累積値をカウントした結果を示す図である。When DNN reinforcement learning is repeated using experience data stored in the experience data storage unit while data is deleted by the data deletion processing according to the present embodiment, the episodes are completed every time five episodes are completed. It is a figure which shows the result of having counted the cumulative value of the reward obtained during execution.

以下、本発明の実施形態によるニューラルネットワークの強化学習方法及び学習装置について図面を参照しつつ詳細に説明する。図１は、本実施形態に係るニューラルネットワークの強化学習装置の構成を概念的に示した図である。本実施形態では、ニューラルネットワークの強化学習装置は、ニューラルネットワークの学習機能を備えたアプリケーションをコンピュータ１０において実行することで具現化される。図１には、アプリケーションの実行により、コンピュータによって実現される各種の機能をブロックとして示している。 A neural network reinforcement learning method and learning device according to an embodiment of the present invention will be described below in detail with reference to the drawings. FIG. 1 is a diagram conceptually illustrating the configuration of a reinforcement learning device for a neural network according to the present embodiment. In the present embodiment, a reinforcement learning device for a neural network is implemented by executing an application having a learning function for a neural network on the computer 10. In FIG. 1, various functions realized by a computer by executing an application are shown as blocks.

図１に示すように、ニューラルネットワークの強化学習装置は、エージェント１１と環境１４とを備えている。 As shown in FIG. 1, the neural network reinforcement learning apparatus includes an agent 11 and an environment 14.

エージェント１１は、学習器としての多層構造のニューラルネットワーク（Deep Neural Network ：ＤＮＮ）１２と、経験データ記憶部１３とを備えている。エージェント１１は、強化学習のために経験データを収集して、経験データ記憶部１３に記憶させる。 The agent 11 includes a neural network (Deep Neural Network: DNN) 12 having a multilayer structure as a learning device, and an experience data storage unit 13. The agent 11 collects experience data for reinforcement learning and stores it in the experience data storage unit 13.

経験データの収集において、エージェント１１は、ある時点ｔにおける、制御対象としての環境１４の状態ｓ_ｔを観測し、その状態ｓ_ｔに基づいて行動ａ_ｔを選択する。この際、多様な状態ｓと行動ａとの組が選択されるように、状態ｓに基づく行動ａは、ある比率で、ε-greedy法やボルツマン選択法などの行動選択方法を用いて選択される。例えば、学習を開始した直後は、９５％の比率で、上述した行動選択方法によって行動ａを決定し、残りの５％の比率で、ＤＮＮ１２がその時点で保有している方策に従って行動ａを決定する。なお、ＤＮＮ１２は、状態ｓを入力すると、保有している方策に基づいて、状態ｓに対して最も適した行動ａに関して最も高い評価値を出力するように構成されている。 In the collection of empirical data, agent 11, at a point in time t, to observe the state s _t environment 14 as a control object, select an action a _t based on the state s _t. At this time, the action a based on the state s is selected at a certain ratio using an action selection method such as the ε-greedy method or the Boltzmann selection method so that various pairs of the state s and the action a are selected. The For example, immediately after learning is started, the action a is determined by the above-described action selection method at a ratio of 95%, and the action a is determined at the remaining 5% according to the policy that the DNN 12 has at that time. To do. When the state s is input, the DNN 12 is configured to output the highest evaluation value with respect to the action a most suitable for the state s based on the policy possessed.

そして、エージェント１１は、強化学習が進むにつれて、徐々に、上述した行動選択方法によって行動ａを決定する比率を低下させ、ＤＮＮ１２の方策によって行動ａを選択する比率を高める。例えば、学習が行われるごとに、９９％の比率で行動選択法による行動選択の比率を低下させる。そして、行動選択法による行動選択比率が所定の最小比率（例えば、５％）に達すると、それ以上の低下を中止する。 Then, as reinforcement learning proceeds, the agent 11 gradually decreases the ratio of determining the action a by the above-described action selection method, and increases the ratio of selecting the action a by the DNN 12 policy. For example, every time learning is performed, the ratio of action selection by the action selection method is reduced by 99%. When the action selection ratio according to the action selection method reaches a predetermined minimum ratio (for example, 5%), the further decrease is stopped.

エージェント１１は、ある時点ｔの状態ｓ_ｔに応じて選択した行動ａ_ｔを環境１４に与える。環境１４は、現時点の状態ｓ_ｔ及び選択される行動ａ_ｔによって，次に遷移する状態ｓ_ｔ＋１を決定する。例えば、環境１４は、現時点の状態ｓ_ｔと行動ａ_ｔにより、次の状態ｓ_ｔ＋１に遷移する確率と、現時点の状態ｓ_ｔから次の状態ｓ_ｔ＋１に遷移した際の報酬ｒ（の期待値）を算出する。環境１４は、最も遷移確率の高い次の状態ｓ_ｔ＋１と、そのときに期待される報酬ｒとをエージェント１１に与える。 Agent 11 gives the action _{a t} selected according to the state _{s t} at a certain point of time t in the environment 14. Environment 14, by current state _{s t} and selected the action _{a t,} then determines the state _{s t + 1} transition. For example, the environment 14, the current state _{s t} and action _{a t,} the probability of transition to the next state _{s t + 1,} reward r (expected value at the time of transition from the current state _{s t} to the next state _{s t + 1} ) Is calculated. The environment 14 gives the agent 11 the next state s _{t + 1} with the highest transition probability and the reward r expected at that time.

エージェント１１は、環境１４から次の状態ｓ_ｔ＋１と報酬ｒを受け取ると、現時点の状態ｓ_ｔ及び行動ａ_ｔとセットにして経験データを作成し、経験データ記憶部１３に保存する。すなわち、各経験データｅ_ｉは、現時点の状態ｓ_ｔ、行動ａ_ｔ、報酬ｒ_ｔ、次の状態ｓ_ｔ＋１を含み、ｅ_ｉ＝（ｓ_ｔ、ａ_ｔ、ｒ_ｔ、ｓ_ｔ＋１）である。 Agent 11, when the environment 14 receives the next state s _{t + 1} and rewards r, creating an empirical data in the state s _t and the action a _t the set of current time, stores the empirical data storage unit 13. In other words, each experience data e _i includes a current state s _t , an action a _t , a reward r _t , and a next state s _{t + 1} , and e _i = (s _t , a _t , r _t , s _{t + 1} ).

エージェント１１は、経験データ記憶部１３に経験データｅ_ｉが蓄積されると、経験データ記憶部１３から所定数の経験データをランダムにサンプリングして、強化学習を実行する。この強化学習には、従来と同様に、ＴＤ学習やＱ学習など公知の手法が適用される。そして、これらＴＤ学習やＱ学習の手法により、サンプリングした経験データを用いて、誤差関数を定め、ＤＮＮ１２に対して誤差逆伝播法を適用して、各ニューロンの重みを更新することで、ＤＮＮ１２の強化学習を行う。 When the experience data e _i is accumulated in the experience data storage unit 13, the agent 11 randomly samples a predetermined number of experience data from the experience data storage unit 13 and executes reinforcement learning. For this reinforcement learning, a known method such as TD learning or Q learning is applied as in the prior art. Then, by using the empirical data of TD and Q learning, the error function is determined using the sampled experience data, the error back propagation method is applied to the DNN 12, and the weight of each neuron is updated. Perform reinforcement learning.

ここで、本実施形態においても、ＤＮＮ１２の強化学習のために、経験データ記憶部１３からランダムに経験データをサンプリングしており、従来の「Experience Replay（経験再生）」を採用している。これにより、ある程度、経験データの相関性が低減され、学習結果がバイアスを受けることを抑制することができる。 Here, also in the present embodiment, experience data is randomly sampled from the experience data storage unit 13 for the reinforcement learning of the DNN 12, and the conventional “Experience Replay” is employed. Thereby, the correlation of experience data is reduced to some extent, and it is possible to suppress the learning result from being biased.

しかし、経験データ記憶部１３の記憶容量は有限である。従って、経験データの保存量が経験データ記憶部１３の記憶容量の上限値に達したとき、新たな経験データを保存するためには、すでに保存されている経験データを削除する必要がある。この場合に、いわゆるＦＩＦＯ方式で経験データを削除すると、学習が進むにつれて、エージェント１１が直面する状況のバリエーションが減少するため、類似性の低い経験データが削除される一方で、類似性の高い経験データばかりが新たに保存される可能性が高くなる。その結果、経験データ記憶部１３に保存される経験データ全体として、類似性の高い経験データの比率が高まることになる。このため、「Experience Reply」を実行しても、ニューラルネットワークは、その類似性の高い経験データに過剰適合してしまう傾向が生じる。 However, the storage capacity of the experience data storage unit 13 is finite. Therefore, when the amount of experience data stored reaches the upper limit of the storage capacity of the experience data storage unit 13, in order to store new experience data, it is necessary to delete the already stored experience data. In this case, when the experience data is deleted by the so-called FIFO method, the variation of the situation that the agent 11 faces decreases as the learning progresses. Therefore, the experience data with low similarity is deleted while the experience data with low similarity is deleted. There is a high possibility that only data will be newly stored. As a result, the ratio of experience data with high similarity increases as the entire experience data stored in the experience data storage unit 13. For this reason, even if “Experience Reply” is executed, the neural network tends to be excessively adapted to the highly similar experience data.

そのため、本実施形態に係るニューラルネットワークの強化学習装置では、図１に示すように、単純なＦＩＦＯ方式ではなく、各経験データの他の経験データとの非類似性を評価し、その非類似性に基づいて、経験データを選別して削除するデータ削除部１５を設けた。より具体的には、データ削除部１５は、他の経験データと非類似性の低い（すなわち、他の経験データと近似している）経験データを削除する一方で、非類似性の高い（すなわち、独自性の高い）経験データを残す。 Therefore, in the reinforcement learning apparatus of the neural network according to the present embodiment, as shown in FIG. 1, the dissimilarity between each experience data and other experience data is evaluated instead of a simple FIFO method, and the dissimilarity is evaluated. Based on the above, a data deleting unit 15 for selecting and deleting the experience data is provided. More specifically, the data deleting unit 15 deletes the experience data having low dissimilarity with other experience data (that is, approximating other experience data), while having high dissimilarity (that is, Leave highly unique) experience data.

これにより、経験データ記憶部１３に保存される経験データが、類似性の高い経験データに偏ることを防ぐことができる。換言すれば、経験データ記憶部１３には、他の経験データとの類似性が低い、すなわち独自性の高い経験データが削除されずに残される。そのため、経験データ記憶部１３に記憶されている経験データを、経験データの要素を軸とする多次元空間にプロットした場合、経験データは広い範囲に分布するとともに、分布密度に極端な差が生じることも抑制される。従って、このような広く分布した経験データを用いてＤＮＮ１２の強化学習を行うことにより、ＤＮＮ１２の過剰適合を効果的に防止することができる。 Thereby, it is possible to prevent the experience data stored in the experience data storage unit 13 from being biased to highly similar experience data. In other words, experience data having a low similarity to other experience data, that is, a highly unique experience data is left in the experience data storage unit 13 without being deleted. Therefore, when the experience data stored in the experience data storage unit 13 is plotted in a multidimensional space with the elements of the experience data as an axis, the experience data is distributed over a wide range and an extreme difference occurs in the distribution density. This is also suppressed. Therefore, by performing reinforcement learning of the DNN 12 using such widely distributed experience data, excessive adaptation of the DNN 12 can be effectively prevented.

以下、本実施形態に係るニューラルネットワークの強化学習装置における、経験データの保存及び削除方法について、図２，図３のフローチャートを参照して詳しく説明する。図２のフローチャートは、エージェント１１が、経験データを収集して経験データ記憶部１３に保存するための処理を示している。また、図３のフローチャートは、経験データ記憶部１３が満杯になった場合に、データ削除部１５によって実行されるデータ削除処理を示している。 Hereinafter, a method for storing and deleting experience data in the reinforcement learning apparatus for a neural network according to the present embodiment will be described in detail with reference to the flowcharts of FIGS. The flowchart of FIG. 2 shows a process for the agent 11 to collect experience data and store it in the experience data storage unit 13. In addition, the flowchart of FIG. 3 illustrates a data deletion process executed by the data deletion unit 15 when the experience data storage unit 13 is full.

まず、経験データの保存処理について、図２のフローチャートを参照して説明する。図２のフローチャートのステップＳ１００では、エージェント１１が、観測された状態ｓに対する行動ａを選択して、環境１４に与える。続くステップＳ１１０では、エージェント１１は、環境１４から次の状態ｓ_ｔ＋１と報酬ｒを受け取って、現時点の状態ｓ_ｔ及び行動ａ_ｔとセットにする。これにより、経験データが収集される。 First, the experience data storage process will be described with reference to the flowchart of FIG. In step S100 of the flowchart of FIG. 2, the agent 11 selects the action a for the observed state s and gives it to the environment 14. In step S110, the agent 11 from the environment 14 receives the next state _{s t + 1} and reward r, a state _{s t} and the action _{a t} and a set of current. Thereby, experience data is collected.

次に、ステップＳ１２０において、経験データの保存量が、経験データ記憶部１３の記憶容量の上限に達しており、経験データ記憶部１３が満杯になっているか否かを判定する。このステップＳ１２０の判定処理において、経験データ記憶部１３が満杯になっていると判定すると、ステップＳ１３０の処理に進む。ステップＳ１３０では、データ削除部１５による経験データの削除処理が実行される。この経験データ削除処理については、後述する。 Next, in step S120, it is determined whether the storage amount of the experience data has reached the upper limit of the storage capacity of the experience data storage unit 13 and the experience data storage unit 13 is full. If it is determined in step S120 that the experience data storage unit 13 is full, the process proceeds to step S130. In step S130, the deletion process of the experience data by the data deletion part 15 is performed. This experience data deletion process will be described later.

そして、ステップＳ１３０のデータ削除処理によって経験データが削除され、経験データ記憶部１３に新たな経験データを保存するための空き容量が確保されると、ステップＳ１４０の処理が実行される。ステップＳ１４０では、収集された経験データを経験データ記憶部１３に保存する。一方、ステップＳ１２０の判定処理において、経験データ記憶部１３は満杯にはなっていないと判定すると、直接、ステップＳ１４０の処理に進んで、エージェント１１は、収集された経験データを経験データ記憶部１３に保存する。 Then, when the experience data is deleted by the data deletion process of step S130 and a free space for storing new experience data is secured in the experience data storage unit 13, the process of step S140 is executed. In step S140, the collected experience data is stored in the experience data storage unit 13. On the other hand, in the determination process of step S120, if it is determined that the experience data storage unit 13 is not full, the process directly proceeds to the process of step S140, and the agent 11 transfers the collected experience data to the experience data storage unit 13. Save to.

そして、エージェント１１は、経験データ記憶部１３に保存されている経験データを用いてＤＮＮ１２の強化学習を行う。この強化学習は、例えば、所定数の経験データが経験データ記憶部１３に保存されたとき、制御対象（環境１４）に対する制御の終了条件が決められている場合に、制御開始から制御終了までを１エピソードとし、所定回数のエピソードが完了したとき、あるいは、前回の強化学習から所定の時間が経過したときなど、所定のタイミングで繰り返し行われる。 Then, the agent 11 performs reinforcement learning of the DNN 12 using the experience data stored in the experience data storage unit 13. In this reinforcement learning, for example, when a predetermined number of experience data is stored in the experience data storage unit 13 and the control end condition for the control target (environment 14) is determined, the control start to the control end are performed. One episode is repeated at a predetermined timing, for example, when a predetermined number of episodes have been completed, or when a predetermined time has elapsed since the last reinforcement learning.

次に、経験データの削除処理について、図３のフローチャートを参照して説明する。図３のフローチャートのステップＳ２００では、データ削除部１５が、各経験データの他の経験データとの非類似性を評価するためのユニークネスパラメータを、各経験データについて算出する。 Next, the experience data deletion process will be described with reference to the flowchart of FIG. In step S200 of the flowchart of FIG. 3, the data deletion unit 15 calculates, for each experience data, a uniqueness parameter for evaluating dissimilarity between each experience data and other experience data.

例えば、図４に示すように、経験データの各要素を軸とする多次元空間に各経験データをプロットした場合に、ある経験データｘからユークリッド距離ｋの範囲に属する経験データの数の逆数をユニークネスパラメータｕとして定義することができる。これは、ある経験データｘからユークリッド距離ｋの範囲に属する経験データの数が多くなるほど、その経験データｘは他の経験データと類似性が高いデータとみなすことができるためである。 For example, as shown in FIG. 4, when each experience data is plotted in a multi-dimensional space with each element of the experience data as an axis, the reciprocal of the number of experience data belonging to the range of the Euclidean distance k from a certain experience data x is obtained. It can be defined as a uniqueness parameter u. This is because as the number of experience data belonging to the range of the Euclidean distance k from a certain experience data x increases, the experience data x can be regarded as data having higher similarity to other experience data.

このようにユニークネスパラメータｕを定義することにより、周囲の経験データの数が多くなるほど低い値がｕとして算出され、逆に、周囲の経験データの数が少なくなるほど高い値がｕとして算出されるようになる。なお、周囲の経験データの数がゼロである場合には、所定の最大値がｕとして算出されるように定めておけば良い。 By defining the uniqueness parameter u in this way, a lower value is calculated as u as the number of surrounding experience data increases, and conversely, a higher value is calculated as u as the number of surrounding experience data decreases. It becomes like this. Note that when the number of surrounding experience data is zero, it may be determined that a predetermined maximum value is calculated as u.

経験データ同士の類似度を図るための距離としては、上述したユークリッド距離に限らず、他の公知の距離（例えば、マハラノビス距離など）を用いるようにしても良い。さらに、各経験データをベクトルとして捉え、コサイン類似度などを用いてベクトルとしての類似性を評価するようにしても良い。例えば、ある経験データのベクトルに対して所定以上の類似度のベクトルを持つ経験データの数の逆数をユニークネスパラメータとして定義するようにしても良い。 The distance for obtaining the similarity between the experience data is not limited to the above-mentioned Euclidean distance, and other known distances (for example, Mahalanobis distance) may be used. Furthermore, each experience data may be regarded as a vector, and the similarity as a vector may be evaluated using a cosine similarity or the like. For example, the reciprocal of the number of experience data having a predetermined similarity vector or more with respect to a certain experience data vector may be defined as the uniqueness parameter.

また、各経験データのユニークネスパラメータを算出する場合、必ずしも、経験データに含まれるすべての要素を用いなくとも良い。具体的には、経験データに含まれる要素（現時点の状態ｓ_ｔ、行動ａ_ｔ、報酬ｒ_ｔ、次の状態ｓ_ｔ＋１）の内、次の状態ｓ_ｔ＋１を除く３つの要素からユニークネスパラメータを算出するようにしても良い。これは、次の状態ｓ_ｔ＋１は現時点の状態ｓ_ｔとの相関性が高く、両方の要素を用いても、情報が冗長的になるだけであるためである。 Further, when calculating the uniqueness parameter of each experience data, it is not always necessary to use all elements included in the experience data. Specifically, among the elements included in the experience data (current state s _t , action a _t , reward r _t , next state s _{t + 1} ), the uniqueness parameter is obtained from three elements excluding the next state s _{t + 1.} It may be calculated. This is because the next state s _{t + 1} is highly correlated with the current state s _t, and even if both elements are used, the information is only redundant.

さらに、経験データに含まれる、状態ｓと行動ａとの少なくとも一方が高次元データからなる場合、その高次元データを低次元化した後に、それぞれの経験データのユニークネスパラメータを算出するようにしても良い。例えば、状態ｓが画像として保存されている場合、オートエンコーダーなどの次元圧縮アルゴリズムを用いて低次元化された特徴量を抽出し、その抽出した特徴量からユニークネスパラメータを算出するようにしても良い。これにより、ユニークネスパラメータを算出するための計算負荷を低減することができる。なお、次元圧縮アルゴリズムとしては、主成分分析などを用いても良い。 Furthermore, when at least one of the state s and the action a included in the experience data is composed of high-dimensional data, the uniqueness parameter of each experience data is calculated after reducing the high-dimensional data. Also good. For example, when the state s is stored as an image, a reduced feature quantity is extracted using a dimension compression algorithm such as an auto encoder, and a uniqueness parameter is calculated from the extracted feature quantity. good. Thereby, the calculation load for calculating the uniqueness parameter can be reduced. Note that principal component analysis or the like may be used as the dimension compression algorithm.

図３のフローチャートのステップＳ２１０では、低いユニークネスパラメータｕを持つ経験データを経験データ記憶部１３から削除する。この際、最も低いユニークネスパラメータｕを持つ１つの経験データを削除しても良いが、そうすると、新たな経験データが収集されるごとに、経験データ記憶部１３に保存されている各経験データに関してユニークネスパラメータｕを算出しなければならなくなり、計算負荷が増大してしまう。そのため、本実施形態では、所定の削除基準値以下のユニークネスパラメータｕを持つ複数の経験データをまとめて削除する。これにより、ユニークネスパラメータｕの算出頻度を低減することができ、ユニークネスパラメータｕを算出することによる計算負荷の増加を抑制することが可能になる。 In step S210 of the flowchart of FIG. 3, experience data having a low uniqueness parameter u is deleted from the experience data storage unit 13. At this time, one piece of experience data having the lowest uniqueness parameter u may be deleted. However, each time new experience data is collected, each piece of experience data stored in the experience data storage unit 13 is deleted. Since the uniqueness parameter u must be calculated, the calculation load increases. Therefore, in the present embodiment, a plurality of pieces of experience data having uniqueness parameters u that are equal to or less than a predetermined deletion reference value are deleted together. Thereby, the calculation frequency of the uniqueness parameter u can be reduced, and an increase in calculation load due to the calculation of the uniqueness parameter u can be suppressed.

また、経験データの削除に関しては、ユニークネスパラメータｕを、直接、削除基準値と比較して、削除基準値以下のユニークネスパラメータｕを持つ経験データを決定論的に削除しても良いが、ユニークネスパラメータｕを元に経験データを削除する確率を算出することで、削除すべき経験データを確率論的に選択するようにしても良い。例えば、下記の数式３に示すように、経験データ毎のユニークネスパラメータｕを、全経験データのユニークネスパラメータを用いて正規化することで削除確率Ｐを定義し、その削除確率Ｐに従って、削除すべき経験データを決定するようにしても良い。具体的な実装方法としては、例えば０〜１の一様乱数値を発生させる乱数発生器を用意し、この乱数発生器が生成した乱数値と削除確率Ｐと比較して、乱数値の方が削除確率Ｐよりも小さい場合に経験データを削除する方法が考えられる。

Regarding the deletion of the experience data, the uniqueness parameter u may be directly compared with the deletion reference value, and the experience data having the uniqueness parameter u less than or equal to the deletion reference value may be deleted deterministically. By calculating the probability of deleting experience data based on the uniqueness parameter u, the experience data to be deleted may be selected stochastically. For example, as shown in Equation 3 below, the deletion probability P is defined by normalizing the uniqueness parameter u for each experience data using the uniqueness parameter of all experience data, and the deletion probability P is deleted according to the deletion probability P. You may make it determine the experience data which should be. As a specific implementation method, for example, a random number generator for generating a uniform random value of 0 to 1 is prepared, and the random value is compared with the random number generated by the random number generator and the deletion probability P. A method of deleting the experience data when the deletion probability P is smaller is conceivable.

確率論的に削除する経験データを決定することで、例えば、経験記憶部に保持されている経験データが密集している場合に、密集部のデータを間引く効果が期待される。ユニークネスパラメータｕそのものを削除基準値と比較して、その削除基準値以下のユニークネスパラメータｕを持つ複数の経験データをまとめて削除する場合、経験データの密集度によっては、密集している経験データの大部分が削除されてしまう可能性がある。それに対して、上述した確率的手法を用いることにより、経験データが密集している場合であっても、まばらに経験データを削除することが可能になる。 By determining the experience data to be deleted probabilistically, for example, when the experience data held in the experience storage unit is dense, an effect of thinning out the data in the dense part is expected. When the uniqueness parameter u itself is compared with the deletion reference value and a plurality of pieces of experience data having the uniqueness parameter u equal to or less than the deletion reference value are deleted at once, depending on the density of the experience data, the dense experience Most of the data can be deleted. On the other hand, by using the probabilistic method described above, it is possible to sparsely delete the experience data even when the experience data is dense.

次に、上述したデータ削除処理が、経験データ記憶部１３に保存される経験データに対してどのような影響を及ぼすのか、また、その結果、学習の安定性にどの程度寄与するのかについて、本実施形態によるデータ削除処理により経験データを削除したケースと、ＦＩＦＯ方式で経験データを削除したケースとを対比しつつ説明する。 Next, the effect of the above-described data deletion process on the experience data stored in the experience data storage unit 13 and, as a result, how much it contributes to the stability of learning will be described. The case where the experience data is deleted by the data deletion processing according to the embodiment and the case where the experience data is deleted by the FIFO method will be described in comparison.

なお、対比するケースでは、エージェント１１は、自動車を制御対象とし、その自動車を直線道路の中心線に沿って走行させるようにＤＮＮ１２を学習させるものとした。エージェント１１は、状態ｓとして、車両の中心位置と道路の中心線との横方向距離ｌ_Ｃ、及び道路の中心線の方向に対する車両の進行方向Ｏ_Ｃを用い、行動ａとしては、直進、右操舵、左操舵の３種類の行動からいずれかを選択するものとした。 Note that, in the case of comparison, the agent 11 is assumed to learn the DNN 12 so that the vehicle is a control target and the vehicle travels along the center line of the straight road. Agent 11, as a state s, using a lateral distance l _C, and the traveling direction O _C of the vehicle with respect to the direction of the center line of the road between the center line of the center position and the road vehicle, the action a, straight, right One of three types of actions, steering and left steering, is selected.

車両の走行開始地点から車両の制御を開始し、所定の終了条件が成立して車両の走行を停止するまでを１エピソードと定義した。終了条件は、車両が所定距離離れたゴールに到達する、車両の中心位置が道路の中心線から所定距離以上離れる（道路から逸脱する）、もしくは、所定の時間が経過するとの３条件とした。報酬関数ｒ_ｉは以下の数式４のように定義した。

The vehicle control is started from the vehicle travel start point, and a period from when a predetermined end condition is satisfied to when the vehicle travels is stopped is defined as one episode. The ending conditions were three conditions: the vehicle reached a goal that was separated by a predetermined distance, the center position of the vehicle was more than a predetermined distance away from the center line of the road (departed from the road), or a predetermined time passed. The reward function r _i is defined as the following Equation 4.

数式４において、ｗ_ｌ、ｗ_ｏ、ｗ_{ｏｆｆｒｏａｄ}は、車両の横方向距離ｌ_Ｃ、車両の進行方向Ｏ_Ｃ、及び道路逸脱に対するマイナスの報酬（罰則）を与えるための重み要素である。 In Equation 4, w _l , w _o , and w _offload are weighting elements for giving a negative reward (penalty) for the lateral distance l _{C of} the vehicle, the traveling direction O _{C of the} vehicle, and the road deviation.

ＤＮＮ１２は、４相構造を持ち、入力相である第１相のニューロン数が２、第２相のニューロン数が５０、第３相のニューロン数が２０、出力相である第４相のニューロン数が３である。ＤＮＮ１２の全ての重みは、−０．０５〜０．０５の範囲で均等に分散するようにランダムに初期化した。学習率は、初期値が０．００１であり、１回の学習当り割引率９９％で０．００００３まで徐々に低減されるものとした。 The DNN 12 has a four-phase structure, the number of neurons in the first phase that is the input phase is 2, the number of neurons in the second phase is 50, the number of neurons in the third phase is 20, and the number of neurons in the fourth phase that is the output phase Is 3. All the weights of DNN12 were randomly initialized so as to be evenly distributed in the range of -0.05 to 0.05. The learning rate has an initial value of 0.001, and is gradually reduced to 0.00003 at a discount rate of 99% per learning.

上述した条件の下で、ＦＩＦＯ方式でデータ削除を行った場合と、本実施形態によるデータ削除処理によってデータ削除を行った場合の、経験データの分散の様子を図５及び図６に示す。なお、図５及び図６において、経験データは、主成分分析により２次元に次元圧縮されている。 FIG. 5 and FIG. 6 show how experience data is distributed when data is deleted by the FIFO method under the above-described conditions and when data is deleted by the data deletion processing according to the present embodiment. In FIG. 5 and FIG. 6, the empirical data is two-dimensionally compressed by principal component analysis.

図５（ａ）は、ＦＩＦＯ方式において、経験データ記憶部１３がほぼ満杯となったとき、すなわち、まだデータの削除を行っていないときの、経験データの分散の様子を示す。また、図６（ａ）は、本実施形態において、同様に、経験データ記憶部１３がほぼ満杯となったときの経験データの分散の様子を示す。図５（ａ）及び図６（ａ）とも、上述した行動選択法によって多様な状態ｓと行動ａとの組が選択されるので、初期段階では、ほぼ同様に経験データが広く分散していることが分かる。 FIG. 5A shows how the experience data is distributed when the experience data storage unit 13 is almost full in the FIFO method, that is, when the data has not yet been deleted. FIG. 6A similarly shows how the experience data is distributed when the experience data storage unit 13 is almost full in the present embodiment. In both FIG. 5A and FIG. 6A, since various combinations of the state s and the action a are selected by the action selection method described above, the experience data is widely distributed in the initial stage almost similarly. I understand that.

一方、図５（ｂ）は、ＦＩＦＯ方式により経験データの削除を行いつつ、エピソードを７０００回繰り返した後に、経験データ記憶部１３に保存されている経験データの分散の様子を示している。また、図６（ｂ）は、本実施形態によるデータ削除処理によってデータ削除を行いつつ、エピソードを７０００回繰り返した後に、経験データ記憶部１３に保存されている経験データの分散の様子を示している。 On the other hand, FIG. 5B shows how the experience data stored in the experience data storage unit 13 is distributed after repeating the episode 7000 times while deleting the experience data by the FIFO method. FIG. 6B shows how the experience data stored in the experience data storage unit 13 is distributed after the episode has been repeated 7000 times while performing data deletion by the data deletion processing according to the present embodiment. Yes.

図５（ａ）では、経験データが中央付近に密集しており、その周辺の経験データはまばらになっていることが確認できる。これは、新しい経験データが保存されるごとに古い経験データが削除される場合、学習が進むにつれて、類似の状態ｓに対しては類似の行動ａを選択する傾向が強まるためであると推測される。それに対し、本実施形態によるデータ削除処理によってデータ削除を行った場合には、図６（ａ）に示す初期段階から大きく変化することなく、経験データが広い範囲に分布した状態を維持していることが確認できる。これは、上述したように、本実施形態では、経験データ削除処理において、類似している経験データを削除し、独自性の高い経験データを残すようにしているためである。 In FIG. 5A, it can be confirmed that the experience data is concentrated near the center, and the experience data around the center is sparse. This is presumed to be because when old experience data is deleted each time new experience data is saved, the tendency to select a similar action a for a similar state s increases as learning progresses. The On the other hand, when data deletion is performed by the data deletion processing according to the present embodiment, the state in which the experience data is distributed over a wide range is maintained without largely changing from the initial stage shown in FIG. I can confirm that. This is because, as described above, in the present embodiment, in the experience data deletion process, similar experience data is deleted to leave highly unique experience data.

次に、ＦＩＦＯ方式でデータ削除を行いつつ、所定の学習実行条件が成立したときに、経験データ記憶部１３に保存されている経験データを用いてＤＮＮ１２の強化学習を繰り返した場合に、５回のエピソードが完了するごとに、それらエピソードの実行中に得られた報酬ｒの累積値をカウントした結果を図７に示す。同様に、本実施形態によるデータ削除処理によりデータ削除を行いつつ、所定の学習実行条件が成立したときに、経験データ記憶部１３に保存されている経験データを用いてＤＮＮ１２の強化学習を繰り返した場合に、５回のエピソードが完了するごとに、それらエピソードの実行中に得られた報酬ｒの累積値をカウントした結果を図８に示す。 Next, when the reinforcement learning of the DNN 12 is repeated using experience data stored in the experience data storage unit 13 when a predetermined learning execution condition is satisfied while performing data deletion by the FIFO method, 5 times FIG. 7 shows the result of counting the accumulated value of the reward r obtained during the execution of each episode. Similarly, while performing data deletion by the data deletion process according to the present embodiment, when predetermined learning execution conditions are satisfied, DNN 12 reinforcement learning is repeated using experience data stored in the experience data storage unit 13. FIG. 8 shows the result of counting the cumulative value of the reward r obtained during execution of the episodes every time five episodes are completed.

図７から、ＦＩＦＯ方式で経験データを削除すると、学習回数が多くなっても、報酬の累積値の変動が収まらず、常に安定した報酬を得られるような学習ができていないことが確認できる。それに対し、本実施形態によるデータ削除処理によって経験データを削除した場合、学習が進展するにつれて、報酬ｒの累積値の変動が明らかに小さくなっていることが確認できる。これは、上述したように、経験データ記憶部１３に保存される経験データに関して、広い範囲に分布した状態を維持することができ、換言すれば、経験データ記憶部１３に多様な経験データが保存されているためである。この結果、経験データを用いた学習によって、類似性の高いデータに対しての過剰適合を抑制することができ、入力される状態ｓがレアなものであっても、良い報酬ｒが得られる行動ａを選択することが可能になる。 From FIG. 7, when the experience data is deleted by the FIFO method, it can be confirmed that even if the number of times of learning increases, the fluctuation of the accumulated value of the reward does not stop, and learning that always obtains a stable reward cannot be performed. On the other hand, when the experience data is deleted by the data deletion process according to the present embodiment, it can be confirmed that the fluctuation of the accumulated value of the reward r is clearly reduced as learning progresses. As described above, the experience data stored in the experience data storage unit 13 can maintain a state distributed over a wide range. In other words, various experience data can be stored in the experience data storage unit 13. It is because it has been. As a result, learning using empirical data can suppress over-fitting to highly similar data, and even if the input state s is rare, an action that provides a good reward r It becomes possible to select a.

以上、本発明の好ましい実施形態について説明したが、本発明は、なんら上述した実施形態に制限されることなく、本発明の主旨を逸脱しない範囲において、種々変形して実施することが可能である。 The preferred embodiments of the present invention have been described above. However, the present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the spirit of the present invention. .

例えば、上述した実施形態では、エージェント１１が１つのＤＮＮ１２を有する例について説明したが、例えば、エージェント１１は、Actorとして用いられるニューラルネットワークと、Criticとして用いられるニューラルネットワークとを別個に備えるものであっても良い。 For example, in the above-described embodiment, the example in which the agent 11 has one DNN 12 has been described. For example, the agent 11 includes a neural network used as an Actor and a neural network used as Critic separately. May be.

また、上述した実施形態では、エージェント１１が、制御対象としての自動車を道路の中心線に沿って走行させることを学習させる例について説明した。しかしながら、例えば障害物を避けつつ自動車を自動運転させるような、より複雑な制御の学習を行っても良いし、制御対象も自動車に限られず、画像（の認識）、音声（の認識）、ロボットなど、入ラルネットワークによって制御したり、処理したりすることができる対象であれば良い。 Further, in the above-described embodiment, the example in which the agent 11 learns to drive the automobile as the control target along the center line of the road has been described. However, for example, more complicated control learning such as driving an automobile automatically while avoiding an obstacle may be performed, and the control target is not limited to the automobile, and the image (recognition), voice (recognition), robot Any object that can be controlled or processed by the incoming network is acceptable.

さらに、上述した実施形態では、１つのコンピュータが、エージェント１１、環境１４、データ削除部１５などの機能を実現する例について説明したが、それぞれの機能を複数のコンピュータによって実現するように構成しても良い。 Furthermore, in the above-described embodiment, an example in which one computer realizes the functions of the agent 11, the environment 14, the data deletion unit 15, and the like has been described. However, each of the functions is configured to be realized by a plurality of computers. Also good.

１０コンピュータ
１１エージェント
１２ＤｅｅｐＮｅｕｒａｌＮｅｔｗｏｒｋ
１３経験データ記憶部
１４環境
１５データ削除部 10 Computer 11 Agent 12 Deep Neural Network
13 Experience data storage unit 14 Environment 15 Data deletion unit

Claims

In the case of using a neural network (12) as a learning device for learning an optimal policy for determining an action according to a state of a controlled object, a reinforcement learning method for reinforcement learning of the neural network,
The computer (10) collects empirical data including the state of the controlled object, the action on the controlled object, the reward obtained by the action, and the state of the controlled object transitioned by the action, and has a finite storage capacity Store it in the experience data storage unit (13),
The computer (10) calculates a uniqueness parameter indicating how different each experience data stored in the experience data storage unit from other experience data,
Based on the calculated uniqueness parameter, the computer (10) deletes experience data similar to other experience data from the experience data storage unit,
A reinforcement learning method in which a computer (10) performs reinforcement learning of the neural network using experience data stored in the experience data storage unit.

The computer according to claim 1, wherein the computer calculates a uniqueness parameter of each experience data based on the state, action, and reward among the state, action, reward, and transition state included in the experience data. Reinforcement learning method.

The computer calculates a uniqueness parameter of each experience data after lowering the high-dimensional data when at least one of a state and an action included in the experience data is composed of high-dimensional data. Or the reinforcement learning method of 2.

The said computer determines the experience data which has the lowest uniqueness parameter, or the experience data which has a uniqueness parameter below a predetermined deletion reference value as experience data which should be deleted. Reinforcement learning method.

The reinforcement learning method according to claim 1, wherein the computer evaluates the uniqueness parameter by a probabilistic method to determine experience data to be deleted.

The enhancement according to claim 5, wherein the computer normalizes the uniqueness parameter for each experience data using the uniqueness parameter of all the experience data, and determines the experience data to be deleted based on the normalized result. Learning method.

The computer calculates uniqueness parameters of all the experience data stored in the experience data storage unit when the storage amount of the experience data reaches the upper limit of the storage capacity of the experience data storage unit. The reinforcement learning method according to any one of claims 1 to 6, wherein a plurality of pieces of experience data are collectively deleted based on the uniqueness parameter.

In the case of using a neural network (12) as a learning device for learning an optimal policy for determining an action according to a state of a control target, a reinforcement learning device for reinforcement learning of the neural network,
A finite memory that stores experience data each time it collects experience data that includes the state of the control object, the action on the control object, the reward obtained by the action, and the state of the control object that has been transitioned by the action An experience data storage unit (13) having a capacity;
For each experience data stored in the experience data storage unit, a calculation unit (S200) that calculates a uniqueness parameter indicating how different from other experience data;
Based on the uniqueness parameter calculated by the calculation unit, a deletion unit (S210) that deletes experience data similar to other experience data from the experience data storage unit,
A reinforcement learning device comprising: a reinforcement learning unit that performs reinforcement learning of the neural network using experience data stored in the experience data storage unit.

The said calculation part calculates the uniqueness parameter of each experience data based on a state, an action, and reward among the state, action, reward, and the transition state which are contained in experience data. Reinforcement learning device.

The calculation unit, when at least one of a state and an action included in experience data is composed of high-dimensional data, lowers the high-dimensional data and then calculates a uniqueness parameter of each experience data. The reinforcement learning apparatus according to 8 or 9.

The said deletion part determines the experience data which has the lowest uniqueness parameter, or the experience data which has a uniqueness parameter below a predetermined deletion reference value as experience data which should be deleted. Reinforcement learning device.

The reinforcement learning device according to claim 8, wherein the deletion unit evaluates the uniqueness parameter by a probabilistic method and determines experience data to be deleted.

13. The deletion unit according to claim 12, wherein the deletion unit normalizes a uniqueness parameter for each experience data using a uniqueness parameter of all experience data, and determines experience data to be deleted based on the normalized result. Reinforcement learning device.

When the amount of experience data stored reaches the upper limit of the storage capacity of the experience data storage unit, the calculation unit calculates uniqueness parameters of all experience data stored in the experience data storage unit, and The reinforcement learning device according to claim 8, wherein the deletion unit deletes a plurality of pieces of experience data at a time based on the calculated uniqueness parameter.