JP2019087096A

JP2019087096A - Action determination system and automatic driving control device

Info

Publication number: JP2019087096A
Application number: JP2017215645A
Authority: JP
Inventors: 康輔中西; Kosuke Nakanishi; 安井　裕司; Yuji Yasui; 裕司安井; 祐紀喜住; Yuki Kizumi; 翔太大西; Shota Onishi; 石井　信; Makoto Ishii; 信石井
Original assignee: Honda Motor Co Ltd; Kyoto University
Current assignee: Honda Motor Co Ltd; Kyoto University
Priority date: 2017-11-08
Filing date: 2017-11-08
Publication date: 2019-06-06
Anticipated expiration: 2037-11-08
Also published as: JP6845529B2

Abstract

To provide an action determination system and an automatic driving control device, capable of improving the leaning speed while securing the learning stability when a reinforcement learning method is used.SOLUTION: In an action determination system 10, an action value function Q is calculated using a state s, and an optimal action a is determined using the action value function Q. A parameter θ of a neural network for calculating the action value function Q is updated so that an error function L is minimized, which is defined so as to include a squared term of TD error of the action value function Q and a squared term of the difference between the action value function Q and a target value T.SELECTED DRAWING: Figure 2

Description

本発明は、強化学習法を用いてエージェントによる行動を決定する行動決定システム、及びこれを備えた自動運転制御装置に関する。 The present invention relates to an action determination system that determines an action by an agent using a reinforcement learning method, and an automatic driving control apparatus including the same.

従来、強化学習法を用いた行動決定システムとして、特許文献１に記載されたものが知られている。この行動決定システムでは、複数の利用者の発言を状態ｓ、発言に対する応答を行動ａ、報酬をｒとして、報酬ｒが最大になるように、行動価値関数Ｑを定義し（同文献の式４）、この行動価値関数Ｑを用いて強化学習を実行する。そして、学習結果に基づいて、行動ａを算出し、これを応答として、ロボットに読み出させている。 DESCRIPTION OF RELATED ART Conventionally, what was described in patent document 1 is known as an action determination system which used the reinforcement learning method. In this action determination system, the action value function Q is defined so that the reward r is maximized, with the utterances of a plurality of users as the state s, the responses to the utterances as the action a, and the reward r. ), Perform reinforcement learning using this action value function Q. Then, the action a is calculated based on the learning result, and this is read as a response by the robot.

このように行動価値関数Ｑを用いて強化学習を実行する場合、行動価値関数Ｑをニューラルネットワークで近似するとともに、誤差関数ＬをＴＤ誤差に基づいて定義し、これが最小になるように、ニューラルネットワークを更新する手法が知られている。この場合、一般的なＱ学習法では、誤差関数Ｌとして、下式（１）に示すものが用いられる。 Thus, when performing reinforcement learning using the action value function Q, the action value function Q is approximated by a neural network, and the error function L is defined based on the TD error, and the neural network is minimized so as to minimize this. The method of updating the is known. In this case, in the general Q learning method, as the error function L, the one shown in the following equation (1) is used.

この式（１）において、θはニューラルネットワークのパラメータ（重みなど）を、ｓ’は状態の次回値をそれぞれ表している。また、γは０＜γ≦１が成立するように設定される割引率である。 In this equation (1), θ represents a parameter (weight or the like) of the neural network, and s' represents the next value of the state. Further, γ is a discount rate set such that 0 <γ ≦ 1 holds.

しかし、上式（１）に示す誤差関数Ｌを用いた場合、更新の目標となる行動価値関数もステップ毎の更新によって変動する関係上、ニューラルネットワークの更新が不安定になり、学習が不安定になってしまう。この問題を回避するために、Fixed Target Q-Network法では、誤差関数Ｌとして、下式（２）に示すように、行動価値関数Ｑに代えて、Target Q-Networkの出力値（以下「ターゲット値」という）ＴをＴＤ誤差の期待報酬に含むように定義されたものが用いられる（非特許文献１，２）。 However, when the error function L shown in the above equation (1) is used, the update of the neural network becomes unstable and the learning becomes unstable because the behavior value function which is the target of update also fluctuates due to the update every step. Become. In order to avoid this problem, in the Fixed Target Q-Network method, as an error function L, as shown in the following equation (2), an output value of Target Q-Network (hereinafter referred to as “target What is defined as including a value “) T in the expected reward of the TD error is used (Non-Patent Documents 1 and 2).

特開２０１７−１７３８７４号公報JP, 2017-173874, A ”Human-level control through deep reinforcement learning”, [online], [平成29年11月2日検索], インターネット<URL:http://www.teach.cs.toronto.edu/~csc2542h/fall/material/csc2542f16_dqn.pdf>“Human level control through deep reinforcement learning”, [online], [search on November 2, 2017], Internet <URL: http://www.teach.cs.toronto.edu/~csc2542h/fall/material /csc2542f16_dqn.pdf> ”Deep Reinforcement Learning with Double Q-learning”, [online], [平成29年11月2日検索], インターネット<URL:https://arxiv.org/pdf/1509.06461.pdf>“Deep Reinforcement Learning with Double Q-learning”, [online], [Search on November 2, 2017], Internet <URL: https://arxiv.org/pdf/1509.06461.pdf>

上記式（２）に示す誤差関数Ｌを用いて、ニューラルネットワークを更新した場合、所定回数の学習が実行されるまでの間、ターゲット値Ｔが更新されることなく保持されるので、行動価値関数の更新の目標となる値が固定化されることによって、学習の安定性を確保することができる。しかしながら、ニューラルネットワークの更新速度が抑制されてしまうことによって、学習速度が低下するという問題がある。 When the neural network is updated using the error function L shown in the above equation (2), the target value T is held without being updated until a predetermined number of times of learning is performed, so the behavior value function The stability of learning can be ensured by fixing the target value of the update of. However, there is a problem that the learning speed is reduced by the fact that the update speed of the neural network is suppressed.

本発明は、上記課題を解決するためになされたもので、強化学習法を用いる場合において、学習の安定性を確保しながら、学習速度を向上させることができる行動決定システム及び自動運転制御装置を提供することを目的とする。 The present invention has been made to solve the above-mentioned problems, and in the case of using a reinforcement learning method, an action determination system and an automatic driving control apparatus capable of improving learning speed while securing learning stability. Intended to be provided.

上記目的を達成するために、本発明は、強化学習法を用いてエージェント（自動運転制御装置１，１Ａ〜１Ｃ）による行動ａを決定する行動決定システム１０，１０Ａ〜１０Ｃにおいて、環境９からエージェントに入力される情報（状態ｓ、状況データｄａｔａ＿ｓ）を用いて、第１価値関数（行動価値関数Ｑ）を算出する第１価値関数算出手段（ＥＣＵ２、行動価値算出部１１，１１Ｂ，１１Ｃ）と、第１価値関数を用いて、エージェントによる最適な行動を決定する行動決定手段（ＥＣＵ２、方策算出部１２，１２Ｃ、行動算出部２０）と、第１価値関数のＴＤ誤差（式（３），（５），（９），（１１）の右辺第１項の｛｝内の値）と、第１価値関数と第１価値関数と異なる第２価値関数（ターゲット値Ｔ）との差分（式（４），（６），（１０），（１２）の右辺第１項の｛｝内の値）と、を含むように定義された誤差関数Ｌが最小になるように、第１価値関数を更新する第１価値関数更新手段（ＥＣＵ２、行動価値算出部１１，１１Ｂ，１１Ｃ）と、を備えることを特徴とする。 In order to achieve the above object, the present invention uses the reinforcement learning method to determine the action a by the agent (the automatic driving control devices 1, 1A to 1C) in the action determination system 10 or 10A to 10C. Means for calculating a first value function (action value function Q) using information (state s, situation data data_s) input to the control unit (ECU 2, action value calculation units 11, 11B, 11C) and , An action determination unit (ECU 2, the policy calculation unit 12, 12C, the action calculation unit 20) for determining an optimum action by the agent using the first value function, and a TD error of the first value function (Equation (3), The difference between the first value function and the second value function (target value T) different from the first value function (target value T) (the expression (4), (6 , (10), (12), the first value function that updates the first value function so that the error function L defined to include the value in {} of the first term of the right side of the first term is minimized And updating means (ECU 2, action value calculation units 11, 11B, 11C).

この行動決定システムによれば、環境からエージェントに入力される情報を用いて、第１価値関数が算出され、第１価値関数を用いて、エージェントによる最適な行動が決定される。さらに、第１価値関数のＴＤ誤差と、第１価値関数と第１価値関数と異なる第２価値関数との差分を含むように定義された誤差関数が最小になるように、第１価値関数が更新されるので、前述した式（１）の誤差関数を用いた場合と比べて、学習初期などの、ＴＤ誤差が大きくなり、第１価値関数の更新が不安定な状態になった際でも、その影響を第１価値関数と第２価値関数との差分によって緩和しながら、第１価値関数を更新することができ、学習の安定性を確保することができる。これに加えて、前述した式（２）の誤差関数と異なり、ターゲット値Ｔが誤差関数のＴＤ誤差に含まれていないので、第１価値関数の更新速度すなわち学習速度を向上させることができる（なお、本明細書における「第１価値関数を算出する」ということは、独立変数の値を第１価値関数に代入することにより、従属変数としての第１価値関数の値を算出／設定することを意味する。また、本明細書における「第１価値関数を更新する」ということは、第１価値関数における独立変数以外のパラメータ成分を更新することを意味する）。 According to this behavior determination system, the first value function is calculated using information input from the environment to the agent, and the optimal behavior by the agent is determined using the first value function. Furthermore, the first value function is such that the error function defined to include the difference between the TD error of the first value function and the second value function different from the first value function and the first value function is minimized. Since it is updated, compared with the case where the error function of the equation (1) described above is used, the TD error becomes larger at the initial stage of learning, etc., and the update of the first value function becomes unstable. The first value function can be updated while mitigating the influence by the difference between the first value function and the second value function, and the stability of learning can be ensured. In addition to this, unlike the error function of equation (2) described above, the target value T is not included in the TD error of the error function, so the update speed of the first value function, ie, the learning speed can be improved ( Note that "calculating the first value function" in the present specification means calculating / setting the value of the first value function as a dependent variable by substituting the value of the independent variable into the first value function. Also, in the present specification, “updating the first value function” means updating parameter components other than the independent variable in the first value function).

本発明において、第１価値関数更新手段は、誤差関数として、差分が所定値ε１を超えているときには、ＴＤ誤差と差分を含むように定義された誤差関数を用い、差分が所定値ε１以下のときには、ＴＤ誤差のみを含むように定義された誤差関数を用いることが好ましい。 In the present invention, the first value function updating means uses, as an error function, an error function defined to include the TD error and the difference when the difference exceeds the predetermined value ε1, and the difference is less than the predetermined value ε1. At times, it is preferable to use an error function defined to include only the TD error.

この制御装置によれば、差分が所定値以下のときには、ＴＤ誤差のみを含むように定義された誤差関数を用いて、第１価値関数が更新されるので、ＴＤ誤差のみを減少するように第１価値関数を更新することができ、その更新速度を向上させることができる。 According to this control device, when the difference is less than the predetermined value, the first value function is updated using the error function defined to include only the TD error, so that only the TD error is reduced. One value function can be updated, and the update speed can be improved.

本発明において、情報（状態ｓ、状況データｄａｔａ＿ｓ）を用いて、第２価値関数（ターゲット値Ｔ）を算出する第２価値関数算出手段（ＥＣＵ２、ターゲット値算出部１４，１４Ｂ，１４Ｃ）と、第２価値関数（ターゲット値Ｔ）を第１価値関数（行動価値関数Ｑ）よりも遅い更新速度で更新する第２価値関数更新手段（ＥＣＵ２、ターゲット値算出部１４，１４Ｂ，１４Ｃ）と、をさらに備えることが好ましい。 In the present invention, second value function calculation means (ECU 2, target value calculation units 14, 14B, 14C) for calculating a second value function (target value T) using information (state s, situation data data_s); Second value function updating means (ECU 2, target value calculation units 14, 14B, 14C) for updating the second value function (target value T) at an update rate slower than the first value function (action value function Q) It is preferable to further include.

この制御装置によれば、第２価値関数が、情報を用いて算出されるとともに、第１価値関数よりも遅い更新速度で更新されるので、ＴＤ誤差の挙動が不安定な状態になったときでも、その影響を第１価値関数と第２価値関数の差分によって緩和しながら、第１価値関数を安定した状態で更新することができ、学習の安定性を確保することができる。さらに、第１価値関数よりも遅い更新速度で更新される第２価値関数がＴＤ誤差に含まれていないので、前述した式（２）の誤差関数を用いた場合と比べて、第１価値関数の更新速度すなわち学習速度を向上させることができる。 According to this control device, the second value function is calculated using information and updated at a slower update rate than the first value function, so when the behavior of the TD error becomes unstable. However, the first value function can be updated in a stable state while the influence thereof is mitigated by the difference between the first value function and the second value function, and learning stability can be ensured. Furthermore, since the second value function updated at a slower update rate than the first value function is not included in the TD error, the first value function is compared with the case where the error function of equation (2) described above is used. Update speed, that is, learning speed can be improved.

本発明において、第２価値関数として固定された関数（ターゲット値Ｔｒｅｆ）を用いることが好ましい。 In the present invention, it is preferable to use a fixed function (target value Tref) as the second value function.

この制御装置によれば、第２価値関数として固定された関数が用いられるので、この固定された関数を適切なもの（例えば他のシステムで学習済みの第２価値関数）に設定することにより、ＴＤ誤差の挙動が不安定な状態になったときでも、その影響を第１価値関数と第２価値関数の差分によって緩和しながら、第１価値関数を安定した状態で更新することができ、学習の安定性を確保することができる。さらに、一定値に設定された第２価値関数がＴＤ誤差に含まれていないので、前述した式（２）の誤差関数を用いた場合と比べて、第１価値関数の更新速度すなわち学習速度を向上させることができる（なお、本明細書における「固定された関数」は、独立変数以外の値が固定された形式の関数を意味する）。 According to this control device, since the fixed function is used as the second value function, by setting the fixed function to an appropriate one (for example, the second value function learned by another system), Even when the behavior of the TD error becomes unstable, the first value function can be updated in a stable state while the influence is mitigated by the difference between the first value function and the second value function. Stability of the Furthermore, since the second value function set to a constant value is not included in the TD error, the update speed of the first value function, that is, the learning speed is compared to the case where the error function of equation (2) described above is used. This can be improved (note that "fixed function" in this specification means a function of a form in which values other than independent variables are fixed).

本発明において、情報は、環境９の状態ｓであり、第１価値関数は、環境９の状態ｓ及び行動ａを評価するための行動価値関数Ｑであり、行動決定手段は、所定手法（ε-greedy法）を用いて、行動価値関数に基づき、最適な行動ａを決定することが好ましい。 In the present invention, the information is the state s of the environment 9, the first value function is the action value function Q for evaluating the state s of the environment 9 and the action a, and the action determining means is a predetermined method (ε It is preferable to determine the optimal action a based on the action value function using the -greedy method).

この制御装置によれば、行動価値関数という１つの関数の算出結果を用いて、最適な行動を決定することができるので、複数の関数を用いる場合と比べて、演算負荷を低減することができる。さらに、前述したように、行動価値関数を安定した状態で更新できることにより、学習を効率的に実行することができる。 According to this control device, since it is possible to determine the optimum behavior using the calculation result of one function called the action value function, the operation load can be reduced compared to the case of using a plurality of functions. . Furthermore, as described above, the ability to update the behavior value function in a stable state enables efficient execution of learning.

本発明において、情報は、環境９の状態であり、第１価値関数は、環境９の状態を評価するための状態価値関数と行動を評価するための方策関数とを含み、行動決定手段は、方策関数を用いて、最適な行動ａを決定し、第１価値関数更新手段は、誤差関数Ｌが最小になるように、状態価値関数を更新し、状態価値関数が最大となるように、方策関数を更新する方策関数更新手段（ＥＣＵ２、行動算出部２０）をさらに備えることが好ましい。 In the present invention, the information is the state of the environment 9, the first value function includes a state value function for evaluating the state of the environment 9 and a policy function for evaluating the action, and the action determining means is The optimal action a is determined using the policy function, and the first value function updating means updates the state value function so that the error function L is minimized, and the policy such that the state value function is maximized. It is preferable to further include a policy function update unit (ECU 2, behavior calculation unit 20) for updating the function.

この制御装置によれば、第１価値関数が、環境の状態を評価するための状態価値関数と行動を評価するための方策関数とを含んでいるので、方策関数を学習する際の任意性を向上させることができ、連続空間や高次元空間に対応できるとともに、エージェントによる探索行動のコントロールを容易に実行することができる。さらに、誤差関数が最小になるように、状態価値関数が更新され、状態価値関数が最大となるように、方策関数が更新されるので、方策関数を、その挙動が不安定になるのを抑制しながら安定した状態で更新することができる。 According to this control device, since the first value function includes the state value function for evaluating the state of the environment and the policy function for evaluating the behavior, it is possible to set the arbitraryness in learning the policy function While being able to improve and respond | correspond to a continuous space or a high dimensional space, control of the search behavior by an agent can be performed easily. Furthermore, the state value function is updated so that the error function is minimized, and the policy function is updated such that the state value function is maximized, so that the policy function suppresses the behavior from becoming unstable. While it can be updated in a stable state.

本発明において、情報は、エージェントが所定周期（制御周期ΔＴ）で最適な行動ａを複数回、実行したときに、環境９から所定周期で入力される情報の複数の時系列離散データｓ_ｔ＋ｉであり、第１価値関数のＴＤ誤差は、情報の複数の時系列離散データｓ_ｔ＋ｉを用いて算出した報酬の複数の時系列離散データｒ（ｓ_ｔ＋ｉ）を含むように構成されていることが好ましい。 In the present invention, the information is a plurality of time-series discrete data s _{t + i} of information input from the environment 9 at a predetermined cycle when the agent executes the optimum action a at a predetermined cycle (control cycle ΔT) a plurality of times. Preferably, the TD error of the first value function is configured to include a plurality of time-series discrete data r (s _{t + i} ) of rewards calculated using a plurality of time-series discrete data s _{t + i} of information .

この制御装置によれば、情報の複数の時系列離散データを用いて、第１価値関数の複数の時系列離散データが算出され、第１価値関数のＴＤ誤差は、情報の複数の時系列離散データを用いて算出した報酬の複数の時系列離散データを含むように構成されており、そのようなＴＤ誤差を含むように定義された誤差関数が最小になるように、第１価値関数が更新されるので、１つの情報の時系列離散データを用いた場合と比べて、過去に行った行動の第１価値関数による評価をより迅速に第１価値関数の更新に反映させることができ、その更新作業がより促進されることで、学習速度をさらに向上させることができる。 According to this control device, the plurality of time series discrete data of the first value function is calculated using the plurality of time series discrete data of the information, and the TD error of the first value function is the plurality of time series discrete of the information The first value function is updated such that it is configured to include multiple time series discrete data of rewards calculated using the data, and the error function defined to include such TD errors is minimized. As compared with the case of using time-series discrete data of one piece of information, it is possible to more quickly reflect the evaluation of the behavior performed in the past by the first value function in the update of the first value function, By further promoting the update work, the learning speed can be further improved.

本発明は、上記の行動決定システムシステム１０，１０Ａ〜１０Ｃを備え、自動運転車両３を制御する自動運転制御装置１，１Ａ〜１Ｃにおいて、情報は、自動運転車両３の動作状況及び動作環境を表す状況データｄａｔａ＿ｓであり、行動は、自動運転車両３を制御するための目標値又は指令値であることが好ましい。 The present invention includes the above-described behavior determination system systems 10 and 10A to 10C, and in the automatic driving control devices 1 and 1A to 1C controlling the autonomous driving vehicle 3, the information indicates the operating condition and the operating environment of the autonomous driving vehicle 3. It is preferable that it is situation data data_s to represent, and action is a target value or command value for controlling the autonomous driving vehicle 3.

この制御装置によれば、自動運転車両の動作状況及び動作環境を表す状況データを用いて、第１価値関数が算出され、第１価値関数を用いて、自動運転車両を制御するための目標値又は指令値が最適な値に決定されるので、自動運転車両の制御精度を向上させることができる。 According to this control device, the first value function is calculated using the operating condition and operating environment of the autonomous driving vehicle, and the target value for controlling the autonomous driving vehicle using the first value function. Alternatively, since the command value is determined to be the optimal value, the control accuracy of the autonomous driving vehicle can be improved.

本発明の第１実施形態に係る自動運転制御装置及び行動決定システムと、これらを適用した自動運転車両の構成を模式的に示す図である。BRIEF DESCRIPTION OF THE DRAWINGS It is a figure which shows typically the structure of the automatic driving | operation control apparatus which concerns on 1st Embodiment of this invention, the action determination system, and the autonomous driving vehicle to which these were applied. 第１実施形態の行動決定システムの機能的な構成を示すブロック図である。It is a block diagram which shows the functional structure of the action determination system of 1st Embodiment. 第１実施形態の誤差関数を用いた場合と、従来の誤差関数を用いた場合の行動価値関数の学習速度を説明するための図である。It is a figure for demonstrating the learning speed of the action value function at the time of using the error function of 1st Embodiment, and the conventional error function. 学習制御を示すフローチャートである。It is a flowchart which shows learning control. 自動運転制御を示すフローチャートである。It is a flowchart which shows automatic operation control. 自動運転車両が追い越しを実行するときの状態を示す図である。It is a figure which shows a state when an autonomous driving vehicle performs overtaking. 第２実施形態の行動決定システムの機能的な構成を示すブロック図である。It is a block diagram which shows the functional structure of the action determination system of 2nd Embodiment. 第３実施形態の行動決定システムの機能的な構成を示すブロック図である。It is a block diagram which shows the functional structure of the action determination system of 3rd Embodiment. 第４実施形態の行動決定システムの機能的な構成を示すブロック図である。It is a block diagram which shows the functional structure of the action determination system of 4th Embodiment.

以下、図面を参照しながら、本発明の第１実施形態に係る自動運転制御装置及び行動決定システムについて説明する。本実施形態の自動運転制御装置は、後述する行動決定システムを備えており、まず、自動運転制御装置について説明する。なお、本実施形態では、自動運転制御装置がエージェントに相当する。 An automatic driving control apparatus and an action determination system according to a first embodiment of the present invention will be described below with reference to the drawings. The automatic driving control apparatus of the present embodiment includes an action determination system to be described later. First, the automatic driving control apparatus will be described. In the present embodiment, the automatic driving control device corresponds to an agent.

図１に示すように、この自動運転制御装置１は、四輪タイプの自動運転車両３に適用されたものであり、ＥＣＵ２を備えている。なお、以下の説明では、この自動運転車両３を「自車両３」という。 As shown in FIG. 1, the automatic driving control device 1 is applied to a four-wheel type automatic driving vehicle 3 and includes an ECU 2. In the following description, the autonomous driving vehicle 3 is referred to as "the vehicle 3".

このＥＣＵ２には、状況検出装置４、原動機５及びアクチュエータ６が電気的に接続されている。この状況検出装置４は、カメラ、ミリ波レーダー、レーザーレーダ、ソナー、ＧＰＳ及び各種のセンサなどで構成されており、自車両３の動作状況及び動作環境を表す状況データｄａｔａ＿ｓをＥＣＵ２に出力する。なお、本実施形態では、状況データｄａｔａ＿ｓが情報及び環境の状態に相当する。 The situation detection device 4, the motor 5 and the actuator 6 are electrically connected to the ECU 2. The situation detection device 4 includes a camera, a millimeter wave radar, a laser radar, a sonar, a GPS, various sensors, and the like, and outputs situation data data_s representing an operation situation and an operation environment of the vehicle 3 to the ECU 2. In the present embodiment, the situation data data_s corresponds to the information and the state of the environment.

この場合、状況データｄａｔａ＿ｓは、車速、操舵角、ヨーレート、加速度、躍度、道路端の座標、他車両との相対位置及び他車両との相対速度などを含む数十種類のデータで構成されている。 In this case, the situation data data_s is composed of dozens of types of data including vehicle speed, steering angle, yaw rate, acceleration, jerk, road edge coordinates, relative position with other vehicles, relative velocity with other vehicles, etc. There is.

原動機５は、例えば、電気モータなどで構成されており、後述するように、ＥＣＵ２によって自動運転制御が実行される際、原動機５の動作状態が制御される。 The prime mover 5 is, for example, an electric motor or the like, and as described later, when the ECU 2 executes automatic operation control, the operation state of the prime mover 5 is controlled.

また、アクチュエータ６は、制動用アクチュエータ及び操舵用アクチュエータなどで構成されており、後述するように、自動運転制御を実行するときに、アクチュエータ６の動作が制御される。 The actuator 6 is composed of a braking actuator, a steering actuator, and the like, and the operation of the actuator 6 is controlled when performing automatic operation control as described later.

一方、ＥＣＵ２は、ＣＰＵ、ＲＡＭ、ＲＯＭ、Ｅ２ＰＲＯＭ、Ｉ／Ｏインターフェース及び各種の電気回路（いずれも図示せず）などからなるマイクロコンピュータで構成されており、上述した状況検出装置４からの状況データｄａｔａ＿ｓなどに基づいて、後述するように、自動運転制御などを実行する。 On the other hand, the ECU 2 is constituted by a microcomputer comprising a CPU, RAM, ROM, E2PROM, I / O interface, various electric circuits (all not shown), etc., and status data from the status detection device 4 described above Based on data_s etc., automatic operation control etc. are performed so that it may mention later.

なお、本実施形態では、ＥＣＵ２が第１価値関数算出手段、行動決定手段、第１価値関数更新手段、第２価値関数算出手段及び第２価値関数更新手段に相当する。 In the present embodiment, the ECU 2 corresponds to first value function calculation means, action determination means, first value function update means, second value function calculation means, and second value function update means.

次に、図２を参照しながら、本実施形態の自動運転制御装置１における行動決定システム１０について説明する。同図において、環境９は、情報としての行動ａ_ｔが入力されたときに、状態ｓ_ｔ＋1を出力する系であり、この行動決定システム１０では、環境９から入力される状態ｓ_ｔ，ｓ_ｔ＋1を用いて、以下に述べる算出アルゴリズムにより、行動ａ_ｔが算出される。 Next, the action determination system 10 in the automatic driving control device 1 of the present embodiment will be described with reference to FIG. In the figure, the environment 9, when the action a _t as information is entered, a system that outputs a state s _{t + 1,} in the action determining system 10, the state s _t input from the environment _9, s t + ₁ using, by the calculation algorithm described below, action a _t is calculated.

ここで、状態ｓ_ｔ及び行動ａ_ｔは、後述する所定の制御周期ΔＴ（例えば１０ｍｓｅｃ）に同期してサンプリング又は算出された離散データであり、状態ｓ_ｔ及び行動ａ_ｔの添字ｔ（ｔは正の整数）は、離散データの制御時刻（すなわちサンプリング／算出タイミング）を表している。 Here, the state _{s t} and the action _{a t} is a discrete data sampled or calculated in synchronism with a predetermined control period ΔT to be described later (e.g. 10 msec), the subscript t (t in state _{s t} and the action _{a t} is A positive integer represents the control time (that is, sampling / calculation timing) of discrete data.

具体的には、状態ｓ_ｔの添字ｔは、今回の制御タイミングでサンプリング／算出された値（以下「今回値」という）であることを、状態ｓ_ｔ＋1の添字ｔ＋１は、次回の制御タイミングでサンプリング／算出されると推定される値（以下「次回値」という）であることをそれぞれ示している。この点は、以下に述べる離散データにおいても同様である。 Specifically, the subscript t of the state s _t is that the current control timing in the sampling / calculated value (hereinafter referred to as "current value"), the subscript t + 1 of the state s _{t + 1} is the next control timing It indicates that it is a value estimated to be sampled / calculated (hereinafter referred to as “next value”). The same applies to the discrete data described below.

なお、実際の制御では、状態の次回値ｓ_ｔ＋1は、今回の制御タイミングでサンプリング／算出することはできないので、今回の制御タイミングでサンプリング／算出された状態ｓの値が状態の次回値ｓ_ｔ＋1として用いられるとともに、前回の制御タイミングでサンプリング／算出された状態の次回値ｓ_ｔ＋1が状態の今回値ｓ_ｔとして用いられる。また、以下の説明では、各離散データにおける添字を適宜省略する。 In the actual control, since the next value s _{t + 1} of the state can not be sampled / calculated at the current control timing, the value of the state s sampled / calculated at the current control timing is the next value s _{t + 1 of the} state And the next value s _{t + 1} of the state sampled / calculated at the previous control timing is used as the current value s _t of the state. Also, in the following description, subscripts in each discrete data are appropriately omitted.

図２に示すように、行動決定システム１０は、行動価値算出部１１、方策算出部１２、最大値選択部１３、ターゲット値算出部１４、報酬算出部１５及び誤差関数算出部１６を備えている。この行動決定システム１０の場合、これらの要素１１〜１６は、具体的にはＥＣＵ２によって構成されており、この点は後述する行動決定システム１０Ａ〜１０Ｃにおいても同様である。 As shown in FIG. 2, the action determination system 10 includes an action value calculation unit 11, a policy calculation unit 12, a maximum value selection unit 13, a target value calculation unit 14, a reward calculation unit 15, and an error function calculation unit 16. . In the case of this action determination system 10, these elements 11-16 are specifically comprised by ECU2, This point is the same also in action determination systems 10A-10C mentioned later.

この行動価値算出部１１は、行動価値関数Ｑを算出するものであり、状態ｓを入力とし、行動価値関数Ｑを出力とするＱ算出用のニューラルネットワーク（図示せず）を備えている。このＱ算出用のニューラルネットワークでは、値ｊをｊ＝１〜ｎ（ｎは複数）と規定したときに、状態の今回値ｓ_ｔを用いて、ｎ個の行動価値関数Ｑ（ｓ_ｔ，ａ_ｊ）が算出され、これが方策算出部１２に出力される。 The action value calculation unit 11 calculates an action value function Q, and includes a neural network (not shown) for Q calculation which receives the state s as an input and outputs the action value function Q. In this neural network for calculating Q, when the value j is defined as j = 1 to n (n is a plurality), n current value values s _{t of the} state are used to obtain n action value functions Q (s _t , a _j ) is calculated and output to the policy calculation unit 12.

さらに、このＱ算出用のニューラルネットワークでは、状態の次回値ｓ_ｔ＋1を用いて、ｎ個の行動価値関数Ｑ（ｓ_ｔ＋1，ａ_ｊ＋１）が算出され、これが最大値選択部１３に出力される。 Further, in the neural network for calculating Q, n action value functions Q ( _{st + 1} , _{aj + 1} ) are calculated using the next value s _{t + 1} of the state, and this is output to the maximum value selection unit 13.

これに加えて、行動価値算出部１１では、誤差関数算出部１６から入力される誤差関数Ｌに基づいて、バックプロパゲーション法をはじめとする勾配法によって誤差勾配を計算し、誤差関数Ｌが最小になるように、Ｑ算出用のニューラルネットワークのパラメータθ（重みなど）が前述した制御周期ΔＴで更新される。 In addition to this, the action value calculation unit 11 calculates the error gradient by the gradient method including the back propagation method based on the error function L input from the error function calculation unit 16 and the error function L is minimum. The parameter θ (such as weight) of the neural network for Q calculation is updated with the control period ΔT as described above.

さらに、このパラメータθの更新回数が所定値（例えば値１００００）に達する毎に、その時点のパラメータθが、更新用のパラメータθ￣としてターゲット値算出部１４に出力される。なお、本実施形態では、行動価値算出部１１が第１価値関数算出手段及び第１価値関数更新手段に相当し、行動価値関数Ｑが第１価値関数に相当する。 Furthermore, every time the number of updates of the parameter θ reaches a predetermined value (for example, the value 10000), the parameter θ at that time is output to the target value calculation unit 14 as the parameter for update θ. In the present embodiment, the action value calculation unit 11 corresponds to a first value function calculation means and a first value function update means, and the action value function Q corresponds to a first value function.

また、方策算出部１２では、行動価値算出部１１から入力される行動価値関数のｎ個の値Ｑ（ｓ_ｔ，ａ_ｊ）に基づいて、ε-greedy法（所定手法）により、最適な行動ａ_ｔが決定される。すなわち、行動価値関数Ｑ（ｓ_ｔ，ａ_ｊ）が最大となる行動ａ_ｊを最適な行動ａ_ｔとして値１−εの確率で選択するとともに、ｎ個の行動ａ_ｊから行動ａ_ｔを値εの確率でランダムに選択される。 Further, in the policy calculation unit 12, based on the n values Q (s _t , a _j ) of the action value function input from the action value calculation unit 11, the optimum action is performed by the ε-greedy method (predetermined method). a _t is determined. In other words, action value function _{Q (s} t, _{a j)} is thereby selected with a probability value 1-epsilon as the optimum action _{a t} the action _{a j} with the maximum value of n actions _{a j} from the action _{a t} It is randomly selected with the probability of ε.

この場合、値εは０＜ε＜１が成立するように設定される。そして、方策算出部１２では、選択された最適な行動ａ_ｔが環境９に出力され、選択された行動ａ_ｔに対応する行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）が、誤差関数算出部１６に出力される。なお、本実施形態では、方策算出部１２が行動決定手段に相当する。 In this case, the value ε is set such that 0 <ε <1 holds. Then, in the measure calculation section 12, the selected optimum action a _t is output to the environment 9, action value function Q (s _{t, a} _t) corresponding to the selected action a _t is the error function calculation unit 16 Output to In the present embodiment, the policy calculation unit 12 corresponds to action determination means.

さらに、最大値選択部１３では、行動価値算出部１１から入力された行動価値関数のｎ個の値Ｑ（ｓ_ｔ＋1，ａ_ｊ＋１）を比較し、これらの中から最大値ｍａｘ_ａｔ＋１Ｑ（ｓ_ｔ＋1，ａ_ｔ＋１）を選択した後、選択された最大値ｍａｘ_ａｔ＋１Ｑ（ｓ_ｔ＋1，ａ_ｔ＋１）が誤差関数算出部１６に出力される。これに加えて、選択された最大値ｍａｘ_ａｔ＋１Ｑ（ｓ_ｔ＋1，ａ_ｔ＋１）に対応する行動の次回値ａ_ｔ＋１がターゲット値算出部１４に出力される。 Furthermore, the maximum value selection unit 13 compares n values Q (s _{t + 1} , a _{j + 1} ) of the action value function input from the action value calculation unit 11 and, among these, the maximum value max _{at +1} Q (s _{t + 1)} , At _{+ 1} ), the selected maximum value max _{at + 1} Q (s _{t + 1} , at _{+ 1} ) is output to the error function calculator 16. In addition to this, the next value _{a t + 1} of the action corresponding to the selected maximum value _{_{max at + 1 Q (s t}} + 1, a t + 1) is output to the target value calculator 14.

一方、ターゲット値算出部１４では、ターゲット値算出用のニューラルネットワーク（図示せず）を用いて、行動価値関数Ｑ（ｓ_ｔ＋1，ａ_ｔ＋１）の目標となるターゲット値Ｔ（ｓ_ｔ＋1，ａ_ｔ＋１）が算出される。このターゲット値算出用のニューラルネットワークは、状態の次回値ｓ_ｔ＋1及び行動の次回値ａ_ｔ＋１が入力されたときに、ターゲット値Ｔ（ｓ_ｔ＋1，ａ_ｔ＋１）を出力するように構成されており、そのパラメータは、前述したように、行動価値算出部１１から入力される更新用のパラメータθ￣に設定される。 On the other hand, the target value calculating section 14, using a neural network for calculating a target value (not shown), action value function _{_{Q (s t + 1, a}} t + 1) a target to become a target value T of the _{_{(s t + 1, a t}} + 1) Is calculated. Neural networks of this target value for calculation, the next time the value a _{t + 1} of the next value s _{t + 1} and the action state is input, is configured to output a target value _{T (s t + 1, a} t + 1), The parameter is set to the updating parameter θ 1 input from the action value calculation unit 11 as described above.

それにより、ターゲット値算出用のニューラルネットワークのパラメータθ￣は、前述したように、パラメータθの更新回数が所定値に達するまでの間、一定値に保持される。言い換えれば、行動価値関数Ｑの算出回数が所定値に達するまでの間、一定値に保持される。以上のように算出されたターゲット値Ｔ（ｓ_ｔ＋1，ａ_ｔ＋１）は、誤差関数算出部１６に出力される。なお、本実施形態では、ターゲット値算出部１４が第２価値関数算出手段及び第２価値関数更新手段に相当し、ターゲット値Ｔが第２価値関数に相当する。 As a result, the parameter θ of the neural network for calculating the target value is held at a constant value until the number of updates of the parameter θ reaches a predetermined value, as described above. In other words, until the number of calculations of the action value function Q reaches a predetermined value, the value is held at a constant value. The target value T (s _{t + 1} , a _{t + 1} ) calculated as described above is output to the error function calculator 16. In the present embodiment, the target value calculation unit 14 corresponds to a second value function calculation unit and a second value function update unit, and the target value T corresponds to a second value function.

また、報酬算出部１５では、状態の次回値ｓ_ｔ＋1に基づき、所定の報酬算出アルゴリズムを用いて、報酬ｒ（ｓ_ｔ＋1）が算出され、これが誤差関数算出部１６に出力される。 Further, the reward calculating unit 15 calculates the reward r (s _{t + 1} ) based on the next value s _{t + 1} of the state using a predetermined reward calculating algorithm, and this is output to the error function calculating unit 16.

一方、誤差関数算出部１６では、以上のように算出された各種の値に基づき、下式（３），（４）により、誤差関数Ｌが算出される。 On the other hand, the error function calculator 16 calculates the error function L by the following equations (3) and (4) based on the various values calculated as described above.

上式（３）において、γは０＜γ≦１が成立するように設定される割引率であり、上式（３）の右辺第１項は、行動価値関数ＱのＴＤ誤差の２乗項である。また、右辺第２項のＥ（ｓ_ｔ＋1，ａ_ｔ＋１）は、上式（４）に示すように定義される制約項であり、λは、調整パラメータである。この調整パラメータλは、値ε１を値０に近い正の所定値（例えば値０．０００１）と規定した場合において、Ｅ（ｓ_ｔ＋1，ａ_ｔ＋１）＞ε１のときには、０＜λ≦１が成立するように設定され、Ｅ（ｓ_ｔ＋1，ａ_ｔ＋１）≦ε１のときには、λ＝０に設定される。 In the above equation (3), γ is a discount rate set so that 0 <γ ≦ 1 holds, and the first term on the right side of the above equation (3) is a square term of the TD error of the action value function Q It is. Further, E (s _{t + 1} , a _{t + 1} ) in the second term on the right side is a constraint term defined as shown in the above equation (4), and λ is an adjustment parameter. This adjustment parameter λ satisfies 0 <λ ≦ 1 when E (s _{t + 1} , a _{t + 1} )> ε1 when the value ε1 is defined as a positive predetermined value close to the value 0 (for example, the value 0.0001). In the case of E (s _{t + 1} , a _{t + 1} ) ≦ ε ₁ , λ = 0 is set.

本実施形態の場合、上式（３）を参照すると明らかなように、誤差関数Ｌは、行動価値関数ＱのＴＤ誤差の２乗項と、調整パラメータと制約項の積λ・Ｅ（ｓ_ｔ＋1，ａ_ｔ＋１）との和として算出される。 In the case of the present embodiment, as apparent from the above equation (3), the error function L is the product of the squared term of the TD error of the action value function Q, the product of the adjustment parameter and the constraint term λ · E (s _{t + 1} , At _{+ 1} ).

この制約項Ｅ（ｓ_ｔ＋1，ａ_ｔ＋１）は、行動価値関数とターゲット値の差分｛Ｑ（ｓ_ｔ＋1，ａ_ｔ＋１）−Ｔ（ｓ_ｔ＋1，ａ_ｔ＋１）｝の２乗項であるので、学習初期などの、ＴＤ誤差が大きくなり、行動価値関数Ｑの更新が不安定な状態になった際でも、その不安定な変動を、行動価値関数Ｑと、所定回数の間、更新されないニューラルネットワークを用いて算出したターゲット値との差分Ｑ−Ｔを含む制約項Ｅによって抑制することができる。すなわち、一般的に、行動価値関数Ｑの学習が不安定になる、ＴＤ誤差が大きい条件下でも、学習を安定して実行することができる。言い換えれば、差分Ｑ−Ｔは、ＴＤ誤差が大きい条件下では、ターゲット値Ｔとの距離によってその不安定さを抑制するとともに、ＴＤ誤差が小さい条件下では、制約項Ｅが小さくなることで、学習の抑制度合いが低下し、効率的な学習を実行できるような効果を発揮する。 The constraint term E (s _{t + 1} , a _{t + 1} ) is a square term of the difference {Q (s _{t + 1} , a _{t + 1} ) −T (s _{t + 1} , a _{t + 1} )} between the action value function and the target value, so Even when the TD error becomes large and the update of the action value function Q becomes unstable, the unstable change is made using the action value function Q and the neural network which is not updated for a predetermined number of times It is possible to suppress by the constraint term E including the difference Q-T from the target value calculated as above. That is, in general, learning can be stably performed even under conditions where the learning of the action value function Q becomes unstable and the TD error is large. In other words, the difference Q-T suppresses the instability by the distance to the target value T under the condition that the TD error is large, and the constraint term E becomes small under the condition that the TD error is small, The degree of suppression of learning is reduced, and it is effective to execute efficient learning.

それにより、前述した式（１）のような、ＴＤ誤差の２乗項のみを成分とする誤差関数Ｌを用いた場合と比べて、学習初期などの、ＴＤ誤差が大きくなり、第１価値関数の更新が不安定な状態になった際でも、その影響を制約項Ｅ（ｓ_ｔ＋1，ａ_ｔ＋１）の効果によって緩和しながら、行動価値算出部１１におけるＱ算出用のニューラルネットワークのパラメータθを安定した状態で更新でき、学習の安定性を確保できることになる。 As a result, compared with the case where the error function L having only the square term of the TD error as a component as the equation (1) described above is used, the TD error such as the learning initial becomes larger, and the first value function Even when the update of the parameter becomes unstable, the parameter θ of the neural network for Q calculation in the action value calculation unit 11 is stabilized while alleviating the influence by the effect of the constraint term E ( _{st + 1} , at _{+ 1} ). It is possible to update in a fixed state, and to ensure the stability of learning.

次に、図３を参照しながら、本実施形態の式（３），（４）に示す誤差関数Ｌを用いた場合と、前述した式（２）の誤差関数Ｌを用いた場合の行動価値関数Ｑの学習速度について説明する。同図において、実線で示す曲線は、本実施形態の式（３），（４）に示す誤差関数Ｌを用いて、スコア獲得形式の市販のコンピュータタスクを自動で学習した学習結果の一例を表している。 Next, referring to FIG. 3, action values in the case of using the error function L shown in the equations (3) and (4) of this embodiment and in the case of using the error function L of the equation (2) described above The learning speed of the function Q will be described. In the same figure, a curve shown by a solid line represents an example of a learning result of automatically learning a commercially available computer task in a score acquisition form using the error function L shown in the equations (3) and (4) of this embodiment. ing.

また、破線で示す曲線は、比較のために、前述した式（２）の誤差関数Ｌを用いたときの学習結果を表している。両者を比較すると明らかなように、本実施形態の誤差関数Ｌを用いた方が、前述した式（２）の誤差関数Ｌを用いたときよりもスコアの上昇勾配が大きくなっており、行動価値関数Ｑの学習速度が上昇していることが判る。これは、前述したように、式（２）の誤差関数Ｌの場合、ターゲット値ＴがＴＤ誤差に含まれているのに対して、本実施形態の式（３），（４）の誤差関数Ｌの場合、ターゲット値ＴがＴＤ誤差に含まれていないことによる。 Further, a curve indicated by a broken line represents a learning result when the error function L of the above-mentioned equation (2) is used for comparison. As is clear from a comparison of the two, when using the error function L of the present embodiment, the rising gradient of the score is larger than when using the error function L of the above-mentioned equation (2), It can be seen that the learning speed of the function Q is rising. This is because, as described above, in the case of the error function L of the equation (2), the target value T is included in the TD error, whereas the error function of the equations (3) and (4) of this embodiment In the case of L, it is because the target value T is not included in the TD error.

次に、図４を参照しながら、学習制御について説明する。この学習制御は、前述した図２の算出手法によって、行動ａを算出するとともに、Ｑ算出用のニューラルネットワークのパラメータθを更新するものであり、ＥＣＵ２によって、前述した所定の制御周期ΔＴで実行される。 Next, learning control will be described with reference to FIG. In this learning control, the action a is calculated by the calculation method of FIG. 2 described above, and the parameter θ of the neural network for Q calculation is updated, and is executed by the ECU 2 at the predetermined control period ΔT described above. Ru.

なお、以下の説明において算出される各種の値は、ＥＣＵ２のＥ２ＰＲＯＭ内に記憶されるものとする。また、以下の説明では、図６に示すように、自車両３が走行車線を走行中で、かつ先行車７ａ，７ｂが走行車線及び追い越し車線に存在する条件下において、先行車７ａの追い越しを実行するときの学習制御の一例について説明する。 Note that various values calculated in the following description are stored in the E2PROM of the ECU 2. In the following description, as shown in FIG. 6, under the condition that the host vehicle 3 is traveling in the traveling lane and the leading vehicles 7a and 7b exist in the traveling lane and the overtaking lane, the passing of the preceding vehicle 7a is performed. An example of learning control when executing will be described.

まず、状態ｓとしての、状況検出装置４からの状況データｄａｔａ＿ｓを読み込む（図４／ＳＴＥＰ１）。この学習制御では、今回の制御タイミングで読み込まれた状況データｄａｔａ＿ｓの値を、状態の次回値ｓ_ｔ＋1として用いるとともに、前回の制御タイミングで読み込まれた状況データｄａｔａ＿ｓの値を、状態の今回値ｓ_ｔとして用いる。 First, the situation data data_s from the situation detection device 4 as the state s is read (FIG. 4 / STEP 1). In this learning control, the value of the situation data data_s read at the current control timing is used as the next value _{st + 1} of the state, and the value of the situation data data_s read at the previous control timing is the current value s of the state _Used as _t .

次いで、前述したように、Ｑ算出用のニューラルネットワークを用いて、状態の次回値ｓ_ｔ＋1に基づき、ｎ個の行動価値関数Ｑ（ｓ_ｔ＋1，ａ_ｊ＋１）を算出するとともに、状態の今回値ｓ_ｔに基づき、ｎ個の行動価値関数Ｑ（ｓ_ｔ，ａ_ｊ）を算出する（図４／ＳＴＥＰ２）。 Next, as described above, using the neural network for Q calculation, n action value functions Q ( _{st + 1} , _{aj + 1} ) are calculated based on the next value s _{t + 1} of the state, and the current value s of the state is s Based on _t , n action value functions Q (s _t , a _j ) are calculated (FIG. 4 / STEP 2).

次に、前述したように、ｎ個の行動価値関数Ｑ（ｓ_ｔ，ａ_ｊ）に基づいて、ε-greedy法により、最適な行動ａを決定する（図４／ＳＴＥＰ３）。この場合の行動ａは、自車両３の操舵量及び加減速度の指令値として決定される。 Next, as described above, based on the n action value functions Q (s _t , a _j ), the optimal action a is determined by the ε-greedy method (FIG. 4 / STEP 3). The action a in this case is determined as the steering amount and acceleration / deceleration command value of the vehicle 3.

その後、前述したように、ターゲット値算出用のニューラルネットワークを用いて、ターゲット値Ｔ（ｓ_ｔ＋1，ａ_ｔ＋１）を算出する（図４／ＳＴＥＰ５）。 Thereafter, as described above, the target value T ( _{st + 1} , at _{+ 1} ) is calculated using the neural network for target value calculation (FIG. 4 / STEP 5).

次いで、前述したように、所定の報酬算出アルゴリズムを用いて、報酬ｒ（ｓ_ｔ＋1）を算出する（図４／ＳＴＥＰ６）。 Next, as described above, the reward r (s _{t + 1} ) is calculated using a predetermined reward calculation algorithm (FIG. 4 / STEP 6).

次に、前述した式（３），（４）により、誤差関数Ｌを算出する（図４／ＳＴＥＰ７） Next, the error function L is calculated according to the equations (3) and (4) described above (FIG. 4 / STEP 7)

そして、この誤差関数Ｌに基づき、前述したように、バックプロパゲーション法により、Ｑ算出用のニューラルネットワークのパラメータθを更新する（図４／ＳＴＥＰ８）。その際、パラメータθを更新回数が所定値に達したときには、その時点のパラメータθを更新用のパラメータθ￣に設定する。以上のように、パラメータθを更新した後、本処理を終了する。 Then, based on the error function L, as described above, the parameter θ of the neural network for Q calculation is updated by the back propagation method (FIG. 4 / STEP 8). At that time, when the number of updates of the parameter θ reaches a predetermined value, the parameter θ at that time is set to the parameter for update θ. As described above, after the parameter θ is updated, the present process ends.

次に、図５を参照しながら、自動運転制御について説明する。この自動運転制御は、自車両３の運転状態を制御するものであり、ＥＣＵ２によって、前述した所定の制御周期ΔＴ（所定周期）で実行される。なお、以下の説明では、前述した図６に示すように、先行車７ａの追い越しを実行するときの自動運転制御の一例について説明する。 Next, automatic operation control will be described with reference to FIG. The automatic driving control is to control the driving state of the host vehicle 3 and is executed by the ECU 2 in the above-described predetermined control cycle ΔT (predetermined cycle). In the following description, as shown in FIG. 6 described above, an example of automatic operation control when executing passing of the leading vehicle 7a will be described.

まず、Ｅ２ＰＲＯＭ内に記憶されている行動ａ、すなわち自車両３の操舵量の指令値及び加減速度の指令値を読み込む（図５／ＳＴＥＰ２０）。なお、本実施形態では、操舵量の指令値及び加減速度の指令値が行動ａに相当する。 First, the action a stored in the E2PROM, that is, the command value of the steering amount of the host vehicle 3 and the command value of the acceleration / deceleration are read (FIG. 5 / STEP 20). In the present embodiment, the command value of the steering amount and the command value of the acceleration / deceleration correspond to the action a.

次いで、自車両３の加減速度が読み込んだ指令値になるように、原動機５を駆動する（図５／ＳＴＥＰ２１）。 Next, the motor 5 is driven so that the acceleration / deceleration of the vehicle 3 becomes the read command value (FIG. 5 / STEP 21).

次に、自車両３の操舵量が読み込んだ指令値になるように、アクチュエータ６を駆動する（図５／ＳＴＥＰ２２）。その後、本処理を終了する。 Next, the actuator 6 is driven so that the steering amount of the host vehicle 3 becomes the read command value (FIG. 5 / STEP 22). Thereafter, the process ends.

以上のように、本実施形態の行動決定システム１０によれば、環境９からの状態ｓを用いて、行動価値関数Ｑが算出され、行動価値関数Ｑを用いて、エージェントによる最適な行動ａが決定される。さらに、式（３），（４）に示すように、誤差関数Ｌが、行動価値関数ＱのＴＤ誤差と、行動価値関数Ｑとターゲット値Ｔとの差分の２乗項である制約項Ｅとを含むように定義され、この誤差関数Ｌが最小になるように、行動価値関数Ｑの算出に用いるニューラルネットワークのパラメータθが更新される。 As described above, according to the action determination system 10 of the present embodiment, the action value function Q is calculated using the state s from the environment 9, and using the action value function Q, the optimal action a by the agent is It is determined. Furthermore, as shown in equations (3) and (4), the error function L is a TD error of the action value function Q, and a constraint term E which is a square term of the difference between the action value function Q and the target value T The parameter θ of the neural network used to calculate the action value function Q is updated such that the error function L is minimized.

このターゲット値Ｔの算出に用いるニューラルネットワークのパラメータθ￣は、パラメータθの更新回数が所定値に達するまでの間に、更新されることなく一定値に保持されるので、前述した式（１）の誤差関数Ｌを用いた場合と比べて、ＴＤ誤差が大きく、行動価値関数Ｑの更新が不安定な状態にあるときでも、その影響を制約項Ｅの効果によって緩和しながら、ニューラルネットワークのパラメータθすなわち行動価値関数Ｑを更新することができ、学習の安定性を確保することができる。これに加えて、ターゲット値Ｔが誤差関数ＬのＴＤ誤差に含まれていないので、前述した式（２）の誤差関数Ｌを用いた場合と比べて、行動価値関数Ｑの更新速度すなわち学習速度を向上させることができる。 Since the parameter θ of the neural network used to calculate the target value T is held at a constant value without being updated until the number of updates of the parameter θ reaches a predetermined value, the equation (1) described above is used. Even when the TD error is large and the update of the action value function Q is in an unstable state as compared with the case where the error function L of L is used, the parameters of the neural network are mitigated by the effect of the constraint term E. That is, the action value function Q can be updated, and learning stability can be ensured. In addition to this, since the target value T is not included in the TD error of the error function L, the updating speed of the action value function Q, that is, the learning speed is higher than in the case of using the error function L of Equation (2) described above. Can be improved.

また、行動価値関数Ｑという１つの関数の算出結果を用いて、最適な行動ａを決定することができるので、複数の関数を用いる場合と比べて、演算負荷を低減することができる。さらに、行動価値関数Ｑを安定した状態で更新できることにより、学習を効率的に実行することができる。 In addition, since the optimal action a can be determined using the calculation result of one function of the action value function Q, the operation load can be reduced as compared with the case of using a plurality of functions. Furthermore, the ability to update the action value function Q in a stable state enables efficient execution of learning.

さらに、本実施形態の自動運転制御装置１によれば、図４の学習制御において、以上のような行動決定システム１０の手法を用いながら、自車両３の操舵量及び加減速度の指令値を最適な値に決定することができるので、自車両３の制御精度を向上させることができる。 Furthermore, according to the automatic driving control device 1 of the present embodiment, the command value of the steering amount and the acceleration / deceleration of the own vehicle 3 is optimized while using the method of the action determination system 10 as described above in the learning control of FIG. Since the value can be determined, the control accuracy of the vehicle 3 can be improved.

なお、図４の学習制御は、行動ａとして、自車両３の操舵量及び加減速度の指令値を決定した例であるが、これに代えて、行動ａとして、自車両３の走行軌道を決定してもよい。その場合には、図５の自動運転制御において、決定された走行軌道で自車両３が走行するように、原動機５及びアクチュエータ６を制御すればよい。 The learning control in FIG. 4 is an example in which the steering amount and acceleration / deceleration command values of the host vehicle 3 are determined as the action a, but instead, the traveling track of the host vehicle 3 is determined as the action a. You may In that case, in the automatic driving control of FIG. 5, the motor 5 and the actuator 6 may be controlled so that the vehicle 3 travels on the determined traveling track.

また、第１実施形態は、行動価値関数算出部１１において、行動価値関数Ｑをニューラルネットワークで近似して、行動価値関数Ｑの値を算出した例であるが、行動価値関数Ｑを近似する関数はこれに限定されるものではない。例えば、行動価値関数Ｑを近似する関数として、状態ｓを表す特徴ベクトルと基底関数の線形結合で表現したものを用いてもよい。その場合には、前述した式（３），（４）で定義される誤差関数Ｌの値が最小になるように、重みの値を更新すればよい。 The first embodiment is an example in which the action value function calculation unit 11 calculates the value of the action value function Q by approximating the action value function Q with a neural network. Is not limited to this. For example, as a function approximating the action value function Q, one represented by a linear combination of a feature vector representing the state s and a basis function may be used. In that case, the value of the weight may be updated so that the value of the error function L defined by the above-mentioned equations (3) and (4) is minimized.

さらに、第１実施形態は、本発明の行動決定システムを自動運転車両を制御する自動運転制御装置に適用した例であるが、本発明の行動決定システムはこれに限らず、様々な産業機器を制御するシステムに適用可能である。例えば、本発明の行動決定システムをロボットを制御するシステムに適用してもよく、自動運転される船舶などの産業機器を制御するシステムに適用してもよい。また、本発明の行動決定システムを、２，３輪タイプの自動運転車両や５輪以上の自動運転車両の制御に適用してもよい。 Furthermore, although the first embodiment is an example in which the behavior determination system of the present invention is applied to an automatic driving control apparatus for controlling an autonomous driving vehicle, the behavior determination system of the present invention is not limited thereto, and various industrial devices It is applicable to the system to control. For example, the behavior determination system of the present invention may be applied to a system that controls a robot, and may be applied to a system that controls an industrial device such as an autonomously operated ship. In addition, the action determination system of the present invention may be applied to control of two-and-three-wheel type automatic driving vehicles and five or more automatic driving vehicles.

一方、第１実施形態は、所定手法として、ε-greedy法を用いた例であるが、本発明の所定手法はこれに限らず、行動価値関数が最大となる行動を最適な行動として選択できるものであればよい。例えば、所定手法として、特定分布に基づくソフトマックス手法や、アニーリングを組み合わせた手法などを用いてもよい。 On the other hand, although the first embodiment is an example using the ε-greedy method as the predetermined method, the predetermined method of the present invention is not limited to this, and the action with the largest action value function can be selected as the optimum action. What is necessary. For example, as a predetermined method, a soft max method based on a specific distribution, a method combining annealing, or the like may be used.

次に、図７を参照しながら、第２実施形態に係る自動運転制御装置１Ａ（エージェント）について説明する。この自動運転制御装置１Ａの場合、第１実施形態の自動運転制御装置１と比較して、図７に示す行動決定システム１０Ａの構成のみが異なっているので、以下、異なる点を中心に説明する。また、第１実施形態と同一の構成に対しては同じ符号を付すとともに、その説明を適宜、省略する。 Next, an automatic driving control apparatus 1A (agent) according to a second embodiment will be described with reference to FIG. In the case of the automatic driving control device 1A, only the configuration of the behavior determining system 10A shown in FIG. 7 is different from that of the automatic driving control device 1 of the first embodiment. . In addition, while attaching the identical mark concerning the constitution which is identical with 1st execution form, that explanation is abbreviated appropriately.

この行動決定システム１０Ａの場合、前述した図２の行動決定システム１０と比較すると明らかなように、行動決定システム１０におけるターゲット値算出部１４に代えて、ターゲット値算出部１４Ａを備えている点が異なっている。 In the case of the action determination system 10A, as apparent from comparison with the action determination system 10 of FIG. 2 described above, the target value calculation unit 14A is provided instead of the target value calculation unit 14 in the action determination system 10 It is different.

このターゲット値算出部１４Ａでは、行動価値関数Ｑの近似関数として、パラメータが固定されたニューラルネットワークを用いて、ターゲット値Ｔｒｅｆ（ｓ_ｔ＋1，ａ_ｔ＋１）が算出され、このターゲット値Ｔｒｅｆ（ｓ_ｔ＋1，ａ_ｔ＋１）が誤差関数算出部１６Ａに出力される。 In the target value calculation unit 14A, a target value Tref (s _{t + 1} , a _{t + 1} ) is calculated using a neural network with fixed parameters as an approximation function of the action value function Q, and this target value Tref (s _{t + 1} , a _{t + 1} ) is output to the error function calculator 16A.

この場合、固定されたパラメータの値としては、他の自動運転制御装置において、Ｑ算出用のニューラルネットワークのパラメータの学習が十分に進行した状態となっているときのパラメータの値が用いられる。なお、本実施形態では、ターゲット値Ｔｒｅｆが固定された関数に相当する。 In this case, as the value of the fixed parameter, the value of the parameter when learning of the parameter of the neural network for Q calculation has progressed sufficiently in the other automatic driving control devices is used. In the present embodiment, the target value Tref corresponds to a fixed function.

また、誤差関数算出部１６Ａでは、下式（５），（６）により、誤差関数Ｌが算出される。 The error function calculator 16A calculates the error function L by the following equations (5) and (6).

以上のように、本実施形態の行動決定システム１０Ａによれば、誤差関数Ｌの制約項Ｅの算出において、ターゲット値Ｔｒｅｆが用いられる。このターゲット値Ｔｒｅｆは、パラメータが固定されたニューラルネットワークを用いて算出され、この固定されたパラメータは、他の自動運転制御装置において、Ｑ算出用のニューラルネットワークのパラメータの学習が十分に進行した状態となっているときのパラメータの値であるので、ＴＤ誤差が大きく、行動価値関数Ｑの更新が不安定な状態になったときでも、その影響を制約項Ｅの効果によって緩和しながら、行動価値関数Ｑを安定した状態で更新することができ、学習の安定性を確保することができる。さらに、ターゲット値ＴｒｅｆがＴＤ誤差に含まれていないので、前述した式（２）の誤差関数を用いた場合と比べて、行動価値関数Ｑの更新速度すなわち学習速度を向上させることができる。 As described above, according to the behavior determination system 10A of the present embodiment, the target value Tref is used in the calculation of the constraint term E of the error function L. The target value Tref is calculated using a neural network with fixed parameters, and this fixed parameter is a state in which learning of parameters of the neural network for Q calculation has progressed sufficiently in other automatic operation control devices. Since the value of the parameter when the value of Q is large, the TD error is large, and even when the update of the action value function Q becomes unstable, the action value is mitigated by the effect of the constraint term E, the action value The function Q can be updated in a stable state, and learning stability can be ensured. Furthermore, since the target value Tref is not included in the TD error, the update speed of the action value function Q, that is, the learning speed can be improved as compared with the case where the error function of the equation (2) described above is used.

なお、第２実施形態は、固定された関数として、ターゲット値Ｔｒｅｆを用いた例であるが、本発明の固定された関数はこれに限らず、独立変数以外のパラメータが固定された関数であればよい。例えば、固定された関数を、複数の他の自動運転制御装置において、Ｑ算出用のニューラルネットワークの学習が十分に進行したときのパラメータθの複数の値の平均値を算出し、この平均値をパラメータとするニューラルネットワークを用いて算出した値としてもよい。 Although the second embodiment is an example using the target value Tref as a fixed function, the fixed function of the present invention is not limited to this, and it may be a function in which parameters other than independent variables are fixed. Just do it. For example, in a plurality of other automatic driving control devices with fixed functions, an average value of a plurality of values of the parameter θ when learning of the neural network for Q calculation has sufficiently progressed is calculated, and this average value is calculated. It may be a value calculated using a neural network as a parameter.

次に、図８を参照しながら、第３実施形態に係る自動運転制御装置１Ｂ（エージェント）について説明する。この自動運転制御装置１Ｂの場合、第１実施形態の自動運転制御装置１と比較して、図８に示す行動決定システム１０Ｂの構成のみが異なっているので、以下、異なる点を中心に説明する。また、第１実施形態と同一の構成に対しては同じ符号を付すとともに、その説明を適宜、省略する。 Next, an automatic driving control apparatus 1B (agent) according to a third embodiment will be described with reference to FIG. In the case of this automatic driving control device 1B, only the configuration of the action determination system 10B shown in FIG. 8 is different from that of the automatic driving control device 1 of the first embodiment, and therefore different points will be mainly described below. . In addition, while attaching the identical mark concerning the constitution which is identical with 1st execution form, that explanation is abbreviated appropriately.

この行動決定システム１０Ｂは、行動算出部２０、行動価値算出部１１Ｂ、ターゲット行動算出部２１、ターゲット値算出部１４Ｂ、報酬算出部１５及び誤差関数算出部１６Ｂを備えている。 The action determination system 10B includes an action calculation unit 20, an action value calculation unit 11B, a target action calculation unit 21, a target value calculation unit 14B, a reward calculation unit 15, and an error function calculation unit 16B.

この行動算出部２０は、方策関数を用いて、行動ａを算出するものである。この方策関数は、環境情報から最適な行動出力や、その確かさを算出ものであり、この行動算出部２０では、方策関数の近似関数として、行動算出用のニューラルネットワーク（図示せず）が用いられる。この行動算出用のニューラルネットワークの場合、状態ｓを入力とし、行動ａを出力とするものであり、具体的には、状態の今回値ｓ_ｔを用いて行動ａの今回値ａ_ｔが算出され、これが環境９及び行動価値算出部１１Ｂに出力される。 The action calculation unit 20 calculates an action a using a policy function. This measure function is for calculating an optimal action output and its certainty from environmental information. In this action calculation unit 20, a neural network (not shown) for action calculation is used as an approximation function of the measure function. Be In the case of the neural network for this action calculation, the state s is input and the action a is output. Specifically, the current value a _{t of the} action a is calculated using the current value s _{t of the} state This is output to the environment 9 and the action value calculation unit 11B.

さらに、行動算出用のニューラルネットワークでは、状態の次回値ｓ_ｔ＋1を用いて、行動ａの次回値ａ_ｔ＋1が算出され、これが行動価値算出部１１Ｂに出力される。 Furthermore, the neural network for Behavior calculation, using the next value s _{t + 1} state, the next value a _{t + 1} of the action a is calculated, which is output to the activation level calculating unit 11B.

これに加えて、行動算出部２０では、バックプロパゲーション法により、行動価値算出部１１Ｂから入力される行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）が最大になるように、行動算出用のニューラルネットワークのパラメータφ（重みなど）が前述した制御周期ΔＴで更新されるとともに、その更新タイミングに同期して、更新されたパラメータφがターゲット行動算出部２１に出力される。なお、本実施形態では、ＥＣＵ２が方策関数更新手段に相当し、行動算出部２０が行動決定手段及び方策関数更新手段に相当する。 In addition to this, in the action calculation unit 20, a neural network for action calculation is performed so that the action value function Q (s _t , a _t ) input from the action value calculation unit 11B is maximized by the back propagation method. The parameter φ (such as weight) is updated at the control period ΔT described above, and the updated parameter φ is output to the target behavior calculation unit 21 in synchronization with the update timing. In the present embodiment, the ECU 2 corresponds to a measure function update means, and the action calculation unit 20 corresponds to an action determination means and a measure function update means.

また、行動価値算出部１１Ｂは、ある状態ｓと、そのときに行った行動ａの評価である行動価値関数Ｑを算出するものであり、状態価値関数としての行動価値関数Ｑを近似したＱ算出用のニューラルネットワーク（図示せず）を備えている。この行動決定システム１０Ｂの場合、行動算出部２０及び行動価値算出部１１Ｂを組み合わせて用いることで、状態の今回値ｓｔから行動価値関数Ｑ（ｓｔ，ａｔ）が算出され、これが誤差関数算出部１６Ｂ及び行動算出部２０に出力される。 Further, the action value calculation unit 11B calculates an action value function Q that is an evaluation of a certain state s and the action a performed at that time, and calculates Q by approximating the action value function Q as a state value function. Neural network (not shown). In the case of the action determination system 10B, the action value function Q (st, at) is calculated from the current value st of the state by using the action calculation unit 20 and the action value calculation unit 11B in combination, and this is the error function calculation unit 16B. And the behavior calculation unit 20.

さらに、このＱ算出用のニューラルネットワークでは、状態の次回値ｓ_ｔ＋1を用いて、行動価値関数Ｑ（ｓ_ｔ＋1，ａ_ｔ＋1）が算出され、これが誤差関数算出部１６Ｂに出力される。 Further, in the neural network for calculating Q, the action value function Q ( _{st + 1} , at _{+ 1} ) is calculated using the next value s _{t + 1} of the state, and this is output to the error function calculator 16B.

これに加えて、行動価値算出部１１Ｂでは、前述した行動価値算出部１１と同様に、バックプロパゲーション法により、誤差関数算出部１６Ｂから入力される誤差関数Ｌが最小になるように、Ｑ算出用のニューラルネットワークのパラメータθが前述した制御周期ΔＴで更新されるとともに、その更新タイミングに同期して、更新されたパラメータθがターゲット行動算出部２１に出力される。なお、本実施形態では、行動価値算出部１１Ｂが第１価値関数算出手段及び第１価値関数更新手段に相当する。 In addition to this, in the action value calculation unit 11B, as in the action value calculation unit 11 described above, Q calculation is performed so that the error function L input from the error function calculation unit 16B is minimized by the back propagation method. The parameter θ of the neural network for use is updated at the control period ΔT described above, and the updated parameter θ is output to the target behavior calculation unit 21 in synchronization with the update timing. In the present embodiment, the action value calculation unit 11B corresponds to first value function calculation means and first value function update means.

一方、前述したターゲット行動算出部２１は、ターゲット行動ａ_Ｔを算出するものであり、状態ｓを入力とし、ターゲット行動ａ_Ｔを出力とするターゲット行動算出用のニューラルネットワーク（図示せず）を備えている。このターゲット行動算出用のニューラルネットワークでは、状態の次回値ｓ_ｔ＋1を用いてターゲット行動ａ_ｔ＋1Ｔが算出され、これがターゲット値算出部１４Ｂに出力される。 On the other hand, the target behavior calculation unit 21 described above is for calculating the target behavior a _T, inputs the state s, comprising a neural network for the target behavior calculation to output the target behavior a _T (not shown) ing. In this neural network for target behavior calculation, the target behavior at _{+ 1T} is calculated using the next value s _{t + 1} of the state, and this is output to the target value calculation unit 14B.

さらに、ターゲット行動算出部２１では、ターゲット行動算出用のニューラルネットワークのパラメータφ￣が、行動算出部２０から入力されるパラメータφを用いて、下式（７）に示す加重平均演算により前述した制御周期ΔＴで更新される。 Furthermore, in the target behavior calculation unit 21, using the parameter φ input from the behavior calculation unit 20, the parameter φ of the neural network for target behavior calculation uses the control described above by the weighted average calculation shown in the following equation (7) It is updated by the period ΔT.

上式（７）のβは、重み係数であり、値０に近い正の所定値（例えば値０．００１）に設定される。 In the above equation (7), β is a weighting coefficient, and is set to a positive predetermined value close to the value 0 (for example, the value 0.001).

また、ターゲット値算出部１４Ｂでは、ターゲット値算出用のニューラルネットワークを用いて、ターゲット値Ｔ（ｓ_ｔ＋1，ａ_ｔ＋１T）が算出される。このターゲット値算出用のニューラルネットワークは、状態の次回値ｓ_ｔ＋1及びターゲット行動ａ_ｔ＋1Ｔが入力されたときに、ターゲット値Ｔ（ｓ_ｔ＋1，ａ_ｔ＋１T）を出力するように構成されている。 Further, the target value calculation unit 14B calculates a target value T ( _{st + 1} , at _{+ 1T} ) using a neural network for target value calculation. The neural network for calculating the target value is configured to output the target value T ( _{st + 1} , at _{+ 1T} ) when the next value s _{t + 1} of the state and the target behavior at _{+ 1 T} are input.

このターゲット値算出用のニューラルネットワークのパラメータθ￣は、行動価値算出部１１Ｂから入力されるパラメータθを用いて、下式（８）に示す加重平均演算により前述した制御周期ΔＴで更新される。 The parameter θ of the neural network for calculating the target value is updated at the above-described control period ΔT by the weighted average calculation shown in the following equation (8) using the parameter θ input from the action value calculation unit 11B.

なお、本実施形態では、ターゲット値算出部１４Ｂが第２価値関数算出手段及び第２価値関数更新手段に相当し、ターゲット値Ｔが第２価値関数に相当する。 In the present embodiment, the target value calculation unit 14B corresponds to the second value function calculation means and the second value function update means, and the target value T corresponds to the second value function.

さらに、誤差関数算出部１６Ｂでは、以上のように算出された各種の値に基づき、下式（９），（１０）により、誤差関数Ｌが算出される。 Furthermore, in the error function calculation unit 16B, the error function L is calculated by the following equations (9) and (10) based on the various values calculated as described above.

なお、上式（９）の最大値ｍａｘ_ａｔ＋１Ｑ（ｓ_ｔ＋1，ａ_ｔ＋１）としては、行動価値関数Ｑ（ｓ_ｔ＋1，ａ_ｔ＋1）の値が用いられる。このように最大値ｍａｘ_ａｔ＋１Ｑ（ｓ_ｔ＋1，ａ_ｔ＋１）を設定する理由は、方策関数を用いて算出されたａ_ｔ＋１は、最適な出力であるという仮定に基づくものである。 Note that, as the maximum value max _{at + 1} Q (s _{t + 1} , a _{t + 1} ) in the above equation (9), the value of the action value function Q (s _{t + 1} , at _{+ 1} ) is used. The reason for setting the maximum value max _{at + 1} Q ( _st _{+ 1} , at _{+ 1} ) in this way is based on the assumption that at _{+ 1} calculated using the policy function is an optimal output.

以上のように、本実施形態の行動決定システム１０Ｂによれば、行動算出部２０で、ニューラルネットワークで近似した方策関数を用いて、行動ａが算出され、行動価値算出部１１Ｂで、ニューラルネットワークで近似した状態価値関数を用いて、行動価値関数Ｑが算出される。このように、方策関数及び状態価値関数を別個に用いることができるので、方策関数を学習する際の任意性を向上させることができ、連続空間や高次元空間に対応できるとともに、エージェントによる探索行動のコントロールを容易に実行することができる。さらに、誤差関数Ｌが最小になるように、状態価値関数が更新されるとともに、状態価値関数が最大となるように、方策関数が更新されるので、方策関数を、その挙動が不安定になるのを抑制しながら安定した状態で更新することができる。 As described above, according to the action determination system 10B of the present embodiment, the action calculation unit 20 calculates the action a using the measure function approximated by the neural network, and the action value calculation unit 11B uses the neural network. The action value function Q is calculated using the approximated state value function. As described above, since the policy function and the state value function can be used separately, it is possible to improve the option when learning the policy function, and to be able to cope with continuous space and high dimensional space. Control can be easily implemented. Furthermore, the state value function is updated such that the error function L is minimized, and the policy function is updated such that the state value function is maximized, so that the behavior of the policy function becomes unstable. Can be updated in a stable state while suppressing

なお、第３実施形態は、状態価値関数が最大となるように、方策関数を更新した例であるが、これに代えて、状態価値関数及びアドバンテージ関数の双方が最大になるように、方策関数を更新するように構成してもよい。 Although the third embodiment is an example in which the policy function is updated so that the state value function is maximized, instead, the policy function is updated such that both the state value function and the advantage function are maximized. May be configured to update the

次に、図９を参照しながら、第４実施形態に係る自動運転制御装置１Ｃ（エージェント）について説明する。この自動運転制御装置１Ｃの場合、第１実施形態の自動運転制御装置１と比較して、図９に示す行動決定システム１０Ｃの構成のみが異なっているので、以下、異なる点を中心に説明する。 Next, an automatic driving control apparatus 1C (agent) according to a fourth embodiment will be described with reference to FIG. In the case of the automatic driving control device 1C, only the configuration of the behavior determining system 10C shown in FIG. 9 is different from that of the automatic driving control device 1 of the first embodiment. .

この行動決定システム１０Ｃは、行動価値算出部１１Ｃ、方策算出部１２Ｃ、最大値選択部１３Ｃ、ターゲット値算出部１４Ｃ、報酬算出部１５Ｃ及び誤差関数算出部１６Ｃを備えている。 The action determination system 10C includes an action value calculation unit 11C, a policy calculation unit 12C, a maximum value selection unit 13C, a target value calculation unit 14C, a reward calculation unit 15C, and an error function calculation unit 16C.

この行動価値算出部１１Ｃは、Ｑ算出用のニューラルネットワーク及び記憶部を備えている。この記憶部は、経験メモリタイプのものであり、値ｉをｉ＝１〜ｍ（ｍは複数）と規定したときに、合計ｍ＋１回の制御タイミングで環境９からそれぞれ入力されたｍ＋１個の状態の時系列離散データｓ_ｔ〜ｓ_ｔ＋ｉを記憶する。さらに、行動価値算出部１１Ｃは、記憶部内の最新の値ｓ_ｔ＋ｍをターゲット値算出部１４Ｃに出力する。 The action value calculation unit 11C includes a neural network for Q calculation and a storage unit. This storage unit is of an empirical memory type, and when the value i is defined as i = 1 to m (m is plural), m + 1 states respectively input from the environment 9 at a total of m + 1 control timings The time-series discrete data s _{t to} s _{t + i} are stored. Further, the action value calculation unit 11C outputs the latest value _{st + m} in the storage unit to the target value calculation unit 14C.

また、Ｑ算出用のニューラルネットワークでは、記憶部内のｍ個の状態の時系列離散データｓ_{ｔ＋ｉ−１}を用いて、ｍ×ｎ個の行動価値関数Ｑ（ｓ_{ｔ＋ｉ−１}，ａ_ｊ）が算出され、これらの値が方策算出部１２Ｃに出力される。 Also, in the neural network for calculating Q, using the time-series discrete data s _{t + i−1} of m states in the storage unit, the m × n action value functions Q (s _{t + i−1} , a _j ) are calculated These values are output to the policy calculation unit 12C.

さらに、このＱ算出用のニューラルネットワークでは、記憶部内の最新の値ｓ_ｔ＋ｍを用いて、ｎ個の行動価値関数Ｑ（ｓ_ｔ＋ｍ，ａ_ｊ）が算出され、これらの値が最大値選択部１３Ｃに出力される。 Furthermore, in this neural network for calculating Q, n action value functions Q ( _{st + m} , _aj ) are calculated using the latest value s _{t + m} in the storage unit, and these values are used as the maximum value selection unit 13C. Output to

これに加えて、行動価値算出部１１Ｃでは、バックプロパゲーション法により、誤差関数算出部１６Ｃから入力される誤差関数Ｌが最小になるように、Ｑ算出用のニューラルネットワークのパラメータθが前述した制御周期ΔＴで更新される。 In addition to this, in the action value calculation unit 11C, control is performed on the parameter θ of the neural network for Q calculation described above so that the error function L input from the error function calculation unit 16C is minimized by the back propagation method. It is updated by the period ΔT.

さらに、このパラメータθの更新回数が前述した所定値に達する毎に、その時点のパラメータθが、更新用のパラメータθ￣としてターゲット値算出部１４Ｃに出力される。なお、本実施形態では、行動価値算出部１１Ｃが第１価値関数算出手段及び第１価値関数更新手段に相当する。 Furthermore, every time the number of updates of the parameter θ reaches the above-described predetermined value, the parameter θ at that time is output to the target value calculation unit 14C as the parameter for update θ. In the present embodiment, the action value calculation unit 11C corresponds to a first value function calculation unit and a first value function update unit.

また、方策算出部１２Ｃ（行動決定手段）では、行動価値算出部１１Ｃから入力されるｍ×ｎ個の行動価値関数Ｑ（ｓ_{ｔ＋ｉ−１}，ａ_ｊ）に基づいて、前述したε-greedy法により、行動ａ_ｔが選択されるとともに、選択された行動ａ_ｔが環境９に出力される。さらに、選択された行動ａ_ｔに対応する行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）が、誤差関数算出部１６Ｃに出力される。 Further, in the policy calculation unit 12C (action determination means), the ε-greedy method described above is based on the m × n action value functions Q ( _{st + i−1} , a _j ) input from the action value calculation unit 11C. a result, the action a _t is selected, the selected action a _t is output to the environment 9. Moreover, action value function _{Q (s} t, _{a t)} corresponding to the selected action _{a t} is output to the error function calculator 16C.

さらに、最大値選択部１３Ｃでは、行動価値算出部１１Ｃから入力されたｎ個の行動価値関数Ｑ（ｓ_ｔ＋ｍ，ａ_ｊ）を比較し、これらの中から最大値ｍａｘ_ａｔ＋１Ｑ（ｓ_ｔ＋ｍ，ａ_ｔ＋ｍ）を選択した後、選択された最大値ｍａｘ_ａｔ＋１Ｑ（ｓ_ｔ＋ｍ，ａ_ｔ＋ｍ）が誤差関数算出部１６Ｃに出力される。これに加えて、選択された最大値ｍａｘ_ａｔ＋ｍＱ（ｓ_ｔ＋ｍ，ａ_ｔ＋ｍ）に対応する行動ａ_ｔ＋ｍがターゲット値算出部１４Ｃに出力される。 Furthermore, the maximum value selection unit 13C compares the n action value functions Q ( _st _{+ m} , _aj ) input from the action value calculation unit 11C, and among these, the maximum value max _{at + 1} Q ( _st _{+ m} , a After selecting _{t + m} ), the selected maximum value max _{at + 1} Q ( _st _{+ m} , at _{+ m} ) is output to the error function calculator 16C. In addition, the maximum value _{_{max at + m Q (s t}} + m, a t + m) which has been selected action _{a t + m} corresponding to is output to the target value calculation portion 14C.

一方、ターゲット値算出部１４Ｃは、ターゲット値算出用のニューラルネットワークを備えており、このターゲット値算出用のニューラルネットワークは、状態の最新値ｓ_ｔ＋ｍ及び最大値ｍａｘ_ａｔ＋ｍＱ（ｓ_ｔ＋ｍ，ａ_ｔ＋ｍ）に対応する行動ａ_ｔ＋ｍを用いてターゲット値Ｔ（ｓ_ｔ＋ｍ，ａ_ｔ＋ｍ）を算出し、これを誤差関数算出部１６Ｃに出力する。 On the other hand, the target value calculation unit 14C includes a neural network for calculating a target value, and the neural network for calculating a target value has the latest value s _{t + m} and the maximum value max _{at + m} Q (s _{t + m} , at _{+ m} ) of the state. The target value T ( _{st + m} , at _{+ m} ) is calculated using the action at _{+ m} corresponding to 、, and this is output to the error function calculator 16C.

また、このターゲット値算出用のニューラルネットワークのパラメータ（重み）θ￣は、前述したように、行動価値算出部１１Ｃから入力される更新用のパラメータθ￣に設定される。なお、本実施形態では、ターゲット値算出部１４Ｃが第２価値関数算出手段及び第２価値関数更新手段に相当する。 Further, as described above, the parameter (weight) ̄ of the neural network for calculating the target value is set to the parameter 更新 for update input from the behavior value calculating unit 11C. In the present embodiment, the target value calculation unit 14C corresponds to the second value function calculation means and the second value function update means.

さらに、報酬算出部１５Ｃは、行動価値算出部１１Ｃの記憶部と同様の、経験メモリタイプの記憶部を備えている。この報酬算出部１５Ｃでは、記憶部に記憶されているｍ個の状態の時系列離散データｓ_ｔ＋ｉに基づき、所定の報酬算出アルゴリズムを用いて、報酬ｒ（ｓ_ｔ＋ｉ）が算出され、これが誤差関数算出部１６Ｃに出力される。 Furthermore, the reward calculation unit 15C includes an experience memory type storage unit similar to the storage unit of the action value calculation unit 11C. In this reward calculation unit 15C, based on the time-series discrete data s _{t + i} of m states stored in the storage unit, the reward r (s _{t + i} ) is calculated using a predetermined reward calculation algorithm, and this is an error function It is output to the calculation unit 16C.

さらに、誤差関数算出部１６Ｃでは、以上のように算出された各種の値に基づき、下式（１１），（１２）により、誤差関数Ｌが算出される。 Furthermore, in the error function calculation unit 16C, the error function L is calculated by the following equations (11) and (12) based on the various values calculated as described above.

以上のように、本実施形態の行動決定システム１０Ｃによれば、誤差関数ＬのＴＤ誤差が、ｍ＋１回の行動ａ_ｔ〜ａ_ｔ＋ｍを実行した結果のｍ個の報酬の時系列離散データｒ（ｓ_ｔ＋ｉ）を含むように算出され、この誤差関数Ｌが最小になるように、行動価値関数Ｑ算出用のニューラルネットワークが更新されるので、１つの状態の時系列離散データｓ_ｔを用いた場合と比べて、過去に行った行動ａの（行動価値関数Ｑによる）評価をより迅速に行動価値関数Ｑの更新に反映させることができ、学習速度をさらに向上させることができる。 As described above, according to the behavior determining system 10C of the present embodiment, TD error of the error function L is, m + 1 times action a _t ~a _t + time series of m reward _{result m} has been executed discrete data r ( Since it is calculated to include s _{t + i} ) and the neural network for calculating the action value function Q is updated so that the error function L is minimized, the time series discrete data s _t of one state is used In comparison with the above, the evaluation (by the action value function Q) of the action a performed in the past can be more quickly reflected in the update of the action value function Q, and the learning speed can be further improved.

１自動運転制御装置（エージェント）
２ＥＣＵ（第１価値関数算出手段、行動決定手段、第１価値関数更新手段、第２価値関数算出手段、第２価値関数更新手段、方策関数更新手段）
３自動運転車両
９環境
１０行動決定システム
１１行動価値算出部（第１価値関数算出手段、第１価値関数更新手段）
１２方策算出部（行動決定手段）
１４ターゲット値算出部（第２価値関数算出手段、第２価値関数更新手段）
１Ａ自動運転制御装置（エージェント）
１０Ａ行動決定システム
１Ｂ自動運転制御装置（エージェント）
１０Ｂ行動決定システム
１１Ｂ行動価値算出部（第１価値関数算出手段、第１価値関数更新手段）
１４Ｂターゲット値算出部（第２価値関数算出手段、第２価値関数更新手段）
２０行動算出部（行動決定手段、方策関数更新手段）
１Ｃ自動運転制御装置（エージェント）
１０Ｃ行動決定システム
１１Ｃ行動価値算出部（第１価値関数算出手段、第１価値関数更新手段）
１２Ｃ方策算出部（行動決定手段）
１４Ｃターゲット値算出部（第２価値関数算出手段、第２価値関数更新手段）
Ｑ行動価値関数（第１価値関数）
ａ行動
ｓ状態（情報）
data_s 状況データ（情報、状態）
Ｌ誤差関数
Ｔターゲット値（第２価値関数）
ε１所定値
Tref ターゲット値（第２価値関数、固定された関数）
ΔＴ制御周期（所定周期）
1 Automatic operation control device (agent)
2 ECU (first value function calculating means, action determining means, first value function updating means, second value function calculating means, second value function updating means, policy function updating means)
3 autonomous driving vehicle 9 environment 10 behavior determination system 11 behavior value calculation unit (first value function calculation means, first value function update means)
12 policy calculation part (action decision means)
14 Target value calculation unit (second value function calculation means, second value function update means)
1A Automatic operation control device (agent)
10A Behavior Decision System 1B Automatic Operation Controller (Agent)
10B action determination system 11B action value calculation unit (first value function calculation means, first value function update means)
14B Target value calculation unit (second value function calculation means, second value function update means)
20 Behavior calculation unit (action decision means, measure function update means)
1C Automatic operation control device (agent)
10C Behavior determination system 11C Behavior value calculation unit (first value function calculation means, first value function update means)
12C policy calculation part (action decision means)
14C Target value calculation unit (2nd value function calculation means, 2nd value function update means)
Q action value function (first value function)
a action s state (information)
data_s status data (information, status)
L error function T target value (second value function)
ε1 predetermined value
Tref target value (second value function, fixed function)
ΔT control cycle (predetermined cycle)

Claims

In an action decision system that decides an action by an agent using a reinforcement learning method,
First value function calculation means for calculating a first value function using information input from an environment to the agent;
Action determining means for determining an optimal action by the agent using the first value function;
In order to minimize the error function defined to include the TD error of the first value function and the difference between the first value function and the second value function different from the first value function, First value function updating means for updating one value function;
An action determination system comprising:

The first value function updating means uses, as the error function, an error function defined to include the TD error and the difference when the difference exceeds a predetermined value, and the difference is less than or equal to the predetermined value The behavior determination system according to claim 1, wherein sometimes an error function defined to include only the TD error is used.

Second value function calculating means for calculating the second value function using the information;
Second value function updating means for updating the second value function at a slower update rate than the first value function;
The action determination system according to claim 1 or 2, further comprising:

The action determination system according to claim 1 or 2, wherein a fixed function is used as the second value function.

The information is the state of the environment,
The first value function is an action value function for evaluating the state of the environment and the action;
The behavior determination system according to any one of claims 1 to 4, wherein the behavior determination means determines the optimal behavior based on the behavior value function using a predetermined method.

The information is the state of the environment,
The first value function includes a state value function for evaluating the state of the environment and a policy function for evaluating the behavior,
The action determining means determines the optimal action using the policy function,
The first value function updating means updates the state value function so as to minimize the error function,
The behavior determination system according to any one of claims 1 to 4, further comprising policy function update means for updating the policy function so that the state value function is maximized.

The information is a plurality of time-series discrete data of the information input in the predetermined cycle from the environment when the agent executes the optimal action a plurality of times in a predetermined cycle,
The TD error of the first value function is configured to include a plurality of time series discrete data of a reward calculated using a plurality of time series discrete data of the information. The action decision system according to any of the above.

An automatic driving control apparatus comprising the behavior determination system according to any one of claims 1 to 7, for controlling an autonomous driving vehicle,
The information is status data representing an operating condition and an operating environment of the autonomous driving vehicle,
The automatic driving control apparatus, wherein the action is a target value or a command value for controlling the autonomous driving vehicle.