JP7110891B2

JP7110891B2 - Autopilot robot control device and control method

Info

Publication number: JP7110891B2
Application number: JP2018188766A
Authority: JP
Inventors: 健人吉田; 寛修深井
Original assignee: Meidensha Corp
Current assignee: Meidensha Corp
Priority date: 2018-10-04
Filing date: 2018-10-04
Publication date: 2022-08-02
Anticipated expiration: 2038-10-04
Also published as: JP2020056737A

Description

本発明は、車両を走行させる自動操縦ロボットの制御装置及び制御方法に関する。 The present invention relates to a control device and control method for an autopilot robot that drives a vehicle.

一般に、普通自動車などの車両を製造、販売する際には、国や地域により規定された、特定の走行パターン（モード）により車両を走行させた際の燃費や排出ガスを測定し、これを表示する必要がある。
モードは、例えば、走行開始から経過した時間と、その時に到達すべき車速との関係として、グラフにより表わすことが可能である。この到達すべき車速は、車両へ与えられる達成すべき速度に関する指令という観点で、指令車速と呼ばれることがある。
上記のような、燃費や排出ガスに関する試験は、シャシーダイナモメータ上に車両を載置し、車両に搭載された自動操縦ロボット、所謂ドライブロボット（登録商標）により、モードに従って車両を運転させることにより行われる。 Generally, when manufacturing and selling a vehicle such as a standard-sized car, the fuel consumption and exhaust gas are measured and displayed when the vehicle is driven in a specific driving pattern (mode) stipulated by the country or region. There is a need to.
The mode can be represented by a graph, for example, as the relationship between the elapsed time from the start of travel and the vehicle speed to be reached at that time. This vehicle speed to be reached is sometimes referred to as command vehicle speed from the viewpoint of a command given to the vehicle regarding the speed to be achieved.
Tests related to fuel consumption and exhaust gas, such as those described above, are performed by placing the vehicle on the chassis dynamometer and driving the vehicle according to the mode by an autopilot robot mounted on the vehicle, the so-called Drive Robot (registered trademark). done.

指令車速には、許容誤差範囲が規定されている。車速が許容誤差範囲を逸脱すると、その試験は無効となるため、自動運転装置には、指令車速への高い追従性が求められる。
これに対し、特許文献１には、車速追従性を高め、事前に行う設定を容易にすることを目的とした、車両速度制御装置が開示されている。
特許文献１の車両速度制御装置は、例えばＰＩＤ制御則等の、既知のフィードバック制御則に基づいている。 A permissible error range is defined for the command vehicle speed. If the vehicle speed deviates from the allowable error range, the test becomes invalid, so the automatic driving system is required to have high followability to the commanded vehicle speed.
On the other hand, Patent Literature 1 discloses a vehicle speed control device for the purpose of improving vehicle speed followability and facilitating setting in advance.
The vehicle speed control device of Patent Document 1 is based on a known feedback control law, for example a PID control law.

特開２０１６－１５６６８７号公報JP 2016-156687 A

上記のように、車両の試験は、燃費や排出ガスの測定を目的の一つとして行われる。
特許文献１のような、フィードバック制御に基づく指令車速への追従制御においては、指令車速に追従させつつ、燃費や排ガス性能をも同時に考慮した車両の操作、例えばアクセルペダルやブレーキペダルの操作量を計算することは、容易ではない。すなわち、燃費や排ガス性能を適切に考慮できないために、指令車速に追従させようとするあまり、例えばアクセルペダルやブレーキペダルを大きく操作したり、小刻みな操作を繰り返したりすることがある。この場合には、車両が備える本来の燃費や排ガス性能よりも、悪い性能が測定される可能性がある。 As described above, one of the purposes of vehicle testing is to measure fuel consumption and emissions.
In the follow-up control to the commanded vehicle speed based on the feedback control as in Patent Document 1, the operation of the vehicle, for example, the amount of operation of the accelerator pedal and the brake pedal, for example, is controlled while following the commanded vehicle speed while also considering the fuel consumption and exhaust gas performance. Calculating is not easy. That is, since fuel consumption and exhaust gas performance cannot be properly considered, for example, the accelerator pedal or the brake pedal may be greatly operated or repeatedly operated in small steps in an attempt to follow the commanded vehicle speed. In this case, there is a possibility that the measured performance is worse than the original fuel efficiency and exhaust gas performance of the vehicle.

本発明が解決しようとする課題は、指令車速に高い精度で追従させつつ、燃費や排ガス性能を考慮して車両を操作可能な、自動操縦ロボット（ドライブロボット）の制御装置及び制御方法を提供することである。 The problem to be solved by the present invention is to provide a control device and a control method for an autopilot robot (drive robot) that can operate a vehicle in consideration of fuel consumption and exhaust gas performance while following a commanded vehicle speed with high accuracy. That is.

本発明は、上記課題を解決するため、以下の手段を採用する。すなわち、本発明は、車両に搭載されて前記車両を走行させる自動操縦ロボットを、前記車両が規定された指令車速に従って走行するように制御する、自動操縦ロボットの制御装置であって、前記車両の走行状態を取得する走行状態取得部と、第１の時刻における前記走行状態を基に、第１学習モデルにより、前記第１の時刻より後の前記車両の操作の内容を推論する操作内容推論部と、前記操作の内容に基づき前記自動操縦ロボットを制御する車両操作制御部と、を備え、前記走行状態は、前記車両において検出された車速と、前記走行状態が取得された時刻における前記指令車速を含み、前記第１学習モデルは、前記操作の内容に基づいた前記自動操縦ロボットの操作の後の、前記第１の時刻より後の第２の時刻における前記走行状態に基づいて、燃費と排ガス性能のいずれか一方または双方がより高い前記操作の内容であるほど大きな値となるように計算された報酬を基に、強化学習されている、自動操縦ロボットの制御装置を提供する。 In order to solve the above problems, the present invention employs the following means. That is, the present invention is a control device for an autopilot robot that controls an autopilot robot that is mounted on a vehicle and causes the vehicle to travel in accordance with a prescribed command vehicle speed, comprising: A driving state acquisition unit that acquires a driving state, and an operation content inference unit that infers details of operation of the vehicle after the first time using a first learning model based on the driving state at the first time. and a vehicle operation control unit that controls the autopilot robot based on the content of the operation, wherein the running state is the vehicle speed detected in the vehicle and the command vehicle speed at the time when the running state is acquired. wherein the first learning model calculates fuel consumption and exhaust gas based on the running state at a second time after the first time after the operation of the autopilot robot based on the content of the operation Provided is a control device for an autopilot robot, in which reinforcement learning is performed based on a reward calculated so that the higher one or both of the performances is, the larger the value of the operation is.

また、本発明は、車両に搭載されて前記車両を走行させる自動操縦ロボットを、前記車両が規定された指令車速に従って走行するように制御する、自動操縦ロボットの制御方法であって、前記車両の走行状態を取得し、前記走行状態は、前記車両において検出された車速と、前記走行状態が取得された時刻における前記指令車速を含み、第１の時刻から、当該第１の時刻より後の前記車両の操作の内容を推論する第１学習モデルであって、前記操作の内容に基づいた前記自動操縦ロボットの操作の後の、前記第１の時刻より後の第２の時刻における前記走行状態に基づいて、燃費と排ガス性能のいずれか一方または双方がより高い前記操作の内容であるほど大きな値となるように報酬を計算し、前記報酬を基に強化学習された前記第１学習モデルにより、前記第１の時刻における前記走行状態を基に、前記車両の操作の内容を推論し、前記操作の内容に基づき前記自動操縦ロボットを制御する、自動操縦ロボットの制御方法を提供する。 The present invention also provides a control method for an autopilot robot that controls an autopilot robot that is mounted on a vehicle and causes the vehicle to travel in accordance with a prescribed command vehicle speed, the method comprising: A running state is acquired, and the running state includes the vehicle speed detected in the vehicle and the command vehicle speed at the time when the running state is acquired, and the running state is from a first time to the time after the first time. A first learning model for inferring details of operation of a vehicle, wherein the driving state at a second time after the first time after the operation of the autopilot robot based on the details of the operation. Based on this, a reward is calculated so that the higher the content of the operation is, the higher the fuel efficiency and/or the exhaust gas performance, the larger the value, and the first learning model that has undergone reinforcement learning based on the reward, A control method for an autopilot robot is provided, which infers details of an operation of the vehicle based on the running state at the first time, and controls the autopilot robot based on the details of the operation.

本発明によれば、指令車速に高い精度で追従させつつ、燃費や排ガス性能を考慮して車両を操作可能な、自動操縦ロボット（ドライブロボット）の制御装置及び制御方法を提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the control apparatus and control method of an autopilot robot (drive robot) which can operate a vehicle in consideration of a fuel consumption and exhaust gas performance can be provided, following command vehicle speed with high precision.

本発明の実施形態における、自動操縦ロボット（ドライブロボット）を用いた試験環境の説明図である。FIG. 4 is an explanatory diagram of a test environment using an autopilot robot (drive robot) in the embodiment of the present invention; 上記実施形態における自動操縦ロボットの制御装置のブロック図である。FIG. 3 is a block diagram of a control device for the autopilot robot in the embodiment; 上記制御装置に設けられた第１学習モデルのブロック図である。It is a block diagram of the 1st learning model provided in the said control apparatus. 上記第１学習モデルの強化学習に用いられる、第２学習モデルのブロック図である。It is a block diagram of the 2nd learning model used for reinforcement learning of the 1st learning model. 上記自動操縦ロボットを制御する制御方法における、学習時のフローチャートである。4 is a flow chart during learning in the control method for controlling the autopilot robot. 上記自動操縦ロボットの制御方法の、学習時における走行データ収集ステップの、詳細なフローチャートである。4 is a detailed flowchart of a traveling data collection step during learning in the control method for the autopilot robot. 上記自動操縦ロボットの制御方法における、性能測定のために車両を走行制御させる際のフローチャートである。4 is a flow chart when the vehicle is controlled for performance measurement in the control method of the autopilot robot. 上記実施形態の変形例における自動操縦ロボットの制御装置のブロック図である。FIG. 11 is a block diagram of a control device for an autopilot robot in a modified example of the above embodiment;

以下、本発明の実施形態について図面を参照して詳細に説明する。
本実施形態における自動操縦ロボットの制御装置は、車両に搭載されて車両を走行させる自動操縦ロボットを、車両が規定された指令車速に従って走行するように制御する、自動操縦ロボットの制御装置であって、車両の走行状態を取得する走行状態取得部と、第１の時刻における走行状態を基に、第１学習モデルにより、第１の時刻より後の車両の操作の内容を推論する操作内容推論部と、操作の内容に基づき自動操縦ロボットを制御する車両操作制御部と、を備え、走行状態は、車両において検出された車速と、走行状態が取得された時刻における指令車速を含み、第１学習モデルは、操作の内容に基づいた自動操縦ロボットの操作の後の、第１の時刻より後の第２の時刻における走行状態に基づいて、燃費と排ガス性能のいずれか一方または双方がより高い操作の内容であるほど大きな値となるように計算された報酬を基に、強化学習されている。
本実施形態においては、自動操縦ロボットとしては、ドライブロボット（登録商標）を用いているため、以下、自動操縦ロボットをドライブロボットと記載する。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
The control device for an autopilot robot according to the present embodiment is a control device for an autopilot robot that controls an autopilot robot that is mounted on a vehicle and causes the vehicle to travel in accordance with a prescribed command vehicle speed. a driving state acquisition unit that acquires the driving state of the vehicle; and an operation content inference unit that infers the content of the operation of the vehicle after the first time using a first learning model based on the driving state at the first time. and a vehicle operation control unit that controls the autopilot robot based on the content of the operation. After the operation of the autopilot robot based on the content of the operation, the model is based on the driving state at the second time after the first time. Reinforcement learning is performed based on a reward that is calculated so that the value increases as the content of the content increases.
In this embodiment, the drive robot (registered trademark) is used as the autopilot robot, and hence the autopilot robot is hereinafter referred to as the drive robot.

図１は、実施形態におけるドライブロボットを用いた試験環境の説明図である。試験装置１は、車両２、シャシーダイナモメータ３、及びドライブロボット４を備えている。
車両２は、床面上に設けられている。シャシーダイナモメータ３は、床面の下方に設けられている。車両２は、車両２の駆動輪２ａがシャシーダイナモメータ３の上に載置されるように、位置づけられている。車両２が走行し駆動輪２ａが回転する際には、シャシーダイナモメータ３が反対の方向に回転する。
ドライブロボット４は、車両２の運転席２ｂに搭載されて、車両２を走行させる。ドライブロボット４は、第１アクチュエータ４ｃと第２アクチュエータ４ｄを備えており、これらはそれぞれ、車両２のアクセルペダル２ｃとブレーキペダル２ｄに当接するように設けられている。 FIG. 1 is an explanatory diagram of a test environment using a drive robot in an embodiment. A test apparatus 1 includes a vehicle 2 , a chassis dynamometer 3 and a drive robot 4 .
The vehicle 2 is provided on the floor surface. The chassis dynamometer 3 is provided below the floor surface. The vehicle 2 is positioned such that the drive wheels 2 a of the vehicle 2 rest on the chassis dynamometer 3 . When the vehicle 2 runs and the drive wheels 2a rotate, the chassis dynamometer 3 rotates in the opposite direction.
The drive robot 4 is mounted on the driver's seat 2b of the vehicle 2 and causes the vehicle 2 to travel. The drive robot 4 includes a first actuator 4c and a second actuator 4d, which are provided to contact an accelerator pedal 2c and a brake pedal 2d of the vehicle 2, respectively.

ドライブロボット４は、制御装置１０によって制御されている。より詳細には、制御装置１０は、ドライブロボット４の第１アクチュエータ４ｃと第２アクチュエータ４ｄを制御することにより、車両２のアクセルペダル２ｃとブレーキペダル２ｄの開度を変更、調整する。
制御装置１０は、ドライブロボット４を、車両２が規定された指令車速に従って走行するように制御する。すなわち、制御装置１０は、車両２のアクセルペダル２ｃとブレーキペダル２ｄの開度を変更することで、規定された走行パターン（モード）に従うように、車両１を走行制御する。より詳細には、制御装置１０は、走行開始から時間が経過するに従い、各時間に到達すべき車速である指令車速に従うように、車両２を走行制御する。 The drive robot 4 is controlled by the control device 10 . More specifically, the control device 10 changes and adjusts the opening degrees of the accelerator pedal 2c and the brake pedal 2d of the vehicle 2 by controlling the first actuator 4c and the second actuator 4d of the drive robot 4 .
The control device 10 controls the drive robot 4 so that the vehicle 2 travels according to the prescribed command vehicle speed. That is, the control device 10 changes the opening degrees of the accelerator pedal 2c and the brake pedal 2d of the vehicle 2, thereby controlling the vehicle 1 so as to follow a prescribed driving pattern (mode). More specifically, the control device 10 controls the traveling of the vehicle 2 so as to follow the commanded vehicle speed, which is the vehicle speed to be reached at each time, as time elapses from the start of traveling.

制御装置１０は、互いに通信可能に設けられた、ドライブロボット制御部２０と、学習部３０を備えている。
ドライブロボット制御部２０は、ドライブロボット４の制御を行うための制御信号を生成し、ドライブロボット４に送信することで、ドライブロボット４を制御する。学習部３０は、後に説明するような機械学習器に対して強化学習を行い、学習モデルを生成する。この学習モデルの出力を基に、ドライブロボット４の制御を行うための制御信号が生成される。
ドライブロボット制御部２０は、例えば、ドライブロボット４の筐体外部に設けられた、コントローラ等の情報処理装置である。学習部３０は、例えばパーソナルコンピュータ等の情報処理装置である。 The control device 10 includes a drive robot control section 20 and a learning section 30 which are provided so as to be able to communicate with each other.
The drive robot control unit 20 controls the drive robot 4 by generating a control signal for controlling the drive robot 4 and transmitting it to the drive robot 4 . The learning unit 30 performs reinforcement learning on a machine learning device, which will be described later, to generate a learning model. A control signal for controlling the drive robot 4 is generated based on the output of this learning model.
The drive robot control unit 20 is, for example, an information processing device such as a controller provided outside the housing of the drive robot 4 . The learning unit 30 is, for example, an information processing device such as a personal computer.

図２は、制御装置１０のブロック図である。ドライブロボット制御部２０は、指令車速記憶部２１、走行状態取得部２２、及び車両操作制御部２３を備えている。学習部３０は、操作内容推論部３１、報酬計算部３２、強化学習部３３、及び学習用データ記憶部３４を備えている。
これら制御装置１０の構成要素のうち、走行状態取得部２２、車両操作制御部２３、操作内容推論部３１、報酬計算部３２、及び強化学習部３３は、例えば上記の各情報処理装置内のＣＰＵにより実行されるソフトウェア、プログラムであってよい。また、指令車速記憶部２１及び学習用データ記憶部３４は、上記各情報処理装置内外に設けられた半導体メモリや磁気ディスクなどの記憶装置により実現されていてよい。 FIG. 2 is a block diagram of the control device 10. As shown in FIG. The drive robot control section 20 includes a command vehicle speed storage section 21 , a running state acquisition section 22 and a vehicle operation control section 23 . The learning unit 30 includes an operation content inference unit 31 , a reward calculation unit 32 , a reinforcement learning unit 33 , and a learning data storage unit 34 .
Among the components of the control device 10, the driving state acquisition unit 22, the vehicle operation control unit 23, the operation content inference unit 31, the reward calculation unit 32, and the reinforcement learning unit 33 are, for example, CPUs in the above information processing devices. It may be software or a program executed by Also, the command vehicle speed storage unit 21 and the learning data storage unit 34 may be implemented by a storage device such as a semiconductor memory or a magnetic disk provided inside or outside each information processing device.

後に説明するように、操作内容推論部３１は、ある時刻における走行状態を基に、当該時刻よりも後の車両２の操作の内容を推論する。この、車両２の操作の内容の推論を効果的に行うために、特に操作内容推論部３１は、後に説明するように機械学習器を備えており、推論した操作の内容に基づいたドライブロボット４の操作の後の時刻における走行状態に基づいて計算された報酬を基に機械学習器を強化学習して学習モデル（第１学習モデル）４０を生成する。操作内容推論部３１は、性能測定のために実際に車両２を走行制御させる際には、この学習が完了した第１学習モデル４０を使用して、車両２の操作の内容を推論する。
すなわち、制御装置１０は大別して、強化学習時における操作の内容の学習と、性能測定のために車両を走行制御させる際における操作の内容の推論の、２通りの動作を行う。説明を簡単にするために、以下ではまず、操作の内容の学習時における、制御装置１０の各構成要素の説明をした後に、車両の性能測定に際して操作の内容を推論する場合での各構成要素の挙動について説明する。
図２においては、各構成要素が太線と細線の２種類の矢印で結ばれて、データや処理の流れが示されている。車両の性能測定に際して操作の内容を推論する場合でのデータや処理の流れは、太線により示されている。操作の内容の学習時におけるデータや処理の流れは、太線と細線の双方の矢印により示されている。 As will be described later, the operation content inference unit 31 infers the content of the operation of the vehicle 2 after that time based on the running state at a certain time. In order to effectively infer the content of the operation of the vehicle 2, the operation content inference unit 31 in particular is equipped with a machine learning device as will be described later. A learning model (first learning model) 40 is generated by performing reinforcement learning on the machine learning device based on the reward calculated based on the running state at the time after the operation of . When actually controlling the vehicle 2 for performance measurement, the operation content inference unit 31 uses the first learning model 40 for which the learning has been completed to infer the content of the operation of the vehicle 2 .
That is, the control device 10 can be roughly classified into two types of operations: learning of operation details during reinforcement learning, and inference of operation details when controlling the running of the vehicle for performance measurement. In order to simplify the explanation, first, each component of the control device 10 when learning the content of the operation will be described, and then each component when inferring the content of the operation when measuring the performance of the vehicle. behavior.
In FIG. 2, each component is connected by two types of arrows, a thick line and a thin line, to show the flow of data and processing. The data and processing flow in the case of inferring the details of the operation when measuring the performance of the vehicle are indicated by thick lines. The flow of data and processing during learning of operation details is indicated by both thick and thin arrows.

まず、操作の内容の学習時における、ドライブロボット制御部２０の構成要素の挙動を説明する。
指令車速記憶部２１には、モードに関する情報に基づいて生成された、指令車速が記憶されている。モードは、例えば、走行開始から経過した時間と、その時に到達すべき車速との関係であり、したがって指令車速記憶部２１には、実際には、経過時間と指令車速の関係が表現された、例えばテーブルやグラフ、関数等が格納されている。 First, the behavior of the components of the drive robot control unit 20 during learning of operation details will be described.
The command vehicle speed storage unit 21 stores a command vehicle speed generated based on the information regarding the mode. The mode is, for example, the relationship between the elapsed time from the start of running and the vehicle speed to be reached at that time. For example, tables, graphs, functions, etc. are stored.

走行状態取得部２２は、現在時点における、車両２の走行状態を取得する。車両２の走行状態は、車両２に備えられた様々な図示されない計測器や、車両２を操作するドライブロボット４内に記録された操作実績から取得され得る。すなわち、車両２の走行状態は、現在時点における車両２の動作状況を数値化して表現したものであり、この値を取得する手段は、車両２の計測器による計測値に限られず、ドライブロボット４によって取得可能な値をも含む。
走行状態としては、前回の走行状態取得時刻からのアクセルペダル操作の、ドライブロボット４の操作実績中の操作量（以下、アクセルペダル検出量と呼称する）、前回の走行状態取得時刻からのブレーキペダル操作の、ドライブロボット４の操作実績中の操作量（以下、ブレーキペダル検出量と呼称する）、車両２において検出されたエンジン回転数（以下、エンジン回転数検出量と呼称する）、車両２において検出された車速（以下、検出車速と呼称する）を含む。
走行状態は、更に、当該走行状態が取得された時刻において、車両２が実現すべき指令車速を含む。 The running state acquisition unit 22 acquires the current running state of the vehicle 2 . The running state of the vehicle 2 can be obtained from various measuring instruments (not shown) provided on the vehicle 2 and the operation results recorded in the drive robot 4 that operates the vehicle 2 . In other words, the running state of the vehicle 2 is a numerical representation of the operational state of the vehicle 2 at the present point in time. Also includes values that can be obtained by
As the driving state, the amount of operation of the accelerator pedal during the actual operation of the drive robot 4 (hereinafter referred to as the accelerator pedal detection amount) since the time when the previous driving state was obtained, and the brake pedal since the time when the previous driving state was obtained. The amount of operation during the operation record of the drive robot 4 (hereinafter referred to as the brake pedal detection amount), the engine speed detected in the vehicle 2 (hereinafter referred to as the engine speed detection amount), the vehicle 2 It includes the detected vehicle speed (hereinafter referred to as the detected vehicle speed).
The running state further includes a command vehicle speed that the vehicle 2 should achieve at the time when the running state is acquired.

上記の走行状態の各々は、スカラー値であってもよいが、複数の値により実現されていてもよい。
走行状態の各々は、後述する機械学習器を学習させて学習モデル（第１学習モデル４０）を生成する際の入力として主に使用される。このため、走行状態の各々に関し、走行状態が取得された時点のみではなく、その前後の複数の時刻において値を取得し、機械学習器の入力とすることにより、過去の経過や将来の推測を活かしてより効果的に学習することができる可能性がある。
例えば、アクセルペダル検出量、ブレーキペダル検出量、エンジン回転数検出量、検出車速等の、車両２の状態を実際に観測、計測することにより取得される走行状態については、機械学習器の学習アルゴリズムにおいて使用する過去の観測データの参照時間を観測データ参照時間Ｔ_ｏｂｓとすると、観測データ参照時間Ｔ_ｏｂｓの系列として、複数の値を有していてもよい。
また、上記のような観測データとは異なり、指令車速記憶部２１に値が格納されており全ての時刻における値が随時参照可能な状態となっている指令車速については、機械学習器の学習アルゴリズムにおいて使用する将来の指令車速の参照時間を指令車速参照時間Ｔ_ｒｅｆとすると、指令車速参照時間Ｔ_ｒｅｆの系列として、複数の値を有していてもよい。
本実施形態においては、走行状態の各々は、複数の値により実現されている。 Each of the above running states may be a scalar value, or may be realized by a plurality of values.
Each of the running states is mainly used as an input when learning a machine learning device, which will be described later, to generate a learning model (first learning model 40). For this reason, for each driving state, values are obtained not only at the time when the driving state was obtained, but also at multiple times before and after that, and by inputting them to the machine learning device, past progress and future predictions can be made. There is a possibility that it can be used to learn more effectively.
For example, the driving state obtained by actually observing and measuring the state of the vehicle 2, such as the detected amount of accelerator pedal, the detected amount of brake pedal, the detected amount of engine speed, and the detected vehicle speed, is determined by the learning algorithm of the machine learning device. If the past observation data reference time used in 1 is assumed to be observation data reference time T _obs , the series of observation data reference time T _obs may have a plurality of values.
Further, unlike the observation data described above, the command vehicle speed, whose value is stored in the command vehicle speed storage unit 21 and can be referred to at any time, is stored in the learning algorithm of the machine learning device. Assuming that the reference time of the future commanded vehicle speed to be used in the above is the commanded vehicle speed reference time T _ref , the sequence of the commanded vehicle speed reference time T _ref may have a plurality of values.
In this embodiment, each running state is realized by a plurality of values.

走行状態取得部２２は、車両２に備えられた様々な図示されない計測器やドライブロボット４内に記録された操作実績等から、アクセルペダル検出量、ブレーキペダル検出量、エンジン回転数検出量、検出車速を取得する。
また、走行状態取得部２２は、指令車速記憶部２１から、指令車速を取得する。
走行状態取得部２２は、これらの取得した走行状態を、学習部３０へ送信する。 The driving state acquisition unit 22 obtains the accelerator pedal detection amount, the brake pedal detection amount, the engine rotation speed detection amount, and the like from various measuring instruments (not shown) provided in the vehicle 2 and operation results recorded in the drive robot 4. Get vehicle speed.
Also, the running state acquisition unit 22 acquires the command vehicle speed from the command vehicle speed storage unit 21 .
The running state acquisition unit 22 transmits these acquired running states to the learning unit 30 .

車両操作制御部２３は、次に説明する操作内容推論部３１が、走行状態取得部２２が送信した送信状態を基に推論した、操作の内容を受信し、これを基にしてドライブロボット４を制御する制御信号を生成して、ドライブロボット４へ送信する。 The vehicle operation control unit 23 receives the details of the operation inferred by the operation content inference unit 31 based on the transmission state transmitted by the driving state acquisition unit 22, and controls the drive robot 4 based on this. A control signal for control is generated and transmitted to the drive robot 4 .

次に、操作の内容の学習時における、学習部３０の構成要素の挙動を説明する。
学習部３０の操作内容推論部３１は、機械学習器を備えている。この機械学習器は、強化学習されることにより、第１学習モデル４０が生成される。第１学習モデル４０は、車両２の操作の内容を推論するのに使用される。すなわち、機械学習器は、人工知能ソフトウェアの一部であるプログラムモジュールとして利用される、適切な学習パラメータが学習された学習済みモデル４０を生成するものである。
学習部３０は、この機械学習器を強化学習するに際し、強化学習に必要な入力となる、走行データを蓄積する。制御装置１０が、学習がまだ終了していない、学習途中の機械学習器によって推論された操作の内容によって、一連のデータ収集を行う際における時間単位であるエピソードごとに車両２を走行制御することにより、走行データが蓄積される。この走行データにより機械学習器を強化学習した後に、この出力となる操作の内容を用いて再度走行データを蓄積し、機械学習器を再度学習する。このように、機械学習器を繰り返し更新することにより、最終的に強化学習された、学習済みの第１学習モデル４０が生成される。
以下、説明を簡単にするため、操作内容推論部３１が備えている機械学習器と、これが学習されて生成される学習モデルをともに、第１学習モデル４０と呼称する。 Next, the behavior of the constituent elements of the learning unit 30 during learning of operation details will be described.
The operation content inference unit 31 of the learning unit 30 includes a machine learning device. This machine learning device generates the first learning model 40 through reinforcement learning. The first learning model 40 is used to infer the details of the operation of the vehicle 2 . That is, the machine learner generates a trained model 40 trained with appropriate learning parameters, which is used as a program module that is part of artificial intelligence software.
The learning unit 30 accumulates driving data, which is an input necessary for the reinforcement learning, when the machine learning device performs the reinforcement learning. The control device 10 performs travel control of the vehicle 2 for each episode, which is a unit of time when a series of data is collected, according to the details of the operation inferred by the machine learning device that has not finished learning yet. The running data is accumulated by . After the machine learning device has undergone reinforcement learning with this travel data, the travel data is accumulated again using the contents of this output operation, and the machine learner learns again. By repeatedly updating the machine learning device in this manner, the trained first learning model 40 that is finally subjected to reinforcement learning is generated.
Hereinafter, for the sake of simplicity, the machine learning device included in the operation content inference unit 31 and the learning model generated by learning the machine learning device are both referred to as the first learning model 40 .

操作内容推論部３１は、ある時刻（第１の時刻）において、走行状態取得部２２から走行状態を受信すると、これを基に、学習中の第１学習モデル４０により、第１の時刻より後の車両２の操作の内容を推論する。 When the operation content inference unit 31 receives the running state from the running state acquisition unit 22 at a certain time (first time), the operation content inference unit 31 uses the first learning model 40 that is learning based on the running state to receive the running state after the first time. infers the content of the operation of the vehicle 2.

第１学習モデル４０は、所定の第１の時間間隔をおいて、車両２の操作の内容を推論する。この、第１学習モデル４０における推論の間隔を、以降、ステップ周期Ｔ_ｓｔｅｐと呼称する。
ドライブロボット制御部２０は、後述するように、ドライブロボット４を制御する制御信号を、ドライブロボット４へと、所定の第２の時間間隔をおいて送信する。この、制御信号の送信間隔を制御周期Ｔ_ｓとすると、ステップ周期Ｔ_ｓｔｅｐは、制御周期Ｔ_ｓと同等であってもよいし、制御周期Ｔ_ｓよりも大きな値であってもよい。ステップ周期Ｔ_ｓｔｅｐが制御周期Ｔ_ｓよりも大きな値である場合には、第１学習モデル４０は、一度の推論により、ステップ周期Ｔ_ｓｔｅｐに含まれる複数の制御周期Ｔ_ｓに相当する、複数の、車両２の操作の内容を出力する。
本実施形態においては、操作内容推論部３１は、第１学習モデル４０によって、第１の時刻からステップ周期Ｔ_ｓｔｅｐ後までの時間範囲内の、複数の制御周期Ｔ_ｓに相当する複数の時刻における操作の内容を推論する。 The first learning model 40 infers the content of the operation of the vehicle 2 at predetermined first time intervals. This inference interval in the first learning model 40 is hereinafter referred to as a step period T _step .
As will be described later, the drive robot control unit 20 transmits control signals for controlling the drive robot 4 to the drive robot 4 at predetermined second time intervals. _Assuming that the control signal transmission interval is the control period Ts, the _step period _Tstep may be equal to the control period _Ts , or may be a value larger than the control period Ts. When the step period T _step is a larger value than the control period T _s , the first learning model 40 can generate a plurality of values corresponding to the plurality of control periods T _s included in the step period T _step by one inference. , to output the content of the operation of the vehicle 2 .
In the present embodiment, the operation _content _inference unit 31 uses the first learning model 40 to determine the Infer the content of the operation.

また、第１学習モデル４０は、上記のように、第１の時刻より後の、少なくともステップ周期Ｔ_ｓｔｅｐ後までの将来にわたる車両２の操作の内容を推論するが、この推論の対象となる時間間隔は、実際には、ステップ周期Ｔ_ｓｔｅｐよりも大きくてもよい。すなわち、第１学習モデル４０は、実際には、ステップ周期Ｔ_ｓｔｅｐ後までの時間範囲内における車両２の操作の内容を推論すると同時に、ステップ周期Ｔ_ｓｔｅｐ後よりも更に将来の時刻における、車両２の操作の内容を推論してもよい。この、第１学習モデル４０によって推論する時間範囲を、行動出力時間Ｔ_ｐｒｅｄと呼称する。この場合においては、操作内容推論部３１は、第１学習モデル４０によって、第１の時刻から行動出力時間Ｔ_ｐｒｅｄ後までの時間範囲内の、複数の制御周期Ｔ_ｓに相当する複数の時刻における操作の内容を推論する。
このようにした場合においては、第１学習モデル４０は、実際に車両２が操作されるステップ周期Ｔ_ｓｔｅｐ後までの操作の内容を推測するに際し、ステップ周期Ｔ_ｓｔｅｐよりも更に後の時刻における車両２の操作の内容を推測するため、将来の状況を見越した推測をするようになる可能性がある。 In addition, as described above, the first learning model 40 infers the content of the future operation of the vehicle 2 after the first time and at least after the step period T _step . The interval may actually be larger than the step period T _step . That is, the first learning model 40 actually infers the details of the operation of the vehicle 2 within the time range up to after the step period T _step , and at the same time, the operation of the vehicle 2 at a future time after the step period T _step . You may infer the content of the operation of This time range inferred by the first learning model 40 is called action output time T _pred . In this case, the operation _content _inference unit 31 uses the first learning model 40 to determine the Infer the content of the operation.
In this case, the first learning model 40, when estimating the details of the operation up to after the step period T _step in which the vehicle 2 is actually operated, assumes that the vehicle at the time after the step period T _step is actually operated. In order to guess the contents of the operation of 2, there is a possibility that it will come to make a guess in anticipation of the future situation.

図３は、第１学習モデル４０のブロック図である。
本実施形態においては、第１学習モデル４０は、中間層を３層とした全５層の全結合型のニューラルネットワークにより実現されている。第１学習モデル４０は、入力層４１、中間層４２、及び出力層４３を備えている。
図３においては、各層が矩形として描かれており、各層に含まれるノードは省略されている。 FIG. 3 is a block diagram of the first learning model 40. As shown in FIG.
In the present embodiment, the first learning model 40 is implemented by a fully-connected neural network with five layers in total, with three intermediate layers. The first learning model 40 has an input layer 41 , an intermediate layer 42 and an output layer 43 .
In FIG. 3, each layer is drawn as a rectangle, and the nodes included in each layer are omitted.

入力層４１は、複数の入力ノードを備えている。複数の入力ノードの各々は、例えばアクセルペダル検出量ｓ１、ブレーキペダル検出量ｓ２から、指令車速ｓＮに至るまでの、走行状態ｓの各々に対応するように設けられている。
既に説明したように、各走行状態ｓは、複数の値により実現されている。例えば、図３においては、一つの矩形として示されている、アクセルペダル検出量ｓ１に対応する入力は、実際には、アクセルペダル検出量ｓ１の複数の値の各々に対応するように、入力ノードが設けられている。
各入力ノードには、走行状態取得部２２から受信した、対応する走行状態ｓの値が格納される。 The input layer 41 has a plurality of input nodes. Each of the plurality of input nodes is provided to correspond to each running state s, for example, from the accelerator pedal detection amount s1 and the brake pedal detection amount s2 to the command vehicle speed sN.
As already explained, each running state s is realized by a plurality of values. For example, in FIG. 3, the input corresponding to the accelerator pedal detection amount s1, which is shown as one rectangle, actually corresponds to each of a plurality of values of the accelerator pedal detection amount s1. is provided.
Each input node stores the value of the corresponding running state s received from the running state acquisition unit 22 .

中間層４２は、第１中間層４２ａ、第２中間層４２ｂ、及び第３中間層４２ｃを備えている。
中間層４２の各ノードにおいては、前段の層（例えば、第１中間層４２ａの場合は入力層４１、第２中間層４２ｂの場合は第１中間層４２ａ）の各ノードから、この前段の層の各ノードに格納された値と、前段の層の各ノードから当該中間層４２のノードへの重みを基にした演算がなされて、当該中間層４２のノード内に演算結果が格納される。
本実施形態においては、この演算において使用される活性化関数は、例えばＲｅＬＵ（ＲｅｃｔｉｆｉｅｄＬｉｎｅａｒＵｎｉｔ）である。 The intermediate layer 42 includes a first intermediate layer 42a, a second intermediate layer 42b, and a third intermediate layer 42c.
At each node of the intermediate layer 42, from each node of the previous layer (for example, the input layer 41 in the case of the first intermediate layer 42a, and the first intermediate layer 42a in the case of the second intermediate layer 42b), this previous layer and the weight from each node of the previous layer to the node of the intermediate layer 42, and the result of the operation is stored in the node of the intermediate layer 42.
In this embodiment, the activation function used in this operation is, for example, ReLU (Rectified Linear Unit).

出力層４３においても、中間層４２の各々と同様な演算が行われ、出力層４３に備えられた各出力ノードに演算結果が格納される。複数の出力ノードの各々は、操作の内容ａの各々に対応するように設けられている。本実施形態においては、車両２の操作の対象は、アクセルペダル２ｃとブレーキペダル２ｄであり、これに対応して、操作の内容ａは、例えばアクセルペダル操作ａ１とブレーキペダル操作ａ２となっている。
既に説明したように、各操作の内容ａは、複数の値により実現されている。例えば、図３においては、一つの矩形として示されている、アクセルペダル操作ａ１に対応する出力は、実際には、アクセルペダル操作ａ１の複数の値の各々に対応するように、出力ノードが設けられている。 In the output layer 43 as well, operations similar to those in each of the intermediate layers 42 are performed, and the operation results are stored in each output node provided in the output layer 43 . Each of the plurality of output nodes is provided so as to correspond to each of the contents a of the operation. In this embodiment, the objects of operation of the vehicle 2 are the accelerator pedal 2c and the brake pedal 2d, and correspondingly, the operation contents a are, for example, the accelerator pedal operation a1 and the brake pedal operation a2. .
As already explained, the content a of each operation is realized by a plurality of values. For example, in FIG. 3, the output corresponding to the accelerator pedal operation a1, which is shown as one rectangle, is actually provided with an output node so as to correspond to each of a plurality of values of the accelerator pedal operation a1. It is

第１学習モデル４０においては、上記のように走行状態ｓが入力されて、適切な操作の内容ａを演算することができるように学習がなされる。この学習においては、重みやバイアスの値等、ニューラルネットワークを構成する各パラメータの値が調整される。
第１学習モデル４０の具体的な学習については、後に説明する。 In the first learning model 40, the driving state s is input as described above, and learning is performed so that the appropriate operation content a can be calculated. In this learning, the values of the parameters that make up the neural network, such as weights and bias values, are adjusted.
Specific learning of the first learning model 40 will be described later.

操作内容推論部３１は、上記のように、第１の時刻における走行状態ｓを基に、第１の時刻より後の行動出力時間Ｔ_ｐｒｅｄまでにおける車両２の操作の内容ａを推論し、ドライブロボット制御部２０の車両操作制御部２３へ送信する。
この操作の内容ａに基づき、車両操作制御部２３はステップ周期Ｔ_ｓｔｅｐの間、ドライブロボット４を操作する。
そして、走行状態取得部２２は、操作後の車両２の、第１の時刻よりも後の第２の時刻における走行状態を再度取得する。 As described above, the operation content inference unit 31 infers the operation content a of the vehicle 2 up to the behavior output time T _pred after the first time based on the running state s at the first time, and drives the vehicle 2. It is transmitted to the vehicle operation control section 23 of the robot control section 20 .
Based on this operation content a, the vehicle operation control unit 23 operates the drive robot 4 during the step period T _step .
Then, the running state acquisition unit 22 acquires again the running state of the vehicle 2 after the operation at the second time after the first time.

以降においては、第１の時刻で取得された走行状態と第２の時刻で取得された走行状態を区別して記載するため、第１の時刻で取得された走行状態を走行状態ｓ_ｔ、第２の時刻で取得された走行状態を走行状態ｓ_ｔ＋１と記載する。また、第１の時刻で取得された走行状態ｓ_ｔに対して推論され、実行された操作の内容を操作の内容ａ_ｔと記載する。
操作内容推論部３１は、第１の時刻における走行状態ｓ_ｔ、これに対して推論され実際に実行された操作の内容ａ_ｔ、及び第２の時刻における走行状態ｓ_ｔ＋１を、次に説明する報酬計算部３２に送信する。
報酬計算部３２は、強化学習に際し必要となる値である報酬を計算する。後述する強化学習部３３は、この報酬を基に、操作の内容ａ_ｔがどの程度適切であったかを示す行動価値を計算し、第１学習モデル４０は、この行動価値が高くなるような操作の内容ａを出力するように、強化学習が行われる。
報酬計算部３２によって計算された報酬は、操作内容推論部３１に送信されて操作内容推論部３１により受信され、これを受けて操作内容推論部３１は、第１の時刻における走行状態ｓ_ｔ、操作の内容ａ_ｔ、第２の時刻における走行状態ｓ_ｔ＋１と、及び受信した報酬の組み合わせを、学習用データ記憶部３４へ送信し、記憶する。 Hereinafter, in order to distinguish between the running state acquired at the first time and the running state acquired at the second time, the running state acquired at the first time is the running state s _t , the second The running state obtained at the time of is described as running state _st+1 . Further, the content of the operation _inferred and executed for the running state _st acquired at the first time is described as the content of operation at.
The operation content inference unit 31 describes the running state s _t at the first time, the operation content a _t inferred and actually executed for this, and the running state s _t+1 at the second time. It is transmitted to the remuneration calculation unit 32 .
The reward calculator 32 calculates a reward that is a value required for reinforcement learning. Based on this reward, the reinforcement learning unit 33, which will be described later, calculates an action value _indicating how appropriate the content of the operation at is. Reinforcement learning is performed so as to output content a.
The reward calculated by the reward calculation unit 32 is transmitted to the operation content inference unit 31 and received by the operation content inference unit 31. In response to this, the operation content inference unit 31 calculates the running state s _t at the first time, The operation content a _t , the running state s _t+1 at the second time, and the combination of the received reward are transmitted to the learning data storage unit 34 and stored.

報酬計算部３２は、操作内容推論部３１から、第１の時刻における走行状態ｓ_ｔ、操作の内容ａ_ｔ、及び第２の時刻における走行状態ｓ_ｔ＋１を受信する。報酬は、操作の内容ａ_ｔ、及びこれに伴う第２の時刻における走行状態ｓ_ｔ＋１が望ましくないほど小さい値を、望ましいほど大きい値を、有するように設計されている。強化学習部３３は、後述の数式２により、報酬が大きいほど行動価値（評価値）を高くするように計算し、第１学習モデル４０はこの行動価値が高くなるような操作の内容ａ_ｔを出力するように、強化学習が行われる。 The reward calculation unit 32 receives the running state s _t at the first time, the operation content a _t , and the running state s _t+1 at the second time from the operation content inference unit 31 . The reward is designed such that the content of the maneuver a _t , and thus the driving state s _t+1 at the second time, has an undesirably small value and a desirably large value. The reinforcement learning unit 33 calculates, according to Formula 2 described later, such that the action value (evaluation value) increases as the reward increases, and the first learning model 40 determines the operation _content at to increase the action value. Reinforcement learning is performed so as to output.

本実施形態においては、制御装置１０は、燃費や排ガス性能を考慮してドライブロボット４を制御するものであるため、報酬には、燃費と排ガス性能が反映されている。
燃費は、例えばガソリンや軽油などの燃料の、単位容量当たりの走行距離、または、一定の距離をどれだけの燃料で走行できるかを示す指標である。
排ガス性能は、排気ガスに含まれる、一酸化炭素、窒素酸化物、炭化水素類、黒煙等の大気汚染物質の濃度が、一定の基準以下であるか否かを示す指標である。
これら燃費や排ガス性能は、車両２の操作という観点では、アクセルペダル２ｃとブレーキペダル２ｄの操作が関連する。すなわち、報酬は、アクセルペダル２ｃとブレーキペダル２ｄの検出量に基づいて計算されるのが適切である。 In this embodiment, the control device 10 controls the drive robot 4 in consideration of fuel consumption and exhaust gas performance, so fuel consumption and exhaust gas performance are reflected in the reward.
The fuel consumption is an index that indicates how much fuel can be used to travel a certain distance or distance per unit capacity of fuel such as gasoline or light oil.
Exhaust gas performance is an index that indicates whether the concentration of air pollutants such as carbon monoxide, nitrogen oxides, hydrocarbons, and black smoke contained in exhaust gas is below a certain standard.
From the viewpoint of operation of the vehicle 2, these fuel consumption and exhaust gas performance are related to the operation of the accelerator pedal 2c and the brake pedal 2d. That is, it is appropriate that the reward is calculated based on the detected amounts of the accelerator pedal 2c and the brake pedal 2d.

ただし、燃費や排ガス性能を向上させることに注目するあまり、制御装置１０が本来達成すべき、指令車速への追従性能が損なわれることがあってはならない。このため、報酬は、アクセルペダル２ｃとブレーキペダル２ｄの検出量に加えて、指令車速への追従性能に基づいて計算されるのが望ましい。 However, too much attention should be paid to improving fuel efficiency and exhaust gas performance, and the ability to follow the commanded vehicle speed, which the control device 10 should originally achieve, should not be impaired. Therefore, the reward is desirably calculated based on the performance of following the commanded vehicle speed in addition to the detected amounts of the accelerator pedal 2c and the brake pedal 2d.

本実施形態においては、ｒ_ｓを指令車速への追従性に基づいて計算される指令車速報酬要素（第２要素）、ｒ_ＡＰをアクセルペダル２ｃの検出量に基づいて計算されるアクセルペダル報酬要素（第１要素）、ｒ_ＢＰをブレーキペダル２ｄの検出量に基づいて計算されるブレーキペダル報酬要素（第１要素）としたときに、報酬ｒは、次の数式１によって表わされる。
ここで、ｗ_ｓ、ｗ_ＡＰ、ｗ_ＢＰは、それぞれ、指令車速報酬要素ｒ_ｓ、アクセルペダル報酬要素ｒ_ＡＰ、ブレーキペダル報酬要素ｒ_ＢＰに対応した重みである。 In this embodiment, _rs is a commanded vehicle speed reward element (second element) calculated based on the ability to follow the commanded vehicle speed, and _rAP is an accelerator pedal reward element calculated based on the detected amount of the accelerator pedal 2c. (first element), when _rBP is a brake pedal reward element (first element) calculated based on the detected amount of the brake pedal 2d, the reward r is expressed by the following Equation 1.
Here, w _s , w _AP , and w _BP are weights corresponding to command vehicle speed reward element r _s , accelerator pedal reward element r _AP , and brake pedal reward element r _BP , respectively.

このように、報酬ｒは、指令車速への追従性や、アクセルペダル２ｃ、ブレーキペダル２ｄの検出量等の、各要素に対応する報酬要素を計算したうえで、これらの重みづけ和を計算することで、一つのスカラー値として計算されている。 In this way, the reward r is calculated by calculating the weighted sum of the reward elements corresponding to each element, such as the followability to the commanded vehicle speed and the detected amount of the accelerator pedal 2c and the brake pedal 2d. Therefore, it is calculated as one scalar value.

指令車速報酬要素ｒ_ｓは、例えば、操作内容推論部３１から受信した第２の時刻における走行状態ｓ_ｔ＋１において、検出車速と指令車速の差分の絶対値を計算し、これが所定の第１閾値以下であれば、差分値が小さいほど大きな値となる、正の値とし、第１閾値よりも大きければ、差分値が大きいほど小さな値となる、負の値とすることで、計算され得る。
この場合においては、操作の内容ａ_ｔによって検出車速が指令車速に十分に追従できている場合においては、検出車速と指令車速の差分の絶対値は第１閾値以下の値となり、指令車速報酬要素ｒ_ｓの値が大きくなる。逆に、検出車速が指令車速に十分に追従できていない場合においては、検出車速と指令車速の差分の絶対値は第１閾値よりも大きな値となり、指令車速報酬要素ｒ_ｓの値が小さくなる。
このように、操作の内容ａ_ｔに基づいたドライブロボット４の操作の後の、第２の時刻における検出車速と指令車速との差が小さいほど値が大きくなるように設定された指令車速報酬要素ｒ_ｓ（第２要素）が計算され、指令車速報酬要素ｒ_ｓを基に報酬ｒが計算されている。 The commanded vehicle speed reward element r _s , for example, calculates the absolute value of the difference between the detected vehicle speed and the commanded vehicle speed in the running state s _t+1 at the second time received from the operation content inference unit 31, and the difference is equal to or less than a predetermined first threshold. Then, the smaller the difference value, the larger the positive value, and if the difference value is larger than the first threshold, the larger the difference value, the smaller the negative value.
In this case, when the detected vehicle speed can sufficiently follow the commanded vehicle speed due to the operation content at, the absolute value of the difference between the detected vehicle speed and the commanded vehicle speed becomes a _value equal to or less than the first threshold value, and the commanded vehicle speed reward element. The value of r _s increases. Conversely, when the detected vehicle speed does not sufficiently follow the commanded vehicle speed, the absolute value of the difference between the detected vehicle speed and the commanded vehicle speed becomes a value larger than the first threshold, and the value of the commanded vehicle speed reward element _rs becomes smaller. .
In this way, the commanded vehicle speed reward element is set such that the smaller the difference between the detected vehicle speed and the commanded vehicle speed at the second time after the operation of the drive _robot 4 based on the content of the operation at, the larger the value. rs (second element) is calculated, and the reward _r is calculated based on the command vehicle speed reward element _rs .

アクセルペダル報酬要素ｒ_ＡＰに関しては、例えば、操作内容推論部３１から受信した第２の時刻における走行状態ｓ_ｔ＋１において、第１の時刻からのアクセルペダル検出量の推移を取得し、時間軸と、アクセルペダル２ｃの検出量を軸とする座標系上で、検出量を関数として表現する。アクセルペダル報酬要素ｒ_ＡＰは、この関数の二階微分または一階微分の値を基に計算され得る。 Regarding the accelerator pedal reward element _rAP , for example, in the running state s _t+1 at the second time received from the operation content inference unit 31, the transition of the accelerator pedal detection amount from the first time is acquired, and the time axis and The detected amount is expressed as a function on a coordinate system whose axis is the detected amount of the accelerator pedal 2c. The accelerator pedal reward factor _rAP can be calculated based on the value of the second derivative or the first derivative of this function.

二階微分の場合においては、例えば、上記関数の二階微分の最大値の絶対値を計算し、これが所定の第２閾値（所定の閾値）以下であれば、最大値の絶対値が小さいほど大きな値となる、正の値とし、第２閾値よりも大きければ、最大値の絶対値が大きいほど小さな値となる、負の値とすることで、計算され得る。
この場合においては、操作の内容ａ_ｔにおいてアクセルペダル２ｃの開度が急激に変わらず、燃費や排ガス性能が良好であると考えられる場合においては、上記曲線の接線の傾きは時間と共に大きく変化せず、したがって関数の二階微分の最大値の絶対値は第２閾値以下の値となり、アクセルペダル報酬要素ｒ_ＡＰの値が大きくなる。逆に、操作の内容ａ_ｔにおいてアクセルペダル２ｃの開度が急激に変化し、燃費や排ガス性能が良好ではないと考えられる場合においては、上記曲線の接線の傾きは時間と共に大きく変化し、したがって関数の二階微分の最大値の絶対値は第２閾値よりも大きな値となり、アクセルペダル報酬要素ｒ_ＡＰの値が小さくなる。 In the case of the second derivative, for example, the absolute value of the maximum value of the second derivative of the above function is calculated. , and if it is larger than the second threshold, it becomes a negative value that decreases as the absolute value of the maximum value increases.
In this case, if the degree of opening of the accelerator pedal _2c does not change abruptly in the operation contents at, and the fuel consumption and exhaust gas performance are considered to be good, the slope of the tangent line of the above curve does not change greatly with time. Therefore, the absolute value of the maximum value of the second derivative of the function becomes a value equal to or less than the second threshold value, and the value of the accelerator pedal reward element _rAP becomes large. Conversely, when the degree of opening of the accelerator pedal _2c changes abruptly in the operation at at, and the fuel consumption and exhaust gas performance are considered unsatisfactory, the slope of the tangent to the above curve changes greatly over time. The absolute value of the maximum value of the second derivative of the function becomes a value larger than the second threshold, and the value of the accelerator pedal reward element r _AP becomes smaller.

一階微分の場合においても同様に、例えば、上記関数の一階微分の最大値の絶対値を計算し、これが所定の第３閾値（所定の閾値）以下であれば、最大値の絶対値が小さいほど大きな値となる、正の値とし、第３閾値よりも大きければ、最大値の絶対値が大きいほど小さな値となる、負の値とすることで、計算され得る。
この場合においては、操作の内容ａ_ｔにおいてアクセルペダル２ｃの開度が急激に変わらず、燃費や排ガス性能が良好であると考えられる場合においては、上記曲線の接線の傾きは大きくはなく、したがって関数の一階微分の最大値の絶対値は第３閾値以下の値となり、アクセルペダル報酬要素ｒ_ＡＰの値が大きくなる。逆に、操作の内容ａ_ｔにおいてアクセルペダル２ｃの開度が急激に変化し、燃費や排ガス性能が良好ではないと考えられる場合においては、上記曲線の接線の傾きは大きくなり、したがって関数の一階微分の最大値の絶対値は第３閾値よりも大きな値となり、アクセルペダル報酬要素ｒ_ＡＰの値が小さくなる。 Similarly, in the case of the first derivative, for example, the absolute value of the maximum value of the first derivative of the above function is calculated. The smaller the value, the larger the positive value, and if the value is greater than the third threshold, the larger the absolute value of the maximum value, the smaller the negative value.
In this case, when the degree of opening of the accelerator pedal _2c does not change abruptly in the operation at, and the fuel efficiency and exhaust gas performance are considered to be good, the slope of the tangent line of the curve is not large. The absolute value of the maximum value of the first derivative of the function becomes a value equal to or less than the third threshold value, and the value of the accelerator pedal reward element r _AP increases. Conversely, when the degree of opening of the accelerator pedal _2c changes abruptly in the operation at at, and the fuel efficiency and exhaust gas performance are considered unsatisfactory, the slope of the tangent line of the above curve increases. The absolute value of the maximum value of the differential is greater than the third threshold value, and the value of the accelerator pedal reward element r _AP becomes smaller.

このように、アクセルペダル報酬要素ｒ_ＡＰは、第１の時刻から第２の時刻までのアクセルペダル検出量の推移を関数として表わしたときに、関数の一階微分または二階微分の最大値の絶対値が所定の第２、第３閾値以下であれば、最大値の絶対値に応じた正の値となるように、かつ、最大値の絶対値が所定の第２、第３閾値よりも大きければ、最大値の絶対値に応じた負の値となるように、計算されている。 Thus, the accelerator pedal reward element _rAP is the absolute value of the maximum value of the first derivative or second derivative of the function when the transition of the accelerator pedal detection amount from the first time to the second time is expressed as a function. If the value is equal to or less than the predetermined second and third thresholds, it becomes a positive value corresponding to the absolute value of the maximum value, and if the absolute value of the maximum value is greater than the predetermined second and third thresholds. For example, it is calculated to be a negative value corresponding to the absolute value of the maximum value.

ブレーキペダル報酬要素ｒ_ＢＰに関しても同様で、例えば、操作内容推論部３１から受信した第２の時刻における走行状態ｓ_ｔ＋１において、第１の時刻からのブレーキペダル検出量の推移を取得し、時間軸と、ブレーキペダル２ｄの検出量を軸とする座標系上で、検出量を関数として表現する。ブレーキペダル報酬要素ｒ_ＢＰは、この関数の二階微分または一階微分の値を基に計算され得る。 The same applies to the brake pedal reward element _r _BP . , the detected amount is expressed as a function on a coordinate system whose axis is the detected amount of the brake pedal 2d. The brake pedal reward factor _{r_BP} can be calculated based on the second derivative or first derivative value of this function.

二階微分の場合においては、例えば、上記関数の二階微分の最大値の絶対値を計算し、これが所定の第４閾値（所定の閾値）以下であれば、最大値の絶対値が小さいほど大きな値となる、正の値とし、第４閾値よりも大きければ、最大値の絶対値が大きいほど小さな値となる、負の値とすることで、計算され得る。
この場合においては、操作の内容ａ_ｔにおいてブレーキペダル２ｄの開度が急激に変わらず、燃費や排ガス性能が良好であると考えられる場合においては、上記曲線の接線の傾きは時間と共に大きく変化せず、したがって関数の二階微分の最大値の絶対値は第４閾値以下の値となり、ブレーキペダル報酬要素ｒ_ＢＰの値が大きくなる。逆に、操作の内容ａ_ｔにおいてブレーキペダル２ｄの開度が急激に変化し、燃費や排ガス性能が良好ではないと考えられる場合においては、上記曲線の接線の傾きは時間と共に大きく変化し、したがって関数の二階微分の最大値の絶対値は第４閾値よりも大きな値となり、ブレーキペダル報酬要素ｒ_ＢＰの値が小さくなる。 In the case of the second derivative, for example, the absolute value of the maximum value of the second derivative of the above function is calculated. , and if it is greater than the fourth threshold value, the larger the absolute value of the maximum value, the smaller the negative value.
In this case, when the degree of opening of the brake pedal _2d does not change abruptly in the operation contents at, and the fuel consumption and exhaust gas performance are considered to be good, the slope of the tangent line of the above curve does not change greatly with time. Therefore, the absolute value of the maximum value of the second derivative of the function becomes a value equal to or less than the fourth threshold value, and the value of the brake pedal reward element _rBP becomes large. Conversely, when the degree of opening of the brake pedal _2d changes abruptly in the operation content at, and the fuel consumption and exhaust gas performance are considered to be poor, the slope of the tangent line of the curve changes greatly with time, and therefore The absolute value of the maximum value of the second derivative of the function becomes a value larger than the fourth threshold, and the value of the brake pedal reward element _rBP becomes small.

一階微分の場合においても同様に、例えば、上記関数の一階微分の最大値の絶対値を計算し、これが所定の第５閾値（所定の閾値）以下であれば、最大値の絶対値が小さいほど大きな値となる、正の値とし、第５閾値よりも大きければ、最大値の絶対値が大きいほど小さな値となる、負の値とすることで、計算され得る。
この場合においては、操作の内容ａ_ｔにおいてブレーキペダル２ｄの開度が急激に変わらず、燃費や排ガス性能が良好であると考えられる場合においては、上記曲線の接線の傾きは大きくはなく、したがって関数の一階微分の最大値の絶対値は第５閾値以下の値となり、ブレーキペダル報酬要素ｒ_ＢＰの値が大きくなる。逆に、操作の内容ａ_ｔにおいてブレーキペダル２ｄの開度が急激に変化し、燃費や排ガス性能が良好ではないと考えられる場合においては、上記曲線の接線の傾きは大きくなり、したがって関数の一階微分の最大値の絶対値は第５閾値よりも大きな値となり、ブレーキペダル報酬要素ｒ_ＢＰの値が小さくなる。 Similarly, in the case of the first-order derivative, for example, the absolute value of the maximum value of the first-order derivative of the above function is calculated. The smaller the value, the larger the positive value, and if the value is greater than the fifth threshold, the larger the absolute value of the maximum value, the smaller the negative value.
In this case, when the degree of opening of the brake pedal _2d does not change abruptly in the operation at at, and the fuel consumption and exhaust gas performance are considered to be good, the slope of the tangent line of the curve is not large. The absolute value of the maximum value of the first derivative of the function becomes a value equal to or lower than the fifth threshold, and the value of the brake pedal reward element _rBP becomes large. Conversely, when the degree of opening of the brake pedal _2d changes abruptly in the operation at at, and the fuel consumption and exhaust gas performance are considered to be poor, the slope of the tangent line of the above curve becomes large. The absolute value of the maximum value of the differential is larger than the fifth threshold, and the value of the brake pedal reward element _rBP becomes smaller.

このように、ブレーキペダル報酬要素ｒ_ＢＰは、第１の時刻から第２の時刻までのブレーキペダル検出量の推移を関数として表わしたときに、関数の一階微分または二階微分の最大値の絶対値が所定の第４、第５閾値以下であれば、最大値の絶対値に応じた正の値となるように、かつ、最大値の絶対値が所定の第４、第５閾値よりも大きければ、最大値の絶対値に応じた負の値となるように、計算されている。 In this way, the brake pedal reward element _rBP is the absolute value of the maximum value of the first or second derivative of the function when the transition of the brake pedal detection amount from the first time to the second time is expressed as a function. If the value is equal to or less than the predetermined fourth and fifth thresholds, it becomes a positive value corresponding to the absolute value of the maximum value, and if the absolute value of the maximum value is greater than the predetermined fourth and fifth thresholds. For example, it is calculated to be a negative value corresponding to the absolute value of the maximum value.

上記のように、アクセルペダル２ｃ及びブレーキペダル２ｄの検出量の変化が小さいほど値が大きくなるように設定されたアクセルペダル報酬要素ｒ_ＡＰ、ブレーキペダル報酬要素ｒ_ＢＰが計算され、アクセルペダル報酬要素ｒ_ＡＰ、ブレーキペダル報酬要素ｒ_ＢＰを基に報酬ｒが計算されている。このように、報酬ｒは、入力された操作の内容ａ_ｔが、対応する第２の時刻における走行状態ｓ_ｔ＋１での燃費と排ガス性能が高くなると考えられるものであるほど、大きな値となるように計算されている。 As described above, the accelerator pedal reward element _rAP and the brake pedal reward element _rBP are calculated so that the smaller the change in the detected amount of the accelerator pedal 2c and the brake pedal 2d, the larger the value of the accelerator pedal reward element rAP and the brake pedal reward element rBP. Reward r is calculated based on r _AP and brake pedal reward component r _BP . In this way, the remuneration r takes a larger value as the input operation content a _t is considered to increase the fuel consumption and exhaust gas performance in the running state _st+1 at the corresponding second time. is calculated to

既に説明したように、上記の数式１によって計算された報酬ｒは、操作内容推論部３１へ送信されて、第１の時刻における走行状態ｓ_ｔ、操作の内容ａ_ｔ、第２の時刻における走行状態ｓ_ｔ＋１と共に組み合わされて、学習用データ記憶部３４へ送信される。
ここで、報酬ｒは、第２の時刻における走行状態ｓ_ｔ＋１に対して計算されたものであるから、以降、報酬ｒ_ｔ＋１と記載する。
学習用データ記憶部３４は、操作内容推論部３１から送信された、第１の時刻における走行状態ｓ_ｔ、操作の内容ａ_ｔ、第２の時刻における走行状態ｓ_ｔ＋１、及び報酬ｒ_ｔ＋１の組み合わせを受信して、記憶する。
この組み合わせは、走行データとして、第１学習モデル４０の強化学習に使用される。 As already explained, the reward r calculated by the above formula 1 is transmitted to the operation content inference unit 31, and the running state s _t at the first time, the operation content a _t , and the running state at the second time It is combined with the state s _t+1 and sent to the learning data storage unit 34 .
Here, since the reward r is calculated for the running state s _t+1 at the second time, it is hereinafter referred to as the reward r _t+1 .
The learning data storage unit 34 stores the combination of the running state s _t at the first time, the operation content a _t , the running state s _t+1 at the second time, and the reward r _t+1 , which are transmitted from the operation content inference unit 31 . is received and stored.
This combination is used for reinforcement learning of the first learning model 40 as travel data.

学習部３０は、強化学習に十分なデータが学習用データ記憶部３４に記憶されるまで、操作内容推論部３１による操作の内容ａ_ｔの推論と、操作の内容ａ_ｔがドライブロボット４によって実行された後の状態ｓ_ｔ＋１の取得、及び報酬計算部３２によるこれを基にした報酬ｒ_ｔ＋１の計算を繰り返し、走行データを学習用データ記憶部３４に蓄積する。
学習用データ記憶部３４に、強化学習に十分な量の走行データが蓄積されると、次に説明する強化学習部３３により強化学習が実行される。 The learning unit 30 causes the operation content _inference unit 31 to _infer the operation content at and the operation content at to be executed by the drive robot 4 until enough data for reinforcement learning is stored in the learning data storage unit 34 . Acquisition of the state s _{t+1 after the state s t+1} and calculation of the remuneration r _t+1 based thereon by the remuneration calculation unit 32 are repeated, and the travel data is accumulated in the learning data storage unit 34 .
When a sufficient amount of travel data for reinforcement learning is accumulated in the learning data storage unit 34, reinforcement learning is executed by the reinforcement learning unit 33 described below.

強化学習部３３は、学習用データ記憶部３４から、複数の走行データを取得し、これを使用して、第１学習モデル４０を強化学習する。以下に説明するように、強化学習部３３は、本実施形態においては、深層強化学習アルゴリズムＤＤＰＧ（ＤｅｅｐＤｅｔｅｒｍｉｎｉｓｔｉｃＰｏｌｉｃｙＧｒａｄｉｅｎｔ）によって、第１学習モデル４０と、後に説明する、強化学習部３３に設けられた第２学習モデル５０を並行して学習させているが、強化学習に用いられるアルゴリズムは、ＤＤＰＧ以外の他のアルゴリズムであってもよい。
まず、第１学習モデル４０の学習について説明する。 The reinforcement learning unit 33 acquires a plurality of pieces of travel data from the learning data storage unit 34 and uses them to perform reinforcement learning of the first learning model 40 . As described below, in the present embodiment, the reinforcement learning unit 33 is provided in the first learning model 40 and the reinforcement learning unit 33, which will be described later, by a deep reinforcement learning algorithm DDPG (Deep Deterministic Policy Gradient). Although the second learning model 50 is learned in parallel, an algorithm other than DDPG may be used for reinforcement learning.
First, learning of the first learning model 40 will be described.

既に説明したように、強化学習部３３は、操作の内容ａ_ｔがどの程度適切であったかを示す行動価値を計算し、第１学習モデル４０が、この行動価値が高くなるような操作の内容ａ_ｔを出力するように、強化学習を行う。この行動価値（評価値）は、第１の時刻における走行状態ｓ_ｔと、これに対する操作の内容ａ_ｔを引数とした関数Ｑ（ｓ_ｔ、ａ_ｔ）として、次の式で表わされる。 As already explained, the reinforcement learning unit 33 calculates the action value _indicating how appropriate the content of the operation at is, and the first learning model 40 calculates the content of the operation a Reinforcement learning is performed so as to output _t . This action value (evaluation value) is expressed by the following formula as a function Q(s _t , a _t ) having as arguments the running state s _t at the first time and the operation content a _t for this.

上式において、γは割引率であり、αは学習率である。
行動価値関数Ｑ（ｓ_ｔ、ａ_ｔ）は、第１の時刻における走行状態ｓ_ｔにおいて操作の内容ａ_ｔを実行した際に、以降の時刻において最終的に得られると考えられる収益、すなわち時間割引報酬の和の期待値を表す。ｍａｘＱ（ｓ_ｔ＋１、ａ）は、第２の時刻においてとり得る操作の内容ａに対する行動価値関数Ｑの最大値であり、これに割引率γを乗算して報酬ｒ_ｔ＋１を加算した値は、第１の時刻において操作の内容ａ_ｔを実行し、報酬ｒ_ｔ＋１を受け取った後の、すなわち第２の時刻における行動価値である。この、第２の時刻における行動価値と、第１の時刻における行動価値Ｑ（ｓ_ｔ、ａ_ｔ）の差分であるＴＤ（ＴｅｍｐｏｒａｌＤｉｆｆｅｒｅｎｃｅ）誤差に対し、学習率αを乗算して、元々の行動価値関数Ｑ（ｓ_ｔ、ａ_ｔ）に加算することにより、行動価値関数Ｑ（ｓ_ｔ、ａ_ｔ）を更新する。
すなわち、上記の数式２は、行動価値関数Ｑ（ｓ_ｔ、ａ_ｔ）の更新式であり、行動価値関数Ｑ（ｓ_ｔ、ａ_ｔ）は随時、更新される。 where γ is the discount rate and α is the learning rate.
The action value function Q ₍ s _t , a _t ₎ is the profit, that is, the time Represents the expected value of the sum of discounted rewards. maxQ(s _t+1 , a) is the maximum value of the action value function Q for the content a of the operation that can be performed at the second time, and the value obtained by multiplying this by the discount rate γ and adding the reward r _t+1 is the It is the action value at the second time, that is, after executing the content of the operation at at time 1 and receiving the reward r _t ₊₁ . A TD (Temporal Difference) error, which is the difference between the action value at the second time and the action value Q(s _t , at ₎ at the first time, is multiplied by the learning rate α to obtain the original action Update the action value function Q(s _t ,at ₎ by adding to the value function Q(s _t ,at ₎ .
That is, Equation 2 above is an update formula for the action-value function Q(s _t , a _t ), and the action-value function Q(s _t , a _t ) is updated as needed.

既に説明したように、強化学習部３３は、上記の数式２により、報酬ｒ_ｔ＋１が大きいほど行動価値Ｑ（ｓ_ｔ、ａ_ｔ）を高くするように計算する。この行動価値Ｑ（ｓ_ｔ、ａ_ｔ）が高くなるような操作の内容ａを第１学習モデル４０が出力するように、第１学習モデル４０の強化学習は実行される。ここで、上記のように数式２は行動価値関数Ｑ（ｓ_ｔ、ａ_ｔ）の更新式であるため、第１学習モデル４０が学習されて走行状態ｓ_ｔと操作の内容ａ_ｔの出力が変化すると、行動価値関数Ｑ（ｓ_ｔ、ａ_ｔ）自体も更新される。
このように、強化学習部３３は、第１学習モデル４０の学習と、行動価値関数Ｑ（ｓ_ｔ、ａ_ｔ）の更新を、並行して、例えば交互に繰り返すことにより、実行する。 As already explained, the reinforcement learning unit 33 calculates the action value Q(s _t , a _t ) to be higher as the reward r _t+1 is larger, using Equation 2 above. Reinforcement learning of the first learning model 40 is executed so that the first learning model 40 outputs the operation content a that increases the action value Q(s _t , a _t ). Here, since Equation 2 is an update formula for the action value function Q(s _t , a _t ) as described above, the first learning model 40 is learned and the output of the driving state s _t and the operation content a _t is When changed, the action-value function Q(s _t , a _t ) itself is also updated.
In this way, the reinforcement learning unit 33 performs learning of the first learning model 40 and updating of the action-value function Q(s _t , a _t ) in parallel, for example, by alternately repeating them.

第１学習モデル４０は、上記のように、行動価値の高い操作の内容ａを出力することを目的としている。すなわち、行動価値関数Ｑ（ｓ_ｔ、ａ_ｔ）の値ができるだけ大きい操作の内容ａを出力するように、第１学習モデル４０の学習は実行される。
本実施形態においては、μ（ｓ_ｔ）を、第１学習モデル４０に走行状態ｓ_ｔを入力としたときの出力関数（すなわち操作の内容ａ_ｔ）としたときに、「－Ｑ（ｓ_ｔ、μ（ｓ_ｔ））」の値を損失関数とし、これをできるだけ小さくする操作の内容ａ_ｔを出力するように、第１学習モデル４０を学習させる。すなわち、誤差逆伝搬法、確率的勾配降下法等により、この損失関数が減る方向に重みやバイアスの値等の、ニューラルネットワークを構成する各パラメータの値を調整することによって、強化学習部３３は第１学習モデル４０を学習させる。 The purpose of the first learning model 40 is to output the operation content a with high action value, as described above. That is, learning of the first learning model 40 is executed so as to output the operation content a with the largest possible value of the action-value function Q(s _t , a _t ).
In the present embodiment, when μ(s _t ) is the output function (that is, operation content a _t ) when the driving state s _t is input to the first learning model 40, "-Q(s _t , μ(s _t ))” as a loss function, and the first learning model 40 is trained so as to output the content of operation at to _minimize this value. That is, by adjusting the values of the parameters that make up the neural network, such as weights and bias values, in the direction in which this loss function decreases, the reinforcement learning unit 33 The first learning model 40 is trained.

ここで、既に説明したように、本実施形態においては強化学習としてＤＤＰＧを用いている。すなわち、強化学習部３３は、ニューラルネットワークにより実現された第２学習モデル５０を備えており、数式２における行動価値関数Ｑ（ｓ_ｔ、ａ_ｔ）を、関数近似器としての第２学習モデル５０により計算している。 Here, as already explained, DDPG is used as reinforcement learning in this embodiment. That is, the reinforcement learning unit 33 includes a second learning model 50 realized by a neural network, and converts the action-value function Q(s _t , at ₎ in Equation 2 to the second learning model 50 as a function approximator Calculated by

図４は、第２学習モデル５０のブロック図である。
本実施形態においては、第２学習モデル５０は、第１学習モデル４０と同様に、中間層を３層とした全５層の全結合型のニューラルネットワークにより実現されている。第２学習モデル５０は、入力層５１、中間層５２、及び出力層５３を備えている。
図４においては、各層が矩形として描かれており、各層に含まれるノードは省略されている。 FIG. 4 is a block diagram of the second learning model 50. As shown in FIG.
In this embodiment, like the first learning model 40, the second learning model 50 is implemented by a five-layer fully-connected neural network with three intermediate layers. The second learning model 50 comprises an input layer 51 , an intermediate layer 52 and an output layer 53 .
In FIG. 4, each layer is drawn as a rectangle, and the nodes included in each layer are omitted.

入力層５１は、複数の入力ノードを備えている。複数の入力ノードの各々は、例えばアクセルペダル検出量ｓ１、ブレーキペダル検出量ｓ２から、指令車速ｓＮに至るまでの、走行状態ｓの各々と、及び、例えばアクセルペダル操作ａ１とブレーキペダル操作ａ２の、操作の内容ａの各々に対応するように設けられている。このように、上記の数式２における行動価値関数Ｑ（ｓ_ｔ、ａ_ｔ）の引数に対応するように、入力ノードが設けられている。
第１学習モデル４０と同様に、各走行状態ｓは、複数の値により実現されている。例えば、図４においては、一つの矩形として示されている、アクセルペダル検出量ｓ１に対応する入力は、実際には、アクセルペダル検出量ｓ１の複数の値の各々に対応するように、入力ノードが設けられている。
また、各操作の内容ａも、第１学習モデル４０と同様に、複数の値により実現されている。例えば、図４においては、一つの矩形として示されている、アクセルペダル操作ａ１に対応する出力は、実際には、アクセルペダル操作ａ１の複数の値の各々に対応するように、入力ノードが設けられている。
各入力ノードには、学習用データ記憶部３４から受信した、第１の時刻における走行状態ｓ_ｔと、操作の内容ａ_ｔの値が格納される。 The input layer 51 has a plurality of input nodes. Each of the plurality of input nodes represents, for example, an accelerator pedal detection amount s1, a brake pedal detection amount s2, a driving state s up to a command vehicle speed sN, and, for example, an accelerator pedal operation a1 and a brake pedal operation a2. , are provided so as to correspond to each of the operation contents a. Thus, input nodes are provided so as to correspond to the arguments of the action value function Q(s _t , a _t ) in Equation 2 above.
As with the first learning model 40, each running state s is realized by a plurality of values. For example, in FIG. 4, the input corresponding to the accelerator pedal detection amount s1, which is shown as one rectangle, actually corresponds to each of a plurality of values of the accelerator pedal detection amount s1. is provided.
Also, the content a of each operation is realized by a plurality of values, as in the case of the first learning model 40 . For example, in FIG. 4, the output corresponding to accelerator pedal actuation a1, which is shown as one rectangle, is actually provided with an input node so as to correspond to each of a plurality of values of accelerator pedal actuation a1. It is
Each input node _stores the value of the running state _st at the first time and the content of the operation at at the first time, which are received from the learning data storage unit 34 .

中間層５２は、第１中間層５２ａ、第２中間層５２ｂ、及び第３中間層５２ｃを備えている。
中間層５２の各ノードにおいては、前段の層（例えば、第１中間層５２ａの場合は入力層５１、第２中間層５２ｂの場合は第１中間層５２ａ）の各ノードから、この前段の層の各ノードに格納された値と、前段の層の各ノードから当該中間層５２のノードへの重みを基にした演算がなされて、当該中間層５２のノード内に演算結果が格納される。
本実施形態においては、この演算において使用される活性化関数は、例えばＲｅＬＵ（ＲｅｃｔｉｆｉｅｄＬｉｎｅａｒＵｎｉｔ）である。 The intermediate layer 52 includes a first intermediate layer 52a, a second intermediate layer 52b, and a third intermediate layer 52c.
At each node of the intermediate layer 52, from each node of the previous layer (for example, the input layer 51 in the case of the first intermediate layer 52a and the first intermediate layer 52a in the case of the second intermediate layer 52b), this previous layer and the weight from each node of the previous layer to the node of the intermediate layer 52, and the result of the operation is stored in the node of the intermediate layer 52.
In this embodiment, the activation function used in this operation is, for example, ReLU (Rectified Linear Unit).

出力層５３においても、中間層５２の各々と同様な演算が行われ、出力層５３に備えられた出力ノードに演算結果が格納される。本実施形態においては、出力ノードは、例えば１つであり、これが、計算された行動価値関数Ｑ（ｓ_ｔ、ａ_ｔ）の値に相当する。 In the output layer 53 as well, operations similar to those in each of the intermediate layers 52 are performed, and the operation results are stored in output nodes provided in the output layer 53 . In this embodiment, there is, for example, one output node, which corresponds to the value of the calculated action-value function Q(s _t , a _t ).

第２学習モデル５０においても、走行状態ｓと操作の内容ａが入力されて、適切な行動評価関数Ｑを演算することができるように学習がなされる。この学習においては、重みやバイアスの値等、ニューラルネットワークを構成する各パラメータの値が調整される。
第２学習モデル５０は、次式を損失関数として、これをできるだけ小さくするように学習される。 In the second learning model 50 as well, the driving state s and the operation content a are input, and learning is performed so that an appropriate action evaluation function Q can be calculated. In this learning, the values of the parameters that make up the neural network, such as weights and bias values, are adjusted.
The second learning model 50 uses the following equation as a loss function and is trained to minimize this loss function.

上式は、第１学習モデル４０において説明した、ＴＤ誤差に相当する。ＴＤ誤差は、第２の時刻における行動価値である、第２の時刻において実行する操作の内容μ（ｓ_ｔ＋１）に対する行動価値関数Ｑに割引率γを乗算して報酬ｒ_ｔ＋１を加算した値と、第１の時刻における行動価値Ｑ（ｓ_ｔ、ａ_ｔ）との差分である。このため、ＴＤ誤差（の二乗）を最小化することにより、行動価値Ｑ（ｓ_ｔ、ａ_ｔ）として適切な値が出力されるように第１学習モデル４０が学習される。
第２学習モデル５０においても、第１学習モデル４０と同様に、誤差逆伝搬法、確率的勾配降下法等により、数式３として示された損失関数が減る方向に重みやバイアスの値等の、ニューラルネットワークを構成する各パラメータの値を調整することによって、第２学習モデル５０は学習される。 The above expression corresponds to the TD error described in the first learning model 40. The TD error is the action value at the second time, which is the value obtained by multiplying the action value function Q for the content μ(s _t+1 ) of the operation to be executed at the second time by the discount rate γ and adding the reward r _t+1 . , the difference from the action value Q(s _t , a _t ) at the first time. Therefore, by minimizing (the square of) the TD error, the first learning model 40 is _trained such that an appropriate value is output as the action value Q(s _t , at ).
In the second learning model 50, as in the first learning model 40, the error back propagation method, the stochastic gradient descent method, etc. are used to reduce the loss function shown in Equation 3, such as weights and bias values. The second learning model 50 is learned by adjusting the values of the parameters that make up the neural network.

このように、本実施形態においては、第１学習モデル４０は、操作の内容ａ_ｔに基づいたドライブロボット４の操作の後の、第１の時刻より後の第２の時刻における走行状態ｓ_ｔ＋１に基づいて、燃費と排ガス性能のいずれか一方または双方がより高い操作の内容ａ_ｔであるほど大きな値となるように計算された報酬ｒ_ｔ＋１を基に、強化学習されている。
また、第１学習モデル４０は、報酬ｒ_ｔ＋１を基に操作の内容ａ_ｔを評価して操作の内容ａ_ｔの評価値Ｑ（ｓ_ｔ、ａ_ｔ）を計算する第２学習モデル５０によって計算された、評価値Ｑ（ｓ_ｔ、ａ_ｔ）を基に、評価値Ｑ（ｓ_ｔ、ａ_ｔ）がより高い操作の内容を推論するように学習されている。
更に、第２学習モデル５０は、第１学習モデル４０により出力された操作の内容ａ_ｔを入力とし、報酬ｒ_ｔ＋１が大きいほど高い評価値Ｑ（ｓ_ｔ、ａ_ｔ）を出力するように学習され、これら第１学習モデル４０と第２学習モデルの学習５０が繰り返されることにより、第１学習モデル４０と第２学習モデルの学習５０は強化学習されている。 Thus, in the present embodiment, the first learning model 40 is based on the driving state s _t ₊₁ at the second time after the first time after the operation of the drive robot 4 based on the operation content at. Reinforcement learning is performed based on the reward _rt ₊₁ , which is calculated such that the higher the operation content at, the higher the fuel efficiency and/or the exhaust gas performance, the larger the value.
In addition, the first learning model 40 is calculated by the second learning model 50 that evaluates the content of the operation a _t based on the reward r _t+1 and calculates the evaluation value Q(s _t , a _t ) of the content of the operation a _t Based on the obtained evaluation value Q(s _t , a _t ), learning is performed so as to infer the content of the operation with the higher evaluation value Q(s _t , a _t ).
Further, the second learning model 50 receives as input the operation content at output from the first learning model 40, and learns to output a higher evaluation value Q(s _t , at ₎ as the reward r _t ₊₁ increases. By repeating the learning 50 of the first learning model 40 and the learning 50 of the second learning model, the learning 50 of the first learning model 40 and the second learning model are reinforced.

以上のように、制御装置１０は、操作の内容の学習時においては、学習が中途の状態における第１学習モデル４０によって、現在（第１の時刻）の走行状態ｓ_ｔにおいて実行すべき操作の内容ａ_ｔを推論する。また、制御装置１０は、この操作の内容ａ_ｔを実行した後の時刻（第２の時刻）において、操作の内容ａ_ｔの実行によって変化した走行状態ｓ_ｔ＋１を基に、報酬ｒ_ｔ＋１を取得する。このようにして、制御装置１０は、まず走行データを蓄積する。
蓄積された走行データを基に、第１の時刻の走行状態ｓ_ｔと、学習が中途の状態における第１学習モデル４０によって推論された操作の内容ａ_ｔを入力として、報酬ｒ_ｔ＋１を基に、現状の第１学習モデル４０の出力となる操作の内容ａ_ｔを適切に評価できるように、第２学習モデル５０を学習する。
この学習後の第２学習モデル５０を用いて、これが出力する評価値Ｑ（ｓ_ｔ、ａ_ｔ）が大きな操作の内容ａ_ｔを出力するように、第１学習モデル４０を学習する。
これにより、第１学習モデル４０が出力する操作の内容ａ_ｔが変化するため、再度走行データを蓄積する。
このように、走行データの蓄積と、第１学習モデル４０及び第２学習モデル５０の学習を繰り返すことで、第１学習モデル４０及び第２学習モデル５０の学習が完了する。 As described above, when the control device 10 learns the details of the operation, the first learning model 40, which is in the middle of learning, determines the operation to be executed in the current (first time) running state _st . Infer the content a _t . Further, the control device 10 acquires the reward r _t ₊₁ based on the running state s _t ₊₁ changed by the execution of the operation content at at the time (second time) after the operation content at is executed. do. In this way, the control device 10 first accumulates travel data.
Based on the accumulated driving data, the driving state s _t at the first time and the operation content at inferred by the first learning model 40 in the state where learning is in progress are input, and the reward r _t ₊₁ , the second learning model 50 is learned so as to appropriately evaluate the operation content a _t that is the output of the current first learning model 40 .
Using the second learning model 50 after learning, the first learning model 40 is learned so that the evaluation value Q(s _t , a _t ) output by the second learning model 50 outputs the operation content a _t with a large value.
As a _result , the operation content at output by the first learning model 40 changes, so the travel data is accumulated again.
By repeating the accumulation of travel data and the learning of the first learning model 40 and the second learning model 50 in this manner, the learning of the first learning model 40 and the second learning model 50 is completed.

本実施形態において、制御装置１０は、第１学習モデル４０及び第２学習モデル５０の学習において、例えば、各々の、学習前後における損失関数の差が一定の値以下となった場合に、学習を繰り返したとしてもその効果が十分に見込めないと判断し、学習を終了する。 In the present embodiment, in the learning of the first learning model 40 and the second learning model 50, the control device 10 performs learning, for example, when the difference between the loss functions before and after the learning becomes equal to or less than a certain value. Even if it is repeated, it is judged that the effect cannot be sufficiently expected, and the learning is terminated.

次に、実際に車両２の性能測定に際して操作の内容を推論する場合での、すなわち、第１学習モデル４０の強化学習が終了した後における、制御装置１０の各構成要素の挙動について説明する。 Next, the behavior of each component of the control device 10 when inferring the details of the operation when actually measuring the performance of the vehicle 2, that is, after the reinforcement learning of the first learning model 40 is completed, will be described.

走行状態取得部２２は、現在時点における、車両２の走行状態を取得する。
走行状態取得部２２は、車両２に備えられた様々な図示されない計測器やドライブロボット４内に記録された操作実績等から、アクセルペダル検出量、ブレーキペダル検出量、エンジン回転数検出量、検出車速を取得する。
また、走行状態取得部２２は、指令車速記憶部２１から、指令車速を取得する。
走行状態取得部２２は、これらの取得した走行状態を、学習部３０へ送信する。 The running state acquisition unit 22 acquires the current running state of the vehicle 2 .
The driving state acquisition unit 22 obtains the accelerator pedal detection amount, the brake pedal detection amount, the engine rotation speed detection amount, and the like from various measuring instruments (not shown) provided in the vehicle 2 and operation results recorded in the drive robot 4. Get vehicle speed.
Also, the running state acquisition unit 22 acquires the command vehicle speed from the command vehicle speed storage unit 21 .
The running state acquisition unit 22 transmits these acquired running states to the learning unit 30 .

学習部３０の操作内容推論部３１は、ある時刻（第１の時刻）において、走行状態取得部２２から走行状態を取得すると、これを基に、学習済みの第１学習モデル４０により、第１の時刻より後の車両２の操作の内容ａを推論する。
この第１学習モデル４０は、操作の内容ａに基づいたドライブロボット４の操作の後の、第１の時刻より後の第２の時刻における走行状態ｓに基づいて、燃費と排ガス性能のいずれか一方または双方がより高い操作の内容ａであるほど大きな値となるように計算された報酬ｒを基に、強化学習されている、学習済みのモデルである。 When the operation content inference unit 31 of the learning unit 30 acquires the running state from the running state acquisition unit 22 at a certain time (first time), the learned first learning model 40 based on this acquires the first The content a of the operation of the vehicle 2 after the time of is inferred.
This first learning model 40 is based on the driving state s at a second time after the first time after the operation of the drive robot 4 based on the content of the operation a, either fuel efficiency or exhaust gas performance. One or both of them is a learned model that undergoes reinforcement learning based on the reward r calculated so that the higher the content a of the operation, the larger the value.

操作内容推論部３１は、学習部３０によって事前に強化学習されて、重みやバイアスの値等の、ニューラルネットワークを構成する各パラメータの値が調整、決定された、学習済みの第１学習モデル４０が、例えばＣＰＵ上でプログラムとして実行されることで、車両２の操作の内容ａを推論する。
より詳細には、操作内容推論部３１が、受信した走行状態ｓの各々を、学習済みの第１学習モデル４０の入力層４１の、対応する入力ノードに入力すると、第１学習モデル４０は、入力層４１から中間層４２を介して出力層４３へと順に辿りながら、重みやバイアスの値等を用いて重み付け和を演算する処理を実行する。最終的に出力層４３の各出力ノードに、第１の時刻以降に実行すべき操作の内容ａが格納される。
操作内容推論部３１は、推論した操作の内容ａを、車両操作制御部２３へ送信する。 The operation content inference unit 31 is a learned first learning model 40 in which reinforcement learning is performed in advance by the learning unit 30, and the values of each parameter constituting the neural network, such as weights and bias values, are adjusted and determined. is executed as a program on the CPU, for example, to infer the content a of the operation of the vehicle 2 .
More specifically, when the operation content inference unit 31 inputs each of the received driving states s to the corresponding input nodes of the input layer 41 of the learned first learning model 40, the first learning model 40: A process of calculating a weighted sum using values of weights and biases is executed while sequentially tracing from the input layer 41 to the output layer 43 via the intermediate layer 42 . Finally, each output node of the output layer 43 stores the content a of the operation to be executed after the first time.
The operation content inference unit 31 transmits the inferred operation content a to the vehicle operation control unit 23 .

車両操作制御部２３は、操作内容推論部３１から操作の内容ａを受信し、この操作の内容ａに基づき、ステップ周期Ｔ_ｓｔｅｐの間、ドライブロボット４を操作する。 The vehicle operation control unit 23 receives the operation content a from the operation content inference unit 31, and operates the drive robot 4 during the step period T _step based on the operation content a.

次に、図１～図４、及び図５～図７を用いて、上記のドライブロボット４の制御装置１０によりドライブロボット４を制御する方法を説明する。図５は、ドライブロボット４の制御方法における、学習時のフローチャートである。図６は、ドライブロボット４の制御方法の、学習時における走行データ収集ステップの、詳細なフローチャートである。図７は、ドライブロボット４の制御方法における、性能測定のために車両２を走行制御させる際のフローチャートである。
本ドライブロボット４の制御方法は、車両２に搭載されて車両２を走行させるドライブロボット４を、車両２が規定された指令車速に従って走行するように制御する、ドライブロボット４の制御方法であって、車両２の走行状態ｓを取得し、走行状態ｓは、車両２において検出された車速と、走行状態ｓが取得された時刻における指令車速を含み、第１の時刻から、第１の時刻より後の車両２の操作の内容ａを推論する第１学習モデル４０であって、操作の内容ａに基づいたドライブロボット４の操作の後の、第１の時刻より後の第２の時刻における走行状態ｓに基づいて、燃費と排ガス性能のいずれか一方または双方がより高い操作の内容ａであるほど大きな値となるように報酬ｒを計算し、報酬ｒを基に強化学習された第１学習モデル４０により、第１の時刻における走行状態ｓを基に、車両の操作の内容ａを推論し、操作の内容ａに基づきドライブロボット４を制御する。
まず、図５、図６を用いて、操作の内容の学習時における動作を説明する。 Next, a method of controlling the drive robot 4 by the control device 10 of the drive robot 4 will be described with reference to FIGS. 1 to 4 and 5 to 7. FIG. FIG. 5 is a flow chart during learning in the control method of the drive robot 4 . FIG. 6 is a detailed flow chart of the travel data collection step during learning in the control method of the drive robot 4 . FIG. 7 is a flow chart for controlling the running of the vehicle 2 for performance measurement in the control method of the drive robot 4 .
This control method of the drive robot 4 is a control method of the drive robot 4 that controls the drive robot 4 that is mounted on the vehicle 2 and causes the vehicle 2 to run according to a prescribed command vehicle speed. , the running state s of the vehicle 2 is acquired, and the running state s includes the vehicle speed detected in the vehicle 2 and the command vehicle speed at the time when the running state s is acquired, and from the first time, from the first time A first learning model 40 for inferring the content a of the subsequent operation of the vehicle 2, and driving at a second time after the first time after the operation of the drive robot 4 based on the content a of the operation. Based on the state s, the reward r is calculated so that the higher the fuel efficiency and/or the exhaust gas performance is, the higher the value of the operation a is, and the first learning is reinforced learning based on the reward r. The model 40 infers the operation content a of the vehicle based on the running state s at the first time, and controls the drive robot 4 based on the operation content a.
First, with reference to FIGS. 5 and 6, the operation during learning of operation details will be described.

学習が開始されると（ステップＳ１）、走行環境や第１学習モデル４０、第２学習モデル５０等の各パラメータが初期設定される（ステップＳ３）。
その後、図６に示される手順に従い、車両２の走行データを収集する（ステップＳ５）。 When learning is started (step S1), each parameter of the driving environment, the first learning model 40, the second learning model 50, etc. is initialized (step S3).
After that, travel data of the vehicle 2 is collected according to the procedure shown in FIG. 6 (step S5).

既に説明したように、走行データは、一連のデータ収集を行う際における時間単位であるエピソードごとに車両２を走行制御することにより、蓄積される。
エピソードが開始されると（ステップＳ２１）、当該エピソードが開始された時点における、車両２の初期状態を観測する（ステップＳ２３）。エピソードは、車両２が走行制御されている途中において開始される場合もあるため、エピソードの開始時における車両２の初期状態は、車両２が停止している状態はもちろん、走行中の状態をも含み得る。 As already explained, travel data is accumulated by controlling the travel of the vehicle 2 for each episode, which is a unit of time when a series of data is collected.
When an episode starts (step S21), the initial state of the vehicle 2 at the time when the episode starts is observed (step S23). Since an episode may be started while the vehicle 2 is under running control, the initial state of the vehicle 2 at the start of the episode may be not only the state in which the vehicle 2 is stopped but also the state in which it is running. can contain.

初期状態の観測は、次のように行われる。
走行状態取得部２２が、現在時点における、車両２の走行状態ｓを取得する。
走行状態取得部２２は、車両２に備えられた様々な図示されない計測器やドライブロボット４内に記録された操作実績等から、アクセルペダル検出量、ブレーキペダル検出量、エンジン回転数検出量、検出車速を取得する。
また、走行状態取得部２２は、指令車速記憶部２１から、指令車速を取得する。
走行状態取得部２２は、これらの取得した走行状態ｓを、学習部３０へ送信する。 Observation of the initial state is performed as follows.
The running state acquisition unit 22 acquires the current running state s of the vehicle 2 .
The driving state acquisition unit 22 obtains the accelerator pedal detection amount, the brake pedal detection amount, the engine rotation speed detection amount, and the like from various measuring instruments (not shown) provided in the vehicle 2 and operation results recorded in the drive robot 4. Get vehicle speed.
Also, the running state acquisition unit 22 acquires the command vehicle speed from the command vehicle speed storage unit 21 .
The running state acquisition unit 22 transmits the acquired running state s to the learning unit 30 .

操作内容推論部３１は、走行状態取得部２２から、走行状態ｓを受信する。操作内容推論部３１は、走行状態ｓを受信した時刻を第１の時刻として、受信した走行状態ｓを基に、学習中の第１学習モデル４０により、第１の時刻より後の車両２の操作の内容を推論する（ステップＳ２５）。
より詳細には、操作内容推論部３１は、走行状態ｓを、第１学習モデル４０の入力層４１の、各走行状態ｓに対応する入力ノードに入力する。
中間層４２の各ノードにおいては、前段の層（例えば、第１中間層４２ａの場合は入力層４１、第２中間層４２ｂの場合は第１中間層４２ａ）の各ノードから、この前段の層の各ノードに格納された値と、前段の層の各ノードから当該中間層４２のノードへの重みを基にした演算がなされて、当該中間層４２のノード内に演算結果が格納される。
出力層４３においても、中間層４２の各々と同様な演算が行われ、出力層４３に備えられた各出力ノードに演算結果、すなわち操作の内容ａが格納される。 The operation content inference unit 31 receives the running state s from the running state acquisition unit 22 . Using the time at which the driving state s is received as a first time, the operation content inference unit 31 uses the first learning model 40 that is learning based on the received driving state s to predict the vehicle 2 after the first time. The content of the operation is inferred (step S25).
More specifically, the operation content inference unit 31 inputs the running state s to the input node corresponding to each running state s in the input layer 41 of the first learning model 40 .
At each node of the intermediate layer 42, from each node of the previous layer (for example, the input layer 41 in the case of the first intermediate layer 42a, and the first intermediate layer 42a in the case of the second intermediate layer 42b), this previous layer and the weight from each node of the previous layer to the node of the intermediate layer 42, and the result of the operation is stored in the node of the intermediate layer 42.
In the output layer 43 as well, operations similar to those in each of the intermediate layers 42 are performed, and each output node provided in the output layer 43 stores the operation result, that is, the content a of the operation.

操作内容推論部３１は、この、現在の学習中の第１学習モデル４０が推論した操作の内容ａを、ドライブロボット制御部２０の車両操作制御部２３へ送信する。
この操作の内容ａに基づき、車両操作制御部２３はステップ周期Ｔ_ｓｔｅｐの間、ドライブロボット４を操作する。
そして、走行状態取得部２２は、操作後の車両２の走行状態ｓを、ステップＳ２３と同様な要領で、再度取得する。
走行状態取得部２２は、操作後の車両２の走行状態ｓを、学習部３０へ送信する。 The operation content inference unit 31 transmits the operation content a inferred by the first learning model 40 currently being learned to the vehicle operation control unit 23 of the drive robot control unit 20 .
Based on this operation content a, the vehicle operation control unit 23 operates the drive robot 4 during the step period T _step .
Then, the running state acquisition unit 22 acquires again the running state s of the vehicle 2 after the operation in the same manner as in step S23.
The running state acquisition unit 22 transmits the running state s of the vehicle 2 after the operation to the learning unit 30 .

操作内容推論部３１は、走行状態取得部２２から、走行状態ｓを受信する。操作内容推論部３１は、走行状態を受信した時刻を、第１の時刻より後の第２の時刻として、第１の時刻における走行状態ｓ_ｔ、これに対して推論され実際に実行された操作の内容ａ_ｔ、及び第２の時刻における走行状態ｓ_ｔ＋１を、報酬計算部３２に送信する。
報酬計算部３２は、強化学習に際し必要となる値である報酬ｒ_ｔ＋１を計算して、操作内容推論部３１に送信する。
操作内容推論部３１は、報酬ｒ_ｔ＋１を受信する（ステップＳ２７）。
操作内容推論部３１は、第１の時刻における走行状態ｓ_ｔ、操作の内容ａ_ｔ、第２の時刻における走行状態ｓ_ｔ＋１と、及び受信した報酬ｒ_ｔ＋１の組み合わせを、学習用データ記憶部３４へ送信し、記憶する（ステップＳ２９）。 The operation content inference unit 31 receives the running state s from the running state acquisition unit 22 . The operation content inference unit 31 regards the time at which the running state was received as a second time after the first time, the running state s _t at the first time, and the operation that was inferred and actually executed for this. and the running state s _t ₊₁ at the second time to the remuneration calculation unit 32 .
The reward calculator 32 calculates a reward rt+1, which is a value required for reinforcement learning, and transmits the calculated reward rt ₊₁ to the operation content inference unit 31 .
The operation content inference unit 31 receives the reward r _t+1 (step S27).
The operation content inference unit 31 stores a combination of the running state s _t at the first time, the operation content a _t , the running state s _t+1 at the second time, and the received reward r _t+1 in the learning data storage unit 34 . and stored (step S29).

学習部３０は、エピソードが終了したか否かを判定する（ステップＳ３１）。エピソードが終了したと判定した場合には（ステップＳ３１のＹｅｓ）、エピソードを終了させて（ステップＳ３３）、図５に示されるステップＳ７へと遷移する。
エピソードが終了していないと判定した場合には（ステップＳ３１のＮｏ）、第２の時刻を第１の時刻とし、第２の時刻における走行状態ｓ_ｔ＋１を第１の時刻における走行状態ｓ_ｔと更新したうえで、ステップＳ２５へ遷移し、この新たな第１の時刻における操作の内容ａの推論を行う。このように、各時刻において、操作の内容ａの推論、推論した操作の内容ａを実行した後の状態の取得、これに基づく報酬の計算を繰り返すことにより、制御装置１０は、走行データを学習用データ記憶部３４に蓄積する。 The learning unit 30 determines whether the episode has ended (step S31). If it is determined that the episode has ended (Yes in step S31), the episode is ended (step S33), and the process proceeds to step S7 shown in FIG.
If it is determined that the episode has not ended (No in step S31), the second time is set as the first time, and the running state s _t+1 at the second time is set as the running state s _t at the first time. After updating, the process proceeds to step S25 to infer the content a of the operation at the new first time. In this way, at each time, the controller 10 repeats the inference of the operation content a, the acquisition of the state after executing the inferred operation content a, and the calculation of the reward based on this, so that the control device 10 learns the driving data. stored in the data storage unit 34 for use.

十分な走行データが学習用データ記憶部３４に蓄積されると、これを用いて、第１学習モデル４０と第２学習モデル５０を強化学習し、学習モデル４０、５０を更新する（ステップＳ７）。
まず、蓄積された走行データを基に、第１の時刻の走行状態ｓ_ｔと、学習が中途の状態における第１学習モデル４０によって推論された操作の内容ａ_ｔを入力として、現状の第１学習モデル４０の出力となる操作の内容ａ_ｔを適切に評価できるように、第２学習モデル５０を学習する。 When sufficient driving data is accumulated in the learning data storage unit 34, the first learning model 40 and the second learning model 50 are subjected to reinforcement learning to update the learning models 40 and 50 (step S7). .
First, based on the accumulated driving data, the driving state s _t at the first time and the contents of the operation at inferred by the first learning model 40 in the state in which the learning is in _progress are input. The second learning model 50 is learned so as to appropriately _evaluate the operation content at, which is the output of the learning model 40 .

強化学習部３３は、第１の時刻の走行状態ｓ_ｔと操作の内容ａ_ｔを、第２学習モデル５０の入力層５１の、各走行状態ｓ及び操作の内容ａに対応する入力ノードに入力する。
中間層５２の各ノードにおいては、前段の層（例えば、第１中間層５２ａの場合は入力層５１、第２中間層５２ｂの場合は第１中間層５２ａ）の各ノードから、この前段の層の各ノードに格納された値と、前段の層の各ノードから当該中間層５２のノードへの重みを基にした演算がなされて、当該中間層５２のノード内に演算結果が格納される。
出力層５３においても、中間層５２の各々と同様な演算が行われ、出力層５３に備えられた出力ノードに演算結果、すなわち行動価値関数Ｑ（ｓ_ｔ、ａ_ｔ）の値が格納される。 The reinforcement learning unit 33 inputs the running state s _t and the operation content a _t at the first time to the input node corresponding to each running state s and the operation content a of the input layer 51 of the second learning model 50 . do.
At each node of the intermediate layer 52, from each node of the previous layer (for example, the input layer 51 in the case of the first intermediate layer 52a and the first intermediate layer 52a in the case of the second intermediate layer 52b), this previous layer and the weight from each node of the previous layer to the node of the intermediate layer 52, and the result of the operation is stored in the node of the intermediate layer 52.
In the output layer 53, the same calculation as in each of the intermediate layers 52 is performed, and the calculation result, that is, the value of the action-value function Q(s _t , at ₎ is stored in the output node provided in the output layer 53. .

強化学習部３３は、既に説明した数式３を損失関数として、これをできるだけ小さくするように、第２学習モデル５０を学習させる。すなわち、第２学習モデル５０は、誤差逆伝搬法、確率的勾配降下法等により、数式３として示された損失関数が減る方向に重みやバイアスの値等の、ニューラルネットワークを構成する各パラメータの値を調整することによって学習される。 The reinforcement learning unit 33 learns the second learning model 50 so as to minimize the loss function, using the already explained Equation 3 as the loss function. That is, the second learning model 50 uses the error backpropagation method, the stochastic gradient descent method, or the like to reduce the loss function shown in Equation 3, such as weights and bias values, which make up the neural network. Learned by adjusting the value.

この時点において学習用データ記憶部３４に蓄積されているデータによる、第２学習モデル５０の更新が終了すると、第１学習モデル４０を学習させる。
強化学習部３３は、「－Ｑ（ｓ_ｔ、μ（ｓ_ｔ））」の値を損失関数とし、これをできるだけ小さくする操作の内容ａ_ｔを出力するように、第１学習モデル４０を学習させる。すなわち、第１学習モデル４０は、誤差逆伝搬法、確率的勾配降下法等により、この損失関数が減る方向に重みやバイアスの値等の、ニューラルネットワークを構成する各パラメータの値を調整することによって学習される。 After updating the second learning model 50 with the data accumulated in the learning data storage unit 34 at this point, the first learning model 40 is learned.
The reinforcement learning unit 33 learns the first learning model 40 so that the value of “−Q(s _t , μ(s _t ))” is set as a loss function and the content of the operation at which makes this as small as _possible is output. Let That is, the first learning model 40 adjusts the values of the parameters that make up the neural network, such as weights and bias values, in the direction in which the loss function decreases, using the error back propagation method, the stochastic gradient descent method, or the like. learned by

第１学習モデル４０と第２学習モデル５０の更新が終了すると、これら第１学習モデル４０と第２学習モデル５０の学習が終了したか否かを判定する（ステップＳ９）。
学習が終了していないと判定された場合には（ステップＳ９のＮｏ）、ステップＳ５へ遷移する。すなわち、制御装置１０は走行データを更に収集し、これを用いた第１学習モデル４０と第２学習モデル５０の更新を繰り返す。
学習が終了したと判定された場合には（ステップＳ９のＹｅｓ）、学習処理を終了する（ステップＳ１１）。 When the updating of the first learning model 40 and the second learning model 50 is finished, it is determined whether or not the learning of the first learning model 40 and the second learning model 50 is finished (step S9).
If it is determined that learning has not ended (No in step S9), the process proceeds to step S5. That is, the control device 10 further collects travel data, and repeats updating of the first learning model 40 and the second learning model 50 using this data.
If it is determined that the learning has ended (Yes in step S9), the learning process ends (step S11).

次に、図７を用いて、実際に車両２の性能測定に際して操作の内容を推論する場合での、すなわち、第１学習モデル４０の強化学習が終了した後において、車両２を走行制御する際の動作について説明する。 Next, with reference to FIG. 7, when inferring the details of the operation when actually measuring the performance of the vehicle 2, that is, after the reinforcement learning of the first learning model 40 is completed, when controlling the vehicle 2, operation will be described.

車両２が走行を開始すると（ステップＳ５１）、走行環境が初期設定され、この時点での走行状態ｓを初期状態として観測する（ステップＳ５３）。
走行状態ｓの観測は、次のように行われる。
走行状態取得部２２が、現在時点における、車両２の走行状態ｓを取得する。
走行状態取得部２２は、車両２に備えられた様々な図示されない計測器やドライブロボット４内に記録された操作実績等から、アクセルペダル検出量、ブレーキペダル検出量、エンジン回転数検出量、検出車速を取得する。
また、走行状態取得部２２は、指令車速記憶部２１から、指令車速を取得する。
走行状態取得部２２は、これらの取得した走行状態ｓを、学習部３０へ送信する。 When the vehicle 2 starts running (step S51), the running environment is initialized, and the running state s at this time is observed as the initial state (step S53).
Observation of the running state s is performed as follows.
The running state acquisition unit 22 acquires the current running state s of the vehicle 2 .
The driving state acquisition unit 22 obtains the accelerator pedal detection amount, the brake pedal detection amount, the engine rotation speed detection amount, and the like from various measuring instruments (not shown) provided in the vehicle 2 and operation results recorded in the drive robot 4. Get vehicle speed.
Also, the running state acquisition unit 22 acquires the command vehicle speed from the command vehicle speed storage unit 21 .
The running state acquisition unit 22 transmits the acquired running state s to the learning unit 30 .

操作内容推論部３１は、走行状態取得部２２から、走行状態ｓを受信する。操作内容推論部３１は、走行状態ｓを受信した時刻を第１の時刻として、受信した走行状態ｓを基に、学習済みの第１学習モデル４０により、第１の時刻より後の車両２の操作の内容を推論する（ステップＳ５５）。
より詳細には、操作内容推論部３１は、受信した走行状態ｓの各々を、学習済みの第１学習モデル４０の入力層４１の、対応する入力ノードに入力すると、入力層４１から中間層４２を介して出力層４３へと順に辿りながら、重みやバイアスの値等を用いて重み付け和を演算する処理を実行する。最終的に出力層４３の各出力ノードに、第１の時刻以降に実行すべき操作の内容ａが格納される。
操作内容推論部３１は、推論した操作の内容ａを、車両操作制御部２３へ送信する。 The operation content inference unit 31 receives the running state s from the running state acquisition unit 22 . Using the time at which the driving state s is received as a first time, the operation content inference unit 31 uses the learned first learning model 40 based on the received driving state s to predict the state of the vehicle 2 after the first time. The content of the operation is inferred (step S55).
More specifically, when the operation content inference unit 31 inputs each of the received driving states s to the corresponding input nodes of the input layer 41 of the first learning model 40 that has been learned, the input layer 41 to the intermediate layer 42 While sequentially tracing to the output layer 43 via , a process of calculating a weighted sum using weights, bias values, and the like is executed. Finally, each output node of the output layer 43 stores the content a of the operation to be executed after the first time.
The operation content inference unit 31 transmits the inferred operation content a to the vehicle operation control unit 23 .

操作内容推論部３１は、この、学習済みの第１学習モデル４０が推論した操作の内容ａを、ドライブロボット制御部２０の車両操作制御部２３へ送信する。
この操作の内容ａに基づき、車両操作制御部２３はステップ周期Ｔ_ｓｔｅｐの間、ドライブロボット４を操作する。
そして、走行状態取得部２２は、操作後の車両２の走行状態ｓを、ステップＳ５３と同様な要領で、再度取得する（ステップＳ５７）。
走行状態取得部２２は、操作後の車両２の走行状態ｓを、学習部３０へ送信する。 The operation content inference unit 31 transmits the operation content a inferred by the learned first learning model 40 to the vehicle operation control unit 23 of the drive robot control unit 20 .
Based on this operation content a, the vehicle operation control unit 23 operates the drive robot 4 during the step period T _step .
Then, the running state acquisition unit 22 acquires again the running state s of the vehicle 2 after the operation in the same manner as in step S53 (step S57).
The running state acquisition unit 22 transmits the running state s of the vehicle 2 after the operation to the learning unit 30 .

制御装置１０は、車両２の走行が終了したか否かを判定する（ステップＳ５９）。
走行が終了していないと判定された場合には（ステップＳ５９のＮｏ）、ステップＳ５５へ遷移する。すなわち、制御装置１０は、ステップＳ５７で取得した走行状態ｓを基にした操作の内容ａの推論と、更なる走行状態ｓの観測を繰り返す。
走行が終了したと判定された場合には（ステップＳ５９のＹｅｓ）、走行処理を終了する（ステップＳ６１）。 The control device 10 determines whether or not the vehicle 2 has finished traveling (step S59).
If it is determined that the vehicle has not finished running (No in step S59), the process proceeds to step S55. That is, the control device 10 repeats the inference of the operation content a based on the running state s acquired in step S57 and the observation of the further running state s.
If it is determined that the vehicle has finished traveling (Yes in step S59), the traveling process is terminated (step S61).

次に、上記のドライブロボットの制御装置及び制御方法の効果について説明する。 Next, the effects of the drive robot control device and control method described above will be described.

本実施形態におけるドライブロボット（自動操縦ロボット）の制御装置１０は、車両２に搭載されて車両２を走行させるドライブロボット４を、車両２が規定された指令車速に従って走行するように制御する、ドライブロボット４の制御装置１０であって、車両２の走行状態ｓを取得する走行状態取得部２２と、第１の時刻における走行状態ｓ_ｔを基に、第１学習モデル４０により、第１の時刻より後の車両２の操作の内容ａ_ｔを推論する操作内容推論部３１と、操作の内容ａ_ｔに基づきドライブロボット４を制御する車両操作制御部２３と、を備え、走行状態ｓは、車両２において検出された車速と、走行状態ｓが取得された時刻における指令車速を含み、第１学習モデル４０は、操作の内容ａ_ｔに基づいたドライブロボット４の操作の後の、第１の時刻より後の第２の時刻における走行状態ｓ_ｔ＋１に基づいて、燃費と排ガス性能のいずれか一方または双方がより高い操作の内容ａ_ｔであるほど大きな値となるように計算された報酬ｒ_ｔ＋１を基に、強化学習されている。
上記のような構成によれば、車両２の操作の内容ａ_ｔを推論する操作内容推論部３１において、第１学習モデル４０は、燃費と排ガス性能がより高い操作の内容ａ_ｔであるほど大きな値となるように計算された報酬ｒ_ｔ＋１を基に、強化学習されている。したがって、操作内容推論部３１は、燃費や排ガス性能が考慮された操作の内容ａ_ｔを推論することができるため、ドライブロボット４に、燃費や排ガス性能を考慮して車両２を操作させることができる。
また、第１学習モデル４０が操作の内容を推論するに際し基づく、車両２の走行状態ｓ_ｔは、走行状態ｓ_ｔが取得された時刻における指令車速を含むため、指令車速に高精度で追従するような操作の内容ａ_ｔを推論可能である。
したがって、指令車速に高い精度で追従させつつ、燃費や排ガス性能を考慮して車両２を操作可能な、ドライブロボット４の制御装置１０を提供可能である。 A control device 10 for a drive robot (autopilot robot) according to the present embodiment controls a drive robot 4 mounted on a vehicle 2 to drive the vehicle 2 so that the vehicle 2 runs according to a prescribed command vehicle speed. In the control device 10 of the robot 4, a running state acquisition unit 22 that acquires the running state s of the vehicle 2, and a first learning model ₄₀ based on the running state s at the first time, the first time An operation content inference unit 31 for _inferring the operation content at of the vehicle 2 later, and a _vehicle operation control unit 23 for controlling the drive robot 4 based on the operation content at. 2, and the commanded vehicle speed at the time when the driving state s was acquired, the first learning model 40 is the first time after the operation of the drive _robot 4 based on the operation content at. Based on the driving state s _t+1 at a later second time, the reward r _t ₊₁ calculated so that the higher the content of operation at, the higher the fuel efficiency and/or the exhaust gas performance, the higher the value r t+1. Reinforcement learning is based on it.
According to the configuration described above, in the operation content _inference unit 31 that _infers the operation content at of the vehicle 2, the first learning model 40 increases as the operation content at increases in fuel efficiency and exhaust gas performance. Reinforcement learning is performed based on the reward r _t+1 calculated to be the value. Therefore, since the operation content inference unit 31 can _infer the operation content at in consideration of fuel consumption and exhaust gas performance, it is possible to cause the drive robot 4 to operate the vehicle 2 in consideration of fuel consumption and exhaust gas performance. can.
In addition, since the running state st of the vehicle 2 based on which the first learning model 40 _infers the details of the operation includes the commanded vehicle speed at the time when the running state _st is acquired, the commanded vehicle speed is followed with high accuracy. It is possible to _infer the content of such an operation at.
Therefore, it is possible to provide the control device 10 for the drive robot 4 that allows the vehicle 2 to be operated in consideration of fuel consumption and exhaust gas performance while following the commanded vehicle speed with high accuracy.

強化学習以外の機械学習、例えば教師あり学習等において、燃費や排ガス性能を考慮してドライブロボット４が車両２を制御するような学習モデルを生成する際には、実際に車両２を、燃費や排ガス性能が良好となるように走行させて、燃費や排ガス性能が良好な走行データを取得し、これを教師データとして学習させることが考えられる。このように、例えば教師あり学習においては、学習する対象はあくまで与えられた教師データであり、燃費や排ガス性能は、この走行データの学習に付随して、間接的に改善される。すなわち、例えば教師あり学習においては、燃費や排ガス性能の向上を直接的な目標として学習することができない。このため、与えられた教師データ以上に燃費や排ガス性能が良好に改善されるような操作の内容が実際にはあったとしても、これを推論することが容易ではない。
これに対し、本実施形態においては、操作の内容ａに関して燃費や排ガス性能が良好か否かの程度を明確な値として有する報酬ｒを基に、燃費や排ガス性能が良好な操作の内容ａを推論するように、第１学習モデル４０が強化学習されている。すなわち、本実施形態においては、第１学習モデル４０は強化学習により学習されているため、燃費や排ガス性能を向上させることを明示的な目標として、第１学習モデル４０が操作の内容ａを推論することができる。このため、教師あり学習等の他の機械学習を適用した形態と比較しても、より良好な燃費や排ガス性能となるような操作の内容ａを推論し得る。 In machine learning other than reinforcement learning, such as supervised learning, when generating a learning model in which the drive robot 4 controls the vehicle 2 in consideration of fuel consumption and exhaust gas performance, the vehicle 2 is actually controlled by It is conceivable to drive the vehicle so that the exhaust gas performance is good, acquire driving data with good fuel efficiency and exhaust gas performance, and use this as teaching data for learning. In this way, for example, in supervised learning, the subject of learning is strictly given teacher data, and fuel efficiency and exhaust gas performance are indirectly improved in association with learning of this travel data. In other words, for example, in supervised learning, learning cannot be performed with the direct goal of improving fuel efficiency or exhaust gas performance. For this reason, even if there is actually an operation content that improves fuel efficiency and exhaust gas performance better than given teaching data, it is not easy to infer this.
On the other hand, in the present embodiment, based on the remuneration r having a clear value indicating whether or not the fuel consumption and exhaust gas performance are good for the operation content a, the operation content a with good fuel economy and exhaust gas performance is determined. As inferred, the first learning model 40 has been reinforcement learned. That is, in the present embodiment, since the first learning model 40 is learned by reinforcement learning, the first learning model 40 infers the operation content a with an explicit goal of improving fuel efficiency and exhaust gas performance. can do. For this reason, it is possible to infer the details of the operation a that will result in better fuel consumption and exhaust gas performance, even when compared with a mode in which other machine learning such as supervised learning is applied.

また、操作の対象は、アクセルペダル２ｃとブレーキペダル２ｄを含み、走行状態ｓは、アクセルペダル２ｃとブレーキペダル２ｄの検出量を含む。
上記のような構成によれば、車両２の操作において、燃費や排ガス性能と密接に関連するアクセルペダル２ｃとブレーキペダル２ｄの検出量を走行状態ｓに含めているため、適切に報酬ｒを計算し、結果として、第１学習モデル４０によって適切に操作の内容ａを推論することができる。したがって、より効果的に、燃費や排ガス性能を考慮して車両２を操作可能な、ドライブロボット４の制御装置を提供可能である。 Further, the objects to be operated include the accelerator pedal 2c and the brake pedal 2d, and the running state s includes the detected amounts of the accelerator pedal 2c and the brake pedal 2d.
According to the configuration as described above, in the operation of the vehicle 2, since the detection amounts of the accelerator pedal 2c and the brake pedal 2d, which are closely related to the fuel efficiency and exhaust gas performance, are included in the driving state s, the reward r is appropriately calculated. As a result, the first learning model 40 can appropriately infer the content a of the operation. Therefore, it is possible to provide a control device for the drive robot 4 that can operate the vehicle 2 more effectively in consideration of fuel consumption and exhaust gas performance.

また、アクセルペダル２ｃ及びブレーキペダル２ｄの検出量の変化が小さいほど値が大きくなるように設定された第１要素ｒ_ＡＰ、ｒ_ＢＰが計算され、第１要素ｒ_ＡＰ、ｒ_ＢＰを基に報酬ｒ_ｔ＋１が計算されている。
上記のような構成によれば、燃費や排ガス性能が良好であると考えられる、アクセルペダル２ｃ及びブレーキペダル２ｄの検出量の変化が小さい場合に、第１要素ｒ_ＡＰ、ｒ_ＢＰの値が小さくなるように計算され、これを基に報酬ｒ_ｔ＋１が計算されるため、適切に報酬ｒ_ｔ＋１の値を設定することができる。したがって、より効果的に、燃費や排ガス性能を考慮して車両２を操作可能な、ドライブロボット４の制御装置を提供可能である。 Also, the first elements r _AP and r _BP are calculated so that the smaller the change in the detected amount of the accelerator pedal 2c and the brake pedal 2d, the larger the value, and the reward is calculated based on the first elements r _AP and r _BP . r _t+1 has been calculated.
According to the configuration described above, when changes in the detected amounts of the accelerator pedal 2c and the brake pedal 2d are small, which is considered to be good fuel efficiency and exhaust gas performance, the values of the first elements r _AP and r _BP are small. Since the reward r _t+1 is calculated based on this, the value of the reward r _t+1 can be appropriately set. Therefore, it is possible to provide a control device for the drive robot 4 that can operate the vehicle 2 more effectively in consideration of fuel consumption and exhaust gas performance.

また、時間軸と、アクセルペダル２ｃまたはブレーキペダル２ｄの検出量を軸とする座標系上で、検出量を関数として表現した際に、関数の一階微分または二階微分の値を基に、第１要素ｒ_ＡＰ、ｒ_ＢＰが計算されている。
上記のような構成によれば、アクセルペダル２ｃまたはブレーキペダル２ｄの検出量の変化量は、これら検出量を表現した関数の一階微分または二階微分の値に密接に関連する。すなわち、検出量を表現した関数の一階微分または二階微分の値を基に第１要素ｒ_ＡＰ、ｒ_ＢＰを計算することにより、適切に報酬ｒ_ｔ＋１の値を設定することができる。したがって、より効果的に、燃費や排ガス性能を考慮して車両２を操作可能な、ドライブロボット４の制御装置を提供可能である。 In addition, when the detected amount is expressed as a function on a coordinate system having the time axis and the detected amount of the accelerator pedal 2c or the brake pedal 2d as axes, the first derivative or the second derivative of the function is obtained. 1-element r _AP , r _BP have been calculated.
According to the configuration described above, the amount of change in the detected amount of the accelerator pedal 2c or the brake pedal 2d is closely related to the value of the first-order differential or the second-order differential of the function expressing these detected amounts. That is, the value of the reward r _t+1 can be set appropriately by calculating the first elements r _AP and r _BP based on the value of the first-order differential or second-order differential of the function expressing the detected quantity. Therefore, it is possible to provide a control device for the drive robot 4 that can operate the vehicle 2 more effectively in consideration of fuel consumption and exhaust gas performance.

また、関数の一階微分または二階微分の最大値の絶対値が所定の閾値以下であれば、第１要素ｒ_ＡＰ、ｒ_ＢＰが正の値となるように、かつ、最大値の絶対値が所定の閾値よりも大きければ、第１要素ｒ_ＡＰ、ｒ_ＢＰが負の値となるように、第１要素ｒ_ＡＰ、ｒ_ＢＰが計算されている。
上記のような構成によれば、関数の一階微分または二階微分の最大値の絶対値が所定の閾値以下であれば、アクセルペダル２ｃまたはブレーキペダル２ｄの検出量の変化量が小さく燃費や排ガス性能が良好であると考えられる。この場合には、第１要素ｒ_ＡＰ、ｒ_ＢＰが正の値となるように計算される。また、関数の一階微分または二階微分の最大値の絶対値が所定の閾値以上であれば、アクセルペダル２ｃまたはブレーキペダル２ｄの検出量の変化量が大きく燃費や排ガス性能が良好ではないと考えられる。この場合には、第１要素ｒ_ＡＰ、ｒ_ＢＰが負の値となるように計算される。
このように、燃費や排ガス性能が良好である場合に値が大きくなるように第１要素ｒ_ＡＰ、ｒ_ＢＰが計算され、これを基に報酬ｒ_ｔ＋１が計算されるため、適切に報酬ｒ_ｔ＋１の値を設定することができる。したがって、より効果的に、燃費や排ガス性能を考慮して車両２を操作可能な、ドライブロボット４の制御装置を提供可能である。 Further, if the absolute value of the maximum value of the first-order derivative or the second-order derivative of the function is equal to or less than a predetermined threshold, the first elements r _AP and r _BP are positive values, and the absolute value of the maximum value is The first elements r _AP and r _BP are calculated such that the first elements r _AP and r _BP are negative values if they are larger than a predetermined threshold.
According to the above configuration, if the absolute value of the maximum value of the first-order differential or second-order differential of the function is equal to or less than the predetermined threshold value, the amount of change in the detected amount of the accelerator pedal 2c or the brake pedal 2d is small, and fuel consumption and exhaust gas are reduced. Performance is considered good. In this case, the first elements r _AP and r _BP are calculated to be positive values. Further, if the absolute value of the maximum value of the first-order differential or second-order differential of the function is equal to or greater than a predetermined threshold, it is considered that the amount of change in the detected amount of the accelerator pedal 2c or the brake pedal 2d is large and the fuel efficiency and exhaust gas performance are not good. be done. In this case, the first elements r _AP and r _BP are calculated to be negative values.
In this way, the first elements r _AP and r _BP are calculated so that the values are large when the fuel efficiency and exhaust gas performance are good, and the reward r _t+1 is calculated based on this, so the reward r _t +1 is appropriately calculated. can be set. Therefore, it is possible to provide a control device for the drive robot 4 that can operate the vehicle 2 more effectively in consideration of fuel consumption and exhaust gas performance.

また、操作の内容ａに基づいたドライブロボット４の操作の後の、第２の時刻における検出車速と指令車速との差が小さいほど値が大きくなるように設定された第２要素ｒ_ｓが計算され、第２要素ｒ_ｓを基に報酬ｒ_ｔ＋１が計算されている。
上記のような構成によれば、検出車速と指令車速との差が小さいほど値が大きくなるように第２要素ｒ_ｓが計算されるため、指令車速への追従性が高いほど、第２要素ｒ_ｓが大きな値を有し得る。報酬ｒ_ｔ＋１は、このような第２要素ｒ_ｓを基に計算されているため、指令車速に高い精度で追従させるように車両２を操作可能な、ドライブロボット４の制御装置を提供可能である。 Also, a second element _rs is calculated, which is set so that the smaller the difference between the detected vehicle speed and the commanded vehicle speed at the second time after the operation of the drive robot 4 based on the operation content a, the larger the value. and the reward r _t+1 is calculated based on the second element r _s .
According to the above configuration, the second element _rs is calculated such that the smaller the difference between the detected vehicle speed and the commanded vehicle speed, the larger the value of the second element rs. r _s can have a large value. Since the reward r _t+1 is calculated based on such a second element r _s , it is possible to provide a control device for the drive robot 4 that can operate the vehicle 2 so as to follow the commanded vehicle speed with high accuracy. .

また、操作内容推論部３１は、第１の時刻以降の時間範囲内の、複数の時刻における操作の内容ａを推論する。
上記のような構成によれば、一度の推論で複数の操作の内容を推論するため、操作間隔を推論に要する時間よりも短くすることができる。このため、緻密な操作が可能となる。
また、推論により、実際には使用されないほど将来の操作の内容をも推論することができる。この場合においては、将来を見越した操作の内容ａを推論することができるため、操作の内容ａの精度が向上し、より効果的に、燃費や排ガス性能を考慮して車両２を操作可能な、ドライブロボット４の制御装置を提供可能である。 Further, the operation content inference unit 31 infers the operation content a at a plurality of times within the time range after the first time.
According to the configuration as described above, since the contents of a plurality of operations are inferred in one inference, the operation interval can be made shorter than the time required for the inference. For this reason, precise operation becomes possible.
Inference can also infer the content of future operations to the extent that they are not actually used. In this case, since it is possible to infer the content a of the operation in anticipation of the future, the accuracy of the content a of the operation is improved, and the vehicle 2 can be operated more effectively in consideration of fuel consumption and exhaust gas performance. , the control device for the drive robot 4 can be provided.

また、第１学習モデル４０は、報酬ｒ_ｔ＋１を基に操作の内容ａ_ｔを評価して操作の内容ａ_ｔの評価値Ｑ（ｓ_ｔ、ａ_ｔ）を計算する第２学習モデル５０によって計算された、評価値Ｑ（ｓ_ｔ、ａ_ｔ）を基に、評価値Ｑ（ｓ_ｔ、ａ_ｔ）がより高い操作の内容ａを推論するように学習されている。
上記のような構成によれば、報酬ｒ_ｔ＋１を基にした評価値Ｑ（ｓ_ｔ、ａ_ｔ）の計算を、関数近似器としての第２学習モデル５０によって計算している。このため、評価値Ｑ（ｓ_ｔ、ａ_ｔ）の計算が容易となる。 In addition, the first learning model 40 is calculated by the second learning model 50 that evaluates the content of the operation a _t based on the reward r _t+1 and calculates the evaluation value Q(s _t , a _t ) of the content of the operation a _t Based on the obtained evaluation value Q(s _t , a _t ), it is learned to infer the content a of the operation with the higher evaluation value Q(s _t , a _t ).
According to the above configuration, the evaluation value Q(s _t , at ) based on the reward r _t ₊₁ is calculated by the second learning model 50 as a function approximator. This facilitates calculation of the evaluation value Q(s _t , a _t ).

また、第２学習モデル５０は、第１学習モデル４０により出力された操作の内容ａ_ｔを入力とし、報酬ｒ_ｔ＋１が大きいほど高い評価値Ｑ（ｓ_ｔ、ａ_ｔ）を出力するように学習され、これら第１学習モデル４０と第２学習モデル５０の学習が繰り返されることにより、第１学習モデル４０と第２学習モデル５０は強化学習されている。
上記のような構成によれば、第１の時刻の走行状態ｓ_ｔと、学習が中途の状態における第１学習モデル４０によって推論された操作の内容ａ_ｔを入力として、現状の第１学習モデル４０の出力となる操作の内容ａ_ｔを適切に評価できるように、第２学習モデル５０を学習し、この学習後の第２学習モデル５０を用いて、これが出力する評価値Ｑ（ｓ_ｔ、ａ_ｔ）が大きくなるように、第１学習モデル４０を学習することを繰り返して、第１学習モデル４０及び第２学習モデル５０を学習させることができる。したがって、第１学習モデル４０及び第２学習モデル５０を効果的に学習させることができる。 Further, the second learning model 50 receives as input the operation content at output from the first learning model 40, and learns to output a higher evaluation value Q(s _t , at ₎ as the reward r _t ₊₁ increases. By repeating the learning of these first learning model 40 and second learning model 50, the first learning model 40 and second learning model 50 undergo reinforcement learning.
According to the configuration as described above, the current first learning model 40 receives as input the running state s _t at the first time and the operation content a _t inferred by the first learning model 40 in the state in which learning is in progress. The second learning model 50 is trained so that the content of operation a _t that is the output of 40 can be appropriately evaluated, and the second learning model 50 after this learning is used to obtain the evaluation value Q(s _t , The first learning model 40 and the second learning model 50 can be learned by repeating the learning of the first learning model 40 such that a _t ) increases. Therefore, the first learning model 40 and the second learning model 50 can be effectively learned.

［実施形態の変形例］
次に、図８を用いて、上記実施形態として示したドライブロボットの制御装置及び制御方法の変形例を説明する。図８は、本変形例におけるドライブロボットの制御装置のブロック図である。本変形例におけるドライブロボット４の制御装置６０は、上記実施形態のドライブロボット４の制御装置１０とは、ドライブロボット制御部６１が、学習部３０の、学習が終了した時点における、操作内容推論部３１及び第１学習モデル４０と同じ構成の、操作内容推論部３１Ａ及び第１学習モデル４０Ａを備えている点が異なっている。 [Modification of Embodiment]
Next, with reference to FIG. 8, a modified example of the control device and control method for the drive robot shown as the above embodiment will be described. FIG. 8 is a block diagram of the control device for the drive robot in this modified example. The control device 60 of the drive robot 4 in this modified example is different from the control device 10 of the drive robot 4 in the above-described embodiment. 31 and the first learning model 40 in that an operation content inference unit 31A and a first learning model 40A having the same configurations as those of the first learning model 40 are provided.

本変形例においては、第１学習モデル４０及び第２学習モデル５０の学習時においては、上記実施形態と同様な構成となっている。これら第１学習モデル４０及び第２学習モデル５０の学習が終了した後に、操作内容推論部３１及び第１学習モデル４０が、操作内容推論部３１Ａ及び第１学習モデル４０Ａとしてドライブロボット制御部６１の中に複製されている。
実際に車両２の性能測定に際して操作の内容ａを推論する場合においては、ドライブロボット制御部６１内の操作内容推論部３１Ａが、第１学習モデル４０Ａを使用して操作の内容ａを推論する。 In this modified example, the configuration during learning of the first learning model 40 and the second learning model 50 is the same as that of the above-described embodiment. After the learning of the first learning model 40 and the second learning model 50 is completed, the operation content inference unit 31 and the first learning model 40 are used as the operation content inference unit 31A and the first learning model 40A for the drive robot control unit 61. reproduced inside.
When actually inferring the operation content a when measuring the performance of the vehicle 2, the operation content inference unit 31A in the drive robot control unit 61 infers the operation content a using the first learning model 40A.

本変形例が、既に説明した実施形態と同様な効果を奏することは言うまでもない。
特に、本変形例の構成においては、実際に車両２の性能測定に際して操作の内容ａを推論する場合における処理が、ドライブロボット制御部６１の内部だけで完結されており、ドライブロボット制御部６１が学習部３０と通信する必要がない。 It goes without saying that this modification has the same effect as the embodiment already described.
In particular, in the configuration of this modified example, the processing for inferring the operation content a when actually measuring the performance of the vehicle 2 is completed only within the drive robot control unit 61, and the drive robot control unit 61 There is no need to communicate with the learning section 30 .

なお、本発明のドライブロボットの制御装置及び制御方法は、図面を参照して説明した上述の実施形態及び変形例に限定されるものではなく、その技術的範囲において他の様々な変形例が考えられる。 It should be noted that the drive robot control device and control method of the present invention are not limited to the above-described embodiments and modifications described with reference to the drawings, and other various modifications can be conceived within the technical scope thereof. be done.

例えば、上記実施形態においては、報酬には、燃費と排ガス性能の双方がより高い操作の内容であるほど大きな値となるように計算されていたが、これに限られず、燃費と排ガス性能のいずれか一方がより高い操作の内容であるほど大きな値となるように計算されていてもよい。
例えば、燃費の値を測定または計算し、燃費の性能が高いほど値が大きくなるように設定された第３要素を計算し、これを基に、燃費が良い操作の内容であるほど大きな値となるように、報酬を計算するようにしてもよい。これにより、燃費のみが報酬に反映され得る。
あるいは、例えば、排ガス性能の値を測定または計算し、排ガス性能が高いほど値が大きくなるように設定された第４要素を計算し、これを基に、排ガス性能が良い操作の内容であるほど大きな値となるように、報酬を計算するようにしてもよい。これにより、排ガス性能のみが報酬に反映され得る。排ガス性能を燃費とは独立して報酬に反映することにより、例えば、自動車の排気経路に設けられる三元触媒コンバータ等において、排ガス中の有害物質の除去性能を評価する場合等に適用可能である。
これら第３及び第４の要素を共に報酬に反映させることによって、燃費と排ガス性能を共に、強化学習に影響し得るようにしてもよいのは、言うまでもない。 For example, in the above embodiment, the reward is calculated so that the higher the fuel efficiency and exhaust gas performance, the higher the value of the operation. Either one of them may be calculated so that the higher the content of the operation, the larger the value.
For example, the fuel consumption value is measured or calculated, and the third factor is calculated so that the value increases as the fuel consumption performance increases. You may make it calculate a reward so that it may become. This allows only fuel consumption to be reflected in the reward.
Alternatively, for example, the value of the exhaust gas performance is measured or calculated, the fourth element is calculated so that the value increases as the exhaust gas performance increases, and based on this, the better the exhaust gas performance, the better the operation You may make it calculate a reward so that it may become a big value. This allows only emissions performance to be reflected in rewards. By reflecting exhaust gas performance in remuneration independently of fuel consumption, it can be applied, for example, when evaluating the performance of removing harmful substances in exhaust gas in a three-way catalytic converter installed in the exhaust path of an automobile. .
It goes without saying that both the fuel efficiency and the exhaust gas performance may be affected by reinforcement learning by reflecting both of these third and fourth elements in the reward.

また、上記実施形態においては、第１学習モデル４０及び第２学習モデル５０を学習させる際には、実際に車両２を走行させて走行データを観測、取得するように説明したが、これに限られない。例えば、学習時においては、車両２の代わりにシミュレータを使用してもよい。 In the above embodiment, when the first learning model 40 and the second learning model 50 are learned, the vehicle 2 is actually driven to observe and acquire the driving data. can't For example, a simulator may be used instead of the vehicle 2 during learning.

また、上記実施形態においては、操作の対象はアクセルペダル２ｃとブレーキペダル２ｄであり、走行状態は、アクセルペダル２ｃとブレーキペダル２ｄの検出量を含むように構成されていたが、これに限られない。 In the above-described embodiment, the objects to be operated are the accelerator pedal 2c and the brake pedal 2d, and the running state is configured to include the detected amounts of the accelerator pedal 2c and the brake pedal 2d. do not have.

また、上記実施形態において、報酬の計算に使用される第１要素ｒ_ＡＰ、ｒ_ＢＰは、第２の時刻における走行状態ｓ_ｔ＋１において、第１の時刻からのアクセルペダル２ｃやブレーキペダル２ｄの検出量の推移を取得し、時間軸と、アクセルペダル２ｃやブレーキペダル２ｄの検出量を軸とする座標系上で、検出量を関数として表現したうえで、この関数の二階微分または一階微分の値を基に、二階微分または一階微分の最大値の絶対値を計算し、これが所定の閾値以下であれば正の値とし、閾値よりも大きければ負の値とすることで計算したが、これに限られない。
第１要素は、例えば、第１の時刻と第２の時刻の時間間隔が十分に短い場合には、第１の時刻におけるアクセルペダルの開度と、第２の時刻におけるアクセルペダルの開度の２値を比較し、その差分が所定の値以上であれば、アクセルペダル２ｃやブレーキペダル２ｄの操作量が大きいと考え、小さい値となるように計算してもよい。
また、上記のように計算した二階微分に関し、第１の時刻と第２の時刻の間において、その値が、正負が所定の回数以上入れ替わるように変動した場合においては、小刻みなペダル操作が行われたとして第１要素の値を小さくするように計算してもよい。
また、アクセルペダル２ｃやブレーキペダル２ｄの変化量、すなわち二階微分または一階微分の最大値の絶対値に－１を乗算して負の値とし、これを第１要素としてもよい。この第１要素をできるだけ大きくするように計算することで、アクセルペダル２ｃやブレーキペダル２ｄの変化量が小さくなるような結果を取得し得る。
あるいは、上記全てを、報酬を計算する上での異なる要素として個別に計算し、報酬に反映することで、上記全ての要因が個別に報酬に影響するように、報酬を計算しても構わない。 Further, in the above-described embodiment, the first elements r _AP and r _BP used for calculating the reward are determined by the detection of the accelerator pedal 2c and the brake pedal 2d from the first time in the running state s _t+1 at the second time. After obtaining the transition of the amount, expressing the detected amount as a function on the coordinate system having the axes of the time axis and the detected amount of the accelerator pedal 2c and the brake pedal 2d, the second derivative or the first derivative of this function Based on the value, the absolute value of the maximum value of the second derivative or the first derivative is calculated, and if this is less than or equal to a predetermined threshold, it is a positive value, and if it is greater than the threshold, it is a negative value. It is not limited to this.
The first element is, for example, when the time interval between the first time and the second time is sufficiently short, the opening degree of the accelerator pedal at the first time and the opening degree of the accelerator pedal at the second time. If the two values are compared and the difference is equal to or greater than a predetermined value, it may be considered that the operation amount of the accelerator pedal 2c or the brake pedal 2d is large, and calculation may be performed so as to give a small value.
Further, regarding the second-order derivative calculated as described above, if the value changes so that the positive and negative values change more than a predetermined number of times between the first time and the second time, the pedal operation is performed in small increments. It may be calculated to reduce the value of the first element assuming that
Alternatively, the amount of change in the accelerator pedal 2c or the brake pedal 2d, that is, the absolute value of the maximum value of the second-order differential or the first-order differential, may be multiplied by -1 to obtain a negative value, which may be used as the first element. By calculating the first element as large as possible, it is possible to obtain a result that reduces the amount of change in the accelerator pedal 2c and the brake pedal 2d.
Alternatively, the remuneration may be calculated so that all of the above factors individually affect the remuneration by calculating all of the above separately as different factors in calculating the remuneration and reflecting them in the remuneration. .

更には、時間軸と、アクセルペダル２ｃまたはブレーキペダル２ｄの検出量を軸とする座標系上で、検出量を関数として表現した際に、関数の積分量が小さいほど値が大きくなるように設定された積分要素が計算され、当該積分要素を基に報酬が計算されるようにしてもよい。積分量が小さい操作においては、アクセルペダル２ｃやブレーキペダル２ｄの全体的な操作量が少なく、燃費や排ガス性能が良好であると考えられる。すなわち、上記関数の積分量が小さいほど値が大きくなるように設定した積分要素を基に報酬を計算することで、効果的に、燃費や排ガス性能を考慮して車両２を操作可能な、ドライブロボット４の制御装置を提供可能である。 Furthermore, when the detected amount is expressed as a function on a coordinate system having axes of the time axis and the detected amount of the accelerator pedal 2c or the brake pedal 2d, the smaller the integral of the function, the larger the value. The calculated integral element may be calculated, and the reward may be calculated based on the integral element. In operation with a small integral amount, the overall operation amount of the accelerator pedal 2c and the brake pedal 2d is small, and it is considered that the fuel consumption and exhaust gas performance are good. That is, by calculating the reward based on the integral element that is set such that the smaller the integral of the function, the larger the value, the driver can effectively operate the vehicle 2 in consideration of fuel consumption and exhaust gas performance. A control device for the robot 4 can be provided.

また、学習モデル４０、５０の構成は、上記実施形態において説明したものに限られないのは、言うまでもない。例えば、学習モデル４０、５０において、中間層４２、５２の数を、３より小さい、または３より多い数とする等、多くの変形例が想定されるが、本発明の主旨を損なわない限りにおいて、どのような構成を備えていてもよい。
これは、学習モデル４０、５０の損失関数に関しても同様である。上記実施形態において、例えば第２学習モデル５０の損失関数は、既に説明した数式３に示される構造としたが、これに代えて、次の数式４を第２学習モデル５０の損失関数としてもよい。数式４は、損失関数の計算において、学習対象として現存する、学習中の現段階のネットワークを用いるのではなく、少し前の時点で固定された第１学習モデル４０及び第２学習モデル５０に対し、これをターゲットネットワークとして使用する場合の損失関数である。Ｑ_{ｔａｒｇｅｔ}は、ターゲットネットワークとしての第２学習モデル５０における行動価値関数であり、μ_{ｔａｒｇｅｔ}は、ターゲットネットワークとしての第１学習モデル４０における出力関数である。 Further, it goes without saying that the configurations of the learning models 40 and 50 are not limited to those described in the above embodiments. For example, in the learning models 40 and 50, the number of intermediate layers 42 and 52 is less than 3 or more than 3. Many modifications are assumed, but as long as it does not impair the gist of the present invention. , may have any configuration.
The same is true for the loss functions of the learning models 40,50. In the above embodiment, for example, the loss function of the second learning model 50 has the structure shown in the already described formula 3, but instead of this, the following formula 4 may be used as the loss function of the second learning model 50. . Equation 4 uses the first learning model 40 and the second learning model 50 that were fixed a little while ago instead of using the current network that is currently being learned as a learning target in the calculation of the loss function. , is the loss function when using this as the target network. Q _target is the action value function in the second learning model 50 as the target network, and μ _target is the output function in the first learning model 40 as the target network.

また、上記実施形態においては、操作内容推論部３１により推論された操作の内容ａは、そのまま車両操作制御部２３に送信されて、ドライブロボット４の制御に使用されたが、これに限られない。例えば、操作の内容ａを過去または将来の一定の期間にわたって、横軸を時間軸として関数表現したうえで、ローパスフィルタを適用することで、近接する時刻における操作の内容ａの変化をなだらかにすることにより、車両２を滑らかに操作することができる。
操作の内容ａに対して、過去または将来の一定の期間にわたって移動平均を計算することによっても、同様な効果が期待できる。
あるいは、ドライブロボット制御部２０は、フィードバック系の制御を行うフィードバック制御部を備え、操作の内容ａは、フィードフォワード値としてフィードバック制御部で使用されてもよい。すなわち、フィードバック制御部により、例えばＰＩＤ制御などのフィードバック系の制御系を実現し、操作内容推論部３１により推論された操作の内容ａを、当該制御系におけるフィードフォワード値として使用するように構成してもよい。この場合においては、車速追従性が向上する。 Further, in the above-described embodiment, the operation content a inferred by the operation content inference unit 31 is transmitted as is to the vehicle operation control unit 23 and used to control the drive robot 4, but the present invention is not limited to this. . For example, by expressing the content of operation a as a function over a certain period of time in the past or in the future with the horizontal axis as the time axis, and applying a low-pass filter, smooth changes in the content of operation a at close times. Thus, the vehicle 2 can be smoothly operated.
A similar effect can be expected by calculating a moving average over a certain period of time in the past or in the future for the contents of the operation a.
Alternatively, the drive robot control unit 20 may include a feedback control unit that controls a feedback system, and the operation content a may be used by the feedback control unit as a feedforward value. That is, the feedback control unit realizes a feedback control system such as PID control, and the operation content a inferred by the operation content inference unit 31 is configured to be used as a feedforward value in the control system. may In this case, vehicle speed followability is improved.

また、上記実施形態においては、学習前後における損失関数の差が一定の値以下となった場合に学習を終了したが、これに限られない。例えば、所定の回数だけ学習モデル４０、５０を更新したら学習を終了するようにしてもよい。あるいは、学習途中のモデル４０を用いて車両２を実際に走行させ、その結果として燃費や排ガス性能等を実際に測定して、これを基に走行スコアを計算し、走行スコアが学習前後で一定以上増加しなくなった場合に学習を終了するようにしてもよい。 Further, in the above embodiment, learning is terminated when the difference between the loss functions before and after learning is equal to or less than a certain value, but the present invention is not limited to this. For example, learning may be terminated after the learning models 40 and 50 are updated a predetermined number of times. Alternatively, the vehicle 2 is actually driven using the model 40 that is in the middle of learning, and as a result, the fuel efficiency, exhaust gas performance, etc. are actually measured, and the driving score is calculated based on this, and the driving score is constant before and after learning. Learning may be terminated when the number does not increase any more.

また、上記実施形態においては、第１学習モデル４０と第２学習モデル５０は強化学習のみにより学習されていたが、部分的に教師あり学習を組み合わせてもよい。例えば、アクセルペダル２ｃとブレーキペダル２ｄを操作して何らかの走行を行った際に、その走行データを取得して、検出車速を指令車速に置き換えると、指令車速に完全に追従されたアクセルペダル２ｃとブレーキペダル２ｄの操作データを得ることができる。このように作成されたデータを教師データとして教師あり学習を併用することにより、強化学習における学習の方向性を定めて学習の進捗を促進するとともに、追従性がより高い操作の内容を学習、推論することができる。 Also, in the above embodiment, the first learning model 40 and the second learning model 50 are learned only by reinforcement learning, but they may be partially combined with supervised learning. For example, when the accelerator pedal 2c and the brake pedal 2d are operated for some driving, the driving data is obtained and the detected vehicle speed is replaced with the commanded vehicle speed. Operation data of the brake pedal 2d can be obtained. By combining the data created in this way with supervised learning as training data, the direction of learning in reinforcement learning is determined and the progress of learning is promoted, and the details of operations with higher followability are learned and inferred. can do.

これ以外にも、本発明の主旨を逸脱しない限り、上記実施形態及び変形例で挙げた構成を取捨選択したり、他の構成に適宜変更したりすることが可能である。 In addition to this, it is possible to select the configurations mentioned in the above-described embodiment and modifications, or to change them to other configurations as appropriate without departing from the gist of the present invention.

１試験装置
２車両
２ｃアクセルペダル
２ｄブレーキペダル
３シャシーダイナモメータ
４ドライブロボット（自動操縦ロボット）
１０、６０制御装置
２０、６１ドライブロボット制御部
２１指令車速記憶部
２２走行状態取得部
２３車両操作制御部
３０学習部
３１、３１Ａ操作内容推論部
３２報酬計算部
３３強化学習部
３４学習用データ記憶部
４０、４０Ａ第１学習モデル
５０第２学習モデル
Ｑ行動価値関数（評価値）
ｓ走行状態
ｓ１アクセルペダル検出量
ｓ２ブレーキペダル検出量
ｓＮ指令車速
ａ操作の内容
ａ１アクセルペダル操作
ａ２ブレーキペダル操作 1 test device 2 vehicle 2c accelerator pedal 2d brake pedal 3 chassis dynamometer 4 drive robot (autopilot robot)
10, 60 control devices 20, 61 drive robot control unit 21 command vehicle speed storage unit 22 running state acquisition unit 23 vehicle operation control unit 30 learning units 31, 31A operation content inference unit 32 reward calculation unit 33 reinforcement learning unit 34 data storage for learning Parts 40, 40A First learning model 50 Second learning model Q Action value function (evaluation value)
s Driving state s1 Accelerator pedal detection amount s2 Brake pedal detection amount sN Commanded vehicle speed a Contents of operation a1 Accelerator pedal operation a2 Brake pedal operation

Claims

A control device for an autopilot robot that controls an autopilot robot that is mounted on a vehicle and drives the vehicle so that the vehicle travels according to a prescribed command vehicle speed,
a running state acquisition unit that acquires the running state of the vehicle;
an operation content inference unit that infers details of operation of the vehicle after the first time using a first learning model based on the running state at the first time;
a vehicle operation control unit that controls the autopilot robot based on the content of the operation;
with
The running state includes the vehicle speed detected in the vehicle and the command vehicle speed at the time when the running state was acquired,
The first learning model is based on the driving state at a second time after the first time after the operation of the autopilot robot based on the content of the operation, either fuel efficiency or exhaust gas performance. Reinforcement learning is performed based on the reward calculated so that the higher the content of the operation, the higher the value of either one or both ,
The first learning model evaluates the content of the operation based on the reward, and calculates the evaluation value of the content of the operation based on the evaluation value calculated by the second learning model. A controller for an autopilot robot that has been trained to infer higher said maneuver content .

2. The control device for an autopilot robot according to claim 1, wherein the objects to be operated include an accelerator pedal and a brake pedal, and the running state includes detection amounts of the accelerator pedal and the brake pedal.

3. A first element is calculated which is set such that the smaller the change in the detected amount of the accelerator pedal and the brake pedal is, the larger the value is, and the reward is calculated based on the first element. 3. The autopilot robot control device according to .

When the detected amount is expressed as a function on a coordinate system having the time axis and the detected amount of the accelerator pedal or the brake pedal as axes, based on the value of the first derivative or the second derivative of the function, 4. An autopilot robot controller according to claim 3, wherein said first factor is calculated.

If the absolute value of the maximum value of the first-order derivative or the second-order derivative of the function is less than or equal to a predetermined threshold value, the first element is a positive value, and the absolute value of the maximum value is the predetermined threshold value. 5. The control device for an autopilot robot according to claim 4, wherein said first element is calculated such that said first element becomes a negative value if it is greater than .

When the detected amount is expressed as a function on a coordinate system whose axes are the time axis and the detected amount of the accelerator pedal or the brake pedal, the smaller the integral of the function, the larger the value. 3. The control device for an autopilot robot according to claim 2, wherein the calculated integral element is calculated, and the reward is calculated based on the integral element.

A second element that is set such that the smaller the difference between the detected vehicle speed and the commanded vehicle speed at the second time after the operation of the autopilot robot based on the content of the operation, the larger the value of the second element. is calculated, and the reward is calculated based on the second element.

The control device for an autopilot robot according to any one of claims 1 to 7, wherein the operation content inference unit infers the content of the operation at a plurality of times after the first time.

The second learning model receives as input the content of the operation output by the first learning model, and is learned to output a higher evaluation value as the reward increases,
The first learning model and the second learning model are subjected to reinforcement learning by repeating the learning of the first learning model and the second learning model, according to any one of claims 1 to 8. autopilot robot controller.

An autopilot robot control method for controlling an autopilot robot mounted on a vehicle to drive the vehicle so that the vehicle travels according to a prescribed command vehicle speed, comprising:
obtaining a running state of the vehicle, the running state including the vehicle speed detected in the vehicle and the command vehicle speed at the time when the running state was acquired;
Based on the running state at the first time, a first learning model infers the content of the operation of the vehicle after the first time, and the first learning model is based on the content of the operation. Based on the running state at a second time after the first time after the operation of the autopilot robot, either one or both of the fuel efficiency and the exhaust gas performance are higher the higher the content of the operation. Reinforcement learning is performed based on a reward calculated to have a large value, and the first learning model evaluates the content of the operation based on the reward and calculates an evaluation value of the content of the operation. Based on the evaluation value calculated by the learning model, it is learned to infer the content of the operation with the higher evaluation value,
A control method for an autopilot robot, wherein the autopilot robot is controlled based on the content of the operation.