JP2021124403A

JP2021124403A - Control device and control method for automatically manipulating robot

Info

Publication number: JP2021124403A
Application number: JP2020018391A
Authority: JP
Inventors: 泰宏金刺; Yasuhiro Kanesashi; 健人吉田; Taketo Yoshida; 寛修深井; Hironaga Fukai
Original assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Current assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Priority date: 2020-02-06
Filing date: 2020-02-06
Publication date: 2021-08-30
Anticipated expiration: 2040-02-06
Also published as: JP6908144B1; WO2021157212A1

Abstract

To provide a control device and control method for an automatically manipulating robot (drive robot).SOLUTION: The control device for an automatically manipulating robot comprises: a manipulation content inference unit 41 for inferring manipulation in a first cycle by a manipulation inference learning model 50 generated by reinforced training of a machine learner; an adjustment coefficient inference unit 45 for inferring an adjustment coefficient by an adjustment coefficient inference learning model 70 that adjusts the manipulation inferred by the manipulation content inference unit 41 during the first cycle on the basis of a traveling state and is generated by reinforced training of a machine learner so as to infer an adjustment coefficient; and a vehicle manipulation control unit 22 for adjusting manipulation by the adjustment coefficient and generating manipulation after adjustment during the first cycle and controlling an automatically manipulating robot 4 on the basis of the manipulation after adjustment.SELECTED DRAWING: Figure 2

Description

本発明は、自動操縦ロボットの制御装置及び制御方法に関する。 The present invention relates to a control device and a control method for an autopilot robot.

一般に、普通自動車などの車両を製造、販売する際には、国や地域により規定された、特定の走行パターン（モード）により車両を走行させた際の燃費や排出ガスを測定し、これを表示する必要がある。
モードは、例えば、走行開始から経過した時間と、その時に到達すべき車速との関係として、グラフにより表わすことが可能である。この到達すべき車速は、車両へ与えられる達成すべき速度に関する指令という観点で、指令車速と呼ばれることがある。
上記のような、燃費や排出ガスに関する試験は、シャシーダイナモメータ上に車両を載置し、車両に搭載された自動操縦ロボット、所謂ドライブロボット（登録商標）により、モードに従って車両を運転させることにより行われる。 Generally, when manufacturing and selling vehicles such as ordinary automobiles, the fuel consumption and exhaust gas when the vehicle is driven according to a specific driving pattern (mode) specified by the country or region are measured and displayed. There is a need to.
The mode can be represented by a graph as, for example, the relationship between the time elapsed from the start of traveling and the vehicle speed to be reached at that time. This vehicle speed to be reached is sometimes referred to as a command vehicle speed in terms of a command regarding the speed to be achieved given to the vehicle.
The above tests on fuel consumption and exhaust gas are carried out by placing the vehicle on the chassis dynamometer and driving the vehicle according to the mode by the autopilot robot, so-called drive robot (registered trademark) mounted on the vehicle. Will be done.

指令車速には、許容誤差範囲が規定されている。車速が許容誤差範囲を逸脱すると、その試験は無効となるため、自動操縦ロボットの制御には、指令車速への高い追従性が求められる。このため、自動操縦ロボットの制御に、例えば強化学習の技術が適用されることがある。
例えば、特許文献１には、人間らしいペダル操作を行うドライバモデルを強化学習によって構築することが可能な車輌用走行シミュレーション装置、ドライバモデル構築方法及びドライバモデル構築プログラムが開示されている。
より詳細には、車輌用走行シミュレーション装置は、ドライバモデルのゲインの値を変更させながら、車輌モデルを複数回走行させ、この時に変更されたゲインの値を報酬値に基づいて評価することによって、ドライバモデルのゲインの設定を自動的に行う。
特許文献１の構成においては、実際に車輌を走行させるに際し、強化学習により予め決定されたゲインの値を用いて、ドライバモデルが車輌をＰＩＤ制御する。 The margin of error is specified for the command vehicle speed. If the vehicle speed deviates from the margin of error, the test becomes invalid. Therefore, the control of the autopilot robot is required to have high followability to the commanded vehicle speed. Therefore, for example, reinforcement learning technology may be applied to the control of the autopilot robot.
For example, Patent Document 1 discloses a vehicle driving simulation device, a driver model construction method, and a driver model construction program capable of constructing a driver model that performs a human-like pedal operation by reinforcement learning.
More specifically, the vehicle driving simulation device travels the vehicle model multiple times while changing the gain value of the driver model, and evaluates the changed gain value based on the reward value. The gain of the driver model is set automatically.
In the configuration of Patent Document 1, the driver model PID controls the vehicle by using the gain value determined in advance by reinforcement learning when the vehicle is actually driven.

特開２０１４−１１５１６８号公報Japanese Unexamined Patent Publication No. 2014-115168

車両を走行させて特性を計測するための、例えばＷＬＴＣ（ＷｏｒｌｄｗｉｄｅｈａｒｍｏｎｉｚｅｄＬｉｇｈｔｖｅｈｉｃｌｅｓＴｅｓｔＣｙｃｌｅ）モード等の走行モードには、多種多様なパターンの走行モードが含まれている。特許文献１のように、予め決定されたゲインの値により車両を制御する装置においては、このような多種多様なパターンの各々に柔軟に対応して、車両を高い精度で指令車速に追従させることは、容易ではない。 The driving mode for driving the vehicle and measuring the characteristics, for example, the WLTC (World Harmonized Light Vehicles Test Cycle) mode, includes a wide variety of patterns of driving modes. In a device that controls a vehicle by a predetermined gain value as in Patent Document 1, the vehicle can be made to follow a commanded vehicle speed with high accuracy by flexibly responding to each of such a wide variety of patterns. Is not easy.

これに対し、検出車速や指令車速等の車両の状態を入力させて当該状態に適した車両の操作を出力するように構築された、ニューラルネットワーク等の機械学習器を、強化学習により学習させて、操作を推論する学習モデルを生成することが考えられる。実際に車両を走行させる際には、車両の状態を操作推論学習モデルに入力し、これに対して操作推論学習モデルが推論した操作を車両に適用するように、ドライブロボットが制御される。
一般に、ニューラルネットワーク等の、機械学習器を学習させて生成される学習モデルによる推論は、演算量が多くなる傾向にある。したがって、実際にドライブロボットを制御する制御時刻の時間間隔である制御周期よりも、操作推論学習モデルによって操作を推論する推論時刻の時間間隔である推論周期の方が長くなり、一つの推論周期内に、複数の制御時刻が含まれることがある。
このような場合に、ある推論周期内に含まれる複数の制御時刻の全てにおいて、操作推論学習モデルによって最新に推論された操作と同じ操作を適用することも考えられるが、これは緻密な制御とはいえず、指令車速への高い追従性が望めない。
あるいは、次の推論周期に含まれる複数の制御時刻の全てにおける操作を、一度にまとめて推論することも考えられる。しかし、この場合においては、推論される操作の数が増えるために操作推論学習モデルの構造が複雑になる。また、操作推論学習モデルの学習も容易ではない。 On the other hand, a machine learning device such as a neural network, which is constructed to input the vehicle state such as the detected vehicle speed and the command vehicle speed and output the operation of the vehicle suitable for the state, is trained by reinforcement learning. , It is conceivable to generate a learning model that infers the operation. When the vehicle is actually driven, the drive robot is controlled so as to input the state of the vehicle into the operation inference learning model and apply the operation inferred by the operation inference learning model to the vehicle.
In general, inference by a learning model generated by learning a machine learning device such as a neural network tends to require a large amount of calculation. Therefore, the inference cycle, which is the time interval of the inference time for inferring the operation by the operation inference learning model, is longer than the control cycle, which is the time interval of the control time that actually controls the drive robot, and is within one inference cycle. May include multiple control times.
In such a case, it is conceivable to apply the same operation as the operation most recently inferred by the operation inference learning model at all of the plurality of control times included in a certain inference cycle, but this is a precise control. However, high followability to the command vehicle speed cannot be expected.
Alternatively, it is also conceivable to infer operations at all of a plurality of control times included in the next inference cycle at once. However, in this case, the structure of the operation inference learning model becomes complicated because the number of inferred operations increases. Also, learning an operation inference learning model is not easy.

本発明が解決しようとする課題は、車両の操作を推論する学習モデルの構造が簡潔で機械学習が容易であり、かつ指令車速に高い精度で追従させることができる、自動操縦ロボット（ドライブロボット）の制御装置及び制御方法を提供することである。 The problem to be solved by the present invention is an autopilot robot (drive robot) in which the structure of a learning model for inferring vehicle operation is simple, machine learning is easy, and the commanded vehicle speed can be followed with high accuracy. Is to provide a control device and a control method for the above.

本発明は、上記課題を解決するため、以下の手段を採用する。すなわち、本発明は、車両に搭載されて前記車両を走行させる自動操縦ロボットを、前記車両が規定された指令車速に従って走行するように制御する、自動操縦ロボットの制御装置であって、車速と前記指令車速を含む、前記車両の走行状態を基に、前記車両を前記指令車速に従って走行させるような前記車両の操作を推論するように、機械学習器を強化学習して生成された操作推論学習モデルにより、前記操作を第１の周期で推論する操作内容推論部と、前記走行状態を基に、前記操作内容推論部により推論された前記操作を前記第１の周期の間に調整する、調整係数を推論するように、機械学習器を強化学習して生成された調整係数推論学習モデルにより、前記調整係数を推論する調整係数推論部と、前記第１の周期の間に、前記調整係数により前記操作を調整して調整後操作を生成し、当該調整後操作に基づき前記自動操縦ロボットを制御する車両操作制御部と、を備えている、自動操縦ロボットの制御装置を提供する。 The present invention employs the following means in order to solve the above problems. That is, the present invention is a control device for an automatic control robot that controls an automatic control robot mounted on a vehicle to drive the vehicle so that the vehicle travels in accordance with a specified command vehicle speed. An operation inference learning model generated by strengthening learning of a machine learning device so as to infer the operation of the vehicle such that the vehicle travels according to the commanded vehicle speed based on the traveling state of the vehicle including the commanded vehicle speed. An adjustment coefficient that adjusts the operation inferred by the operation content inference unit during the first cycle based on the operation content inference unit that infers the operation in the first cycle and the running state. With the adjustment coefficient inference learning model generated by strengthening the machine learning device so as to infer Provided is a control device for an automatic control robot, comprising a vehicle operation control unit that adjusts the operation to generate an adjusted operation and controls the automatic control robot based on the adjusted operation.

また、本発明は、車両に搭載されて前記車両を走行させる自動操縦ロボットを、前記車両が規定された指令車速に従って走行するように制御する、自動操縦ロボットの制御方法であって、車速と前記指令車速を含む、前記車両の走行状態を基に、前記車両を前記指令車速に従って走行させるような前記車両の操作を推論するように、機械学習器を強化学習して生成された操作推論学習モデルにより、前記操作を第１の周期で推論し、前記走行状態を基に、推論された前記操作を前記第１の周期の間に調整する、調整係数を推論するように、機械学習器を強化学習して生成された調整係数推論学習モデルにより、前記調整係数を推論し、前記第１の周期の間に、前記調整係数により前記操作を調整して調整後操作を生成し、当該調整後操作に基づき前記自動操縦ロボットを制御する、自動操縦ロボットの制御方法を提供する。 Further, the present invention is a control method for an automatic control robot, which controls an automatic control robot mounted on a vehicle to drive the vehicle so that the vehicle travels in accordance with a specified command vehicle speed. An operation inference learning model generated by reinforcement learning of a machine learning device so as to infer the operation of the vehicle such that the vehicle travels according to the commanded vehicle speed based on the traveling state of the vehicle including the commanded vehicle speed. The machine learning device is strengthened so as to infer the adjustment coefficient, which infers the operation in the first cycle and adjusts the inferred operation during the first cycle based on the running state. The adjustment coefficient is inferred by the adjustment coefficient inference learning model generated by learning, and during the first cycle, the operation is adjusted by the adjustment coefficient to generate an adjusted operation, and the adjusted operation is generated. Provided is a control method of an automatic control robot that controls the automatic control robot based on the above.

本発明によれば、車両の操作を推論する学習モデルの構造が簡潔で機械学習が容易であり、かつ指令車速に高い精度で追従させることができる、自動操縦ロボット（ドライブロボット）の制御装置及び制御方法を提供することができる。 According to the present invention, a control device for an autopilot robot (drive robot), which has a simple structure of a learning model for inferring vehicle operation, easy machine learning, and can follow a commanded vehicle speed with high accuracy. A control method can be provided.

本発明の実施形態における、自動操縦ロボット（ドライブロボット）を用いた試験環境の説明図である。It is explanatory drawing of the test environment using the autopilot robot (drive robot) in embodiment of this invention. 上記実施形態における自動操縦ロボットの制御装置のブロック図である。It is a block diagram of the control device of the autopilot robot in the said embodiment. 上記制御装置のデータの流れを示す処理ブロック図である。It is a processing block diagram which shows the data flow of the said control device. 上記自動操縦ロボットを制御する制御方法における、学習時のフローチャートである。It is a flowchart at the time of learning in the control method which controls the autopilot robot. 上記自動操縦ロボットの制御方法における、性能測定のために車両を走行制御させる際のフローチャートである。It is a flowchart at the time of running control of a vehicle for performance measurement in the control method of the autopilot robot. 上記実施形態の第１変形例における自動操縦ロボットの制御装置のデータの流れを示す処理ブロック図である。It is a processing block diagram which shows the data flow of the control device of the autopilot robot in the 1st modification of the said Embodiment. 上記実施形態の第２変形例における自動操縦ロボットの制御装置のデータの流れを示す処理ブロック図である。It is a processing block diagram which shows the data flow of the control device of the autopilot robot in the 2nd modification of the said Embodiment.

以下、本発明の実施形態について図面を参照して詳細に説明する。
本実施形態においては、自動操縦ロボットとしては、ドライブロボット（登録商標）を用いているため、以下、自動操縦ロボットをドライブロボットと記載する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
In the present embodiment, since the drive robot (registered trademark) is used as the autopilot robot, the autopilot robot will be referred to as a drive robot below.

図１は、実施形態におけるドライブロボットを用いた試験環境の説明図である。試験装置１は、車両２、シャシーダイナモメータ３、及びドライブロボット４を備えている。
車両２は、床面上に設けられている。シャシーダイナモメータ３は、床面の下方に設けられている。車両２は、車両２の駆動輪２ａがシャシーダイナモメータ３の上に載置されるように、位置づけられている。車両２が走行し駆動輪２ａが回転する際には、シャシーダイナモメータ３が反対の方向に回転する。
ドライブロボット４は、車両２の運転席２ｂに搭載されて、車両２を走行させる。ドライブロボット４は、アクチュエータ４ｃを備えている。アクチュエータ４ｃは、車両２のアクセルペダル２ｃに当接するように設けられている。 FIG. 1 is an explanatory diagram of a test environment using a drive robot in the embodiment. The test device 1 includes a vehicle 2, a chassis dynamometer 3, and a drive robot 4.
The vehicle 2 is provided on the floor surface. The chassis dynamometer 3 is provided below the floor surface. The vehicle 2 is positioned so that the drive wheels 2a of the vehicle 2 are placed on the chassis dynamometer 3. When the vehicle 2 travels and the drive wheels 2a rotate, the chassis dynamometer 3 rotates in the opposite direction.
The drive robot 4 is mounted on the driver's seat 2b of the vehicle 2 to drive the vehicle 2. The drive robot 4 includes an actuator 4c. The actuator 4c is provided so as to come into contact with the accelerator pedal 2c of the vehicle 2.

ドライブロボット４は、後に詳説する制御装置１１によって制御されている。制御装置１１は、ドライブロボット４のアクチュエータ４ｃを制御することにより、車両２のアクセルペダル２ｃの開度を変更、調整する。
制御装置１１は、ドライブロボット４を、車両２が規定された指令車速に従って走行するように制御する。すなわち、制御装置１１は、車両２のアクセルペダル２ｃの開度を変更することで、規定された走行パターン（モード）に従うように、車両２を走行制御する。より詳細には、制御装置１１は、走行開始から時間が経過するに従い、各時間に到達すべき車速である指令車速に従うように、車両２を走行制御する。 The drive robot 4 is controlled by the control device 11 described in detail later. The control device 11 changes and adjusts the opening degree of the accelerator pedal 2c of the vehicle 2 by controlling the actuator 4c of the drive robot 4.
The control device 11 controls the drive robot 4 so that the vehicle 2 travels according to a specified command vehicle speed. That is, the control device 11 controls the traveling of the vehicle 2 so as to follow the defined traveling pattern (mode) by changing the opening degree of the accelerator pedal 2c of the vehicle 2. More specifically, the control device 11 controls the traveling of the vehicle 2 so as to follow the commanded vehicle speed, which is the vehicle speed to be reached at each time, as time elapses from the start of traveling.

制御装置１１は、ドライブロボット制御部２０と学習部３０を備えている。
ドライブロボット制御部２０は、ドライブロボット４の制御を行うための制御信号を生成し、ドライブロボット４に送信することで、ドライブロボット４を制御する。学習部３０は、後に説明するような機械学習を行い、操作推論学習モデル、第１行動価値推論学習モデル、調整係数推論学習モデル、及び第２行動価値推論学習モデルを生成する。上記のような、ドライブロボット４の制御を行うための制御信号は、操作推論学習モデルと調整係数推論学習モデルによる推論結果を基に生成される。
ドライブロボット制御部２０は、例えば、ドライブロボット４の筐体外部に設けられた、コントローラ等の情報処理装置である。学習部３０は、例えばパーソナルコンピュータ等の情報処理装置である。 The control device 11 includes a drive robot control unit 20 and a learning unit 30.
The drive robot control unit 20 controls the drive robot 4 by generating a control signal for controlling the drive robot 4 and transmitting the control signal to the drive robot 4. The learning unit 30 performs machine learning as described later to generate an operation inference learning model, a first action value inference learning model, an adjustment coefficient inference learning model, and a second action value inference learning model. The control signal for controlling the drive robot 4 as described above is generated based on the inference results by the operation inference learning model and the adjustment coefficient inference learning model.
The drive robot control unit 20 is, for example, an information processing device such as a controller provided outside the housing of the drive robot 4. The learning unit 30 is an information processing device such as a personal computer.

図２は、試験装置１と制御装置１１のブロック図である。図３は、試験装置１と制御装置１１のデータの流れを示す処理ブロック図である。
試験装置１は、既に説明したような車両２、シャシーダイナモメータ３、及びドライブロボット４に加え、車両状態計測部５を備えている。車両状態計測部５は、車両２やシャシーダイナモメータ３の状態を計測する各種の計測装置である。車両状態計測部５は、本実施形態においては、車両２のエンジン回転数ｎ_ｄｅｔ、エンジン温度ｄ_ｄｅｔ、及び車速ｖ_ｄｅｔを検出する。これらの検出された値の各々は、次に説明する制御装置１１の、ドライブロボット制御部２０に送信される。 FIG. 2 is a block diagram of the test device 1 and the control device 11. FIG. 3 is a processing block diagram showing a data flow of the test device 1 and the control device 11.
The test device 1 includes a vehicle condition measuring unit 5 in addition to the vehicle 2, the chassis dynamometer 3, and the drive robot 4 as described above. The vehicle condition measuring unit 5 is various measuring devices for measuring the condition of the vehicle 2 and the chassis dynamometer 3. In the present embodiment, the vehicle state measuring unit 5 detects _{the engine speed n det} , the engine temperature d _det , and the vehicle speed v _{det of the vehicle 2.} Each of these detected values is transmitted to the drive robot control unit 20 of the control device 11 described below.

ドライブロボット制御部２０は、車両操作制御部２２と駆動状態取得部２３を備えている。車両操作制御部２２は、操作補完部２４を備えている。操作補完部２４は、走行抵抗演算部２５、フィードバック操作量演算部２６、及び車両駆動力演算部２７を備えている。学習部３０は、指令車速生成部３１、推論データ成形部３２、学習データ成形部３３、操作学習データ生成部３４、学習データ記憶部３５、調整係数学習データ生成部３６、及び強化学習部４０を備えている。強化学習部４０は、操作内容推論部４１、第１行動価値推論部４２、報酬計算部４３、調整係数推論部４５、及び第２行動価値推論部４６を備えている。報酬計算部４３は、操作報酬計算部４４と調整係数報酬計算部４７を備えている。
制御装置１１の、学習データ記憶部３５以外の各構成要素は、例えば上記の各情報処理装置内のＣＰＵにより実行されるソフトウェア、プログラムであってよい。また、学習データ記憶部３５は、上記各情報処理装置内外に設けられた半導体メモリや磁気ディスクなどの記憶装置により実現されていてよい。
操作内容推論部４１、第１行動価値推論部４２、調整係数推論部４５、及び第２行動価値推論部４６の各々は、操作推論学習モデル５０、第１行動価値推論学習モデル６０、調整係数推論学習モデル７０、及び第２行動価値推論学習モデル８０を、それぞれ備えている。 The drive robot control unit 20 includes a vehicle operation control unit 22 and a drive state acquisition unit 23. The vehicle operation control unit 22 includes an operation complement unit 24. The operation complementing unit 24 includes a traveling resistance calculation unit 25, a feedback operation amount calculation unit 26, and a vehicle driving force calculation unit 27. The learning unit 30 includes a command vehicle speed generation unit 31, an inference data molding unit 32, a learning data molding unit 33, an operation learning data generation unit 34, a learning data storage unit 35, an adjustment coefficient learning data generation unit 36, and a reinforcement learning unit 40. I have. The reinforcement learning unit 40 includes an operation content inference unit 41, a first action value inference unit 42, a reward calculation unit 43, an adjustment coefficient inference unit 45, and a second action value inference unit 46. The reward calculation unit 43 includes an operation reward calculation unit 44 and an adjustment coefficient reward calculation unit 47.
Each component of the control device 11 other than the learning data storage unit 35 may be, for example, software or a program executed by the CPU in each of the above-mentioned information processing devices. Further, the learning data storage unit 35 may be realized by a storage device such as a semiconductor memory or a magnetic disk provided inside or outside each of the information processing devices.
Operation contents Each of the inference unit 41, the first action value inference unit 42, the adjustment coefficient inference unit 45, and the second action value inference unit 46 has an operation inference learning model 50, a first action value inference learning model 60, and an adjustment coefficient inference. A learning model 70 and a second action value inference learning model 80 are provided, respectively.

後に説明するように、操作内容推論部４１の操作推論学習モデル５０が車両２の操作を推論し、調整係数推論部４５の調整係数推論学習モデル７０が車両２の調整係数を推論する。ドライブロボット制御部２０は、これらの推論された操作と調整係数を基に、ドライブロボット４を制御する。
特に本実施形態においては、操作補完部２４は、これを構成する走行抵抗演算部２５、フィードバック操作量演算部２６、及び車両駆動力演算部２７により、推論された車両２の操作に対して、推論された調整係数に則ってフィードバック系の制御を行い、実際にドライブロボット４に適用する操作を演算して、ドライブロボット４を制御する。
ここではまず、ドライブロボット制御部２０を詳細に説明する。以下、操作内容推論部４１及び調整係数推論部４５における、操作と調整係数を推論する推論時刻の時間間隔を、推論周期（第１の周期）Ｔｎｎと呼称する。また、実際にドライブロボット４を制御する制御時刻の時間間隔を、制御周期（第２の周期）Ｔｄｒと呼称する。本実施形態においては、推論周期Ｔｎｎは制御周期Ｔｄｒよりも長くなるように設定されている。すなわち、ある時刻における操作と調整係数の推論結果と同一の値が、次の推論周期Ｔｎｎ後の時刻までの時間間隔内の、ドライブロボット４の全ての制御時刻において適用される。以下のドライブロボット制御部２０の各動作は、制御周期Ｔｄｒにおいて実行される。 As will be described later, the operation inference learning model 50 of the operation content inference unit 41 infers the operation of the vehicle 2, and the adjustment coefficient inference learning model 70 of the adjustment coefficient inference unit 45 infers the adjustment coefficient of the vehicle 2. The drive robot control unit 20 controls the drive robot 4 based on these inferred operations and adjustment coefficients.
In particular, in the present embodiment, the operation complement unit 24 refers to the operation of the vehicle 2 inferred by the traveling resistance calculation unit 25, the feedback operation amount calculation unit 26, and the vehicle driving force calculation unit 27 that constitute the operation complement unit 24. The feedback system is controlled according to the inferred adjustment coefficient, and the operation actually applied to the drive robot 4 is calculated to control the drive robot 4.
Here, first, the drive robot control unit 20 will be described in detail. Hereinafter, the time interval of the inference time for inferring the operation and the adjustment coefficient in the operation content inference unit 41 and the adjustment coefficient inference unit 45 is referred to as an inference cycle (first cycle) Tnn. Further, the time interval of the control time that actually controls the drive robot 4 is referred to as a control cycle (second cycle) Tdr. In the present embodiment, the inference cycle Tnn is set to be longer than the control cycle Tdr. That is, the same value as the inference result of the operation and the adjustment coefficient at a certain time is applied at all the control times of the drive robot 4 within the time interval until the time after the next inference cycle Tnn. Each of the following operations of the drive robot control unit 20 is executed in the control cycle Tdr.

駆動状態取得部２３は、車両状態計測部５から、車両２の検出エンジン回転数ｎ_ｄｅｔと検出エンジン温度ｄ_ｄｅｔ、及び検出車速ｖ_ｄｅｔを受信する。これらの値は、車両操作制御部２２内の各構成要素から参照可能に設けられている。
車両操作制御部２２は、後に説明する学習部３０の指令車速生成部３１から、従うべき指令車速ｖ_ｒｅｆを受信する。車両操作制御部２２の車両駆動力演算部２７は、この受信した指令車速ｖ_ｒｅｆの微分値と、車両２の重量を基に所定の近似式により車両駆動力Ｆ_ｘを演算する。
走行抵抗演算部２５は、検出車速ｖ_ｄｅｔを基に、実路面上での実走行を模した走行抵抗Ｆ_ＲＬを演算する。走行抵抗演算部２５は、走行抵抗Ｆ_ＲＬをシャシーダイナモメータ３へ送信し、走行中の車両２に対して走行抵抗力を発生させる。 The drive state acquisition unit 23 receives the detected engine speed n _det , the detected engine temperature d _det , and the detected vehicle speed v _det of the vehicle 2 from the vehicle state measurement unit 5. These values are provided so as to be able to be referred to from each component in the vehicle operation control unit 22.
The vehicle operation control unit 22 receives a _{command vehicle speed v ref} to be obeyed from the command vehicle speed generation unit 31 of the learning unit 30, which will be described later. The vehicle driving force calculation unit 27 of the vehicle operation control unit 22 calculates the _{vehicle driving force F x} by a predetermined approximate formula based on the _{received differential value of the command vehicle speed v ref} and the weight of the vehicle 2.
Running resistance calculating unit 25, based on the detected vehicle speed v _det, it calculates the running resistance F _RL imitating the actual running on the real road surface. The traveling resistance calculation unit 25 _{transmits the traveling resistance FLL} to the chassis dynamometer 3 to generate a traveling resistance force for the traveling vehicle 2.

駆動状態取得部２３は、車両駆動力Ｆ_ｘと走行抵抗Ｆ_ＲＬが加算された値である要求駆動力Ｆ_ｒｅｆと、検出エンジン回転数ｎ_ｄｅｔ、検出エンジン温度ｄ_ｄｅｔ、及び検出車速（車速）ｖ_ｄｅｔを、後に説明する推論データ成形部３２に送信する。
推論データ成形部３２は、駆動状態取得部２３から受信した値の各々と、別途指令車速生成部３１から受信した指令車速ｖ_ｒｅｆを併せて、車両２の走行状態として、操作内容推論部４１に送信する。
操作内容推論部４１は、これら走行状態を基に、車両２を指令車速ｖ_ｒｅｆに従って走行させるような車両２の操作を推論するように強化学習されている。操作内容推論部４１は、推論周期Ｔｎｎごとに、受信した走行状態を基に、車両２の操作を推論する。本実施形態においては、操作の対象はアクセルペダル２ｃを含んでいる。このため、操作内容推論部４１は、本実施形態においてはアクセル開度の変更量を演算する。このアクセル開度の変更量は、厳密には、指令車速ｖ_ｒｅｆから算出された要求駆動力Ｆ_ｒｅｆを基に、フィードフォワード系の推論を行うことで算出されるものである。すなわち、操作内容推論部４１によって算出されるアクセル開度の変更量は、フィードフォワード変更量（以下、ＦＦ変更量と記載する）θ_ＦＦである。 Driving state acquisition unit 23, a driving force demand _{F ref} vehicle driving force _{F x} and the running resistance _{F RL} is added value, detected engine speed _{n det,} detected engine temperature _{d det,} and the detected vehicle speed (vehicle speed) The v _date is transmitted to the inference data forming unit 32 described later.
The inference data forming unit 32 combines each of the values received from the drive state acquisition unit 23 and the command vehicle speed v _ref separately received from the command vehicle speed generation unit 31 into the operation content inference unit 41 as the running state of the vehicle 2. Send.
The operation content inference unit 41 is reinforcement-learned to infer the operation of the vehicle 2 so as to cause the vehicle 2 to travel according _{to the command vehicle speed vref based on these traveling states.} Operation content The inference unit 41 infers the operation of the vehicle 2 based on the received running state for each inference cycle Tnn. In the present embodiment, the operation target includes the accelerator pedal 2c. Therefore, the operation content inference unit 41 calculates the amount of change in the accelerator opening in the present embodiment. Strictly speaking, the amount of change in the accelerator opening is calculated by inferring the feedforward system based on the required driving force _Ref _{calculated from the command vehicle speed vref.} That is, the change amount of the accelerator opening calculated by the operation content inference unit 41 is the feedforward change amount (hereinafter, referred to as the FF change amount) θ _FF .

推論データ成形部３２はまた、上記の車両２の走行状態を、調整係数推論部４５に送信する。調整係数推論部４５は、走行状態を基に、操作内容推論部４１により推論されたＦＦ変更量すなわち操作θ_ＦＦを次の推論周期Ｔｎｎの間に調整するための、調整係数を推論するように強化学習されている。調整係数推論部４５は、推論周期Ｔｎｎごとに、受信した走行状態を基に、車両２の調整係数を推論する。本実施形態においては、調整係数は比例ゲインＫｐ、積分ゲインＫｉ、及び微分ゲインＫｄを含んでいる。 The inference data forming unit 32 also transmits the traveling state of the vehicle 2 to the adjustment coefficient inference unit 45. The adjustment coefficient inference unit 45 infers the adjustment coefficient for adjusting the FF change amount inferred by the operation content inference unit 41, that is, the operation θ _FF during the next inference cycle Tnn, based on the traveling state. Reinforcement learning is being done. The adjustment coefficient inference unit 45 infers the adjustment coefficient of the vehicle 2 based on the received running state for each inference cycle Tnn. In the present embodiment, the adjustment coefficient includes a proportional gain Kp, an integrated gain Ki, and a differential gain Kd.

フィードバック操作量演算部２６は、指令車速ｖ_ｒｅｆと検出車速ｖ_ｄｅｔとの差分である車速誤差ｄｖを受信する。フィードバック操作量演算部２６はまた、推論周期Ｔｎｎごとに、調整係数推論部４５から、推論された調整係数Ｋｐ、Ｋｉ、Ｋｄ、すなわち比例ゲインＫｐ、積分ゲインＫｉ、及び微分ゲインＫｄを受信する。
フィードバック操作量演算部２６は、推論周期Ｔｎｎごとに受信する調整係数Ｋｐ、Ｋｉ、Ｋｄの最新の推論結果を基に、フィードバック制御により、操作θ_ＦＦの調整量θ_ＦＢ、すなわちアクセル開度のフィードバック変更量（以下、ＦＢ変更量と記載する）θ_ＦＢを演算する。特に本実施形態においては、フィードバック制御は、ＰＩＤ（Ｐｒｏｐｏｒｔｉｏｎａｌ−ＤｉｆｆｅｒｅｎｔｉａｌＣｏｎｔｒｏｌｌｅｒ）制御である。フィードバック操作量演算部２６は、上記のように、推論周期Ｔｎｎよりも短い制御周期Ｔｄｒで、調整量θ_ＦＢを演算する。 The feedback manipulated variable calculation unit 26 receives the vehicle speed error dv, which is the difference between the _{command vehicle speed v ref} and the detected vehicle speed v _det. The feedback manipulated variable calculation unit 26 also receives the inferred adjustment coefficients Kp, Ki, Kd, that is, the proportional gain Kp, the integral gain Ki, and the differential gain Kd from the adjustment coefficient inference unit 45 for each inference cycle Tnn.
The feedback manipulated variable calculation unit 26 feeds back the adjusted variable θ _FB _{of the operation θ FF} , that is, the feedback of the accelerator opening degree by feedback control based on the latest inference results of the adjustment coefficients Kp, Ki, and Kd received for each inference cycle Tnn. Change amount (hereinafter referred to as FB change amount) θ _FB is calculated. In particular, in the present embodiment, the feedback control is a PID (Proportional-Differential Control) control. As described above, the feedback manipulated variable calculation unit 26 calculates the adjusted variable θ _FB with a control cycle Tdr shorter than the inference cycle Tnn.

操作補完部２４は、推論周期Ｔｎｎごとに、操作内容推論部４１から、推論された操作θ_ＦＦを受信する。
操作補完部２４は、推論周期Ｔｎｎごとに受信する操作θ_ＦＦの最新の推論結果に対し、フィードバック操作量演算部２６により演算された調整量θ_ＦＢを加算して、調整後操作θ_ｒｅｆ、すなわち実際に使用される変更量θ_ｒｅｆを計算する。操作補完部２４は、上記のように、推論周期Ｔｎｎよりも短い制御周期Ｔｄｒで、調整後操作θ_ｒｅｆを演算する。
操作補完部２４は、この調整後操作θ_ｒｅｆを、ドライブロボット４に送信する。ドライブロボット４は、調整後操作θ_ｒｅｆを基にアクチュエータ４ｃを駆動させてアクセルペダル２ｃを操作することにより、アクセル開度を変更する。 _{The operation complement unit 24 receives the inferred operation θ FF} from the operation content inference unit 41 for each inference cycle Tnn.
_{The operation complement unit 24 adds the adjustment amount θ FB} calculated by the feedback operation amount calculation unit 26 to the latest inference result of _{the operation θ FF} received for each inference cycle Tnn, and performs the adjusted operation θ _ref , that is, Calculate the amount of change θ _{ref that is actually used.} As described above, the operation complement unit 24 calculates the _{adjusted operation θ ref with a control cycle Tdr shorter than the inference cycle Tnn.}
The operation complement unit 24 _{transmits this adjusted operation θ ref} to the drive robot 4. The drive robot 4 changes the accelerator opening degree by driving the actuator 4c based on the _{adjusted operation θ ref and operating the accelerator pedal 2c.}

このように、車両操作制御部２２は、調整係数Ｋｐ、Ｋｉ、Ｋｄにより操作θ_ＦＦを調整して調整後操作θ_ｒｅｆを生成し、調整後操作θ_ｒｅｆに基づきドライブロボット４を制御する。操作θ_ＦＦと調整係数Ｋｐ、Ｋｉ、Ｋｄは、制御周期Ｔｄｒよりも長い推論周期Ｔｎｎで推論されて更新される。 Thus, the vehicle operation control unit 22, the adjustment factor Kp, Ki, and adjust the operation theta _FF by Kd to generate adjusted operation theta _ref, controls the drive robot 4 based on the adjusted operating theta _ref. The operation θ _FF and the adjustment coefficients Kp, Ki, and Kd are inferred and updated with an inference period Tnn longer than the control period Tdr.

次に、学習部３０について説明する。
上記のように、操作内容推論部４１は、ある時刻における走行状態を基に、当該時刻よりも後の車両２の操作θ_ＦＦを推論する。この、車両２の操作θ_ＦＦの推論を効果的に行うために、特に操作内容推論部４１は、後に説明するように機械学習器を備えており、推論した操作θ_ＦＦに基づいたドライブロボット４の操作の後の時刻における走行状態に基づいて計算された報酬を基に機械学習器を強化学習して操作推論学習モデル５０を生成する。操作内容推論部４１は、性能測定のために実際に車両２を走行制御させる際には、この学習が完了した操作推論学習モデル５０を使用して、車両２の操作θ_ＦＦを推論する。
また、上記のように、調整係数推論部４５は、ある時刻における走行状態を基に、当該時刻よりも後の車両２の調整係数Ｋｐ、Ｋｉ、Ｋｄを推論する。この、車両２の調整係数Ｋｐ、Ｋｉ、Ｋｄの推論を効果的に行うために、特に調整係数推論部４５は、後に説明するように機械学習器を備えており、推論した調整係数Ｋｐ、Ｋｉ、Ｋｄに基づいたドライブロボット４の操作の後の時刻における走行状態に基づいて計算された報酬を基に機械学習器を強化学習して調整係数推論学習モデル７０を生成する。調整係数推論部４５は、性能測定のために実際に車両２を走行制御させる際には、この学習が完了した調整係数推論学習モデル７０を使用して、車両２の調整係数Ｋｐ、Ｋｉ、Ｋｄを推論する。
すなわち、制御装置１１は大別して、強化学習時における操作θ_ＦＦ及び調整係数Ｋｐ、Ｋｉ、Ｋｄの学習と、性能測定のために車両２を走行制御させる際における操作θ_ＦＦ及び調整係数Ｋｐ、Ｋｉ、Ｋｄの推論の、２通りの動作を行う。説明を簡単にするために、以下ではまず、操作θ_ＦＦ及び調整係Ｋｐ、Ｋｉ、Ｋｄ数の学習時における、制御装置１１の各構成要素の説明をした後に、車両２の性能測定に際して操作θ_ＦＦ及び調整係数Ｋｐ、Ｋｉ、Ｋｄを推論する場合での各構成要素の挙動について説明する。
図２においては、各学習モデル５０、７０の学習時における、各学習モデル５０、７０に関連したデータの送受信は破線で示されている。 Next, the learning unit 30 will be described.
_{As described above, the operation content inference unit 41 infers the operation θ FF} of the vehicle 2 after the time based on the traveling state at a certain time. _{In order to effectively infer the operation θ FF} of the vehicle 2, the operation content inference unit 41 is provided with a machine learning device as will be described later, and the drive robot 4 based on the _{inferred operation θ FF is provided.} The machine learning device is strengthened and learned based on the reward calculated based on the running state at the time after the operation of, and the operation inference learning model 50 is generated. When actually controlling the running of the vehicle 2 for performance measurement, the operation content inference unit 41 infers the operation θ _{FF of the} vehicle 2 by using the operation inference learning model 50 for which this learning is completed.
Further, as described above, the adjustment coefficient inference unit 45 infers the adjustment coefficients Kp, Ki, and Kd of the vehicle 2 after the time based on the traveling state at a certain time. In order to effectively infer the adjustment coefficients Kp, Ki, and Kd of the vehicle 2, the adjustment coefficient inference unit 45 is provided with a machine learning device as will be described later, and the inferred adjustment coefficients Kp, Ki, Ki are provided. , The machine learner is strengthened and learned based on the reward calculated based on the running state at the time after the operation of the drive robot 4 based on Kd, and the adjustment coefficient inference learning model 70 is generated. When the adjustment coefficient inference unit 45 actually controls the running of the vehicle 2 for performance measurement, the adjustment coefficient inference learning model 70 for which this learning is completed is used by the adjustment coefficient inference unit 45 to adjust the adjustment coefficients Kp, Ki, Kd of the vehicle 2. Infer.
That is, the control device 11 mainly includes an operation theta _FF and adjustment factor Kp during reinforcement learning, Ki, learning and Kd, operation theta _FF and adjustment factor Kp in time for travel control of the vehicle 2 for performance measurement, Ki , Kd inference is performed in two ways. In order to simplify the explanation, in the following, first, after explaining each component of the control device 11 at the time of learning _{the operation θ FF and the coordinators Kp, Ki, and Kd numbers, the operation θ at the time of measuring the performance of the vehicle 2} _{The behavior of each component when inferring FF} and adjustment coefficients Kp, Ki, and Kd will be described.
In FIG. 2, the transmission / reception of data related to the learning models 50 and 70 at the time of learning the learning models 50 and 70 is shown by a broken line.

まず、操作θ_ＦＦ及び調整係数Ｋｐ、Ｋｉ、Ｋｄの学習時における、学習部３０の構成要素の挙動を説明する。
指令車速生成部３１は、モードに関する情報に基づいて生成された、指令車速ｖ_ｒｅｆを保持している。指令車速生成部３１は指令車速ｖ_ｒｅｆを、車両操作制御部２２と推論データ成形部３２に送信する。
既に説明したように、車両操作制御部２２は、指令車速生成部３１から受信した指令車速ｖ_ｒｅｆを基にドライブロボット４を制御して車両２を走行させる。駆動状態取得部２３は、要求駆動力Ｆ_ｒｅｆ、検出エンジン回転数ｎ_ｄｅｔ、検出エンジン温度ｄ_ｄｅｔ、及び検出車速（車速）ｖ_ｄｅｔを収集し、推論データ成形部３２へ送信する。
推論データ成形部３２は、駆動状態取得部２３から要求駆動力Ｆ_ｒｅｆ、検出エンジン回転数ｎ_ｄｅｔ、検出エンジン温度ｄ_ｄｅｔ、及び検出車速ｖ_ｄｅｔを受信する。また、推論データ成形部３２は、指令車速生成部３１から指令車速ｖ_ｒｅｆを受信する。推論データ成形部３２は、これらを併せて走行状態とし、適切に成形した後に、強化学習部４０の操作内容推論部４１と調整係数推論部４５に送信する。 First, the behavior of the components of the learning unit 30 during learning of the operation θ _FF and the adjustment coefficients Kp, Ki, and Kd will be described.
The command vehicle speed generation unit 31 holds the _{command vehicle speed v ref} generated based on the information regarding the mode. The command vehicle speed generation unit 31 _{transmits the command vehicle speed vref} to the vehicle operation control unit 22 and the inference data molding unit 32.
As described above, the vehicle operation control unit 22 controls the _{drive robot 4 based on the command vehicle speed vref} received from the command vehicle speed generation unit 31 to drive the vehicle 2. The driving state acquisition unit 23 _{collects the required driving force F ref} , the detected engine speed n _det , the detected engine temperature d _det , and the detected vehicle speed (vehicle speed) v _det , and transmits them to the inference data forming unit 32.
The inference data forming unit 32 receives the required driving force _Ref , the detected engine speed n _det , the detected engine temperature d _det , and the detected vehicle speed v _det from the driving state acquisition unit 23. Further, the inference data forming unit 32 receives the command vehicle speed _vref from the command vehicle speed generation unit 31. The inference data forming unit 32 puts these together into a running state, and after appropriately forming them, transmits them to the operation content inference unit 41 and the adjustment coefficient inference unit 45 of the reinforcement learning unit 40.

操作内容推論部４１は、走行状態を受信すると、これを基に、学習中の操作推論学習モデル５０により、車両２を指令車速ｖ_ｒｅｆに従って走行させるための、車両２の操作θ_ＦＦを推論する。この操作θ_ＦＦは、操作内容推論部４１が次の推論を実行している推論周期Ｔｎｎの間は更新されないため、次の推論周期Ｔｎｎの間のドライブロボット４の制御に継続して使用される。
本実施形態においては、操作推論学習モデル５０は、走行状態の各々に対応する入力ノードを備えた入力層と、複数の中間層、及び車両２の操作θ_ＦＦに対応する出力ノードを備えた、ニューラルネットワークである。
入力ノードの各々に、対応する走行状態の値が入力されると、重みを基にした演算がなされて、入力ノードの次の段として設けられた中間層の、中間ノードの各々に、演算結果が格納される。このような演算と、次の段の中間ノードへの演算結果の格納が、各中間層に対して順次実行される。最終的には、最終段の中間層内の中間ノードに格納された演算結果を基に、同様な演算がなされ、その結果が車両２の操作θ_ＦＦとして出力ノードに格納される。
操作内容推論部４１は、このようにして生成された車両２の操作θ_ＦＦを、車両操作制御部２２に送信する。 When the operation content inference unit 41 receives the traveling state, the operation inference learning model 50 during learning infers _{the operation θ FF} of the vehicle 2 for driving the vehicle 2 according _{to the command vehicle speed v ref.} .. Since this operation θ _FF is not updated during the inference cycle Tnn in which the operation content inference unit 41 is executing the next inference, it is continuously used for controlling the drive robot 4 during the next inference cycle Tnn. ..
In the present embodiment, the operation inference learning model 50 includes an input layer having input nodes corresponding to each of the traveling states, a plurality of intermediate layers, and an output node corresponding to _{the operation θ FF of the vehicle 2.} It is a neural network.
When the corresponding running state value is input to each of the input nodes, the calculation based on the weight is performed, and the calculation result is performed for each of the intermediate nodes of the intermediate layer provided as the next stage of the input node. Is stored. Such an operation and storage of the operation result in the intermediate node of the next stage are sequentially executed for each intermediate layer. Finally, the same calculation is performed based on the calculation result stored in the intermediate node in the middle layer of the final stage, and the result is stored in the output node as _{the operation θ FF of the vehicle 2.}
_{The operation content inference unit 41 transmits the operation θ FF} of the vehicle 2 generated in this way to the vehicle operation control unit 22.

同様に、調整係数推論部４５は、走行状態を受信すると、これを基に、学習中の調整係数推論学習モデル７０により、車両２を指令車速ｖ_ｒｅｆに従って走行させるための、操作内容推論部４１により推論された車両２の操作θ_ＦＦに適用される調整係数Ｋｐ、Ｋｉ、Ｋｄを推論する。この調整係数Ｋｐ、Ｋｉ、Ｋｄは、調整係数推論部４５が次の推論を実行している推論周期Ｔｎｎの間は更新されないため、次の推論周期Ｔｎｎの間のドライブロボット４の制御に継続して使用される。
本実施形態においては、調整係数推論学習モデル７０は、走行状態の各々に対応する入力ノードを備えた入力層と、複数の中間層、及び調整係数Ｋｐ、Ｋｉ、Ｋｄの各々に対応する出力ノードを備えた、ニューラルネットワークである。
入力ノードの各々に、対応する走行状態の値が入力されると、重みを基にした演算がなされて、入力ノードの次の段として設けられた中間層の、中間ノードの各々に、演算結果が格納される。このような演算と、次の段の中間ノードへの演算結果の格納が、各中間層に対して順次実行される。最終的には、最終段の中間層内の中間ノードに格納された演算結果を基に、同様な演算がなされ、その結果が調整係数Ｋｐ、Ｋｉ、Ｋｄとして出力ノードに格納される。
調整係数推論部４５は、このようにして生成された調整係数Ｋｐ、Ｋｉ、Ｋｄを、車両操作制御部２２に送信する。 Similarly, when the adjustment coefficient inference unit 45 receives the traveling state, the adjustment coefficient inference unit 45 for driving the vehicle 2 according to the command vehicle speed _vref by the adjustment coefficient inference learning model 70 under learning based on the traveling state, the operation content inference unit 41. The adjustment coefficients Kp, Ki, and Kd applied to the operation θ _{FF of the vehicle 2 inferred by the above are inferred.} Since the adjustment coefficients Kp, Ki, and Kd are not updated during the inference cycle Tnn in which the adjustment coefficient inference unit 45 is executing the next inference, the adjustment coefficients Kp, Ki, and Kd continue to be controlled by the drive robot 4 during the next inference cycle Tnn. Is used.
In the present embodiment, the adjustment coefficient inference learning model 70 includes an input layer having an input node corresponding to each of the traveling states, a plurality of intermediate layers, and an output node corresponding to each of the adjustment coefficients Kp, Ki, and Kd. It is a neural network equipped with.
When the corresponding running state value is input to each of the input nodes, the calculation based on the weight is performed, and the calculation result is performed for each of the intermediate nodes of the intermediate layer provided as the next stage of the input node. Is stored. Such an operation and storage of the operation result in the intermediate node of the next stage are sequentially executed for each intermediate layer. Finally, the same operation is performed based on the operation result stored in the intermediate node in the intermediate layer of the final stage, and the result is stored in the output node as the adjustment coefficients Kp, Ki, and Kd.
The adjustment coefficient inference unit 45 transmits the adjustment coefficients Kp, Ki, and Kd thus generated to the vehicle operation control unit 22.

上記のような、操作内容推論部４１と調整係数推論部４５における、車両２の操作θ_ＦＦと調整係数Ｋｐ、Ｋｉ、Ｋｄの推論は、推論周期Ｔｎｎごとに行われる。操作推論学習モデル５０と調整係数推論学習モデル７０の各々は、一度の推論で、次の推論周期Ｔｎｎの間に使用される車両２の操作θ_ＦＦと調整係数Ｋｐ、Ｋｉ、Ｋｄのみを推論し、より将来の推論は行わない。更に次の推論周期Ｔｎｎに使用される車両２の操作θ_ＦＦと調整係数Ｋｐ、Ｋｉ、Ｋｄは、次の推論において導出される。
車両操作制御部２２は、これらの推論された車両２の操作θ_ＦＦと調整係数Ｋｐ、Ｋｉ、Ｋｄを、推論周期Ｔｎｎごとに受信して更新する。車両操作制御部２２は、次の推論周期Ｔｎｎ後の時刻までの間、更新された最新の車両２の操作θ_ＦＦと調整係数Ｋｐ、Ｋｉ、Ｋｄを基に、刻々と変化する走行状態を随時入力して調整後操作θ_ｒｅｆを生成し、調整後操作θ_ｒｅｆに基づきドライブロボット４を制御する。
操作推論学習モデル５０と調整係数推論学習モデル７０の学習、すなわち誤差逆伝搬法、確率的勾配降下法によるニューラルネットワークを構成する各パラメータの値の調整は、現段階においては行われず、操作推論学習モデル５０と調整係数推論学習モデル７０は車両２の操作θ_ＦＦと調整係数Ｋｐ、Ｋｉ、Ｋｄを推論するのみである。操作推論学習モデル５０と調整係数推論学習モデル７０の学習は、後に、第１及び第２行動価値推論学習モデル６０、８０の学習に伴って行われる。 _{The inference of the operation θ FF} of the vehicle 2 and the adjustment coefficients Kp, Ki, and Kd in the operation content inference unit 41 and the adjustment coefficient inference unit 45 as described above is performed for each inference cycle Tnn. Each of the operation inference learning model 50 and the adjustment coefficient inference learning model 70 _{infers only the operation θ FF of the} vehicle 2 and the adjustment coefficients Kp, Ki, and Kd used during the next inference cycle Tnn in one inference. , No further inferences are made. _{Further, the operation θ FF of the} vehicle 2 and the adjustment coefficients Kp, Ki, and Kd used in the next inference cycle Tnn are derived in the next inference.
The vehicle operation control unit 22 _{receives and updates the operation θ FF} of the inferred vehicle 2 and the adjustment coefficients Kp, Ki, and Kd for each inference cycle Tnn. _{The vehicle operation control unit 22 constantly changes the running state based on the latest updated operation θ FF} of the vehicle 2 and the adjustment coefficients Kp, Ki, and Kd until the time after the next inference cycle Tnn. input and generates an adjusted operating theta _ref, controls the drive robot 4 based on the adjusted operating theta _ref.
Operational inference learning model 50 and adjustment coefficient Inference learning model 70 learning, that is, adjustment of the values of each parameter constituting the neural network by the error back propagation method and the stochastic gradient descent method is not performed at this stage, and the operation inference learning is performed. Model 50 and adjustment coefficient inference The learning model 70 _{only infers the operation θ FF of the} vehicle 2 and the adjustment coefficients Kp, Ki, and Kd. The learning of the operation inference learning model 50 and the adjustment coefficient inference learning model 70 is later performed in association with the learning of the first and second action value inference learning models 60 and 80.

操作推論学習モデル５０と調整係数推論学習モデル７０の推論結果を基にドライブロボット４が制御された結果、車両２の走行状態が変更される。駆動状態取得部２３は、この変更後の走行状態を、車両２の操作θ_ＦＦと調整係数Ｋｐ、Ｋｉ、Ｋｄが適用された後の、次の走行状態として取得する。
報酬計算部４３は、操作推論学習モデル５０と調整係数推論学習モデル７０の強化学習に使用される報酬を計算する。
より詳細には、操作報酬計算部４４は、走行状態と、これに対応して操作推論学習モデル５０により推論された操作θ_ＦＦ、及び当該操作θ_ＦＦを基に新たに生成された次の走行状態を基に、適切に設計された式により報酬を計算する。また、調整係数報酬計算部４７は、走行状態と、これに対応して調整係数推論学習モデル７０により推論された調整係数Ｋｐ、Ｋｉ、Ｋｄ、及び当該調整係数Ｋｐ、Ｋｉ、Ｋｄを基に新たに生成された次の走行状態を基に、適切に設計された式により報酬を計算する。
本実施形態においては、推論周期Ｔｎｎよりも制御周期Ｔｄｒが短いため、推論周期Ｔｎｎの間に制御周期Ｔｄｒ間隔で複数回、ドライブロボット４が制御される。これに伴い、本実施形態における報酬は、この複数回の各制御の後における指令車速ｖ_ｒｅｆと検出車速ｖ_ｄｅｔの誤差を平均した値の、絶対値として設定されている。すなわち、本実施形態においては、上記のような絶対値を計算し、これが０に近いほど、高い報酬となるように設計されている。
後述する第１及び第２行動価値推論学習モデル６０、８０は、行動価値を、報酬が小さいほどこれが高くするように計算し、操作推論学習モデル５０と調整係数推論学習モデル７０はこれらの行動価値が高くなるような操作θ_ＦＦや調整係数Ｋｐ、Ｋｉ、Ｋｄを出力するように、強化学習が行われる。 As a result of controlling the drive robot 4 based on the inference results of the operation inference learning model 50 and the adjustment coefficient inference learning model 70, the running state of the vehicle 2 is changed. The drive state acquisition unit 23 acquires the changed _{running state as the next running state after the operation θ FF of the} vehicle 2 and the adjustment coefficients Kp, Ki, and Kd are applied.
The reward calculation unit 43 calculates the reward used for reinforcement learning of the operation reasoning learning model 50 and the adjustment coefficient reasoning learning model 70.
More specifically, the operation reward calculation unit 44 includes the running state, the operation θ _FF inferred by the operation inference learning model 50 corresponding to the running state, and the next running newly generated based on the _{operation θ FF.} Based on the condition, the reward is calculated by a well-designed formula. Further, the adjustment coefficient reward calculation unit 47 is newly added based on the running state, the adjustment coefficients Kp, Ki, Kd inferred by the adjustment coefficient inference learning model 70 corresponding to the running state, and the adjustment coefficients Kp, Ki, Kd. Based on the next running condition generated in, the reward is calculated by a properly designed formula.
In the present embodiment, since the control cycle Tdr is shorter than the inference cycle Tnn, the drive robot 4 is controlled a plurality of times at the control cycle Tdr interval during the inference cycle Tnn. Along with this, the reward in the present embodiment is set as an absolute value of the average value of the errors _{of the command vehicle speed v ref} and the detected vehicle speed v _{det after each of the plurality of controls.} That is, in the present embodiment, the absolute value as described above is calculated, and the closer it is to 0, the higher the reward is designed.
The first and second action value inference learning models 60 and 80, which will be described later, calculate the action value so that the smaller the reward, the higher the action value, and the operation inference learning model 50 and the adjustment coefficient inference learning model 70 calculate these action values. Reinforcement learning is performed so as to output the operation θ _FF and the adjustment coefficients Kp, Ki, and Kd that increase the value.

操作報酬計算部４４は、走行状態、これに対応して推論された操作θ_ＦＦ、当該操作θ_ＦＦを基に新たに生成された次の走行状態、及び計算した報酬を、学習データ成形部３３に送信する。学習データ成形部３３は、これらを適切に成形して学習データ記憶部３５に保存する。これらのデータは、後述する第１行動価値推論学習モデル６０の学習に使用される。
また、調整係数推論部４５は、走行状態、これに対応して推論された調整係数Ｋｐ、Ｋｉ、Ｋｄ、当該調整係数Ｋｐ、Ｋｉ、Ｋｄを基に新たに生成された次の走行状態、及び計算した報酬を、学習データ成形部３３に送信する。学習データ成形部３３は、これらを適切に成形して学習データ記憶部３５に保存する。これらのデータは、後述する第２行動価値推論学習モデル８０の学習に使用される。
このようにして、操作θ_ＦＦと調整係数Ｋｐ、Ｋｉ、Ｋｄの推論と、この推論結果に対応した、次の走行状態の取得、及び報酬の計算が、第１及び第２行動価値推論学習モデル６０、８０の学習に十分なデータが蓄積されるまで、繰り返し行われる。 The operation reward calculation unit 44 uses the learning data forming unit 33 to obtain the running state, the operation θ _FF inferred corresponding to the running state, the next running state newly generated based on the operation θ _{FF, and the calculated reward.} Send to. The learning data forming unit 33 appropriately forms these and stores them in the learning data storage unit 35. These data are used for learning the first action value inference learning model 60, which will be described later.
Further, the adjustment coefficient inferring unit 45 includes a traveling state, an adjustment coefficient Kp, Ki, Kd inferred corresponding to the traveling state, a next traveling state newly generated based on the adjustment coefficient Kp, Ki, Kd, and an adjustment coefficient inferring unit 45. The calculated reward is transmitted to the learning data forming unit 33. The learning data forming unit 33 appropriately forms these and stores them in the learning data storage unit 35. These data are used for learning the second action value inference learning model 80, which will be described later.
In this way, _{the inference of the operation θ FF} and the adjustment coefficients Kp, Ki, and Kd, the acquisition of the next running state corresponding to the inference result, and the calculation of the reward are the first and second action value inference learning models. It is repeated until sufficient data is accumulated for learning 60 and 80.

学習データ記憶部３５に、第１行動価値推論学習モデル６０の学習に十分な量の走行データが蓄積されると、第１行動価値推論部４２は第１行動価値推論学習モデル６０を学習する。第１行動価値推論学習モデル６０は、機械学習器が学習されることにより、人工知能ソフトウェアの一部であるプログラムモジュールとして利用される、適切な学習パラメータが学習された学習済みモデルとなる。
強化学習部４０は全体として、操作推論学習モデル５０が推論した操作θ_ＦＦがどの程度適切であったかを示す行動価値を計算し、操作推論学習モデル５０が、この行動価値が高くなるような操作θ_ＦＦを出力するように、強化学習を行う。行動価値は、走行状態と、これに対する操作θ_ＦＦを引数として、報酬が大きいほど行動価値を高くするように設計された関数として表わされる。本実施形態においては、この関数の計算を、走行状態と操作θ_ＦＦを入力として、行動価値を出力するように設計された、関数近似器としての第１行動価値推論学習モデル６０により行う。 When a sufficient amount of running data for learning the first action value inference learning model 60 is accumulated in the learning data storage unit 35, the first action value inference learning unit 42 learns the first action value inference learning model 60. The first action value inference learning model 60 becomes a learned model in which appropriate learning parameters are learned, which is used as a program module that is a part of artificial intelligence software by learning a machine learning device.
As a whole, the reinforcement learning unit 40 _{calculates the action value indicating how appropriate the operation θ FF} inferred by the operation inference learning model 50 is, and the operation inference learning model 50 increases the action value θ. Reinforcement learning is performed so as to output _FF. The action value is expressed as a function designed so that the larger the reward, the higher the action value, with the running state and the operation θ _{FF for it as arguments.} In the present embodiment, the calculation of this function is performed by the first action value inference learning model 60 as a function approximation device designed to output the action value by inputting the _{running state and the operation θ FF.}

操作学習データ生成部３４は、学習データ記憶部３５内の学習データを成形して、第１行動価値推論部４２へ送信する。
第１行動価値推論部４２は、成形された学習データを受信し、第１行動価値推論学習モデル６０を機械学習させる。
本実施形態においては、第１行動価値推論学習モデル６０は、走行状態と操作θ_ＦＦの各々に対応する入力ノードを備えた入力層と、複数の中間層、及び操作θ_ＦＦに関する行動価値に対応する出力ノードを備えた、ニューラルネットワークである。第１行動価値推論学習モデル６０は、操作推論学習モデル５０と同様な構造のニューラルネットワークにより実現されているため、構造上の詳細な説明を割愛する。 The operation learning data generation unit 34 forms the learning data in the learning data storage unit 35 and transmits it to the first action value inference unit 42.
The first action value inference unit 42 receives the formed learning data and causes the first action value inference learning model 60 to be machine-learned.
In the present embodiment, the first action value inference learning model 60 corresponds to an input layer having input nodes corresponding to each of _{the running state and the operation θ FF} , a plurality of intermediate layers, and the action value related to the _{operation θ FF.} It is a neural network with an output node to do. Since the first action value inference learning model 60 is realized by a neural network having the same structure as the operation inference learning model 50, detailed structural explanation is omitted.

操作報酬計算部４４は、ＴＤ（ＴｅｍｐｏｒａｌＤｉｆｆｅｒｅｎｃｅ）誤差、すなわち、操作θ_ＦＦを基にした制御を行う前の行動価値と、制御後の行動価値の誤差を小さくして、行動価値として適切な値が出力されるように、重みやバイアスの値等、ニューラルネットワークを構成する各パラメータの値を、誤差逆伝搬法、確率的勾配降下法により調整する。このように、現状の操作推論学習モデル５０によって推論された操作θ_ＦＦを適切に評価できるように、第１行動価値推論学習モデル６０を学習させる。 The operation reward calculation unit 44 reduces the error of the TD (Temporal Difference) error, that is, the error _{between the action value before performing the control based on the operation θ FF} and the action value after the control, and sets an appropriate value as the action value. Is output, the values of each parameter constituting the neural network, such as the weight and bias values, are adjusted by the error back propagation method and the stochastic gradient descent method. In this way, the first action value inference learning model 60 is trained so _{that the operation θ FF} inferred by the current operation inference learning model 50 can be appropriately evaluated.

また、同様に、学習データ記憶部３５に、調整係数推論学習モデル７０の学習に十分な量の走行データが蓄積されると、調整係数推論部４５は第２行動価値推論学習モデル８０を学習する。第２行動価値推論学習モデル８０は、機械学習器が学習されることにより、人工知能ソフトウェアの一部であるプログラムモジュールとして利用される、適切な学習パラメータが学習された学習済みモデルとなる。
強化学習部４０は全体として、調整係数推論学習モデル７０が推論した調整係数Ｋｐ、Ｋｉ、Ｋｄがどの程度適切であったかを示す行動価値を計算し、調整係数推論学習モデル７０が、この行動価値が高くなるような調整係数Ｋｐ、Ｋｉ、Ｋｄを出力するように、強化学習を行う。行動価値は、走行状態と、これに対する調整係数Ｋｐ、Ｋｉ、Ｋｄを引数として、報酬が大きいほど行動価値を高くするように設計された関数として表わされる。本実施形態においては、この関数の計算を、走行状態と調整係数Ｋｐ、Ｋｉ、Ｋｄを入力として、行動価値を出力するように設計された、関数近似器としての第２行動価値推論学習モデル８０により行う。 Similarly, when a sufficient amount of running data for learning the adjustment coefficient inference learning model 70 is accumulated in the learning data storage unit 35, the adjustment coefficient inference unit 45 learns the second action value inference learning model 80. .. The second action value inference learning model 80 becomes a learned model in which appropriate learning parameters are learned, which is used as a program module that is a part of artificial intelligence software by learning a machine learning device.
As a whole, the reinforcement learning unit 40 calculates the action value indicating how appropriate the adjustment coefficients Kp, Ki, and Kd inferred by the adjustment coefficient inference learning model 70 are, and the adjustment coefficient inference learning model 70 determines this action value. Reinforcement learning is performed so as to output the adjustment coefficients Kp, Ki, and Kd that increase. The action value is expressed as a function designed so that the larger the reward, the higher the action value, with the running state and the adjustment coefficients Kp, Ki, and Kd for the running state as arguments. In the present embodiment, the second action value inference learning model 80 as a function approximator is designed to output the action value by inputting the running state and the adjustment coefficients Kp, Ki, and Kd for the calculation of this function. To do.

調整係数学習データ生成部３６は、学習データ記憶部３５内の学習データを成形して、第２行動価値推論部４６へ送信する。
第２行動価値推論部４６は、成形された学習データを受信し、第２行動価値推論学習モデル８０を機械学習させる。
本実施形態においては、第２行動価値推論学習モデル８０は、走行状態と調整係数Ｋｐ、Ｋｉ、Ｋｄの各々に対応する入力ノードを備えた入力層と、複数の中間層、及び調整係数Ｋｐ、Ｋｉ、Ｋｄに関する行動価値に対応する出力ノードを備えた、ニューラルネットワークである。第２行動価値推論学習モデル８０は、調整係数推論学習モデル７０と同様な構造のニューラルネットワークにより実現されているため、構造上の詳細な説明を割愛する。 The adjustment coefficient learning data generation unit 36 forms the learning data in the learning data storage unit 35 and transmits it to the second action value inference unit 46.
The second action value inference unit 46 receives the formed learning data and causes the second action value inference learning model 80 to be machine-learned.
In the present embodiment, the second action value inference learning model 80 includes an input layer having input nodes corresponding to the running state and the adjustment coefficients Kp, Ki, and Kd, a plurality of intermediate layers, and an adjustment coefficient Kp. It is a neural network provided with output nodes corresponding to the action values related to Ki and Kd. Since the second action value inference learning model 80 is realized by a neural network having the same structure as the adjustment coefficient inference learning model 70, detailed structural explanation is omitted.

調整係数推論部４５は、ＴＤ誤差、すなわち、調整係数Ｋｐ、Ｋｉ、Ｋｄを基にした制御を行う前の行動価値と、制御後の行動価値の誤差を小さくして、行動価値として適切な値が出力されるように、重みやバイアスの値等、ニューラルネットワークを構成する各パラメータの値を、誤差逆伝搬法、確率的勾配降下法により調整する。このように、現状の調整係数推論学習モデル７０によって推論された調整係数Ｋｐ、Ｋｉ、Ｋｄを適切に評価できるように、第２行動価値推論学習モデル８０を学習させる。 The adjustment coefficient inference unit 45 reduces the TD error, that is, the error between the action value before performing control based on the adjustment coefficients Kp, Ki, and Kd and the action value after control, and sets an appropriate value as the action value. Is output, the values of each parameter constituting the neural network, such as the weight and bias values, are adjusted by the error back propagation method and the stochastic gradient descent method. In this way, the second action value inference learning model 80 is trained so that the adjustment coefficients Kp, Ki, and Kd inferred by the current adjustment coefficient inference learning model 70 can be appropriately evaluated.

第１及び第２行動価値推論学習モデル６０、８０の学習が進むと、第１及び第２行動価値推論学習モデル６０、８０の各々は、より適切な行動価値の値を出力するようになる。すなわち、第１及び第２行動価値推論学習モデル６０、８０の各々が出力する行動価値の値が学習前とは変わるため、これに伴い、行動価値が高くなるような操作θ_ＦＦと調整係数Ｋｐ、Ｋｉ、Ｋｄを出力するように設計された操作推論学習モデル５０と調整係数推論学習モデル７０の各々を更新する必要がある。このため、操作内容推論部４１と調整係数推論部４５は、操作推論学習モデル５０と調整係数推論学習モデル７０を学習する。
具体的には、操作内容推論部４１と調整係数推論部４５の各々は、例えば行動価値の負値を損失関数とし、これをできるだけ小さくするような、すなわち行動価値が大きくなるような操作θ_ＦＦと調整係数Ｋｐ、Ｋｉ、Ｋｄを出力するように、重みやバイアスの値等、ニューラルネットワークを構成する各パラメータの値を、誤差逆伝搬法、確率的勾配降下法により調整して、操作推論学習モデル５０と調整係数推論学習モデル７０の各々を学習させる。
操作推論学習モデル５０と調整係数推論学習モデル７０の各々が学習され更新されると、出力される操作θ_ＦＦと調整係数Ｋｐ、Ｋｉ、Ｋｄが変化するため、再度走行データを蓄積し、これを基に第１及び第２行動価値推論学習モデル６０、８０を学習する。
このように、学習部３０は、操作推論学習モデル５０及び調整係数推論学習モデル７０と、第１及び第２行動価値推論学習モデル６０、８０との学習を互いに繰り返すことにより、これら学習モデル５０、６０、７０、８０を強化学習する。
学習部３０は、この強化学習を、所定の学習終了基準を満たすまで実行する。 As the learning of the first and second action value inference learning models 60 and 80 progresses, each of the first and second action value inference learning models 60 and 80 outputs a more appropriate action value value. That is, since the action value values output by each of the first and second action value inference learning models 60 and 80 are different from those before learning, the operation θ _FF and the adjustment coefficient Kp that increase the action value accordingly. It is necessary to update each of the operation inference learning model 50 and the adjustment coefficient inference learning model 70 designed to output, Ki, and Kd. Therefore, the operation content inference unit 41 and the adjustment coefficient inference unit 45 learn the operation inference learning model 50 and the adjustment coefficient inference learning model 70.
Specifically, each of the operation content inference unit 41 and the adjustment coefficient inference unit 45 uses, for example, a negative value of the action value as a loss function, and makes it as small as possible, that is, an operation θ _{FF that increases the action value.} And adjustment coefficients Kp, Ki, Kd are output, and the values of each parameter that composes the neural network, such as the weight and bias values, are adjusted by the error back propagation method and the stochastic gradient descent method, and the operation inference learning is performed. Each of the model 50 and the adjustment coefficient inference learning model 70 is trained.
When each of the operation inference learning model 50 and the adjustment coefficient inference learning model 70 is learned and updated, the output operation θ _FF and the adjustment coefficients Kp, Ki, and Kd change. Based on this, the first and second behavioral value inference learning models 60 and 80 are learned.
In this way, the learning unit 30 repeats learning between the operation inference learning model 50 and the adjustment coefficient inference learning model 70, and the first and second action value inference learning models 60 and 80, thereby causing the learning model 50, Reinforcement learning of 60, 70, 80.
The learning unit 30 executes this reinforcement learning until a predetermined learning end criterion is satisfied.

次に、車両２の性能測定に際して操作θ_ＦＦと調整係数Ｋｐ、Ｋｉ、Ｋｄを推論する場合での、すなわち、操作推論学習モデル５０と調整係数推論学習モデル７０の強化学習が終了した後における、学習部３０の各構成要素の挙動について説明する。 _{Next, in the case of inferring the operation θ FF} and the adjustment coefficients Kp, Ki, and Kd when measuring the performance of the vehicle 2, that is, after the reinforcement learning of the operation inference learning model 50 and the adjustment coefficient inference learning model 70 is completed. The behavior of each component of the learning unit 30 will be described.

指令車速生成部３１は指令車速ｖ_ｒｅｆを、ドライブロボット制御部２０と推論データ成形部３２に送信する。
ドライブロボット制御部２０は、要求駆動力Ｆ_ｒｅｆ、検出エンジン回転数ｎ_ｄｅｔ、検出エンジン温度ｄ_ｄｅｔ、及び検出車速（車速）ｖ_ｄｅｔを、推論データ成形部３２へ送信する。
推論データ成形部３２は、要求駆動力Ｆ_ｒｅｆ、検出エンジン回転数ｎ_ｄｅｔ、検出エンジン温度ｄ_ｄｅｔ、及び検出車速（車速）ｖ_ｄｅｔ、及び指令車速ｖ_ｒｅｆを走行状態として受信し、適切に成形した後に、強化学習部４０の操作内容推論部４１と調整係数推論部４５に送信する。
操作内容推論部４１は、走行状態を受信すると、これを基に、学習が完了した操作推論学習モデル５０により、次の推論周期Ｔｎｎの間、車両を指令車速ｖ_ｒｅｆに従って走行させるための、車両２の操作θ_ＦＦを推論する。
同様に、調整係数推論部４５は、走行状態を受信すると、これを基に、学習中の調整係数推論学習モデル７０により、次の推論周期Ｔｎｎの間、車両を指令車速ｖ_ｒｅｆに従って走行させるための、操作内容推論部４１により推論された車両２の操作θ_ＦＦに適用される調整係数Ｋｐ、Ｋｉ、Ｋｄを推論する。 The command vehicle speed generation unit 31 _{transmits the command vehicle speed vref} to the drive robot control unit 20 and the inference data molding unit 32.
The drive robot control unit 20 transmits the required driving force _Ref , the detected engine speed n _det , the detected engine temperature d _det , and the detected vehicle speed (vehicle speed) v _det to the inference data forming unit 32.
The inference data forming unit 32 receives the required driving force F _ref , the detected engine rotation number n _det , the detected engine temperature d _det , the detected vehicle speed (vehicle speed) v _det , and the commanded vehicle speed v _ref as the traveling state, and appropriately forms the data. After that, it is transmitted to the operation content inference unit 41 and the adjustment coefficient inference unit 45 of the reinforcement learning unit 40.
Operation content inference unit 41 receives the running condition, based on this, the operation inference learning model 50 completion of the learning, during the next inference cycle Tnn, for running the vehicle in accordance with a command vehicle speed v _ref, vehicle Infer the operation θ _FF of 2.
Similarly, adjustment coefficient inference unit 45 receives the running condition, based on this, the adjustment coefficient inference learning model 70 in the training, during the next inference cycle Tnn, for running the vehicle in accordance with a command vehicle speed v _ref The adjustment coefficients Kp, Ki, and Kd applied to _{the operation θ FF} of the vehicle 2 inferred by the operation content inference unit 41 are inferred.

フィードバック操作量演算部２６は、推論された調整係数Ｋｐ、Ｋｉ、Ｋｄ、すなわち比例ゲインＫｐ、積分ゲインＫｉ、及び微分ゲインＫｄを基に、推論周期Ｔｎｎより短い制御周期Ｔｄｒ間隔で、ＰＩＤ制御により、調整量θ_ＦＢを演算する。この演算において使用される調整係数Ｋｐ、Ｋｉ、Ｋｄは、制御周期Ｔｄｒよりも長い推論周期Ｔｎｎ間隔で、調整係数推論部４５によって推論され、更新される。
操作補完部２４は、フィードバック操作量演算部２６から調整量θ_ＦＢを受信し、推論された車両２の操作θ_ＦＦを基に、推論周期Ｔｎｎより短い制御周期Ｔｄｒ間隔で、調整後操作θ_ｒｅｆを計算する。この演算において使用される車両２の操作θ_ＦＦは、制御周期Ｔｄｒよりも長い推論周期Ｔｎｎ間隔で、操作内容推論部４１によって推論され、更新される。
操作補完部２４は、この調整後操作θ_ｒｅｆを、ドライブロボット４に送信する。ドライブロボット４は、調整後操作θ_ｒｅｆを基にアクチュエータ４ｃを駆動させてアクセルペダル２ｃを操作することにより、アクセル開度を変更する。 The feedback manipulated variable calculation unit 26 is based on the inferred adjustment coefficients Kp, Ki, Kd, that is, the proportional gain Kp, the integral gain Ki, and the differential gain Kd, and is controlled by PID at a control cycle Tdr interval shorter than the inference cycle Tnn. , The adjustment amount θ _FB is calculated. The adjustment coefficients Kp, Ki, and Kd used in this calculation are inferred and updated by the adjustment coefficient inference unit 45 at intervals of inference period Tnn longer than the control period Tdr.
_{The operation complement unit 24 receives the adjustment amount θ FB} from the feedback operation amount calculation unit 26, _{and based on the inferred operation θ FF} of the vehicle 2, the _{operation complement unit 24 receives the adjustment amount θ ref} at a control cycle Tdr interval shorter than the inference cycle Tnn. To calculate. _{The operation θ FF of the} vehicle 2 used in this calculation is inferred and updated by the operation content inference unit 41 at an inference cycle Tnn interval longer than the control cycle Tdr.
The operation complement unit 24 _{transmits this adjusted operation θ ref} to the drive robot 4. The drive robot 4 changes the accelerator opening degree by driving the actuator 4c based on the _{adjusted operation θ ref and operating the accelerator pedal 2c.}

次に、図１〜図３、及び図４、５を用いて、上記のドライブロボット４の制御装置１１によりドライブロボット４を制御する方法を説明する。図４は、ドライブロボット４の制御方法における、学習時のフローチャートである。図５は、ドライブロボット４の制御方法における、性能測定のために車両２を走行制御させる際のフローチャートである。
まず、図４を用いて、学習時における動作を説明する。 Next, a method of controlling the drive robot 4 by the control device 11 of the drive robot 4 will be described with reference to FIGS. 1 to 3 and 4 and 5. FIG. 4 is a flowchart at the time of learning in the control method of the drive robot 4. FIG. 5 is a flowchart in the control method of the drive robot 4 when the vehicle 2 is controlled to travel for performance measurement.
First, the operation at the time of learning will be described with reference to FIG.

学習が開始されると（ステップＳ１）、各学習モデル５０、６０、７０、８０等の各パラメータが初期設定される（ステップＳ３）。
その後、車両２の走行データを収集する（ステップＳ５）。より詳細には、制御装置１１が、学習がまだ終了していない、学習途中の操作推論学習モデル５０、調整係数推論学習モデル７０によって推論された車両２の操作θ_ＦＦと調整係数Ｋｐ、Ｋｉ、Ｋｄによって車両２を走行制御することにより、走行データが蓄積される。 When learning is started (step S1), each parameter of each learning model 50, 60, 70, 80 and the like is initialized (step S3).
After that, the traveling data of the vehicle 2 is collected (step S5). More specifically, the control device 11 has not yet completed learning, the operation inference learning model 50 in the middle of learning, the operation θ _{FF of the} vehicle 2 inferred by the adjustment coefficient inference learning model 70, and the adjustment coefficients Kp, Ki, By controlling the running of the vehicle 2 by Kd, the running data is accumulated.

十分な走行データが学習データ記憶部３５に蓄積されると、これを用いて、操作推論学習モデル５０と調整係数推論学習モデル７０を強化学習し、学習モデル５０、７０を更新する（ステップＳ７）。
操作推論学習モデル５０と調整係数推論学習モデル７０の更新が終了すると、これら操作推論学習モデル５０と調整係数推論学習モデル７０の学習が終了したか否かを判定する（ステップＳ９）。
学習が終了していないと判定された場合には（ステップＳ９のＮｏ）、ステップＳ５へ遷移する。すなわち、制御装置１１は走行データを更に収集し、これを用いた操作推論学習モデル５０と調整係数推論学習モデル７０の更新を繰り返す。
学習が終了したと判定された場合には（ステップＳ９のＹｅｓ）、学習処理を終了する（ステップＳ１１）。 When sufficient running data is accumulated in the learning data storage unit 35, the operation inference learning model 50 and the adjustment coefficient inference learning model 70 are strengthened and learned by using the learning data storage unit 35, and the learning models 50 and 70 are updated (step S7). ..
When the update of the operation inference learning model 50 and the adjustment coefficient inference learning model 70 is completed, it is determined whether or not the learning of the operation inference learning model 50 and the adjustment coefficient inference learning model 70 is completed (step S9).
If it is determined that the learning has not been completed (No in step S9), the process proceeds to step S5. That is, the control device 11 further collects the traveling data, and repeats the update of the operation inference learning model 50 and the adjustment coefficient inference learning model 70 using the travel data.
When it is determined that the learning is completed (Yes in step S9), the learning process is ended (step S11).

次に、図５を用いて、実際に車両２の性能測定に際して車両２の操作θ_ＦＦと調整係数Ｋｐ、Ｋｉ、Ｋｄを推論する場合での、すなわち、操作推論学習モデル５０と調整係数推論学習モデル７０の強化学習が終了した後において、車両２を走行制御する際の動作について説明する。 Next, using FIG. 5, when actually _{inferring the operation θ FF} of the vehicle 2 and the adjustment coefficients Kp, Ki, and Kd when actually measuring the performance of the vehicle 2, that is, the operation inference learning model 50 and the adjustment coefficient inference learning. The operation when the vehicle 2 is controlled to travel after the reinforcement learning of the model 70 is completed will be described.

車両２が走行を開始すると（ステップＳ５１）、走行環境が初期設定され、制御装置１１は、この時点での走行状態を初期状態として観測する（ステップＳ５３）。
推論データ成形部３２は、走行状態を適切に成形した後に、強化学習部４０の操作内容推論部４１と調整係数推論部４５に送信する。
操作内容推論部４１は、走行状態を受信すると、これを基に、次の推論周期Ｔｎｎの間、車両を指令車速ｖ_ｒｅｆに従って走行させるための、車両２の操作θ_ＦＦを推論する。
同様に、調整係数推論部４５は、走行状態を受信すると、これを基に、次の推論周期Ｔｎｎの間、車両を指令車速ｖ_ｒｅｆに従って走行させるための、操作内容推論部４１により推論された車両２の操作θ_ＦＦに適用される調整係数Ｋｐ、Ｋｉ、Ｋｄを推論する（ステップＳ５５）。 When the vehicle 2 starts traveling (step S51), the traveling environment is initially set, and the control device 11 observes the traveling state at this point as the initial state (step S53).
The inference data forming unit 32 appropriately forms the running state, and then transmits the operation content inference unit 41 and the adjustment coefficient inference unit 45 of the reinforcement learning unit 40.
When the operation content inference unit 41 receives the traveling state, the operation content inference unit 41 infers _{the operation θ FF} of the vehicle 2 for driving the _{vehicle according to the command vehicle speed v ref} during the next inference cycle Tnn based on the traveling state.
Similarly, adjustment coefficient inference unit 45 receives the running condition, based on this, during the next inference cycle Tnn, for running the vehicle in accordance with a command vehicle speed v _ref, inferred by the operation content inference unit 41 The adjustment coefficients Kp, Ki, and Kd applied to the operation θ _{FF of the vehicle 2 are inferred (step S55).}

フィードバック操作量演算部２６は、推論周期Ｔｎｎより短い制御周期Ｔｄｒ間隔で、推論された調整係数Ｋｐ、Ｋｉ、Ｋｄを基に、ＰＩＤ制御により、調整量θ_ＦＢを演算する。
操作補完部２４は、推論周期Ｔｎｎより短い制御周期Ｔｄｒ間隔で、フィードバック操作量演算部２６から調整量θ_ＦＢを受信し、推論された車両２の操作θ_ＦＦを基に、調整後操作θ_ｒｅｆを計算する。
操作補完部２４は、この調整後操作θ_ｒｅｆを、ドライブロボット４に送信する。ドライブロボット４は、調整後操作θ_ｒｅｆを基にアクチュエータ４ｃを駆動させてアクセルペダル２ｃを操作することにより、アクセル開度を変更する。
そして、駆動状態取得部２３は、操作後の車両２の走行状態を、ステップＳ５３と同様な要領で、再度取得する（ステップＳ５７）。
駆動状態取得部２３は、操作後の車両２の走行状態を、学習部３０へ送信する。 _{The feedback manipulated variable calculation unit 26 calculates the adjusted variable θ FB} by PID control based on the inferred adjustment coefficients Kp, Ki, and Kd at a control cycle Tdr interval shorter than the inference cycle Tnn.
_{The operation complement unit 24 receives the adjustment amount θ FB} from the feedback operation amount calculation unit 26 at a control cycle Tdr interval shorter than the inference cycle Tnn, and based on the inferred operation θ _FF of the vehicle 2, the operation complement unit 24 after adjustment θ _ref. To calculate.
The operation complement unit 24 _{transmits this adjusted operation θ ref} to the drive robot 4. The drive robot 4 changes the accelerator opening degree by driving the actuator 4c based on the _{adjusted operation θ ref and operating the accelerator pedal 2c.}
Then, the drive state acquisition unit 23 acquires the running state of the vehicle 2 after the operation again in the same manner as in step S53 (step S57).
The drive state acquisition unit 23 transmits the running state of the vehicle 2 after the operation to the learning unit 30.

制御装置１１は、車両２の走行が終了したか否かを判定する（ステップＳ５９）。
走行が終了していないと判定された場合には（ステップＳ５９のＮｏ）、ステップＳ５５へ遷移する。すなわち、制御装置１１は、ステップＳ５７で取得した走行状態を基にした操作θ_ＦＦと調整係数Ｋｐ、Ｋｉ、Ｋｄの推論と、更なる走行状態の観測を繰り返す。
走行が終了したと判定された場合には（ステップＳ５９のＹｅｓ）、走行処理を終了する（ステップＳ６１）。 The control device 11 determines whether or not the running of the vehicle 2 has been completed (step S59).
If it is determined that the running has not been completed (No in step S59), the process proceeds to step S55. _{That is, the control device 11 repeats the operation θ FF} based on the traveling state acquired in step S57, the inference of the adjustment coefficients Kp, Ki, and Kd, and the observation of the further traveling state.
When it is determined that the running is completed (Yes in step S59), the running process is finished (step S61).

次に、上記のドライブロボット４の制御装置及び制御方法の効果について説明する。 Next, the effects of the control device and the control method of the drive robot 4 will be described.

本実施形態のドライブロボット４の制御装置１１は、車両２に搭載されて車両２を走行させるドライブロボット（自動操縦ロボット）４を、車両２が規定された指令車速ｖ_ｒｅｆに従って走行するように制御する、ドライブロボット４の制御装置１１であって、車速ｖ_ｄｅｔと指令車速ｖ_ｒｅｆを含む、車両２の走行状態を基に、車両２を指令車速ｖ_ｒｅｆに従って走行させるような車両２の操作θ_ＦＦを推論するように、機械学習器を強化学習して生成された操作推論学習モデル５０により、操作θ_ＦＦを推論周期（第１の周期）Ｔｎｎで推論する操作内容推論部４１と、走行状態を基に、操作内容推論部４１により推論された操作θ_ＦＦを推論周期Ｔｎｎの間に調整する、調整係数Ｋｐ、Ｋｉ、Ｋｄを推論するように、機械学習器を強化学習して生成された調整係数推論学習モデル７０により、調整係数Ｋｐ、Ｋｉ、Ｋｄを推論する調整係数推論部４５と、推論周期Ｔｎｎの間に、調整係数Ｋｐ、Ｋｉ、Ｋｄにより操作θ_ＦＦを調整して調整後操作θ_ｒｅｆを生成し、当該調整後操作θ_ｒｅｆに基づきドライブロボット４を制御する車両操作制御部２２と、を備えている。
また、本実施形態のドライブロボット４の制御方法は、車両２に搭載されて車両２を走行させるドライブロボット（自動操縦ロボット）４を、車両２が規定された指令車速ｖ_ｒｅｆに従って走行するように制御する、ドライブロボット４の制御方法であって、車速ｖ_ｄｅｔと指令車速ｖ_ｒｅｆを含む、車両２の走行状態を基に、車両２を指令車速ｖ_ｒｅｆに従って走行させるような車両２の操作θ_ＦＦを推論するように、機械学習器を強化学習して生成された操作推論学習モデル５０により、操作θ_ＦＦを推論周期（第１の周期）Ｔｎｎで推論し、走行状態を基に、推論された操作θ_ＦＦを推論周期Ｔｎｎの間に調整する、調整係数Ｋｐ、Ｋｉ、Ｋｄを推論するように、機械学習器を強化学習して生成された調整係数推論学習モデル７０により、調整係数Ｋｐ、Ｋｉ、Ｋｄを推論し、推論周期Ｔｎｎの間に、調整係数Ｋｐ、Ｋｉ、Ｋｄにより操作θ_ＦＦを調整して調整後操作θ_ｒｅｆを生成し、当該調整後操作θ_ｒｅｆに基づきドライブロボット４を制御する。
上記のような構成によれば、操作推論学習モデル５０は、車速ｖ_ｄｅｔと指令車速ｖ_ｒｅｆを含む、車両２の走行状態を基に、車両２を指令車速ｖ_ｒｅｆに従って走行させるような車両２の操作θ_ＦＦを推論するように強化学習されている。このため、少なくとも操作推論学習モデル５０が車両２の操作θ_ＦＦを推論する周期である推論周期Ｔｎｎおきに、車両２を指令車速ｖ_ｒｅｆに精度よく追従させるような車両２の操作θ_ＦＦが出力される。
ここで、上記のような操作推論学習モデル５０は、演算量が多くなる傾向がある。したがって、ドライブロボット４の制御周期Ｔｄｒよりも推論周期Ｔｎｎは長くなり、一つの推論周期Ｔｎｎ内に、複数の制御時刻が含まれる。このため、車両２の操作θ_ＦＦは、制御時刻の各々に個別に対応するように出力されない。このような場合に、複数の制御時刻の各々に同一の車両２の操作θ_ＦＦを適用すると、緻密な制御ができず、指令車速への追従性が向上しない。
これに対し、本実施形態においては、走行状態を基に、推論周期Ｔｎｎの間、推論された操作θ_ＦＦを調整する、調整係数Ｋｐ、Ｋｉ、Ｋｄを推論するように強化学習されている調整係数推論学習モデル７０により、調整係数Ｋｐ、Ｋｉ、Ｋｄが推論される。すなわち、推論周期Ｔｎｎに含まれる各制御時刻において、操作θ_ＦＦはこの調整係数Ｋｐ、Ｋｉ、Ｋｄによって随時調整されて、ドライブロボット４が制御される。これにより、推論周期Ｔｎｎと制御周期Ｔｄｒのサンプリング差が補完され、一定の時間の間、操作θ_ＦＦが新たに推論されなくとも、その間に、操作θ_ＦＦを調整しつつ使用することができる。したがって、指令車速への追従性が向上する。
また、推論周期Ｔｎｎに含まれる、ドライブロボット４の複数の制御時刻において、同一の操作θ_ＦＦが調整されつつ使用されるので、操作推論学習モデル５０は、一度の推論において、複数の操作θ_ＦＦを推論する必要がない。これにより、操作推論学習モデル５０の構造を簡潔にすることができ、かつ操作推論学習モデル５０を容易に機械学習させることができる。 The control device 11 of the drive robot 4 of the present embodiment controls the drive robot (automatic control robot) 4 mounted on the vehicle 2 to drive the vehicle 2 so that the vehicle 2 travels according to a specified command vehicle speed _vref. The operation θ of the drive robot 4 that causes the vehicle 2 to travel according _{to the command vehicle speed v ref} based on the traveling state of the vehicle 2 including the vehicle _{speed v date} and the command vehicle speed v _ref. _The _{operation content inference unit 41 that infers the operation θ FF} in the inference cycle (first cycle) Tnn by the operation inference learning model 50 generated by strengthening the machine learning device so as to infer the FF, and the running state. Based on the above, the operation θ _FF inferred by the operation content inference unit 41 is adjusted during the inference period Tnn. _{Adjustment coefficient inference The operation θ FF} is adjusted by the adjustment coefficients Kp, Ki, and Kd between the adjustment coefficient inference unit 45 that infers the adjustment coefficients Kp, Ki, and Kd by the adjustment coefficient inference learning model 70 and the inference cycle Tnn, and the operation after adjustment. It _{includes a vehicle operation control unit 22 that generates θ ref} and controls the drive robot 4 based on the adjusted operation θ _ref.
Further, the control method of the drive robot 4 of the present embodiment is such that the drive robot (automatic control robot) 4 mounted on the vehicle 2 and traveling the vehicle 2 travels according to the command vehicle speed _{vref specified by the vehicle 2.} A control method for the drive robot 4 to be controlled, which is an operation θ of the vehicle 2 such that the vehicle 2 is driven according _{to the command vehicle speed v ref} based on the traveling state of the vehicle 2 including the vehicle _{speed v det} and the command vehicle speed v _ref. _{The operation θ FF} is inferred by the inference cycle (first cycle) Tnn by the operation inference learning model 50 generated by strengthening the machine learning device so as to infer the _{FF, and is inferred based on the running state.} The adjustment coefficient Kp, is adjusted by the adjustment coefficient inference learning model 70 generated by strengthening the machine learning device so as to infer the adjustment coefficients Kp, Ki, and Kd that adjust the operation θ _{FF during the inference period Tnn.} ki, infer Kd, between inference cycles Tnn, the adjustment factor Kp, Ki, and adjust the operation theta _FF by Kd to generate adjusted operation theta _ref, drive the robot 4 based after the adjustment operation theta _ref Control.
According to the above configuration, the operation reasoning learning model 50 is a vehicle 2 that causes the vehicle 2 to travel according _{to the command vehicle speed v ref} based on the traveling state of the vehicle 2 including the vehicle _{speed v date} and the command vehicle speed v _ref. Reinforcement learning is done to infer the operation θ _{FF of.} Therefore, at least operations inference learning model 50 to infer period Tnn every a cycle of inference operations theta _FF of the vehicle 2, the vehicle 2 the command vehicle speed v _ref in accurately operation theta _FF vehicle 2, such as to follow the output Will be done.
Here, the operation inference learning model 50 as described above tends to have a large amount of calculation. Therefore, the inference cycle Tnn is longer than the control cycle Tdr of the drive robot 4, and a plurality of control times are included in one inference cycle Tnn. Therefore, the operation θ _{FF of the} vehicle 2 is not output so as to correspond to each of the control times individually. In such a case, if the same operation θ _FF of the vehicle 2 is applied to each of the plurality of control times, precise control cannot be performed and the followability to the command vehicle speed cannot be improved.
On the other hand, in the present embodiment, the adjustments that are strengthened and learned to infer the adjustment coefficients Kp, Ki, and Kd that adjust the _{inferred operation θ FF} during the inference cycle Tnn based on the running state. The coefficient inference learning model 70 infers the adjustment coefficients Kp, Ki, and Kd. That is, at each control time included in the inference cycle Tnn, the operation θ _FF is adjusted at any time by the adjustment coefficients Kp, Ki, and Kd, and the drive robot 4 is controlled. Accordingly, supplemented sampled difference inference cycle Tnn the control period Tdr is, during a certain time, even operating theta _FF is not newly inferred, during which can be used while adjusting the operation theta _FF. Therefore, the ability to follow the command vehicle speed is improved.
_{Further, since the same operation θ FF} is adjusted and used at a plurality of control times of the drive robot 4 included in the inference cycle Tnn, the operation inference learning model 50 uses a plurality of operations θ _{FF in one inference.} There is no need to infer. As a result, the structure of the operation inference learning model 50 can be simplified, and the operation inference learning model 50 can be easily machine-learned.

また、推論周期Ｔｎｎは、ドライブロボット４を制御する制御周期（第２の周期）Ｔｄｒよりも長く設定され、調整係数推論学習モデル７０も推論周期Ｔｎｎごとに調整係数Ｋｐ、Ｋｉ、Ｋｄを推論し、操作推論学習モデル５０と調整係数推論学習モデル７０の各々は、一度の推論で、次の推論周期Ｔｎｎの間に使用される車両２の操作θ_ＦＦと調整係数Ｋｐ、Ｋｉ、Ｋｄのみを推論し、車両操作制御部２２は、最新の車両２の操作θ_ＦＦと調整係数Ｋｐ、Ｋｉ、Ｋｄを使用して、次の推論が行われるまでの間に、調整後操作θ_ｒｅｆを生成する。
また、調整係数Ｋｐ、Ｋｉ、Ｋｄは、比例ゲインＫｐ、積分ゲインＫｉ、及び微分ゲインＫｄを含み、車両操作制御部２２は、調整係数Ｋｐ、Ｋｉ、Ｋｄを基に、フィードバック制御により、操作θ_ＦＦの調整量θ_ＦＢを計算し、当該調整量θ_ＦＢを基に操作θ_ＦＦを調整して調整後操作θ_ｒｅｆを生成する。
更に、操作θ_ＦＦの対象は、アクセルペダル２ｃを含む。
上記のような構成によれば、ドライブロボット４の制御装置１１を適切に実現可能である。 Further, the inference cycle Tnn is set longer than the control cycle (second cycle) Tdr that controls the drive robot 4, and the adjustment coefficient inference learning model 70 also infers the adjustment coefficients Kp, Ki, and Kd for each inference cycle Tnn. , Each of the operation inference learning model 50 and the adjustment coefficient inference learning model 70 _{infers only the operation θ FF} and the adjustment coefficients Kp, Ki, and Kd of the vehicle 2 used during the next inference cycle Tnn in one inference. Then, the vehicle operation control unit 22 uses the latest operation θ _FF of the vehicle 2 and the adjustment coefficients Kp, Ki, and Kd to generate the _{adjusted operation θ ref until the next inference is made.}
Further, the adjustment coefficients Kp, Ki, and Kd include the proportional gain Kp, the integral gain Ki, and the differential gain Kd, and the vehicle operation control unit 22 operates θ by feedback control based on the adjustment coefficients Kp, Ki, and Kd. The adjustment amount θ _{FB of} _FF is calculated, and the operation θ _FF _{is adjusted based on the adjustment amount θ FB} to generate the adjusted operation θ _ref.
Further, _{the target of the operation θ FF} includes the accelerator pedal 2c.
According to the above configuration, the control device 11 of the drive robot 4 can be appropriately realized.

［実施形態の第１変形例］
次に、図６を用いて、上記実施形態として示したドライブロボット４の制御装置１１及び制御方法の変形例を説明する。図６は、本変形例におけるドライブロボットの制御装置のデータの流れを示す処理ブロック図である。
本変形例におけるドライブロボット４の制御装置は、上記実施形態のドライブロボット４の制御装置１１とは、車両操作制御部のフィードバック操作量演算部２６Ａが、ＰＩＤ制御の積分項により蓄積される積分バッファｉ＿ｂｕｆｆを計算して調整係数推論部４５Ａへ送信する点が異なっている。
これに伴い、調整係数推論部４５Ａ内に設けられた調整係数推論学習モデルは、入力層に、走行状態に対応する入力ノードに加えて、積分バッファｉ＿ｂｕｆｆに対応する入力ノードを備えた構成となっている。これにより、調整係数推論学習モデルは、走行状態と、積分バッファｉ＿ｂｕｆｆを基に、調整係数を推論する。 [First Modified Example of Embodiment]
Next, a modified example of the control device 11 and the control method of the drive robot 4 shown as the above embodiment will be described with reference to FIG. FIG. 6 is a processing block diagram showing a data flow of the control device of the drive robot in this modified example.
The control device of the drive robot 4 in this modification is the integration buffer in which the feedback operation amount calculation unit 26A of the vehicle operation control unit is accumulated by the integration term of PID control with the control device 11 of the drive robot 4 of the above embodiment. The difference is that i_buff is calculated and transmitted to the adjustment coefficient inference unit 45A.
Along with this, the adjustment coefficient inference learning model provided in the adjustment coefficient inference unit 45A has a configuration in which the input layer includes an input node corresponding to the integration buffer i_buff in addition to the input node corresponding to the running state. ing. As a result, the adjustment coefficient inference learning model infers the adjustment coefficient based on the running state and the integration buffer i_buff.

本第１変形例が、既に説明した実施形態と同様な効果を奏することは言うまでもない。
本変形例の構成においては、調整係数推論学習モデルの推論結果である調整係数が使用される、調整係数推論学習モデルの後段に位置するフィードバック操作量演算部２６Ａにおいて使用される積分バッファｉ＿ｂｕｆｆが、調整係数推論学習モデルの入力となっている。したがって、上記実施形態よりも、調整係数の精度が向上する。 Needless to say, this first modification has the same effect as that of the embodiment described above.
In the configuration of this modification, the integration buffer i_buff used in the feedback manipulation amount calculation unit 26A located at the subsequent stage of the adjustment coefficient inference learning model, in which the adjustment coefficient which is the inference result of the adjustment coefficient inference learning model is used, is It is the input of the adjustment coefficient inference learning model. Therefore, the accuracy of the adjustment coefficient is improved as compared with the above embodiment.

［実施形態の第２変形例］
次に、図７を用いて、上記第１変形例として示したドライブロボット４の制御装置及び制御方法の、更なる変形例を説明する。図７は、本変形例におけるドライブロボットの制御装置のデータの流れを示す処理ブロック図である。
本変形例におけるドライブロボット４の制御装置は、上記第１変形例のドライブロボット４の制御装置とは、操作内容推論部４１Ｂ内の操作推論学習モデルに、調整係数推論学習モデルが統合され、操作推論学習モデルと、調整係数推論学習モデルが、一つの学習モデルとして実現されている点が異なっている。 [Second variant of the embodiment]
Next, a further modification of the control device and control method of the drive robot 4 shown as the first modification will be described with reference to FIG. 7. FIG. 7 is a processing block diagram showing a data flow of the control device of the drive robot in this modified example.
The control device of the drive robot 4 in this modification is different from the control device of the drive robot 4 in the first modification, in which the adjustment coefficient inference learning model is integrated with the operation inference learning model in the operation content inference unit 41B and operated. The difference is that the inference learning model and the adjustment coefficient inference learning model are realized as one learning model.

すなわち、本変形例における、操作内容推論部４１Ｂに設けられた学習モデルは、走行状態及び積分バッファｉ＿ｂｕｆｆの各々に対応する入力ノードを備えた入力層と、複数の中間層、及び車両２の操作θ_ＦＦと調整係数Ｋｐ、Ｋｉ、Ｋｄの各々に対応する出力ノードを備えた、ニューラルネットワークである。
これに伴い、この学習モデルを強化学習する際に用いられる行動価値推論学習モデルは、走行状態と、これに対する車両２の操作θ_ＦＦ、及び調整係数Ｋｐ、Ｋｉ、Ｋｄを入力として、報酬が大きいほど高い行動価値出力するように設計された、関数近似器としての学習モデルとなっている。 That is, the learning model provided in the operation content inference unit 41B in this modification is an operation of an input layer having input nodes corresponding to each of the traveling state and the integration buffer i_buff, a plurality of intermediate layers, and the vehicle 2. It is a neural network provided with output nodes corresponding to each of θ _{FF and adjustment coefficients Kp, Ki, and Kd.}
Along with this, the behavioral value inference learning model used for reinforcement learning of this learning model _{has a large reward by inputting the running state, the operation θ FF of the} vehicle 2 with respect to this, and the adjustment coefficients Kp, Ki, and Kd. It is a learning model as a function approximator designed to output a moderately high behavioral value.

このような構成においては、操作内容推論部４１Ｂが調整係数Ｋｐ、Ｋｉ、Ｋｄを出力するようになり、これがフィードバック操作量演算部２６Ａへと送信される。
また、フィードバック操作量演算部２６Ａが出力した積分バッファｉ＿ｂｕｆｆは、操作内容推論部４１Ｂへと送信されて、学習モデルに入力される。 In such a configuration, the operation content inference unit 41B outputs the adjustment coefficients Kp, Ki, and Kd, which are transmitted to the feedback operation amount calculation unit 26A.
Further, the integration buffer i_buff output by the feedback manipulated variable calculation unit 26A is transmitted to the operation content inference unit 41B and input to the learning model.

本第２変形例が、既に説明した実施形態と同様な効果を奏することは言うまでもない。
本変形例の構成においては、学習モデルの数が低減されるため、より少ないリソース環境下においても実装が可能となる。 Needless to say, this second modification has the same effect as that of the embodiment described above.
In the configuration of this modification, the number of learning models is reduced, so that it can be implemented even in a smaller resource environment.

なお、本発明のドライブロボットの制御装置及び制御方法は、図面を参照して説明した上述の実施形態及び各変形例に限定されるものではなく、その技術的範囲において他の様々な変形例が考えられる。
例えば、上記実施形態においては、車両の操作としてアクセルペダルの操作量を出力したが、これに加え、ブレーキペダル等の、他の操作を出力しても構わない。
また、上記実施形態においては、操作推論学習モデル５０と調整係数推論学習モデル７０の学習と、第１行動価値推論学習モデル６０と第２行動価値推論学習モデル８０の学習が繰り返されるように説明した。しかし、十分な精度で学習されるのであれば、これら学習モデル５０、６０、７０、８０を学習する順序は、これに限られない。例えば、操作推論学習モデル５０と第１行動価値推論学習モデル６０の学習を繰り返してこれらの学習を完了させた後、調整係数推論学習モデル７０と第２行動価値推論学習モデル８０の学習を繰り返し、これらの学習を完了させるようにしてもよい。
これ以外にも、本発明の主旨を逸脱しない限り、上記実施形態及び各変形例で挙げた構成を取捨選択したり、他の構成に適宜変更したりすることが可能である。 The control device and control method for the drive robot of the present invention are not limited to the above-described embodiment and each modification described with reference to the drawings, and various other modifications are included in the technical scope thereof. Conceivable.
For example, in the above embodiment, the operation amount of the accelerator pedal is output as the operation of the vehicle, but in addition to this, other operations such as the brake pedal may be output.
Further, in the above embodiment, the learning of the operation inference learning model 50 and the adjustment coefficient inference learning model 70 and the learning of the first action value inference learning model 60 and the second action value inference learning model 80 are repeated. .. However, the order in which these learning models 50, 60, 70, and 80 are trained is not limited to this as long as they are trained with sufficient accuracy. For example, after repeating the learning of the operation inference learning model 50 and the first action value inference learning model 60 to complete these learnings, the learning of the adjustment coefficient inference learning model 70 and the second action value inference learning model 80 is repeated. These learnings may be completed.
In addition to this, as long as the gist of the present invention is not deviated, the configurations given in the above-described embodiment and each modification can be selected or changed to other configurations as appropriate.

１試験装置
２車両
２ｃアクセルペダル
３シャシーダイナモメータ
４ドライブロボット（自動操縦ロボット）
１１制御装置
２０ドライブロボット制御部
２２車両操作制御部
２３駆動状態取得部
２４操作補完部
２５走行抵抗演算部
２６、２６Ａフィードバック操作量演算部
２７車両駆動力演算部
３０学習部
３１指令車速生成部
３５学習データ記憶部
４０強化学習部
４１、４１Ｂ操作内容推論部
４２第１行動価値推論部
４３報酬計算部
４５、４５Ａ調整係数推論部
４６第２行動価値推論部
５０操作推論学習モデル
６０第１行動価値推論学習モデル
７０調整係数推論学習モデル
８０第２行動価値推論学習モデル
θ_ＦＦフィードフォワード変更量（操作）
θ_ＦＢフィードバック変更量（調整量）
θ_ｒｅｆ調整後操作
Ｋｐ比例ゲイン（調整係数）
Ｋｉ積分ゲイン（調整係数）
Ｋｄ微分ゲイン（調整係数）
ｉ＿ｂｕｆｆ積分バッファ
ｖ_ｄｅｔ検出車速（車速）
ｖ_ｒｅｆ指令車速
1 Test equipment 2 Vehicle 2c Accelerator pedal 3 Chassis dynamometer 4 Drive robot (autopilot robot)
11 Control device 20 Drive robot control unit 22 Vehicle operation control unit 23 Drive state acquisition unit 24 Operation complement unit 25 Travel resistance calculation unit 26, 26A Feedback operation amount calculation unit 27 Vehicle drive force calculation unit 30 Learning unit 31 Command vehicle speed generation unit 35 Learning data storage unit 40 Enhanced learning unit 41, 41B Operation content reasoning unit 42 First action value inference unit 43 Reward calculation unit 45, 45A Adjustment coefficient inference unit 46 Second action value inference unit 50 Operation inference learning model 60 First action value Inference learning model 70 Adjustment coefficient Inference learning model 80 Second action value Inference learning model θ _FF Feed forward Change amount (operation)
θ _FB feedback change amount (adjustment amount)
Operation after θ _ref adjustment Kp proportional gain (adjustment coefficient)
Ki integrated gain (adjustment coefficient)
Kd derivative gain (adjustment coefficient)
i_buff Integral buffer v _det detection Vehicle speed (vehicle speed)
v _ref command vehicle speed

Claims

An autopilot robot control device that controls an autopilot robot mounted on a vehicle to drive the vehicle so that the vehicle travels according to a specified command vehicle speed.
An operation generated by reinforcement learning of a machine learning device so as to infer an operation of the vehicle such that the vehicle travels according to the commanded vehicle speed based on the traveling state of the vehicle including the vehicle speed and the commanded vehicle speed. An operation content inference unit that infers the operation in the first cycle using an inference learning model, and an operation content inference unit.
Adjustment generated by strengthening learning of a machine learning device so as to infer an adjustment coefficient that adjusts the operation inferred by the operation content inference unit during the first period based on the running state. An adjustment coefficient inference unit that infers the adjustment coefficient using a coefficient inference learning model, and an adjustment coefficient inference unit.
During the first cycle, the vehicle operation control unit adjusts the operation according to the adjustment coefficient to generate the adjusted operation, and controls the autopilot robot based on the adjusted operation.
The control device of the autopilot robot.

The first cycle is set longer than the second cycle for controlling the autopilot robot.
The adjustment coefficient inference learning model also infers the adjustment coefficient for each first cycle.
Each of the operation inference learning model and the adjustment coefficient inference learning model infers only the operation of the vehicle and the adjustment coefficient used during the next first period in one inference.
The autopilot robot according to claim 1, wherein the vehicle operation control unit uses the latest operation of the vehicle and the adjustment coefficient to generate the adjusted operation until the next inference is made. Control device.

The adjustment coefficients include proportional gain, integral gain, and derivative gain.
The vehicle operation control unit calculates the adjustment amount of the operation by feedback control based on the adjustment coefficient, adjusts the operation based on the adjustment amount, and generates the adjusted operation. Alternatively, the control device for the autopilot robot according to 2.

The vehicle operation control unit calculates the integration buffer and
The control device for an autopilot robot according to claim 3, wherein the adjustment coefficient inference learning model infers the adjustment coefficient based on the traveling state and the integration buffer.

The control device for an autopilot robot according to any one of claims 1 to 4, wherein the operation inference learning model and the adjustment coefficient inference learning model are realized as one learning model.

The control device for an autopilot robot according to any one of claims 1 to 5, wherein the operation target includes an accelerator pedal.

A control method for an autopilot robot that controls an autopilot robot mounted on a vehicle to drive the vehicle so that the vehicle travels according to a specified command vehicle speed.
An operation generated by reinforcement learning of a machine learning device so as to infer an operation of the vehicle that causes the vehicle to travel according to the commanded vehicle speed based on the traveling state of the vehicle including the vehicle speed and the commanded vehicle speed. The operation is inferred in the first cycle by the inference learning model, and the operation is inferred in the first cycle.
An adjustment coefficient inference learning model generated by reinforcement learning of a machine learning device so as to infer an adjustment coefficient that adjusts the inferred operation during the first period based on the running state. Infer the adjustment coefficient and
A control method for an autopilot robot, which adjusts the operation according to the adjustment coefficient to generate an adjusted operation during the first cycle, and controls the autopilot robot based on the adjusted operation.