JP2021143882A

JP2021143882A - Learning system and learning method for operation inference learning model that controls automatically manipulated robot

Info

Publication number: JP2021143882A
Application number: JP2020041429A
Authority: JP
Inventors: 健人吉田; Taketo Yoshida; 寛修深井; Hironaga Fukai; 泰宏金刺; Yasuhiro Kanesashi
Original assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Current assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2021-09-24

Abstract

To provide a learning system and a learning method, with which it is possible to suppress a decrease in the learning accuracy of an operation inference learning model that is attributable to a difference in pedal play amount between a vehicle model and an actual vehicle.SOLUTION: Provided is a learning system 10 comprising an operation inference learning model 70 for inferring a vehicle operation on the basis of the traveling state of a vehicle 2 and an automatically manipulated robot 4, operation including a pedal operation amount of one or both of an accelerator pedal and a brake pedal, the system including a first pedal play amount adjustment unit 54 and a second pedal play amount adjustment unit 55 for inputting the pedal operation amount inferred by the operation inference learning model to a vehicle model on the basis of a difference in pedal play amount between the vehicle and a vehicle model 52 that outputs a simulated traveling state. The simulated traveling state having been adjusted is applied to the operation inference learning model, whereby a learning system for the operation inference learning model that controls the automatically manipulated robot is provided.SELECTED DRAWING: Figure 2

Description

本発明は、自動操縦ロボットを制御する操作推論学習モデルの学習システム及び学習方法に関する。 The present invention relates to a learning system and a learning method of an operation inference learning model that controls an autopilot robot.

一般に、普通自動車などの車両を製造、販売する際には、国や地域により規定された、特定の走行パターン（モード）により車両を走行させた際の燃費や排出ガスを測定し、これを表示する必要がある。
モードは、例えば、走行開始から経過した時間と、その時に到達すべき車速との関係として、グラフにより表わすことが可能である。この到達すべき車速は、車両へ与えられる達成すべき速度に関する指令という観点で、指令車速と呼ばれることがある。
上記のような、燃費や排出ガスに関する試験は、シャシーダイナモメータ上に車両を載置し、車両に搭載された自動操縦ロボット、所謂ドライブロボット（登録商標）により、モードに従って車両を運転させることにより行われる。 Generally, when manufacturing and selling vehicles such as ordinary automobiles, the fuel consumption and exhaust gas when the vehicle is driven according to a specific driving pattern (mode) specified by the country or region are measured and displayed. There is a need to.
The mode can be represented by a graph as, for example, the relationship between the time elapsed from the start of traveling and the vehicle speed to be reached at that time. This vehicle speed to be reached is sometimes referred to as a command vehicle speed in terms of a command regarding the speed to be achieved given to the vehicle.
The above tests on fuel consumption and exhaust gas are carried out by placing the vehicle on the chassis dynamometer and driving the vehicle according to the mode by the autopilot robot mounted on the vehicle, the so-called drive robot (registered trademark). Will be done.

指令車速には、許容誤差範囲が規定されている。車速が許容誤差範囲を逸脱すると、その試験は無効となるため、ドライブロボットの制御には、指令車速への高い追従性が求められる。このため、特に近年においては、ドライブロボットを、車両の現在の状態を入力すると、車両を指令車速に従って走行させるような操作を推論するように機械学習された学習モデルを用いて制御することがある。
例えば、特許文献１には、人間らしいペダル操作を行うドライバモデルを強化学習によって構築することが可能な車輌用走行シミュレーション装置、ドライバモデル構築方法及びドライバモデル構築プログラムが開示されている。
より詳細には、車輌用走行シミュレーション装置は、ドライバモデルのゲインの値を変更させながら、車輌モデルを複数回走行させ、この時に変更されたゲインの値を報酬値に基づいて評価することによって、ドライバモデルのゲインの設定を自動的に行う。上記ゲインの値は、車速の追従性を評価する車速報酬関数のみならず、アクセルペダルの操作の滑らかさを評価するアクセル報酬関数、ブレーキペダルの操作の滑らかさを評価するブレーキ報酬関数によっても評価が行われる。
特許文献１等において用いられる車両モデルとしては、通常、車両の各構成要素に対して、動作を模した物理モデルを各々作成し、これらを組み合わせた物理モデルとして作成される。 The margin of error is specified for the command vehicle speed. If the vehicle speed deviates from the permissible error range, the test becomes invalid, so that the control of the drive robot is required to have high followability to the commanded vehicle speed. For this reason, especially in recent years, the drive robot may be controlled using a machine-learned learning model to infer an operation that causes the vehicle to travel according to a commanded vehicle speed when the current state of the vehicle is input. ..
For example, Patent Document 1 discloses a vehicle driving simulation device, a driver model construction method, and a driver model construction program capable of constructing a driver model that performs a human-like pedal operation by reinforcement learning.
More specifically, the vehicle driving simulation device travels the vehicle model multiple times while changing the gain value of the driver model, and evaluates the changed gain value based on the reward value. The gain of the driver model is set automatically. The above gain value is evaluated not only by the vehicle speed reward function that evaluates the followability of the vehicle speed, but also by the accelerator reward function that evaluates the smoothness of the accelerator pedal operation and the brake reward function that evaluates the smoothness of the brake pedal operation. Is done.
As a vehicle model used in Patent Document 1 and the like, a physical model that imitates an operation is usually created for each component of the vehicle, and a physical model that combines these is created.

特開２０１４−１１５１６８号公報Japanese Unexamined Patent Publication No. 2014-115168

特許文献１に開示されたような装置においては、車両の操作を推論する操作推論学習モデルを、車両モデルを基に学習している。このため、車両モデルの再現精度が低いと、操作推論学習モデルをどれだけ精密に学習させたとしても、操作推論学習モデルが推論する操作が、実際の車両にそぐわないものとなり得る。
車両モデルの再現精度向上に大きな影響を有する車両の特性として、アクセルペダルまたはブレーキペダルの、ペダル遊びが挙げられる。図１１に示されるように、車両の座席に設けられたドライブロボットのアクチュエータによって、ペダルの踏み込み開始位置Ｐ１から、方向ＰＤに、踏み込み限界位置Ｐ２までペダルを踏み込む場合を考える。すなわち、踏み込み開始位置Ｐ１は、ペダルの開度が０％の場合に相当し、踏み込み限界位置Ｐ２は、ペダルの開度が１００％の場合に相当する。このとき、実際には、踏み込み開始位置Ｐ１からペダルを踏み込んだ直後に、アクセルペダルによる駆動、またはブレーキペダルによる減速が開始されるわけではなく、ペダルを踏み込んで既定の駆動開始位置Ｐ３に到達するまでの間は、ペダル操作が車両の挙動に反映されない。ペダル遊びとは、このような遊間を指す。また、踏み込み開始位置Ｐ１から駆動開始位置Ｐ３までのペダル開度の大きさＨを、以下、ペダル遊び量と呼称する。 In the device as disclosed in Patent Document 1, an operation inference learning model for inferring the operation of the vehicle is learned based on the vehicle model. Therefore, if the reproduction accuracy of the vehicle model is low, the operation inferred by the operation inference learning model may not be suitable for the actual vehicle, no matter how precisely the operation inference learning model is trained.
One of the characteristics of the vehicle that has a great influence on the improvement of the reproduction accuracy of the vehicle model is the pedal play of the accelerator pedal or the brake pedal. As shown in FIG. 11, consider a case where the actuator of the drive robot provided in the seat of the vehicle depresses the pedal from the pedal depressing start position P1 to the depressing limit position P2 in the direction PD. That is, the depression start position P1 corresponds to the case where the pedal opening degree is 0%, and the depression limit position P2 corresponds to the case where the pedal opening degree is 100%. At this time, in reality, the drive by the accelerator pedal or the deceleration by the brake pedal is not started immediately after the pedal is depressed from the depression start position P1, but the pedal is depressed to reach the default drive start position P3. Until then, the pedal operation is not reflected in the behavior of the vehicle. Pedal play refers to such play. Further, the magnitude H of the pedal opening degree from the depression start position P1 to the drive start position P3 is hereinafter referred to as a pedal play amount.

ペダル遊び量は、ドライブロボットを制御し車両を走行させる際に、ペダル操作値の絶対量に大きく影響する。
例えば、ペダル遊び量が実車両よりも小さな値として設定された車両モデルを用いて学習された操作推論学習モデルが、ブレーキペダルを、車両モデル上での駆動開始位置Ｐ３を僅かに超えて踏み込むような操作を推論した場合を考える。このような場合において、推論された操作は、ペダル遊び量が車両モデルよりも大きい実車両においては、駆動開始位置Ｐ３に届かない操作となり得る。すなわち、操作推論学習モデルはブレーキペダルを軽く踏み込む操作を推論したつもりであっても、当該操作によっては、車両の駆動開始位置Ｐ３よりも深くブレーキペダルを踏み込むことができず、実際にはブレーキが効かないという状況が生じ得る。 The amount of pedal play greatly affects the absolute amount of pedal operation value when the drive robot is controlled to drive the vehicle.
For example, an operation inference learning model learned using a vehicle model in which the pedal play amount is set to a value smaller than that of the actual vehicle is such that the brake pedal is depressed slightly beyond the drive start position P3 on the vehicle model. Consider the case of inferring various operations. In such a case, the inferred operation may be an operation that does not reach the drive start position P3 in an actual vehicle in which the pedal play amount is larger than that of the vehicle model. That is, even if the operation reasoning learning model intends to infer the operation of lightly depressing the brake pedal, depending on the operation, the brake pedal cannot be depressed deeper than the driving start position P3 of the vehicle, and the brake is actually applied. There can be situations where it doesn't work.

上記のような、実車両と車両モデルのペダル遊び量の差異が発覚した後に、車両モデルをペダル遊び量が適切となるように再度学習することも考えられるが、これは多くの計算時間を要し、容易に実行され得るものではない。 After discovering the difference in pedal play amount between the actual vehicle and the vehicle model as described above, it is conceivable to relearn the vehicle model so that the pedal play amount is appropriate, but this requires a lot of calculation time. However, it cannot be easily implemented.

本発明が解決しようとする課題は、車両モデルを操作実行の対象として操作推論学習モデルを機械学習するに際し、車両モデルと実車両とのペダル遊び量の差異に起因する操作推論学習モデルの学習精度の低下を、容易に抑制可能な、自動操縦ロボット（ドライブロボット）を制御する操作推論学習モデルの学習システム及び学習方法を提供することである。 The problem to be solved by the present invention is the learning accuracy of the operation inference learning model due to the difference in the pedal play amount between the vehicle model and the actual vehicle when the operation inference learning model is machine-learned with the vehicle model as the operation execution target. It is an object of the present invention to provide a learning system and a learning method of an operation inference learning model for controlling an automatic control robot (drive robot), which can easily suppress a decrease in the number of.

本発明は、上記課題を解決するため、以下の手段を採用する。すなわち、本発明は、車速を含む車両の走行状態を基に、前記車両を規定された指令車速に従って走行させるような、前記車両の操作を推論する操作推論学習モデルと、前記車両に搭載されて、前記操作を基に当該車両を走行させる自動操縦ロボットを備え、前記操作推論学習モデルを機械学習する、自動操縦ロボットを制御する操作推論学習モデルの学習システムであって、前記操作は、アクセルペダルとブレーキペダルのいずれか一方または双方のペダルの、ペダル操作量を含み、前記車両を模擬動作するように設定され、前記ペダルのペダル検出量を含む、前記車両を模した模擬走行状態を出力する、車両モデルと、前記車両のペダル遊び量と、前記車両モデルのペダル遊び量の差分値を基に、前記操作推論学習モデルが推論した前記操作を基にした入力操作に含まれる前記ペダル操作量を前記車両モデルにあわせて調整し、調整された前記入力操作を前記車両モデルへ入力する、第１ペダル遊び量調整部と、前記差分値の正負を反転させた反転差分値を基に、前記模擬走行状態に含まれる前記ペダル検出量を前記車両にあわせて調整し、調整された前記模擬走行状態を生成する、第２ペダル遊び量調整部と、を備え、前記調整された前記模擬走行状態を前記操作推論学習モデルに適用することで、前記操作推論学習モデルを機械学習する、自動操縦ロボットを制御する操作推論学習モデルの学習システムを提供する。 The present invention employs the following means in order to solve the above problems. That is, the present invention is mounted on the vehicle and an operation inference learning model that infers the operation of the vehicle so that the vehicle travels according to a specified command vehicle speed based on the traveling state of the vehicle including the vehicle speed. A learning system for an operation inference learning model that controls an automatic control robot, comprising an automatic control robot that runs the vehicle based on the operation, and machine-learning the operation inference learning model. The operation is an accelerator pedal. The vehicle is set to perform a simulated operation including the pedal operation amount of one or both of the pedals and the brake pedal, and a simulated running state imitating the vehicle including the pedal detection amount of the pedal is output. , The pedal operation amount included in the input operation based on the operation inferred by the operation inference learning model based on the difference value between the vehicle model, the pedal play amount of the vehicle, and the pedal play amount of the vehicle model. Is adjusted according to the vehicle model, and the adjusted input operation is input to the vehicle model. The adjusted simulated running state is provided with a second pedal play amount adjusting unit that adjusts the pedal detection amount included in the simulated running state according to the vehicle and generates the adjusted simulated running state. Is applied to the operation inference learning model to provide a learning system for an operation inference learning model that controls an automatic control robot by machine learning the operation inference learning model.

また、本発明は、車速を含む車両の走行状態を基に、前記車両を規定された指令車速に従って走行させるような、前記車両の操作を推論する操作推論学習モデルと、前記車両に搭載されて、前記操作を基に当該車両を走行させる自動操縦ロボットに関し、前記操作推論学習モデルを機械学習する、自動操縦ロボットを制御する操作推論学習モデルの学習方法であって、前記操作は、アクセルペダルとブレーキペダルのいずれか一方または双方のペダルの、ペダル操作量を含み、前記車両を模擬動作するように設定され、前記ペダルのペダル検出量を含む、前記車両を模した模擬走行状態を出力する車両モデルにあわせて、前記車両のペダル遊び量と、前記車両モデルのペダル遊び量の差分値を基に、前記操作推論モデルが推論した前記操作を基にした入力操作に含まれる前記ペダル操作量を調整し、調整された前記入力操作を前記車両モデルへ入力し、前記差分値の正負を反転させた反転差分値を基に、前記模擬走行状態に含まれる前記ペダル検出量を前記車両にあわせて調整し、調整された前記模擬走行状態を生成し、前記調整された前記模擬走行状態を前記操作推論学習モデルに適用することで、前記操作推論学習モデルを機械学習する、自動操縦ロボットを制御する操作推論学習モデルの学習方法を提供する。 Further, the present invention is mounted on the vehicle and an operation inference learning model for inferring the operation of the vehicle so as to drive the vehicle in accordance with a specified command vehicle speed based on the traveling state of the vehicle including the vehicle speed. A learning method of an operation inference learning model for controlling an automatic control robot, which machine-learns the operation inference learning model with respect to an automatic control robot that runs the vehicle based on the operation, wherein the operation is an accelerator pedal. A vehicle that includes the pedal operation amount of one or both of the brake pedals, is set to simulate the vehicle, and outputs a simulated running state that imitates the vehicle, including the pedal detection amount of the pedal. According to the model, the pedal operation amount included in the input operation based on the operation inferred by the operation inference model is calculated based on the difference value between the pedal play amount of the vehicle and the pedal play amount of the vehicle model. The adjusted and adjusted input operation is input to the vehicle model, and the pedal detection amount included in the simulated running state is adjusted to the vehicle based on the inverted difference value obtained by reversing the positive and negative of the difference value. By adjusting and generating the adjusted simulated running state and applying the adjusted simulated running state to the operation inference learning model, an automatic control robot that machine-learns the operation inference learning model is controlled. A learning method of an operation inference learning model is provided.

本発明によれば、車両モデルを操作実行の対象として操作推論学習モデルを機械学習するに際し、車両モデルと実車両とのペダル遊び量の差異に起因する操作推論学習モデルの学習精度の低下を、容易に抑制可能な、自動操縦ロボット（ドライブロボット）を制御する操作推論学習モデルの学習システム及び学習方法を提供することができる。 According to the present invention, when the operation inference learning model is machine-learned with the vehicle model as the target of operation execution, the learning accuracy of the operation inference learning model is reduced due to the difference in the amount of pedal play between the vehicle model and the actual vehicle. It is possible to provide a learning system and a learning method of an operation inference learning model that controls an automatic control robot (drive robot) that can be easily suppressed.

本発明の実施形態における、自動操縦ロボット（ドライブロボット）を用いた試験環境の説明図である。It is explanatory drawing of the test environment using the autopilot robot (drive robot) in embodiment of this invention. 上記実施形態における自動操縦ロボットを制御する操作推論学習モデルの学習システムの、車両学習モデルの学習時における処理の流れを記したブロック図である。It is a block diagram which describes the process flow at the time of learning of a vehicle learning model of the learning system of the operation reasoning learning model which controls an autopilot robot in the said embodiment. 上記車両学習モデルのブロック図である。It is a block diagram of the said vehicle learning model. 上記自動操縦ロボットを制御する操作推論学習モデルの学習システムの、操作推論学習モデルの事前学習時における処理の流れを記したブロック図である。It is a block diagram which describes the process flow at the time of the pre-learning of the operation inference learning model of the learning system of the operation inference learning model which controls the autopilot robot. 上記学習システムの第１及び第２ペダル遊び量調整部の説明図である。It is explanatory drawing of the 1st and 2nd pedal play amount adjustment part of the said learning system. 上記自動操縦ロボットを制御する操作推論学習モデルの学習システムの、操作推論学習モデルの事前学習が終了した後の強化学習時における処理の流れを記したブロック図である。It is a block diagram which describes the process flow at the time of reinforcement learning after the pre-learning of the operation inference learning model is completed of the learning system of the operation inference learning model which controls the automatic control robot. 上記実施形態における自動操縦ロボットを制御する操作推論学習モデルの学習方法のフローチャートである。It is a flowchart of the learning method of the operation reasoning learning model which controls an autopilot robot in the said embodiment. 上記実施形態の第１変形例に関する、第２ペダル遊び量調整部の説明図である。It is explanatory drawing of the 2nd pedal play amount adjustment part with respect to the 1st modification of the said embodiment. 上記第１変形例におけるペダル遊び量の調整において、車両モデルの推論に時間を要した場合の説明図である。It is explanatory drawing when it took time to infer the vehicle model in the adjustment of the pedal play amount in the 1st modification. 上記実施形態の第２変形例に関する、第２ペダル遊び量調整部の説明図である。It is explanatory drawing of the 2nd pedal play amount adjustment part with respect to the 2nd modification of the said embodiment. ペダル遊びの説明図である。It is explanatory drawing of pedal play.

以下、本発明の実施形態について図面を参照して詳細に説明する。
本実施形態においては、自動操縦ロボットとしては、ドライブロボット（登録商標）を用いているため、以下、自動操縦ロボットをドライブロボットと記載する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
In the present embodiment, since the drive robot (registered trademark) is used as the autopilot robot, the autopilot robot will be referred to as a drive robot below.

図１は、実施形態におけるドライブロボットを用いた試験環境の説明図である。試験装置１は、車両２、シャシーダイナモメータ３、及びドライブロボット４を備えている。
車両２は、床面上に設けられている。シャシーダイナモメータ３は、床面の下方に設けられている。車両２は、車両２の駆動輪２ａがシャシーダイナモメータ３の上に載置されるように、位置づけられている。車両２が走行し駆動輪２ａが回転する際には、シャシーダイナモメータ３が反対の方向に回転する。
ドライブロボット４は、車両２の運転席２ｂに搭載されて、車両２を走行させる。ドライブロボット４は、第１アクチュエータ４ｃと第２アクチュエータ４ｄを備えており、これらはそれぞれ、車両２のアクセルペダル２ｃとブレーキペダル２ｄに当接するように設けられている。 FIG. 1 is an explanatory diagram of a test environment using a drive robot in the embodiment. The test device 1 includes a vehicle 2, a chassis dynamometer 3, and a drive robot 4.
The vehicle 2 is provided on the floor surface. The chassis dynamometer 3 is provided below the floor surface. The vehicle 2 is positioned so that the drive wheels 2a of the vehicle 2 are placed on the chassis dynamometer 3. When the vehicle 2 travels and the drive wheels 2a rotate, the chassis dynamometer 3 rotates in the opposite direction.
The drive robot 4 is mounted on the driver's seat 2b of the vehicle 2 to drive the vehicle 2. The drive robot 4 includes a first actuator 4c and a second actuator 4d, which are provided so as to come into contact with the accelerator pedal 2c and the brake pedal 2d of the vehicle 2, respectively.

ドライブロボット４は、後に詳説する学習制御装置１１によって制御されている。学習制御装置１１は、ドライブロボット４の第１アクチュエータ４ｃと第２アクチュエータ４ｄを制御することにより、車両２のアクセルペダル２ｃとブレーキペダル２ｄの開度を変更、調整する。
学習制御装置１１は、ドライブロボット４を、車両２が規定された指令車速に従って走行するように制御する。すなわち、学習制御装置１１は、車両２のアクセルペダル２ｃとブレーキペダル２ｄの開度を変更することで、規定された走行パターン（モード）に従うように、車両２を走行制御する。より詳細には、学習制御装置１１は、走行開始から時間が経過するに従い、各時間に到達すべき車速である指令車速に従うように、車両２を走行制御する。 The drive robot 4 is controlled by the learning control device 11 described in detail later. The learning control device 11 changes and adjusts the opening degrees of the accelerator pedal 2c and the brake pedal 2d of the vehicle 2 by controlling the first actuator 4c and the second actuator 4d of the drive robot 4.
The learning control device 11 controls the drive robot 4 so that the vehicle 2 travels according to a specified command vehicle speed. That is, the learning control device 11 controls the traveling of the vehicle 2 so as to follow the defined traveling pattern (mode) by changing the opening degrees of the accelerator pedal 2c and the brake pedal 2d of the vehicle 2. More specifically, the learning control device 11 controls the traveling of the vehicle 2 so as to follow the commanded vehicle speed, which is the vehicle speed to reach each time, as time elapses from the start of traveling.

学習システム１０は、上記のような試験装置１と学習制御装置１１を備えている。
学習制御装置１１は、ドライブロボット制御部２０と学習部３０を備えている。
ドライブロボット制御部２０は、ドライブロボット４の制御を行うための制御信号を生成し、ドライブロボット４に送信することで、ドライブロボット４を制御する。学習部３０は、後に説明するような機械学習を行い、車両学習モデル、操作推論学習モデル、及び価値推論学習モデルを生成する。上記のような、ドライブロボット４の制御を行うための制御信号は、操作推論学習モデルにより生成される。
ドライブロボット制御部２０は、例えば、ドライブロボット４の筐体外部に設けられた、コントローラ等の情報処理装置である。学習部３０は、例えばパーソナルコンピュータ等の情報処理装置である。 The learning system 10 includes the test device 1 and the learning control device 11 as described above.
The learning control device 11 includes a drive robot control unit 20 and a learning unit 30.
The drive robot control unit 20 controls the drive robot 4 by generating a control signal for controlling the drive robot 4 and transmitting the control signal to the drive robot 4. The learning unit 30 performs machine learning as described later to generate a vehicle learning model, an operation inference learning model, and a value inference learning model. The control signal for controlling the drive robot 4 as described above is generated by the operation inference learning model.
The drive robot control unit 20 is, for example, an information processing device such as a controller provided outside the housing of the drive robot 4. The learning unit 30 is an information processing device such as a personal computer.

図２は、学習システム１０のブロック図である。図２においては、各構成要素を結ぶ線は、上記車両学習モデルを機械学習する際にデータの送受信があるもののみが示されており、したがって構成要素間の全てのデータの送受信を示すものではない。
試験装置１は、既に説明したような車両２、シャシーダイナモメータ３、及びドライブロボット４に加え、車両状態計測部５を備えている。車両状態計測部５は、車両２の状態を計測する各種の計測装置である。車両状態計測部５としては、例えばアクセルペダル２ｃやブレーキペダル２ｄの操作量を計測するためのカメラや赤外線センサなどであり得る。 FIG. 2 is a block diagram of the learning system 10. In FIG. 2, the lines connecting the components are shown only for those that have data transmission / reception during machine learning of the vehicle learning model, and therefore do not indicate the transmission / reception of all data between the components. No.
The test device 1 includes a vehicle condition measuring unit 5 in addition to the vehicle 2, the chassis dynamometer 3, and the drive robot 4 as described above. The vehicle state measuring unit 5 is various measuring devices for measuring the state of the vehicle 2. The vehicle state measuring unit 5 may be, for example, a camera or an infrared sensor for measuring the amount of operation of the accelerator pedal 2c or the brake pedal 2d.

ドライブロボット制御部２０は、ペダル操作パターン生成部２１、車両操作制御部２２、及び駆動状態取得部２３を備えている。学習部３０は、指令車速生成部３１、推論データ成形部３２、学習データ成形部３３、学習データ生成部３４、学習データ記憶部３５、強化学習部４０、及び試験装置モデル５０を備えている。強化学習部４０は、操作内容推論部４１、状態行動価値推論部４２、及び報酬計算部４３を備えている。試験装置モデル５０は、ドライブロボットモデル５１、車両モデル５２、シャシーダイナモメータモデル５３、第１ペダル遊び量調整部５４、及び第２ペダル遊び量調整部５５を備えている。
学習制御装置１１の、学習データ記憶部３５以外の各構成要素は、例えば上記の各情報処理装置内のＣＰＵにより実行されるソフトウェア、プログラムであってよい。また、学習データ記憶部３５は、上記各情報処理装置内外に設けられた半導体メモリや磁気ディスクなどの記憶装置により実現されていてよい。 The drive robot control unit 20 includes a pedal operation pattern generation unit 21, a vehicle operation control unit 22, and a drive state acquisition unit 23. The learning unit 30 includes a command vehicle speed generation unit 31, an inference data molding unit 32, a learning data molding unit 33, a learning data generation unit 34, a learning data storage unit 35, a reinforcement learning unit 40, and a test device model 50. The reinforcement learning unit 40 includes an operation content inference unit 41, a state behavior value inference unit 42, and a reward calculation unit 43. The test device model 50 includes a drive robot model 51, a vehicle model 52, a chassis dynamometer model 53, a first pedal play amount adjusting unit 54, and a second pedal play amount adjusting unit 55.
Each component of the learning control device 11 other than the learning data storage unit 35 may be, for example, software or a program executed by the CPU in each of the above-mentioned information processing devices. Further, the learning data storage unit 35 may be realized by a storage device such as a semiconductor memory or a magnetic disk provided inside or outside each of the information processing devices.

後に説明するように、操作内容推論部４１は、ある時刻における走行状態を基に、指令車速に従うような、当該時刻よりも後の車両２の操作を推論する。この、車両２の操作の推論を効果的に行うために、特に操作内容推論部４１は、後に説明するように機械学習器を備えており、推論した操作に基づいたドライブロボット４の操作の後の時刻における走行状態に基づいて計算された報酬を基に機械学習器を機械学習して学習モデル（操作推論学習モデル）７０を生成する。操作内容推論部４１は、性能測定のために実際に車両２を走行制御させる際には、この学習が完了した操作推論学習モデル７０を使用して、車両２の操作を推論する。
すなわち、学習システム１０は大別して、強化学習時における操作の学習と、性能測定のために車両を走行制御させる際における操作の推論の、２通りの動作を行う。説明を簡単にするために、以下ではまず、操作の学習時における、学習システム１０の各構成要素の説明をした後に、車両の性能測定に際して操作を推論する場合での各構成要素の挙動について説明する。 As will be described later, the operation content inference unit 41 infers the operation of the vehicle 2 after the time so as to follow the commanded vehicle speed based on the traveling state at a certain time. In order to effectively infer the operation of the vehicle 2, the operation content inference unit 41 is provided with a machine learning device as will be described later, and after the operation of the drive robot 4 based on the inferred operation. A learning model (operation inference learning model) 70 is generated by machine learning the machine learning device based on the reward calculated based on the running state at the time of. When actually controlling the running of the vehicle 2 for performance measurement, the operation content inference unit 41 infers the operation of the vehicle 2 by using the operation inference learning model 70 for which this learning has been completed.
That is, the learning system 10 is roughly divided into two types of operations: learning the operation during reinforcement learning and inferring the operation when controlling the running of the vehicle for performance measurement. In order to simplify the explanation, first, each component of the learning system 10 at the time of learning the operation is explained, and then the behavior of each component at the time of inferring the operation at the time of measuring the performance of the vehicle is explained. do.

まず、操作の学習時における、学習制御装置１１の構成要素の挙動を説明する。
学習制御装置１１は、操作の学習に先立ち、学習時に使用する走行実績データ（走行実績）を、走行実績として収集する。詳細には、ドライブロボット制御部２０が、アクセルペダル２ｃ及びブレーキペダル２ｄの、車両特性計測用の操作パターンを生成して、これにより車両２を走行制御し、走行実績データを収集する。
ペダル操作パターン生成部２１は、ペダル２ｃ、２ｄの、車両特性計測用の操作パターンを生成する。ペダル操作パターンとしては、例えば車両２と類似する他の車両において、ＷＬＴＣ（ＷｏｒｌｄｗｉｄｅｈａｒｍｏｎｉｚｅｄＬｉｇｈｔｖｅｈｉｃｌｅｓＴｅｓｔＣｙｃｌｅ）モードなどによって走行した際のペダル操作の実績値を使用することができる。
ペダル操作パターン生成部２１は、生成したペダル操作パターンを、車両操作制御部２２へ送信する。 First, the behavior of the components of the learning control device 11 at the time of learning the operation will be described.
The learning control device 11 collects running record data (running record) used at the time of learning as a running record prior to learning the operation. Specifically, the drive robot control unit 20 generates operation patterns for measuring vehicle characteristics of the accelerator pedal 2c and the brake pedal 2d, thereby controlling the vehicle 2 to travel and collecting travel record data.
The pedal operation pattern generation unit 21 generates operation patterns for measuring vehicle characteristics of the pedals 2c and 2d. As the pedal operation pattern, for example, in another vehicle similar to the vehicle 2, the actual value of the pedal operation when traveling in the WLTC (World Harmonized Light Vehicles Test Cycle) mode or the like can be used.
The pedal operation pattern generation unit 21 transmits the generated pedal operation pattern to the vehicle operation control unit 22.

車両操作制御部２２は、ペダル操作パターン生成部２１から、ペダル操作パターンを受信し、これを、ドライブロボット４の第１及び第２アクチュエータ４ｃ、４ｄへの指令に変換して、ドライブロボット４に送信する。
ドライブロボット４は、アクチュエータ４ｃ、４ｄへの指令を受信すると、これに基づいて車両２をシャシーダイナモメータ３上で走行させる。
駆動状態取得部２３は、例えばアクチュエータ４ｃ、４ｄの位置等の、ドライブロボット４の実際の駆動状態を取得する。車両２が走行することにより、車両２の走行状態は逐次変化する。駆動状態取得部２３と、車両状態計測部５、及びシャシーダイナモメータ３に設けられた様々な計測器により、車両２の走行状態が計測される。例えば、駆動状態取得部２３は上記のように、アクセルペダル２ｃの検出量と、ブレーキペダル２ｄの検出量を、走行状態として計測する。また、シャシーダイナモメータ３に設けられた計測器は、車速を走行状態として計測する。
計測された車両２の走行状態は、学習部３０の学習データ成形部３３へ送信される。
学習データ成形部３３は、車両２の走行状態を受信し、受信したデータを後の様々な学習において使用されるフォーマットに変換して、走行実績データとして学習データ記憶部３５に保存する。 The vehicle operation control unit 22 receives the pedal operation pattern from the pedal operation pattern generation unit 21, converts it into commands to the first and second actuators 4c and 4d of the drive robot 4, and converts the pedal operation pattern into commands to the drive robot 4. Send.
When the drive robot 4 receives a command to the actuators 4c and 4d, the drive robot 4 causes the vehicle 2 to travel on the chassis dynamometer 3 based on the command.
The drive state acquisition unit 23 acquires the actual drive state of the drive robot 4, such as the positions of the actuators 4c and 4d. As the vehicle 2 travels, the traveling state of the vehicle 2 changes sequentially. The running state of the vehicle 2 is measured by various measuring instruments provided in the drive state acquisition unit 23, the vehicle state measurement unit 5, and the chassis dynamometer 3. For example, the drive state acquisition unit 23 measures the detected amount of the accelerator pedal 2c and the detected amount of the brake pedal 2d as the running state as described above. Further, the measuring instrument provided in the chassis dynamometer 3 measures the vehicle speed as a traveling state.
The measured running state of the vehicle 2 is transmitted to the learning data forming unit 33 of the learning unit 30.
The learning data forming unit 33 receives the traveling state of the vehicle 2, converts the received data into formats used in various subsequent learnings, and stores the received data in the learning data storage unit 35 as traveling record data.

車両２の走行状態すなわち走行実績データの収集が終了すると、学習データ生成部３４は学習データ記憶部３５から走行実績データを取得し、適切なフォーマットに成形して、試験装置モデル５０に送信する。
学習部３０の、試験装置モデル５０の車両モデル５２は、学習データ生成部３４から成形された走行実績データを取得し、これを用いて機械学習器６０を機械学習して、車両学習モデル６０を生成する。車両学習モデル６０は、車両２の実際の走行実績である走行実績データを基に車両２を模擬動作するように設定、本実施形態においては機械学習され、車両２に対する操作を受信すると、これを基に、車両２を模した模擬走行状態を出力する。すなわち、車両モデル５２の機械学習器６０は、人工知能ソフトウェアの一部であるプログラムモジュールとして利用される、適切な学習パラメータが学習された学習済みモデル６０を生成するものである。
本実施形態においては、車両学習モデル６０は、ニューラルネットワークで実現されている。
以下、説明を簡単にするため、車両モデル５２が備えている機械学習器と、これが学習されて生成される学習モデルをともに、車両学習モデル６０と呼称する。 When the traveling state of the vehicle 2, that is, the collection of the traveling record data is completed, the learning data generation unit 34 acquires the traveling record data from the learning data storage unit 35, forms it into an appropriate format, and transmits it to the test apparatus model 50.
The vehicle model 52 of the test device model 50 of the learning unit 30 acquires the running record data formed from the learning data generation unit 34, and uses this to machine-learn the machine learning device 60 to obtain the vehicle learning model 60. Generate. The vehicle learning model 60 is set to perform a simulated operation of the vehicle 2 based on the travel record data which is the actual travel record of the vehicle 2. In the present embodiment, the machine learning is performed, and when an operation for the vehicle 2 is received, this is performed. Based on this, a simulated running state imitating the vehicle 2 is output. That is, the machine learning device 60 of the vehicle model 52 generates a learned model 60 in which appropriate learning parameters are learned, which is used as a program module that is a part of artificial intelligence software.
In this embodiment, the vehicle learning model 60 is realized by a neural network.
Hereinafter, for the sake of simplicity, both the machine learning device included in the vehicle model 52 and the learning model generated by learning the machine learning device will be referred to as a vehicle learning model 60.

図３は、車両学習モデル６０のブロック図である。本実施形態においては、車両学習モデル６０は、中間層を３層とした全５層の全結合型のニューラルネットワークにより実現されている。車両学習モデル６０は、入力層６１、中間層６２、及び出力層６３を備えている。図３においては、各層が矩形として描かれており、各層に含まれるノードは省略されている。 FIG. 3 is a block diagram of the vehicle learning model 60. In the present embodiment, the vehicle learning model 60 is realized by a fully connected neural network of all five layers with three intermediate layers. The vehicle learning model 60 includes an input layer 61, an intermediate layer 62, and an output layer 63. In FIG. 3, each layer is drawn as a rectangle, and the nodes included in each layer are omitted.

本実施形態においては、車両学習モデル６０の入力は、任意の基準時刻を基点として、走行実績データ内の所定の第１時間だけ過去から基準時刻までの間の、車速の系列を含む。また、本実施形態においては、車両学習モデル６０の入力は、基準時刻から所定の第２時間だけ将来の時刻までの間の、アクセルペダル２ｃの操作量の系列、及びブレーキペダル２ｄの操作量の系列を含む。これらアクセルペダル２ｃの操作量の系列、及びブレーキペダル２ｄの操作量の系列は、実際には、学習データ記憶部３５に保存された走行実績データ内の、基準時刻以降のアクセルペダル２ｃの検出量と、ブレーキペダル２ｄの検出量であり、これらが基準時刻において車両２に対して適用される操作として、車両学習モデル６０に入力される。
入力層６１は、上記のような車速の系列である車速系列ｉ１、アクセルペダル２ｃの操作量の系列であるモデル入力アクセルペダル操作量系列ｉｍ２、及びブレーキペダル２ｄの操作量の系列であるモデル入力ブレーキペダル操作量系列ｉｍ３の各々に対応する入力ノードを備えている。
上記のように、各入力ｉ１、ｉｍ２、ｉｍ３は系列であり、それぞれ、複数の値により実現されている。例えば、図３においては、一つの矩形として示されている、車速系列ｉ１に対応する入力は、実際には、車速系列ｉ１の複数の値の各々に対応するように、入力ノードが設けられている。
車両モデル５２は、各入力ノードに、対応する走行実績データの値を格納する。 In the present embodiment, the input of the vehicle learning model 60 includes a series of vehicle speeds from the past to the reference time for a predetermined first time in the travel record data with an arbitrary reference time as a base point. Further, in the present embodiment, the input of the vehicle learning model 60 is the sequence of the operation amount of the accelerator pedal 2c and the operation amount of the brake pedal 2d from the reference time to the future time by a predetermined second time. Includes series. The series of the operation amount of the accelerator pedal 2c and the series of the operation amount of the brake pedal 2d are actually the detection amounts of the accelerator pedal 2c after the reference time in the running record data stored in the learning data storage unit 35. And the detected amount of the brake pedal 2d, which are input to the vehicle learning model 60 as an operation applied to the vehicle 2 at the reference time.
The input layer 61 is a vehicle speed series i1 which is a series of vehicle speeds as described above, a model input which is a series of operation amounts of the accelerator pedal 2c, a model input which is a series of operation amounts of the accelerator pedal operation amount series im2, and a model input which is a series of operation amounts of the brake pedal 2d. It is provided with input nodes corresponding to each of the brake pedal operation amount series im3.
As described above, each of the inputs i1, im2, and im3 is a series, and each is realized by a plurality of values. For example, in FIG. 3, the input corresponding to the vehicle speed series i1 shown as one rectangle is actually provided with an input node so as to correspond to each of a plurality of values of the vehicle speed series i1. There is.
The vehicle model 52 stores the value of the corresponding travel record data in each input node.

中間層６２は、第１中間層６２ａ、第２中間層６２ｂ、及び第３中間層６２ｃを備えている。
中間層６２の各ノードにおいては、前段の層（例えば、第１中間層６２ａの場合は入力層６１、第２中間層６２ｂの場合は第１中間層６２ａ）の各ノードから、この前段の層の各ノードに格納された値と、前段の層の各ノードから当該中間層６２のノードへの重みを基にした演算がなされて、当該中間層６２のノード内に演算結果が格納される。
出力層６３においても、中間層６２の各々と同様な演算が行われ、出力層６３に備えられた各出力ノードに演算結果が格納される。
本実施形態においては、車両学習モデル６０の出力は、基準時刻から所定の第３時間だけ将来の時刻（後の時刻）までの間の、推定された車速の系列である推定車速系列ｏ１と、アクセルペダル２ｃの検出量の系列であるモデル出力アクセルペダル検出量系列ｏｍ２、及びブレーキペダル２ｄの検出量の系列であるモデル出力ブレーキペダル検出量系列ｏｍ３を含む、模擬走行状態ｏｍである。この、図３においては、一つの矩形として示されている模擬走行状態ｏｍの各々は、実際には、上記の複数の値の各々に対応するように、出力ノードが設けられている。 The intermediate layer 62 includes a first intermediate layer 62a, a second intermediate layer 62b, and a third intermediate layer 62c.
In each node of the intermediate layer 62, from each node of the previous layer (for example, the input layer 61 in the case of the first intermediate layer 62a and the first intermediate layer 62a in the case of the second intermediate layer 62b), the previous layer. An operation is performed based on the value stored in each node of the above and the weight from each node of the previous layer to the node of the intermediate layer 62, and the operation result is stored in the node of the intermediate layer 62.
In the output layer 63, the same calculation as in each of the intermediate layers 62 is performed, and the calculation result is stored in each output node provided in the output layer 63.
In the present embodiment, the output of the vehicle learning model 60 is an estimated vehicle speed series o1 which is a series of estimated vehicle speeds from the reference time to a future time (later time) by a predetermined third time. It is a simulated running state om including a model output accelerator pedal detection amount series om2 which is a series of detection amounts of the accelerator pedal 2c and a model output brake pedal detection amount series om3 which is a series of detection amounts of the brake pedal 2d. In FIG. 3, each of the simulated running states om shown as one rectangle is actually provided with an output node so as to correspond to each of the above-mentioned plurality of values.

車両学習モデル６０においては、上記のように基準時刻の走行実績が入力されて、後の時刻の、車両２の走行を模した模擬走行状態ｏｍを出力することができるように学習がなされる。
より詳細には、車両モデル５２は、別途学習データ記憶部３５から学習データ生成部３４を介して送信された、基準時刻から第３時間だけ将来の時刻までの間の走行実績を、教師データとして受信する。車両モデル５２は、教師データと、車両学習モデル６０が出力した模擬走行状態ｏｍの平均二乗誤差が小さくなるように、重みやバイアスの値等、ニューラルネットワークを構成する各パラメータの値を、誤差逆伝搬法、確率的勾配降下法により調整する。
車両モデル５２は、車両学習モデル６０の学習を繰り返しつつ、教師データと模擬走行状態ｏｍの最小二乗誤差を都度計算し、これが所定の値よりも小さければ、車両学習モデル６０の学習を終了する。 In the vehicle learning model 60, the running record at the reference time is input as described above, and learning is performed so that the simulated running state om that imitates the running of the vehicle 2 at a later time can be output.
More specifically, the vehicle model 52 uses the running record from the reference time to the future time by the third hour, which is separately transmitted from the learning data storage unit 35 via the learning data generation unit 34, as teacher data. Receive. The vehicle model 52 reverses the error of each parameter constituting the neural network, such as a weight and a bias value, so that the training data and the mean square error of the simulated running state om output by the vehicle learning model 60 become small. Adjust by propagation method and stochastic gradient descent method.
The vehicle model 52 calculates the least squares error of the teacher data and the simulated running state om each time while repeating the learning of the vehicle learning model 60, and if this is smaller than a predetermined value, the learning of the vehicle learning model 60 ends.

車両学習モデル６０の学習が終了すると、学習システム１０の強化学習部４０は、操作内容推論部４１に設けられた、車両２の操作を推論する操作推論学習モデル７０を事前学習する。図４は、事前学習時のデータの送受信関係が示された学習システム１０のブロック図である。本実施形態においては、操作推論学習モデル７０は、強化学習により機械学習される。すなわち、操作推論学習モデル７０は、機械学習器が学習されることにより、人工知能ソフトウェアの一部であるプログラムモジュールとして利用される、適切な学習パラメータが学習された学習済みモデルとなる。
学習システム１０は、既に学習が終了した車両学習モデル６０が出力した模擬走行状態を操作推論学習モデル７０に適用することで、操作推論学習モデル７０を事前に強化学習する。後に説明するように、操作推論学習モデル７０の強化学習が進行して事前の強化学習が終了した後に、操作推論学習モデル７０の出力した操作を基に実際に車両２を走行させて取得された走行状態を操作推論学習モデル７０に適用することで、操作推論学習モデル７０を更に強化学習する。このように、学習システム１０は、操作推論学習モデル７０の学習段階に応じて、推論した操作の実行対象及び走行状態の取得対象を、車両学習モデル６０から実車両２へと変更する。 When the learning of the vehicle learning model 60 is completed, the reinforcement learning unit 40 of the learning system 10 pre-learns the operation inference learning model 70 for inferring the operation of the vehicle 2 provided in the operation content inference unit 41. FIG. 4 is a block diagram of the learning system 10 showing the data transmission / reception relationship at the time of pre-learning. In the present embodiment, the operation reasoning learning model 70 is machine-learned by reinforcement learning. That is, the operation inference learning model 70 becomes a learned model in which appropriate learning parameters are learned, which is used as a program module that is a part of artificial intelligence software by learning the machine learning device.
The learning system 10 reinforces the operation inference learning model 70 in advance by applying the simulated running state output by the vehicle learning model 60 for which learning has already been completed to the operation inference learning model 70. As will be described later, after the reinforcement learning of the operation inference learning model 70 progresses and the preliminary reinforcement learning is completed, the vehicle 2 is actually driven and acquired based on the operation output by the operation inference learning model 70. By applying the traveling state to the operation inference learning model 70, the operation inference learning model 70 is further strengthened and learned. In this way, the learning system 10 changes the execution target of the inferred operation and the acquisition target of the running state from the vehicle learning model 60 to the actual vehicle 2 according to the learning stage of the operation inference learning model 70.

後に説明するように、操作内容推論部４１は、学習が中途段階の操作推論学習モデル７０によって、現時点から第３時間だけ将来の時刻までの間の車両２の操作を出力し、これをドライブロボットモデル５１に送信する。本実施形態において、操作内容推論部４１は、特にアクセルペダル２ｃ及びブレーキペダル２ｄの操作の系列、すなわちペダル操作量を出力する。
車両学習モデル６０の学習により、試験装置モデル５０は、全体として試験装置１の各々を模擬動作するように構成されている。試験装置モデル５０は、操作の系列を受信する。 As will be described later, the operation content inference unit 41 outputs the operation of the vehicle 2 from the present time to the future time by the third time by the operation inference learning model 70 in the middle of learning, and outputs the operation of the vehicle 2 to the drive robot. Send to model 51. In the present embodiment, the operation content inference unit 41 outputs, in particular, a sequence of operations of the accelerator pedal 2c and the brake pedal 2d, that is, a pedal operation amount.
By learning the vehicle learning model 60, the test device model 50 is configured to simulate each of the test devices 1 as a whole. The test apparatus model 50 receives a sequence of operations.

ドライブロボットモデル５１は、ドライブロボット４を模擬動作するように構成されている。ドライブロボットモデル５１は、操作内容推論部４１から受信した、操作推論学習モデル７０が推論した操作を基に、操作系の表現を車両２に対する実際のペダル操作量の値へと変換して、入力操作を生成する。より詳細には、ドライブロボットモデル５１は、入力操作としてペダル操作量の系列であるアクセルペダル操作量系列ｉ２とブレーキペダル操作量系列ｉ３を生成し、第１ペダル遊び量調整部５４に送信する。
シャシーダイナモメータモデル５３は、シャシーダイナモメータ３を模擬動作するように構成されている。シャシーダイナモメータ３は、模擬走行中の車両学習モデル６０の車速を検出しつつ、これを内部に随時記録している。シャシーダイナモメータモデル５３は、この過去の車速の記録から車速系列ｉ１を生成し、第１ペダル遊び量調整部５４に送信する。 The drive robot model 51 is configured to simulate the drive robot 4. Based on the operation inferred by the operation inference learning model 70 received from the operation content inference unit 41, the drive robot model 51 converts the expression of the operation system into the value of the actual pedal operation amount for the vehicle 2 and inputs it. Generate an operation. More specifically, the drive robot model 51 generates an accelerator pedal operation amount series i2 and a brake pedal operation amount series i3, which are a series of pedal operation amounts, as input operations and transmits them to the first pedal play amount adjustment unit 54.
The chassis dynamometer model 53 is configured to simulate the chassis dynamometer 3. The chassis dynamometer 3 detects the vehicle speed of the vehicle learning model 60 during simulated running and records it internally at any time. The chassis dynamometer model 53 generates a vehicle speed series i1 from the record of the past vehicle speed and transmits it to the first pedal play amount adjusting unit 54.

第１ペダル遊び量調整部５４は、後に詳細に説明するように、ドライブロボットモデル５１から受信した入力操作、すなわちアクセルペダル操作量系列ｉ２、ブレーキペダル操作量系列ｉ３に含まれるペダル操作量を調整し、車両モデル５２に入力されるモデル入力アクセルペダル操作量系列（調整された入力操作）ｉｍ２、モデル入力ブレーキペダル操作量系列（調整された入力操作）ｉｍ３を生成して、シャシーダイナモメータモデル５３から受信した車速系列ｉ１とともに、車両モデル５２に送信する。
車両モデル５２は、車速系列ｉ１と、モデル入力アクセルペダル操作量系列ｉｍ２、及びモデル入力ブレーキペダル操作量系列ｉｍ３を受信して、これらを車両学習モデル６０に入力する。車両学習モデル６０が模擬走行状態ｏｍを出力すると、車両モデル５２は模擬走行状態ｏｍをシャシーダイナモメータモデル５３と第２ペダル遊び量調整部５５に送信する。
第２ペダル遊び量調整部５５は、後に詳細に説明するように、模擬走行状態ｏｍのモデル出力アクセルペダル検出量系列ｏｍ２、及びモデル出力ブレーキペダル検出量系列ｏｍ３に含まれるペダル検出量を調整して、アクセルペダル検出量系列ｏ２、ブレーキペダル検出量系列ｏ３を生成する。
シャシーダイナモメータモデル５３は、模擬走行状態ｏｍに含まれる車速を検出して内部の状態を更新する。
第２ペダル遊び量調整部５５によって調整された、アクセルペダル検出量系列ｏ２、ブレーキペダル検出量系列ｏ３と、シャシーダイナモメータモデル５３において保持される車速系列は、調整された模擬走行状態ｏとして、推論データ成形部３２と強化学習部４０に送信される。 The first pedal play amount adjusting unit 54 adjusts the input operation received from the drive robot model 51, that is, the pedal operation amount included in the accelerator pedal operation amount series i2 and the brake pedal operation amount series i3, as will be described in detail later. Then, the model input accelerator pedal operation amount series (adjusted input operation) im2 and the model input brake pedal operation amount series (adjusted input operation) im3 input to the vehicle model 52 are generated to generate the chassis dynamometer model 53. It is transmitted to the vehicle model 52 together with the vehicle speed series i1 received from.
The vehicle model 52 receives the vehicle speed series i1, the model input accelerator pedal operation amount series im2, and the model input brake pedal operation amount series im3, and inputs these to the vehicle learning model 60. When the vehicle learning model 60 outputs the simulated running state om, the vehicle model 52 transmits the simulated running state om to the chassis dynamometer model 53 and the second pedal play amount adjusting unit 55.
The second pedal play amount adjusting unit 55 adjusts the pedal detection amount included in the model output accelerator pedal detection amount series om2 and the model output brake pedal detection amount series om3 in the simulated running state om, as will be described in detail later. Then, the accelerator pedal detection amount series o2 and the brake pedal detection amount series o3 are generated.
The chassis dynamometer model 53 detects the vehicle speed included in the simulated running state om and updates the internal state.
The accelerator pedal detection amount series o2, the brake pedal detection amount series o3, and the vehicle speed series held in the chassis dynamometer model 53 adjusted by the second pedal play amount adjusting unit 55 are set as the adjusted simulated running state o. It is transmitted to the inference data forming unit 32 and the reinforcement learning unit 40.

指令車速生成部３１は、モードに関する情報に基づいて生成された、指令車速を保持している。指令車速生成部３１は、現時点から所定の第４時間だけ将来の時刻までの間に、車両学習モデル６０が従うべき指令車速の系列を生成し、推論データ成形部３２に送信する。
推論データ成形部３２は、調整された模擬走行状態ｏと指令車速系列を受信し、適切に成形した後に強化学習部４０に送信する。
強化学習部４０は、これらの調整された模擬走行状態ｏと、指令車速系列を、走行状態として操作内容推論部４１に送信する。 The command vehicle speed generation unit 31 holds the command vehicle speed generated based on the information regarding the mode. The command vehicle speed generation unit 31 generates a sequence of command vehicle speeds to be followed by the vehicle learning model 60 from the present time to a future time by a predetermined fourth time, and transmits the sequence to the inference data forming unit 32.
The inference data forming unit 32 receives the adjusted simulated running state o and the command vehicle speed sequence, appropriately forms them, and then transmits them to the reinforcement learning unit 40.
The reinforcement learning unit 40 transmits these adjusted simulated running states o and the command vehicle speed series to the operation content inference unit 41 as running states.

操作内容推論部４１は、ある時刻において走行状態を受信すると、これを基に、学習中の操作推論学習モデル７０により、当該時刻より後の操作の系列を推論する。
本実施形態においては、操作推論学習モデル７０は、走行状態の各々に対応する入力ノードを備えた入力層と、複数の中間層、及び複数の出力ノードを有する出力層を備えた、ニューラルネットワークである。
入力ノードの各々に、対応する走行状態の値が入力されると、重みを基にした演算がなされて、入力ノードの次の段として設けられた中間層の、中間ノードの各々に、演算結果が格納される。このような演算と、次の段の中間ノードへの演算結果の格納が、各中間層に対して順次実行される。最終的には、最終段の中間層内の中間ノードに格納された演算結果を基に、同様な演算がなされ、その結果が出力ノードに格納される。
操作推論学習モデル７０の出力ノードの各々は、操作の各々に対応するように設けられている。本実施形態においては、操作の対象は、アクセルペダル２ｃとブレーキペダル２ｄであり、これに対応して、操作推論学習モデル７０は、操作として、例えばアクセルペダル操作の系列とブレーキペダル操作の系列を推論する。 When the operation content inference unit 41 receives the traveling state at a certain time, the operation inference learning model 70 during learning infers a series of operations after the time based on the driving state.
In the present embodiment, the operation inference learning model 70 is a neural network including an input layer having input nodes corresponding to each of the running states, a plurality of intermediate layers, and an output layer having a plurality of output nodes. be.
When the corresponding running state value is input to each of the input nodes, an operation based on the weight is performed, and the operation result is performed for each of the intermediate nodes of the intermediate layer provided as the next stage of the input node. Is stored. Such an operation and storage of the operation result in the intermediate node of the next stage are sequentially executed for each intermediate layer. Finally, the same operation is performed based on the operation result stored in the intermediate node in the intermediate layer of the final stage, and the result is stored in the output node.
Each of the output nodes of the operation inference learning model 70 is provided so as to correspond to each of the operations. In the present embodiment, the objects of operation are the accelerator pedal 2c and the brake pedal 2d, and in response to this, the operation inference learning model 70 sets, for example, a sequence of accelerator pedal operations and a sequence of brake pedal operations as operations. Infer.

操作内容推論部４１は、このようにして生成されたアクセルペダル操作とブレーキペダル操作をドライブロボットモデル５１に送信する。ドライブロボットモデル５１は、これを基に入力操作となるアクセルペダル操作量系列ｉ２とブレーキペダル操作量系列ｉ３を生成して第１ペダル遊び量調整部５４に送信する。第１ペダル遊び量調整部５４は入力操作を調整してモデル入力アクセルペダル操作量系列ｉｍ２、モデル入力ブレーキペダル操作量系列ｉｍ３を生成し、車両学習モデル６０に送信する。車両学習モデル６０は、これらを受信して、次の模擬走行状態ｏｍを推論する。第２ペダル遊び量調整部５５は模擬走行状態ｏｍから調整された模擬走行状態ｏを生成する。このようにして、次の走行状態が生成される。
操作推論学習モデル７０の学習、すなわち誤差逆伝搬法、確率的勾配降下法によるニューラルネットワークを構成する各パラメータの値の調整は、現段階においては行われず、操作推論学習モデル７０は操作を推論するのみである。操作推論学習モデル７０の学習は、後に、価値推論学習モデル８０の学習に伴って行われる。 The operation content reasoning unit 41 transmits the accelerator pedal operation and the brake pedal operation generated in this manner to the drive robot model 51. Based on this, the drive robot model 51 generates an accelerator pedal operation amount series i2 and a brake pedal operation amount series i3, which are input operations, and transmits them to the first pedal play amount adjustment unit 54. The first pedal play amount adjusting unit 54 adjusts the input operation to generate the model input accelerator pedal operation amount series im2 and the model input brake pedal operation amount series im3, and transmits the model input brake pedal operation amount series im3 to the vehicle learning model 60. The vehicle learning model 60 receives these and infers the next simulated running state om. The second pedal play amount adjusting unit 55 generates the simulated running state o adjusted from the simulated running state om. In this way, the next running state is generated.
The learning of the operation inference learning model 70, that is, the adjustment of the values of each parameter constituting the neural network by the error back propagation method and the stochastic gradient descent method is not performed at this stage, and the operation inference learning model 70 infers the operation. Only. The learning of the operation inference learning model 70 is later performed in association with the learning of the value inference learning model 80.

報酬計算部４３は、走行状態と、これに対応して操作推論学習モデル７０により推論された操作、及び当該操作を基に新たに生成された走行状態を基に、適切に設計された式により報酬を計算する。報酬は、操作、及びこれに伴う新たに生成された走行状態が望ましくないほど小さい値を、望ましいほど大きい値を、有するように設計されている。後述する状態行動価値推論部４２は、行動価値を、報酬が大きいほどこれが高くするように計算し、操作推論学習モデル７０はこの行動価値が高くなるような操作を出力するように、強化学習が行われる。
報酬計算部４３は、走行状態、これに対応して推論された操作、当該操作を基に新たに生成された走行状態、及び計算した報酬を、学習データ成形部３３に送信する。学習データ成形部３３は、これらを適切に成形して学習データ記憶部３５に保存する。これらのデータは、後述する価値推論学習モデル８０の学習に使用される。
このようにして、操作内容推論部４１による操作の推論と、この操作に対応した、車両モデル５２による模擬走行状態ｏｍの推論、及び報酬の計算が、価値推論学習モデル８０の学習に十分なデータが蓄積されるまで、繰り返し行われる。 The reward calculation unit 43 uses an appropriately designed formula based on the running state, the operation inferred by the operation inference learning model 70 corresponding to the running state, and the running state newly generated based on the operation. Calculate the reward. The reward is designed so that the operation and the newly generated running conditions associated therewith have an undesirably small value and a desirablely large value. The state action value inference unit 42, which will be described later, calculates the action value so that the larger the reward, the higher the action value, and the operation inference learning model 70 outputs an operation such that the action value becomes higher. Will be done.
The reward calculation unit 43 transmits the running state, the operation inferred corresponding to the running state, the running state newly generated based on the operation, and the calculated reward to the learning data forming unit 33. The learning data forming unit 33 appropriately forms these and stores them in the learning data storage unit 35. These data are used for learning the value inference learning model 80, which will be described later.
In this way, the inference of the operation by the operation content inference unit 41, the inference of the simulated running state om by the vehicle model 52 corresponding to this operation, and the calculation of the reward are sufficient data for learning the value inference learning model 80. Is repeated until is accumulated.

学習データ記憶部３５に、価値推論学習モデル８０の学習に十分な量の走行データが蓄積されると、状態行動価値推論部４２は価値推論学習モデル８０を学習する。価値推論学習モデル８０は、機械学習器が学習されることにより、人工知能ソフトウェアの一部であるプログラムモジュールとして利用される、適切な学習パラメータが学習された学習済みモデルとなる。
強化学習部４０は全体として、操作推論学習モデル７０が推論した操作がどの程度適切であったかを示す行動価値を計算し、操作推論学習モデル７０が、この行動価値が高くなるような操作を出力するように、強化学習を行う。行動価値は、走行状態と、これに対する操作を引数として、報酬が大きいほど行動価値を高くするように設計された関数として表わされる。本実施形態においては、この関数の計算を、走行状態と操作を入力として、行動価値を出力するように設計された、関数近似器としての学習モデル８０により行う。 When a sufficient amount of running data for learning the value inference learning model 80 is accumulated in the learning data storage unit 35, the state behavior value inference unit 42 learns the value inference learning model 80. The value inference learning model 80 becomes a learned model in which appropriate learning parameters are learned, which is used as a program module that is a part of artificial intelligence software by learning a machine learning device.
As a whole, the reinforcement learning unit 40 calculates an action value indicating how appropriate the operation inferred by the operation inference learning model 70 is, and the operation inference learning model 70 outputs an operation such that the action value becomes high. As you can see, reinforcement learning is performed. The action value is expressed as a function designed so that the larger the reward, the higher the action value, with the running state and the operation for it as arguments. In the present embodiment, the calculation of this function is performed by the learning model 80 as a function approximator designed to output the action value by inputting the running state and the operation.

操作学習データ生成部３４は、学習データ記憶部３５内の学習データを成形して、状態行動価値推論部４２へ送信する。
状態行動価値推論部４２は、成形された学習データを受信し、価値推論学習モデル８０を機械学習させる。
本実施形態においては、価値推論学習モデル８０は、走行状態と操作の各々に対応する入力ノードを備えた入力層と、複数の中間層、及び行動価値に対応する出力ノードを備えた、ニューラルネットワークである。価値推論学習モデル８０は、操作推論学習モデル７０と同様な構造のニューラルネットワークにより実現されているため、構造上の詳細な説明を割愛する。 The operation learning data generation unit 34 forms the learning data in the learning data storage unit 35 and transmits it to the state / behavior value inference unit 42.
The state-behavioral value inference unit 42 receives the formed learning data and causes the value inference learning model 80 to be machine-learned.
In the present embodiment, the value inference learning model 80 is a neural network including an input layer having input nodes corresponding to each of a running state and an operation, a plurality of intermediate layers, and an output node corresponding to an action value. Is. Since the value inference learning model 80 is realized by a neural network having the same structure as the operation inference learning model 70, detailed structural explanation is omitted.

状態行動価値推論部４２は、ＴＤ（ＴｅｍｐｏｒａｌＤｉｆｆｅｒｅｎｃｅ）誤差、すなわち、操作を実行する前の行動価値と、操作を実行した後の行動価値の誤差を小さくして、行動価値として適切な値が出力されるように、重みやバイアスの値等、ニューラルネットワークを構成する各パラメータの値を、誤差逆伝搬法、確率的勾配降下法により調整する。このように、現状の操作推論学習モデル７０によって推論された操作を適切に評価できるように、価値推論学習モデル８０を学習させる。
価値推論学習モデル８０の学習が進むと、価値推論学習モデル８０は、より適切な行動価値の値を出力するようになる。すなわち、価値推論学習モデル８０が出力する行動価値の値が学習前とは変わるため、これに伴い、行動価値が高くなるような操作を出力するように設計された操作推論学習モデル７０を更新する必要がある。このため、操作内容推論部４１は操作推論学習モデル７０を学習する。
具体的には、操作内容推論部４１は、例えば行動価値の負値を損失関数とし、これをできるだけ小さくするような、すなわち行動価値が大きくなるような操作を出力するように、重みやバイアスの値等、ニューラルネットワークを構成する各パラメータの値を、誤差逆伝搬法、確率的勾配降下法により調整して、操作推論学習モデル７０を学習させる。
操作推論学習モデル７０が学習され更新されると、出力される操作が変化するため、再度走行データを蓄積し、これを基に価値推論学習モデル８０を学習する。
このように、学習部３０は、操作推論学習モデル７０と価値推論学習モデル８０の学習を繰り返すことにより、これら学習モデル７０、８０を強化学習する。 The state action value inference unit 42 reduces the TD (Temporal Difference) error, that is, the error between the action value before executing the operation and the action value after executing the operation, and outputs an appropriate value as the action value. The values of each parameter constituting the neural network, such as the weight and bias values, are adjusted by the error back propagation method and the stochastic gradient descent method. In this way, the value inference learning model 80 is trained so that the operations inferred by the current operation inference learning model 70 can be appropriately evaluated.
As the learning of the value inference learning model 80 progresses, the value inference learning model 80 comes to output a more appropriate action value value. That is, since the value of the action value output by the value inference learning model 80 is different from that before learning, the operation inference learning model 70 designed to output an operation that increases the action value is updated accordingly. There is a need. Therefore, the operation content inference unit 41 learns the operation inference learning model 70.
Specifically, the operation content inference unit 41 uses a negative value of the action value as a loss function, and outputs an operation that makes it as small as possible, that is, increases the action value. The operation inference learning model 70 is trained by adjusting the values of each parameter constituting the neural network, such as the values, by the error back propagation method and the stochastic gradient descent method.
When the operation inference learning model 70 is learned and updated, the output operation changes, so the driving data is accumulated again, and the value inference learning model 80 is learned based on this.
In this way, the learning unit 30 reinforces the learning models 70 and 80 by repeating the learning of the operation inference learning model 70 and the value inference learning model 80.

学習部３０は、この事前学習としての、車両学習モデル６０を操作の実行対象として用いた強化学習を、事前学習終了基準を満たすまで実行する。 The learning unit 30 executes reinforcement learning using the vehicle learning model 60 as an operation execution target as the pre-learning until the pre-learning end criterion is satisfied.

次に、上記のような車両学習モデル６０を用いた操作推論学習モデル７０の事前学習における、第１ペダル遊び量調整部５４と第２ペダル遊び量調整部５５の挙動を説明する。
操作推論学習モデル７０が精度よく学習されるためには、車両モデル５２内の車両学習モデル６０において、車両２の再現精度を高める必要がある。この際に特に重要となるのは、車両２のアクセルペダル２ｃとブレーキペダル２ｄのペダル遊び量を、車両モデル５２においても正確に再現することである。
しかし、例えば車両学習モデル６０の学習時に学習データとして使用する走行実績データを収集する際に、アクセルペダル２ｃやブレーキペダル２ｄのペダル操作量を検出する計器の設置に問題があることがある。例えばこのような、ペダル遊び量が実際とは異なる状況下で走行実績データが収集された場合には、車両モデル５２のペダル遊び量は、車両２の本来のペダル遊び量とは異なる値となる。
あるいは、ドライブロボット４を車両２に設置しなおした際に、ドライブロボット４を以前とは異なる位置に設置した場合や、ペダルのキャリブレーションを行った場合においては、同一の車両２においてもペダル遊び量が変動し得る。
このような場合において、第１ペダル遊び量調整部５４と第２ペダル遊び量調整部５５は、車両２と車両モデル５２の間のペダル遊び量を調整する。 Next, the behaviors of the first pedal play amount adjusting unit 54 and the second pedal play amount adjusting unit 55 in the pre-learning of the operation inference learning model 70 using the vehicle learning model 60 as described above will be described.
In order for the operation inference learning model 70 to be learned accurately, it is necessary to improve the reproduction accuracy of the vehicle 2 in the vehicle learning model 60 in the vehicle model 52. At this time, what is particularly important is to accurately reproduce the pedal play amount of the accelerator pedal 2c and the brake pedal 2d of the vehicle 2 also in the vehicle model 52.
However, for example, when collecting travel record data to be used as learning data when learning the vehicle learning model 60, there may be a problem in installing an instrument that detects the pedal operation amount of the accelerator pedal 2c and the brake pedal 2d. For example, when the running record data is collected under such a situation that the pedal play amount is different from the actual one, the pedal play amount of the vehicle model 52 becomes a value different from the original pedal play amount of the vehicle 2. ..
Alternatively, when the drive robot 4 is re-installed in the vehicle 2, if the drive robot 4 is installed at a position different from the previous one, or if the pedal is calibrated, the pedal play is also performed in the same vehicle 2. The amount can fluctuate.
In such a case, the first pedal play amount adjusting unit 54 and the second pedal play amount adjusting unit 55 adjust the pedal play amount between the vehicle 2 and the vehicle model 52.

図５は、第１及び第２ペダル遊び量調整部５４、５５の説明図である。
後に説明するように、操作推論学習モデル７０が推論した操作は、車両操作制御部２２を介してドライブロボット４に送信され、車両２を操作するために使用される。したがって、操作推論学習モデル７０が出力する操作は、車両２のペダル遊び量が反映されたものとなっている。
このため、第１ペダル遊び量調整部５４は、操作推論学習モデル７０が推論した操作を基にした入力操作に含まれるペダル操作量を、車両モデル５２にあわせて調整して調整された入力操作を生成し、調整された入力操作を車両モデル５２に入力する。 FIG. 5 is an explanatory diagram of the first and second pedal play amount adjusting units 54 and 55.
As will be described later, the operation inferred by the operation inference learning model 70 is transmitted to the drive robot 4 via the vehicle operation control unit 22 and used to operate the vehicle 2. Therefore, the operation output by the operation inference learning model 70 reflects the amount of pedal play of the vehicle 2.
Therefore, the first pedal play amount adjusting unit 54 adjusts the pedal operation amount included in the input operation based on the operation inferred by the operation inference learning model 70 according to the vehicle model 52, and adjusts the input operation. Is generated and the adjusted input operation is input to the vehicle model 52.

より詳細には、第１ペダル遊び量調整部５４は、操作推論学習モデル７０が推論した操作を基にしてドライブロボットモデル５１が生成した入力操作、すなわちアクセルペダル操作量系列ｉ２、ブレーキペダル操作量系列ｉ３を受信する。
車両２のペダル遊び量をｄ_ｒ、車両モデル５２のペダル遊び量をｄ_ｖとしたときに、ペダル遊び量差（差分値）はｄ_ｄｉｆｆ＝ｄ_ｒ−ｄ_ｖと定義される。第１ペダル遊び量調整部５４は、アクセルペダル操作量系列ｉ２とブレーキペダル操作量系列ｉ３、すなわち入力操作に含まれるペダル操作量ａ_ｒ−ｉを、次の式１により車両モデル５２にあわせて調整し、調整されたペダル操作量ａ_ｖ−ｉを生成する。

ここで、関数ｆは、次の式２として示されるような、ペダル操作量を下限である０％から上限である１００％の範囲内に収めるための関数である。

More specifically, the first pedal play amount adjusting unit 54 is an input operation generated by the drive robot model 51 based on the operation inferred by the operation inference learning model 70, that is, the accelerator pedal operation amount series i2 and the brake pedal operation amount. Receives the series i3.
Pedal play amount _{d r} of the vehicle 2, the pedal play amount of the vehicle model 52 is taken as _{d v,} the pedal play amount difference (difference value) is defined as _d diff ₌ _d r -d _v. The first pedal play amount adjusting unit 54 adjusts the accelerator pedal operation amount series i2 and the brake pedal operation amount series i3, that is, the pedal operation amount a _ri included in the input operation to the vehicle model 52 by the following equation 1. It is adjusted and the adjusted pedal operation amount _avi is generated.

Here, the function f is a function for keeping the pedal operation amount within the range of 0%, which is the lower limit, and 100%, which is the upper limit, as shown by the following equation 2.

このように、第１ペダル遊び量調整部５４は、車両２のペダル遊び量ｄ_ｒと、車両モデル５２のペダル遊び量ｄ_ｖの差分値であるペダル遊び量差ｄ_ｄｉｆｆを基に、入力操作であるアクセルペダル操作量系列ｉ２、ブレーキペダル操作量系列ｉ３に含まれるペダル操作量ａ_ｒ−ｉを、ペダル操作量ａ_ｒ−ｉから差分値を減算することで、車両モデル５２にあわせて調整する。これにより、第１ペダル遊び量調整部５４は、調整された入力操作、すなわちモデル入力アクセルペダル操作量系列ｉｍ２、モデル入力ブレーキペダル操作量系列ｉｍ３を生成する。
上記の式１の、関数ｆに渡される値ａ_ｒ−ｉ−ｄ_ｄｉｆｆは、（ａ_ｒ−ｉ−ｄ_ｒ）＋ｄ_ｖと変形できる。（ａ_ｒ−ｉ−ｄ_ｒ）の値は、車両２においてペダルを操作したときに、ペダルの駆動開始位置を超えてペダルが踏み込まれる量である。式１のａ_ｖ−ｉの値は、これに車両モデル５２のペダル遊び量ｄ_ｖが加算された値であり、駆動開始位置を超えて踏み込まれる量に車両モデル５２のペダル遊び量ｄ_ｖが反映された値となっている。
このため、例えば図５に示されるように、アクセルペダル操作量系列ｉ２において車両２のペダル遊び量ｄ_ｒよりも上方に位置する範囲Ｒ１は、式１によりこれが変換されたモデル入力アクセルペダル操作量系列ｉｍ２においても、車両モデル５２のペダル遊び量ｄ_ｖより上方に位置している。 Thus, the first pedal play amount adjusting unit 54, based on a pedal play amount d _r of the vehicle 2, the pedal play amount difference d _diff is a difference value between the pedal play amount d _v of the vehicle model 52, an input operation _{The pedal operation amount a r-i} included in the accelerator pedal operation amount series i2 and the brake pedal operation amount series i3 is adjusted according to the vehicle model 52 by subtracting the difference value from the pedal operation amount a _r-i. do. As a result, the first pedal play amount adjusting unit 54 generates the adjusted input operation, that is, the model input accelerator pedal operation amount series im2 and the model input brake pedal operation amount series im3.
_{The value a r-i-} d _diff passed to the function f in the above equation 1 can be transformed into (a _r-i- d _r ) + d _v. The value of (a _{r-i -d} _r), upon operating the pedal in the vehicle 2, is the amount the pedal is depressed beyond the drive start position of the pedal. The value of a _v-i of equation 1, this is a pedal play amount d _v of the vehicle model 52 is added the value, the pedal play amount d _v of the vehicle model 52 to the amount depressed beyond the drive starting position It is a reflected value.
Thus, for example, as shown in FIG. 5, the range R1 which is located above the pedal play amount d _r of the vehicle 2 in the accelerator pedal operation amount sequence i2, the model input accelerator pedal operation amount which is converted by Equation 1 also in series im2, it is located above the pedal play amount d _v of the vehicle model 52.

なお、ペダル遊び量ｄ_ｒ、ｄ_ｖは、例えば図５においてアクセルペダル操作量系列ｉ２とブレーキペダル操作量系列ｉ３、モデル入力アクセルペダル操作量系列ｉｍ２とモデル入力ブレーキペダル操作量系列ｉｍ３に異なる値として示されているように、アクセルペダル２ｃとブレーキペダル２ｄで異なる値を取り得る。
このため、ペダル遊び量差ｄ_ｄｉｆｆも、アクセルペダル２ｃとブレーキペダル２ｄで異なる値となり得る。
したがって、上記の式１は、アクセルペダル２ｃとブレーキペダル２ｄでは実際には異なる式となる。第１ペダル遊び量調整部５４は、アクセルペダル２ｃとブレーキペダル２ｄの各々の操作量に対して、対応する値が反映された式１を適用する。
これは、第２ペダル遊び量調整部５５に関して以下に説明する式３や、後に本実施形態の変形例として説明する式４、式５についても同様である。 Incidentally, the pedal play amount d _r, d _v, for example the accelerator pedal operation amount sequence i2 and the brake pedal operation amount sequence i3 in FIG. 5, the model input accelerator pedal operation amount sequence im2 and model input different values for the brake pedal operation amount sequence im3 As shown as, the accelerator pedal 2c and the brake pedal 2d can take different values.
Therefore, the pedal play amount difference d _diff can also have different values between the accelerator pedal 2c and the brake pedal 2d.
Therefore, the above equation 1 is actually different for the accelerator pedal 2c and the brake pedal 2d. The first pedal play amount adjusting unit 54 applies Equation 1 in which the corresponding values are reflected to the respective operating amounts of the accelerator pedal 2c and the brake pedal 2d.
This also applies to the formula 3 described below with respect to the second pedal play amount adjusting unit 55, and the formulas 4 and 5 described later as a modification of the present embodiment.

第２ペダル遊び量調整部５５は、モデル入力アクセルペダル操作量系列ｉｍ２、モデル入力ブレーキペダル操作量系列ｉｍ３を、車両モデル５２に入力する。
車両モデル５２は、車両学習モデル６０により、模擬走行状態ｏｍ、すなわちモデル出力アクセルペダル検出量系列ｏｍ２とモデル出力ブレーキペダル検出量系列ｏｍ３を推論する。 The second pedal play amount adjusting unit 55 inputs the model input accelerator pedal operation amount series im2 and the model input brake pedal operation amount series im3 to the vehicle model 52.
The vehicle model 52 infers the simulated running state om, that is, the model output accelerator pedal detection amount series om2 and the model output brake pedal detection amount series om3 by the vehicle learning model 60.

第２ペダル遊び量調整部５５は、模擬走行状態ｏｍを受信し、これに含まれるペダル検出量ａ_ｖ−ｏを、次の式３により車両２にあわせて調整し、調整されたペダル検出量ａ_ｒ−ｏを生成する。

Second pedal play amount adjustment unit 55 receives a simulated running state om, pedal detected amount a _v-o contained therein, and adjusted for vehicle 2 by the following equation 3, the pedal detection amount which is adjusted Generate a _r-o.

このように、第２ペダル遊び量調整部５５は、模擬走行状態ｏｍに含まれるペダル検出量ａ_ｖ−ｏを車両２にあわせて調整して、調整された模擬走行状態ｏ、すなわちアクセルペダル検出量系列ｏ２、ブレーキペダル検出量系列ｏ３を生成する。
上記の式３の、関数ｆに渡される値ａ_ｖ−ｏ＋ｄ_ｄｉｆｆは、（ａ_ｖ−ｏ−ｄ_ｖ）＋ｄ_ｒと変形できる。（ａ_ｖ−ｏ−ｄ_ｖ）の値は、車両モデル５２において、ペダルの駆動開始位置を超えてペダルが踏み込まれる量である。式３のａ_ｒ−ｏの値は、これに車両２のペダル遊び量ｄ_ｒが加算された値であり、駆動開始位置を超えて踏み込まれる量に車両２のペダル遊び量ｄ_ｒが反映された値となっている。
このため、例えば図５に示されるように、モデル出力アクセルペダル検出量系列ｏｍ２において車両モデル５２のペダル遊び量ｄ_ｖよりも上方に位置する範囲Ｒ２は、式３によりこれが変換されたアクセルペダル検出量系列ｏ２においても、車両２のペダル遊び量ｄ_ｒより上方に位置している。 Thus, the second pedal play amount adjustment unit 55, a pedal detected amount a _v-o contained in the simulated running state om adjusted in accordance with the vehicle 2, adjusted simulated running state o, i.e. the accelerator pedal detected The quantity series o2 and the brake pedal detection quantity series o3 are generated.
Of formula 3 above, the value _a _{v-o + d diff} passed to the function f _can be modified with _{(a v-o -d v)} + d r. The value of _{(a _v-o} -d _v), in the vehicle model 52, is the amount the pedal is depressed beyond the drive start position of the pedal. The value of a _r-o of the formula 3, this is a value that pedal play amount d _r of the vehicle 2 is added, the pedal play amount d _r of the vehicle 2 is reflected in the amounts depressed beyond the drive starting position It is a value.
Thus, for example, as shown in FIG. 5, the range located above the pedal play amount d _v of the vehicle model 52 in the model output accelerator pedal detected amount sequence om2 R2 is an accelerator pedal detection which is converted by Equation 3 also in the amount sequence o2, it is located above the pedal play amount d _r of the vehicle 2.

上記の式３は、式１とは対称的な演算となっている。
すなわち、式１においては、入力操作に含まれるペダル操作量ａ_ｒ−ｉからペダル遊び量差（差分値）ｄ_ｄｉｆｆを減算する操作となっている。
これに対し、式３においては、模擬走行状態ｏｍに含まれるペダル検出量ａ_ｒ−ｏから、ペダル遊び量差ｄ_ｄｉｆｆの正負を反転させた値（反転差分値）である−ｄ_ｄｉｆｆの値を減算する操作となっている。
上記のような操作により、第２ペダル遊び量調整部５５は、模擬走行状態ｏｍに含まれるペダル検出量ａ_ｖ−ｏを、反転差分値を基に、より詳細にはペダル検出量ａ_ｖ−ｏにペダル遊び量差（差分値）ｄ_ｄｉｆｆを加算することで、ペダル検出量を車両２にあわせて調整して、調整された模擬走行状態ｏを生成する。
第２ペダル遊び量調整部５５が生成した、調整された模擬走行状態ｏは、操作推論学習モデル７０に適用されることで、操作推論学習モデル７０は機械学習される。 The above equation 3 is a symmetric operation with the equation 1.
That is, in Equation 1, the pedal play amount difference (difference value) d _diff is subtracted _{from the pedal operation amount a r-i included in the input operation.}
On the other hand, in Equation 3, the value of _{−d diff} , which is a value obtained by reversing the positive / negative of the pedal play amount difference d _diff _{from the pedal detection amount a r— included in the simulated running state om (reversal difference value).} It is an operation to subtract.
The above-described operation, the second pedal play amount adjustment unit 55, a pedal detected amount a _v-o contained in the simulated running state om, inversion based on a difference value, and more particularly pedal detected amount a _v- By adding the pedal play amount difference (difference value) d _diff _{to o} , the pedal detection amount is adjusted according to the vehicle 2, and the adjusted simulated running state o is generated.
The adjusted simulated running state o generated by the second pedal play amount adjusting unit 55 is applied to the operation inference learning model 70, so that the operation inference learning model 70 is machine-learned.

操作推論学習モデル７０及び価値推論学習モデル８０の、車両学習モデル６０を操作の実行対象として用いた事前学習が終了すると、学習部３０は、車両学習モデル６０に替えて、実車両２を操作の実行対象として、操作推論学習モデル７０及び価値推論学習モデル８０を更に強化学習する。図６は、事前学習が終了した後の強化学習時におけるデータの送受信関係が示された学習システム１０のブロック図である。 When the pre-learning using the vehicle learning model 60 of the operation inference learning model 70 and the value inference learning model 80 as the execution target of the operation is completed, the learning unit 30 operates the actual vehicle 2 instead of the vehicle learning model 60. The operation inference learning model 70 and the value inference learning model 80 are further strengthened and learned as execution targets. FIG. 6 is a block diagram of the learning system 10 showing the data transmission / reception relationship at the time of reinforcement learning after the completion of the pre-learning.

操作内容推論部４１は、現時点から第３時間だけ将来の時刻までの間の車両２の操作を出力し、これを車両操作制御部２２に送信する。
車両操作制御部２２は、受信した操作を、ドライブロボット４の第１及び第２アクチュエータ４ｃ、４ｄへの指令に変換して、ドライブロボット４に送信する。
ドライブロボット４は、アクチュエータ４ｃ、４ｄへの指令を受信すると、これに基づいて車両２をシャシーダイナモメータ３上で走行させる。
シャシーダイナモメータ３と車両状態計測部５は、車両２の車速、アクセルペダル２ｃとブレーキペダル２ｄの操作量を検出して各々の系列を生成し、推論データ成形部３２に送信する。
指令車速生成部３１は、指令車速系列を生成して推論データ成形部３２に送信する。
推論データ成形部３２は、各系列を受信し、適切に成形した後に走行状態として、強化学習部４０に送信する。 The operation content inference unit 41 outputs the operation of the vehicle 2 from the present time to the future time for the third time, and transmits this to the vehicle operation control unit 22.
The vehicle operation control unit 22 converts the received operation into commands to the first and second actuators 4c and 4d of the drive robot 4 and transmits them to the drive robot 4.
When the drive robot 4 receives a command to the actuators 4c and 4d, the drive robot 4 causes the vehicle 2 to travel on the chassis dynamometer 3 based on the command.
The chassis dynamometer 3 and the vehicle condition measuring unit 5 detect the vehicle speed of the vehicle 2 and the operating amounts of the accelerator pedal 2c and the brake pedal 2d, generate their respective series, and transmit them to the inference data forming unit 32.
The command vehicle speed generation unit 31 generates a command vehicle speed series and transmits it to the inference data forming unit 32.
The inference data forming unit 32 receives each series, appropriately forms them, and then transmits them to the reinforcement learning unit 40 as a running state.

強化学習部４０は、試験装置モデル５０により生成される調整された模擬走行状態ｏの替わりに上記の各系列を用いて、図４を用いて説明した事前学習時と同様に、上記のように実車両２を操作の実行対象として用いて学習データを学習データ記憶部３５に蓄積する。強化学習部４０は、十分な量の走行データが蓄積されると、価値推論学習モデル８０を学習し、その後操作推論学習モデル７０を学習する。
学習部３０は、学習データの蓄積と、操作推論学習モデル７０と価値推論学習モデル８０の学習を繰り返すことにより、これら学習モデル７０、８０を強化学習する。 The reinforcement learning unit 40 uses each of the above series instead of the adjusted simulated running state o generated by the test device model 50, as described above, as in the case of the pre-learning described with reference to FIG. The learning data is stored in the learning data storage unit 35 using the actual vehicle 2 as the execution target of the operation. When a sufficient amount of driving data is accumulated, the reinforcement learning unit 40 learns the value inference learning model 80, and then learns the operation inference learning model 70.
The learning unit 30 reinforces the learning models 70 and 80 by repeating the accumulation of learning data and the learning of the operation inference learning model 70 and the value inference learning model 80.

学習部３０は、車両２を操作の実行対象として用いた強化学習を、学習終了基準を満たすまで実行する。 The learning unit 30 executes reinforcement learning using the vehicle 2 as an operation execution target until the learning end criterion is satisfied.

次に、車両２の性能測定に際して操作を推論する場合での、すなわち、操作推論学習モデル７０の強化学習が終了した後における、学習システム１０の各構成要素の挙動について説明する。 Next, the behavior of each component of the learning system 10 in the case of inferring the operation when measuring the performance of the vehicle 2, that is, after the reinforcement learning of the operation inference learning model 70 is completed will be described.

駆動状態取得部２３と、車両状態計測部５、及びシャシーダイナモメータ３に設けられた様々な計測器により、車両２の車速、アクセルペダル２ｃの検出量、ブレーキペダル２ｄの検出量等が計測される。これらの値は、推論データ成形部３２に送信される。
指令車速生成部３１は、指令車速系列を生成して推論データ成形部３２に送信する。
推論データ成形部３２は、車速、アクセルペダル２ｃの検出量、ブレーキペダル２ｄの検出量等と、指令車速系列を受信し、適切に成形した後に走行状態として、強化学習部４０に送信する。
操作内容推論部４１は、走行状態を受信すると、これを基に、学習済みの操作推論学習モデル７０により、車両２の操作を推論する。
操作内容推論部４１は、推論した操作を、車両操作制御部２２へ送信する。
車両操作制御部２２は、操作内容推論部４１から操作を受信し、この操作に基づき、ドライブロボット４を操作する。 The vehicle speed of the vehicle 2, the detection amount of the accelerator pedal 2c, the detection amount of the brake pedal 2d, etc. are measured by various measuring instruments provided in the drive state acquisition unit 23, the vehicle state measurement unit 5, and the chassis dynamometer 3. NS. These values are transmitted to the inference data forming unit 32.
The command vehicle speed generation unit 31 generates a command vehicle speed series and transmits it to the inference data forming unit 32.
The inference data molding unit 32 receives the vehicle speed, the detected amount of the accelerator pedal 2c, the detected amount of the brake pedal 2d, and the commanded vehicle speed series, and after appropriately molding, transmits them to the reinforcement learning unit 40 as a running state.
When the operation content inference unit 41 receives the traveling state, the operation content inference unit 41 infers the operation of the vehicle 2 by the learned operation inference learning model 70 based on the received driving state.
Operation content The inference unit 41 transmits the inferred operation to the vehicle operation control unit 22.
The vehicle operation control unit 22 receives an operation from the operation content inference unit 41, and operates the drive robot 4 based on this operation.

次に、図１〜図６、及び図７を用いて、上記の学習システム１０を用いた、ドライブロボット４を制御する操作推論学習モデル７０の学習方法を説明する。図７は、学習方法のフローチャートである。
学習制御装置１１は、操作の学習に先立ち、学習時に使用する走行実績データ（走行実績）を、走行実績として収集する。詳細には、ドライブロボット制御部２０が、アクセルペダル２ｃ及びブレーキペダル２ｄの、車両特性計測用の操作パターンを生成して、これにより車両２を走行制御し、走行実績データを収集する（ステップＳ１）。
車両モデル５２は、学習データ生成部３４から成形された走行実績データを取得し、これを用いて機械学習器６０を機械学習して、車両学習モデル６０を生成する（ステップＳ３）。 Next, a learning method of the operation inference learning model 70 for controlling the drive robot 4 using the above learning system 10 will be described with reference to FIGS. 1 to 6 and 7. FIG. 7 is a flowchart of the learning method.
The learning control device 11 collects running record data (running record) used at the time of learning as a running record prior to learning the operation. Specifically, the drive robot control unit 20 generates an operation pattern for measuring vehicle characteristics of the accelerator pedal 2c and the brake pedal 2d, thereby controlling the vehicle 2 to travel and collecting travel performance data (step S1). ).
The vehicle model 52 acquires the travel record data formed from the learning data generation unit 34, and uses this to perform machine learning on the machine learning device 60 to generate the vehicle learning model 60 (step S3).

車両学習モデル６０の学習が終了すると、学習システム１０の強化学習部４０は、車両２の操作を推論する操作推論学習モデル７０を事前学習する（ステップＳ５）。より詳細には、学習システム１０は、既に学習が終了した車両学習モデル６０が出力した模擬走行状態を操作推論学習モデル７０に適用することで、操作推論学習モデル７０を事前に強化学習する。この際には、第１ペダル遊び量調整部５４と第２ペダル遊び量調整部５５は、式１〜式３を用いて、車両２と車両モデル５２の間のペダル遊び量を調整する。
学習部３０は、この事前学習としての、車両学習モデル６０を操作の実行対象として用いた強化学習を、事前学習終了基準を満たすまで実行する。事前学習終了基準を満たさなければ（ステップＳ７のＮｏ）、事前学習を継続する。事前学習終了基準が満たされると（ステップＳ７のＹｅｓ）、事前学習を終了する。 When the learning of the vehicle learning model 60 is completed, the reinforcement learning unit 40 of the learning system 10 pre-learns the operation inference learning model 70 that infers the operation of the vehicle 2 (step S5). More specifically, the learning system 10 reinforces the operation inference learning model 70 in advance by applying the simulated running state output by the vehicle learning model 60 for which learning has already been completed to the operation inference learning model 70. At this time, the first pedal play amount adjusting unit 54 and the second pedal play amount adjusting unit 55 adjust the pedal play amount between the vehicle 2 and the vehicle model 52 by using the formulas 1 to 3.
The learning unit 30 executes reinforcement learning using the vehicle learning model 60 as an operation execution target as the pre-learning until the pre-learning end criterion is satisfied. If the pre-learning end criterion is not satisfied (No in step S7), the pre-learning is continued. When the pre-learning end criterion is satisfied (Yes in step S7), the pre-learning ends.

操作推論学習モデル７０及び価値推論学習モデル８０の、車両学習モデル６０を操作の実行対象として用いた事前学習が終了すると、学習部３０は、車両学習モデル６０に替えて、実車両２を操作の実行対象として、操作推論学習モデル７０及び価値推論学習モデル８０を更に強化学習する（ステップＳ９）。 When the pre-learning using the vehicle learning model 60 of the operation inference learning model 70 and the value inference learning model 80 as the execution target of the operation is completed, the learning unit 30 operates the actual vehicle 2 instead of the vehicle learning model 60. The operation inference learning model 70 and the value inference learning model 80 are further strengthened and learned as execution targets (step S9).

次に、上記のドライブロボットを制御する操作推論学習モデルの学習システム及び学習方法の効果について説明する。 Next, the effects of the learning system and learning method of the operation inference learning model that controls the drive robot will be described.

本実施形態の学習システム１０は、車速を含む車両２の走行状態を基に、車両２を規定された指令車速に従って走行させるような、車両２の操作を推論する操作推論学習モデル７０と、車両２に搭載されて、操作を基に車両２を走行させるドライブロボット（自動操縦ロボット）４を備え、操作推論学習モデル７０を機械学習する、ドライブロボット４を制御する操作推論学習モデル７０の学習システム１０であって、操作は、アクセルペダル２ｃとブレーキペダル２ｄの双方のペダル操作量を含み、車両２を模擬動作するように設定され、ペダル２ｃ、２ｄのペダル検出量ａ_ｖ−ｏを含む、車両２を模した模擬走行状態ｏｍを出力する、車両モデル５２と、車両２のペダル遊び量ｄ_ｒと、車両モデル５２のペダル遊び量ｄ_ｖの差分値ｄ_ｄｉｆｆを基に、操作推論学習モデル７０が推論した操作を基にした入力操作ｉ２、ｉ３に含まれるペダル操作量ａ_ｒ−ｉを車両モデル５２にあわせて調整し、調整された入力操作ｉｍ２、ｉｍ３を車両モデル５２へ入力する、第１ペダル遊び量調整部５４と、差分値ｄ_ｄｉｆｆの正負を反転させた反転差分値−ｄ_ｄｉｆｆを基に、模擬走行状態ｏｍに含まれるペダル検出量ａ_ｖ−ｏを車両２にあわせて調整し、調整された模擬走行状態ｏを生成する、第２ペダル遊び量調整部５５と、を備え、調整された模擬走行状態ｏを操作推論学習モデル７０に適用することで、操作推論学習モデル７０を機械学習する。
また、本実施形態の学習制御方法は、車速を含む車両２の走行状態を基に、車両２を規定された指令車速に従って走行させるような、車両２の操作を推論する操作推論学習モデル７０と、車両２に搭載されて、操作を基に車両２を走行させるドライブロボット（自動操縦ロボット）４に関し、操作推論学習モデル７０を機械学習する、ドライブロボット４を制御する操作推論学習モデル７０の学習方法であって、操作は、アクセルペダル２ｃとブレーキペダル２ｄの双方のペダル操作量を含み、車両２を模擬動作するように設定され、ペダル２ｃ、２ｄのペダル検出量ａ_ｖ−ｏを含む、車両２を模した模擬走行状態ｏｍを出力する車両モデル５２にあわせて、車両２のペダル遊び量ｄ_ｒと、車両モデル５２のペダル遊び量ｄ_ｖの差分値ｄ_ｄｉｆｆを基に、操作推論モデル７０が推論した操作を基にした入力操作ｉ２、ｉ３に含まれるペダル操作量ａ_ｒ−ｉを調整し、調整された入力操作ｉｍ２、ｉｍ３を車両モデル５２へ入力し、差分値ｄ_ｄｉｆｆの正負を反転させた反転差分値−ｄ_ｄｉｆｆを基に、模擬走行状態ｏｍに含まれるペダル検出量ａ_ｖ−ｏを車両２にあわせて調整し、調整された模擬走行状態ｏを生成し、調整された模擬走行状態ｏを操作推論学習モデル７０に適用することで、操作推論学習モデル７０を機械学習する。
上記のような構成によれば、第１ペダル遊び量調整部５４は、車両２のペダル遊び量ｄ_ｒと、車両モデル５２のペダル遊び量ｄ_ｖの差分値ｄ_ｄｉｆｆを基にして、操作推論学習モデル７０の推論結果を基にした入力操作ｉ２、ｉ３に含まれるペダル操作量ａ_ｒ−ｉを車両モデル５２にあわせて調整する。このため、車両２と車両モデル５２のペダル遊び量ｄ_ｒ、ｄ_ｖの差異が適切に反映され、ペダル操作量が適切に調整される。
また、第２ペダル遊び量調整部５５は、差分値ｄ_ｄｉｆｆの正負を反転させた反転差分値−ｄ_ｄｉｆｆを基に、車両モデル５２が出力する模擬走行状態ｏｍに含まれるペダル検出量ａ_ｖ−ｏを車両２にあわせて調整する。このため、第１ペダル遊び量調整部５４における調整とは逆方向の調整が行われて、車両２と車両モデル５２のペダル遊び量ｄ_ｒ、ｄ_ｖの差異が適切に反映され、ペダル検出量が適切に調整される。
このようにして、車両２と車両モデル５２の間のペダル遊び量の差が適切に吸収される。結果として、車両２の操作を推論する操作推論学習モデル７０が、車両モデル５２を操作の実行対象として機械学習されても、ペダル遊び量の差に起因する、操作推論学習モデル７０の学習精度の低下が抑制される。
更に、車両モデル５２の入出力部分におけるペダル遊び量の値の変換により、ペダル遊び量の差が吸収されるため、ペダル遊び量に差が生じた場合においても、車両モデル５２を再度学習させる必要がなく、学習システム１０に要する計算コストが低減される。したがって、操作推論学習モデル７０の学習精度の低下の抑制が容易となる。 The learning system 10 of the present embodiment includes an operation inference learning model 70 that infers the operation of the vehicle 2 so that the vehicle 2 travels according to a specified command vehicle speed based on the traveling state of the vehicle 2 including the vehicle speed, and the vehicle. A learning system for an operation inference learning model 70 that controls a drive robot 4 and is equipped with a drive robot (automatic control robot) 4 that is mounted on the vehicle 2 and runs the vehicle 2 based on operations, and machine-learns the operation inference learning model 70. a 10, the operation includes a pedal operation amount both of the accelerator pedal 2c and the brake pedal 2d, is configured to simulate operation of the vehicle 2, includes pedals 2c, a pedal detected amount a _v-o of 2d, and outputs a simulated running state om simulating a vehicle 2, based on the vehicle model 52, and the pedal play amount d _r of the vehicle 2, the difference value d _diff pedal play amount d _v of the vehicle model 52, the operation inference learning model _{The pedal operation amount a r-i} included in the input operations i2 and i3 based on the operation inferred by 70 is adjusted according to the vehicle model 52, and the adjusted input operations im2 and im3 are input to the vehicle model 52. Based on the first pedal play amount adjusting unit 54 and _{the inverted difference value −d diff in} which the positive and negative of the _{difference value d diff} _{are reversed, the pedal detection amount a v o} included in the simulated running state om is adjusted to the vehicle 2. An operation inference learning model is provided by providing a second pedal play amount adjusting unit 55 that adjusts and generates an adjusted simulated running state o, and by applying the adjusted simulated running state o to the operation inference learning model 70. Machine learning 70.
Further, the learning control method of the present embodiment includes an operation inference learning model 70 that infers the operation of the vehicle 2 so as to cause the vehicle 2 to travel according to a specified command vehicle speed based on the traveling state of the vehicle 2 including the vehicle speed. , Learning the operation inference learning model 70 that controls the drive robot 4 by machine learning the operation inference learning model 70 with respect to the drive robot (automatic control robot) 4 that is mounted on the vehicle 2 and runs the vehicle 2 based on the operation. a method, operating comprises pedal operation amount both of the accelerator pedal 2c and the brake pedal 2d, is configured to simulate operation of the vehicle 2, includes pedals 2c, a pedal detected amount a _v-o of 2d, in accordance with the vehicle model 52 for outputting a simulated running state om simulating a vehicle 2, based on a pedal play amount d _r of the vehicle 2, the difference value d _diff pedal play amount d _v of the vehicle model 52, the operation inference model 70 is based on the operations inferred input operation i2, i3 adjust the pedal operation amount _{a r-i} included in the adjusted input operation im2, im3 input to the vehicle model 52, the positive and negative difference values _{d diff} _{The pedal detection amount av-o} included in the simulated running state om is adjusted according to the vehicle 2 based on the inverted difference value −d _diff, which is the reverse of the above, and the adjusted simulated running state o is generated and adjusted. By applying the simulated running state o to the operation inference learning model 70, the operation inference learning model 70 is machine-learned.
According to the above structure, the first pedal play amount adjuster 54, and the pedal play amount d _r of the vehicle 2, based on the difference value d _diff pedal play amount d _v of the vehicle model 52, the operation reasoning _{The pedal operation amount a r-i} included in the input operations i2 and i3 based on the inference result of the learning model 70 is adjusted according to the vehicle model 52. Therefore, the pedal play amount d _r of the vehicle 2 and the vehicle model _52, a difference of d _v is appropriately reflected, the pedal operation amount is appropriately adjusted.
The second pedal play amount adjusting unit 55, based on the inverted difference value _{-d diff} obtained by inverting the sign of the difference value _{d diff,} pedal detected amount _{a v} contained in the simulated running state om the vehicle model 52 is output _{Adjust −o according} to the vehicle 2. Therefore, the adjustment in the first pedal play amount adjuster 54 is performed in the backward adjustment, the pedal play amount d _r of the vehicle 2 and the vehicle model _52, a difference of d _v is appropriately reflected, pedal detected amount Is adjusted appropriately.
In this way, the difference in pedal play amount between the vehicle 2 and the vehicle model 52 is appropriately absorbed. As a result, even if the operation inference learning model 70 that infers the operation of the vehicle 2 is machine-learned with the vehicle model 52 as the operation execution target, the learning accuracy of the operation inference learning model 70 due to the difference in the amount of pedal play is achieved. The decrease is suppressed.
Further, since the difference in the pedal play amount is absorbed by converting the value of the pedal play amount in the input / output portion of the vehicle model 52, it is necessary to relearn the vehicle model 52 even if the difference in the pedal play amount occurs. The calculation cost required for the learning system 10 is reduced. Therefore, it becomes easy to suppress a decrease in learning accuracy of the operation inference learning model 70.

また、第１ペダル遊び量調整部５４は、入力操作ｉ２、ｉ３に含まれるペダル操作量ａ_ｒ−ｉを、当該ペダル操作量ａ_ｒ−ｉから差分値ｄ_ｄｉｆｆを減算することで調整して、調整されたペダル操作量ａ_ｖ−ｉを生成し、第２ペダル遊び量調整部５５は、模擬走行状態ｏｍに含まれるペダル検出量ａ_ｖ−ｏを、当該ペダル検出量ａ_ｖ−ｏに差分値ｄ_ｄｉｆｆを加算することで調整して、調整されたペダル検出量ａ_ｒ−ｏを生成する。
上記のような構成によれば、式１、式３を用いて既に説明したように、車両２と車両モデル５２の間のペダル遊び量の差を、効果的に吸収可能である。 Further, the first pedal play amount adjusting unit 54 adjusts the pedal operation amount a _r-i included in the input operations i2 and i3 by subtracting the difference value d _diff from the pedal operation amount a _r-i. , The adjusted pedal operation amount _av-i is generated, and the second pedal play amount adjustment unit 55 sets the pedal detection amount _av-o included in the simulated running state om to the pedal detection amount _av-o . The adjusted pedal detection amount a _r-o is generated by adjusting by adding the difference value d _diff.
According to the above configuration, as already described using the equations 1 and 3, the difference in the amount of pedal play between the vehicle 2 and the vehicle model 52 can be effectively absorbed.

また、車両モデル５２は、車両２の実際の走行実績を基に車両２を模擬動作するように機械学習され、調整された入力操作ｉｍ２、ｉｍ３を基に、模擬走行状態ｏｍを出力する、車両学習モデル６０を備えている。
特に本実施形態においては、車両学習モデル６０は、ニューラルネットワークで実現されている。
上記のような構成によれば、学習システム１０を適切に実現可能である。 Further, the vehicle model 52 is a vehicle that outputs a simulated running state om based on the input operations im2 and im3 that are machine-learned and adjusted based on the actual running performance of the vehicle 2 so as to perform a simulated operation of the vehicle 2. It is equipped with a learning model 60.
In particular, in the present embodiment, the vehicle learning model 60 is realized by a neural network.
According to the above configuration, the learning system 10 can be appropriately realized.

また、操作推論学習モデル７０は、強化学習されている。
強化学習により学習される操作推論学習モデル７０は、強化学習の初期段階においては、例えばペダル２ｃ、２ｄを極端に高い頻度で操作するような、人間には不可能で、実車両に負担がかかる、好ましくない操作を出力する可能性がある。
上記のような構成によれば、このような強化学習の初期段階においては、当該車両学習モデル６０が、操作推論学習モデル７０が推論した操作を基に、車両２を模した走行状態ｓである模擬走行状態ｏｍを出力し、これを操作推論学習モデル７０に適用することで、操作推論学習モデル７０を事前に強化学習する。すなわち、強化学習の初期段階においては、実車両２を使用せずに、操作推論学習モデル７０を強化学習することができる。したがって、実車両２の負担を低減可能である。
また、事前学習が終了すると、実車両２を使用して操作推論学習モデル７０を更に強化学習するため、車両学習モデル６０のみを使用して操作推論学習モデル７０を強化学習する場合に比べると、操作推論学習モデル７０により出力する操作の学習精度を向上することができる。
特に、上記のような構成においては、事前学習を、車両学習モデル６０を操作の実行対象として行うため、事前学習の全過程において車両２を操作の実行対象とした場合に比べると、学習時間を低減可能である。 Further, the operation reasoning learning model 70 is reinforcement-learned.
In the initial stage of reinforcement learning, the operation inference learning model 70 learned by reinforcement learning is impossible for humans to operate pedals 2c and 2d with extremely high frequency, and puts a burden on the actual vehicle. , May output undesired operations.
According to the above configuration, in the initial stage of such reinforcement learning, the vehicle learning model 60 is a traveling state s that imitates the vehicle 2 based on the operation inferred by the operation inference learning model 70. By outputting the simulated running state om and applying it to the operation inference learning model 70, the operation inference learning model 70 is reinforced and learned in advance. That is, in the initial stage of reinforcement learning, the operation inference learning model 70 can be reinforcement-learned without using the actual vehicle 2. Therefore, the burden on the actual vehicle 2 can be reduced.
Further, when the pre-learning is completed, the operation inference learning model 70 is further strengthened and learned using the actual vehicle 2, so that the operation inference learning model 70 is strengthened and learned by using only the vehicle learning model 60. The learning accuracy of the operation output by the operation inference learning model 70 can be improved.
In particular, in the above configuration, since the pre-learning is performed with the vehicle learning model 60 as the operation execution target, the learning time is longer than when the vehicle 2 is the operation execution target in the entire process of the pre-learning. It can be reduced.

［実施形態の第１変形例］
次に、図８を用いて、上記実施形態として示したドライブロボットを制御する操作推論学習モデルの学習システム及び学習方法の第１変形例を説明する。図８は、本第１変形例における第２ペダル遊び量調整部の動作の説明図である。本第１変形例における学習システムは、上記実施形態の学習システム１０とは、第２ペダル遊び量調整部５５Ａの処理内容が異なっている。 [First Modified Example of Embodiment]
Next, with reference to FIG. 8, a first modification of the learning system and the learning method of the operation inference learning model for controlling the drive robot shown in the above embodiment will be described. FIG. 8 is an explanatory diagram of the operation of the second pedal play amount adjusting unit in the first modification. The learning system in the first modification is different from the learning system 10 of the above embodiment in the processing content of the second pedal play amount adjusting unit 55A.

図５の例においては、車両２のペダル遊び量ｄ_ｒよりも、車両モデル５２のペダル遊び量ｄ_ｖが小さい。このような場合においては、例えば車両モデル５２が出力したモデル出力ブレーキペダル検出量系列ｏｍ３を上記実施形態の第２ペダル遊び量調整部５５が変換すると、出力されたブレーキペダル検出量系列ｏ３に、Ｒ３として示されるような、調整されたペダル検出量ａ_ｖ−ｏの値が一定の値となる部分が生じ得る。
これは、次のような理由に因る。すなわち、入力操作に含まれるペダル操作量ａ_ｒ−ｉがペダル遊び量差ｄ_ｄｉｆｆよりも小さい場合においては、第１ペダル遊び量調整部５４における式１を用いた演算によって、調整されたペダル操作量ａ_ｖ−ｉの値は０となる。このため、これを車両モデル５２に適用した後の、模擬走行状態ｏｍに含まれるペダル検出量ａ_ｖ−ｏも、０に近い値となる。この、０に近い値に、第２ペダル遊び量調整部５５において式３によりペダル遊び量差ｄ_ｄｉｆｆが加算される。これにより、アクセルペダル検出量系列ｏ２、ブレーキペダル検出量系列ｏ３の、アクセルペダル操作量系列ｉ２、ブレーキペダル操作量系列ｉ３においてペダル操作量ａ_ｒ−ｉがペダル遊び量差ｄ_ｄｉｆｆよりも小さい部分が、一律に、ペダル遊び量差ｄ_ｄｉｆｆの値へと変換される。 In the example of FIG. 5, than the pedal play amount d _r of the vehicle 2, the pedal play amount d _v of the vehicle model 52 is small. In such a case, for example, when the model output brake pedal detection amount series om3 output by the vehicle model 52 is converted by the second pedal play amount adjusting unit 55 of the above embodiment, the output brake pedal detection amount series o3 is converted. There may be a portion where the adjusted pedal detection amount _{avo value becomes a constant value, as shown as R3.}
This is due to the following reasons. That is, when the pedal operation amount a _r-i included in the input operation is smaller than the pedal play amount difference d _diff , the pedal operation adjusted by the calculation using Equation 1 in the first pedal play amount adjusting unit 54. The value of the quantity a _vi is 0. Therefore, after applying it to the vehicle model 52, the pedal detected amount a _v-o contained in the simulated running state om also becomes a value close to 0. In the second pedal play amount adjusting unit 55, the pedal play amount difference _ddiff is added to this value close to 0 by the equation 3. As a result, in the accelerator pedal detection amount series o2 and the brake pedal detection amount series o3, the portion where the pedal operation amount a _r-i is smaller than the pedal play amount difference d _{diff in the accelerator pedal operation amount series i2 and the brake pedal operation amount series i3.} However, it is uniformly converted into the value of the pedal play amount difference d _diff.

これに対し、本変形例における第２ペダル遊び量調整部５５Ａは、次の式４を使用して、模擬走行状態ｏｍに含まれるペダル検出量ａ_ｖ−ｏを調整する。

上式においては、アクセルペダル操作量系列ｉ２、ブレーキペダル操作量系列ｉ３の、ペダル操作量ａ_ｒ−ｉがペダル遊び量差ｄ_ｄｉｆｆよりも小さい部分においては、アクセルペダル操作量系列ｉ２、ブレーキペダル操作量系列ｉ３に含まれるペダル操作量ａ_ｒ−ｉを、調整されたペダル検出量ａ_ｒ−ｏの値としている。これにより、図８にＲ４として示されるように、上記実施形態においては一定の値が出力されていた部分Ｒ３が、アクセルペダル操作量系列ｉ２、ブレーキペダル操作量系列ｉ３に近い、なだらかな線となっている。 In contrast, the second pedal play amount adjusting unit 55A in this variation, using the following equation 4, to adjust the pedal detected amount a _v-o contained in the simulated running state om.

In the above equation, in the part of the accelerator pedal operation amount series i2 and the brake pedal operation amount series i3 where the pedal operation amount a _r-i is smaller than the pedal play amount difference d _diff , the accelerator pedal operation amount series i2 and the brake pedal _{The pedal operation amount a r-i} included in the operation amount series i3 is used as the value of the adjusted pedal detection amount a _r-o . As a result, as shown as R4 in FIG. 8, the portion R3 in which a constant value is output in the above embodiment becomes a gentle line close to the accelerator pedal operation amount series i2 and the brake pedal operation amount series i3. It has become.

本変形例の学習システムにおいては、第２ペダル遊び量調整部５５Ａは、入力操作ｉ２、ｉ３に含まれるペダル操作量ａ_ｒ−ｉから差分値ｄ_ｄｉｆｆを減算した値が、０以上の場合には、模擬走行状態ｏｍに含まれるペダル検出量ａ_ｖ−ｏを、当該ペダル検出量ａ_ｖ−ｏに差分値ｄ_ｄｉｆｆを加算することで調整して、調整されたペダル検出量ａ_ｒ−ｏを生成し、それ以外の場合には、入力操作ｉ２、ｉ３に含まれるペダル操作量ａ_ｒ−ｉの値を、調整されたペダル検出量ａ_ｒ−ｏとする。
上記のような構成によれば、調整されたペダル検出量ａ_ｒ−ｏを、入力操作ｉ２、ｉ３に含まれるペダル操作量ａ_ｒ−ｉに近い値として表現することができる。このため、操作推論学習モデル７０の入力となる調整された模擬走行状態ｏを、現実のペダル操作に近い状態とすることができ、したがって、操作推論学習モデル７０の学習精度の低下を、効果的に抑制できる。
本変形例が、既に説明した実施形態と同様な他の効果を奏することは言うまでもない。 In the learning system of this modification, the second pedal play amount adjusting unit 55A is when the value obtained by subtracting the difference value d _diff _{from the pedal operation amount a r-i included in the input operations i2 and i3 is 0 or more.} _{Adjusts the pedal detection amount a v-o} included in the simulated running state om by adding the difference value d _diff to the pedal detection amount a _v-o , and adjusts the pedal detection amount a _r-o. _{In other cases, the value of the pedal operation amount a r-i} included in the input operations i2 and i3 is set as the adjusted pedal detection amount a _r-o .
According to the above configuration, the adjusted pedal detection amount a _r-o can be expressed as a value close _{to the pedal operation amount a r-i} included in the input operations i2 and i3. Therefore, the adjusted simulated running state o that is the input of the operation inference learning model 70 can be set to a state close to the actual pedal operation. Therefore, the reduction of the learning accuracy of the operation inference learning model 70 is effective. Can be suppressed.
Needless to say, this modification has other effects similar to those of the embodiments already described.

［実施形態の第２変形例］
次に、図９、図１０を用いて、上記実施形態として示したドライブロボットを制御する操作推論学習モデルの学習システム及び学習方法の第２変形例を説明する。図９は、上記第１変形例におけるペダル遊び量の調整において、車両モデルの推論に時間を要した場合の説明図である。図１０は、本第２変形例における第２ペダル遊び量調整部の動作の説明図である。本第２変形例における学習システムは、上記第１変形例の学習システムの更なる変形例であり、第２ペダル遊び量調整部５５Ｂの処理内容が異なっている。 [Second variant of the embodiment]
Next, with reference to FIGS. 9 and 10, a second modification of the learning system and learning method of the operation inference learning model for controlling the drive robot shown in the above embodiment will be described. FIG. 9 is an explanatory diagram when it takes time to infer the vehicle model in adjusting the pedal play amount in the first modification. FIG. 10 is an explanatory diagram of the operation of the second pedal play amount adjusting unit in the second modification. The learning system in the second modification is a further modification of the learning system in the first modification, and the processing content of the second pedal play amount adjusting unit 55B is different.

第１変形例として説明した学習システムに、図９の左上に示されるようなアクセルペダル操作量系列ｉ２が入力されると、第１ペダル遊び量調整部５４により図９の右上に示されるようなモデル入力アクセルペダル操作量系列ｉｍ２に変換され、これが車両モデルに入力される。
ここで、車両モデルによるモデル出力アクセルペダル検出量系列ｏｍ２の推論に時間を要すると、図９の右下に示されるように、モデル入力アクセルペダル操作量系列ｉｍ２とモデル出力アクセルペダル検出量系列ｏｍ２の間に遅延時間Ｄとして示されるような遅延が生じる。
このような状況でモデル出力アクセルペダル検出量系列ｏｍ２が第１変形例の第２ペダル遊び量調整部５５Ａに入力されると、図９左下に示されるようなアクセルペダル検出量系列ｏ２が出力される。より詳細には、アクセルペダル操作量系列ｉ２においてペダル操作量ａ_ｒ−ｉがペダル遊び量差ｄ_ｄｉｆｆ以下の値となっている時刻Ｔ１から遅延時間Ｄの間においては、式４によってアクセルペダル操作量系列ｉ２の値がアクセルペダル検出量系列ｏ２として採用されている。結果として、範囲Ｒ５として示されるように、アクセルペダル検出量系列ｏ２の立下り部分において、本来であればペダル遊び量差ｄ_ｄｉｆｆよりも大きな値をとるのが望ましいであろう部分が、ペダル遊び量差ｄ_ｄｉｆｆよりも小さな値となっている。 When the accelerator pedal operation amount series i2 as shown in the upper left of FIG. 9 is input to the learning system described as the first modification, the first pedal play amount adjusting unit 54 as shown in the upper right of FIG. Model input The accelerator pedal operation amount series is converted to im2, which is input to the vehicle model.
Here, if it takes time to infer the model output accelerator pedal detection amount series om2 by the vehicle model, as shown in the lower right of FIG. 9, the model input accelerator pedal operation amount series im2 and the model output accelerator pedal detection amount series om2 There is a delay between the two, as shown as the delay time D.
In such a situation, when the model output accelerator pedal detection amount series om2 is input to the second pedal play amount adjustment unit 55A of the first modification, the accelerator pedal detection amount series o2 as shown in the lower left of FIG. 9 is output. NS. More specifically, in the accelerator pedal operation amount series i2, the accelerator pedal operation is performed by the equation 4 between the time T1 and the delay time D _{in which the pedal operation amount a r-i} is equal to or less than the pedal play amount difference d _diff. The value of the quantity series i2 is adopted as the accelerator pedal detection quantity series o2. As a result, as shown as the range R5, in the falling portion of the accelerator pedal detection amount series o2, the portion where it would be desirable to take a value larger than the _{pedal play amount difference d diff is the pedal play.} The value is smaller than the amount difference d _diff.

本変形例の第２ペダル遊び量調整部５５Ｂにおいては、次の式５を使用して、模擬走行状態ｏｍに含まれるペダル検出量ａ_ｖ−ｏを調整する。

すなわち、上式においては、模擬走行状態ｏｍに含まれるペダル検出量ａ_ｖ−ｏが０より大きな値を有する場合には、入力操作ｉ２、ｉ３に含まれるペダル操作量ａ_ｒ−ｉではなく、模擬走行状態ｏｍに含まれるペダル検出量ａ_ｖ−ｏの値を調整されたペダル検出量ａ_ｒ−ｏに反映している。これにより、図１０に範囲Ｒ６として示されるように、少なくともペダル遊び量差ｄ_ｄｉｆｆよりも大きな部分においては、モデル出力アクセルペダル検出量系列ｏｍ２の値はアクセルペダル操作量系列ｉ２の値に変換されずに、維持されている。 In the second pedal play amount adjusting section 55B of the present modification, using Equation 5 follows, adjusting the pedal detected amount a _v-o contained in the simulated running state om.

That is, in the above equation, when the pedal detected amount a _v-o contained in the simulated running state om has a value greater than 0, the input operation i2, instead pedal operation amount a _r-i included in the i3, It reflects the pedal detected amount a pedal detected amount value is adjusted _{_v-o} a _{_r-o} contained in the simulated running state om. As a result, as shown as the range R6 in FIG. 10, the value of the model output accelerator pedal detection amount series om2 is converted to the value of the accelerator pedal operation amount series i2 _{at least in a portion larger than the pedal play amount difference d diff.} It is maintained without.

本変形例の学習システムにおいては、第２ペダル遊び量調整部５５Ｂは、入力操作ｉ２、ｉ３に含まれるペダル操作量ａ_ｒ−ｉから差分値ｄ_ｄｉｆｆを減算した値が０以上であるか、模擬走行状態ｏｍに含まれるペダル検出量ａ_ｖ−ｏが０より大きい場合には、模擬走行状態ｏｍに含まれるペダル検出量ａ_ｖ−ｏを、当該ペダル検出量ａ_ｖ−ｏに差分値ｄ_ｄｉｆｆを加算することで調整して、調整されたペダル検出量ａ_ｒ−ｏを生成し、それ以外の場合には、入力操作ｉ２、ｉ３に含まれるペダル操作量ａ_ｒ−ｉの値を、調整されたペダル検出量ａ_ｒ−ｏとする。
上記のような構成によれば、車両モデルの推論に時間を要する場合においても、操作推論学習モデル７０の学習精度の低下を、効果的に抑制できる。
本変形例が、既に説明した実施形態と同様な他の効果を奏することは言うまでもない。 In the learning system of this modified example, in the second pedal play amount adjusting unit 55B, whether the value obtained by subtracting the difference value _ddiff _{from the pedal operation amount a r-i included in the input operations i2 and i3 is 0 or more.} When the pedal detection amount a _v-o included in the simulated running state om is larger than 0, the pedal detection amount a _v-o included in the simulated running state om is added to the pedal detection amount a _v-o by a difference value d. _The adjusted pedal detection amount a _r-o is generated by adding the diff, and in other cases, the value of the _{pedal operation amount a r-i} included in the input operations i2 and i3 is set. Let the adjusted pedal detection amount a _r-o .
According to the above configuration, even when it takes time to infer the vehicle model, it is possible to effectively suppress a decrease in learning accuracy of the operation inference learning model 70.
Needless to say, this modification has other effects similar to those of the embodiments already described.

なお、本発明のドライブロボットを制御する操作推論学習モデルの学習システム及び学習方法は、図面を参照して説明した上述の実施形態及び各変形例に限定されるものではなく、その技術的範囲において他の様々な変形例が考えられる。 The learning system and learning method of the operation inference learning model for controlling the drive robot of the present invention are not limited to the above-described embodiment and each modification described with reference to the drawings, but are within the technical scope thereof. Various other variants are possible.

例えば、上記実施形態及び各変形例においては、車両モデル５２はペダル検出量の系列を、すなわち複数のペダル検出量の値を出力したが、これに限られず、ペダル検出量の値を１つだけ出力してもよい。
また、上記実施形態及び各変形例においては、操作は、アクセルペダル２ｃとブレーキペダル２ｄの双方を含んでいたが、これに限られず、アクセルペダル２ｃのみであってもよいし、ブレーキペダル２ｄのみであっても構わない。
また、上記実施形態及び各変形例においては、操作推論学習モデル７０が推論した操作は、ドライブロボットモデル５１によって実際のペダル操作量の値へと変換されて、入力操作が生成されたが、これに限られない。例えば、操作推論学習モデル７０は、入力操作そのものを推論し、この入力操作が、ドライブロボットモデル５１を介さず、車両モデル５２に入力されるようにしてもよい。 For example, in the above embodiment and each modification, the vehicle model 52 outputs a series of pedal detection amounts, that is, a plurality of pedal detection amount values, but the present invention is not limited to this, and only one pedal detection amount value is output. It may be output.
Further, in the above-described embodiment and each modification, the operation includes both the accelerator pedal 2c and the brake pedal 2d, but the operation is not limited to this, and may be only the accelerator pedal 2c or only the brake pedal 2d. It doesn't matter.
Further, in the above embodiment and each modification, the operation inferred by the operation inference learning model 70 is converted into the value of the actual pedal operation amount by the drive robot model 51, and an input operation is generated. Not limited to. For example, the operation inference learning model 70 may infer the input operation itself, and the input operation may be input to the vehicle model 52 without going through the drive robot model 51.

また、上記実施形態及び各変形例においては、車両モデル５２はニューラルネットワークとして実現された車両学習モデル６０を備え、この車両学習モデル６０によって車両２を模擬動作させていたが、これに限られない。すなわち、車両モデルは、車両２を模擬動作するように設定され、操作が入力されると、ペダルのペダル検出量を含む、車両２を模した模擬走行状態を出力するものであれば、機械学習された学習モデルでなくともよい。
例えば、車両２の動特性まで一致しなくとも良いような場合においては、学習対象となる車両２に依らない物理モデルを車両モデルとして用いても構わない。本実施形態及び各変形例において説明した第１及び第２ペダル遊び量調整部５４、５５を用いることにより、操作推論学習モデル７０の学習対象となる車両と、事前学習に使用される車両モデル５２の間のペダル遊び量が調整、吸収される。このため、例えば車両モデル５２として、操作推論学習モデル７０の学習対象とは異なる車両の物理モデルも使用可能である。
操作推論学習モデル７０を事前学習する目的の一つは、既に説明したように、人間には不可能で、実車両に負担がかかる、好ましくない操作を推論する可能性がある学習の初期段階に、実車両を使用しないことである。したがって、これを主目的として事前学習するに際し、例えば、車両２の動特性は事前学習後に、実際の車両２を用いて学習されることが前提であれば、学習対象となる車両２に依らない物理モデルであっても、事前学習で十分に使用可能である。
このようにした場合においては、何らかの車両モデルが用意できる環境にあるのであれば、車両学習モデル６０を機械学習させなくとも、操作推論学習モデル７０を事前学習することができる。したがって、操作推論学習モデル７０の学習が容易である。 Further, in the above-described embodiment and each modification, the vehicle model 52 includes a vehicle learning model 60 realized as a neural network, and the vehicle 2 is simulated by the vehicle learning model 60, but the present invention is not limited to this. .. That is, if the vehicle model is set to perform a simulated operation of the vehicle 2, and when an operation is input, a simulated running state imitating the vehicle 2 including the pedal detection amount of the pedal is output, machine learning is performed. It does not have to be a learning model.
For example, when it is not necessary to match the dynamic characteristics of the vehicle 2, a physical model that does not depend on the vehicle 2 to be learned may be used as the vehicle model. By using the first and second pedal play amount adjusting units 54 and 55 described in the present embodiment and each modification, the vehicle to be learned by the operation inference learning model 70 and the vehicle model 52 used for pre-learning. The amount of pedal play between is adjusted and absorbed. Therefore, for example, as the vehicle model 52, a physical model of a vehicle different from the learning target of the operation inference learning model 70 can be used.
One of the purposes of pre-learning the operation inference learning model 70 is, as already explained, in the early stage of learning where humans cannot infer undesired operations that are impossible and burden the actual vehicle. , Do not use the actual vehicle. Therefore, when pre-learning with this as the main purpose, for example, if it is premised that the dynamic characteristics of the vehicle 2 are learned using the actual vehicle 2 after the pre-learning, it does not depend on the vehicle 2 to be learned. Even a physical model can be sufficiently used by pre-learning.
In such a case, if the environment is such that some vehicle model can be prepared, the operation inference learning model 70 can be pre-learned without machine learning the vehicle learning model 60. Therefore, learning of the operation inference learning model 70 is easy.

これ以外にも、本発明の主旨を逸脱しない限り、上記実施形態及び各変形例で挙げた構成を取捨選択したり、他の構成に適宜変更したりすることが可能である。 In addition to this, as long as the gist of the present invention is not deviated, the configurations given in the above-described embodiment and each modification can be selected or changed to other configurations as appropriate.

１試験装置
２車両
２ｃアクセルペダル
２ｄブレーキペダル
３シャシーダイナモメータ
４ドライブロボット（自動操縦ロボット）
１０学習システム
１１学習制御装置
２０ドライブロボット制御部
３０学習部
４０強化学習部
４１操作内容推論部
５０試験装置モデル
５１ドライブロボットモデル
５２車両モデル
５３シャシーダイナモメータモデル
５４第１ペダル遊び量調整部
５５、５５Ａ、５５Ｂ第２ペダル遊び量調整部
６０車両学習モデル
７０操作推論学習モデル
ｉ１車速系列
ｉ２アクセルペダル操作量系列（入力操作）
ｉ３ブレーキペダル操作量系列（入力操作）
ｉｍ２モデル入力アクセルペダル操作量系列（調整された入力操作）
ｉｍ３モデル入力ブレーキペダル操作量系列（調整された入力操作）
ｏｍ模擬走行状態
ｏ調整された模擬走行状態
ａ_ｒ−ｉ入力操作に含まれるペダル操作量
ａ_ｖ−ｉ調整されたペダル操作量
ａ_ｖ−ｏ模擬走行状態に含まれるペダル検出量
ａ_ｒ−ｏ調整されたペダル検出量
ｄ_ｒ車両のペダル遊び量
ｄ_ｖ車両モデルのペダル遊び量
ｄ_ｄｉｆｆペダル遊び量差（差分値）
1 Test equipment 2 Vehicle 2c Accelerator pedal 2d Brake pedal 3 Chassis dynamometer 4 Drive robot (autopilot robot)
10 Learning system 11 Learning control device 20 Drive robot control unit 30 Learning unit 40 Reinforcement learning unit 41 Operation content Reasoning unit 50 Test device model 51 Drive robot model 52 Vehicle model 53 Chassis dynamometer model 54 First pedal play amount adjustment unit 55, 55A, 55B 2nd pedal play amount adjustment unit 60 Vehicle learning model 70 Operation reasoning learning model i1 Vehicle speed series i2 Accelerator pedal operation amount series (input operation)
i3 Brake pedal operation amount series (input operation)
im2 model input accelerator pedal operation amount series (adjusted input operation)
im3 model input brake pedal operation amount series (adjusted input operation)
om Simulated running state o Adjusted simulated running state a _r-i Pedal operation amount included in input operation a _v-i Adjusted pedal operation amount a _{v o} Pedal detection amount included in simulated running state a _r-o Adjusted pedal detection amount _dr Vehicle pedal play amount d _v Vehicle model pedal play amount d _diff Pedal play amount difference (difference value)

Claims

An operation inference learning model that infers the operation of the vehicle such that the vehicle is driven according to a specified command vehicle speed based on the traveling state of the vehicle including the vehicle speed, and an operation inference learning model mounted on the vehicle and based on the operation. It is a learning system of an operation inference learning model that controls an autopilot robot and includes an autopilot robot that runs the vehicle and machine-learns the operation inference learning model.
The operation includes the amount of pedal operation of one or both of the accelerator pedal and the brake pedal.
A vehicle model that is set to perform a simulated operation of the vehicle and outputs a simulated running state that imitates the vehicle, including a pedal detection amount of the pedal.
Based on the difference value between the pedal play amount of the vehicle and the pedal play amount of the vehicle model, the pedal operation amount included in the input operation based on the operation inferred by the operation inference learning model is applied to the vehicle model. A first pedal play amount adjusting unit that adjusts and inputs the adjusted input operation to the vehicle model, and
A second pedal that adjusts the pedal detection amount included in the simulated running state according to the vehicle and generates the adjusted simulated running state based on the inverted difference value obtained by reversing the positive and negative of the difference value. Play amount adjustment part and
With
A learning system for an operation inference learning model that controls an automatic control robot, which machine-learns the operation inference learning model by applying the adjusted simulated running state to the operation inference learning model.

The first pedal play amount adjusting unit adjusts the pedal operation amount included in the input operation by subtracting the difference value from the pedal operation amount to generate the adjusted pedal operation amount.
The second pedal play amount adjusting unit adjusts the pedal detection amount included in the simulated running state by adding the difference value to the pedal detection amount to generate the adjusted pedal detection amount. , A learning system for an operation inference learning model that controls the automatic control robot according to claim 1.

When the value obtained by subtracting the difference value from the pedal operation amount included in the input operation is 0 or more, the second pedal play amount adjusting unit determines the pedal detection amount included in the simulated running state. The adjusted pedal detection amount is generated by adding the difference value to the pedal detection amount.
In other cases, learning of the operation inference learning model for controlling the automatic control robot according to claim 2, wherein the value of the pedal operation amount included in the input operation is the adjusted pedal detection amount. system.

In the second pedal play amount adjusting unit, the value obtained by subtracting the difference value from the pedal operation amount included in the input operation is 0 or more, or the pedal detection amount included in the simulated running state is larger than 0. In this case, the pedal detection amount included in the simulated running state is adjusted by adding the difference value to the pedal detection amount to generate the adjusted pedal detection amount.
In other cases, learning of the operation inference learning model for controlling the automatic control robot according to claim 2, wherein the value of the pedal operation amount included in the input operation is the adjusted pedal detection amount. system.

The vehicle model is a vehicle learning model that is machine-learned to simulate the vehicle based on the actual running performance of the vehicle and outputs the simulated running state based on the adjusted input operation. The learning system of the operation inference learning model for controlling the automatic control robot according to claim 1.

The vehicle learning model is a learning system of an operation inference learning model that controls an autopilot robot according to claim 5, which is realized by a neural network.

The operation reasoning learning model is a learning system of an operation reasoning learning model that controls an automatic control robot according to any one of claims 1 to 3, which is reinforcement-learned.

An operation inference learning model that infers the operation of the vehicle such that the vehicle is driven according to a specified command vehicle speed based on the traveling state of the vehicle including the vehicle speed, and an operation inference learning model mounted on the vehicle and based on the operation. It is a learning method of an operation inference learning model that controls an automatic control robot, which machine-learns the operation inference learning model with respect to the automatic control robot that runs the vehicle.
The operation includes the amount of pedal operation of one or both of the accelerator pedal and the brake pedal.
The pedal play amount of the vehicle and the pedal of the vehicle model are set according to the vehicle model that is set to perform the simulated operation of the vehicle and outputs a simulated running state that imitates the vehicle, including the pedal detection amount of the pedal. Based on the difference value of the play amount, the pedal operation amount included in the input operation based on the operation inferred by the operation inference model is adjusted, and the adjusted input operation is input to the vehicle model.
Based on the inverted difference value obtained by reversing the positive and negative of the difference value, the pedal detection amount included in the simulated traveling state is adjusted according to the vehicle, and the adjusted simulated traveling state is generated.
A learning method of an operation inference learning model that controls an automatic control robot, which machine-learns the operation inference learning model by applying the adjusted simulated running state to the operation inference learning model.