JP2022135788A

JP2022135788A - vehicle controller

Info

Publication number: JP2022135788A
Application number: JP2021035841A
Authority: JP
Inventors: 淳川俣; Atsushi Kawamata
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2022-09-15

Abstract

To suppress input of an excessive load into a drive line while securing the running property by individually controlling the torque of each driving wheel.SOLUTION: A vehicle has a mechanism that individually controls the torque of each driving wheel. A controller 100 as a vehicle controller has a storage device 110 and a processing circuit 120. The storage device 110 stores a learned model which determines and outputs an action of operating a control amount of the mechanism when a state variable including road surface information data extracted from an image of a road surface of the front of the vehicle taken by a camera 70 is input. The processing circuit 120 operates the control amount on the basis of the action output by inputting the state variable into the learned model, and controls the mechanism. The learned model is a model learned by reinforcement learning in which a more reward is given as a distance that the vehicle travels without causing slipping of any driving wheels is longer.SELECTED DRAWING: Figure 1

Description

この発明は車両制御装置に関するものである。 The present invention relates to a vehicle control system.

車両が岩場などの悪路を走行していると、駆動輪がスリップすることがある。そして、スリップしていた駆動輪のグリップが回復すると、回転が急に制動される。その結果、その駆動輪に駆動力を伝達しているドライブラインに大きな負荷がかかることになる。 When the vehicle is traveling on a rough road such as a rocky place, the drive wheels may slip. Then, when the grip of the slipping drive wheels recovers, the rotation is suddenly braked. As a result, a large load is applied to the driveline that transmits the driving force to the drive wheels.

特許文献１に開示されている制御装置は、駆動輪がスリップしているときに、スリップが収束するとドライブラインに過大な負荷が作用することが推定される状態であるか否かを判定する。そして、この制御装置は、過大な負荷が作用することが推定される状態であると判定した場合には、高変速比側に変速段を変更する。 The control device disclosed in Japanese Patent Laid-Open No. 2002-200002 determines whether or not it is estimated that an excessive load will be applied to the driveline when the slip converges when the drive wheels are slipping. Then, when the control device determines that the state is such that an excessive load is expected to act, it changes the gear stage to the high gear ratio side.

この制御装置は、こうして高変速比側に変速段を変更することにより、駆動輪の回転速度を低下させ、スリップしていた駆動輪がグリップした際にドライブラインに作用する負荷を低減するようにしている。 By changing the gear stage to the high gear ratio side in this way, the control device lowers the rotation speed of the drive wheels and reduces the load acting on the drive line when the slipping drive wheels grip. ing.

特開２００９－２１００００号公報Japanese Patent Application Laid-Open No. 2009-210000

しかし、上記のように高変速比側の変速段に変更してしまうと、スリップしている駆動輪だけでなく全ての駆動輪の回転速度が低下してしまう。その結果、車両の速度が低下し、走破性が低下してしまう。 However, if the gear stage is changed to the high gear ratio side as described above, the rotation speed of not only the slipping drive wheels but also all of the drive wheels will decrease. As a result, the speed of the vehicle decreases, and the running performance deteriorates.

以下、上記課題を解決するための手段及びその作用効果について記載する。
上記課題を解決するための車両制御装置は、各駆動輪のトルクを個別に制御することのできる機構を備えた車両を制御する。この車両制御装置は、車両前方の路面の画像から抽出した路面情報データを含む状態変数が入力されると前記機構の制御量を操作する行動を決定して出力する学習済みモデルを記憶した記憶装置と、前記学習済みモデルに前記状態変数を入力することによって出力された前記行動に基づいて前記制御量を操作し、前記機構を制御する処理回路と、を備えている。前記学習済みモデルは、前記車両がいずれの駆動輪もスリップさせずに走行した距離が長いほど大きな報酬を付与する強化学習によって学習したモデルである。 Means for solving the above problems and their effects will be described below.
A vehicle control device for solving the above problems controls a vehicle having a mechanism capable of individually controlling the torque of each driving wheel. This vehicle control device stores a learned model that determines and outputs an action for manipulating the control amount of the mechanism when state variables including road surface information data extracted from an image of the road surface in front of the vehicle are input. and a processing circuit for controlling the mechanism by manipulating the control amount based on the behavior output by inputting the state variables to the learned model. The learned model is a model learned by reinforcement learning that gives a greater reward as the vehicle travels a longer distance without slipping any driving wheels.

上記構成では、車両前方の路面の画像から抽出した路面情報データを含む状態変数を用いて行動を決定する。そのため、駆動輪がこれから接地する路面の情報を行動の決定に反映させることができる。これにより、各駆動輪が接地する路面の状況を考慮して、スリップの発生を未然に防ぐように各駆動輪のトルクを制御することができる。また、学習済みモデルは、いずれの駆動輪もスリップさせずに走行した距離が長いほど大きな報酬を付与する強化学習によって学習されている。そのため、この制御装置によれば、全ての駆動輪のトルクを一律に抑制するのではなく、スリップの発生を回避しながら前進し続けることができるように各駆動輪のトルクをそれぞれに制御することができる。 In the above configuration, actions are determined using state variables including road surface information data extracted from an image of the road surface in front of the vehicle. Therefore, it is possible to reflect the information of the road surface on which the drive wheels will come into contact with the decision of action. As a result, it is possible to control the torque of each driving wheel so as to prevent the occurrence of a slip in consideration of the condition of the road surface on which each driving wheel is grounded. In addition, the learned model is learned by reinforcement learning that gives a greater reward as the distance traveled without any driving wheels slipping increases. Therefore, according to this control device, instead of uniformly suppressing the torque of all drive wheels, the torque of each drive wheel is individually controlled so that the vehicle can continue to move forward while avoiding the occurrence of slip. can be done.

すなわち、上記構成によれば、走破性を確保しつつ、ドライブラインへの過大な負荷の入力を抑制することができる。 That is, according to the above configuration, it is possible to suppress the input of an excessive load to the driveline while ensuring the running performance.

一実施形態の車両制御装置と、同車両制御装置の制御対象である車両のドライブトレインとの関係を示す模式図。1 is a schematic diagram showing the relationship between a vehicle control device according to an embodiment and a drive train of a vehicle controlled by the vehicle control device; FIG. 行動価値関数を算出する多層ニューラルネットワークの構造を示す模式図。The schematic diagram which shows the structure of the multilayer neural network which calculates an action-value function. 学習処理における一連の処理の流れを示すフローチャート。4 is a flowchart showing a series of processing flows in learning processing; 悪路走行処理における一連の処理の流れを示すフローチャート。4 is a flowchart showing a series of processes in rough road travel processing;

以下、車両制御装置の一実施形態について、図１～図４を参照して説明する。
図１は、車両制御装置である制御装置１００と、制御装置１００が搭載されている車両のドライブトレインとを示している。 An embodiment of a vehicle control device will be described below with reference to FIGS. 1 to 4. FIG.
FIG. 1 shows a control device 100, which is a vehicle control device, and a drive train of a vehicle in which the control device 100 is mounted.

＜ドライブトレインの構成＞
図１に示すように、この車両は、右前輪５０ＲＦ、左前輪５０ＬＦ、右後輪５０ＲＲ、左後輪５０ＬＲの４つの車輪を備えている。 <Configuration of drive train>
As shown in FIG. 1, this vehicle has four wheels: a right front wheel 50RF, a left front wheel 50LF, a right rear wheel 50RR, and a left rear wheel 50LR.

この車両には、動力源としてエンジン１０が搭載されている。エンジン１０は、車両前部のエンジンコンパートメント内に収容されている。エンジンコンパートメント内にはエンジン１０の出力軸の回転を変速する変速機２０も収容されている。変速機２０のケースにはフロントディファレンシャル２１が収容されている。フロントディファレンシャル２１は、変速機２０で変速した回転を右フロントドライブシャフト５１ＲＦ及び左フロントドライブシャフト５１ＬＦに伝達する。右フロントドライブシャフト５１ＲＦは、右前輪５０ＲＦに接続されている。右フロントドライブシャフト５１ＲＦが回転することにより、右前輪５０ＲＦが駆動される。また、左フロントドライブシャフト５１ＬＦは、左前輪５０ＬＦに接続されている。左フロントドライブシャフト５１ＬＦが回転することにより、左前輪５０ＬＦが駆動される。右前輪５０ＲＦにはブレーキ６０ＲＦが設けられている。左前輪５０ＬＦにはブレーキ６０ＬＦが設けられている。 This vehicle is equipped with an engine 10 as a power source. The engine 10 is housed in an engine compartment at the front of the vehicle. A transmission 20 for changing the speed of rotation of the output shaft of the engine 10 is also housed in the engine compartment. A front differential 21 is housed in the case of the transmission 20 . The front differential 21 transmits the rotation speed-changed by the transmission 20 to the right front drive shaft 51RF and the left front drive shaft 51LF. The right front drive shaft 51RF is connected to the right front wheel 50RF. The right front wheel 50RF is driven by the rotation of the right front drive shaft 51RF. Also, the left front drive shaft 51LF is connected to the left front wheel 50LF. The rotation of the left front drive shaft 51LF drives the left front wheel 50LF. A brake 60RF is provided on the right front wheel 50RF. A left front wheel 50LF is provided with a brake 60LF.

フロントディファレンシャル２１と右前輪５０ＲＦの間、車両の幅方向における中央部には、トランスファ３０が設けられている。トランスファ３０は、フロントディファレンシャル２１から右フロントドライブシャフト５１ＲＦに伝達された回転を、プロペラシャフト５２を介して後輪側に伝達するための装置である。トランスファ３０は、フロントディスコネクト機構３１を備えている。フロントディスコネクト機構３１は、フロントディファレンシャル２１及び右フロントドライブシャフト５１ＲＦとプロペラシャフト５２との間の回転の伝達を遮断するための機構である。 A transfer 30 is provided between the front differential 21 and the right front wheel 50RF in the central portion in the width direction of the vehicle. The transfer 30 is a device for transmitting the rotation transmitted from the front differential 21 to the right front drive shaft 51RF to the rear wheels via the propeller shaft 52 . The transfer 30 has a front disconnect mechanism 31 . The front disconnect mechanism 31 is a mechanism for interrupting transmission of rotation between the front differential 21 and the right front drive shaft 51 RF and the propeller shaft 52 .

プロペラシャフト５２は、リヤディファレンシャル４０に接続されている。リヤディファレンシャル４０は、右後輪５０ＲＲと左後輪５０ＬＲの間に配置されている。リヤディファレンシャル４０は、プロペラシャフト５２を介して伝達された回転を右リヤドライブシャフト５１ＲＲ及び左リヤドライブシャフト５１ＬＲに伝達する。右リヤドライブシャフト５１ＲＲは、右後輪５０ＲＲに接続されている。右リヤドライブシャフト５１ＲＲが回転することにより、右後輪５０ＲＲが駆動される。また、左リヤドライブシャフト５１ＬＲは、左後輪５０ＬＲに接続されている。左リヤドライブシャフト５１ＬＲが回転することにより、左後輪５０ＬＲが駆動される。右後輪５０ＲＲにはブレーキ６０ＲＲが設けられている。左後輪５０ＬＲにはブレーキ６０ＬＲが設けられている。 Propeller shaft 52 is connected to rear differential 40 . The rear differential 40 is arranged between the right rear wheel 50RR and the left rear wheel 50LR. Rear differential 40 transmits the rotation transmitted through propeller shaft 52 to right rear drive shaft 51RR and left rear drive shaft 51LR. The right rear drive shaft 51RR is connected to the right rear wheel 50RR. The right rear wheel 50RR is driven by the rotation of the right rear drive shaft 51RR. Also, the left rear drive shaft 51LR is connected to the left rear wheel 50LR. The rotation of the left rear drive shaft 51LR drives the left rear wheel 50LR. A brake 60RR is provided on the right rear wheel 50RR. A brake 60LR is provided on the left rear wheel 50LR.

リヤディファレンシャル４０には、リヤディスコネクト機構４１と、電子制御カップリング４２Ｒと、電子制御カップリング４２Ｌが搭載されている。
電子制御カップリング４２Ｒ及び電子制御カップリング４２Ｌは、内蔵している電磁クラッチを操作することによりリヤドライブシャフトに駆動力を伝達する割合を変更するための機構である。すなわち、電子制御カップリング４２Ｒは、内蔵している電磁クラッチを操作することにより右後輪５０ＲＲに伝達する駆動力の割合を変更することのできる機構である。一方で、電子制御カップリング４２Ｌは、内蔵している電磁クラッチを制御することにより左後輪５０ＬＲに伝達する駆動力の割合を変更することのできる機構である。 The rear differential 40 is equipped with a rear disconnect mechanism 41, an electronically controlled coupling 42R, and an electronically controlled coupling 42L.
The electronically controlled coupling 42R and the electronically controlled coupling 42L are mechanisms for changing the ratio of transmitting the driving force to the rear drive shaft by operating the built-in electromagnetic clutch. That is, the electronically controlled coupling 42R is a mechanism that can change the ratio of the driving force transmitted to the right rear wheel 50RR by operating the built-in electromagnetic clutch. On the other hand, the electronically controlled coupling 42L is a mechanism that can change the ratio of the driving force transmitted to the left rear wheel 50LR by controlling the built-in electromagnetic clutch.

リヤディスコネクト機構４１は、プロペラシャフト５２と電子制御カップリング４２Ｒ及び電子制御カップリング４２Ｌとの間の回転の伝達を遮断するための機構である。
このように、この車両のドライブトレインは、４つの車輪を駆動することができるように構成されている。すなわち、この車両では、右前輪５０ＲＦ、左前輪５０ＬＦ、右後輪５０ＲＲ、左後輪５０ＬＲがいずれも駆動輪になりえる。 The rear disconnect mechanism 41 is a mechanism for interrupting transmission of rotation between the propeller shaft 52 and the electronically controlled couplings 42R and 42L.
Thus, the drive train of this vehicle is configured to be able to drive four wheels. That is, in this vehicle, any of the right front wheel 50RF, the left front wheel 50LF, the right rear wheel 50RR, and the left rear wheel 50LR can be driving wheels.

＜制御装置１００の構成＞
制御装置１００は、プログラムが記憶されている記憶装置１１０と、記憶装置１１０に記憶されているプログラムを実行して各種の処理を実行する処理回路１２０と、を備えている。制御装置１００は、エンジン１０の各部を制御し、エンジン１０の出力を制御する。また、制御装置１００は、変速機２０、フロントディスコネクト機構３１、リヤディスコネクト機構４１、電子制御カップリング４２Ｒ及び電子制御カップリング４２Ｌを制御する。さらに、制御装置１００は、４つの車輪に設けられたブレーキ６０ＲＦ、ブレーキ６０ＬＦ、ブレーキ６０ＲＲ、ブレーキ６０ＬＲを制御する。なお、この車両では、ブレーキ６０ＲＦ、ブレーキ６０ＬＦ、ブレーキ６０ＲＲ、ブレーキ６０ＬＲによる制動力を個別に制御することができるようになっている。 <Configuration of control device 100>
The control device 100 includes a storage device 110 storing programs, and a processing circuit 120 executing the programs stored in the storage device 110 to perform various processes. The control device 100 controls each part of the engine 10 and controls the output of the engine 10 . The control device 100 also controls the transmission 20, the front disconnect mechanism 31, the rear disconnect mechanism 41, the electronically controlled coupling 42R, and the electronically controlled coupling 42L. Further, the control device 100 controls the brakes 60RF, 60LF, 60RR, and 60LR provided on the four wheels. In this vehicle, the braking forces of the brakes 60RF, 60LF, 60RR, and 60LR can be individually controlled.

制御装置１００は、エンジン１０の出力を制御するとともに、変速機２０を制御することによって車両の駆動力を制御する。また、この車両では、フロントディスコネクト機構３１、リヤディスコネクト機構４１、電子制御カップリング４２Ｒ及び電子制御カップリング４２Ｌを制御することによって各車輪へのトルクの分配を制御することができる。 Control device 100 controls the output of engine 10 and controls the driving force of the vehicle by controlling transmission 20 . Further, in this vehicle, torque distribution to each wheel can be controlled by controlling the front disconnect mechanism 31, the rear disconnect mechanism 41, the electronically controlled coupling 42R, and the electronically controlled coupling 42L.

例えば、制御装置１００は、フロントディスコネクト機構３１及びリヤディスコネクト機構４１により回転の伝達を遮断することにより、右前輪５０ＲＦ及び左前輪５０ＬＦのみにトルクを分配することができる。すなわち、この車両は、後輪を駆動せずに、前輪駆動により走行することもできる。また、例えば、制御装置１００は、ディスコネクト機構により回転の伝達を遮断せずに、電子制御カップリング４２Ｒ及び電子制御カップリング４２Ｌによって後輪に配分するトルクを操作することにより、右後輪５０ＲＲ及び左後輪５０ＬＲへのトルクの配分を変更することができる。 For example, control device 100 can distribute torque only to right front wheel 50RF and left front wheel 50LF by interrupting transmission of rotation by front disconnect mechanism 31 and rear disconnect mechanism 41 . That is, this vehicle can also run by front-wheel drive without driving the rear wheels. Further, for example, the control device 100 controls the torque distributed to the rear wheels by the electronically controlled couplings 42R and 42L without interrupting the transmission of rotation by the disconnect mechanism, so that the right rear wheel 50RR is controlled. And the torque distribution to the left rear wheel 50LR can be changed.

さらに、この車両では、ブレーキ６０ＲＦ、ブレーキ６０ＬＦ、ブレーキ６０ＲＲ、ブレーキ６０ＬＲによって各駆動輪の制動力の制御をあわせて行うことにより、各駆動輪のトルクを個別に制御することができる。 Furthermore, in this vehicle, the torque of each driving wheel can be individually controlled by controlling the braking force of each driving wheel by the brake 60RF, the brake 60LF, the brake 60RR, and the brake 60LR.

要するに、制御装置１００が制御するこの車両は、各駆動輪のトルクを個別に制御することのできる機構を備えている。具体的には、フロントディスコネクト機構３１、リヤディスコネクト機構４１、電子制御カップリング４２Ｒ、電子制御カップリング４２Ｌ、ブレーキ６０ＲＦ、ブレーキ６０ＬＦ、ブレーキ６０ＲＲ、ブレーキ６０ＬＲが、この機構に相当する。 In short, this vehicle controlled by the control device 100 has a mechanism capable of individually controlling the torque of each drive wheel. Specifically, the front disconnect mechanism 31, the rear disconnect mechanism 41, the electronically controlled coupling 42R, the electronically controlled coupling 42L, the brake 60RF, the brake 60LF, the brake 60RR, and the brake 60LR correspond to this mechanism.

制御装置１００には、運転者による操作の状態、車両の状態、路面の状況を把握するための情報を収集する各種のセンサや装置が接続されている。
例えば、制御装置１００には、アクセルポジションセンサ７１が接続されている。アクセルポジションセンサ７１は、運転者によるアクセルの操作量であるアクセル開度を検出する。制御装置１００には、クランクポジションセンサ７２が接続されている。クランクポジションセンサ７２は、エンジン１０の出力軸であるクランク軸の回転角に応じたクランク角信号を検出する。制御装置１００は、クランク角信号に基づいてクランク軸の回転速度である機関回転速度を算出する。制御装置１００には、ブレーキセンサ７３が接続されている。ブレーキセンサ７３は、ブレーキペダルの操作量を検出する。制御装置１００には、車速センサ７４が接続されている。車速センサ７４は、変速機２０の出力軸の回転速度を検出する。制御装置１００は、車速センサ７４によって検出された出力軸の回転速度に基づいて車両の速度を算出する。制御装置１００には、ステアリングセンサ７５が接続されている。ステアリングセンサ７５は、ステアリングの操舵角を検出する。制御装置１００には、各駆動輪の回転速度を検出する速度センサが接続されている。具体的には、制御装置１００には、右前輪５０ＲＦの回転速度を検出する右フロント速度センサ７６ＲＦと、左前輪５０ＬＦの回転速度を検出する左フロント速度センサ７６ＬＦと、が接続されている。また、制御装置１００には、右後輪５０ＲＲの回転速度を検出する右リヤ速度センサ７６ＲＲと、左後輪５０ＬＲの回転速度を検出する左リヤ速度センサ７６ＬＲも接続されている。また、制御装置１００には、ヨーレートセンサ８０とリニアＧセンサ８１も接続されている。ヨーレートセンサ８０は、車両の旋回時の角速度であるヨーレートを検出する。リニアＧセンサ８１は車両の加速度を検出する。 The control device 100 is connected to various sensors and devices for collecting information for grasping the state of operation by the driver, the state of the vehicle, and the state of the road surface.
For example, an accelerator position sensor 71 is connected to the control device 100 . The accelerator position sensor 71 detects the accelerator opening, which is the amount of accelerator operation by the driver. A crank position sensor 72 is connected to the control device 100 . Crank position sensor 72 detects a crank angle signal corresponding to the rotation angle of the crankshaft, which is the output shaft of engine 10 . The control device 100 calculates the engine rotation speed, which is the rotation speed of the crankshaft, based on the crank angle signal. A brake sensor 73 is connected to the control device 100 . The brake sensor 73 detects the amount of operation of the brake pedal. A vehicle speed sensor 74 is connected to the control device 100 . Vehicle speed sensor 74 detects the rotation speed of the output shaft of transmission 20 . Control device 100 calculates the speed of the vehicle based on the rotation speed of the output shaft detected by vehicle speed sensor 74 . A steering sensor 75 is connected to the control device 100 . A steering sensor 75 detects the steering angle of the steering wheel. A speed sensor that detects the rotational speed of each drive wheel is connected to the control device 100 . Specifically, the control device 100 is connected to a right front speed sensor 76RF that detects the rotational speed of the right front wheel 50RF and a left front speed sensor 76LF that detects the rotational speed of the left front wheel 50LF. Also connected to the control device 100 are a right rear speed sensor 76RR for detecting the rotational speed of the right rear wheel 50RR and a left rear speed sensor 76LR for detecting the rotational speed of the left rear wheel 50LR. A yaw rate sensor 80 and a linear G sensor 81 are also connected to the control device 100 . A yaw rate sensor 80 detects a yaw rate, which is the angular velocity of the vehicle when it turns. A linear G sensor 81 detects the acceleration of the vehicle.

さらに、制御装置１００には、車両前方の路面を撮影するカメラ７０が接続されている。制御装置１００には、雨滴センサ７８も接続されている。雨滴センサ７８は、フロントウィンドウに付着する雨滴の量を検出する。また、この車両には、ＧＰＳ装置７７が搭載されている。制御装置１００は、ＧＰＳ装置７７から車両の位置情報を取得する。制御装置１００は、取得した位置情報に基づいて車両の速度を算出することもできる。 Further, the control device 100 is connected with a camera 70 that captures the road surface in front of the vehicle. A raindrop sensor 78 is also connected to the controller 100 . A raindrop sensor 78 detects the amount of raindrops adhering to the front window. In addition, a GPS device 77 is installed in this vehicle. The control device 100 acquires vehicle position information from the GPS device 77 . The control device 100 can also calculate the speed of the vehicle based on the acquired position information.

また、この車両には走行モード切替スイッチ７９が設けられている。走行モード切替スイッチ７９も制御装置１００に接続されている。制御装置１００は、走行モード切替スイッチ７９からの信号に基づいて走行モードを切り替える。具体的には、この車両では、走行モード切替スイッチ７９によって悪路走破性の高い悪路走行モードを選択することができるようになっている。制御装置１００は、悪路走行モードが選択された場合には、エンジン１０で発生させたトルクがプロペラシャフト５２を介してリヤディファレンシャル４０に伝達される状態にフロントディスコネクト機構３１及びリヤディスコネクト機構４１を固定する。そして、制御装置１００は、変速機２０を変速比の高い状態に固定し、各駆動輪のトルクを個別に制御して駆動輪のスリップの発生を抑制しながら、車両を走行させる。 This vehicle is also provided with a driving mode changeover switch 79 . A running mode changeover switch 79 is also connected to the control device 100 . Control device 100 switches the running mode based on a signal from running mode changeover switch 79 . Specifically, in this vehicle, a rough road driving mode with high rough road running performance can be selected by the driving mode selector switch 79 . When the rough road running mode is selected, the control device 100 controls the front disconnect mechanism 31 and the rear disconnect mechanism so that the torque generated by the engine 10 is transmitted to the rear differential 40 via the propeller shaft 52. 41 is fixed. Then, the control device 100 fixes the transmission 20 to a high gear ratio state, and controls the torque of each driving wheel individually to suppress the occurrence of slipping of the driving wheels while causing the vehicle to run.

＜悪路走行モードにおける各駆動輪のトルクの制御＞
制御装置１００の記憶装置１１０には、悪路走行モードが選択されているときの各駆動輪のトルク制御に用いるモデルのデータが記憶されている。なお、このモデルは、強化学習によって学習された学習済みモデルである。 <Control of Torque of Each Driving Wheel in Rough Road Driving Mode>
The storage device 110 of the control device 100 stores model data used for torque control of each driving wheel when the rough road traveling mode is selected. Note that this model is a learned model that has been learned through reinforcement learning.

制御装置１００は、悪路走行モードが選択されているときに、アクセル開度に基づいてエンジン１０の出力を制御する。そして、制御装置１００は、カメラ７０によって撮影した車両前方の路面の画像から抽出した路面情報データ、各速度センサで検出した各駆動輪の回転速度、ステアリングの操舵角、車速などを含む情報変数を学習済みモデルに入力してドライブトレインの各機構の制御量を操作する行動を決定する。そして、決定された行動に従って各機構の制御量を操作することにより、いずれの駆動輪もスリップさせないように、各駆動輪のトルクを制御する。すなわち、学習済みモデルは、状態変数を入力すると、ドライブトレインの各機構の制御量を操作する行動を出力するモデルになっている。 The control device 100 controls the output of the engine 10 based on the accelerator opening when the rough road traveling mode is selected. Then, the control device 100 generates information variables including road surface information data extracted from the image of the road surface in front of the vehicle captured by the camera 70, the rotation speed of each drive wheel detected by each speed sensor, the steering angle of the steering wheel, the vehicle speed, and the like. Input to the trained model to determine the behavior that manipulates the control amount of each mechanism of the drive train. By manipulating the control amount of each mechanism according to the determined action, the torque of each driving wheel is controlled so that none of the driving wheels will slip. In other words, the learned model is a model that, when state variables are input, outputs actions that manipulate the control amounts of each mechanism of the drive train.

具体的には、学習済みモデルは、それぞれの駆動輪のトルクについて、トルクを増大させる、維持する、低減させる、という選択肢の中から行動を選択し、出力する。制御装置１００の処理回路１２０は、学習済みモデルが出力した行動に従って各駆動輪のトルクを操作するようにドライブトレインの各機構を制御する。例えば、処理回路１２０は、右後輪５０ＲＲのトルクを増大させる際には、電子制御カップリング４２Ｒによって右後輪５０ＲＲに分配するトルクを増大させる。また、例えば、処理回路１２０は、左前輪５０ＬＦのトルクを低減させる際には、ブレーキ６０ＬＦによる制動力を増大させる。このように、悪路走行モードでは、制御装置１００の処理回路１２０は、電子制御カップリング４２Ｒ、電子制御カップリング４２Ｌ、ブレーキ６０ＲＦ、ブレーキ６０ＬＦ、ブレーキ６０ＲＲ、ブレーキ６０ＬＲを制御して各駆動輪のトルクを個別に制御する。 Specifically, the learned model selects and outputs an action from options of increasing, maintaining, and reducing the torque of each drive wheel. The processing circuit 120 of the control device 100 controls each mechanism of the drive train so as to manipulate the torque of each drive wheel according to the behavior output by the learned model. For example, when increasing the torque of the right rear wheel 50RR, the processing circuit 120 increases the torque distributed to the right rear wheel 50RR by the electronically controlled coupling 42R. Also, for example, the processing circuit 120 increases the braking force of the brake 60LF when reducing the torque of the left front wheel 50LF. In this way, in the rough road running mode, the processing circuit 120 of the control device 100 controls the electronically controlled coupling 42R, the electronically controlled coupling 42L, the brake 60RF, the brake 60LF, the brake 60RR, and the brake 60LR to control the driving wheels. Separate torque control.

＜学習済みモデルの学習＞
次に、悪路走行モードにおけるトルク制御である悪路走行処理に用いる学習済みモデルの学習について説明する。記憶装置１１０に記憶されている学習済みモデルは強化学習によって学習されている。 <Learning the trained model>
Next, learning of a learned model used for rough road running processing, which is torque control in the rough road running mode, will be described. The trained model stored in the storage device 110 has been trained by reinforcement learning.

学習を行う学習システムでは、制御装置１００に、状態変数に基づいて行動を決定させ、決定された行動を実行させる。そして、その行動実行後の状態に応じて報酬を評価すれば、選択した行動の行動価値が判明する。そこで、学習システムの制御装置１００は、状態変数の取得と、取得した状態変数に応じた行動の決定と、決定した行動によって得られる報酬の評価とを繰り返すことによって学習を行う。 A learning system that performs learning causes the control device 100 to determine an action based on state variables and to execute the determined action. Then, if the reward is evaluated according to the state after execution of the action, the action value of the selected action becomes clear. Therefore, the control device 100 of the learning system performs learning by repeatedly acquiring state variables, determining actions according to the acquired state variables, and evaluating rewards obtained by the determined actions.

強化学習におけるエージェントは、予め決められた方策に応じて行動ａを選択する機能に相当する。強化学習における環境は、エージェントが選択した行動ａと現在の状態ｓとに基づいて次の状態ｓ｀を決定し、行動ａと状態ｓと状態ｓ｀とに基づいて即時報酬ｒを決定する機能に相当する。 An agent in reinforcement learning corresponds to a function that selects action a according to a predetermined policy. The environment in reinforcement learning is a function that determines the next state s' based on the agent's selected action a and the current state s, and determines the immediate reward r based on the action a, state s, and state s'. corresponds to

この実施形態かかる学習においては、予め決められた方策によって学習システムの制御装置１００が行動ａを選択し、状態ｓの更新を行う処理を繰り返すことにより、ある状態ｓにおけるある行動ａの行動価値関数Ｑ（ｓ，ａ）を算出するＱ学習が採用されている。 In learning according to this embodiment, the control device 100 of the learning system selects an action a according to a predetermined policy, and repeats the process of updating the state s. Q-learning is employed to calculate Q(s, a).

ここでは、下記の式（１）によって行動価値関数Ｑ（ｓ，ａ）を更新する。 Here, the action value function Q(s, a) is updated by the following formula (1).

そして、行動価値関数Ｑ（ｓ，ａ）が適正に収束した場合には、その行動価値関数Ｑ（ｓ，ａ）を最大化する行動ａが最適な行動であるとみなされ、この行動ａによって決定される操作量がドライブトレインの機構の最適な操作量であるとみなされる。

Then, when the action-value function Q(s, a) converges appropriately, the action a that maximizes the action-value function Q(s, a) is regarded as the optimal action, and by this action a The determined manipulated variable is considered to be the optimum manipulated variable of the drive train mechanism.

上記の式（１）では、行動価値関数Ｑ（ｓ，ａ）は、状態ｓにおいて行動ａを取った場合において将来にわたって得られる収益の期待値である。報酬はｒである。そして、状態ｓ、行動ａ、報酬ｒにおける添え字のｔは、時系列で繰り返す試行過程における１回分のステップを示す試行番号である。行動決定後に状態ｓが変化すると試行番号ｔが１つインクリメントされる。なお、以下では、添え字を「＿」に続けて記載する。したがって、式（１）内の報酬ｒ＿ｔ＋１は、状態ｓ＿ｔで行動ａ＿ｔが選択され、状態ｓがｓ＿ｔ＋１になった場合に得られる報酬ｒである。αは学習率、γは割引率である。また、ａ｀は、状態ｓ＿ｔ＋１で取り得る行動ａ＿ｔ＋１の中で行動価値関数Ｑ（ｓ＿ｔ＋１，ａ＿ｔ＋１）を最大化する行動ａである。そして、ｍａｘ＿（ａ｀）Ｑ（ｓ＿ｔ＋１，ａ｀）は、行動ａ｀が選択されたことによって最大化された行動価値関数Ｑである。 In the above equation (1), the action-value function Q(s, a) is the expected value of future profits obtained when action a is taken in state s. The reward is r. The suffix t in the state s, the action a, and the reward r is a trial number indicating one step in the trial process repeated in chronological order. When the state s changes after action determination, the trial number t is incremented by one. In addition, below, a subscript is described following "_". Therefore, the reward r_t+1 in equation (1) is the reward r obtained when action a_t is selected in state s_t and state s becomes s_t+1. α is the learning rate and γ is the discount rate. Also, a′ is action a that maximizes action value function Q(s_t+1, a_t+1) among actions a_t+1 that can be taken in state s_t+1. And max_(a')Q(s_t+1, a') is the action value function Q maximized by the action a' being selected.

この実施形態の強化学習においては、各駆動輪のトルクを制御することが行動ａの決定に相当しており、取り得る行動を示す情報が制御装置１００の記憶装置１１０に予め記録される。 In the reinforcement learning of this embodiment, controlling the torque of each drive wheel corresponds to determining the action a, and information indicating possible actions is recorded in advance in the storage device 110 of the control device 100 .

上述したように、行動ａは、トルクを増大させる、維持する、低減させる、の３ついずれかをそれぞれの駆動輪について選択可能である。もちろんこれは、一例であり、行動ａの内容はこうした内容に限定する必要はなく、行動ａの選択肢がこれより多くて、少なくてもよい。 As described above, the action a can select one of the three actions of increasing, maintaining, and decreasing the torque for each drive wheel. Of course, this is just an example, and the content of action a need not be limited to such content, and action a may have more or fewer options.

この実施形態の強化学習においては、報酬ｒは、いずれの駆動輪もスリップさせずに車両が走行した距離が長いほど大きくなるように設定される。具体的には、状態変数には、各速度センサによって検出される各駆動輪の回転速度が含まれている。４つの速度センサが検出した回転速度の中に、他の３つの速度センサが検出した回転速度から乖離して高くなっている回転速度がある場合には、その回転速度を検出した速度センサが設けられている駆動輪がスリップしていることになる。制御装置１００は、こうしていずれかの駆動輪でスリップが発生したか否かを判定し、スリップが発生した場合には、その時点でその学習のための試行のエピソードを終了させる。そして、スリップが発生したことに対する報酬ｒとして負の報酬ｒ、例えば「－１０」を与える。一方で、いずれの駆動輪でもスリップが発生しなかった場合には、試行を継続する。試行を継続している間に車両が走行した距離に応じて正の報酬ｒを増加させ、試行が終了した時点でその報酬ｒが決定する。こうすることにより、いずれの駆動輪もスリップさせずに車両が走行した距離が長くなるほど報酬ｒが大きくなる。なお、車両が走行した距離は、例えば、ＧＰＳ装置７７から取得した位置情報に基づいて算出することができる。 In the reinforcement learning of this embodiment, the reward r is set to increase as the distance traveled by the vehicle without slipping any driving wheels increases. Specifically, the state variables include the rotational speed of each drive wheel detected by each speed sensor. If the rotation speed detected by the four speed sensors includes a rotation speed that deviates from the rotation speed detected by the other three speed sensors and is higher than the rotation speed detected by the other three speed sensors, a speed sensor that detects that rotation speed is provided. This means that the drive wheels that are engaged are slipping. Controller 100 thus determines whether slip has occurred on any of the drive wheels, and if so, terminates the learning trial episode at that point. Then, a negative reward r, such as "-10", is given as a reward r for occurrence of a slip. On the other hand, if none of the drive wheels slip, the trial continues. A positive reward r is increased according to the distance traveled by the vehicle while the trial is continued, and the reward r is determined when the trial ends. By doing so, the longer the distance traveled by the vehicle without causing any of the drive wheels to slip, the greater the reward r. Note that the distance traveled by the vehicle can be calculated based on the position information acquired from the GPS device 77, for example.

現在の状態ｓにおいて行動ａが採用された場合における次の状態ｓ｀は、行動ａとしての操作量を変更して車両を走行させ、状態変数を取得することによって特定可能である。
そのため、この実施形態では、状態変数として、各駆動輪の回転速度と、ＧＰＳ装置７７から取得した位置情報から算出した車速を入力する。具体的には、状態変数には、右フロント速度センサ７６ＲＦが検出した右前輪５０ＲＦの回転速度、左フロント速度センサ７６ＬＦが検出した左前輪５０ＬＦの回転速度が含まれている。また、状態変数には、右リヤ速度センサ７６ＲＲが検出した右後輪５０ＲＲの回転速度、左リヤ速度センサ７６ＬＲが検出した左後輪５０ＬＲの回転速度が含まれている。また、状態変数に車速が含まれていることにより、車速に基づいて走行距離を算出することができる。 The next state s' when the action a is adopted in the current state s can be specified by changing the operation amount as the action a, running the vehicle, and acquiring the state variables.
Therefore, in this embodiment, the rotational speed of each drive wheel and the vehicle speed calculated from the positional information obtained from the GPS device 77 are input as state variables. Specifically, the state variables include the rotational speed of the right front wheel 50RF detected by the right front speed sensor 76RF and the rotational speed of the left front wheel 50LF detected by the left front speed sensor 76LF. The state variables also include the rotational speed of the right rear wheel 50RR detected by the right rear speed sensor 76RR and the rotational speed of the left rear wheel 50LR detected by the left rear speed sensor 76LR. In addition, since the vehicle speed is included in the state variables, the traveled distance can be calculated based on the vehicle speed.

状態変数には、アクセル開度及びブレーキペダルの操作量も含まれている。また、路面の状況を示す情報を参照して行動ａを決定するために、状態変数には以下の情報も含まれている。 State variables also include the degree of accelerator opening and the amount of operation of the brake pedal. The state variables also include the following information in order to determine the action a by referring to the information indicating the road surface condition.

状態変数には、カメラ７０によって撮影した車両前方の路面の画像から抽出した路面情報データが含まれている。例えば、路面情報データは、数メートル手前の時点で撮影された画像から各駆動輪が接地している箇所に相当する画像を切り出して各駆動輪が接地している路面の情報を抽出したデータである。このデータは、具体的には切り出した画像から畳み込みニューラルネットワークを利用して特徴量を抽出したベクトルであってもよい。また、このデータは、切り出した画像を解析して接地している路面に存在している凹凸の大きさや高さを算出したデータであってもよい。 The state variables include road surface information data extracted from the image of the road surface in front of the vehicle captured by the camera 70 . For example, the road surface information data is data obtained by extracting the information of the road surface where each driving wheel is touching by extracting the image corresponding to the location where each driving wheel is touching the ground from the image taken several meters before. be. Specifically, this data may be a vector obtained by extracting a feature amount from the clipped image using a convolutional neural network. Also, this data may be data obtained by analyzing the clipped image and calculating the size and height of unevenness existing on the grounded road surface.

また、状態変数には、車両の進行方向及び前輪の向きを把握するためにステアリングセンサ７５によって検出された操舵角が含まれている。また、状態変数には、車両の傾きを把握するために、リニアＧセンサ８１によって検出された加速度が含まれている。また、雨によって路面が濡れていると、スリップしやすくなるため、雨滴センサ７８の検出値も状態変数に含めている。雨滴センサ７８によって検出されている雨滴の量が多ければ、それだけスリップしやすくなっていることが分かる。 The state variables also include the steering angle detected by the steering sensor 75 to determine the traveling direction of the vehicle and the orientation of the front wheels. The state variables also include the acceleration detected by the linear G sensor 81 in order to grasp the inclination of the vehicle. In addition, when the road surface is wet with rain, slipping is likely to occur, so the detection value of the raindrop sensor 78 is also included in the state variables. It can be seen that the greater the amount of raindrops detected by the raindrop sensor 78, the more likely the vehicle is to slip.

学習の過程で参照される変数や関数を示す情報は、学習システムの記憶装置１１０に記憶される。学習システムの制御装置１００は、状態変数の観測と、観測した状態変数に応じた行動ａの決定と、その行動ａによって得られる報酬ｒの評価とを行うことによって行動価値関数Ｑ（ｓ，ａ）を収束させる構成が採用されている。学習システムの制御装置１００では、学習の過程で状態変数と行動ａと報酬ｒとの時系列の値が、順次、記憶装置１１０に記録されていく。 Information indicating variables and functions referred to in the process of learning is stored in the storage device 110 of the learning system. The control device 100 of the learning system observes the state variables, determines the action a according to the observed state variables, and evaluates the reward r obtained by the action a to obtain the action value function Q(s, a ) is adopted. In the control device 100 of the learning system, time-series values of the state variable, the action a, and the reward r are sequentially recorded in the storage device 110 in the process of learning.

この実施形態では、行動価値関数Ｑ（ｓ，ａ）を近似的に算出する一手法であるＤＱＮ（ＤｅｅｐＱ－Ｎｅｔｗｏｒｋ）を採用している。ＤＱＮにおいては、多層ニューラルネットワークを用いて行動価値関数Ｑ（ｓ，ａ）を推定する。この実施形態では、状態ｓを入力として、選択し得る行動ａの個数に対応した行動価値関数Ｑ（ｓ，ａ）の値を出力する多層ニューラルネットワークを採用している。 This embodiment employs DQN (Deep Q-Network), which is a technique for approximately calculating the action-value function Q(s, a). In DQN, a multilayer neural network is used to estimate the action-value function Q(s,a). This embodiment employs a multi-layer neural network that takes state s as input and outputs the value of action value function Q(s, a) corresponding to the number of selectable actions a.

図２は、行動価値関数Ｑを出力する多層ニューラルネットワークを模式的に示した図である。図２において、多層ニューラルネットワークは、状態変数であるＭ個の状態ｓを入力とし、Ｎ個の行動価値関数Ｑの値を出力としている。図２では、試行番号ｔにおけるＭ個の状態ｓをｓ＿１ｔ～ｓ＿Ｍｔとして示している。 FIG. 2 is a diagram schematically showing a multilayer neural network that outputs the action-value function Q. As shown in FIG. In FIG. 2, the multi-layer neural network has M states s as state variables as inputs and N values of action-value functions Q as outputs. In FIG. 2, M states s at trial number t are indicated as s_1t to s_Mt.

なお、Ｎ個は選択し得る行動ａの数であり、多層ニューラルネットワークの出力は、入力された状態ｓにおいて特定の行動ａが選択された場合の行動価値関数Ｑの値である。図２では、試行番号ｔにおいて選択し得る行動ａ＿１ｔ～ａ＿Ｎｔのそれぞれにおける行動価値関数ＱをＱ（ｓ＿ｔ，ａ＿１ｔ）～Ｑ（ｓ＿ｔ，ａ＿Ｎｔ）として示している。この行動価値関数Ｑに表記されている「ｓ＿ｔ」は、試行番号ｔにおいて入力された状態ｓ、すなわち、状態ｓ＿１ｔ～ｓ＿Ｍｔを代表して示す文字である。なお、この実施形態の例では、選択し得る行動ａの個数は、４つの駆動輪に対してそれぞれ３つあるため、全部で１２個になっている。したがってＮ＝１２である。 Note that N is the number of actions a that can be selected, and the output of the multilayer neural network is the value of the action value function Q when a specific action a is selected in the input state s. In FIG. 2, action value functions Q for actions a_1t to a_Nt that can be selected at trial number t are indicated as Q(s_t, a_1t) to Q(s_t, a_Nt). "s_t" written in this action value function Q is a character representing the state s input at the trial number t, that is, the states s_1t to s_Mt. In the example of this embodiment, the number of actions a that can be selected is three for each of the four drive wheels, so there are 12 in total. Therefore N=12.

図２に示す多層ニューラルネットワークは、各層の各ノードにおいて直前の層の入力に対する重みｗの乗算とバイアスｂの加算とを実行し、必要に応じて活性化関数を経た出力を得る演算を実行する全結合順伝播型のニューラルネットワークである。なお、図２では、隣り合う層のノードを繋ぐ伝送路の表記を省略している。 The multi-layer neural network shown in FIG. 2 performs multiplication of weight w and addition of bias b on the input of the immediately preceding layer at each node of each layer, and performs operations to obtain outputs via activation functions as necessary. It is a fully-connected forward-propagating neural network. In FIG. 2, notation of transmission lines connecting nodes in adjacent layers is omitted.

多層ニューラルネットワークの構造は、各層における重みｗとバイアスｂ、活性化関数及び層の順序などの情報によって特定される。そのため、学習システムでは、この多層ニューラルネットワークを特定するためのパラメータが記憶装置１１０に記録される。なお、学習の際には、多層ニューラルネットワークの中で可変の値である重みｗとバイアスｂを更新していく。以下では、学習の過程で変化し得る多層ニューラルネットワークのパラメータをθと表記する。このθを使用することにより、行動価値関数Ｑ（ｓ＿ｔ，ａ＿１ｔ）～Ｑ（ｓ＿ｔ，ａ＿Ｎｔ）は、Ｑ（ｓ＿ｔ，ａ＿１ｔ：θ＿ｔ）～Ｑ（ｓ＿ｔ，ａ＿Ｎｔ：θ＿ｔ）とも表記できる。 The structure of a multi-layer neural network is specified by information such as weights w and biases b in each layer, activation functions and layer order. Therefore, in the learning system, parameters for specifying this multilayer neural network are recorded in the storage device 110 . During learning, the weight w and the bias b, which are variable values in the multi-layer neural network, are updated. Below, the parameter of the multi-layer neural network that can change in the course of learning is denoted as θ. By using this θ, the action value functions Q(s_t, a_1t) to Q(s_t, a_Nt) can also be expressed as Q(s_t, a_1t: θ_t) to Q(s_t, a_Nt: θ_t).

次に図３に示すフローチャートを参照しながら学習処理の手順を説明する。図３に示すように学習処理を開始すると、学習システムの制御装置１００は、ステップＳ１００の処理において状態変数を取得する。そして、次に、制御装置１００は、ステップＳ１１０の処理において、行動価値を算出する。すなわち、制御装置１００は、記憶装置１１０に記憶された学習情報を参照してθを取得する。そして、記憶装置１１０に記憶された学習情報が示す多層ニューラルネットワークに最新の状態変数を入力し、Ｎ個の行動価値関数Ｑ（ｓ＿ｔ，ａ＿１ｔ：θ＿ｔ）～Ｑ（ｓ＿ｔ，ａ＿Ｎｔ：θ＿ｔ）を算出する。なお、学習を開始した直後の初期状態では、初期値として設定したθが学習情報として記憶装置１１０に記憶されている。 Next, the procedure of the learning process will be described with reference to the flowchart shown in FIG. When the learning process is started as shown in FIG. 3, the control device 100 of the learning system acquires state variables in the process of step S100. Next, the control device 100 calculates the action value in the process of step S110. That is, the control device 100 refers to the learning information stored in the storage device 110 to acquire θ. Then, the latest state variables are input to the multilayer neural network indicated by the learning information stored in the storage device 110, and N action value functions Q(s_t, a_1t: θ_t) to Q(s_t, a_Nt: θ_t) are calculated. do. In an initial state immediately after learning is started, θ set as an initial value is stored in storage device 110 as learning information.

試行番号ｔは初回の実行時においては０である。学習処理が十分に進行していない場合、記憶装置１１０に示す学習情報が示すθは十分に最適化されていない。そのため、行動価値関数Ｑの値は不適当な値になり得る。しかし、試行を繰り返すことにより、行動価値関数Ｑは徐々に最適化されていく。また、試行の繰り返しにおいて、状態ｓ、行動ａ、報酬ｒは、各試行番号ｔに対応付けられて記憶装置１１０に記憶されている。これにより、任意のタイミングで参照することができるようになっている。 The trial number t is 0 at the first execution. If the learning process has not progressed sufficiently, θ indicated by the learning information shown in storage device 110 is not sufficiently optimized. Therefore, the value of the action-value function Q may become an inappropriate value. However, by repeating trials, the action-value function Q is gradually optimized. Also, in the repetition of trials, the state s, the action a, and the reward r are stored in the storage device 110 in association with each trial number t. This allows you to refer to it at any time.

次に、学習システムの制御装置１００は、ステップＳ１２０の処理において、行動ａを選択し、実行する。この実施形態では、行動価値関数Ｑ（ｓ，ａ）を最大化する行動ａが最適な行動ａであるとみなす処理を行う。そこで、制御装置１００はステップＳ１１０において算出されたＮ個の行動価値関数Ｑ（ｓ＿ｔ，ａ＿１ｔ：θ＿ｔ）～Ｑ（ｓ＿ｔ，ａ＿Ｎｔ：θ＿ｔ）の値の中で最大の値を特定する。 Next, the control device 100 of the learning system selects and executes action a in the process of step S120. In this embodiment, the action a that maximizes the action-value function Q(s, a) is regarded as the optimum action a. Therefore, the control device 100 specifies the maximum value among the N values of the action value functions Q(s_t, a_1t: θ_t) to Q(s_t, a_Nt: θ_t) calculated in step S110.

そして、学習システムの制御装置１００は、最大の値を与えた行動ａを選択する。例えば、Ｎ個の行動価値関数Ｑ（ｓ＿ｔ，ａ＿１ｔ：θ＿ｔ）～Ｑ（ｓ＿ｔ，ａ＿Ｎｔ：θ＿ｔ）の中でＱ（ｓ＿ｔ，ａ＿３ｔ：θ＿ｔ）が最大値であれば、行動ａ＿３ｔを選択する。 Then, the control device 100 of the learning system selects the action a that gives the maximum value. For example, if Q(s_t, a_3t: θ_t) is the maximum value among N action value functions Q(s_t, a_1t: θ_t) to Q(s_t, a_Nt: θ_t), action a_3t is selected.

行動ａが選択されると、学習システムの制御装置１００は、その行動ａに従ってトルクを操作するようにパワートレインの機構の操作量を制御する。例えば、右前輪５０ＲＦのトルクを低減させる行動ａが選択された場合には、制御装置１００の処理回路１２０がブレーキ６０ＲＦの制動力を増大させる。 When an action a is selected, the learning system controller 100 controls the manipulated variable of the powertrain mechanism to manipulate the torque according to the action a. For example, when the action a that reduces the torque of the right front wheel 50RF is selected, the processing circuit 120 of the control device 100 increases the braking force of the brake 60RF.

次に、学習システムの制御装置１００は、ステップＳ１３０の処理において、状態変数を取得する。すなわち、制御装置１００は、ステップＳ１００における処理と同様の処理を行って、状態変数を取得する。なお、例えば、現在の試行番号がｔであり、選択された行動ａが行動ａ＿ｔである場合、ステップＳ１３０の処理で取得される状態ｓは状態ｓ＿ｔ＋１である。 Next, the control device 100 of the learning system acquires state variables in the process of step S130. That is, the control device 100 acquires the state variables by performing the same processing as the processing in step S100. Note that, for example, if the current trial number is t and the selected action a is action a_t, the state s acquired in the process of step S130 is state s_t+1.

次に、学習システムの制御装置１００は、ステップＳ１４０の処理において、報酬ｒを評価する。具体的には、制御装置１００は、上述したように各回転速度センサの検出した回転速度に基づいて、いずれかの駆動輪でスリップが発生したか否かを判定する。そして、スリップが発生していると判定した場合には、負の報酬ｒを取得する。そして、この学習のエピソードを終了させる。一方で、スリップが発生していないと判定した場合には、上述したようにＧＰＳ装置７７から取得した位置情報に基づいて算出した車速に基づいて今回の試行で走行した距離を算出し、距離に応じた正の報酬ｒを取得する。なお、現在の試行番号がｔである場合、ステップＳ１４０で取得される報酬ｒは報酬ｒ＿ｔ＋１である。 Next, the learning system control device 100 evaluates the reward r in the process of step S140. Specifically, control device 100 determines whether a slip has occurred in any of the driving wheels based on the rotation speed detected by each rotation speed sensor as described above. Then, when it is determined that a slip has occurred, a negative reward r is obtained. And finish this learning episode. On the other hand, when it is determined that no slip has occurred, the distance traveled in this trial is calculated based on the vehicle speed calculated based on the position information acquired from the GPS device 77 as described above, and the distance Get a responsive positive reward r. Note that when the current trial number is t, the reward r obtained in step S140 is the reward r_t+1.

次に、学習システムの制御装置１００は、ステップＳ１５０の処理において、学習のエピソードがここで終了であるか否かを判定する。そして、ステップＳ１５０の処理においてエピソードが終了であると判定した場合（ステップＳ１５０：ＹＥＳ）には、制御装置１００は、処理をステップＳ２００へと進める。一方で、ステップＳ１５０の処理においてエピソードが終了ではないと判定した場合（ステップＳ１５０：ＮＯ）には、制御装置１００は、処理をステップＳ１００へと戻し、試行を継続する。 Next, in the process of step S150, the control device 100 of the learning system determines whether or not the learning episode ends here. Then, when it is determined that the episode has ended in the process of step S150 (step S150: YES), the control device 100 advances the process to step S200. On the other hand, if it is determined in the process of step S150 that the episode has not ended (step S150: NO), the control device 100 returns the process to step S100 and continues the trial.

こうして、制御装置１００は、いずれかの駆動輪がスリップしたと判定されるまでステップＳ１００～ステップＳ１５０の処理を繰り返し、報酬ｒを取得する。そして、いずれかの駆動輪がスリップしたと判定されると、制御装置１００は、負の報酬ｒを取得して、その時点でエピソードを終了させ、処理をステップＳ２００へと進める。 In this way, the control device 100 repeats the processing of steps S100 to S150 until it is determined that one of the driving wheels has slipped, and obtains the reward r. Then, when it is determined that one of the driving wheels has slipped, control device 100 obtains negative reward r, terminates the episode at that point, and advances the process to step S200.

この実施形態では、式（１）に示した行動価値関数Ｑの更新を行うが、行動価値関数Ｑを適切に更新していくためには、θを最適化して行動価値関数Ｑを示す多層ニューラルネットワークを最適化していかなくてはならない。図２に示す多層ニューラルネットワークによって行動価値関数Ｑを適正に出力させるためには、出力のターゲットとなる教師データが必要になる。すなわち、多層ニューラルネットワークの出力と、ターゲットとの誤差を最小化するようにθを改善することによって、多層ニューラルネットワークを最適化することができる。 In this embodiment, the action-value function Q shown in equation (1) is updated. In order to update the action-value function Q appropriately, a multi-layer neural We have to optimize the network. In order for the multilayer neural network shown in FIG. 2 to properly output the action-value function Q, teacher data that serves as an output target is required. That is, the multi-layer neural network can be optimized by improving θ to minimize the error between the output of the multi-layer neural network and the target.

しかし、学習が完了していない段階では行動価値関数Ｑの知見が十分ではない。そのため、ターゲットを特定することが困難である。そこで、この実施形態では、式（１）の第２項、いわゆるＴＤ誤差（ＴｅｍｐｏｒａｌＤｉｆｆｅｒｅｎｃｅ）を最小化する目的関数によって、多層ニューラルネットワークを示すθの改善を実施する。すなわち、（（ｒ＿ｔ＋１）＋γｍａｘ＿（ａ｀）Ｑ（ｓ＿ｔ＋１，ａ｀：θ＿ｔ））をターゲットとする。そして、ターゲットと、Ｑ（ｓ＿ｔ，ａ＿ｔ：θ＿ｔ）との誤差が最小化するようにθを学習する。ただし、ターゲット（（ｒ＿ｔ＋１）＋γｍａｘ＿（ａ｀）Ｑ（ｓ＿ｔ＋１，ａ｀：θ＿ｔ））は、学習対象のθを含んでいる。そのため、この実施形態では、既定のエピソード数にわたりターゲットを固定する。 However, knowledge of the action-value function Q is not sufficient at the stage where learning is not completed. Therefore, it is difficult to identify the target. Therefore, in this embodiment, improvement of θ, which indicates a multi-layer neural network, is performed by the second term of Equation (1), the objective function for minimizing the so-called TD error (Temporal Difference). That is, ((r_t+1)+γmax_(a′)Q(s_t+1, a′:θ_t)) is targeted. Then, θ is learned so as to minimize the error between the target and Q(s_t, a_t: θ_t). However, the target ((r_t+1)+γmax_(a′)Q(s_t+1, a′:θ_t)) includes θ to be learned. Therefore, in this embodiment the target is fixed for a predetermined number of episodes.

このような前提で学習を行うため、学習システムの制御装置１００は、ステップＳ２００の処理において目的関数を算出する。すなわち制御装置１００は、エピソードのそれぞれにおけるＴＤ誤差を評価するための目的関数を算出する。目的関数は、例えばＴＤ誤差の２乗の期待値に比例する関数やＴＤ誤差の２乗の総和などである。なおＴＤ誤差は、ターゲットを固定して算出される。そこで、固定されたターゲットを（（ｒ＿ｔ＋１）＋γｍａｘ＿（ａ｀）Ｑ（ｓ＿ｔ＋１，ａ｀：θ＿ｈｏｌｄ））と表記する。こうすると、ＴＤ誤差は（（ｒ＿ｔ＋１）＋γｍａｘ＿（ａ｀）Ｑ（ｓ＿ｔ＋１，ａ｀：θ＿ｈｏｌｄ）－Ｑ（ｓ＿ｔ，ａ＿ｔ：θ＿ｔ））である。このＴＤ誤差の式において報酬ｒ＿ｔ＋１は、行動ａ＿ｔによってステップＳ１４０で得られた報酬である。 In order to perform learning on such a premise, the control device 100 of the learning system calculates an objective function in the process of step S200. That is, the control device 100 calculates an objective function for evaluating the TD error in each episode. The objective function is, for example, a function proportional to the expected value of the square of the TD error or the sum of the squares of the TD error. Note that the TD error is calculated with the target fixed. Therefore, the fixed target is expressed as ((r_t+1)+γmax_(a′)Q(s_t+1, a′:θ_hold)). Then the TD error is ((r_t+1)+γmax_(a′)Q(s_t+1,a′:θ_hold)−Q(s_t,a_t:θ_t)). In this TD error formula, the reward r_t+1 is the reward obtained in step S140 by action a_t.

また、ｍａｘ＿（ａ｀）Ｑ（ｓ＿ｔ＋１，ａ｀：θ＿ｈｏｌｄ）は、行動ａ＿ｔによってステップＳ１３０で取得される状態ｓ＿ｔ＋１を、固定されたθ＿ｈｏｌｄで特定される多層ニューラルネットワークの入力とした場合に得られる出力の中で最大の値である。 Also, max_(a′)Q(s_t+1, a′: θ_hold) is obtained when the state s_t+1 obtained in step S130 by the action a_t is input to a multi-layer neural network specified by a fixed θ_hold. The maximum value in the output.

そして、Ｑ（ｓ＿ｔ，ａ＿ｔ：θ＿ｔ）は、行動ａ＿ｔが選択される前の状態ｓ＿ｔを、試行番号ｔの段階のθ＿ｔで特定される多層ニューラルネットワークの入力とした場合に得られる出力の中で、行動ａ＿ｔに対応した出力の値である。 Then, Q(s_t, a_t: θ_t) is the output obtained when the state s_t before action a_t is selected is input to a multi-layer neural network specified by θ_t at the stage of trial number t. , is the value of the output corresponding to the action a_t.

目的関数が算出されると、次のステップＳ２１０の処理において、学習システムの制御装置１００は、学習が完了したか否かを判定する。ここでは、ＴＤ誤差が十分に小さくなっているか否かを判定するための閾値が予め設定されている。そして、目的関数が閾値以下である場合、制御装置１００は、学習が完了していると判定する。 After the objective function is calculated, in the processing of the next step S210, the control device 100 of the learning system determines whether or not the learning is completed. Here, a threshold is set in advance for determining whether the TD error is sufficiently small. Then, when the objective function is equal to or less than the threshold, the control device 100 determines that learning has been completed.

ステップＳ２１０の処理において学習が完了したと判定されない場合（ステップＳ２１０：ＮＯ）には、学習システムの制御装置１００は、処理をステップＳ２２０へと進める。そして、制御装置１００は、ステップＳ２２０の処理において、行動価値を更新する。すなわち、制御装置１００は、ＴＤ誤差のθによる偏微分に基づいて目的関数を小さくするためのθの変化を特定し、θを変化させる。ここでは、各種の手法でθを変化させることが可能である。例えば、勾配降下法を採用可能である。また、学習率などによる調整も適宜実施されてよい。こうした処理によれば、行動価値関数Ｑがターゲットに近づくようにθを変化させることができる。 If it is determined that learning has not been completed in the process of step S210 (step S210: NO), control device 100 of the learning system advances the process to step S220. Then, the control device 100 updates the action value in the process of step S220. That is, the control device 100 specifies the change in θ for reducing the objective function based on the partial differentiation of the TD error with respect to θ, and changes θ. Here, θ can be changed by various methods. For example, gradient descent can be employed. Also, adjustment by a learning rate or the like may be appropriately performed. According to such processing, θ can be changed so that the action-value function Q approaches the target.

ただし、上述のようにターゲットが固定されているため、学習システムの制御装置１００は、さらにターゲットを更新するか否かの判定を行う。具体的には、学習システムの制御装置１００は、次のステップＳ２３０の処理において、エピソード数が既定回数以上であるか否かを判定する。そして、制御装置１００は、ステップＳ２３０においてエピソード数が既定回数以上であると判定された場合（ステップＳ２３０：ＹＥＳ）に、ステップＳ２４０へと処理を進める。 However, since the target is fixed as described above, the control device 100 of the learning system further determines whether or not to update the target. Specifically, the control device 100 of the learning system determines whether or not the number of episodes is equal to or greater than the predetermined number in the processing of the next step S230. Then, when it is determined in step S230 that the number of episodes is equal to or greater than the predetermined number of times (step S230: YES), the control device 100 advances the process to step S240.

ステップＳ２４０の処理では、学習システムの制御装置１００は、ターゲットを更新する。すなわち、制御装置１００は、ターゲットを算出する際に参照されるθを最新のθに更新する。この後、制御装置１００は、処理をステップＳ１００に戻す。そして、ステップＳ１００以降の処理を繰り返す。 In the process of step S240, the learning system control device 100 updates the target. That is, the control device 100 updates θ referred to when calculating the target to the latest θ. Thereafter, control device 100 returns the process to step S100. Then, the processing after step S100 is repeated.

一方で、ステップＳ２３０の処理においてエピソード数が既定回数未満であると判定された場合（ステップＳ２３０：ＮＯ）には、制御装置１００は、ステップＳ２４０の処理をスキップして処理をステップＳ１００に戻す。そして、ステップＳ１００以降の処理を繰り返す。 On the other hand, when it is determined in the process of step S230 that the number of episodes is less than the predetermined number (step S230: NO), the control device 100 skips the process of step S240 and returns the process to step S100. Then, the processing after step S100 is repeated.

また、ステップＳ２１０の処理において学習が完了したと判定された場合（ステップＳ２１０：ＹＥＳ）には、学習システムの制御装置１００は、処理をステップＳ２５０へと進める。そして、ステップＳ２５０の処理において、制御装置１００は、記憶装置１１０に記憶された学習情報を更新する。すなわち、制御装置１００は、学習によって得られたθを、学習済みモデルとして記憶装置１１０に記憶させる。このθを含む学習済みモデルが、車両制御装置としての制御装置１００の記憶装置１１０に記憶されると、車両に搭載された制御装置１００の処理回路１２０は、その学習済みモデルを用いて各駆動輪のトルクを制御することができるようになる。 Further, when it is determined that the learning is completed in the process of step S210 (step S210: YES), the control device 100 of the learning system advances the process to step S250. Then, in the process of step S250, the control device 100 updates the learning information stored in the storage device 110. FIG. That is, the control device 100 stores θ obtained by learning in the storage device 110 as a learned model. When the learned model including this θ is stored in the storage device 110 of the control device 100 as a vehicle control device, the processing circuit 120 of the control device 100 mounted on the vehicle uses the learned model for each drive. You will be able to control the torque of the wheels.

＜悪路走行処理＞
図４は、車両の制御装置１００において悪路走行モードが選択されているときに繰り返し実行する悪路走行処理を示すフローチャートである。 <Rough road travel processing>
FIG. 4 is a flowchart showing rough road travel processing that is repeatedly executed when the rough road travel mode is selected in the vehicle control device 100 .

この悪路走行処理が開始されると、車両制御装置である制御装置１００の処理回路１２０は、ステップＳ３００の処理を実行する。ステップＳ３００の処理において、処理回路１２０は、図３を参照して説明したステップＳ１００の処理と同様に状態変数を取得する。 When the rough road traveling process is started, the processing circuit 120 of the control device 100, which is the vehicle control device, executes the process of step S300. In the process of step S300, the processing circuit 120 acquires the state variables in the same manner as in the process of step S100 described with reference to FIG.

次に、制御装置１００の処理回路１２０は、ステップＳ３１０の処理において、ドライブトレインの機構の制御量を決定する。具体的には、処理回路１２０は、記憶装置１１０に記憶されている学習済みモデルにステップＳ３００において取得した状態変数を入力する。そして処理回路１２０は、学習済みモデルの出力である行動価値関数Ｑ（ｓ，ａ）の中で最大の値を与える行動ａを選択する。そして、処理回路１２０は、選択した行動ａに基づいてドライブトレインの機構の制御量を決定する。 Next, the processing circuit 120 of the control device 100 determines the control amount of the mechanism of the drive train in the processing of step S310. Specifically, the processing circuit 120 inputs the state variables obtained in step S300 to the trained model stored in the storage device 110 . Then, the processing circuit 120 selects the action a that gives the maximum value in the action value function Q(s, a) that is the output of the trained model. Processing circuitry 120 then determines the amount of control of the drive train mechanism based on the selected action a.

次に、処理回路１２０、処理をステップＳ３２０へと進める。そして、ステップＳ３２０の処理において、処理回路１２０は、各駆動輪の駆動力及び制動力を制御して、各駆動輪のトルクを制御する。すなわち処理回路１２０は、ステップＳ３１０において決定した制御量に基づいてドライブトレインの機構を制御する。具体的には、処理回路１２０は、決定した制御量に基づいて電子制御カップリング４２Ｒ及び電子制御カップリング４２Ｌを制御する。これによって後輪に配分する駆動力を操作する。また、処理回路１２０は、決定した制御量に基づいてブレーキ６０ＲＦ、ブレーキ６０ＬＦ、ブレーキ６０ＲＲ、ブレーキ６０ＬＲを制御する。これによって各駆動輪の制動力の制御をあわせて行う。これにより、処理回路１２０は、各駆動輪の駆動力及び制動力を制御し、各駆動輪のトルクを個別に制御する。こうしてステップＳ３２０の処理を実行すると、処理回路１２０は、このルーチンを一旦終了させる。 Processing circuitry 120 then advances processing to step S320. Then, in the process of step S320, the processing circuit 120 controls the driving force and braking force of each drive wheel to control the torque of each drive wheel. That is, the processing circuit 120 controls the drive train mechanism based on the control amount determined in step S310. Specifically, the processing circuit 120 controls the electronically controlled coupling 42R and the electronically controlled coupling 42L based on the determined control amount. This controls the driving force distributed to the rear wheels. The processing circuit 120 also controls the brake 60RF, the brake 60LF, the brake 60RR, and the brake 60LR based on the determined control amount. Thereby, the control of the braking force of each driving wheel is performed together. Thereby, the processing circuit 120 controls the driving force and braking force of each drive wheel, and individually controls the torque of each drive wheel. After executing the process of step S320 in this way, the processing circuit 120 once terminates this routine.

＜作用＞
以上の構成によれば、行動価値関数Ｑが最大化される行動ａを選択して各駆動輪のトルクを制御することができる。行動価値関数Ｑは上述の学習処理を通じて多数の試行が繰り返された結果、最適化されている。学習済みモデルは、いずれの駆動輪もスリップさせずに走行した距離が長いほど大きな報酬ｒを付与する強化学習によって学習されている。そのため、この制御装置１００によれば、全ての駆動輪のトルクを一律に抑制するのではなく、スリップの発生を回避しながら前進し続けることができるように各駆動輪のトルクをそれぞれに制御することができる。 <Action>
According to the above configuration, the action a that maximizes the action-value function Q can be selected to control the torque of each drive wheel. The action-value function Q is optimized as a result of repeating many trials through the learning process described above. The learned model is learned by reinforcement learning that gives a larger reward r as the distance traveled without any driving wheels slipping increases. Therefore, according to the control device 100, instead of uniformly suppressing the torque of all drive wheels, the torque of each drive wheel is individually controlled so that the vehicle can continue to move forward while avoiding the occurrence of slip. be able to.

＜効果＞
（１）上記構成では、車両前方の路面の画像から抽出した路面情報データを含む状態変数を用いて行動ａを決定する。そのため、駆動輪がこれから接地する路面の情報を行動ａの決定に反映させることができる。これにより、各駆動輪が接地する路面の状況を考慮して、スリップの発生を未然に防ぐように各駆動輪のトルクを制御することができる。 <effect>
(1) In the above configuration, the action a is determined using state variables including road surface information data extracted from the image of the road surface in front of the vehicle. Therefore, it is possible to reflect the information of the road surface on which the drive wheels will touch the ground in the determination of the action a. As a result, it is possible to control the torque of each driving wheel so as to prevent the occurrence of a slip in consideration of the condition of the road surface on which each driving wheel is grounded.

（２）学習済みモデルは、いずれの駆動輪もスリップさせずに走行した距離が長いほど大きな報酬ｒを付与する強化学習によって学習されている。そのため、この制御装置１００によれば、全ての駆動輪のトルクを一律に抑制するのではなく、スリップの発生を回避しながら前進し続けることができるように各駆動輪のトルクをそれぞれに制御することができる。 (2) The learned model is learned by reinforcement learning that gives a larger reward r as the distance traveled without any driving wheels slipping increases. Therefore, according to the control device 100, instead of uniformly suppressing the torque of all drive wheels, the torque of each drive wheel is individually controlled so that the vehicle can continue to move forward while avoiding the occurrence of slip. be able to.

（３）上記構成によれば、上記の（１）及び（２）を同時に実現することにより、走破性を確保しつつ、ドライブラインへの過大な負荷の入力を抑制することができる。
本実施形態は、以下のように変更して実施することができる。本実施形態及び以下の変更例は、技術的に矛盾しない範囲で互いに組み合わせて実施することができる。 (3) According to the above configuration, by realizing the above (1) and (2) at the same time, it is possible to suppress the input of an excessive load to the driveline while ensuring the running performance.
This embodiment can be implemented with the following modifications. This embodiment and the following modified examples can be implemented in combination with each other within a technically consistent range.

・上記の実施形態では、状態変数にアクセル開度が含まれている例を示したが、例えば、悪路走行モードにおいて予め設定した一定の速度を保つように自動的にエンジン１０の出力が制御される場合には、状態変数にアクセル開度を含める必要はない。すなわち、こうした速度の制御を自動で行う悪路走行モードを実行する構成に対しても上記の実施形態と同様の強化学習済みモデルを用いた制御を適用することができる。・In the above embodiment, an example in which the accelerator opening is included in the state variables is shown, but for example, the output of the engine 10 is automatically controlled so as to maintain a preset constant speed in rough road driving mode. , there is no need to include the accelerator opening in the state variable. That is, the same control using the reinforcement learned model as in the above-described embodiment can be applied to the configuration for executing the rough road running mode in which such speed control is automatically performed.

・上記の実施形態では、制御装置１００は一つの装置として図示されているが、エンジン１０を制御するエンジン制御装置、変速機２０を制御する変速機制御装置、各駆動輪への駆動力の配分を制御する４ＷＤ制御装置、各ブレーキを制御するブレーキ制御装置などに分かれていてもよい。この場合、４ＷＤ制御装置とブレーキ制御装置が学習済みモデルを利用して算出した操作量に基づいて駆動力の配分及び制動力を制御し、協働によって各駆動輪のトルクを制御することになる。 In the above embodiment, the control device 100 is illustrated as one device, but there are an engine control device that controls the engine 10, a transmission control device that controls the transmission 20, and distribution of driving force to each drive wheel. may be divided into a 4WD control device that controls the , a brake control device that controls each brake, and the like. In this case, the 4WD control device and the brake control device control the driving force distribution and braking force based on the operation amount calculated using the learned model, and cooperate to control the torque of each drive wheel. .

・天気情報は、路面の状況を推定するために利用できる情報であるため、通信装置を備え、現在地点の天気情報を取得してその天気情報を状態変数の１つとして入力するようにしてもよい。・Since weather information is information that can be used to estimate road conditions, even if a communication device is provided, the weather information for the current location is acquired, and the weather information is input as one of the state variables. good.

・カメラ７０は、ステレオカメラであってもよい。ステレオカメラを用いれば、路面の凹凸の大きさや凹凸までの距離をより正確に把握できるようになる。
・カメラ７０の他に、光を用いて物体との距離を測定するライダーや、音波によって物体を探知するソナーなどを用いて状態変数の１つとしての路面情報データを収集するようにしてもよい。 - The camera 70 may be a stereo camera. By using a stereo camera, it is possible to more accurately grasp the size of unevenness on the road surface and the distance to the unevenness.
・In addition to the camera 70, a lidar that measures the distance to an object using light, a sonar that detects an object using sound waves, etc. may be used to collect road surface information data as one of the state variables. .

・上記の実施形態では、各駆動輪のトルクを制御するための機構として左右の後輪への駆動力の配分を変更することのできる電子制御カップリングを備えている例を示した。各駆動輪のブレーキをそれぞれ個別に制御することができれば、各駆動輪のトルクを制御することができる。そのため、こうした駆動力の配分を変更する機構を備えていない車両に対しても上記の実施形態と同様に強化学習済みのモデルを用いたトルクの制御を適用することもできる。 - In the above embodiment, an example is shown in which an electronically controlled coupling is provided as a mechanism for controlling the torque of each driving wheel, which can change the distribution of the driving force to the left and right rear wheels. If the brakes of each drive wheel can be individually controlled, the torque of each drive wheel can be controlled. Therefore, torque control using a model that has undergone reinforcement learning can also be applied to a vehicle that does not have such a mechanism for changing the distribution of driving force, as in the above embodiment.

・駆動輪のトルクを個別に制御することにより、駆動輪のスリップを抑制することができる。そのため、前輪駆動車や後輪駆動車にも上記の実施形態と同様に強化学習済みのモデルを用いたトルクの制御を適用することができる。・Slipping of the drive wheels can be suppressed by individually controlling the torque of the drive wheels. Therefore, torque control using a reinforcement learning model can be applied to a front-wheel drive vehicle and a rear-wheel drive vehicle as in the above embodiment.

・上記の実施形態においては、行動価値関数Ｑに基づいてｇｒｅｅｄｙ方策で行動ａを選択して試行しながら、行動価値関数Ｑを最適化することにより、最適化された行動価値関数Ｑに対するｇｒｅｅｄｙ方策が最適方策であるとみなしている。この処理は、いわゆる価値反復法であるが、他の手法、例えば、方策反復法によって学習が行われてもよい。さらに、状態ｓ、行動ａ、報酬ｒなどの各種変数においては、各種の正規化が行われてもよい。 - In the above embodiment, the greedy policy for the optimized action-value function Q is selected by optimizing the action-value function Q while trying to select and try the action a with the greedy policy based on the action-value function Q. is considered to be the optimal policy. This process is a so-called value iteration method, but learning may be performed by other methods, such as policy iteration method. Furthermore, various normalizations may be performed on various variables such as state s, action a, and reward r.

・機械学習の手法としては、種々の手法を採用であり、行動価値関数Ｑに基づいたε－ｇｒｅｅｄｙ方策によって試行が行われてもよい。また、強化学習の手法としても上述のようなＱ学習に限定されず、ＳＡＲＳＡ等の手法が用いられてもよい。また、方策のモデルと行動価値関数Ｑのモデルを別々にモデル化した手法、例えば、Ａｃｔｏｒ－Ｃｒｉｔｉｃアルゴリズムが利用されてもよい。・Various methods are adopted as machine learning methods, and trials may be performed by the ε-greedy policy based on the action value function Q. Also, the method of reinforcement learning is not limited to Q-learning as described above, and a method such as SARSA may be used. Alternatively, a technique in which the policy model and the action-value function Q model are separately modeled, such as the Actor-Critic algorithm, may be used.

・車両制御装置としては、処理回路１２０と記憶装置１１０とを備えて、ソフトウェア処理を実行するものに限らない。たとえば、上記実施形態においてソフトウェア処理されたものの少なくとも一部を、ハードウェア処理する専用のハードウェア回路（たとえばＡＳＩＣ等）を備えてもよい。すなわち、車両制御装置は、以下の（ａ）～（ｃ）のいずれかの構成であればよい。（ａ）処理の全てを、プログラムに従って実行する処理回路と、プログラムを記憶するＲＯＭ等の記憶装置とを備える。（ｂ）処理の一部をプログラムに従って実行する処理回路及び記憶装置と、残りの処理を実行する専用のハードウェア回路とを備える。（ｃ）上記処理の全てを実行する専用のハードウェア回路を備える。ここで、処理回路及び記憶装置を備えたソフトウェア実行装置や、専用のハードウェア回路は複数であってもよい。 - The vehicle control device is not limited to one that includes the processing circuit 120 and the storage device 110 and executes software processing. For example, a dedicated hardware circuit (for example, an ASIC, etc.) that performs hardware processing at least part of what is software-processed in the above embodiments may be provided. That is, the vehicle control device may have any one of the following configurations (a) to (c). (a) A processing circuit that executes all processes according to a program and a storage device such as a ROM that stores the program. (b) A processing circuit and a storage device for executing part of the processing according to a program, and a dedicated hardware circuit for executing the remaining processing. (c) provide dedicated hardware circuitry to perform all of the above processing; Here, there may be a plurality of software execution devices provided with processing circuits and storage devices, or a plurality of dedicated hardware circuits.

１００…制御装置
１１０…記憶装置
１２０…処理回路
１０…エンジン
２０…変速機
２１…フロントディファレンシャル
３０…トランスファ
３１…フロントディスコネクト機構
４０…リヤディファレンシャル
４１…リヤディスコネクト機構
４２Ｒ…電子制御カップリング
４２Ｌ…電子制御カップリング
６０ＲＦ…ブレーキ
６０ＬＦ…ブレーキ
６０ＲＲ…ブレーキ
６０ＬＲ…ブレーキ
７０…カメラ
７６ＲＦ…右フロント速度センサ
７６ＬＦ…左フロント速度センサ
７６ＲＲ…右リヤ速度センサ
７６ＬＲ…左リヤ速度センサ
７７…ＧＰＳ装置 DESCRIPTION OF SYMBOLS 100... Control device 110... Storage device 120... Processing circuit 10... Engine 20... Transmission 21... Front differential 30... Transfer 31... Front disconnect mechanism 40... Rear differential 41... Rear disconnect mechanism 42R... Electronic control coupling 42L... Electronically controlled coupling 60RF... Brake 60LF... Brake 60RR... Brake 60LR... Brake 70... Camera 76RF... Right front speed sensor 76LF... Left front speed sensor 76RR... Right rear speed sensor 76LR... Left rear speed sensor 77... GPS device

Claims

A vehicle control device for controlling a vehicle equipped with a mechanism capable of individually controlling the torque of each driving wheel,
a storage device storing a learned model that determines and outputs an action for manipulating the control amount of the mechanism when state variables including road surface information data extracted from an image of the road surface in front of the vehicle are input;
a processing circuit that manipulates the control amount based on the behavior output by inputting the state variables to the trained model and controls the mechanism;
The learned model is a model learned by reinforcement learning that gives a greater reward as the vehicle travels a longer distance without slipping any driving wheels.