JP7207289B2

JP7207289B2 - Vehicle control device, vehicle control system, vehicle learning device, and vehicle learning method

Info

Publication number: JP7207289B2
Application number: JP2019231144A
Authority: JP
Inventors: 洋介橋本; 章弘片山; 裕太大城; 和紀杉江; 尚哉岡
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2019-12-23
Filing date: 2019-12-23
Publication date: 2023-01-18
Anticipated expiration: 2039-12-23
Also published as: JP2021099059A

Description

本発明は、車両用制御装置、車両用制御システム、車両用学習装置、および車両用学習方法に関する。 The present invention relates to a vehicle control device, a vehicle control system, a vehicle learning device, and a vehicle learning method.

たとえば下記特許文献１には、アクセルペダルの操作量をフィルタ処理した値に基づき、車両に搭載される内燃機関の操作部としてのスロットルバルブを操作する制御装置が記載されている。 For example, Patent Literature 1 listed below describes a control device that operates a throttle valve as an operation unit of an internal combustion engine mounted on a vehicle based on a value obtained by filtering an operation amount of an accelerator pedal.

特開２０１６－６３２７号公報JP 2016-6327 A

ところで、上記フィルタは、アクセルペダルの操作量に応じて車両に搭載される内燃機関のスロットルバルブの操作量を適切な操作量に設定するものである必要があることから、その適合には熟練者が多くの工数をかける必要が生じる。このように、従来は、車両の状態に応じた車両内の電子機器の操作量等の適合には、熟練者が多くの工数をかけていた。 By the way, the above filter is required to set the amount of operation of the throttle valve of the internal combustion engine mounted on the vehicle to an appropriate amount of operation according to the amount of operation of the accelerator pedal. requires a lot of man-hours. As described above, conventionally, skilled workers spend a lot of man-hours to adjust the operation amounts of electronic devices in the vehicle according to the state of the vehicle.

以下、上記課題を解決するための手段およびその作用効果について記載する。
１．実行装置および記憶装置を備え、前記記憶装置には、車両の状態と前記車両に搭載される電子機器の操作に関する変数である行動変数との関係を規定する関係規定データが記憶されており、前記実行装置は、都度のセンサの検出値に基づく都度の前記車両の状態を取得する状態取得処理と、前記状態取得処理によって取得された前記車両の状態と前記関係規定データとによって定まる前記行動変数の値に基づき前記電子機器を操作する操作処理と、前記状態取得処理によって取得された前記車両の状態に基づき、前記車両の特性が基準を満たす場合に満たさない場合よりも大きい報酬を与える報酬算出処理と、前記状態取得処理によって取得された前記車両の状態、前記電子機器の操作に用いられた前記行動変数の値、および該操作に対応する前記報酬を予め定められた更新写像への入力とし、前記関係規定データを更新する更新処理と、前記車両の劣化度合いを示す変数である劣化変数を取得する劣化変数取得処理と、前記車両の劣化度合いが所定以上である場合、所定未満の場合と比較して、前記操作処理によって前記行動変数の値のうちの前記報酬についての期待収益を最大化する値以外の値が採用される範囲を拡大する側に変更する変更処理と、を実行し、前記更新写像は、前記関係規定データに従って前記電子機器が操作される場合の前記期待収益を増加させるように更新された前記関係規定データを出力するものである車両用制御装置である。 Means for solving the above problems and their effects will be described below.
1. an execution device and a storage device, wherein the storage device stores relationship defining data defining a relationship between a vehicle state and an action variable, which is a variable relating to operation of an electronic device mounted on the vehicle; The execution device performs a state acquisition process for acquiring the state of the vehicle each time based on the detected value of the sensor each time, and the behavior variable determined by the state of the vehicle acquired by the state acquisition process and the relationship defining data. operation processing for operating the electronic device based on the value; and remuneration calculation processing for giving a larger reward when the characteristics of the vehicle meet criteria than when they do not, based on the vehicle status acquired by the status acquisition process. and inputting the state of the vehicle obtained by the state obtaining process, the value of the action variable used in the operation of the electronic device, and the reward corresponding to the operation to a predetermined update map, An update process for updating the relationship defining data, a deterioration variable acquisition process for acquiring a deterioration variable that is a variable indicating the degree of deterioration of the vehicle, and a comparison between when the degree of deterioration of the vehicle is equal to or greater than a predetermined value and when the degree of deterioration is less than a predetermined value. and executing a change process for expanding a range in which a value other than a value that maximizes an expected profit for the reward among the values of the action variable is adopted by the operation process, and The update map is a vehicle control device that outputs the relationship-defining data updated so as to increase the expected profit when the electronic device is operated according to the relationship-defining data.

上記構成では、電子機器の操作に伴う報酬を算出することによって、当該操作によってどのような報酬が得られるかを把握することができる。そして、報酬に基づき、強化学習に従った更新写像によって関係規定データを更新することにより、車両の状態と行動変数との関係を車両の走行において適切な関係に設定することができる。したがって、車両の状態と行動変数との関係を車両の走行において適切な関係に設定する際、熟練者に要求される工数を削減できる。 In the above configuration, by calculating the reward associated with the operation of the electronic device, it is possible to grasp what kind of reward is obtained by the operation. Then, based on the reward, the relationship defining data is updated by an update map according to reinforcement learning, so that the relationship between the vehicle state and the behavioral variables can be set to an appropriate relationship during vehicle travel. Therefore, it is possible to reduce the number of man-hours required for an expert when setting the relationship between the vehicle state and the behavioral variable to an appropriate relationship for running the vehicle.

ところで、強化学習によって定まる期待収益を最大化する行動が収束する場合、むやみに探索を継続するよりは、期待収益を最大化する行動を常時選択した方が、車両の特性を狙いとする特性に制御するうえで好適である。ただし、車両が劣化すると、期待収益を最大化する行動が変化するおそれがある。そこで、上記構成では、劣化度合いが所定以上となる場合、所定未満の場合と比較して、期待収益を最大化する値以外の値が採用される範囲を拡大することにより、劣化した車両にとってより適切な行動変数の値を強化学習によって見出すことが可能となる。 By the way, when the behavior that maximizes the expected profit determined by reinforcement learning converges, it is better to always select the behavior that maximizes the expected profit than to continue searching blindly. It is suitable for controlling. However, as vehicles deteriorate, their expected return-maximizing behavior may change. Therefore, in the above configuration, when the degree of deterioration is greater than or equal to a predetermined value, compared to when the degree of deterioration is less than a predetermined value, by expanding the range in which a value other than the value that maximizes the expected profit is adopted, the deteriorated vehicle can Appropriate behavioral variable values can be found by reinforcement learning.

なお、状態取得処理によって取得される車両の状態は、劣化変数の値よりも短時間で変化するものを少なくとも含むことが望ましい。
２．前記変更処理は、前記期待収益を最大化する値以外の値が採用される範囲をゼロからゼロよりも大きい範囲に拡大する処理を含む上記１記載の車両用制御装置である。 It is desirable that the state of the vehicle acquired by the state acquisition process includes at least those that change in a shorter time than the value of the deterioration variable.
2. 2. The vehicle control device according to 1 above, wherein the change processing includes processing for expanding a range in which a value other than the value that maximizes the expected profit is adopted from zero to a range greater than zero.

上記構成では、劣化度合いが所定未満の場合、期待収益を最大化する値以外の値が採用される範囲をゼロとすることにより、無駄な探索がなされることを抑制できる。
３．前記劣化変数は、前記劣化度合いが所定未満の場合を、時間の経過と正の相関を有する量によって細分化する変数でもあり、前記変更処理は、前記時間の経過に伴って、前記期待収益を最大化する値以外の値が採用される範囲を、第１の範囲から第２の範囲を経て第３の範囲とするものであり、前記第１の範囲は、前記第２の範囲および前記第３の範囲よりも大きい範囲であり、前記第３の範囲は、前記第２の範囲よりも大きい範囲であり、前記範囲を拡大する側に変更する処理は、前記車両の劣化度合いが所定以上の場合、前記期待収益を最大化する値以外の値が採用される範囲を前記第２の範囲から前記第３の範囲へと拡大する側に変更する処理である上記２記載の車両用制御装置である。 In the above configuration, when the degree of deterioration is less than a predetermined value, the range in which values other than the value that maximizes the expected profit is adopted is set to zero, thereby suppressing unnecessary searches.
3. The deterioration variable is also a variable that subdivides the case where the degree of deterioration is less than a predetermined amount by an amount having a positive correlation with the passage of time. The range in which a value other than the value to be maximized is adopted is a third range from the first range through the second range, and the first range is the second range and the third range. 3, and the third range is a range larger than the second range, and the process of changing the range to an expanding side is performed when the degree of deterioration of the vehicle exceeds a predetermined level. 3. The vehicle control device according to the above 2, wherein the range in which the value other than the value maximizing the expected profit is adopted is changed from the second range to the third range, if the be.

上記構成では、車両が劣化したとしても期待収益を最大化する行動変数の値は劣化前の値に対して大きくは変化しないと想定されることに鑑み、第３の範囲を第１の範囲よりも小さい範囲とする。これにより、劣化した車両にとって期待収益を最大化する可能性のある行動変数の値に限って探索を行う可能性を高めることができることから、効率的に探索を行うことができる。 In the above configuration, even if the vehicle deteriorates, it is assumed that the value of the behavioral variable that maximizes the expected profit does not change significantly from the value before deterioration. is also small. This makes it possible to search efficiently because it is more likely that the search will be limited to the values of the behavioral variables that are likely to maximize the expected profit for the degraded vehicle.

４．上記１～３のいずれか１つに記載の前記実行装置および前記記憶装置を備え、前記実行装置は、前記車両に搭載される第１実行装置と、車載装置とは別の第２実行装置と、を含み、前記第１実行装置は、少なくとも前記状態取得処理および前記操作処理を実行し、前記第２実行装置は、少なくとも前記更新処理を実行する車両用制御システムである。 4. 4. The execution device according to any one of 1 to 3 above and the storage device, wherein the execution device includes a first execution device mounted on the vehicle and a second execution device separate from the on-vehicle device. , wherein the first execution device executes at least the state acquisition process and the operation process, and the second execution device executes at least the update process.

上記構成では、更新処理を第２実行装置によって実行することにより、更新処理を第１実行装置が実行する場合と比較して、第１実行装置の演算負荷を軽減できる。
なお、第２実行装置が車載装置とは別の装置であることは、第２実行装置が車載装置ではないことを意味する。 In the above configuration, by executing the update processing by the second execution unit, the calculation load of the first execution unit can be reduced compared to the case where the update processing is executed by the first execution unit.
Note that the fact that the second execution device is a device different from the in-vehicle device means that the second execution device is not the in-vehicle device.

５．上記４記載の第１実行装置を備える車両用制御装置である。
６．上記４記載の第２実行装置を備える車両用学習装置である。
７．上記１～３のいずれか１つに記載の前記状態取得処理、前記操作処理、前記報酬算出処理、前記更新処理、前記劣化変数取得処理および前記変更処理をコンピュータに実行させる車両用学習方法。 5. 5. A vehicle control device comprising the first execution device according to 4 above.
6. 5. A vehicle learning device comprising the second execution device according to 4 above.
7. A vehicle learning method for causing a computer to execute the state acquisition process, the operation process, the remuneration calculation process, the update process, the deterioration variable acquisition process, and the change process according to any one of 1 to 3 above.

上記方法によれば、上記１と同様の作用効果を奏することができる。 According to the above method, the same effects as in 1 above can be obtained.

第１の実施形態にかかる制御装置およびその駆動系を示す図。The figure which shows the control apparatus concerning 1st Embodiment, and its drive system. 同実施形態にかかる制御装置が実行する処理の手順を示す流れ図。FIG. 4 is a flowchart showing the procedure of processing executed by the control device according to the embodiment; FIG. 同実施形態にかかる制御装置が実行する処理の手順を示す流れ図。FIG. 4 is a flowchart showing the procedure of processing executed by the control device according to the embodiment; FIG. 同実施形態にかかる制御装置が実行する処理の一部の詳細な手順を示す流れ図。FIG. 4 is a flowchart showing a detailed procedure of part of the processing executed by the control device according to the embodiment; FIG. 同実施形態にかかる制御装置が実行する処理の手順を示す流れ図。FIG. 4 is a flowchart showing the procedure of processing executed by the control device according to the embodiment; FIG. 第２の実施形態にかかる車両用制御システムの構成を示す図。The figure which shows the structure of the control system for vehicles concerning 2nd Embodiment. （ａ）および（ｂ）は、車両用制御システムが実行する処理の手順を示す流れ図。4(a) and 4(b) are flowcharts showing procedures of processing executed by the vehicle control system;

＜第１の実施形態＞
以下、車両用制御装置の第１の実施形態について、図面を参照しつつ説明する。
図１に、本実施形態にかかる車両ＶＣ１の駆動系および制御装置の構成を示す。 <First Embodiment>
A first embodiment of a vehicle control device will be described below with reference to the drawings.
FIG. 1 shows the configuration of a driving system and a control device of a vehicle VC1 according to this embodiment.

図１に示すように、内燃機関１０の吸気通路１２には、上流側から順にスロットルバルブ１４および燃料噴射弁１６が設けられており、吸気通路１２に吸入された空気や燃料噴射弁１６から噴射された燃料は、吸気バルブ１８の開弁に伴って、シリンダ２０およびピストン２２によって区画される燃焼室２４に流入する。燃焼室２４内において、燃料と空気との混合気は、点火装置２６の火花放電に伴って燃焼に供され、燃焼によって生じたエネルギは、ピストン２２を介してクランク軸２８の回転エネルギに変換される。燃焼に供された混合気は、排気バルブ３０の開弁に伴って、排気として排気通路３２に排出される。排気通路３２には、排気を浄化する後処理装置としての触媒３４が設けられている。 As shown in FIG. 1, an intake passage 12 of an internal combustion engine 10 is provided with a throttle valve 14 and a fuel injection valve 16 in this order from the upstream side. The injected fuel flows into the combustion chamber 24 defined by the cylinder 20 and the piston 22 as the intake valve 18 is opened. In the combustion chamber 24, the mixture of fuel and air is combusted by the spark discharge of the ignition device 26, and the energy generated by the combustion is converted into rotational energy of the crankshaft 28 via the piston 22. be. The combusted air-fuel mixture is discharged as exhaust gas to the exhaust passage 32 as the exhaust valve 30 is opened. The exhaust passage 32 is provided with a catalyst 34 as an aftertreatment device for purifying exhaust gas.

クランク軸２８には、ロックアップクラッチ４２を備えたトルクコンバータ４０を介して、変速装置５０の入力軸５２が機械的に連結可能とされている。変速装置５０は、入力軸５２の回転速度と出力軸５４の回転速度との比である変速比を可変とする装置である。出力軸５４には、駆動輪６０が機械的に連結されている。 An input shaft 52 of a transmission 50 can be mechanically connected to the crankshaft 28 via a torque converter 40 having a lockup clutch 42 . The transmission 50 is a device that varies a gear ratio, which is the ratio between the rotation speed of the input shaft 52 and the rotation speed of the output shaft 54 . A drive wheel 60 is mechanically connected to the output shaft 54 .

制御装置７０は、内燃機関１０を制御対象とし、その制御量であるトルクや排気成分比率等を制御すべく、スロットルバルブ１４、燃料噴射弁１６および点火装置２６等の内燃機関１０の操作部を操作する。また、制御装置７０は、トルクコンバータ４０を制御対象とし、ロックアップクラッチ４２の係合状態を制御すべくロックアップクラッチ４２を操作する。また、制御装置７０は、変速装置５０を制御対象とし、その制御量としての変速比を制御すべく変速装置５０を操作する。なお、図１には、スロットルバルブ１４、燃料噴射弁１６、点火装置２６、ロックアップクラッチ４２、および変速装置５０のそれぞれの操作信号ＭＳ１～ＭＳ５を記載している。 The control device 70 treats the internal combustion engine 10 as a controlled object, and controls the operation units of the internal combustion engine 10 such as the throttle valve 14, the fuel injection valve 16, and the ignition device 26 in order to control the torque, the exhaust component ratio, etc., which are the control amounts. Manipulate. The control device 70 controls the torque converter 40 and operates the lockup clutch 42 to control the engagement state of the lockup clutch 42 . Further, the control device 70 controls the transmission device 50 and operates the transmission device 50 so as to control the gear ratio as its control amount. 1, operation signals MS1 to MS5 for the throttle valve 14, the fuel injection valve 16, the ignition device 26, the lockup clutch 42, and the transmission device 50 are shown.

制御装置７０は、制御量の制御のために、エアフローメータ８０によって検出される吸入空気量Ｇａや、スロットルセンサ８２によって検出されるスロットルバルブ１４の開口度（スロットル開口度ＴＡ）、クランク角センサ８４の出力信号Ｓｃｒを参照する。また、制御装置７０は、アクセルセンサ８８によって検出されるアクセルペダル８６の踏み込み量（アクセル操作量ＰＡ）や、加速度センサ９０によって検出される車両ＶＣ１の前後方向の加速度Ｇｘを参照する。また、制御装置７０は、駆動輪６０の回転角を検知する車輪回転センサ９２の出力信号Ｓｖを参照する。 The control device 70 controls the control amount based on the intake air amount Ga detected by the air flow meter 80, the opening degree of the throttle valve 14 (throttle opening degree TA) detected by the throttle sensor 82, the crank angle sensor 84 The output signal Scr of is referred to. The control device 70 also refers to the depression amount of the accelerator pedal 86 (accelerator operation amount PA) detected by the accelerator sensor 88 and the longitudinal acceleration Gx of the vehicle VC1 detected by the acceleration sensor 90 . The control device 70 also refers to the output signal Sv of the wheel rotation sensor 92 that detects the rotation angle of the driving wheels 60 .

制御装置７０は、ＣＰＵ７２、ＲＯＭ７４、電気的に書き換え可能な不揮発性メモリ（記憶装置７６）、および周辺回路７８を備え、それらがローカルネットワーク７９を介して通信可能とされている。ここで、周辺回路７８は、内部の動作を規定するクロック信号を生成する回路や、電源回路、リセット回路等を含む。 The control device 70 includes a CPU 72 , a ROM 74 , an electrically rewritable nonvolatile memory (storage device 76 ), and a peripheral circuit 78 , which can communicate with each other via a local network 79 . Here, the peripheral circuit 78 includes a circuit that generates a clock signal that defines internal operations, a power supply circuit, a reset circuit, and the like.

ＲＯＭ７４には、制御プログラム７４ａおよび学習プログラム７４ｂが記憶されている。一方、記憶装置７６には、アクセル操作量ＰＡと、スロットル開口度ＴＡの指令値（スロットル開口度指令値ＴＡ＊）および点火装置２６の遅角量ａｏｐとの関係を規定する関係規定データＤＲが記憶されている。ここで、遅角量ａｏｐは、予め定められた基準点火時期に対する遅角量であり、基準点火時期は、ＭＢＴ点火時期とノック限界点とのうちの遅角側の時期である。ＭＢＴ点火時期は、最大トルクの得られる点火時期（最大トルク点火時期）である。またノック限界点は、ノック限界の高い高オクタン価燃料の使用時に、想定される最良の条件下で、ノッキングを許容できるレベル以内に収めることのできる点火時期の進角限界値である。また、記憶装置７６には、トルク出力写像データＤＴが記憶されている。トルク出力写像データＤＴによって規定されるトルク出力写像は、クランク軸２８の回転速度ＮＥ、充填効率η、および点火時期を入力とし、トルクＴｒｑを出力する写像である。 The ROM 74 stores a control program 74a and a learning program 74b. On the other hand, the storage device 76 stores relationship defining data DR that defines the relationship between the accelerator operation amount PA, the command value of the throttle opening TA (throttle opening command value TA*), and the retardation amount aop of the ignition device 26. remembered. Here, the retardation amount aop is an amount of retardation with respect to a predetermined reference ignition timing, and the reference ignition timing is the timing on the retard side between the MBT ignition timing and the knock limit point. The MBT ignition timing is the ignition timing at which maximum torque is obtained (maximum torque ignition timing). The knock limit point is the ignition timing advance limit value at which knocking can be kept within an allowable level under the best assumed conditions when using high octane fuel with a high knock limit. The storage device 76 also stores torque output mapping data DT. The torque output map defined by the torque output map data DT is a map that inputs the rotation speed NE of the crankshaft 28, the charging efficiency η, and the ignition timing, and outputs the torque Trq.

図２に、本実施形態にかかる制御装置７０が実行する処理の手順を示す。図２に示す処理は、ＲＯＭ７４に記憶された学習プログラム７４ｂをＣＰＵ７２がたとえば所定周期で繰り返し実行することにより実現される。なお、以下では、先頭に「Ｓ」が付与された数字によって各処理のステップ番号を示す。 FIG. 2 shows the procedure of processing executed by the control device 70 according to this embodiment. The processing shown in FIG. 2 is realized by the CPU 72 repeatedly executing the learning program 74b stored in the ROM 74, for example, at predetermined intervals. In the following description, the step number of each process is indicated by a number prefixed with "S".

図２に示す一連の処理において、ＣＰＵ７２は、まず走行距離ＲＬを取得する（Ｓ１０）。ここで、走行距離ＲＬは、車輪回転センサ９２の出力信号Ｓｖに基づきＣＰＵ７２によって算出される。 In the series of processes shown in FIG. 2, the CPU 72 first acquires the travel distance RL (S10). Here, the running distance RL is calculated by the CPU 72 based on the output signal Sv of the wheel rotation sensor 92 .

次にＣＰＵ７２は、走行距離ＲＬが、収束判定値ＲＬｔｈＬ以下であるか否かを判定する（Ｓ１２）。そしてＣＰＵ７２は、収束判定値ＲｔｈＬ以下であると判定する場合、劣化フラグＦｄに「１」を代入する（Ｓ１４）。一方、ＣＰＵ７２は、Ｓ１２の処理において否定判定する場合には、走行距離ＲＬが収束判定値ＲＬｔｈＬよりも大きくて且つ劣化閾値ＲＬｔｈＨ未満であるか否かを判定する（Ｓ１６）。そしてＣＰＵ７２は、収束判定値ＲＬｔｈＬよりも大きくて且つ劣化閾値ＲＬｔｈＨ未満であると判定する場合（Ｓ１６：ＹＥＳ）、劣化フラグＦｄに「２」を代入する（Ｓ１８）。ＣＰＵ７２は、Ｓ１６の処理において否定判定する場合には、劣化フラグＦｄに「３」を代入する（Ｓ２０）。 Next, the CPU 72 determines whether or not the running distance RL is equal to or less than the convergence determination value RLthL (S12). When the CPU 72 determines that the convergence determination value is equal to or less than the convergence determination value RthL, the CPU 72 substitutes "1" for the deterioration flag Fd (S14). On the other hand, when making a negative determination in the process of S12, the CPU 72 determines whether or not the travel distance RL is greater than the convergence determination value RLthL and less than the deterioration threshold RLthH (S16). If the CPU 72 determines that it is greater than the convergence determination value RLthL and less than the deterioration threshold RLthH (S16: YES), it substitutes "2" for the deterioration flag Fd (S18). When the CPU 72 makes a negative determination in the process of S16, it substitutes "3" for the deterioration flag Fd (S20).

なお、ＣＰＵ７２は、Ｓ１４，Ｓ１８，Ｓ２０の処理を完了する場合、図２に示した一連の処理を一旦終了する。
図３に、本実施形態にかかる制御装置７０が実行する処理の手順を示す。図３に示す処理は、ＲＯＭ７４に記憶された制御プログラム７４ａおよび学習プログラム７４ｂをＣＰＵ７２がたとえば所定周期で繰り返し実行することにより実現される。 When completing the processes of S14, S18, and S20, the CPU 72 once ends the series of processes shown in FIG.
FIG. 3 shows the procedure of processing executed by the control device 70 according to this embodiment. The processing shown in FIG. 3 is realized by the CPU 72 repeatedly executing the control program 74a and the learning program 74b stored in the ROM 74, for example, at predetermined intervals.

図３に示す一連の処理において、ＣＰＵ７２は、まず、状態ｓとして、アクセル操作量ＰＡの６個のサンプリング値「ＰＡ（１），ＰＡ（２），…ＰＡ（６）」からなる時系列データを取得する（Ｓ３０）。ここで、時系列データを構成する各サンプリング値は、互いに異なるタイミングにおいてサンプリングされたものである。本実施形態では、一定のサンプリング周期でサンプリングされる場合の、互いに時系列的に隣り合う６個のサンプリング値によって時系列データを構成する。 In the series of processes shown in FIG. 3, the CPU 72 first generates time-series data consisting of six sampling values "PA(1), PA(2), . (S30). Here, each sampled value constituting the time-series data is sampled at different timings. In the present embodiment, time-series data is composed of six sampling values that are time-sequentially adjacent to each other when sampled at a constant sampling cycle.

次にＣＰＵ７２は、関係規定データＤＲが定める方策πに従い、Ｓ３０の処理によって取得した状態ｓに応じたスロットル開口度指令値ＴＡ＊および遅角量ａｏｐからなる行動ａを設定する（Ｓ３２）。 Next, the CPU 72 sets an action a consisting of the throttle opening command value TA* and the retardation amount aop corresponding to the state s obtained by the process of S30, in accordance with the policy π determined by the relationship defining data DR (S32).

本実施形態において、関係規定データＤＲは、行動価値関数Ｑおよび方策πを定めるデータである。本実施形態において、行動価値関数Ｑは、状態ｓおよび行動ａの８次元の独立変数に応じた期待収益の値を示すテーブル型式の関数である。また、方策πは、状態ｓが与えられたときに、独立変数が与えられた状態ｓとなる行動価値関数Ｑのうち最大となる行動ａ（グリーディ行動）を優先的に選択しつつも、所定の確率で、それ以外の行動ａを選択する規則を定める。 In this embodiment, the relationship defining data DR is data that defines the action-value function Q and the policy π. In this embodiment, the action-value function Q is a tabular function that indicates the value of the expected profit according to the eight-dimensional independent variables of the state s and the action a. In addition, when the state s is given, the policy π preferentially selects the action a (greedy action) that maximizes the action value function Q in the state s given the independent variable, while preferentially selecting the action a (greedy action). A rule is established to select the other action a with a probability of .

詳しくは、本実施形態にかかる行動価値関数Ｑの独立変数がとりうる値の数は、状態ｓおよび行動ａのとりうる値の全組み合わせのうちの一部が、人の知見等によって削減されたものである。すなわち、たとえばアクセル操作量ＰＡの時系列データのうち隣接する２つのサンプリング値の１つがアクセル操作量ＰＡの最小値となりもう１つが最大値となるようなことは、人によるアクセルペダル８６の操作からは生じえないとして、行動価値関数Ｑが定義されていない。本実施形態では、人の知見等に基づく次元削減によって、行動価値関数Ｑを定義する状態ｓの取りうる値を、１０の４乗個以下、より望ましくは１０の３乗個以下に制限する。 Specifically, the number of values that the independent variables of the action-value function Q according to this embodiment can take is reduced by human knowledge, etc. It is. That is, for example, one of the two adjacent sampling values of the accelerator operation amount PA time-series data is the minimum value and the other is the maximum value. cannot occur, the action-value function Q is not defined. In this embodiment, the possible values of the state s defining the action-value function Q are limited to 10 4 or less, more preferably 10 3 or less, by dimensionality reduction based on human knowledge or the like.

次にＣＰＵ７２は、設定されたスロットル開口度指令値ＴＡ＊および遅角量ａｏｐに基づき、スロットルバルブ１４に操作信号ＭＳ１を出力してスロットル開口度ＴＡを操作するとともに、点火装置２６に操作信号ＭＳ３を出力して点火時期を操作する（Ｓ３４）。ここで、本実施形態では、スロットル開口度ＴＡをスロットル開口度指令値ＴＡ＊にフィードバック制御することを例示することから、スロットル開口度指令値ＴＡ＊が同一の値であっても、操作信号ＭＳ１が互いに異なる信号となりうるものである。また、たとえば周知のノッキングコントロール（ＫＣＳ）等がなされる場合、点火時期は、基準点火時期を遅角量ａｏｐにて遅角させた値がＫＣＳにてフィードバック補正された値とされる。ここで、基準点火時期は、ＣＰＵ７２により、クランク軸２８の回転速度ＮＥおよび充填効率ηに応じて可変設定される。なお、回転速度ＮＥは、クランク角センサ８４の出力信号Ｓｃｒに基づきＣＰＵ７２によって算出される。また、充填効率ηは、回転速度ＮＥおよび吸入空気量Ｇａに基づきＣＰＵ７２によって算出される。 Next, the CPU 72 outputs an operation signal MS1 to the throttle valve 14 to operate the throttle opening degree TA based on the set throttle opening degree command value TA* and the retardation amount aop, and outputs an operation signal MS3 to the ignition device 26. is output to operate the ignition timing (S34). Here, in the present embodiment, since feedback control of the throttle opening degree TA to the throttle opening degree command value TA* is exemplified, even if the throttle opening degree command value TA* is the same value, the operation signal MS1 can be different signals. Further, when the well-known knocking control (KCS) or the like is performed, the ignition timing is set to a value obtained by retarding the reference ignition timing by the retardation amount aop and feedback corrected by the KCS. Here, the reference ignition timing is variably set by the CPU 72 according to the rotation speed NE of the crankshaft 28 and the charging efficiency η. Note that the rotation speed NE is calculated by the CPU 72 based on the output signal Scr of the crank angle sensor 84 . Further, the charging efficiency η is calculated by the CPU 72 based on the rotation speed NE and the intake air amount Ga.

次にＣＰＵ７２は、内燃機関１０のトルクＴｒｑ、内燃機関１０に対するトルク指令値Ｔｒｑ＊、および加速度Ｇｘを取得する（Ｓ３６）。ここで、ＣＰＵ１１２は、トルクＴｒｑを、回転速度ＮＥ、充填効率ηおよび点火時期をトルク出力写像に入力することによって算出する。また、ＣＰＵ７２は、トルク指令値Ｔｒｑ＊を、アクセル操作量ＰＡに応じて設定する。 Next, the CPU 72 acquires the torque Trq of the internal combustion engine 10, the torque command value Trq* for the internal combustion engine 10, and the acceleration Gx (S36). Here, the CPU 112 calculates the torque Trq by inputting the rotational speed NE, the charging efficiency η and the ignition timing into the torque output map. Further, the CPU 72 sets the torque command value Trq* according to the accelerator operation amount PA.

次にＣＰＵ７２は、過渡フラグＦが「１」であるか否かを判定する（Ｓ３８）。過渡フラグＦは、「１」である場合に過渡運転時であることを示し、「０」である場合に過渡運転時ではないことを示す。ＣＰＵ７２は、過渡フラグＦが「０」であると判定する場合（Ｓ３８：ＮＯ）、アクセル操作量ＰＡの単位時間当たりの変化量ΔＰＡの絶対値が所定量ΔＰＡｔｈ以上であるか否かを判定する（Ｓ４０）。ここで、変化量ΔＰＡは、たとえば、Ｓ４０の処理の実行タイミングにおける最新のアクセル操作量ＰＡと、同タイミングに対して単位時間だけ前におけるアクセル操作量ＰＡとの差とすればよい。 Next, the CPU 72 determines whether or not the transient flag F is "1" (S38). When the transient flag F is "1", it indicates that the operation is in a transient operation, and when it is "0", it indicates that the operation is not in a transient operation. When the CPU 72 determines that the transient flag F is "0" (S38: NO), the CPU 72 determines whether or not the absolute value of the change amount ΔPA of the accelerator operation amount PA per unit time is equal to or greater than a predetermined amount ΔPAth. (S40). Here, the amount of change ΔPA may be, for example, the difference between the latest accelerator operation amount PA at the execution timing of the processing of S40 and the accelerator operation amount PA a unit time before the same timing.

ＣＰＵ７２は、所定量ΔＰＡｔｈ以上であると判定する場合（Ｓ４０：ＹＥＳ）、過渡フラグＦに「１」を代入する（Ｓ４２）。
これに対し、ＣＰＵ７２は、過渡フラグＦが「１」であると判定する場合（Ｓ３８：ＹＥＳ）、Ｓ４２の処理の実行タイミングから所定期間が経過したか否かを判定する（Ｓ４４）。ここで、所定期間は、アクセル操作量ＰＡの単位時間当たりの変化量ΔＰＡの絶対値が所定量ΔＰＡｔｈよりも小さい規定量以下となる状態が所定時間継続するまでの期間とする。ＣＰＵ７２は、所定期間が経過したと判定する場合（Ｓ４４：ＹＥＳ）、過渡フラグＦに「０」を代入する（Ｓ４６）。 If the CPU 72 determines that it is equal to or greater than the predetermined amount ΔPAth (S40: YES), it substitutes "1" for the transient flag F (S42).
On the other hand, when determining that the transient flag F is "1" (S38: YES), the CPU 72 determines whether or not a predetermined period has elapsed from the execution timing of the process of S42 (S44). Here, the predetermined period is defined as a period until the absolute value of the change amount ΔPA of the accelerator operation amount PA per unit time is equal to or less than a specified amount smaller than the predetermined amount ΔPAth and continues for a predetermined period of time. When determining that the predetermined period has passed (S44: YES), the CPU 72 substitutes "0" for the transient flag F (S46).

ＣＰＵ７２は、Ｓ４２，Ｓ４６の処理が完了する場合、１つのエピソードが終了したとして、劣化フラグＦｄが「１」または「３」であるか否かを判定する（Ｓ４８）。そして、ＣＰＵ７２は、劣化フラグＦｄが「１」または「３」であると判定する場合（Ｓ４８：ＹＥＳ）、強化学習によって行動価値関数Ｑを更新する（Ｓ５０）。 When the processes of S42 and S46 are completed, the CPU 72 determines whether or not the deterioration flag Fd is "1" or "3" (S48), assuming that one episode has ended. Then, when determining that the deterioration flag Fd is "1" or "3" (S48: YES), the CPU 72 updates the action value function Q by reinforcement learning (S50).

図４に、Ｓ５０の処理の詳細を示す。
図４に示す一連の処理において、ＣＰＵ７２は、直近に終了されたエピソード中のトルク指令値Ｔｒｑ＊、トルクＴｒｑおよび加速度Ｇｘの３つのサンプリング値の組からなる時系列データと、状態ｓおよび行動ａの時系列データと、を取得する（Ｓ６０）。ここで、直近のエピソードは、Ｓ４２の処理に続いてＳ６０の処理がなされる場合には、過渡フラグＦが継続して「０」となっていた期間であり、Ｓ４６の処理に続いてＳ６０の処理がなされる場合には、過渡フラグＦが継続して「１」となっていた期間である。 FIG. 4 shows details of the processing of S50.
In the series of processes shown in FIG. 4, the CPU 72 collects time-series data consisting of a set of three sampled values of torque command value Trq*, torque Trq and acceleration Gx during the most recently completed episode, state s and action a. and the time-series data of (S60). Here, the most recent episode is the period during which the transient flag F was continuously "0" when the process of S60 is performed following the process of S42, and the period of S60 following the process of S46. When the process is performed, it is the period during which the transient flag F is continuously at "1".

図４には、カッコの中の数字が異なるものが、異なるサンプリングタイミングにおける変数の値であることを示す。たとえば、トルク指令値Ｔｒｑ＊（１）とトルク指令値Ｔｒｑ＊（２）とは、サンプリングタイミングが互いに異なるものである。また、直近のエピソードに属する行動ａの時系列データを、行動集合Ａｊとし、同エピソードに属する状態ｓの時系列データを、状態集合Ｓｊと定義する。 In FIG. 4, different numbers in parentheses indicate variable values at different sampling timings. For example, torque command value Trq*(1) and torque command value Trq*(2) have different sampling timings. Also, the time-series data of action a belonging to the latest episode is defined as action set Aj, and the time-series data of state s belonging to the same episode is defined as state set Sj.

次にＣＰＵ７２は、直近のエピソードに属する任意のトルクＴｒｑとトルク指令値Ｔｒｑ＊との差の絶対値が規定量ΔＴｒｑ以下である旨の条件（ア）と、加速度Ｇｘが下限値ＧｘＬ以上であって上限値ＧｘＨ以下である旨の条件（イ）との論理積が真であるか否かを判定する（Ｓ６２）。 Next, the CPU 72 sets the condition (a) that the absolute value of the difference between any torque Trq belonging to the most recent episode and the torque command value Trq* is equal to or less than the prescribed amount ΔTrq, and that the acceleration Gx is equal to or greater than the lower limit value GxL. is equal to or less than the upper limit value GxH (S62).

ここで、ＣＰＵ７２は、規定量ΔＴｒｑを、エピソードの開始時におけるアクセル操作量ＰＡの単位時間当たりの変化量ΔＰＡによって可変設定する。すなわち、ＣＰＵ７２は、変化量ΔＰＡの絶対値が大きい場合には過渡時に関するエピソードであるとして、定常時である場合と比較して、規定量ΔＴｒｑを大きい値に設定する。 Here, the CPU 72 variably sets the prescribed amount ΔTrq depending on the change amount ΔPA per unit time of the accelerator operation amount PA at the start of the episode. That is, when the absolute value of the change amount ΔPA is large, the CPU 72 regards it as an episode relating to the transient state, and sets the prescribed amount ΔTrq to a larger value than in the case of the steady state.

また、ＣＰＵ７２は、下限値ＧｘＬを、エピソードの開始時におけるアクセル操作量ＰＡの変化量ΔＰＡによって可変設定する。すなわち、ＣＰＵ７２は、過渡時に関するエピソードであって且つ変化量ΔＰＡが正である場合には、定常時に関するエピソードの場合と比較して、下限値ＧｘＬを大きい値に設定する。また、ＣＰＵ７２は、過渡時に関するエピソードであって且つ変化量ΔＰＡが負である場合には、定常時に関するエピソードの場合と比較して、下限値ＧｘＬを小さい値に設定する。 Further, the CPU 72 variably sets the lower limit value GxL depending on the change amount ΔPA of the accelerator operation amount PA at the start of the episode. That is, the CPU 72 sets the lower limit value GxL to a larger value when the episode is related to the transient time and the amount of change ΔPA is positive compared to the case of the episode related to the steady state. In addition, when the episode is related to the transient time and the amount of change ΔPA is negative, the CPU 72 sets the lower limit value GxL to a smaller value than in the case of the episode related to the steady state.

また、ＣＰＵ７２は、上限値ＧｘＨを、エピソードの開始時におけるアクセル操作量ＰＡの単位時間当たりの変化量ΔＰＡによって可変設定する。すなわち、ＣＰＵ７２は、過渡時に関するエピソードであって且つ変化量ΔＰＡが正である場合には、定常時に関するエピソードの場合と比較して、上限値ＧｘＨを大きい値に設定する。また、ＣＰＵ７２は、過渡時に関するエピソードであって且つ変化量ΔＰＡが負である場合には、定常時に関するエピソードの場合と比較して、上限値ＧｘＨを小さい値に設定する。 Further, the CPU 72 variably sets the upper limit value GxH according to the change amount ΔPA per unit time of the accelerator operation amount PA at the start of the episode. That is, the CPU 72 sets the upper limit value GxH to a larger value when the episode is related to the transient time and the amount of change ΔPA is positive compared to the case of the episode related to the steady state. In addition, when the episode is related to the transient time and the amount of change ΔPA is negative, the CPU 72 sets the upper limit value GxH to a smaller value than in the case of the episode related to the steady state.

ＣＰＵ７２は、論理積が真であると判定する場合（Ｓ６２：ＹＥＳ）、報酬ｒに「１０」を代入する一方（Ｓ６４）、偽であると判定する場合（Ｓ６２：ＮＯ）、報酬ｒに「－１０」を代入する（Ｓ６６）。Ｓ６２～Ｓ６６の処理は、ドライバビリティに関する基準を満たす場合に満たさない場合よりも大きい報酬を与える処理である。ＣＰＵ７２は、Ｓ６４，Ｓ６６の処理が完了する場合、図１に示した記憶装置７６に記憶されている関係規定データＤＲを更新する。本実施形態では、εソフト方策オン型モンテカルロ法を用いる。 If the CPU 72 determines that the logical product is true (S62: YES), it substitutes "10" for the reward r (S64). −10” is substituted (S66). The processing of S62 to S66 is processing to give a larger reward when the criteria for drivability are met than when they are not met. When the processes of S64 and S66 are completed, the CPU 72 updates the relationship defining data DR stored in the storage device 76 shown in FIG. In this embodiment, the ε-soft policy on-type Monte Carlo method is used.

すなわち、ＣＰＵ７２は、上記Ｓ６０の処理によって読み出した各状態と対応する行動との組によって定まる収益Ｒ（Ｓｊ，Ａｊ）に、それぞれ、報酬ｒを加算する（Ｓ６８）。ここで、「Ｒ（Ｓｊ，Ａｊ）」は、状態集合Ｓｊの要素の１つを状態とし行動集合Ａｊの要素の１つを行動とする収益Ｒを総括した記載である。次に、上記Ｓ６０の処理によって読み出した各状態と対応する行動との組によって定まる収益Ｒ（Ｓｊ，Ａｊ）のそれぞれについて、平均化して対応する行動価値関数Ｑ（Ｓｊ，Ａｊ）に代入する（Ｓ７０）。ここで、平均化は、Ｓ６８の処理がなされた回数に所定数を加えた数によって、Ｓ６８の処理によって算出された収益Ｒを除算する処理とすればよい。なお、収益Ｒの初期値は、対応する行動価値関数Ｑの初期値とすればよい。 That is, the CPU 72 adds the reward r to each of the revenues R (Sj, Aj) determined by the set of each state and the corresponding action read out in the process of S60 (S68). Here, "R(Sj, Aj)" is a generalized description of the revenue R in which one of the elements of the state set Sj is the state and one of the elements of the action set Aj is the action. Next, each of the returns R (Sj, Aj) determined by the set of each state and the corresponding action read by the processing of S60 is averaged and substituted into the corresponding action value function Q (Sj, Aj) ( S70). Here, the averaging may be a process of dividing the profit R calculated by the process of S68 by the number obtained by adding a predetermined number to the number of times the process of S68 is performed. The initial value of the profit R may be the initial value of the corresponding action value function Q.

次にＣＰＵ７２は、上記Ｓ６０の処理によって読み出した状態について、それぞれ、対応する行動価値関数Ｑ（Ｓｊ，Ａ）のうち、最大値となるときのスロットル開口度指令値ＴＡ＊および遅角量ａｏｐの組である行動を、行動Ａｊ＊に代入する（Ｓ７２）。ここで、「Ａ」は、とりうる任意の行動を示す。なお、行動Ａｊ＊は、上記Ｓ６０の処理によって読み出した状態の種類に応じて各別の値となるものであるが、ここでは、表記を簡素化して、同一の記号にて記載している。 Next, the CPU 72 determines the throttle opening degree command value TA* and the retardation amount aop when the corresponding action value function Q(Sj, A) reaches the maximum value for each of the states read by the process of S60. The action that is a pair is substituted for the action Aj* (S72). Here, "A" indicates any possible action. Note that the action Aj* has a different value depending on the type of state read out by the process of S60, but here, the notation is simplified and the same symbol is used.

次に、ＣＰＵ７２は、上記Ｓ６０の処理によって読み出した状態のそれぞれについて、対応する方策π（Ａｊ｜Ｓｊ）を更新する（Ｓ７４）。すなわち、行動の総数を、「｜Ａ｜」とすると、Ｓ７２によって選択された行動Ａｊ＊の選択確率を、「（１－ε）＋ε／｜Ａ｜」とする。また、行動Ａｊ＊以外の「｜Ａ｜－１」個の行動の選択確率を、それぞれ「ε／｜Ａ｜」とする。Ｓ７４の処理は、Ｓ７０の処理によって更新された行動価値関数Ｑに基づく処理であることから、これにより、状態ｓと行動ａとの関係を規定する関係規定データＤＲが、収益Ｒを増加させるように更新されることとなる。 Next, the CPU 72 updates the corresponding policy π(Aj|Sj) for each of the states read by the process of S60 (S74). That is, if the total number of actions is "|A|", the selection probability of the action Aj* selected in S72 is "(1-ε)+ε/|A|". Also, the selection probabilities of “|A|-1” actions other than action Aj* are assumed to be “ε/|A|”. Since the processing of S74 is processing based on the action-value function Q updated by the processing of S70, the relationship defining data DR that defines the relationship between the state s and the action a is changed so as to increase the revenue R. will be updated to

なお、ＣＰＵ７２は、Ｓ７４の処理が完了する場合、図４に示す一連の処理を一旦終了する。
図３に戻り、ＣＰＵ７２は、Ｓ５０の処理が完了する場合や、Ｓ４０，Ｓ４４，Ｓ４８の処理において否定判定する場合には、図３に示す一連の処理を一旦終了する。なお、Ｓ３０～Ｓ４６の処理は、ＣＰＵ７２が制御プログラム７４ａを実行することにより実現され、Ｓ４８，Ｓ５０の処理は、ＣＰＵ７２が学習プログラム７４ｂを実行することにより実現される。また、車両ＶＣ１の出荷時における関係規定データＤＲは、テストベンチで車両の走行を模擬するなどしつつ図３に示した処理と同様の処理を実行することによって予め学習がなされたデータとする。 When the process of S74 is completed, the CPU 72 once terminates the series of processes shown in FIG.
Returning to FIG. 3, the CPU 72 temporarily terminates the series of processes shown in FIG. 3 when the process of S50 is completed or when a negative determination is made in the processes of S40, S44, and S48. The processes of S30 to S46 are realized by the CPU 72 executing the control program 74a, and the processes of S48 and S50 are realized by the CPU 72 executing the learning program 74b. Further, the relational regulation data DR at the time of shipment of the vehicle VC1 is pre-learned data by executing the same processing as the processing shown in FIG. 3 while simulating the running of the vehicle on a test bench.

図５に、制御装置７０が実行する処理の手順を示す。図５に示す処理は、ＲＯＭ７４に記憶された学習プログラム７４ｂをＣＰＵ７２がたとえば所定周期で繰り返し実行することにより実現される。 FIG. 5 shows the procedure of processing executed by the control device 70 . The processing shown in FIG. 5 is realized by the CPU 72 repeatedly executing a learning program 74b stored in the ROM 74, for example, at predetermined intervals.

図５に示す一連の処理において、ＣＰＵ７２は、まず、劣化フラグＦｄが「１」から「２」に切り替わった時点であるか否かを判定する（Ｓ８０）。そしてＣＰＵ７２は、劣化フラグＦｄが「１」から「２」に切り替わった時点であると判定する場合（Ｓ８０：ＹＥＳ）、グリーディ行動以外の行動が選択される確率をゼロとすべく、「ε」にゼロを代入する（Ｓ８２）。 In the series of processes shown in FIG. 5, the CPU 72 first determines whether or not the deterioration flag Fd is switched from "1" to "2" (S80). When the CPU 72 determines that the deterioration flag Fd has changed from "1" to "2" (S80: YES), the CPU 72 sets the probability of selecting an action other than the greedy action to zero. is substituted with zero (S82).

一方、ＣＰＵ７２は、Ｓ８０の処理において否定判定する場合、劣化フラグＦｄが「２」から「３」に切り替わった時点であるか否かを判定する（Ｓ８３）。そしてＣＰＵ７２は、劣化フラグＦｄが「２」から「３」に切り替わった時点であると判定する場合（Ｓ８３：ＹＥＳ）、行動価値関数Ｑの独立変数として定義されている状態ｓの１つを選択する（Ｓ８４）。次にＣＰＵ７２は、Ｓ８４の処理によって選択した状態ｓを独立変数の値とする行動価値関数Ｑのうちその値が最大となるときの行動ａを、グリーディ行動ａｇに代入する（Ｓ８６）。そしてＣＰＵ７２は、Ｓ８４の処理において選択された状態ｓにおいてとりうる行動ａの集合Ａｓを、グリーディ行動ａｇとの差が所定値δ以下の行動ａに制限する（Ｓ８８）。ここで、グリーディ行動ａｇとの差の絶対値が所定値δ以下の行動とは、グリーディ行動ａｇに対応するスロットル開口度指令値ＴＡ＊、遅角量ａｏｐのそれぞれとの差の絶対値が、スロットル開口度指令値ＴＡ＊、遅角量ａｏｐのそれぞれの取りうる値の範囲の大きさに所定値δを乗算した値以下であることとする。すなわち、スロットル開口度指令値ＴＡ＊の取りうる値の範囲の大きさをＴＡｍａｘとし、「０＜δ＜１」とすると、スロットル開口度指令値ＴＡ＊とグリーディ行動が示す値との差の絶対値が「δ・ＴＡｍａｘ」以下となるように制限する。また、ＣＰＵ７２は、遅角量ａｏｐの取りうる値の範囲の大きさをａｏｐｍａｘとすると、遅角量ａｏｐとグリーディ行動が示す値との差の絶対値が「δ・ａｏｐｍａｘ」以下となるように制限する。 On the other hand, when the CPU 72 makes a negative determination in the process of S80, it determines whether or not it is time to switch the deterioration flag Fd from "2" to "3" (S83). If the CPU 72 determines that the deterioration flag Fd has switched from "2" to "3" (S83: YES), it selects one of the states s defined as independent variables of the action value function Q. (S84). Next, the CPU 72 substitutes, into the greedy action ag, the action a that maximizes the value of the action value function Q whose independent variable is the state s selected by the process of S84 (S86). Then, the CPU 72 limits the set As of actions a that can be taken in the state s selected in the process of S84 to actions a whose difference from the greedy action ag is equal to or less than a predetermined value δ (S88). Here, the action whose absolute value of the difference from the greedy action ag is equal to or less than the predetermined value δ means that the absolute value of the difference from the throttle opening command value TA* corresponding to the greedy action ag and the retardation amount aop is It is assumed that the throttle opening degree command value TA* and the retardation amount aop are each equal to or smaller than a value obtained by multiplying the range of possible values by a predetermined value δ. That is, if the range of values that the throttle opening degree command value TA* can take is TAmax and "0<δ<1", then the absolute difference between the throttle opening degree command value TA* and the value indicated by the greedy behavior is The value is limited to "δ·TAmax" or less. Further, the CPU 72 sets the absolute value of the difference between the retardation amount aop and the value indicated by the greedy behavior to be less than or equal to "δ·aopmax", where aopmax is the range of possible values of the retardation amount aop. Restrict.

ＣＰＵ７２は、Ｓ８８の処理が完了する場合、Ｓ８４の処理によって、行動価値関数Ｑの独立変数として定義されている状態ｓの全てが選択されたか否かを判定する（Ｓ９０）。ＣＰＵ７２は、未だ選択されていない状態ｓがあると判定する場合（Ｓ９０：ＮＯ）、Ｓ８４の処理に戻る。 When the process of S88 is completed, the CPU 72 determines whether or not all states s defined as independent variables of the action value function Q have been selected by the process of S84 (S90). When determining that there is a state s that has not been selected yet (S90: NO), the CPU 72 returns to the process of S84.

これに対しＣＰＵ７２は、全て選択したと判定する場合（Ｓ９０：ＹＥＳ）や、Ｓ８２の処理が完了する場合、Ｓ８３の処理において否定判定する場合には、図５に示す一連の処理を一旦終了する。 On the other hand, the CPU 72 temporarily ends the series of processes shown in FIG. .

ここで、本実施形態の作用および効果について説明する。
ＣＰＵ７２は、ユーザによるアクセルペダル８６の操作に伴って、アクセル操作量ＰＡの時系列データを取得し、方策πに従って、スロットル開口度指令値ＴＡ＊および遅角量ａｏｐからなる行動ａを設定する。ここでＣＰＵ７２は、基本的には、関係規定データＤＲに規定されている行動価値関数Ｑに基づき期待収益を最大とする行動ａを選択する。ただし、ＣＰＵ７２は、所定の確率「ε－ε｜Ａ｜」で、期待収益を最大化する行動ａ以外の行動を選択することによって、期待収益を最大化する行動ａの探索を行う。これにより、ユーザによる車両ＶＣ１の運転に伴って、関係規定データＤＲを強化学習によって更新できる。したがって、アクセル操作量ＰＡに応じたスロットル開口度指令値ＴＡ＊および遅角量ａｏｐを、熟練者による工数を過度に大きくすることなく車両ＶＣ１の走行において適切な値に設定することができる。 Here, the action and effect of this embodiment will be described.
The CPU 72 acquires time-series data of the accelerator operation amount PA in accordance with the user's operation of the accelerator pedal 86, and sets the action a consisting of the throttle opening command value TA* and the retardation amount aop according to the policy π. Here, the CPU 72 basically selects the action a that maximizes the expected profit based on the action value function Q defined in the relationship defining data DR. However, the CPU 72 searches for an action a that maximizes the expected profit by selecting an action other than the action a that maximizes the expected profit with a predetermined probability "ε−ε|A|". Accordingly, the relationship defining data DR can be updated by reinforcement learning as the user drives the vehicle VC1. Therefore, the throttle opening degree command value TA* and the retardation amount aop corresponding to the accelerator operation amount PA can be set to appropriate values for traveling of the vehicle VC1 without excessively increasing the number of man-hours required by an expert.

このようにして、車両ＶＣ１の出荷後、走行距離ＲＬが収束判定値ＲＬｔｈＬを超えるまで、車両ＶＣ１の走行に伴って関係規定データＤＲが更新されていく。そして、収束判定値ＲＬｔｈＬ以上となる場合、関係規定データＤＲが車両ＶＣ１の走行において最適な値に収束したとして、「ε」をゼロとすることによって、方策πを、グリーディ行動のみをとるものに変更する。 In this manner, after the vehicle VC1 is shipped, the relationship defining data DR is updated as the vehicle VC1 travels until the travel distance RL exceeds the convergence determination value RLthL. Then, when the convergence judgment value RLthL or more is reached, it is assumed that the relationship defining data DR has converged to the optimum value for the running of the vehicle VC1, and by setting "ε" to zero, the policy π is set to take only greedy actions. change.

ここで、たとえばスロットル開口度ＴＡが同一であったとしても、車両ＶＣ１の劣化に伴ってスロットルバルブ１４や吸気通路１２に堆積物が堆積する場合には、吸気通路１２の流路断面積が小さくなることから、吸入空気量Ｇａが小さくなる。そのため、走行距離ＲＬが収束判定値ＲＬｔｈＬを大きく上回り、車両ＶＣ１の劣化が進行する場合には、アクセル操作量ＰＡの時系列データに応じて期待収益を最大化するスロットル開口度指令値ＴＡ＊が、走行距離ＲＬが収束判定値ＲＬｔｈＬとなった時点における関係規定データＤＲによって規定される値からずれるおそれがある。 Here, for example, even if the throttle opening degree TA is the same, if deposits accumulate on the throttle valve 14 and the intake passage 12 due to deterioration of the vehicle VC1, the passage cross-sectional area of the intake passage 12 becomes small. Therefore, the intake air amount Ga becomes smaller. Therefore, when the travel distance RL greatly exceeds the convergence determination value RLthL and the deterioration of the vehicle VC1 progresses, the throttle opening degree command value TA* that maximizes the expected profit according to the time-series data of the accelerator operation amount PA is set. , the travel distance RL may deviate from the value defined by the relationship defining data DR at the time when the travel distance RL reaches the convergence determination value RLthL.

そこで、本実施形態にかかるＣＰＵ７２は、車両ＶＣ１の走行距離ＲＬが劣化閾値ＲＬｔｈＨ以上となる場合、グリーディ行動以外の行動が採用される確率「ε－ε｜Ａ｜」をゼロよりも大きくする。ただし、ＣＰＵ７２はグリーディ行動以外の行動として選択可能なものを、走行距離ＲＬが収束判定値ＲＬｔｈＬ以下の場合よりも少なくする制限を設ける。具体的には、グリーディ行動が示すスロットル開口度指令値ＴＡ＊の値との差の絶対値が「δ・ＴＡｍａｘ」以下となって且つ、グリーディ行動が示す遅角量ａｏｐとの差の絶対値が「δ・ａｏｐｍａｘ」以下となるように制限する。これは、劣化によってグリーディ行動が劣化前に対して変化したとしても、劣化前のグリーディ行動からの変化量がさほど大きくならないと考えられることに鑑みたものである。このように、探索範囲を制限することにより、グリーディ行動となりえない行動を用いた不要な探索がなされることを抑制できる。 Therefore, when the travel distance RL of the vehicle VC1 is equal to or greater than the deterioration threshold RLthH, the CPU 72 according to the present embodiment sets the probability “ε−ε|A|” of adopting an action other than the greedy action to be greater than zero. However, the CPU 72 imposes a limit that selectable behaviors other than the greedy behavior are less than when the running distance RL is equal to or less than the convergence determination value RLthL. Specifically, the absolute value of the difference from the throttle opening command value TA* indicated by the greedy action becomes equal to or less than ".delta.TAmax" and the absolute value of the difference from the retardation amount aop indicated by the greedy action. is limited to "δ·aopmax" or less. This is because even if the greedy behavior changes from that before deterioration due to deterioration, it is considered that the amount of change from the greedy behavior before deterioration is not so large. By limiting the search range in this way, it is possible to suppress unnecessary searches using actions that cannot be greedy actions.

以上説明した本実施形態によれば、さらに以下に記載する効果が得られる。
（１）走行距離ＲＬが収束判定値ＲＬｔｈＬを超えて且つ劣化閾値ＲＬｔｈＨ未満の場合、探索を禁止することにより、不要な探索が継続され、ひいては最適な行動以外の行動がとられることを回避できる。 According to this embodiment described above, the following effects can be obtained.
(1) When the travel distance RL exceeds the convergence judgment value RLthL and is less than the deterioration threshold RLthH, by prohibiting the search, it is possible to avoid the continuation of the unnecessary search and the taking of any action other than the optimum action. .

（２）走行距離ＲＬを車両ＶＣ１の劣化度合いを示す変数である劣化変数とし、これに応じて探索の範囲を変更した。これにより、車両ＶＣ１の劣化の有無を簡易に定量化できる。 (2) The travel distance RL is used as a deterioration variable indicating the degree of deterioration of the vehicle VC1, and the search range is changed accordingly. This makes it possible to easily quantify the presence or absence of deterioration of the vehicle VC1.

＜第２の実施形態＞
以下、第２の実施形態について、第１の実施形態との相違点を中心に図面を参照しつつ説明する。 <Second embodiment>
The second embodiment will be described below with reference to the drawings, focusing on differences from the first embodiment.

本実施形態では、関係規定データＤＲの更新を、車両ＶＣ１の外で実行する。
図６に、本実施形態において、強化学習を実行する制御システムの構成を示す。なお、図６において、図１に示した部材に対応する部材については、便宜上、同一の符号を付している。 In this embodiment, the update of the relationship defining data DR is performed outside the vehicle VC1.
FIG. 6 shows the configuration of a control system that executes reinforcement learning in this embodiment. 6, members corresponding to members shown in FIG. 1 are denoted by the same reference numerals for convenience.

図６に示す車両ＶＣ１内の制御装置７０におけるＲＯＭ７４は、制御プログラム７４ａを記憶しているものの、学習プログラム７４ｂを記憶していない。また、制御装置７０は、通信機７７を備えている。通信機７７は車両ＶＣ１の外部のネットワーク１００を介してデータ解析センター１１０と通信するための機器である。 The ROM 74 in the control device 70 in the vehicle VC1 shown in FIG. 6 stores the control program 74a, but does not store the learning program 74b. The control device 70 also includes a communication device 77 . The communication device 77 is a device for communicating with the data analysis center 110 via the network 100 outside the vehicle VC1.

データ解析センター１１０は、複数の車両ＶＣ１，ＶＣ２，…から送信されるデータを解析する。データ解析センター１１０は、ＣＰＵ１１２、ＲＯＭ１１４、電気的に書き換え可能な不揮発性メモリ（記憶装置１１６）、周辺回路１１８および通信機１１７を備えており、それらがローカルネットワーク１１９によって通信可能とされるものである。ＲＯＭ１１４には、学習プログラム１１４ａが記憶されており、記憶装置１１６には、関係規定データＤＲが記憶されている。 The data analysis center 110 analyzes data transmitted from a plurality of vehicles VC1, VC2, . The data analysis center 110 includes a CPU 112, a ROM 114, an electrically rewritable nonvolatile memory (storage device 116), a peripheral circuit 118, and a communication device 117, which can communicate with each other via a local network 119. be. The ROM 114 stores a learning program 114a, and the storage device 116 stores relationship defining data DR.

図７に、本実施形態にかかる強化学習の処理手順を示す。図７（ａ）に示す処理は、図６に示すＲＯＭ７４に記憶されている制御プログラム７４ａをＣＰＵ７２が実行することにより実現される。また、図７（ｂ）に示す処理は、ＲＯＭ１１４に記憶されている学習プログラム１１４ａをＣＰＵ１１２が実行することにより実現される。なお、図７において図２に示した処理に対応する処理については、便宜上同一のステップ番号を付している。以下では、強化学習の時系列に沿って、図７に示す処理を説明する。 FIG. 7 shows a processing procedure of reinforcement learning according to this embodiment. The processing shown in FIG. 7(a) is implemented by the CPU 72 executing a control program 74a stored in the ROM 74 shown in FIG. The processing shown in FIG. 7B is realized by the CPU 112 executing the learning program 114a stored in the ROM 114. FIG. In FIG. 7, the same step numbers are attached to the processes corresponding to the processes shown in FIG. 2 for the sake of convenience. The processing shown in FIG. 7 will be described below along the time series of reinforcement learning.

図７（ａ）に示す一連の処理において、ＣＰＵ７２は、Ｓ３０～Ｓ４８の処理を実行し、Ｓ４８の処理において肯定判定する場合、通信機７７を操作することによって、関係規定データＤＲの更新処理に必要なデータを送信する（Ｓ１００）。ここで、送信対象とされるデータは、所定期間内におけるＳ３０の処理において設定された状態ｓ、所定期間内におけるＳ３２の処理において設定された行動ａ、ならびに所定期間内におけるＳ３６の処理において取得されたトルク指令値Ｔｒｑ＊、トルクＴｒｑ、および加速度Ｇｘを含む。 In the series of processes shown in FIG. 7(a), the CPU 72 executes the processes of S30 to S48, and if the determination in S48 is affirmative, the CPU 72 operates the communication device 77 to update the relationship defining data DR. Send necessary data (S100). Here, the data to be transmitted are the state s set in the process of S30 within the predetermined period, the action a set in the process of S32 within the predetermined period, and the data acquired in the process of S36 within the predetermined period. includes torque command value Trq*, torque Trq, and acceleration Gx.

これに対し、図７（ｂ）に示すように、ＣＰＵ１１２は、送信されたデータを受信し（Ｓ１１０）、受信したデータに基づき関係規定データＤＲを更新する（Ｓ５０）。そしてＣＰＵ１１２は、関係規定データＤＲの更新回数が所定回数以上であるか否かを判定し（Ｓ１１２）、所定回数以上であると判定する場合（Ｓ１１２：ＹＥＳ）、通信機１１７を操作して、Ｓ１１０の処理によって受信したデータを送信した車両ＶＣ１に関係規定データＤＲを送信する（Ｓ１１４）。なお、ＣＰＵ１１２は、Ｓ１１４の処理を完了する場合や、Ｓ１１２の処理において否定判定する場合には、図７（ｂ）に示す一連の処理を一旦終了する。 On the other hand, as shown in FIG. 7B, the CPU 112 receives the transmitted data (S110), and updates the relationship defining data DR based on the received data (S50). Then, the CPU 112 determines whether or not the number of updates of the relationship defining data DR is equal to or greater than a predetermined number of times (S112). The relationship defining data DR is transmitted to the vehicle VC1 that transmitted the data received by the process of S110 (S114). Note that the CPU 112 once ends the series of processes shown in FIG. 7B when completing the process of S114 or when making a negative determination in the process of S112.

これに対し、図７（ａ）に示すように、ＣＰＵ７２は、更新データがあるか否かを判定し（Ｓ１０２）、あると判定する場合（Ｓ１０２：ＹＥＳ）、更新された関係規定データＤＲを受信する（Ｓ１０４）。そしてＣＰＵは、Ｓ３２の処理において利用する関係規定データＤＲを、受信した関係規定データＤＲに書き換える（Ｓ１０６）。なお、ＣＰＵ７２は、Ｓ１０６の処理を完了する場合や、Ｓ４０，Ｓ４４，Ｓ４８，Ｓ１０２の処理において否定判定する場合には、図７（ａ）に示す一連の処理を一旦終了する。 On the other hand, as shown in FIG. 7A, the CPU 72 determines whether or not there is update data (S102). Receive (S104). Then, the CPU rewrites the relationship defining data DR used in the process of S32 with the received relationship defining data DR (S106). It should be noted that the CPU 72 temporarily ends the series of processes shown in FIG.

このように、本実施形態によれば、関係規定データＤＲの更新処理を車両ＶＣ１の外部で行うことから、制御装置７０の演算負荷を軽減できる。さらに、たとえばＳ１１０の処理において、複数の車両ＶＣ１，ＶＣ２からのデータを受信してＳ５０の処理を行うなら、学習に用いるデータ数を容易に大きくすることができる。 As described above, according to the present embodiment, the processing for updating the relationship defining data DR is performed outside the vehicle VC1, so that the calculation load of the control device 70 can be reduced. Furthermore, for example, in the process of S110, if data from a plurality of vehicles VC1 and VC2 are received and the process of S50 is performed, the number of data used for learning can be easily increased.

＜対応関係＞
上記実施形態における事項と、上記「課題を解決するための手段」の欄に記載した事項との対応関係は、次の通りである。以下では、「課題を解決するための手段」の欄に記載した解決手段の番号毎に、対応関係を示している。［１］実行装置は、ＣＰＵ７２およびＲＯＭ７４に対応し、記憶装置は、記憶装置７６に対応する。状態取得処理は、Ｓ３０，Ｓ３６の処理に対応し、操作処理は、Ｓ３４の処理に対応し、報酬算出処理は、Ｓ６２～Ｓ６６の処理に対応し、更新処理は、Ｓ６８～Ｓ７４の処理に対応する。劣化変数取得処理は、Ｓ１０の処理に対応し、変更処理は、図５の処理に対応する。更新写像は、学習プログラム７４ｂのうちＳ６８～Ｓ７４の処理を実行する指令によって規定された写像に対応する。劣化度合いが所定以上となる場合は、走行距離ＲＬが劣化閾値ＲＬｔｈＨ以上となる場合に対応する。［２］Ｓ８８の処理に対応する。［３］第１の範囲は、Ｓ８６の処理において検討される行動の全てに対応する。第２の範囲は、ゼロに対応する。第３の範囲は、Ｓ８８の処理によって設定される範囲に対応する。［４～６］第１実行装置は、ＣＰＵ７２およびＲＯＭ７４に対応し、第２実行装置は、ＣＰＵ１１２およびＲＯＭ１１４に対応する。［７］コンピュータは、図１のＣＰＵ７２や、図６のＣＰＵ７２，１１２に対応する。 <Correspondence relationship>
Correspondence relationships between the items in the above embodiment and the items described in the "Means for Solving the Problems" column are as follows. Below, the corresponding relationship is shown for each number of the means for solving the problem described in the column of "means for solving the problem". [1] The execution device corresponds to the CPU 72 and the ROM 74 , and the storage device corresponds to the storage device 76 . The state acquisition process corresponds to the processes of S30 and S36, the operation process corresponds to the process of S34, the reward calculation process corresponds to the processes of S62 to S66, and the update process corresponds to the processes of S68 to S74. do. The degradation variable acquisition process corresponds to the process of S10, and the change process corresponds to the process of FIG. The updated mapping corresponds to the mapping specified by the instruction for executing the processing of S68-S74 in the learning program 74b. The case where the degree of deterioration is equal to or greater than a predetermined value corresponds to the case where the travel distance RL is equal to or greater than the deterioration threshold RLthH. [2] Corresponds to the processing of S88. [3] The first range corresponds to all actions considered in the process of S86. The second range corresponds to zero. The third range corresponds to the range set by the process of S88. [4-6] The first execution unit corresponds to the CPU 72 and the ROM 74, and the second execution unit corresponds to the CPU 112 and the ROM 114. [7] A computer corresponds to the CPU 72 in FIG. 1 and the CPUs 72 and 112 in FIG.

＜その他の実施形態＞
なお、本実施形態は、以下のように変更して実施することができる。本実施形態および以下の変更例は、技術的に矛盾しない範囲で互いに組み合わせて実施することができる。 <Other embodiments>
In addition, this embodiment can be changed and implemented as follows. This embodiment and the following modifications can be implemented in combination with each other within a technically consistent range.

「劣化変数について」
・劣化変数としては、走行距離ＲＬに限らない。たとえば、空燃比センサを備える場合、その検出値の変動量であってもよい。またたとえば、空燃比を開ループ制御およびフィードバック制御をする場合、燃料噴射弁１６による燃料噴射量のフィードバック補正量の大きさであってもよい。 "About deterioration variables"
- The deterioration variable is not limited to the travel distance RL. For example, if an air-fuel ratio sensor is provided, it may be the amount of variation in the detected value. Further, for example, when open-loop control and feedback control are performed on the air-fuel ratio, the magnitude of the feedback correction amount of the fuel injection amount by the fuel injection valve 16 may be used.

・劣化変数を、劣化度合いが所定未満の場合を、時間の経過と正の相関を有する量によって細分化する変数とする場合、たとえば劣化度合いが所定以上か否かを判定するための走行距離ＲＬと、行動価値関数Ｑの値が収束したか否かの変数との組によって劣化変数を構成してもよい。 When the deterioration variable is a variable that subdivides the degree of deterioration less than a predetermined value by an amount having a positive correlation with the passage of time, for example, the travel distance RL for determining whether the degree of deterioration is a predetermined value or more and a variable indicating whether or not the value of the action-value function Q has converged.

「変更処理について」
・上記実施形態では、走行距離ＲＬが収束判定値ＲＬｔｈＬよりも大きくて且つ劣化閾値ＲＬｔｈＨ未満の場合、グリーディ行動のみを用いることとし、探索を禁止したが、これに限らない。たとえば、Ｓ８８の処理によって規定される範囲よりも狭い範囲での探索を許容してもよい。 "About change processing"
In the above-described embodiment, only greedy behavior is used and search is prohibited when the travel distance RL is greater than the convergence determination value RLthL and less than the deterioration threshold RLthH, but the present invention is not limited to this. For example, a search within a range narrower than the range defined by the processing of S88 may be permitted.

・たとえば「コンピュータについて」の欄に記載したように、製品出荷時には、探索を禁止し、劣化度合いが所定以上となることで探索を開始する処理によって変更処理を構成してもよい。 - For example, as described in the column "Computer", the change process may be configured by prohibiting the search at the time of product shipment and starting the search when the degree of deterioration exceeds a predetermined level.

「行動変数について」
・上記実施形態では、行動変数としてのスロットルバルブの開口度に関する変数として、スロットル開口度指令値ＴＡ＊を例示したが、これに限らない。たとえば、アクセル操作量ＰＡに対するスロットル開口度指令値ＴＡ＊の応答性を、無駄時間および２次遅れフィルタにて表現し、無駄時間と、２次遅れフィルタを規定する２つの変数との合計３つの変数を、スロットルバルブの開口度に関する変数としてもよい。ただし、その場合、状態変数は、アクセル操作量ＰＡの時系列データに代えて、アクセル操作量ＰＡの単位時間当たりの変化量とすることが望ましい。 "About Behavioral Variables"
In the above embodiment, the throttle opening command value TA* was exemplified as a variable relating to the opening of the throttle valve as an action variable, but the present invention is not limited to this. For example, the responsiveness of the throttle opening command value TA* to the accelerator operation amount PA is expressed by a dead time and a secondary lag filter, and the dead time and the two variables that define the secondary lag filter. The variable may be a throttle valve opening variable. However, in that case, it is desirable that the state variable is the amount of change in the accelerator operation amount PA per unit time instead of the time-series data of the accelerator operation amount PA.

・上記実施形態では、行動変数としての点火時期に関する変数として、遅角量ａｏｐを例示したが、これに限らない。たとえば、ＫＣＳによる補正対象とされる点火時期自体であってもよい。 In the above embodiment, the retardation amount aop was exemplified as a variable related to ignition timing as a behavioral variable, but the present invention is not limited to this. For example, it may be the ignition timing itself to be corrected by the KCS.

・上記実施形態では、行動変数として、スロットルバルブの開口度に関する変数および点火時期に関する変数を例示したが、これに限らない。たとえば、スロットルバルブの開口度に関する変数および点火時期に関する変数に加えて、燃料噴射量を用いてもよい。また、それら３つに関しては、行動変数としてスロットルバルブの開口度に関する変数および燃料噴射量のみを採用したり、点火時期に関する変数および燃料噴射量のみを採用したりしてもよい。さらに、それら３つに関しては、行動変数としてそれらのうちの１つのみを採用してもよい。 In the above-described embodiment, the behavior variables are the throttle valve opening degree variable and the ignition timing variable, but are not limited to these. For example, the fuel injection amount may be used in addition to the variables relating to the degree of opening of the throttle valve and the variables relating to ignition timing. Further, with respect to these three, only the variable relating to the opening of the throttle valve and the fuel injection amount may be employed as the action variables, or only the variable relating to the ignition timing and the fuel injection amount may be employed. Furthermore, regarding those three, only one of them may be adopted as a behavioral variable.

・「内燃機関について」の欄に記載したように、圧縮着火式の内燃機関の場合、スロットルバルブの開口度に関する変数に代えて噴射量に関する変数を用い、点火時期に関する変数に代えて噴射時期に関する変数を用いればよい。なお、噴射時期に関する変数に加えて、１燃焼サイクルにおける噴射回数に関する変数や、１燃焼サイクルにおける１つの気筒のための時系列的に隣接した２つの燃料噴射のうちの一方の終了タイミングと他方の開始タイミングとの間の時間間隔に関する変数を加えることが望ましい。・As described in the section "Internal Combustion Engine", in the case of a compression ignition type internal combustion engine, a variable related to the injection amount is used instead of a variable related to the opening of the throttle valve, and a variable related to the injection timing is used instead of a variable related to the ignition timing. Variables should be used. In addition to the variables related to the injection timing, the variables related to the number of injections in one combustion cycle, the end timing of one of the two fuel injections adjacent in time series for one cylinder in one combustion cycle, and the other It is desirable to add a variable for the time interval between start timings.

・たとえば変速装置５０が有段変速装置の場合、クラッチの係合状態を油圧によって調整するためのソレノイドバルブの電流値等を行動変数としてもよい。
・たとえば、下記「車両について」の欄に記載したように車両としてハイブリッド車や、電気自動車、燃料電池車を採用する場合、回転電機のトルクや出力を行動変数としてもよい。またたとえば、内燃機関のクランク軸の回転動力によって回転するコンプレッサを備えた車載空調装置を備える場合、コンプレッサの負荷トルクを行動変数に含めてもよい。また、電動式の車載空調装置を備える場合、空調装置の消費電力を行動変数に含めてもよい。 For example, if the transmission 50 is a stepped transmission, the action variable may be a current value of a solenoid valve for adjusting the engagement state of the clutch by hydraulic pressure.
- For example, when a hybrid vehicle, an electric vehicle, or a fuel cell vehicle is adopted as the vehicle as described in the section "Vehicles" below, the torque and output of the rotating electric machine may be used as the action variable. Further, for example, when an in-vehicle air conditioner is equipped with a compressor that is rotated by rotational power of a crankshaft of an internal combustion engine, the load torque of the compressor may be included in the action variable. Moreover, when an electric vehicle-mounted air conditioner is provided, the power consumption of the air conditioner may be included in the behavior variable.

「状態について」
・上記実施形態では、アクセル操作量ＰＡの時系列データを、等間隔でサンプリングされた６個の値からなるデータとしたが、これに限らない。互いに異なるサンプリングタイミングにおける２個以上のサンプリング値からなるデータであればよく、この際、３個以上のサンプリング値からなるデータや、サンプリング間隔が等間隔であるデータであることがより望ましい。 "About the state"
- In the above-described embodiment, the time-series data of the accelerator operation amount PA is data consisting of six values sampled at equal intervals, but the present invention is not limited to this. Data consisting of two or more sampling values at sampling timings different from each other may be used. In this case, data consisting of three or more sampling values or data with equal sampling intervals are more desirable.

・アクセル操作量に関する状態変数としては、アクセル操作量ＰＡの時系列データに限らず、たとえば「行動変数について」の欄に記載したように、アクセル操作量ＰＡの単位時間当たりの変化量等であってもよい。・The state variable related to the accelerator operation amount is not limited to the time-series data of the accelerator operation amount PA. may

・たとえば「行動変数について」の欄に記載したように、ソレノイドバルブの電流値を行動変数とする場合、状態に、変速装置の入力軸５２の回転速度や出力軸５４の回転速度、ソレノイドバルブによって調整される油圧を含めればよい。またたとえば「行動変数について」の欄に記載したように、回転電機のトルクや出力を行動変数とする場合、状態に、バッテリの充電率や温度を含めればよい。またたとえば「行動変数について」の欄に記載したように、コンプレッサの負荷トルクや空調装置の消費電力を行動に含める場合、状態に、車室内の温度を含めればよい。・For example, when the current value of a solenoid valve is used as an action variable as described in the column "About action variables", the state depends on the rotation speed of the input shaft 52 of the transmission, the rotation speed of the output shaft 54, and the solenoid valve. Include the hydraulic pressure to be adjusted. For example, as described in the column "Behavioral variables", when the torque and output of the rotary electric machine are used as behavioral variables, the state may include the charging rate and temperature of the battery. For example, as described in the column "Behavioral Variables", when the load torque of the compressor and the power consumption of the air conditioner are included in the behavior, the temperature in the passenger compartment may be included in the state.

「テーブル形式のデータの次元削減について」
・テーブル形式のデータの次元削減手法としては、上記実施形態において例示したものに限らない。たとえばアクセル操作量ＰＡが最大値となることはまれであることから、アクセル操作量ＰＡが規定量以上となる状態については行動価値関数Ｑを定義せず、アクセル操作量ＰＡが規定量以上となる場合のスロットル開口度指令値ＴＡ＊等は、別途適合してもよい。またたとえば、行動のとりうる値からスロットル開口度指令値ＴＡ＊が規定値以上となるものを除くなどして、次元削減をしてもよい。 "About dimensionality reduction of tabular data"
- The dimension reduction method for data in table format is not limited to those exemplified in the above embodiments. For example, since it is rare for the accelerator operation amount PA to reach its maximum value, the action value function Q is not defined for a state where the accelerator operation amount PA is equal to or greater than a specified amount, and the accelerator operation amount PA is equal to or greater than the specified amount. The throttle opening degree command value TA* and the like in this case may be adapted separately. Further, for example, dimensionality reduction may be performed by excluding, from the values that can be taken by actions, those in which the throttle opening degree command value TA* is equal to or greater than a specified value.

「関係規定データについて」
・上記実施形態では、行動価値関数Ｑを、テーブル形式の関数としたが、これに限らない。たとえば、関数近似器を用いてもよい。 "Regarding related regulation data"
- In the above embodiment, the action value function Q is a function in a table format, but it is not limited to this. For example, a function approximator may be used.

・たとえば、行動価値関数Ｑを用いる代わりに、方策πを、状態ｓおよび行動ａを独立変数とし、行動ａをとる確率を従属変数とする関数近似器にて表現し、関数近似器を定めるパラメータを、報酬ｒに応じて更新してもよい。・For example, instead of using the action-value function Q, the policy π is expressed by a function approximator with the state s and the action a as independent variables and the probability of taking the action a as the dependent variable, and the parameters that define the function approximator may be updated according to the reward r.

「操作処理について」
・たとえば「関係規定データについて」の欄に記載したように、行動価値関数を関数近似器とする場合、上記実施形態におけるテーブル型式の関数の独立変数となる行動についての離散的な値の組の全てについて、状態ｓとともに行動価値関数Ｑに入力することによって、行動価値関数Ｑを最大化する行動ａを選択すればよい。 "About operation processing"
・For example, as described in the column "Regarding relational data", when the action value function is a function approximator, a set of discrete values for actions that are independent variables of the table-type function in the above embodiment For all, the action a that maximizes the action-value function Q should be selected by inputting it into the action-value function Q together with the state s.

・たとえば「関係規定データについて」の欄に記載したように、方策πを、状態ｓおよび行動ａを独立変数とし、行動ａをとる確率を従属変数とする関数近似器とする場合、方策πによって示される確率に基づき行動ａを選択すればよい。・For example, as described in the column "Regarding relational data", if policy π is a function approximator with state s and action a as independent variables and the probability of taking action a as a dependent variable, then policy π Action a may be selected based on the indicated probability.

「更新写像について」
・Ｓ６８～Ｓ７４の処理においては、εソフト方策オン型モンテカルロ法によるものを例示したが、これに限らない。たとえば、方策オフ型モンテカルロ法によるものであってもよい。もっとも、モンテカルロ法にも限らず、たとえば、方策オフ型ＴＤ法を用いたり、またたとえばＳＡＲＳＡ法のように方策オン型ＴＤ法を用いたり、またたとえば、方策オン型の学習として適格度トレース法を用いたりしてもよい。 "On update maps"
・In the processing of S68 to S74, the ε-soft policy-on type Monte Carlo method was exemplified, but the present invention is not limited to this. For example, it may be based on off-policy Monte Carlo method. However, it is not limited to the Monte Carlo method. You may use it.

・たとえば「関係規定データについて」の欄に記載したように、方策πを関数近似器を用いて表現し、これを報酬ｒに基づき直接更新する場合には、方策勾配法等を用いて更新写像を構成すればよい。・For example, as described in the column "Regarding relational data", when the policy π is expressed using a function approximator and directly updated based on the reward r, the update mapping is performed using the policy gradient method, etc. should be configured.

・行動価値関数Ｑと方策πとのうちのいずれか一方のみを、報酬ｒによる直接の更新対象とするものに限らない。たとえば、アクター・クリティック法のように、行動価値関数Ｑおよび方策πをそれぞれ更新してもよい。また、アクター・クリティック法においては、これに限らず、たとえば行動価値関数Ｑに代えて価値関数Ｖを更新対象としてもよい。 - Either one of the action-value function Q and the policy π is not limited to being directly updated with the reward r. For example, the action-value function Q and policy π may be updated, respectively, like the actor-critic method. In addition, in the actor-critic method, the value function V may be updated instead of the action value function Q, for example.

「報酬算出処理について」
・上記実施形態では、条件（ア）および条件（イ）の論理積が真であるか否かに応じて報酬を与えたが、これに限らない。たとえば、条件（ア）を満たすか否かに応じて報酬を与える処理と、条件（イ）を満たすか否かに応じて報酬を与える処理と、を実行してもよい。 "About Reward Calculation Process"
- In the above-described embodiment, a reward is given according to whether or not the logical product of condition (a) and condition (b) is true, but the present invention is not limited to this. For example, a process of giving a reward depending on whether condition (a) is satisfied and a process of giving a reward depending on whether condition (b) is satisfied may be executed.

・たとえば条件（ア）を満たす場合に一律同じ報酬を与える代わりに、トルクＴｒｑとトルク指令値Ｔｒｑ＊との差の絶対値が小さい場合に大きい場合よりもより大きい報酬を与える処理としてもよい。またたとえば、条件（ア）を満たさない場合に一律同じ報酬を与える代わりに、トルクＴｒｑとトルク指令値Ｔｒｑ＊との差の絶対値が大きい場合に小さい場合よりもより小さい報酬を与える処理としてもよい。 - For example, instead of uniformly giving the same reward when the condition (a) is satisfied, processing may be such that when the absolute value of the difference between the torque Trq and the torque command value Trq* is small, a larger reward is given than when the absolute value is large. Alternatively, for example, instead of uniformly giving the same reward when the condition (a) is not satisfied, if the absolute value of the difference between the torque Trq and the torque command value Trq* is large, a smaller reward may be given than if it is small. good.

・たとえば条件（イ）を満たす場合に一律同じ報酬を与える代わりに、加速度Ｇｘの大きさに応じて報酬の大きさを可変とする処理としてもよい。またたとえば、条件（イ）を満たさない場合に一律同じ報酬を与える代わりに、加速度Ｇｘの大きさに応じて報酬の大きさを可変とする処理としてもよい。 - For example, instead of uniformly giving the same reward when the condition (a) is satisfied, the processing may be such that the magnitude of the reward is variable according to the magnitude of the acceleration Gx. Further, for example, instead of giving the same reward uniformly when the condition (a) is not satisfied, processing may be performed in which the magnitude of the reward is made variable according to the magnitude of the acceleration Gx.

・報酬算出処理としては、報酬ｒを、ドライバビリティに関する基準を満たすか否かに応じて与えるものに限らない。たとえば、エネルギ利用効率が基準を満たす場合に満たさない場合よりも大きい報酬を与える処理や、排気特性が基準を満たす場合に満たさない場合よりも大きい報酬を与える処理であってもよい。なお、ドライバビリティに関する基準を満たす場合に満たさない場合よりも大きい報酬を与える処理と、エネルギ利用効率が基準を満たす場合に満たさない場合よりも大きい報酬を与える処理と、排気特性が基準を満たす場合に満たさない場合よりも大きい報酬を与える処理との３つの処理のうちの２つまたは３つを含んでもよい。 - The remuneration calculation process is not limited to giving the remuneration r depending on whether or not the standard regarding drivability is satisfied. For example, it may be a process of giving a higher reward when the energy utilization efficiency meets the standard than when it does not, or a process of giving a higher reward when the exhaust characteristic meets the standard than when it does not. A process of giving a higher reward when the criteria for drivability are met than when the criteria are not met, a process of giving a higher reward when the energy utilization efficiency meets the criteria than when the criteria are not met, and a process of giving a greater reward when the criteria are met and the exhaust characteristics satisfy the criteria. It may include two or three of the three treatments with a treatment that rewards more than if it does not meet.

・たとえば「行動変数について」の欄に記載したように、変速装置５０のソレノイドバルブの電流値を行動変数とする場合、たとえば報酬算出処理に以下の（ａ）～（ｃ）の３つの処理のうちの少なくとも１つの処理を含めればよい。・For example, as described in the section "Behavioral variables", when the current value of the solenoid valve of the transmission 50 is used as the behavioral variable, for example, the following three processes (a) to (c) are performed in the reward calculation process. At least one of the processes may be included.

（ａ）変速装置による変速比の切り替えに要する時間が所定時間以内である場合に所定時間を超える場合よりも大きい報酬を与える処理である。
（ｂ）変速装置の入力軸５２の回転速度の変化速度の絶対値が入力側所定値以下である場合に入力側所定値を超える場合よりも大きい報酬を与える処理である。 (a) This is a process of giving a larger reward when the time required for switching the gear ratio by the transmission is within a predetermined period of time than when the predetermined period of time is exceeded.
(b) A process of giving a larger reward when the absolute value of the rate of change of the rotation speed of the input shaft 52 of the transmission is equal to or less than the input side predetermined value than when it exceeds the input side predetermined value.

（ｃ）変速装置の出力軸５４の回転速度の変化速度の絶対値が出力側所定値以下である場合に出力側所定値を超える場合よりも大きい報酬を与える処理である。
・たとえば「行動変数について」の欄に記載したように、回転電機のトルクや出力を行動変数とする場合、バッテリの充電率が所定範囲内にある場合にない場合よりも大きい報酬を与える処理や、バッテリの温度が所定範囲内にある場合にない場合よりも大きい報酬を与える処理を含めてもよい。また、たとえば「行動変数について」の欄に記載したように、コンプレッサの負荷トルクや空調装置の消費電力を行動変数に含める場合、車室内の温度が所定範囲内にある場合にない場合よりも大きい報酬を与える処理を加えてもよい。 (c) A process of giving a larger reward when the absolute value of the rate of change of the rotational speed of the output shaft 54 of the transmission is equal to or less than the predetermined value on the output side than when it exceeds the predetermined value on the output side.
・For example, as described in the section "Behavioral variables", if the torque or output of the rotating electric machine is used as the behavioral variable, a process of giving a larger reward when the charging rate of the battery is within a predetermined range than when it is not, or , may include providing a greater reward if the temperature of the battery is within a predetermined range than if it is not. For example, as described in the column "Behavioral variables", when the load torque of the compressor and the power consumption of the air conditioner are included in the behavioral variables, the temperature inside the passenger compartment is larger than when it is not within a predetermined range. You may add the process which gives a reward.

「車両用制御システムについて」
・図７の処理では、Ｓ５０の処理を全てデータ解析センター１１０において実行したがこれに限らない。たとえば、Ｓ５０の処理のうちＳ６２～Ｓ６６の処理については、車両ＶＣ１側で実行し、Ｓ１００の処理を一部変更して報酬ｒの算出結果を送信するようにしてもよい。 "About Vehicle Control Systems"
- In the process of FIG. 7, all the processes of S50 were performed in the data analysis center 110, but it is not restricted to this. For example, the processing of S62 to S66 of the processing of S50 may be executed on the vehicle VC1 side, and the processing of S100 may be partially modified to transmit the calculation result of the reward r.

・車両用制御システムとしては、制御装置７０およびデータ解析センター１１０によって構成されるものに限らない。たとえば、データ解析センター１１０に代えて、ユーザが所持する携帯端末を用い、制御装置７０および携帯端末によって車両用制御システムを構成してもよい。また、たとえば、制御装置７０、携帯端末、およびデータ解析センター１１０によって構成してもよい。これは、たとえば図７において、Ｓ３２の処理を携帯端末が実行することによって実現できる。 - The vehicle control system is not limited to the one configured by the control device 70 and the data analysis center 110 . For example, instead of the data analysis center 110, a mobile terminal owned by the user may be used, and the control device 70 and the mobile terminal may constitute the vehicle control system. Alternatively, for example, it may be configured by the control device 70, the mobile terminal, and the data analysis center 110. FIG. For example, in FIG. 7, this can be realized by the mobile terminal executing the process of S32.

「実行装置について」
・実行装置としては、ＣＰＵ７２（１１２）とＲＯＭ７４（１１４）とを備えて、ソフトウェア処理を実行するものに限らない。たとえば、上記実施形態においてソフトウェア処理されたものの少なくとも一部を、ハードウェア処理するたとえばＡＳＩＣ等の専用のハードウェア回路を備えてもよい。すなわち、実行装置は、以下の（ａ）～（ｃ）のいずれかの構成であればよい。（ａ）上記処理の全てを、プログラムに従って実行する処理装置と、プログラムを記憶するＲＯＭ等のプログラム格納装置とを備える。（ｂ）上記処理の一部をプログラムに従って実行する処理装置およびプログラム格納装置と、残りの処理を実行する専用のハードウェア回路とを備える。（ｃ）上記処理の全てを実行する専用のハードウェア回路を備える。ここで、処理装置およびプログラム格納装置を備えたソフトウェア実行装置や、専用のハードウェア回路は複数であってもよい。 "About Execution Units"
- The execution device is not limited to one that includes the CPU 72 (112) and the ROM 74 (114) and executes software processing. For example, a dedicated hardware circuit such as an ASIC may be provided to perform hardware processing at least part of what is software processed in the above embodiments. That is, the execution device may have any one of the following configurations (a) to (c). (a) A processing device that executes all of the above processes according to a program, and a program storage device such as a ROM that stores the program. (b) A processing device and a program storage device for executing part of the above processing according to a program, and a dedicated hardware circuit for executing the remaining processing. (c) provide dedicated hardware circuitry to perform all of the above processing; Here, there may be a plurality of software execution devices provided with a processing device and a program storage device, or a plurality of dedicated hardware circuits.

「コンピュータについて」
・コンピュータとしては、図１のＣＰＵ７２や、図６のＣＰＵ７２，１１２に限らない。たとえば、車両ＶＣ１の出荷前の関係規定データＤＲを生成するためのコンピュータと、車両ＶＣ１に搭載されるＣＰＵ７２とであってもよい。その場合、出荷時においては探索を禁止し、走行距離ＲＬが劣化閾値ＲＬｔｈＨ以上となる場合に、探索を許可してもよい。ここでの探索範囲は、関係規定データＤＲを生成するためのコンピュータによって実行される強化学習における探索と比較して、行動変数の取りうる値の範囲が小さいことが望ましい。ちなみに、車両の出荷前の関係規定データＤＲの生成処理においては、車両が存在せず、テストベンチにて内燃機関１０等を稼働させて車両の走行を模擬することによって、車両の状態を疑似的に生成し、センサの検出値等によって疑似的に生成された車両の状態を把握しつつ強化学習に用いてもよい。その場合、疑似的に生成された車両の状態を、センサ値に基づく車両の状態とみなす。 "About Computers"
- The computer is not limited to the CPU 72 in FIG. 1 or the CPUs 72 and 112 in FIG. For example, it may be a computer for generating relationship defining data DR before shipment of vehicle VC1 and CPU 72 mounted on vehicle VC1. In that case, the search may be prohibited at the time of shipment, and the search may be permitted when the travel distance RL is equal to or greater than the deterioration threshold RLthH. It is preferable that the range of possible values of the behavioral variables be smaller than the search range in the computer-executed reinforcement learning for generating the relationship defining data DR. Incidentally, in the process of generating the relevant regulation data DR before shipment of the vehicle, the vehicle does not exist, and the running of the vehicle is simulated by running the internal combustion engine 10 or the like on the test bench, thereby simulating the vehicle state. , and the state of the vehicle, which is generated in a simulated manner from sensor detection values, may be grasped and used for reinforcement learning. In that case, the simulated vehicle state is regarded as the vehicle state based on the sensor values.

「記憶装置について」
・上記実施形態では、関係規定データＤＲが記憶される記憶装置と、学習プログラム７４ｂや制御プログラム７４ａが記憶される記憶装置（ＲＯＭ７４）とを別の記憶装置としたが、これに限らない。 "About storage devices"
In the above-described embodiment, the storage device storing the relationship defining data DR and the storage device (ROM 74) storing the learning program 74b and the control program 74a are separate storage devices, but the present invention is not limited to this.

「内燃機関について」
・内燃機関としては、燃料噴射弁として吸気通路１２に燃料を噴射するポート噴射弁を備えるものに限らず、燃焼室２４に燃料を直接噴射する筒内噴射弁を備えるものであってもよく、またたとえば、ポート噴射弁および筒内噴射弁の双方を備えるものであってもよい。 "About Internal Combustion Engines"
The internal combustion engine is not limited to the one provided with a port injection valve for injecting fuel into the intake passage 12 as a fuel injection valve, but may be provided with an in-cylinder injection valve for directly injecting fuel into the combustion chamber 24. Further, for example, both a port injection valve and an in-cylinder injection valve may be provided.

・内燃機関としては、火花点火式内燃機関に限らず、たとえば燃料として軽油などを用いる圧縮着火式内燃機関等であってもよい。
「車両について」
・車両としては、推力生成装置が内燃機関のみである車両に限らず、たとえば内燃機関と回転電機とを備えるいわゆるハイブリッド車両であってもよい。またたとえば、推力生成装置として、内燃機関を備えることなく、回転電機を備えるいわゆる電気自動車や燃料電池車あってもよい。 - The internal combustion engine is not limited to a spark ignition internal combustion engine, and may be a compression ignition internal combustion engine using light oil or the like as fuel.
"About vehicle"
- The vehicle is not limited to a vehicle having only an internal combustion engine as a thrust generating device, and may be a so-called hybrid vehicle having an internal combustion engine and a rotating electric machine, for example. Further, for example, the thrust generator may be a so-called electric vehicle or a fuel cell vehicle equipped with a rotary electric machine without an internal combustion engine.

１０…内燃機関
１２…吸気通路
１４…スロットルバルブ
１６…燃料噴射弁
１８…吸気バルブ
２０…シリンダ
２２…ピストン
２４…燃焼室
２６…点火装置
２８…クランク軸
７０…制御装置
１１０…データ解析センター DESCRIPTION OF SYMBOLS 10... Internal combustion engine 12... Intake passage 14... Throttle valve 16... Fuel injection valve 18... Intake valve 20... Cylinder 22... Piston 24... Combustion chamber 26... Ignition device 28... Crankshaft 70... Control device 110... Data analysis center

Claims

having an execution unit and a storage unit,
The storage device stores relationship definition data that defines a relationship between a vehicle state and an action variable that is a variable related to operation of an electronic device mounted on the vehicle,
The execution device is
A state acquisition process for acquiring the state of the vehicle each time based on the detected value of the sensor each time;
an operation process of operating the electronic device based on the value of the behavior variable determined by the vehicle state acquired by the state acquisition process and the relationship defining data;
a remuneration calculation process that provides a larger remuneration when the characteristics of the vehicle meet criteria than when they do not, based on the vehicle status acquired by the status acquisition process;
The state of the vehicle obtained by the state obtaining process, the value of the action variable used in the operation of the electronic device, and the reward corresponding to the operation are input to a predetermined update map, and the relationship an update process for updating the stipulated data;
a deterioration variable acquisition process for acquiring a deterioration variable, which is a variable indicating the degree of deterioration of the vehicle;
When the degree of deterioration of the vehicle is greater than or equal to a predetermined value, a value other than the value that maximizes the expected profit for the reward is adopted by the operation processing from among the values of the behavior variables as compared with the case of less than the predetermined degree. change processing to change the range to the side to expand,
and run
The vehicle control device according to claim 1, wherein the updated map outputs the relational data updated so as to increase the expected profit when the electronic device is operated according to the relational data.

2. The vehicle control device according to claim 1, wherein said change processing includes processing for expanding a range in which a value other than the value that maximizes said expected profit is adopted from zero to a range greater than zero.

The deterioration variable is also a variable that subdivides the case where the degree of deterioration is less than a predetermined value by an amount that has a positive correlation with the passage of time,
In the change processing, as the time elapses, the range in which the value other than the value that maximizes the expected profit is adopted is changed from the first range to the second range to the third range. can be,
The first range is a range larger than the second range and the third range,
The third range is a range larger than the second range,
In the processing for expanding the range, when the degree of deterioration of the vehicle is a predetermined value or more, the range in which a value other than the value that maximizes the expected profit is adopted is changed from the second range to the third range. 3. The vehicle control device according to claim 2, wherein the processing is a process of changing to an expanding side of the range.

comprising the execution device and the storage device according to any one of claims 1 to 3,
The execution device includes a first execution device mounted on the vehicle and a second execution device separate from the on-vehicle device,
The first execution device executes at least the state acquisition process and the operation process,
The vehicle control system, wherein the second execution device executes at least the update process.

A vehicle control device comprising the first execution device according to claim 4 .

A vehicle learning device comprising the second execution device according to claim 4 .

A vehicle learning method for causing a computer to execute the state acquisition process, the operation process, the reward calculation process, the update process, the deterioration variable acquisition process, and the change process according to any one of claims 1 to 3.