JP7248053B2

JP7248053B2 - Control device and control method

Info

Publication number: JP7248053B2
Application number: JP2021098394A
Authority: JP
Inventors: 健人吉田
Original assignee: Meidensha Corp
Current assignee: Meidensha Corp
Priority date: 2021-06-14
Filing date: 2021-06-14
Publication date: 2023-03-29
Anticipated expiration: 2041-06-14
Also published as: JP2022190200A; WO2022264929A1

Description

本発明は、制御装置及び制御方法に関する。 The present invention relates to a control device and control method.

一般に、普通自動車等の車両を製造して販売する際には、国又は地域において規定された、特定の走行パターン（以下、モードという）により車両を走行させた際の燃費及び排出ガスを測定する試験を行い、この試験結果を表示することが求められる。
モードは、例えば、走行開始からの時間と、到達すべき車速との関係のグラフにより表わすことが可能である。
到達すべき車速は、車両に与えられる達成すべき速度に関する指令という観点で、指令車速と呼ばれることがある。
燃費及び排出ガスを測定する試験は、シャシーダイナモメータ上に車両を載置し、車両に設置された自動操縦ロボット（ドライブロボット（登録商標））により、モードに従って車両を運転させることにより行われる。
指令車速には許容誤差範囲が規定されており、車速が許容誤差範囲外になると、その試験は無効となる。
そのため、自動操縦ロボットの制御には指令車速への高い追従性が求められ、自動操縦ロボットは、強化学習により学習された学習モデルにより制御される。 In general, when manufacturing and selling vehicles such as ordinary automobiles, the fuel consumption and exhaust gas are measured when the vehicle is driven in a specific driving pattern (hereinafter referred to as "mode") stipulated in the country or region. It is required to conduct a test and display the results of this test.
The mode can be represented, for example, by a graph of the relationship between the time from the start of running and the vehicle speed to be reached.
The vehicle speed to be reached is sometimes referred to as command vehicle speed in terms of a command given to the vehicle regarding the speed to be achieved.
A test for measuring fuel consumption and exhaust gas is performed by placing a vehicle on a chassis dynamometer and driving the vehicle according to a mode by an autopilot robot (Drive Robot (registered trademark)) installed in the vehicle.
A permissible error range is defined for the commanded vehicle speed, and the test is invalidated when the vehicle speed falls outside the permissible error range.
Therefore, the control of the autopilot robot requires high followability to the commanded vehicle speed, and the autopilot robot is controlled by a learning model learned by reinforcement learning.

従来技術の一例である特許文献１には、強化学習により学習された学習モデルにより制御される自動操縦ロボットの制御装置及び制御方法が開示されている。 Patent Literature 1, which is an example of conventional technology, discloses a control device and a control method for an autopilot robot controlled by a learning model learned by reinforcement learning.

従来技術の一例である特許文献２には、車両モデルを作成して自動操縦ロボットを制御する操作推論学習モデルの強化学習により学習される学習システム及び学習方法が開示されている。 Patent Literature 2, which is an example of conventional technology, discloses a learning system and a learning method that are learned by reinforcement learning of an operation inference learning model that creates a vehicle model and controls an autopilot robot.

強化学習においては、制御対象を試行錯誤的に制御させつつ、報酬と呼ばれる評価値が大きくなるような制御方法、すなわち方策を学習する。
連続値的な制御学習においては、試行錯誤は、一般的に、現在の学習状態による制御値にランダムな摂動を加えたものとして表現される。
強化学習の学習進行においては、獲得する報酬の値が大きくなるような経験（以下、本明細書において、「よい経験」と称呼する。）をすることが重要であり、学習初期においてはランダム摂動による探索性によって偶発的によい経験をすることが期待されている。 In reinforcement learning, a controlled object is controlled by trial and error, and a control method, that is, a policy, is learned so as to increase an evaluation value called a reward.
In continuous value control learning, trial and error is generally expressed as adding random perturbations to the control value according to the current learning state.
In the learning progress of reinforcement learning, it is important to have an experience that increases the value of the reward to be obtained (hereinafter referred to as "good experience" in this specification.), and random perturbation It is expected that the explorability of

特開２０２０－５６７３７号公報JP 2020-56737 A 特開２０２０－１４８５９３号公報JP 2020-148593 A

しかしながら、上記の従来技術によれば、学習初期におけるよい経験は偶発的に生じるものであり、よい経験ができないことも想定される。
仮に、よい経験ができない場合には、学習が進行せず、又はアクセル若しくはブレーキペダルのみを使うような誤った学習が進行するおそれがある、という問題があった。
誤った学習によれば、制御対象の実車両の破損が生じることもある。 However, according to the conventional technology described above, a good experience in the early stage of learning occurs by chance, and it is assumed that a good experience cannot be obtained.
There is a problem that if a good experience cannot be obtained, learning does not progress, or erroneous learning such as using only the accelerator or brake pedal may progress.
Incorrect learning may result in damage to the real vehicle being controlled.

本発明は、上記に鑑みてなされたものであって、学習時によい経験をしやすくし、学習初期における学習進行を効率化することを目的とする。 The present invention has been made in view of the above, and an object of the present invention is to make it easier to have a good experience during learning and to streamline the progress of learning in the early stages of learning.

上述の課題を解決して目的を達成する本発明の一態様は、制御対象の状態が指令に一致するように前記制御対象を制御する制御装置であって、強化学習アルゴリズムに基づいて前記制御対象を操作する制御指令を出力する操作内容推論部と、制御理論に基づく制御指令を出力する副操作内容推論部と、学習初期には前記副操作内容推論部の出力する制御指令を採用し、学習初期を脱した後には前記操作内容推論部の出力する制御指令を採用する判断を行い、採用した制御指令を出力する制御方式判断部と、前記採用した制御指令により前記制御対象の操作を制御する操作制御部と、を備える制御装置である。 One aspect of the present invention that solves the above-described problems and achieves the object is a control device that controls the controlled object so that the state of the controlled object matches a command, wherein the controlled object is controlled based on a reinforcement learning algorithm. and a sub-operation content inference unit that outputs a control command based on control theory. After exiting the initial stage, a determination is made to adopt the control command output by the operation content inference unit, a control method determination unit that outputs the adopted control command, and the operation of the controlled object is controlled by the adopted control command. and a control device.

又は、上述の課題を解決して目的を達成する本発明の一態様は、車両に搭載されて該車両を走行させる自動操縦ロボットを該車両が規定された指令車速に従って走行するように制御する、該自動操縦ロボットの制御装置であって、強化学習アルゴリズムに基づいて前記車両を操作する制御指令を出力する操作内容推論部と、制御理論に基づく制御指令を出力する副操作内容推論部と、学習初期には前記副操作内容推論部の出力する制御指令を採用し、学習初期を脱した後には前記操作内容推論部の出力する制御指令を採用する判断を行い、採用した制御指令を出力する制御方式判断部と、前記採用した制御指令により前記車両の操作を制御する車両操作制御部と、を備える自動操縦ロボットの制御装置である。 Alternatively, in one aspect of the present invention that solves the above-described problems and achieves the object, an autopilot robot mounted on a vehicle that drives the vehicle is controlled so that the vehicle runs according to a prescribed command vehicle speed. A control device for the autopilot robot, comprising: an operation content inference unit for outputting a control command for operating the vehicle based on a reinforcement learning algorithm; a secondary operation content inference unit for outputting a control command based on control theory; Initially, the control command output by the secondary operation content inference unit is adopted, and after the initial stage of learning is passed, a determination is made to adopt the control command output by the operation content inference unit, and the adopted control command is output. A control device for an autopilot robot, comprising: a method determination unit; and a vehicle operation control unit that controls operation of the vehicle according to the adopted control command.

上記構成の自動操縦ロボットの制御装置では、学習初期又は学習初期を脱した状態であるか、操作内容系列の相関に基づいて判断され、前記操作内容系列の相関が大きくなると、前記操作内容推論部の挙動と、前記副操作内容推論部の挙動と、が類似するように学習が進行してきたものと判断されることが好ましい。 In the control device for the autopilot robot configured as described above, it is determined based on the correlation of the operation content sequence whether it is in the initial stage of learning or out of the initial stage of learning. and the behavior of the sub-operation content inference unit are similar to each other.

上記構成の自動操縦ロボットの制御装置では、実試験装置環境下における学習の前に、試験装置モデルによるシミュレーションを用いた事前学習が行われることが好ましい。 In the control device for the autopilot robot configured as described above, it is preferable that pre-learning using a simulation using a test device model is performed before learning under the environment of the actual test device.

又は、上述の課題を解決して目的を達成する本発明の一態様は、車両に搭載されて該車両を走行させる自動操縦ロボットを該車両が規定された指令車速に従って走行するように制御する、該自動操縦ロボットの制御方法であって、強化学習アルゴリズムに基づいて前記車両を操作する制御指令を出力すること、制御理論に基づく制御指令を出力すること、学習初期には前記強化学習アルゴリズムに基づいた制御指令を採用し、学習初期を脱した後には制御理論に基づく制御指令を採用する判断を行い、採用した制御指令を出力すること、前記採用した制御指令により前記車両の操作を制御すること、を含む自動操縦ロボットの制御方法である。 Alternatively, in one aspect of the present invention that solves the above-described problems and achieves the object, an autopilot robot mounted on a vehicle that drives the vehicle is controlled so that the vehicle runs according to a prescribed command vehicle speed. A control method for the autopilot robot, comprising: outputting a control command for operating the vehicle based on a reinforcement learning algorithm; outputting a control command based on a control theory; adopting a control command based on the control theory after the initial stage of learning has passed, and outputting the adopted control command; and controlling the operation of the vehicle according to the adopted control command. , is a control method for an autopilot robot.

本発明によれば、学習時によい経験をしやすくし、学習初期における学習進行を効率化することができる。 ADVANTAGE OF THE INVENTION According to this invention, it becomes easy to have a good experience at the time of learning, and learning progress in the early stage of learning can be made efficient.

図１は、実施形態１における自動操縦ロボットであるドライブロボットを用いた試験環境の概要を示す図である。FIG. 1 is a diagram showing an outline of a test environment using a drive robot, which is an autopilot robot, according to Embodiment 1. FIG. 図２は、実施形態１における試験装置と、実施形態１に係る自動操縦ロボットの制御装置と、を示す機能ブロック図である。FIG. 2 is a functional block diagram showing the test device according to the first embodiment and the control device for the autopilot robot according to the first embodiment. 図３は、実施形態３における試験装置と、実施形態３に係る自動操縦ロボットの制御装置と、を示す機能ブロック図である。FIG. 3 is a functional block diagram showing a test device according to the third embodiment and a control device for an autopilot robot according to the third embodiment.

以下、添付図面を参照して、本発明を実施するための形態について説明する。
ただし、本発明は、以下の実施形態の記載によって限定解釈されるものではない。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments for carrying out the present invention will be described with reference to the accompanying drawings.
However, the present invention is not to be construed as limited by the description of the following embodiments.

（実施形態１）
図１は、本実施形態における自動操縦ロボットであるドライブロボット１１を用いた試験環境の概要を示す図である。
図２は、本実施形態における試験装置１と、本実施形態に係る自動操縦ロボットの制御装置２と、を示す機能ブロック図である。 (Embodiment 1)
FIG. 1 is a diagram showing an outline of a test environment using a drive robot 11, which is an autopilot robot in this embodiment.
FIG. 2 is a functional block diagram showing the test device 1 according to this embodiment and the control device 2 for the autopilot robot according to this embodiment.

試験装置１は、ドライブロボット１１と、車両１２と、シャシーダイナモメータ１３と、を備える。
車両１２は、試験環境の床面上に配置された、性能が計測される被試験車両であり、駆動輪１２１と、運転席１２２と、車両操作ペダル１２３ａ，１２３ｂと、を備える。
シャシーダイナモメータ１３は、試験環境の床面の下方に設置され、路上に代えてシャシローラ上で車両１２を走行させ、車両１２の特性を計測するための構成である。
車両１２は、車両１２の前輪である駆動輪１２１がシャシーダイナモメータ１３の上に位置するように配置されている。
駆動輪１２１が回転する際には、シャシーダイナモメータ１３は、駆動輪１２１の回転の反対方向に回転する。 A test apparatus 1 includes a drive robot 11 , a vehicle 12 and a chassis dynamometer 13 .
The vehicle 12 is a vehicle under test, the performance of which is measured and placed on the floor of the test environment.
The chassis dynamometer 13 is installed below the floor surface of the test environment and configured to measure the characteristics of the vehicle 12 by running the vehicle 12 on chassis rollers instead of on the road.
Vehicle 12 is arranged such that drive wheels 121 , which are front wheels of vehicle 12 , are positioned above chassis dynamometer 13 .
When the drive wheels 121 rotate, the chassis dynamometer 13 rotates in the direction opposite to the rotation of the drive wheels 121 .

ドライブロボット１１は、アクチュエータ１１０ａ，１１０ｂを備え、人間のドライバーに代えて車両１２の運転席１２２に設置され、車両１２を走行させる動作を行う機械である。
アクチュエータ１１０ａ，１１０ｂは、各々、車両操作ペダル１２３ａ，１２３ｂに当接する。
車両操作ペダル１２３ａ，１２３ｂの一方はアクセルペダルであり、他方はブレーキペダルである。 The drive robot 11 is a machine that includes actuators 110a and 110b, is installed in the driver's seat 122 of the vehicle 12 in place of a human driver, and makes the vehicle 12 run.
Actuators 110a and 110b abut vehicle operation pedals 123a and 123b, respectively.
One of the vehicle operation pedals 123a and 123b is an accelerator pedal, and the other is a brake pedal.

ドライブロボット１１は、制御装置２によって制御される。
制御装置２は、学習部２０と、ドライブロボット制御部２１と、副操作内容推論部２２と、制御方式判断部２３と、を備える。
制御装置２は、車両１２が規定された指令車速に従って走行するようにドライブロボット１１のアクチュエータ１１０ａ，１１０ｂを制御し、車両操作ペダル１２３ａ，１２３ｂの開度を調整する。
すなわち、制御装置２は、車両１２の車両操作ペダル１２３ａ，１２３ｂの開度を調整することで、規定された走行パターンであるモードに従うように、車両１２の走行を制御する。
詳細には、制御装置２は、走行開始から時間が経過するに従って、各時刻に到達すべき車速である指令車速に従うように、車両１２の走行を制御する。 Drive robot 11 is controlled by control device 2 .
The control device 2 includes a learning section 20 , a drive robot control section 21 , a secondary operation content inference section 22 and a control method determination section 23 .
The control device 2 controls the actuators 110a and 110b of the drive robot 11 so that the vehicle 12 runs according to the prescribed command vehicle speed, and adjusts the opening degrees of the vehicle operation pedals 123a and 123b.
That is, the control device 2 adjusts the opening degrees of the vehicle operation pedals 123a and 123b of the vehicle 12, thereby controlling the traveling of the vehicle 12 so as to follow the mode, which is the prescribed traveling pattern.
Specifically, the control device 2 controls the traveling of the vehicle 12 so as to follow the commanded vehicle speed, which is the vehicle speed to be reached at each time, as time elapses from the start of traveling.

車両状態計測部１４は、車両１２の状態を計測する計測部又は外的に設置された計測部である。
ここで、車両１２の状態としては、車両操作ペダル１２３ａ，１２３ｂの操作値を例示することができる。
ここで、外的に設置された計測部としては、車両操作ペダル１２３ａ，１２３ｂの操作値を計測するカメラ又は赤外線センサ等を例示することができる。 The vehicle state measuring unit 14 is a measuring unit that measures the state of the vehicle 12 or an externally installed measuring unit.
Here, as the state of the vehicle 12, operation values of the vehicle operation pedals 123a and 123b can be exemplified.
Here, as an externally installed measurement unit, a camera, an infrared sensor, or the like that measures the operation values of the vehicle operation pedals 123a and 123b can be exemplified.

学習部２０は、指令車速生成部２００と、強化学習部２０１と、学習データ成型部２０２と、学習データ記憶部２０３と、学習データ生成部２０４と、推論データ成型部２０５と、を備え、ドライブロボット制御における車両モデルの学習を行う。 The learning unit 20 includes a command vehicle speed generation unit 200, a reinforcement learning unit 201, a learning data molding unit 202, a learning data storage unit 203, a learning data generation unit 204, and an inference data molding unit 205. Learn a vehicle model for robot control.

指令車速生成部２００は、ドライブロボット制御の推論を行う際に、入力データとして使用する指令車速を生成する。
強化学習部２０１は、報酬計算部２０１０と、操作内容推論部２０１１と、状態行動価値推論部２０１２と、を備え、ドライブロボット制御の強化学習を行う。
強化学習部２０１は、演算部により実現される。
報酬計算部２０１０は、行動後の状態に対する強化学習の報酬を計算する。
操作内容推論部２０１１は、第１学習モデルを有し、状態の入力に対して、ドライブロボット１１の操作内容である制御指令を出力する。
状態行動価値推論部２０１２は、第２学習モデルを有し、状態と行動の入力に対して、時間割引された期待収益であるＱ値を計算する。 The command vehicle speed generator 200 generates a command vehicle speed to be used as input data when inferring drive robot control.
The reinforcement learning unit 201 includes a reward calculation unit 2010, an operation content inference unit 2011, and a state action value inference unit 2012, and performs reinforcement learning for drive robot control.
Reinforcement learning unit 201 is implemented by a computing unit.
The reward calculation unit 2010 calculates a reward for reinforcement learning for the state after action.
The operation content inference unit 2011 has a first learning model, and outputs a control command, which is the operation content of the drive robot 11, in response to state input.
The state-action-value inference unit 2012 has a second learning model and calculates a Q value, which is a time-discounted expected return, for state and action inputs.

学習データ成型部２０２は、強化学習部２０１で使用される学習データを適切なデータ形式に変換することでデータ成型を行う。
学習データ記憶部２０３は、強化学習部２０１における強化学習に用いる学習データを記憶する。
学習データ生成部２０４は、学習データ記憶部２０３のデータから強化学習部２０１における強化学習に用いる学習データを生成する。
推論データ成型部２０５は、初期観測状態と、指令車速とを、第１学習モデルのニューラルネットワークへ入力するためのデータ形式に変換することでデータ成型する。 The learning data shaping unit 202 performs data shaping by converting the learning data used in the reinforcement learning unit 201 into an appropriate data format.
The learning data storage unit 203 stores learning data used for reinforcement learning in the reinforcement learning unit 201 .
The learning data generation unit 204 generates learning data used for reinforcement learning in the reinforcement learning unit 201 from the data in the learning data storage unit 203 .
The inference data forming unit 205 forms data by converting the initial observation state and command vehicle speed into a data format for inputting to the neural network of the first learning model.

ドライブロボット制御部２１は、駆動状態取得部２１１と、車両操作制御部２１２と、を備え、ドライブロボット１１の状態を観測しつつドライブロボット１１に制御指令を与える。
ドライブロボット制御部２１は、演算部により実現される。
駆動状態取得部２１１は、試験装置１に含まれる各構成の状態、例えばドライブロボット１１のペダル操作検出値の状態を取得する。
車両操作制御部２１２は、駆動状態取得部２１１からの入力データに対してペダル操作指令を生成し、ドライブロボット１１のアクチュエータ１１０ａ，１１０ｂへの指令に変換することで車両１２の操作を制御する。 The drive robot control unit 21 includes a drive state acquisition unit 211 and a vehicle operation control unit 212 , and gives control instructions to the drive robot 11 while observing the state of the drive robot 11 .
The drive robot control unit 21 is implemented by a computing unit.
The drive state acquisition unit 211 acquires the state of each component included in the test apparatus 1 , for example, the state of the pedal operation detection value of the drive robot 11 .
The vehicle operation control unit 212 generates a pedal operation command from the input data from the driving state acquisition unit 211 and converts the pedal operation command into a command for the actuators 110a and 110b of the drive robot 11, thereby controlling the operation of the vehicle 12. FIG.

副操作内容推論部２２は、制御理論に基づいて、ドライブロボット１１の操作内容である制御指令を出力する。 The secondary operation content inference unit 22 outputs a control command, which is the operation content of the drive robot 11, based on control theory.

制御方式判断部２３は、操作内容推論部２０１１からの制御指令及び副操作内容推論部２２からの制御指令のいずれを採用するかについて判断し、ドライブロボット制御部２１に対して採用した制御指令を出力する。 The control method determination unit 23 determines which of the control command from the operation content inference unit 2011 and the control command from the secondary operation content inference unit 22 is to be adopted, and sends the adopted control command to the drive robot control unit 21. Output.

なお、図２において、矢印で結ばれた各構成の間は、有線又は無線により接続されている。 In addition, in FIG. 2, each configuration connected by an arrow is connected by wire or wirelessly.

本実施形態においては、学習が適切に進行する、よい経験をしやすくし、初期の学習進行を効率化することを目的とする。
具体的には、学習初期において、ＰＩ（Proportional-Integral）制御等の制御理論に基づく制御方式を用いる。
制御理論に基づく制御方式では、車速が不足している場合にはアクセルペダル操作値を出力し、車速が超過している場合にはブレーキペダル操作値を出力する。
ドライブロボット１１の制御においては、指令車速への追従性が大前提として求められるため、指令車速に近づく走行がよい経験として扱われる。
ただし、制御理論に基づく制御方式では、高精度な追従をするように設計及びチューニングされる必要はなく、追従性を評価する報酬をいくらか獲得できる程度に大雑把に調整されていればよい。
制御理論に基づく制御方式は、学習初期における学習進行を効率化するための呼び水として機能すればよい。
この制御方式を適用するためには、深層強化学習アルゴリズムとして方策オフ型を選択すべきである。
方策オン又は方策オフの特性は、アルゴリズム導出式の表現によって決まるが、方策オン型のアルゴリズムでは、現在の学習状態による経験のみを学習に使用することができ、それ以外の経験を学習に使用すると学習が崩壊するおそれがある。 In this embodiment, the object is to make learning progress appropriately, to facilitate a good experience, and to streamline the initial progress of learning.
Specifically, in the initial stage of learning, a control method based on control theory such as PI (Proportional-Integral) control is used.
In a control method based on control theory, an accelerator pedal operation value is output when the vehicle speed is insufficient, and a brake pedal operation value is output when the vehicle speed is excessive.
In the control of the drive robot 11, since followability to the commanded vehicle speed is required as a major premise, traveling closer to the commanded vehicle speed is treated as a good experience.
However, the control method based on the control theory does not need to be designed and tuned for highly accurate tracking, and may be roughly adjusted to the extent that some reward for evaluating the tracking performance can be obtained.
The control method based on the control theory should just function as a priming for making the progress of learning more efficient in the initial stage of learning.
In order to apply this control method, policy-off type should be selected as deep reinforcement learning algorithm.
The policy-on or policy-off characteristics are determined by the expression of the algorithm derivation formula, but in policy-on type algorithms, only the experience from the current learning state can be used for learning, Learning can be disrupted.

本実施形態においては、強化学習部２０１内の第１学習モデル及び第２学習モデルの効率的な学習進行を行う。
ドライブロボット１１の制御においては、例えばＷＬＴＣ（Worldwide-harmonized Light vehicles Test Cycle）モード等の指令車速パターンに追従して走行するための制御則が第１学習モデル及び第２学習モデルにより獲得される。
制御装置２の動作は、以下に説明するように、経験フェーズと学習フェーズとに大別される。 In this embodiment, efficient learning progress of the first learning model and the second learning model in the reinforcement learning unit 201 is performed.
In the control of the drive robot 11, the first learning model and the second learning model acquire a control law for following a commanded vehicle speed pattern such as WLTC (Worldwide-harmonized Light vehicles Test Cycle) mode.
The operation of the control device 2 is roughly divided into an experience phase and a learning phase, as described below.

＜経験フェーズ＞
経験フェーズは、学習に使用するデータの蓄積を行う段階である。
図２における実線は、経験フェーズの流れを表している。 <Experience Phase>
The experience phase is a stage of accumulating data used for learning.
The solid line in FIG. 2 represents the flow of the experience phase.

ここで、例えばＷＬＴＣモード走行１回を行うような、一連の経験周期をエピソードと称呼する。
第１学習モデル及び第２学習モデルは、ニューラルネットワークで表され、学習開始時にランダムに初期化される。
推論データ成型部２０５は、ドライブロボット制御部２１の駆動状態取得部２１１並びに試験装置１のシャシーダイナモメータ１３及び車両状態計測部１４から得られる初期観測状態と、指令車速生成部２００から得られる指令車速とを、第１学習モデルのニューラルネットワークへ入力するためのデータ形式に変換する。
操作内容推論部２０１１は、変換された初期観測状態と、指令車速生成部２００から得られる指令車速と、を用いて、第１学習モデルにより、ドライブロボット１１がアクセルペダル及びブレーキペダルを操作するための操作内容を推論する。
ここで、経験に探索性を加える場合には、学習フェーズで使用するアルゴリズムに合わせて、推論した操作内容にランダムな摂動を加える。
他方、副操作内容推論部２２も、初期観測状態と指令車速とを用いて、操作内容を推論する。
ここで、副操作内容推論部２２には、例えばＰＩ制御に代表される制御理論に基づく制御が実装される。
これにより、副操作内容推論部２２は、指令車速に対して観測した車速が不足している場合にはアクセルペダル操作値を出力し、指令車速に対して観測した車速が超過している場合にはブレーキペダル操作値を出力する。
このような制御方式によれば、高精度な追従をするように設計及びチューニングされる必要はなく、追従性を評価する報酬をいくらか獲得できる程度に大雑把に調整されていればよい。
操作内容推論部２０１１から出力される操作内容と、副操作内容推論部２２から出力される操作内容とは、制御方式判断部２３に送られる。
制御方式判断部２３は、操作内容推論部２０１１から出力された操作内容及び副操作内容推論部２２から出力された操作内容のうち、いずれかを採用する判断を行い、採用した操作内容である制御指令を車両操作制御部２１２に送る。
副操作内容推論部２２は、学習初期における学習進行を効率化するための呼び水となる経験を得るために用いるものであり、学習全体を通して副操作内容推論部２２が強く寄与すると、副操作内容推論部２２の制御が学習結果に大きく反映されてしまう。
そのため、車両操作制御部２１２は、学習が進行して初期状態を脱した後には、操作内容推論部２０１１から出力される操作内容に従った制御を行うことが好ましい。
従って、制御方式判断部２３は、学習開始から所定のエピソード（例えば１０エピソード）までの初期状態では副操作内容推論部２２から出力される操作内容を採用し、初期状態を脱した後には、操作内容推論部２０１１から出力される操作内容を採用するように設定されるとよい。
制御方式判断部２３から出力された制御指令は、ドライブロボット制御部２１の車両操作制御部２１２へ送られる。
ここでは、制御方式判断部２３からの制御指令が、ドライブロボット１１のアクチュエータ１１０ａ，１１０ｂへの指令に変換されて伝達されるとともに、必要に応じてアクセルペダル及びブレーキペダルの排他処理、制御周期の整合、制御指令の値のアップサンプリング及びダウンサンプリングが行われる。
ドライブロボット制御部２１の車両操作制御部２１２から出力された制御指令は、ドライブロボット１１に送られる。
ドライブロボット１１は、受け取った制御指令をもとに、車両１２のアクセルペダル及びブレーキペダルを操作する。
ドライブロボット１１の操作は、ドライブロボット制御部２１の駆動状態取得部２１１によって観測される。
車両１２の状態は、シャシーダイナモメータ１３及び車両状態計測部１４によって観測される。
観測された車両１２の状態は、推論データ成型部２０５及び副操作内容推論部２２に送られ、次のエピソードにおける制御指令の生成に用いられる。
また、観測された車両１２の状態は、強化学習部２０１の報酬計算部２０１０によって得られた報酬の値と合わせて学習データ成型部２０２に送られ、学習データ記憶部２０３に蓄積される。
ドライブロボット１１の制御においては、指令車速への追従性が大前提として求められるため、指令車速に近づく走行がよい経験として扱われるように、報酬設計がなされることが好ましい。
なお、学習データ記憶部２０３は、蓄積許容量を超えるデータを受け取ると、古いデータから廃棄していくように設定されているとよい。 Here, a series of experience cycles such as one WLTC mode run is called an episode.
The first learning model and the second learning model are represented by neural networks and are randomly initialized at the start of learning.
The inference data shaping unit 205 combines the initial observed state obtained from the drive state acquisition unit 211 of the drive robot control unit 21 and the chassis dynamometer 13 and vehicle state measurement unit 14 of the test apparatus 1, and the command obtained from the command vehicle speed generation unit 200. The vehicle speed is converted into a data format for input to the neural network of the first learning model.
The operation content inference unit 2011 uses the converted initial observation state and the command vehicle speed obtained from the command vehicle speed generation unit 200 to generate the following information for the drive robot 11 to operate the accelerator pedal and the brake pedal according to the first learning model. Infer the operation content of
Here, when adding explorability to the experience, random perturbations are added to the inferred operation contents according to the algorithm used in the learning phase.
On the other hand, the secondary operation content inference unit 22 also infers the operation content using the initial observed state and the commanded vehicle speed.
Here, in the secondary operation content inference unit 22, control based on control theory represented by PI control, for example, is implemented.
Thereby, the secondary operation content inference unit 22 outputs the accelerator pedal operation value when the observed vehicle speed is insufficient with respect to the command vehicle speed, and outputs the accelerator pedal operation value when the observed vehicle speed exceeds the command vehicle speed. outputs the brake pedal operation value.
Such a control method does not need to be designed and tuned for highly accurate tracking, but only needs to be roughly adjusted to the extent that some reward for evaluating tracking performance can be obtained.
The operation content output from the operation content inference unit 2011 and the operation content output from the secondary operation content inference unit 22 are sent to the control method determination unit 23 .
The control method determination unit 23 determines to adopt one of the operation content output from the operation content inference unit 2011 and the operation content output from the sub-operation content inference unit 22, and performs control based on the adopted operation content. A command is sent to the vehicle operation control unit 212 .
The sub-operation content inference unit 22 is used to obtain an experience that serves as a prime mover for streamlining the progress of learning in the early stages of learning. The control of the unit 22 is greatly reflected in the learning result.
Therefore, it is preferable that the vehicle operation control unit 212 performs control according to the operation content output from the operation content inference unit 2011 after the learning proceeds and the initial state is exited.
Therefore, the control method determination unit 23 adopts the operation content output from the secondary operation content inference unit 22 in the initial state from the start of learning to a predetermined episode (for example, 10 episodes), and after exiting the initial state, the operation It is preferable that the operation content output from the content inference unit 2011 is adopted.
A control command output from the control method determination unit 23 is sent to the vehicle operation control unit 212 of the drive robot control unit 21 .
Here, the control command from the control method determination unit 23 is converted into a command to the actuators 110a and 110b of the drive robot 11 and transmitted. Matching, upsampling and downsampling of the values of the control commands are performed.
A control command output from the vehicle operation control section 212 of the drive robot control section 21 is sent to the drive robot 11 .
The drive robot 11 operates the accelerator pedal and brake pedal of the vehicle 12 based on the received control commands.
The operation of the drive robot 11 is observed by the drive state acquisition section 211 of the drive robot control section 21 .
The state of the vehicle 12 is observed by a chassis dynamometer 13 and a vehicle state measuring section 14 .
The observed state of the vehicle 12 is sent to the inference data forming section 205 and the secondary operation content inference section 22, and used to generate the control command in the next episode.
Further, the observed state of the vehicle 12 is sent to the learning data molding section 202 together with the reward value obtained by the reward calculation section 2010 of the reinforcement learning section 201 and accumulated in the learning data storage section 203 .
In the control of the drive robot 11, since the followability to the commanded vehicle speed is required as a major premise, it is preferable to design rewards so that traveling closer to the commanded vehicle speed is treated as a good experience.
It should be noted that the learning data storage unit 203 may be set to discard the oldest data when it receives data exceeding the allowable storage amount.

このような処理の流れが、所定期間、例えば１エピソード分だけ繰り返され、一連の経験が学習データとして蓄積されると、学習フェーズに移行する。
学習初期において副操作内容推論部２２の制御理論に基づく操作内容が用いられると、副操作内容推論部２２の制御理論に基づく操作内容が用いられない場合と比較して、学習初期においても良好な報酬を獲得する、よい経験が可能である。 Such a flow of processing is repeated for a predetermined period of time, for example, for one episode, and when a series of experiences is accumulated as learning data, the learning phase is entered.
When the operation content based on the control theory of the sub-operation content inferring unit 22 is used in the early stage of learning, the operation content based on the control theory of the sub-operation content inferring unit 22 is not used. Earn rewards, good experiences are possible.

＜学習フェーズ＞
学習フェーズは、経験フェーズにおいて蓄積された学習データを用いて、第１学習モデル及び第２学習モデルを強化学習する段階である。
図２における破線は、学習フェーズの流れを表している。 <Learning phase>
The learning phase is a step of performing reinforcement learning of the first learning model and the second learning model using the learning data accumulated in the experience phase.
The dashed line in FIG. 2 represents the flow of the learning phase.

まず、深層強化学習アルゴリズムについて、学習初期においては副操作内容推論部２２の制御理論に基づく操作内容の使用を前提としているため、方策オフ型のアルゴリズムが選定される。
方策オン又は方策オフの特性は、アルゴリズム導出式の表現によって決まるものであるが、方策オン型のアルゴリズムでは、現在の学習状態による経験のみを学習に使用することができ、それ以外の経験を学習に使用すると学習が崩壊するおそれがある。
本実施形態では、方策オフ型のアルゴリズムとして広く知られているＤＤＰＧ（Deep Deterministic Policy Gradient）を採用して、処理の流れを簡単に説明する。 First, regarding the deep reinforcement learning algorithm, since it is premised on the use of the operation content based on the control theory of the sub-operation content inference unit 22 in the early stage of learning, a policy-off type algorithm is selected.
The policy-on or policy-off characteristic is determined by the expression of the algorithm derivation formula, but in the policy-on type algorithm, only the experience from the current learning state can be used for learning, and the other experiences are not learned. If used for , learning may collapse.
In this embodiment, DDPG (Deep Deterministic Policy Gradient), which is widely known as a policy-off algorithm, is adopted, and the flow of processing will be briefly described.

学習データ生成部２０４は、学習データ記憶部２０３から学習に使用するデータを選択し、操作内容推論部２０１１及び状態行動価値推論部２０１２に出力する。 The learning data generation unit 204 selects data to be used for learning from the learning data storage unit 203 and outputs the data to the operation content inference unit 2011 and the state action value inference unit 2012 .

操作内容推論部２０１１の第１学習モデルの学習においては、第１学習モデルに学習データを入力したときの出力を第２学習モデルに学習データと合わせて入力し、その状態行動価値の出力値を用いて誤差逆伝播する。
これにより、状態行動価値が大きくなるような出力を得るための学習勾配が得られ、勾配法ベースの最適化手法によってニューラルネットワークを学習することができる。 In the learning of the first learning model of the operation content inference unit 2011, the output when the learning data is input to the first learning model is input to the second learning model together with the learning data, and the output value of the state action value is input to the second learning model. Backpropagate the error using
As a result, a learning gradient for obtaining an output that increases the state-action value can be obtained, and the neural network can be trained by the gradient method-based optimization technique.

状態行動価値推論部２０１２の第２学習モデルの学習においては、ＴＤ（Temporal Difference）誤差と呼ばれる、現入力データで推論した価値と、現報酬に入力データで推論した価値を加えたものと、の差分が小さくなるような学習をすると、学習の進行に伴って価値を適切に推論することができるようになる。 In the learning of the second learning model of the state-action-value inference unit 2012, a value called a TD (Temporal Difference) error, which is inferred from the current input data, and the sum of the value inferred from the input data and the current reward. If learning is performed in such a way that the difference becomes small, the value can be appropriately inferred as the learning progresses.

経験フェーズにおいて、ランダムに初期化された第１学習モデルで経験した場合、報酬の小さい経験が主となり、第２学習モデルの学習が進まず、それに伴って第１学習モデルの学習も適切に進行しないおそれがある。
適切に学習が進行しない場合には、アクセルペダル又はブレーキペダルのみを使うように誤って学習が進行するおそれがある。
そこで、上述したように、制御理論に基づく制御指令を用いると、よい経験を得やすくなり、第１学習モデル及び第２学習モデルの学習が効率的に進行する。
学習フェーズが終了すると、経験フェーズに再度移行する。
なお、第１学習モデル及び第２学習モデルの学習時には、学習の安定化のためにTargetネットワークが用いられてもよい。
また、第１学習モデル及び第２学習モデルの役割及び構成は、使用する深層強化学習アルゴリズムによって変更され得る。 In the experience phase, when experiencing the first learning model that is randomly initialized, the experience with a small reward is the main experience, the learning of the second learning model does not progress, and the learning of the first learning model progresses accordingly. likely not.
If the learning does not progress properly, there is a risk that the learning will progress by mistake as if only the accelerator pedal or the brake pedal is used.
Therefore, as described above, using control commands based on control theory makes it easier to obtain good experience, and the learning of the first learning model and the second learning model progresses efficiently.
After the learning phase ends, the experience phase begins again.
Note that a target network may be used for stabilizing learning when learning the first learning model and the second learning model.
Also, the roles and configurations of the first learning model and the second learning model can be changed depending on the deep reinforcement learning algorithm used.

本実施形態によれば、不適切な学習の進行を抑制するとともに、学習初期における学習の進行を効率化することができる。 According to the present embodiment, progress of inappropriate learning can be suppressed, and progress of learning in the initial stage of learning can be made more efficient.

＜実施形態２＞
実施形態１においては、制御方式判断部２３における操作内容の切り替えが、所定数のエピソードを経た場合に行われる形態について説明したが、本発明はこれに限定されるものではない。
本実施形態では、制御方式判断部２３は、操作内容推論部２０１１及び副操作内容推論部２２から各々入力される操作内容を蓄積し、エピソード終了時に操作内容系列の相関によって、いずれの操作内容を選択するかの判断を行う点のみが異なり、その他の点は実施形態１と同じである。
操作内容系列の相関が大きくなってくると、ニューラルネットワークからなる操作内容推論部２０１１の挙動と、制御理論に基づく副操作内容推論部２２の挙動と、が類似するように学習が進行してきたものと判断することができる。
ここで、相関は、過度に大きくする必要はなく、例えば相関係数が＋０．３になると、操作内容推論部２０１１の操作内容を採用するように切り替えるよう設定されていればよい。 <Embodiment 2>
In the first embodiment, the switching of the operation content in the control method determination unit 23 is performed after a predetermined number of episodes, but the present invention is not limited to this.
In this embodiment, the control method determination unit 23 accumulates the operation contents input from the operation content inference unit 2011 and the secondary operation content inference unit 22, and determines which operation content is selected by correlation of the operation content sequence at the end of the episode. The only difference is that it is determined whether to select, and the other points are the same as those of the first embodiment.
As the correlation between the operation content series increases, learning progresses such that the behavior of the operation content inference unit 2011 consisting of a neural network and the behavior of the sub-operation content inference unit 22 based on control theory are similar. can be determined.
Here, the correlation does not need to be excessively large. For example, when the correlation coefficient reaches +0.3, the operation content of the operation content inference unit 2011 may be set so as to adopt the operation content.

本実施形態によれば、実施形態１の効果に加えて、操作内容の切り替えを適切なタイミングで行うことが可能になる、という効果を奏する。 According to this embodiment, in addition to the effect of the first embodiment, there is an effect that it becomes possible to switch the operation content at an appropriate timing.

＜実施形態３＞
本実施形態では、試験装置モデル１Ａによるシミュレーションを用いた事前学習を行う形態について説明する。
本実施形態では、試験装置モデル１Ａによるシミュレーションを用いた事前学習を行う点のみが実施形態１と異なり、その他は実施形態１と同じである。 <Embodiment 3>
In the present embodiment, a form in which pre-learning is performed using simulation by the test apparatus model 1A will be described.
This embodiment differs from the first embodiment only in that pre-learning is performed using a simulation using the test apparatus model 1A, and the rest is the same as the first embodiment.

図３は、本実施形態における試験装置１と、本実施形態に係る自動操縦ロボットの制御装置２Ａと、を示す機能ブロック図である。
制御装置２Ａは、試験装置モデル１Ａを有する。
試験装置モデル１Ａは、試験装置１を模擬した内部モデルによって構成され、ドライブロボットモデル１１Ａと、車両モデル１２Ａと、シャシーダイナモメータモデル１３Ａと、を有する。
ドライブロボットモデル１１Ａは、試験装置１のドライブロボット１１の機械動作を模擬する。
車両モデル１２Ａは、試験装置１の車両１２を模擬し、ドライブロボットモデル１１Ａの挙動に対して、車両１２の挙動を出力する。
シャシーダイナモメータモデル１３Ａは、試験装置１のシャシーダイナモメータの動作及び計測を模擬する。 FIG. 3 is a functional block diagram showing the test device 1 according to this embodiment and the control device 2A for the autopilot robot according to this embodiment.
The control device 2A has a test device model 1A.
The test equipment model 1A is composed of an internal model simulating the test equipment 1, and has a drive robot model 11A, a vehicle model 12A, and a chassis dynamometer model 13A.
The drive robot model 11A simulates the mechanical motion of the drive robot 11 of the test apparatus 1. FIG.
The vehicle model 12A simulates the vehicle 12 of the test device 1 and outputs the behavior of the vehicle 12 with respect to the behavior of the drive robot model 11A.
The chassis dynamometer model 13A simulates the operation and measurement of the chassis dynamometer of the test apparatus 1. FIG.

試験装置モデル１Ａによるシミュレーション環境において経験フェーズと学習フェーズとが繰り返され、第１学習モデル及び第２学習モデルの事前学習が十分に進行してから試験装置１を用いて実試験装置環境下における学習を行うと、実時間における走行時間の抑制が可能となる。
また、第１学習モデル及び第２学習モデルの事前学習が十分に行うことで、未熟な学習モデルによる制御指令による試験装置１の誤動作のおそれを低減することができる。
なお、試験装置モデル１Ａによる事前学習は、車両モデル１２Ａの制御により十分な走行性能が得られるまで行われる。
例えば、モード走行を前提とする場合には、試験装置モデル１Ａによる事前学習は、車両モデル１２Ａにより対象のモードにおける走行で十分に小さな車速誤差が得られるまで行われる。 The experience phase and the learning phase are repeated in the simulation environment by the test equipment model 1A, and after the pre-learning of the first learning model and the second learning model has progressed sufficiently, the learning under the actual test equipment environment using the test equipment 1 is performed. , it is possible to suppress the running time in real time.
In addition, by sufficiently pre-learning the first learning model and the second learning model, it is possible to reduce the risk of malfunction of the test apparatus 1 due to a control command by an immature learning model.
The pre-learning by the test equipment model 1A is performed until sufficient driving performance is obtained by the control of the vehicle model 12A.
For example, when mode driving is assumed, pre-learning by the test device model 1A is performed until a sufficiently small vehicle speed error is obtained by the vehicle model 12A when driving in the target mode.

本実施形態によれば、事前学習が可能となり、実施形態１の効果に加えて、実時間における走行時間の抑制が可能となるとともに、試験装置１の誤動作のおそれを低減することができる、という効果を奏する。 According to this embodiment, pre-learning is possible, and in addition to the effects of the first embodiment, it is possible to suppress the running time in real time, and it is possible to reduce the risk of malfunction of the test apparatus 1. Effective.

なお、本実施形態では、制御対象を車両とし、制御対象の状態を車速とし、指令が指令車速である場合、すなわち、車両に搭載されて該車両を走行させる自動操縦ロボットを該車両が規定された指令車速に従って走行するように制御する、該自動操縦ロボットの制御装置について説明したが、本発明は、これに限定されるものではない。
すなわち、制御対象の状態が指令に一致するように前記制御対象を制御する制御装置であって、強化学習アルゴリズムに基づいて前記制御対象を操作する制御指令を出力する操作内容推論部と、制御理論に基づく制御指令を出力する副操作内容推論部と、学習初期には前記副操作内容推論部の出力する制御指令を採用し、学習初期を脱した後には前記操作内容推論部の出力する制御指令を採用する判断を行い、採用した制御指令を出力する制御方式判断部と、前記採用した制御指令により前記制御対象の操作を制御する操作制御部と、を備える制御装置も本発明に含まれるものである。 In the present embodiment, the vehicle is the object to be controlled, the vehicle speed is the state of the object to be controlled, and the command is the commanded vehicle speed. Although the control device for the autopilot robot that controls the vehicle to run according to the commanded vehicle speed has been described, the present invention is not limited to this.
That is, a control device that controls the controlled object so that the state of the controlled object matches the command, an operation content inference unit that outputs a control command for operating the controlled object based on a reinforcement learning algorithm; and a control command output by the secondary operation content inference unit at the beginning of learning, and a control command output by the operation content inference unit after the initial stage of learning. The present invention also includes a control device comprising a control method determination unit that determines whether to adopt a control method and outputs the adopted control command, and an operation control unit that controls the operation of the controlled object according to the adopted control command. is.

なお、本発明は、上述の実施形態に限定されるものではなく、上述の構成に対して、構成要素の付加、削除又は転換を行った様々な変形例も含むものとする。 In addition, the present invention is not limited to the above-described embodiments, and includes various modifications in which components are added, deleted, or converted to the above-described configuration.

１試験装置
１１ドライブロボット
１１０ａ，１１０ｂアクチュエータ
１２車両
１２１駆動輪
１２２運転席
１２３ａ，１２３ｂ車両操作ペダル
１３シャシーダイナモメータ
１４車両状態計測部
１Ａ試験装置モデル
１１Ａドライブロボットモデル
１２Ａ車両モデル
１３Ａシャシーダイナモメータモデル
２，２Ａ制御装置
２０学習部
２００指令車速生成部
２０１強化学習部
２０１０報酬計算部
２０１１操作内容推論部
２０１２状態行動価値推論部
２０２学習データ成型部
２０３学習データ記憶部
２０４学習データ生成部
２０５推論データ成型部
２１ドライブロボット制御部
２１１駆動状態取得部
２１２車両操作制御部
２２副操作内容推論部
２３制御方式判断部 Reference Signs List 1 test device 11 drive robot 110a, 110b actuator 12 vehicle 121 drive wheel 122 driver's seat 123a, 123b vehicle operation pedal 13 chassis dynamometer 14 vehicle state measurement unit 1A test device model 11A drive robot model 12A vehicle model 13A chassis dynamometer model 2 , 2A control device 20 learning unit 200 command vehicle speed generation unit 201 reinforcement learning unit 2010 reward calculation unit 2011 operation content inference unit 2012 state action value inference unit 202 learning data molding unit 203 learning data storage unit 204 learning data generation unit 205 inference data molding Section 21 Drive Robot Control Section 211 Drive State Acquisition Section 212 Vehicle Operation Control Section 22 Sub-Operation Content Inference Section 23 Control Method Determination Section

Claims

A control device that controls the controlled object so that the state of the controlled object matches a command,
an operation content inference unit that outputs a control command for operating the controlled object based on a reinforcement learning algorithm;
a sub-operation content inference unit that outputs a control command based on control theory;
In the early stage of learning, which is a predetermined period from the start of learning under the environment of the actual test apparatus, the control command output by the secondary operation content inference unit is adopted, and after the initial learning period ends, the control command output by the operation content inference unit A control method determination unit that determines to adopt and outputs the adopted control command;
and an operation control unit that controls the operation of the controlled object according to the adopted control command.

A control device for an autopilot robot that controls an autopilot robot that is mounted on a vehicle and causes the vehicle to travel in accordance with a prescribed command vehicle speed,
an operation content inference unit that outputs a control command for operating the vehicle based on a reinforcement learning algorithm;
a sub-operation content inference unit that outputs a control command based on control theory;
In the early stage of learning, which is a predetermined period from the start of learning under the environment of the actual test apparatus, the control command output by the secondary operation content inference unit is adopted, and after the initial learning period ends, the control command output by the operation content inference unit A control method determination unit that determines to adopt and outputs the adopted control command;
A control device for an autopilot robot, comprising: a vehicle operation control unit that controls operation of the vehicle according to the adopted control command.

It is determined based on the correlation of the operation content sequence whether the learning initial stage or the learning initial stage has passed,
3. The method according to claim 2, wherein when the correlation between the operation content sequences increases, it is determined that the learning has progressed such that the behavior of the operation content inference unit and the behavior of the secondary operation content inference unit are similar to each other. autopilot robot controller.

4. The control device for an autopilot robot according to claim 2, wherein pre-learning using a simulation using a test device model is performed before learning under an actual test device environment.

A control method for an autopilot robot, comprising: controlling an autopilot robot mounted on a vehicle to drive the vehicle so that the vehicle travels according to a prescribed command vehicle speed,
outputting a control command to operate the vehicle based on a reinforcement learning algorithm;
outputting a control command based on control theory;
Judgment to adopt the control command based on the control theory in the initial stage of learning, which is a predetermined period from the start of learning under the environment of the actual test equipment , and to adopt the control command based on the reinforcement learning algorithm after the initial stage of learning is over. and outputting the adopted control command,
A method of controlling an autopilot robot, comprising: controlling the operation of the vehicle according to the employed control commands.