JP2022532972A

JP2022532972A - Unmanned vehicle lane change decision method and system based on hostile imitation learning

Info

Publication number: JP2022532972A
Application number: JP2021541153A
Authority: JP
Inventors: 科 ▲チー▼; 立生范
Original assignee: ▲広▼州大学
Priority date: 2020-04-24
Filing date: 2020-09-17
Publication date: 2022-07-21
Anticipated expiration: 2040-09-17
Also published as: CN111483468B; JP7287707B2; WO2021212728A1; CN111483468A

Abstract

本発明は、敵対的模倣学習に基づく無人運転車両車線変更決定方法及びシステムを開示し、まず、無人運転車両車線変更決定タスクを部分観測マルコフ決定過程として記述し、それから、敵対的模倣学習方法を用いて、専門運転教示によって提供される例から学習し、無人運転車両車線変更決定モデルを取得し、車両の無人運転走行中に、現在取得されている環境車両情報を無人運転車両車線変更決定モデルの入力パラメータとして、無人運転車両車線変更決定モデルによって車両車線変更決定結果を取得する。本発明は、専門運転教示によって提供される例から敵対的模倣学習方法によって車線変更方策を学習し、車両状態から車両車線変更決定への直接マッピングを、人為的なタスクインセンティブ関数を必要とすることなく、直接的に確立することができ、動的な車両走行条件下での無人運転車両の車線変更決定の正確性、ロバスト性及び適応性を効果的に向上させる。【選択図】図２The present invention discloses a driverless vehicle lane change decision method and system based on adversarial imitation learning, firstly describing the driverless vehicle lane change decision task as a partially observable Markov decision process, and then applying the adversarial imitation learning method to using the example provided by the professional driving teaching to learn from the examples provided by the driverless vehicle lane change decision model, and during the driverless driving run of the vehicle, the currently acquired environmental vehicle information is applied to the driverless vehicle lane change decision model As the input parameter of , the vehicle lane change decision result is obtained by the unmanned vehicle lane change decision model. The present invention learns lane change policies by an adversarial imitation learning method from examples provided by expert driving instruction, requiring a direct mapping from vehicle state to vehicle lane change decisions, and an artificial task incentive function. It can be established directly without any need, effectively improving the accuracy, robustness and adaptability of driverless vehicle's lane change decision under dynamic vehicle driving conditions. [Selection drawing] Fig. 2

Description

本発明は、無人自律車両運転の技術分野に属し、特に敵対的模倣学習に基づく無人運転車両車線変更決定方法及びシステムに関する。 The present invention belongs to the technical field of unmanned autonomous vehicle driving, and particularly relates to an unmanned driving vehicle lane change determination method and system based on hostile imitation learning.

無人運転の発展は、道路交通の知的レベルを向上させ、交通運送業界のトランスフォーメーションおよびアップグレードを推進するのに役立つ。無人運転車両は、様々なタイプのセンサ、コントローラを含むハードウェアと、環境認識、行動決定、運動計画が自律制御モジュールと統合された統合システムであるソフトウェアとの組み合わせである。 The development of unmanned driving will help improve the intellectual level of road traffic and drive the transformation and upgrade of the transportation industry. An unmanned vehicle is a combination of hardware, including various types of sensors and controllers, and software, which is an integrated system in which environmental awareness, action decisions, and exercise planning are integrated with autonomous control modules.

車線変更の決定は、無人運転車両決定技術の重要な構成モジュールであり、後続の動作計画モジュールが実行される根拠である。現在、開示された特許を含む先行技術において、主に採用されている無人運転車両車線変更決定方法は、規則に基づく決定、動的計画に基づく決定、ファジィ制御に基づく決定などの従来の方法を含む。しかし、車両の走行環境が複雑かつ多様で高度な動的交通環境であり、決定方法の設計のための正確な数学モデルの確立が困難であり、従来の車線変更決定方法のロバスト性及び適応性は、無人運転車線変更決定の要件を完全に満たすことができなかった。 The lane change decision is an important component module of the unmanned vehicle determination technique and is the basis for the subsequent motion planning module to be executed. Currently, in the prior art including the disclosed patents, the mainly adopted unmanned vehicle lane change decision method is a conventional method such as a rule-based decision, a dynamic programming-based decision, and a fuzzy control-based decision. include. However, the driving environment of the vehicle is complicated, diverse, and highly dynamic traffic environment, and it is difficult to establish an accurate mathematical model for designing the decision method, and the robustness and adaptability of the conventional lane change decision method. Could not fully meet the requirements of the unmanned lane change decision.

近年、無人運転分野における人工知能の応用が急速に進展しており、無人運転車両車線変更決定の問題を解決するために人工知能の採用が可能となっている。エンド・ツー・エンドの教師あり学習と深度強化学習は、２つの比較的一般的な手法である。エンド・ツー・エンドの教師あり学習及び深度強化学習は、いずれもニューラルネットワークモデルを学習して、感知データを車線変更の決定の出力に直接マッピングすることができる。しかし、エンド・ツー・エンドの教師あり学習は、多くの場合、大量の学習データを必要とし、モデル化能力の弱い深度強化学習は、タスク要件を満たすインセンティブ関数を人為的に設計する必要がある。 In recent years, the application of artificial intelligence in the field of unmanned driving has rapidly progressed, and it has become possible to adopt artificial intelligence in order to solve the problem of determining the lane change of an unmanned driving vehicle. End-to-end supervised learning and depth reinforcement learning are two relatively common techniques. Both end-to-end supervised learning and depth-enhanced learning can train neural network models and map perceived data directly to the output of lane change decisions. However, end-to-end supervised learning often requires a large amount of training data, and deep reinforcement learning with weak modeling ability requires artificially designing incentive functions that meet task requirements. ..

現在の無人運転技術のボトルネックと、車線変更決定技術の不足とを総合的に考慮して、新たな無人運転車両車線変更決定方法を設計する必要がある。 It is necessary to design a new unmanned driving vehicle lane change determination method by comprehensively considering the current bottleneck of unmanned driving technology and the lack of lane change determination technology.

本発明の第１の目的は、従来技術の欠点及び不備を克服し、敵対的模倣学習に基づく無人運転車両車線変更決定方法を提供することである。該方法は、専門運転教示によって提供される例から学習し、車両状態から車両の車線変更決定への直接マッピングを、人為的なタスクインセンティブ関数を必要とすることなく、直接的に確立することができ、動的な車両の走行条件下での無人運転車両車線変更決定の正確性、ロバスト性及び適応性を効果的に向上させる。 A first object of the present invention is to overcome the shortcomings and deficiencies of the prior art and to provide a method for determining an unmanned driving vehicle lane change based on hostile imitation learning. The method can be learned from the examples provided by professional driving instruction to directly establish a direct mapping from vehicle state to vehicle lane change decisions without the need for artificial task incentive functions. It can effectively improve the accuracy, robustness and adaptability of unmanned vehicle lane change decisions under dynamic vehicle driving conditions.

本発明の第２の目的は、無人運転車両車線変更決定システムを提供することである。 A second object of the present invention is to provide an unmanned driving vehicle lane change determination system.

本発明の第３の目的は、記憶媒体を提供することである。 A third object of the present invention is to provide a storage medium.

本発明の第４の目的は、演算機器を提供することである。 A fourth object of the present invention is to provide a computing device.

本発明の第１の目的は、以下の技術手段によって実現される。敵対的模倣学習に基づく無人運転車両車線変更決定方法において、無人運転車両車線変更決定タスクを部分観測マルコフ決定過程として記述するステップＳ１と、学習中に分散減少方策勾配の学習方策に基づいて専門運転パフォーマンスをシミュレーションする敵対的模倣学習方法を用いて、専門運転教示によって提供される例から学習し、無人運転車両車線変更決定モデルを取得するステップＳ２と、車両の無人運転走行中に、現在取得されている環境車両情報を無人運転車両車線変更決定モデルの入力パラメータとして、無人運転車両車線変更決定モデルによって車両車線変更決定結果を取得するステップＳ３と、を含む。 The first object of the present invention is realized by the following technical means. In the unmanned vehicle lane change determination method based on hostile imitation learning, step S1 that describes the unmanned vehicle lane change determination task as a partial observation Markov determination process, and specialized driving based on the learning policy of the dispersion reduction policy gradient during learning. Step S2 to acquire an unmanned vehicle lane change decision model by learning from an example provided by professional driving instruction using a hostile imitation learning method that simulates performance, and currently acquired during unmanned driving of the vehicle. As an input parameter of the unmanned driving vehicle lane change determination model, step S3 of acquiring the vehicle lane change determination result by the unmanned driving vehicle lane change determination model is included.

好ましく、ステップＳ１において、無人運転車両車線変更決定タスクを部分観測マルコフ決定過程として記述することは、具体的に、
ステップＳ１１において、自車両、車両進路における前後車両及び左右車線における自
車両に最も近い車両の走行状態を含む状態Ｏ_ｔの空間［ｌ，ｖ_０，ｓ_ｆ，ｖ_ｆ，ｓ_ｂ，ｖ_ｂ，ｓ_ｌｆ，ｖ_ｌｆ，ｓ_ｌｂ，ｖ_ｌｂ，ｓ_ｒｆ，ｖ_ｒｆ，ｓ_ｒｂ，ｖ_ｒｂ］
（ここで、
ｌは、自車両が走行する車線であり、ｖ_０は、自車両の走行速度であり、
ｓ_ｆ、ｖ_ｆは、それぞれ、自車両の進路の前方で最も近い車両から自車両までの距離、自車両までの相対速度に対応し、
ｓ_ｂ、ｖ_ｂは、それぞれ、自車両の進路の後方で最も近い車両から自車両までの距離、自車両までの相対速度に対応し、
ｓ_ｌｆ、ｖ_ｌｆは、それぞれ、自車両より左車線の前方で最も近い車両から自車両までの距離、自車両までの相対速度に対応し、
ｓ_ｌｂ、ｖ_ｌｂは、それぞれ、自車両より左車線の後方で最も近い車両から自車両までの距離、自車両までの相対速度に対応し、
ｓ_ｒｆ、ｖ_ｒｆは、それぞれ、自車両より右車線の前方で最も近い車両から自車両までの距離、自車両までの相対速度に対応し、
ｓ_ｒｂ、ｖ_ｒｂは、それぞれ、自車両より右車線の後方で最も近い車両から自車両までの距離、自車両までの相対速度に対応する）を決定し、
ステップＳ１２において、車両の左へ車線変更、車両の右へ車線変更、車両における車線維持且つ車速維持、車両の車線維持且つ加速、及び、車両の車線維持且つ減速を含む動作Ａ_ｔの空間を決定する。 Preferably, in step S1, describing the unmanned vehicle lane change determination task as a partially observed Markov determination process is specifically described.
In step S11, the space [l, v ₀ , s _f , _{v f} _, s _b , v _b , s _lf , v _lf , s _lb , v _lb , s _rf , v _rf , s _rb , v _rb ]
(here,
l is the lane in which the own vehicle travels, v ₀ is the traveling speed of the own vehicle, and so on.
s _f and v _f correspond to the distance from the nearest vehicle to the own vehicle in front of the course of the own vehicle and the relative speed to the own vehicle, respectively.
s _b and v _b correspond to the distance from the nearest vehicle to the own vehicle and the relative speed to the own vehicle, respectively, behind the course of the own vehicle.
s _lf and v _lf correspond to the distance from the vehicle closest to the vehicle in front of the vehicle in the left lane to the vehicle and the relative speed to the vehicle, respectively.
s _lb and v _lb correspond to the distance from the vehicle closest to the vehicle in the left lane behind the vehicle to the vehicle and the relative speed to the vehicle, respectively.
s _rf and v _rf correspond to the distance from the vehicle closest to the vehicle in front of the vehicle in the right lane to the vehicle and the relative speed to the vehicle, respectively.
s _rb and v _rb correspond to the distance from the vehicle closest to the vehicle in the right lane behind the vehicle to the vehicle and the relative speed to the vehicle, respectively).
In step _S12 , the space of the operation At including the lane change to the left of the vehicle, the lane change to the right of the vehicle, the lane keeping and speed maintenance in the vehicle, the lane keeping and acceleration of the vehicle, and the lane keeping and deceleration of the vehicle is determined. do.

更に好ましく、自車両に対し、
その進路前方の車両が検出されない場合、ｓ_ｆ、ｖ_ｆをそれぞれ固定値にセットし、
その進路後方の車両が検出されない場合、ｓ_ｂ、ｖ_ｂをそれぞれ固定値にセットし、
左車線前方の車両が検出されない場合、ｓ_ｌｆ、ｖ_ｌｆをそれぞれ固定値にセットし、
左車線後方の車両が検出されない場合、ｓ_ｌｂ、ｖ_ｌｂをそれぞれ固定値にセットし、
右車線前方の車両が検出されない場合、ｓ_ｒｆ、ｖ_ｒｆをそれぞれ固定値にセットし、
右車線後方の車両が検出されない場合、ｓ_ｒｂ、ｖ_ｒｂをそれぞれ固定値にセットする。 More preferably, with respect to the own vehicle
If the vehicle in front of the path is not detected, set s _f and v _f to fixed values, respectively.
If no vehicle behind the path is detected, set s _b and v _b to fixed values, respectively.
If no vehicle in front of the left lane is detected, set _slf and _vlf to fixed values, respectively.
If no vehicle behind the left lane is detected, set _{slb and vlb} _to fixed values, respectively.
If a vehicle in front of the right lane is not detected, set _srf and _vrf to fixed values, respectively.
If a vehicle behind the right lane is not detected, set s _rb and v _rb to fixed values, respectively.

更に、ステップＳ２において、敵対的模倣学習方法を用いて、専門運転教示によって提供される例から学習する具体的な過程として、
ステップＳ２１において、専門運転者の車両の運転挙動に対して、専門運転者の運転の状態データ及び動作データの収集を含むデータ収集を行い、
ステップＳ２２において、収集した車両状態データ及び動作データのペアを抽出し、データセットτ＝｛τ_１，τ_２，τ_３，...，τ_Ｎ｝＝｛（Ｏ_１，Ａ_１），（Ｏ_２，Ａ_２），（Ｏ_３，Ａ_３），...，（Ｏ_Ｎ，Ａ_Ｎ）｝（τを敵対的模倣学習のエキスパート軌跡に定義し、τ_１～τ_Ｎは、それぞれ、１～Ｎ番目のデータペアを示し、Ｏ_１～Ｏ_Ｎは、それぞれ、収集した１～Ｎ番目の状態データを示し、Ａ_１～Ａ_Ｎは、それぞれ、収集した１～Ｎ番目の動作データを示し、Ｎは、学習データセットにおけるデータペアの総数であり、サンプリング回数に対応する）を構成し、
ステップＳ２３において、データセットτを入力として、敵対的模倣学習方法を用いて学習し、専門運転者の運転挙動を模倣し、無人運転車両車線変更決定モデルを取得する。 Further, in step S2, as a specific process of learning from the example provided by the specialized driving instruction using the hostile imitation learning method.
In step S21, data collection including collection of driving state data and operation data of the specialized driver is performed for the driving behavior of the vehicle of the specialized driver.
In step S22, the collected vehicle state data and operation data pairs are extracted, and the datasets τ = {τ ₁ , τ ₂ , τ ₃ , ..., τ _N } = {(O ₁ , A ₁ ), ( O ₂ , A ₂ ), (O ₃ , A ₃ ), ..., ( _{ON, AN)} (τ is defined as the expert trajectory of hostile imitation learning, and τ 1 to τ N} _are _, _respectively . The 1st to Nth data pairs are indicated, O ₁ to ON indicate the collected 1st to _Nth state data, respectively, and A ₁ to AN indicate the collected 1st to _Nth operation data, respectively. Shown, N is the total number of data pairs in the training dataset, which corresponds to the number of samplings).
In step S23, using the data set τ as an input, learning is performed using a hostile imitation learning method, the driving behavior of a professional driver is imitated, and an unmanned driving vehicle lane change determination model is acquired.

更に、ステップＳ２３において、敵対的模倣学習として学習中に分散減少方策勾配の学習方策に基づいて専門運転パフォーマンスをシミュレーションすることは、具体的な過程として、
ステップＳ２３１において、初期化し、
最大学習ラウンドＴ、学習ペースα、及びサンプリング回数Ｎをセットし、
無人運転車両代理方策π_θを初期化し、無人運転車両代理方策π_θの重みパラメータをθ₀に初期化し、
敵対的ネットワーク判別器Ｄ_φの重みパラメータを初期化し、ここで、φ_０は、敵対的ネットワーク判別器Ｄ_φの初期化重みパラメータであり、
無人運転車両の現在の状態ベクトルＯ及び現在の動作ベクトルＡを取得し、
ステップＳ２３２において、学習ラウンドｔ（０≦ｔ≦Ｔ）ごとに、ステップＳ２３３～ステップＳ２３９を実行し、
ステップＳ２３３において、ランダムにサンプリングし、平均が０で分散がｖであるガウスベクトルδ_ｔ＝｛δ_１，δ_２，...，δ_Ｎ｝をＮ個生成し、ここで、δ_１～δ_Ｎは、１～Ｎ番目のガウスベクトルであり、δ_ｔは、Ｎ個のガウスベクトルを組み合わせたベクトルであり、
ステップＳ２３４において、現在の学習ラウンドｔの際に、無人運転車両代理方策π_θの重みパラメータθ_ｔの平均分散

を算出し、
ステップＳ２３５において、無人運転車両の現在の状態ベクトルＯの平均値μを算出し、
ステップＳ２３６において、各ｋ（ｋ∈｛１，２，...，Ｎ｝）について、分散減少方法を用いてランダム代理方策π_{ｔ，（ｋ）}：

を算出し、δ_ｋは、ステップＳ２３３で得られたｋ番目のガウスベクトルであり、
ステップＳ２３７において、無人運転車両の現在の状態ベクトルＯを入力として、ランダム代理方策π_{ｔ，（ｋ）}（ｋ＝１，２，...，Ｎ）を適用して、サンプル軌跡

を生成し、
ここで、

は、それぞれ、Ｏを入力とし、ｋに１～Ｎの値をとり、ランダム代理方策π_{ｔ，（ｋ）}によって生成された１～Ｎ番目のサンプル軌跡であり、

は、それぞれ、１～Ｎ番目のサンプル軌跡における動作データを示し、
ステップＳ２３８において、敵対的ネットワーク判別器Ｄ_φの重みパラメータφ_ｔを更新し、
最小二乗損失関数を用いて敵対的ネットワーク判別器Ｄ_φの重みパラメータφ_ｔを学習して更新し、即ち、決定境界の両側でエキスパート軌跡から離れているサンプル軌跡に対して、最小二乗損失関数を用いて懲罰し、損失関数が

であり、ここで、π_Ｅ、π_θは、それぞれ、エキスパート方策、無人運転車両代理方策に対応し、

は、エキスパート方策のエントロピー正則化であり、

は、無人運転車両代理方策のエントロピー正則化であり、
ステップＳ２３９において、無人運転車両代理方策π_θの重みパラメータθ_ｔを更新し、
現在の学習ラウンドｔが最大学習ラウンドＴに達するまで、分散減少に基づく方策勾配法を用いて代理方策π_θの重みパラメータθ_ｔを更新して、更新後の重みパラメータθ_ｔ＋１を得る。 Further, in step S23, simulating the professional driving performance based on the learning policy of the variance reduction policy gradient during learning as hostile imitation learning is a concrete process.
In step S231, the initialization is performed.
Set the maximum learning round T, learning pace α, and sampling count N,
Initialize the unmanned vehicle surrogate policy π _θ , initialize the weight parameter of the unmanned vehicle surrogate policy π _θ to θ ₀ ,
Initialize the weight parameter of the hostile network discriminator D _φ , where φ ₀ is the initialization weight parameter of the hostile network discriminator D _φ .
Acquire the current state vector O and the current motion vector A of the unmanned driving vehicle,
In step S232, step S233 to step S239 are executed for each learning round t (0 ≦ t ≦ T).
In step S233, N random Gaussian vectors δ _t = {δ ₁ , δ ₂ , ..., δ _N } having a mean of 0 and a variance of v are generated, where δ ₁ to δ are generated. _N is the 1st to Nth Gaussian vector, and δ _t is a vector obtained by combining N Gaussian vectors.
In step S234, during the current learning round t, the average variance of the weight parameter θ _t of the unmanned vehicle surrogate policy π _θ

Is calculated,
In step S235, the average value μ of the current state vector O of the unmanned driving vehicle is calculated.
In step S236, for each k (k ∈ {1, 2, ..., N}), the random surrogate strategy π _{t, (k)} : using the variance reduction method:

Is calculated, and δ _k is the k-th Gauss vector obtained in step S233.
In step S237, the current state vector O of the unmanned driving vehicle is input, and the random surrogate measures π _{t, (k)} (k = 1, 2, ..., N) are applied to sample the locus.

To generate
here,

Is the 1st to Nth sample loci generated by the random surrogate measures π _{t, (k)} , each taking O as an input and taking a value of 1 to N for k.

Indicates the motion data in the 1st to Nth sample trajectories, respectively.
In step S238, the weight parameter φ _t of the hostile network discriminator D _φ is updated.
The least squares loss function is used to learn and update the weight parameter φ _t of the hostile network discriminator D _φ , i.e., for sample trajectories that are far from expert trajectories on both sides of the decision boundary, the least squares loss function. Use and punish, the loss function

Here, π _E and π _θ correspond to the expert policy and the unmanned vehicle surrogate policy, respectively.

Is the entropy regularization of expert measures,

Is the entropy regularization of unmanned vehicle surrogate measures,
In step S239, the weight parameter θ _t of the unmanned vehicle surrogate policy π _θ is updated.
Until the current learning round t reaches the maximum learning round T, the weight parameter θ _t of the surrogate policy π _θ is updated using the measure gradient method based on the variance reduction to obtain the updated weight parameter θ _{t + 1} .

更に、ステップＳ２３９において、分散減少に基づく方策勾配法を用いて代理方策π_θの重みパラメータθ_ｔを更新する具体的な過程は、
ランダム代理方策π_{ｔ，（ｋ）}（ｋ∈｛１，２，...，Ｎ｝）毎に、インセンティブ関数

（式中、

は、エントロピー正則化である）を算出するステップＳ２３９１と、

のように、無人運転車両代理方策πθのパラメータθｔを更新するステップＳ２３９２と、を含む。 Further, in step S239, the specific process of updating the weight parameter θ _t of the surrogate measure π _θ using the measure gradient method based on the variance reduction is described.
Random surrogate policy π _{t, (k)} (k ∈ {1, 2, ..., N}) for each incentive function

(During the ceremony

Is entropy regularization) in step S2391 and

As in step S2392, which updates the parameter θt of the unmanned driving vehicle surrogate policy πθ.

更に、ステップＳ３において、無人運転車両車線変更決定モデルによって無人運転車両車線変更決定結果を取得する具体的な過程として、
ステップＳ３１において、無人運転車両状態データを含む無人運転車両の現在の環境車両情報を取得し、
ステップＳ３２において、無人運転車両の状態データに基づいて、無人運転車両車線変更決定モデルの入力状態に値を与え、
ステップＳ３３において、無人運転車両車線変更決定モデルによって車線変更決定結果を取得し、
ステップＳ３４において、連続してｎ（ｎは常数である）回の決定結果がすべて車線変更であり且つ車線変更の方向が同じであるかを判断し、ＮＯであれば、ステップＳ３５に進むが、ＹＥＳであれば、ステップＳ３６に進み、
ステップＳ３５において、現在の決定結果が車線変更であるかを判断し、
ＮＯであれば、現在の決定結果に応じて、無人運転車両の現在の運転動作を制御し、即ち、無人運転車両が現在の車線を維持しながら走行するように制御し、加速、減速、又は車速維持の動作を実行し、ステップＳ３１に戻り、
ＹＥＳであれば、無人運転車両が現在の決定結果の前の運転状態を維持し、ステップＳ３１に戻り、
ステップＳ３６において、決定結果に応じて車線変更を行い、同時に無人運転車両の車線変更中に緊急事態の有無を検出し、あれば無人運転状態から脱し、手動介入を行うが、なければ、車線変更決定結果に基づいて車線変更を完了し、ステップＳ３１に戻る。 Further, in step S3, as a specific process of acquiring the unmanned driving vehicle lane change determination result by the unmanned driving vehicle lane change determination model,
In step S31, the current environmental vehicle information of the unmanned driving vehicle including the unmanned driving vehicle state data is acquired, and the information is obtained.
In step S32, a value is given to the input state of the unmanned driving vehicle lane change determination model based on the state data of the unmanned driving vehicle.
In step S33, the lane change decision result is acquired by the unmanned driving vehicle lane change decision model.
In step S34, it is determined whether the determination results of n (n is a constant) consecutive times are all lane changes and the directions of lane changes are the same. If NO, the process proceeds to step S35. If YES, the process proceeds to step S36.
In step S35, it is determined whether the current decision result is a lane change, and the result is determined.
If NO, then depending on the current decision result, the current driving behavior of the unmanned vehicle is controlled, that is, the unmanned vehicle is controlled to stay in the current lane, and is accelerated, decelerated, or decelerated. Execute the operation of maintaining the vehicle speed, return to step S31, and return to step S31.
If YES, the unmanned vehicle maintains the driving state before the current decision result and returns to step S31.
In step S36, the lane is changed according to the decision result, and at the same time, the presence or absence of an emergency is detected during the lane change of the unmanned driving vehicle. The lane change is completed based on the determination result, and the process returns to step S31.

本発明の第２の目的は、以下の技術手段によって実現される。無人運転車両車線変更決定システムにおいて、無人運転車両車線変更決定タスクを部分観測マルコフ決定過程として記述するタスク記述モジュールと、学習中に分散減少方策勾配の学習方策に基づいて専門運転パフォーマンスをシミュレーションする敵対的模倣学習方法を用いて、専門運転教示によって提供される例から学習し、無人運転車両車線変更決定モデルを取得する車線変更決定モデル構築モジュールと、車両の無人運転走行中に、現在の環境車両情報を取得する環境車両情報取得モジュールと、現在取得されている環境車両情報を無人運転車両車線変更決定モデルの入力パラメータとして、無人運転車両車線変更決定モデルによって車両の車線変更決定結果を取得する車線変更決定モジュールとを含む。 The second object of the present invention is realized by the following technical means. In the unmanned vehicle lane change decision system, a task description module that describes the unmanned vehicle lane change decision task as a partial observation Markov decision process, and hostility that simulates professional driving performance based on the distributed reduction policy gradient learning policy during learning. A lane change decision model construction module that learns from the examples provided by professional driving instruction and acquires an unmanned vehicle lane change decision model using a model imitation learning method, and the current environmental vehicle during unmanned driving of the vehicle. The environmental vehicle information acquisition module for acquiring information and the currently acquired environmental vehicle information are used as input parameters for the unmanned vehicle lane change determination model, and the lane for acquiring the vehicle lane change determination result by the unmanned vehicle lane change determination model. Includes change decision module.

本発明の第３の目的は、以下の技術手段によって実現される。プログラムが格納されている記憶媒体であって、前記プログラムがプロセッサによって実行されると、実施例１に記載の敵対的模倣学習に基づく無人運転車両車線変更決定方法を実現する。 The third object of the present invention is realized by the following technical means. A storage medium in which a program is stored, and when the program is executed by a processor, realizes the method for determining an unmanned driving vehicle lane change based on the hostile imitation learning according to the first embodiment.

本発明の第４の目的は、以下の技術手段によって実現される。プロセッサと、プロセッサによって実行可能なプログラムを格納するためのメモリとを含む演算機器であって、前記プロセッサは、メモリに格納されているプログラムを実行すると、実施例１に記載の敵対的模倣学習に基づく無人運転車両車線変更決定方法を実現する。 The fourth object of the present invention is realized by the following technical means. An arithmetic unit including a processor and a memory for storing a program that can be executed by the processor. When the processor executes a program stored in the memory, the hostile imitation learning according to the first embodiment is performed. Realize the method of deciding to change the lane of an unmanned driving vehicle based on the above.

本発明は、従来技術に対して以下の利点及び効果を有する。
（１）本発明の無人運転車両車線変更決定方法は、まず、無人運転車両車線変更決定タスクを部分観測マルコフ決定過程として記述し、それから、敵対的模倣学習方法を用いて、専門運転教示によって提供される例から学習し、無人運転車両車線変更決定モデルを取得し、車両の無人運転走行中に、現在取得されている環境車両情報を無人運転車両車線変更決定モデルの入力パラメータとして、無人運転車両車線変更決定モデルによって車両車線変更決定結果を取得する。本発明は、専門運転教示によって提供される例から敵対的模倣学習方法によって車線変更方策を学習し、車両状態から車両車線変更決定への直接マッピングを、人為的なタスクインセンティブ関数を必要とすることなく、直接的に確立することができ、動的な車両走行条件下での無人運転車両車線変更決定の正確性、ロバスト性及び適応性を効果的に向上させる。 The present invention has the following advantages and effects over the prior art.
(1) The unmanned vehicle lane change determination method of the present invention first describes the unmanned vehicle lane change determination task as a partial observation Markov determination process, and then provides it by specialized driving instruction using a hostile imitation learning method. The unmanned vehicle lane change determination model is acquired by learning from the example, and the currently acquired environmental vehicle information is used as the input parameter of the unmanned vehicle lane change determination model during the unmanned driving of the vehicle. The vehicle lane change decision result is acquired by the lane change decision model. The present invention requires an artificial task incentive function to learn a lane change strategy by a hostile imitation learning method from an example provided by a professional driving instruction, and to directly map a vehicle state to a vehicle lane change decision. It can be directly established without, effectively improving the accuracy, robustness and adaptability of unmanned vehicle lane change decisions under dynamic vehicle driving conditions.

（２）本発明の無人運転車両車線変更決定方法は、敵対的模倣学習方法によって、分散減少方策勾配の学習方策に基づいて専門運転パフォーマンスをシミュレーションすることによって、車線変更決定の正確性を更に向上させることができる。また、無人運転車両車線変更決定モデルによる無人運転車両の車線変更決定過程で、複数回連続して車線変更の決定結果が得られた場合にのみ、決定結果に応じた車線変更を行うので、上記操作は、決定結果の正確性をより一層保証し、車線変更の安全性を確保することができる。 (2) The unmanned vehicle lane change determination method of the present invention further improves the accuracy of the lane change determination by simulating the professional driving performance based on the learning policy of the dispersion reduction policy gradient by the hostile imitation learning method. Can be made to. In addition, in the process of deciding the lane change of the unmanned driving vehicle by the unmanned driving vehicle lane change decision model, the lane change is performed according to the decision result only when the decision result of the lane change is obtained a plurality of times in succession. The operation can further guarantee the accuracy of the decision result and ensure the safety of lane change.

（３）本発明の無人運転車両車線変更決定方法は、無人運転車両が決定結果に応じて車線変更を行う過程で、緊急事態の有無をリアルタイムで検出し、緊急事態がある場合、無人運転状態から脱して手動介入を行うことで、車両運転の安全性を確保し、車両の乗員の生命安全を確保し、交通事故を極力回避する。 (3) The unmanned vehicle lane change determination method of the present invention detects the presence or absence of an emergency in real time in the process of an unmanned vehicle changing lanes according to the determination result, and if there is an emergency, the unmanned driving state. By moving away from the vehicle and performing manual intervention, the safety of vehicle driving is ensured, the life safety of vehicle occupants is ensured, and traffic accidents are avoided as much as possible.

図１は、本発明の方法の、敵対的模倣学習に基づくオフライン学習のフローチャートである。FIG. 1 is a flowchart of offline learning based on hostile imitation learning of the method of the present invention. 図２は、本発明の方法による無人車両車線変更決定のフローチャートである。FIG. 2 is a flowchart of an automatic guided vehicle lane change determination by the method of the present invention.

以下、本発明を実施例及び図面に基づいて更に詳細に説明するが、本発明の実施の形態は、これらに限定されるものではない。 Hereinafter, the present invention will be described in more detail with reference to Examples and drawings, but the embodiments of the present invention are not limited thereto.

（実施例１）
本実施例は、敵対的模倣学習に基づく無人運転車両車線変更決定方法を開示し、この方法によって、無人運転車両が正確かつ安全に車線を切り替えることができる。該方法は、以下のステップを含む。 (Example 1)
The present embodiment discloses a method for determining an unmanned driving vehicle lane change based on hostile imitation learning, and this method enables an unmanned driving vehicle to switch lanes accurately and safely. The method comprises the following steps.

ステップＳ１において、無人運転車両車線変更決定タスクを部分観測マルコフ決定過程として記述する。 In step S1, the unmanned vehicle lane change determination task is described as a partial observation Markov determination process.

本実施例において、無人運転車両車線変更決定タスクを部分観測マルコフ決定過程として記述することは、具体的に以下である。
ステップＳ１１において、自車両、車両進路における前後車両及び左右車線における自車両に最も近い車両の走行状態を含む状態Ｏ_ｔの空間［ｌ，ｖ_０，ｓ_ｆ，ｖ_ｆ，ｓ_ｂ，ｖ_ｂ，ｓ_ｌｆ，ｖ_ｌｆ，ｓ_ｌｂ，ｖ_ｌｂ，ｓ_ｒｆ，ｖ_ｒｆ，ｓ_ｒｂ，ｖ_ｒｂ］を決定する。
ここで、ｌは、自車両が走行する車線であり、ｖ_０は、自車両の走行速度である。本実施例において、自車両の走行速度ｖ_０は、自車両の車速センサによって収集して検出される。ｓ_ｆ、ｖ_ｆは、それぞれ、自車両の進路の前方で最も近い車両から自車両までの距離、自車両までの相対速度に対応し、ｓ_ｂ、ｖ_ｂは、それぞれ、自車進路の後方で最も近い
車両から自車両までの距離、自車両までの相対速度に対応し、ｓ_ｌｆ、ｖ_ｌｆは、それぞれ、自車両より左車線の前方で最も近い車両から自車両までの距離、自車両までの相対速度に対応し、ｓ_ｌｂ、ｖ_ｌｂは、それぞれ、自車両より左車線の後方で最も近い車両から自車両までの距離、自車両までの相対速度に対応し、ｓ_ｒｆ、ｖ_ｒｆは、それぞれ、自車両より右車線の前方で最も近い車両から自車両までの距離、自車両までの相対速度に対応し、ｓ_ｒｂ、ｖ_ｒｂは、それぞれ、自車両より右車線の後方で最も近い車両から自車両までの距離、自車両までの相対速度に対応する。 In this embodiment, the unmanned vehicle lane change determination task is specifically described as a partial observation Markov determination process as follows.
In step S11, the space [l, v ₀ , s _f , _{v f} _, s _b , v _b , s _lf , v _lf , s _lb , v _lb , s _rf , v _rf , s _rb , v _rb ] are determined.
Here, l is the lane in which the own vehicle travels, and v ₀ is the traveling speed of the own vehicle. In this embodiment, the traveling speed _v0 of the own vehicle is collected and detected by the vehicle speed sensor of the own vehicle. s _f and v _f correspond to the distance from the nearest vehicle to the own vehicle and the relative speed to the own vehicle in front of the own vehicle's course, respectively, and s _b and v _b correspond to the rear of the own vehicle's course, respectively. _Corresponds to the distance from the nearest _vehicle to the own vehicle and the relative speed to the own vehicle. _S _lb and _v _lb correspond to the distance from the vehicle closest to the vehicle in the left lane behind the vehicle to the vehicle and the relative speed to the vehicle, respectively. _Corresponds to the distance from the vehicle _closest to the vehicle in front of the vehicle to the vehicle and the relative speed to the vehicle, respectively. Corresponds to the distance from a nearby vehicle to the own vehicle and the relative speed to the own vehicle.

本実施例において、他車両から自車両までの距離ｓ_ｆ、ｓ_ｂ、ｓ_ｌｆ、ｓ_ｌｂ、ｓ_ｒｆ、ｓ_ｒｂは、自車両の画像センサ又はレーダセンサによって収集して検出される。他車両から自車両までの相対速度ｖ_ｆ、ｖ_ｂ、ｖ_ｌｆ、ｖ_ｌｂ、ｖ_ｒｆ、ｖ_ｒｂは、自車両のレーダセンサによって収集して検出される。 In this embodiment, the distances s _f , s _b , s _lf , s _lb , s _rf , and s _rb from another vehicle to the own vehicle are collected and detected by the image sensor or radar sensor of the own vehicle. The relative velocities v _f , v _b , v _lf , v _lb , v _rf , and v _rb from another vehicle to the own vehicle are collected and detected by the radar sensor of the own vehicle.

ここで、自車両に対し、その進路前方の車両が検出されない場合、ｓ_ｆ、ｖ_ｆをそれぞれ固定値にセットし、その進路後方の車両が検出されない場合、ｓ_ｂ、ｖ_ｂをそれぞれ固定値にセットし、左車線前方の車両が検出されない場合、ｓ_ｌｆ、ｖ_ｌｆをそれぞれ固定値にセットし、左車線後方の車両が検出されない場合、ｓ_ｌｂ、ｖ_ｌｂをそれぞれ固定値にセットし、右車線前方の車両が検出されない場合、ｓ_ｒｆ、ｖ_ｒｆをそれぞれ固定値にセットし、右車線後方の車両が検出されない場合、ｓ_ｒｂ、ｖ_ｒｂをそれぞれ固定値にセットする。 Here, if the vehicle in front of the course is not detected for the own vehicle, s _f and v _f are set to fixed values, respectively, and if the vehicle behind the course is not detected, s _b and v _b are set to fixed values, respectively. If no vehicle in front of the left lane is detected, set _{slf and v lf} _to fixed values, and if no vehicle behind the left lane is detected, set _{slb and v lb} _to fixed values. If the vehicle in front of the right lane is not detected, s _rf and v _rf are set to fixed values, and if the vehicle behind the right lane is not detected, s _rb and v _rb are set to fixed values, respectively.

ここで、上記のセットされたｓ_ｆ、ｓ_ｂ、ｓ_ｌｆ、ｓ_ｌｂ、ｓ_ｒｆ、ｓ_ｒｂの固定値は、レーダの最大感知距離を取り、例えば３００メートルである。上記のセットされたｖ_ｆ、ｖ_ｂ、ｖ_ｌｆ、ｖ_ｌｂ、ｖ_ｒｆ、ｖ_ｒｂの固定値は、スマートカーの予想走行速度を取り、例えば１００ｋｍ／ｈである。 Here, the fixed values of the set s _f , s _b , s _lf , s _lb , s _rf , and s _rb take the maximum sensing distance of the radar, and are, for example, 300 meters. The fixed values of v _f , v _b , v _lf , v _lb , v _rf , and v _rb set above take the expected running speed of the smart car, for example, 100 km / h.

上記自車両は、無人運転車両自車を指す。 The above-mentioned own vehicle refers to an unmanned driving vehicle own vehicle.

ステップＳ１２において、第１種類の車両の左へ車線変更、第２種類の車両の右へ車線変更、第３種類の車両の車線維持且つ車速維持、第４種類の車両の車線維持且つ加速、及び、第５種類の車両の車線維持且つ減速を含む動作空間Ａ_ｔを決定する。 In step S12, lane change to the left of the first type vehicle, lane change to the right of the second type vehicle, lane keeping and speed maintenance of the third type vehicle, lane keeping and acceleration of the fourth type vehicle, and , _Determine the operating space At including lane keeping and deceleration of the fifth type of vehicle.

ステップＳ２において、敵対的模倣学習方法を用いて、専門運転教示によって提供される例からオフライン学習をし、無人運転車両車線変更決定モデルを取得する。ここで、学習中に、敵対的模倣学習方法は、分散減少方策勾配の学習方策に基づいて専門運転パフォーマンスをシミュレーションする。図１に示すように、具体的な過程は、以下のとおりである。 In step S2, using the hostile imitation learning method, offline learning is performed from the example provided by the specialized driving instruction, and an unmanned driving vehicle lane change determination model is acquired. Here, during learning, the hostile imitation learning method simulates professional driving performance based on the learning strategy of the variance reduction policy gradient. As shown in FIG. 1, the specific process is as follows.

ステップＳ２１において、専門運転者の車両運転挙動に対して、専門運転者の運転の状態データ及び動作データの収集を含むデータ収集を行う。ここで、各状態データは、状態Ｏ_ｔの空間のデータ［ｌ，ｖ_０，ｓ_ｆ，ｖ_ｆ，ｓ_ｂ，ｖ_ｂ，ｓ_ｌｆ，ｖ_ｌｆ，ｓ_ｌｂ，ｖ_ｌｂ，ｓ_ｒｆ，ｖ_ｒｆ，ｓ_ｒｂ，ｖ_ｒｂ］を含み、即ち、専門運転者の運転する自車両、車両進路における前後車両及び左右車線における自車両に最も近い車両の走行状態を含む。動作データは、動作Ａ_ｔの空間のデータに対応し、毎回収集される動作データは、車両の左へ車線変更、車両の右へ車線変更、車両の車線維持且つ車速維持、車両の車線維持且つ加速、及び、車両の車線維持且つ減速を含む。 In step S21, data collection including collection of driving state data and operation data of the specialized driver is performed with respect to the vehicle driving behavior of the specialized driver. Here, each state data is the data in the space of the state _Ot [l, v ₀ , s _f , v _f , s _b , v _b , s _lf , v _lf , s _lb , v _lb , s _rf , v _rf . , S _rb , v _rb ], that is, the running state of the own vehicle driven by a professional driver, the front and rear vehicles in the vehicle path, and the vehicle closest to the own vehicle in the left and right lanes. The motion data corresponds to the space data of the _{motion At} , and the motion data collected each time is the lane change to the left of the vehicle, the lane change to the right of the vehicle, the lane maintenance and speed maintenance of the vehicle, and the lane maintenance of the vehicle. Includes acceleration and lane keeping and deceleration of the vehicle.

ステップＳ２２において、収集した車両状態データ及び動作データのペアを抽出し、データセットτ＝｛τ_１，τ_２，τ_３，...，τ_Ｎ｝＝｛（Ｏ_１，Ａ_１），（Ｏ_２，Ａ_２），（Ｏ_３，Ａ_３），...，（Ｏ_Ｎ，Ａ_Ｎ）｝を構成する。τを敵対的模倣学習のエキスパート軌跡に定義し、τ_１～τ_Ｎは、それぞれ、１～Ｎ番目のデータペアを示し、Ｏ_１～Ｏ_Ｎは、それぞれ、収集した１～Ｎ番目の状態データを示し、Ａ_１～Ａ_Ｎは、それぞれ、収集した１～Ｎ番目の動作データを示す。ここでＮは、学習データセットにおけるデータペアの総数であり、サンプリング回数に対応する。本実施例において、サンプリング回数Ｎは、Ｎ＝１０^５にセットされる。 In step S22, the collected vehicle state data and operation data pairs are extracted, and the datasets τ = {τ ₁ , τ ₂ , τ ₃ , ..., τ _N } = {(O ₁ , A ₁ ), ( It constitutes O ₂ , A ₂ ), (O ₃ , A ₃ ), ..., ( _ON , _AN )}. τ is defined as an expert locus of hostile imitation learning, τ ₁ to τ _N indicate the 1st to _Nth data pairs, respectively, and O ₁ to ON are the collected 1st to Nth state data, respectively. 1 to AN indicate the collected _1st to _Nth operation data, respectively. Here, N is the total number of data pairs in the training data set and corresponds to the number of samplings. In this embodiment, the number of samplings N is set to N = ¹⁰⁵ .

ステップＳ２３において、データセットτを入力として、敵対的模倣学習方法を用いて学習し、専門運転者の運転挙動を模倣し、無人運転車両車線変更決定モデルを取得する。具体的な過程は、以下のとおりである。 In step S23, using the data set τ as an input, learning is performed using a hostile imitation learning method, the driving behavior of a professional driver is imitated, and an unmanned driving vehicle lane change determination model is acquired. The specific process is as follows.

ステップＳ２３１において、初期化し、以下を含む。
１）最大学習ラウンドＴ、学習ペースα、及びサンプリング回数Ｎをセットする。
本実施例において、最大学習ラウンドＴは、Ｔ＝２０００にセットされ、学習ペースαは、α＝０．３にセットされ、ステップＳ２２に示すように、サンプリング回数Ｎは、Ｎ＝１０^５にセットされる。
２）行動クローニング方法を用いて無人運転車両代理方策π_θを初期化し、ここで、無人運転車両代理方策π_θの重みパラメータをθ_０に初期化する。
３）Ｘａｖｉｅｒ方式を用いて敵対的ネットワーク判別器Ｄ_φの重みパラメータを初期化し、ここで、φ_０は、敵対的ネットワーク判別器Ｄ_φの初期化重みパラメータである。 In step S231, it is initialized and includes:
1) Set the maximum learning round T, the learning pace α, and the number of samplings N.
In this embodiment, the maximum learning round T is set to T = 2000, the learning pace α is set to α = 0.3, and the number of samplings N is set to N = ¹⁰⁵ as shown in step S22. Will be done.
2) Initialize the unmanned vehicle surrogate policy π _θ using the behavioral cloning method, where the weight parameter of the unmanned vehicle surrogate policy π _θ is initialized to θ ₀ .
3) The weight parameter of the hostile network discriminator D _φ is initialized by using the Xavier method, where φ ₀ is the initialization weight parameter of the hostile network discriminator D _φ .

３）無人運転車両の走行中に、無人運転車両の現在の状態ベクトルＯ及び現在の動作ベクトルＡを含む車両環境情報を取得する。
ここで、無人運転車両の現在の状態ベクトルＯは、状態Ｏ_ｔの空間のデータ［ｌ，ｖ_０，ｓ_ｆ，ｖ_ｆ，ｓ_ｂ，ｖ_ｂ，ｓ_ｌｆ，ｖ_ｌｆ，ｓ_ｌｂ，ｖ_ｌｂ，ｓ_ｒｆ，ｖ_ｒｆ，ｓ_ｒｂ，ｖ_ｒｂ］を含み、即ち、無人運転車両の自車両、車両進路における前後車両及び左右車線における自車両に最も近い車両の走行状態を含む。無人運転車両の現在の動作ベクトルＡは、動作空間Ａｔのデータに対応し、現在取得されている動作データは、無人運転車両の左へ車線変更、車両の右へ車線変更、車両の車線維持且つ車速維持、車両の車線維持且つ加速、及び、車両の車線維持且つ減速を含む。 3) While the unmanned driving vehicle is running, the vehicle environment information including the current state vector O and the current motion vector A of the unmanned driving vehicle is acquired.
Here, the current state vector O of the unmanned driving vehicle is the space data of the state _Ot [l, v ₀ , s _f , v _f , s _b , v _b , s _lf , v _lf , s _lb , v _lb. , S _rf , v _rf , s _rb , v _rb ], that is, the running state of the own vehicle of the unmanned driving vehicle, the front and rear vehicles in the vehicle path, and the vehicle closest to the own vehicle in the left and right lanes. The current motion vector A of the unmanned vehicle corresponds to the data of the motion space At, and the currently acquired motion data includes changing lanes to the left of the unmanned vehicle, changing lanes to the right of the vehicle, maintaining the lane of the vehicle, and so on. Includes vehicle speed maintenance, vehicle lane maintenance and acceleration, and vehicle lane maintenance and deceleration.

該無人運転車両は、ステップＳ３で車線変更決定を行う無人運転車両に対応する。 The unmanned driving vehicle corresponds to an unmanned driving vehicle that makes a lane change decision in step S3.

ステップＳ２３２において、学習ラウンドｔ（０≦ｔ≦Ｔ）ごとに、ステップＳ２３３～ステップＳ２３９を実行する。 In step S232, steps S233 to S239 are executed for each learning round t (0 ≦ t ≦ T).

ステップＳ２３３において、ランダムにサンプリングし、平均が０で分散がｔ（０≦ｔ≦Ｔ）であるガウスベクトルδ_ｔ＝｛δ_１，δ_２，...，δ_Ｎ｝をＮ個生成し、ここで、δ_１～δ_Ｎは、１～Ｎ番目のガウスベクトルであり、δ_ｔは、Ｎ個のガウスベクトルを組み合わせたベクトルである。本実施例において、ｖは、常数であり、０．３～０をとる。 In step S233, N pieces of Gaussian vectors δ _t = {δ ₁ , δ ₂ , ..., δ _N } having an average of 0 and a variance of t (0 ≦ t ≦ T) are generated by randomly sampling. Here, δ ₁ to δ _N are the 1st to Nth Gaussian vectors, and δ _t is a vector in which N Gaussian vectors are combined. In this embodiment, v is a constant and takes 0.3 to 0.

ステップＳ２３４において、現在の学習ラウンドｔの際に、無人運転車両代理方策π_θの重みパラメータθ_ｔの平均分散

を算出する。 In step S234, during the current learning round t, the average variance of the weight parameter θ _t of the unmanned vehicle surrogate policy π _θ

Is calculated.

ステップＳ２３５において、無人運転車両の現在の状態ベクトルＯの平均値μを算出する。 In step S235, the average value μ of the current state vector O of the unmanned driving vehicle is calculated.

ステップＳ２３６において、各ｋ（ｋ∈｛１，２，...，Ｎ｝）について、分散減少方法を用いてランダム代理方策π_{ｔ，（ｋ）}：

を算出し、δ_ｋは、ステップＳ２３３で得られたｋ番目のガウスベクトルである。
本ステップにおいて、δ_ｋ＝δ_１，δ_２，...，δ_Ｎに基づき、Ｎ個のランダム代理方策π_{ｔ，（１）}，π_{ｔ，（２），}π_{ｔ，（３）}，...，π_{ｔ，（Ｎ）}が得られる。 In step S236, for each k (k ∈ {1, 2, ..., N}), the random surrogate policy π _{t, (k)} : using the variance reduction method:

Is calculated, and δ _k is the k-th Gauss vector obtained in step S233.
In this step, based on δ _k = δ ₁ , δ ₂ , ..., δ _N , N random surrogate measures π _{t, (1)} , π _{t, (2),} π _{t, (3)} ,. .., π _{t, (N)} are obtained.

ステップＳ２３７において、無人運転車両の現在の状態ベクトルＯを入力として、ランダム代理方策π_{ｔ，（ｋ）}（ｋ＝１，２，...，Ｎ）を適用して、サンプル軌跡

を生成する。
本ステップにおいて、無人運転車両の現在の状態ベクトルＯを入力として、Ｎ個のランダム代理方策π_{ｔ，（１）}，π_{ｔ，（２），}π_{ｔ，（３）}，...，π_{ｔ，（Ｎ）}をそれぞれ適用して、サンプル軌跡

を対応的に生成する。
ここで、

は、それぞれ、Ｏを入力とし、ｋに１～Ｎの値をとり、ランダム代理方策π_t,(k)によって生成された１～Ｎ番目のサンプル軌跡であり、

は、それぞれ、１～Ｎ番目のサンプル軌跡における動作データを示す。 In step S237, the current state vector O of the unmanned driving vehicle is input, and the random surrogate measures π _{t, (k)} (k = 1, 2, ..., N) are applied to sample the locus.

To generate.
In this step, N random surrogate measures π _{t, (1)} , π _{t, (2),} π _{t, (3)} , ..., π _t , with the current state vector O of the unmanned vehicle as input. _{, (N)} are applied respectively, and the sample locus

Is generated correspondingly.
here,

Indicates the operation data in the 1st to Nth sample loci, respectively.

ステップＳ２３８において、敵対的ネットワーク判別器Ｄ_φの重みパラメータφ_ｔを更新する。
最小二乗損失関数を用いて敵対的ネットワーク判別器Ｄ_φの重みパラメータφ_ｔを学習して
更新し、即ち、決定境界の両側でエキスパート軌跡から離れているサンプル軌跡に対して、最小二乗損失関数を用いて懲罰し、損失関数が

である。ここで、π_Ｅ、π_θは、それぞれ、エキスパート方策、無人運転車両代理方策に対応し、

は、エキスパート方策のエントロピー正則化であり、

は、無人運転車両代理方策のエントロピー正則化である。

は、

を入力とし、重みパラメータφ_ｔで算出した結果である。 In step S238, the weight parameter φ _t of the hostile network discriminator D _φ is updated.
The least squares loss function is used to learn and update the weight parameter φ _t of the hostile network discriminator D _φ , i.e., for sample trajectories that are far from expert trajectories on both sides of the decision boundary, the least squares loss function. Use and punish, the loss function

Is. Here, π _E and π _θ correspond to the expert policy and the unmanned vehicle surrogate policy, respectively.

Is the entropy regularization of expert measures,

Is the entropy regularization of the unmanned vehicle surrogate policy.

teeth,

Is the input, and the result is calculated with the weight parameter φ _t .

ステップＳ２３９において、無人運転車両代理方策π_θの重みパラメータθ_ｔを更新する。
現在の学習ラウンドｔが最大学習ラウンドＴに達するまで、分散減少に基づく方策勾配法を用いて代理方策π_θの重みパラメータθ_ｔを更新して、更新後の重みパラメータθ_ｔ＋１を得る。 In step S239, the weight parameter θ _t of the unmanned vehicle surrogate policy π _θ is updated.
Until the current learning round t reaches the maximum learning round T, the weight parameter θ _t of the surrogate policy π _θ is updated using the measure gradient method based on the variance reduction to obtain the updated weight parameter θ _{t + 1} .

ステップＳ２３９において、分散減少に基づく方策勾配法を用いて代理方策π_θの重みパラメータθ_ｔを更新する具体的な過程は、ランダム代理方策π_{ｔ，（ｋ）}（ｋ∈｛１，２，...，Ｎ｝）毎に、インセンティブ関数

（式中、

は、エントロピー正則化である。

は、判別器が（Ｏ，Ａ）で判別計算した結果を示す。）を算出するステップＳ２３９１と、

のように、無人運転車両代理方策π_θのパラメータθ_tを更新するステップＳ２３９２とを含む。 In step S239, the specific process of updating the weight parameter θ _t of the surrogate policy π _θ using the measure gradient method based on the variance reduction is the random surrogate policy π _{t, (k)} (k ∈ {1, 2,. .., N}), incentive function

(During the ceremony

Is an entropy regularization.

Indicates the result of discrimination calculation by the discriminator in (O, A). ) Is calculated in step S2391 and

As in step S2392, which updates the parameter θ _t of the unmanned vehicle surrogate policy π _θ .

本ステップで敵対的ネットワーク判別器Ｄ_φの重みパラメータ及び無人運転車両代理方策π_θのパラメータを学習回数で更新することによって敵対的模倣学習方法の学習を実現し、無人運転車両車線変更決定モデルを取得する。 In this step, learning of the hostile imitation learning method is realized by updating the weight parameter of the hostile network discriminator _Dφ and the parameter of the unmanned driving vehicle surrogate policy π _θ with the number of learnings, and the unmanned driving vehicle lane change determination model is created. get.

ステップＳ３において、車両の無人運転走行中に、現在取得されている環境車両情報を無人運転車両車線変更決定モデルの入力パラメータとして、無人運転車両車線変更決定モデルによって車両車線変更決定結果を取得する。具体的に図２に示すように、以下のとおりである。 In step S3, the vehicle lane change determination result is acquired by the unmanned vehicle lane change determination model using the currently acquired environmental vehicle information as an input parameter of the unmanned vehicle lane change determination model during the unmanned operation of the vehicle. Specifically, as shown in FIG. 2, it is as follows.

ステップＳ３１において、無人運転車両の状態データを含む無人運転車両の現在の環境車両情報を取得し、状態Ｏ_ｔの空間のデータ［ｌ，ｖ_０，ｓ_ｆ，ｖ_ｆ，ｓ_ｂ，ｖ_ｂ，ｓ_ｌｆ，ｖ_ｌｆ，ｓ_ｌｂ，ｖ_ｌｂ，ｓ_ｒｆ，ｖ_ｒｆ，ｓ_ｒｂ，ｖ_ｒｂ］を含み、即ち、無人運転車両の自車両、車両進路における前後車両及び左右車線における自車両に最も近い車両の走行状態を含む。 In step S31, the current environmental vehicle information of the unmanned vehicle including the state data of the unmanned vehicle is acquired, and the space data of the state _Ot [l, v ₀ , s _f , v _f , s _b , v _b , s _lf , v _lf , s _lb , v _lb , s _rf , v _rf , s _rb , v _rb ] Including the running condition of the vehicle.

ステップＳ３２において、無人運転車両の状態データに基づいて、無人運転車両車線変更決定モデルの入力状態に値を与える。即ち、ステップＳ３１で取得した無人運転車両の状態データ［ｌ，ｖ_０，ｓ_ｆ，ｖ_ｆ，ｓ_ｂ，ｖ_ｂ，ｓ_ｌｆ，ｖ_ｌｆ，ｓ_ｌｂ，ｖ_ｌｂ，ｓ_ｒｆ，ｖ_ｒｆ，ｓ_ｒｂ，ｖ_ｒｂ］を無人運転車両車線変更決定モデルに入力する。 In step S32, a value is given to the input state of the unmanned driving vehicle lane change determination model based on the state data of the unmanned driving vehicle. That is, the state data of the unmanned driving vehicle acquired in step S31 [l, v ₀ , s _f , v _f , s _b , v _b , s _lf , v _lf , s _lb , v _lb , s _rf , v _rf , s. _rb , v _rb ] is input to the unmanned driving vehicle lane change determination model.

ステップＳ３３において、無人運転車両車線変更決定モデルによって車線変更決定結果を取得する。本実施例において、無人運転車両車線変更決定モデルによって取得した車線変更決定結果は、動作Ａ_ｔの空間の内容に対応し、第１種類の車両の左へ車線変更、第２種類の車両の右へ車線変更、第３種類の車両の車線維持且つ車速維持、第４種類の車両の車線維持且つ加速、及び、第５種類の車両の車線維持且つ減速を含む。 In step S33, the lane change determination result is acquired by the unmanned driving vehicle lane change determination model. In this embodiment, the lane change determination result acquired by the unmanned vehicle lane change determination model corresponds to the content of the space of the operation _At , and the lane change to the left of the first type vehicle and the right side of the second type vehicle. Includes lane change, lane keeping and speed maintenance of the third type of vehicle, lane keeping and acceleration of the fourth type of vehicle, and lane keeping and deceleration of the fifth type of vehicle.

ステップＳ３４において、連続してｎ回の決定結果がすべて車線変更であり且つ車線変更の方向が同じであるかを判断し、即ち連続してｎ回ですべて左へ車線変更又は右へ車線変更であるかを判断する。ｎは、常数であり、３～５にセットされる。ＮＯであれば、ステップＳ３５に進むが、ＹＥＳであれば、ステップＳ３６に進む。 In step S34, it is determined whether the determination results of n times in a row are all lane changes and the direction of the lane change is the same, that is, by changing lanes to the left or to the right in a row of n times. Determine if there is. n is a constant and is set to 3-5. If NO, the process proceeds to step S35, but if YES, the process proceeds to step S36.

ステップＳ３５において、現在の決定結果が車線変更であるかを判断する。
ＮＯであれば、現在の決定結果に応じて、無人運転車両の現在の運転動作を制御し、即ち、無人運転車両が現在の車線を維持しながら走行するように制御し、加速、減速、又は車速維持の動作を実行し、ステップＳ３１に戻る。例えば、現在の決定結果が車両の車線維持且つ加速である場合、無人運転車両が現在の走行車線を維持し且つ加速動作を実行するように制御する。 In step S35, it is determined whether the current determination result is a lane change.
If NO, then depending on the current decision result, the current driving behavior of the unmanned vehicle is controlled, that is, the unmanned vehicle is controlled to stay in the current lane, and is accelerated, decelerated, or decelerated. The operation of maintaining the vehicle speed is executed, and the process returns to step S31. For example, if the current determination result is vehicle lane keeping and acceleration, the unmanned driving vehicle is controlled to maintain the current driving lane and perform the acceleration operation.

ＹＥＳであれば、無人運転車両が現在の決定結果の前の運転状態を維持する。この場合、決定結果が車線変更であるにかかわらず、車線変更の決定結果が連続してｎ回出ていないので、この際に車線変更せず、現在の決定結果の前の運転状態を維持し、決定結果の前の運転車線及び運転速度を維持することを含む。 If YES, the unmanned vehicle maintains the driving state prior to the current decision result. In this case, even though the decision result is a lane change, the lane change decision result has not been issued n times in a row. Therefore, the lane change is not performed at this time, and the driving state before the current decision result is maintained. Includes maintaining the driving lane and driving speed before the decision result.

ステップＳ３６において、意思決定結果に応じて車線変更を行い、同時に無人運転車両の車線変更中に緊急事態の有無を検出し、あれば無人運転状態から脱し、手動介入を行うが、なければ、車線変更決定結果に基づいて車線変更を完了し、ステップＳ３１に戻る。 In step S36, the lane is changed according to the decision-making result, and at the same time, the presence or absence of an emergency is detected during the lane change of the unmanned driving vehicle. The lane change is completed based on the change determination result, and the process returns to step S31.

（実施例２）
本実施例は、実施例１の敵対的模倣学習に基づく無人運転車両車線変更決定方法を実現するための無人運転車両車線変更決定システムを開示し、無人運転車両車線変更決定タスクを部分観測マルコフ決定過程として記述するタスク記述モジュールと、学習中に分散減少方策勾配の学習方策に基づいて専門運転パフォーマンスをシミュレーションする敵対的模倣学習方法を用いて、専門運転教示によって提供される例から学習し、無人運転車両車線変更決定モデルを取得する車線変更決定モデル構築モジュールと、車両の無人運転走行中に、現在の環境車両情報を取得する環境車両情報取得モジュールと、現在取得されている環境車両情報を無人運転車両車線変更決定モデルの入力パラメータとして、無人運転車両車線変更決定モデルによって車両車線変更決定結果を取得する車線変更決定モジュールとを含む。 (Example 2)
This embodiment discloses an unmanned vehicle lane change determination system for realizing an unmanned vehicle lane change determination method based on hostile imitation learning of Example 1, and partially observes the unmanned vehicle lane change determination task. Unmanned learning from the examples provided by the specialized driving instruction, using a task description module described as a process and a hostile imitation learning method that simulates specialized driving performance based on the distributed reduction policy gradient learning policy during learning. The lane change decision model construction module that acquires the driving vehicle lane change decision model, the environmental vehicle information acquisition module that acquires the current environmental vehicle information during unmanned driving of the vehicle, and the unmanned environmental vehicle information that is currently acquired. As an input parameter of the driving vehicle lane change determination model, a lane change determination module for acquiring the vehicle lane change determination result by the unmanned driving vehicle lane change determination model is included.

更に、本実施例において、タスク記述モジュールは、自車両、車両進路における前後車両及び左右車線における自車両に最も近い車両の走行状態を含む状態Ｏ_ｔの空間［ｌ，ｖ_０，ｓ_ｆ，ｖ_ｆ，ｓ_ｂ，ｖ_ｂ，ｓ_ｌｆ，ｖ_ｌｆ，ｓ_ｌｂ，ｖ_ｌｂ，ｓ_ｒｆ，ｖ_ｒｆ，ｓ_ｒｂ，ｖ_ｒｂ］を決定する状態空間決定モジュールと、第１種類の車両の左へ車線変更、第２種類の車両の右へ車線変更、第３種類の車両の車線維持且つ車速維持、第４種類の車両の車線維持且つ加速、及び、第５種類の車両の車線維持且つ減速を含む動作空間Ａ_ｔを決定する動作空間決定モジュールと、を含む。 Further, in the present embodiment, the task description module is a space [l, v ₀ , s _f , v] of the state _Ot including the running state of the own vehicle, the front and rear vehicles in the vehicle course, and the vehicle closest to the own vehicle in the left and right lanes. The state space determination module that determines [ _f , s _b , v _b , s _lf , v _lf , s _lb , v _lb , s _rf , v _rf , s _rb , v _rb ] and the left lane of the first type of vehicle. Includes change, lane change to the right of the second type of vehicle, lane keeping and speed maintenance of the third type of vehicle, lane keeping and acceleration of the fourth type of vehicle, and lane keeping and deceleration of the fifth type of vehicle. Includes an operating space determination _module that determines the operating space At.

更に、本実施例において、車線変更決定モデル構築モジュールは、専門運転者の車両運転挙動に対して、専門運転者の運転の状態データ及び動作データの収集を含むデータ収集を行う第１データ収集モジュールと、収集した車両状態データ及び動作データのペアを抽出し、データセットτ＝｛τ_１，τ_２，τ_３，...，τ_Ｎ｝＝｛（Ｏ_１，Ａ_１），（Ｏ_２，Ａ_２），（Ｏ_３，Ａ_３），...，（Ｏ_Ｎ，Ａ_Ｎ）｝（τを敵対的模倣学習のエキスパート軌跡に定義し、τ_１～τ_Ｎは、それぞれ、１～Ｎ番目のデータペアを示し、Ｏ_１～Ｏ_Ｎは、それぞれ、収集した１～Ｎ番目の状態データを示し、Ａ_１～Ａ_Ｎは、それぞれ、収集した１～Ｎ番目の動作データを示す）を構成するエキスパート軌道生成モジュールと、データセットτを入力として、敵対的模倣学習方法を用いて学習し、専門運転者の運転挙動を模倣し、無人運転車両車線変更決定モデルを取得する学習モジュールとを含む。具体的な学習過程は、実施例１のステップＳ２３１～ステップＳ２３９で示されるとおりである。 Further, in the present embodiment, the lane change determination model construction module is a first data collection module that collects data including the collection of driving state data and operation data of the specialized driver with respect to the vehicle driving behavior of the specialized driver. And, the pair of the collected vehicle state data and operation data is extracted, and the data set τ = {τ ₁ , τ ₂ , τ ₃ , ..., τ _N } = {(O ₁ , A ₁ ), (O ₂ ). , A ₂ ), (O ₃ , A ₃ ), ..., ( _{ON, AN)} (τ is defined as an expert trajectory of hostile imitation learning, and τ 1 to τ N} _are ₁ _to 1, respectively. Indicates the _Nth data pair, O1 to ON indicate the collected _1st to Nth state data, respectively, and A1 to AN indicate the collected _1st to _Nth operation data, respectively). With the expert track generation module that composes the including. The specific learning process is as shown in steps S231 to S239 of the first embodiment.

本実施例の無人運転車両車線変更決定システムは、実施例１の無人運転車両車線変更決定方法に対応するので、各モジュールの具体的な実現は、上記実施例１を参照でき、ここで一々説明しない。なお、本実施例で提供する装置は、上記各機能ブロックの区分のみを例示したものであり、実際の応用においては、必要に応じて上記機能の割り当てを異なる機能ブロックで行う。即ち内部構成を異なる機能ブロックに区分し、上記で説明した機能の全部又は一部を達成することができる。当業者は、本明細書に開示される実施例に記載される各例のユニット及びアルゴリズムステップに関連して、電子ハードウェア、コンピュータソフトウェア、又は両方の組合せで実装できることを認識することができる。ハードウェア及びソフトウェアの互換性を明確に説明するために、上記の説明では、各例の構成及びステップを機能に応じて一般的に記載してある。これらの機能がハードウェア又はソフトウェアのいずれで実行されるかは、技術手段の特定のアプリケーション及び設計制約条件に依存する。当業者は、記載された機能を実現するために、特定のアプリケーションごとに異なる方法を使用することができるが、そのような実現は、本発明の範囲から逸脱するものと考えられるべきではない。 Since the unmanned driving vehicle lane change determination system of the present embodiment corresponds to the unmanned driving vehicle lane change determination method of the first embodiment, the specific realization of each module can be referred to the above-described first embodiment, which will be described one by one. do not do. The apparatus provided in this embodiment exemplifies only the division of each of the above functional blocks, and in an actual application, the above functions are assigned to different functional blocks as needed. That is, the internal configuration can be divided into different functional blocks to achieve all or part of the functions described above. One of ordinary skill in the art can recognize that it can be implemented in electronic hardware, computer software, or a combination of both, in connection with the units and algorithm steps of each example described in the examples disclosed herein. In order to clearly illustrate hardware and software compatibility, the above description generally describes the configuration and steps of each example according to function. Whether these functions are performed in hardware or software depends on the specific application of the technical means and design constraints. One of ordinary skill in the art may use different methods for each particular application to achieve the described functionality, but such implementation should not be considered to deviate from the scope of the invention.

（実施例３）
本実施例は、プログラムが格納されている記憶媒体を開示し、前記プログラムがプロセッサによって実行されると、実施例１に記載の敵対的模倣学習に基づく無人運転車両車線変更決定方法を実現し、即ち、無人運転車両車線変更決定タスクを部分観測マルコフ決定過程として記述し、学習中に分散減少方策勾配の学習方策に基づいて専門運転パフォーマンスをシミュレーションする敵対的模倣学習方法を用いて、専門運転教示によって提供される例から学習し、無人運転車両車線変更決定モデルを取得し、車両の無人運転走行中に、現在取得されている環境車両情報を無人運転車両車線変更決定モデルの入力パラメータとして、無人運転車両車線変更決定モデルによって車両車線変更決定結果を取得する。 (Example 3)
The present embodiment discloses a storage medium in which the program is stored, and when the program is executed by the processor, realizes the unmanned vehicle lane change determination method based on the hostile imitation learning described in the first embodiment. That is, the unmanned vehicle lane change decision task is described as a partial observation Markov decision process, and the specialized driving teaching is performed using a hostile imitation learning method that simulates the specialized driving performance based on the learning policy of the dispersion reduction policy gradient during learning. Learn from the example provided by, acquire the unmanned vehicle lane change determination model, and use the currently acquired environmental vehicle information as the input parameter of the unmanned vehicle lane change determination model during unmanned driving of the vehicle. The vehicle lane change decision result is acquired by the driving vehicle lane change decision model.

本実施例における記憶媒体は、磁気ディスク、光ディスク、コンピュータメモリ、リードオンリーメモリ（ＲＯＭ：Ｒｅａｄ－ＯｎｌｙＭｅｍｏｒｙ）、ランダムアクセスメモリ（ＲＡＭ：ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、Ｕディスク、リムーバブルハードディスク等の媒体である。 The storage medium in this embodiment is a medium such as a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM: Read-Only Memory), a random access memory (RAM: Random Access Memory), a U disk, or a removable hard disk.

（実施例４）
本実施例は、プロセッサと、プロセッサによって実行可能なプログラムを格納するためのメモリとを含む演算機器を開示し、前記プロセッサは、メモリに格納されているプログラムを実行すると、実施例１に記載の敵対的模倣学習に基づく無人運転車両の車線変更決定方法を実現することを特徴とする。即ち、無人運転車両車線変更決定タスクを部分観測マルコフ決定過程として記述し、学習中に分散減少方策勾配の学習方策に基づいて専門運転パフォーマンスをシミュレーションする敵対的模倣学習方法を用いて、専門運転教示によって提供される例から学習し、無人運転車両車線変更決定モデルを取得し、車両の無人運転走行中に、現在取得されている環境車両情報を無人運転車両車線変更決定モデルの入力パラメータとして、無人運転車両車線変更決定モデルによって車両車線変更決定結果を取得する。 (Example 4)
The present embodiment discloses an arithmetic unit including a processor and a memory for storing a program that can be executed by the processor, and when the processor executes a program stored in the memory, the first embodiment is described. It is characterized by realizing a lane change determination method for an unmanned driving vehicle based on hostile imitation learning. That is, the unmanned vehicle lane change decision task is described as a partial observation Markov decision process, and the specialized driving teaching is performed using a hostile imitation learning method that simulates the specialized driving performance based on the learning policy of the dispersion reduction policy gradient during learning. Learn from the example provided by, acquire the unmanned vehicle lane change determination model, and use the currently acquired environmental vehicle information as the input parameter of the unmanned vehicle lane change determination model during unmanned driving of the vehicle. The vehicle lane change decision result is acquired by the driving vehicle lane change decision model.

本実施例における演算機器は、デスクトップコンピュータ、ラップトップ、スマートフォン、ＰＤＡ携帯端末、タブレット、又はプロセッサ機能を有する他の端末機器である。 The computing device in this embodiment is a desktop computer, a laptop, a smartphone, a PDA mobile terminal, a tablet, or another terminal device having a processor function.

上記実施例は、本発明の好適な実施形態であるが、本発明の実施形態は、上記実施例に限定されるものではなく、本発明の趣旨及び原理から逸脱しない範囲での変更、修正、置換、組み合わせ、単純化は、均等な置換として本発明の保護範囲内に含まれる。 The above-described embodiment is a preferred embodiment of the present invention, but the embodiment of the present invention is not limited to the above-mentioned embodiment, and changes, modifications, and modifications are made without departing from the spirit and principle of the present invention. Substitutions, combinations and simplifications are included within the scope of the invention as even substitutions.

（付記）
（付記１）
敵対的模倣学習に基づく無人運転車両車線変更決定方法において、
無人運転車両車線変更決定タスクを部分観測マルコフ決定過程として記述するステップＳ１と、
学習中に分散減少方策勾配の学習方策に基づいて専門運転パフォーマンスをシミュレーションする敵対的模倣学習方法を用いて、専門運転教示によって提供される例から学習し、無人運転車両車線変更決定モデルを取得するステップＳ２と、
車両の無人運転走行中に、現在取得されている環境車両情報を無人運転車両車線変更決定モデルの入力パラメータとして、無人運転車両車線変更決定モデルによって車両車線変更決定結果を取得するステップＳ３とを含むことを特徴とする、
敵対的模倣学習に基づく無人運転車両車線変更決定方法。 (Additional note)
(Appendix 1)
In the method of determining the lane change of an unmanned driving vehicle based on hostile imitation learning,
Step S1 to describe the unmanned vehicle lane change decision task as a partial observation Markov decision process,
Dispersion reduction policy during learning Use a hostile imitation learning method that simulates professional driving performance based on a gradient learning strategy to learn from the examples provided by professional driving teaching and obtain an unmanned vehicle lane change decision model. Step S2 and
During unmanned driving of the vehicle, the currently acquired environmental vehicle information is used as an input parameter of the unmanned driving vehicle lane change determination model, and includes step S3 of acquiring the vehicle lane change determination result by the unmanned driving vehicle lane change determination model. Characterized by that
Unmanned vehicle lane change decision method based on hostile imitation learning.

（付記２）
ステップＳ１において、無人運転車両車線変更決定タスクを部分観測マルコフ決定過程として記述することは、具体的に、
ステップＳ１１において、自車両、車両進路における前後車両及び左右車線における自車両に最も近い車両の走行状態を含む状態Ｏ_ｔの空間［ｌ，ｖ_０，ｓ_ｆ，ｖ_ｆ，ｓ_ｂ，ｖ_ｂ，ｓ_ｌｆ，ｖ_ｌｆ，ｓ_ｌｂ，ｖ_ｌｂ，ｓ_ｒｆ，ｖ_ｒｆ，ｓ_ｒｂ，ｖ_ｒｂ］
（ここで、
ｌは、自車両が走行する車線であり、ｖ_０は、自車両の走行速度であり、
ｓ_ｆ、ｖ_ｆは、それぞれ、自車進路の前方で最も近い車両から自車両までの距離、自車両までの相対速度に対応し、
ｓ_ｂ、ｖ_ｂは、それぞれ、自車両の進路の後方で最も近い車両から自車両までの距離、自車両までの相対速度に対応し、
ｓ_ｌｆ、ｖ_ｌｆは、それぞれ、自車両より左車線の前方で最も近い車両から自車両までの距離、自車両までの相対速度に対応し、
ｓ_ｌｂ、ｖ_ｌｂは、それぞれ、自車両より左車線の後方で最も近い車両から自車両までの距離、自車両までの相対速度に対応し、
ｓ_ｒｆ、ｖ_ｒｆは、それぞれ、自車両より右車線の前方で最も近い車両から自車両までの距離、自車両までの相対速度に対応し、
ｓ_ｒｂ、ｖ_ｒｂは、それぞれ、自車両より右車線の後方で最も近い車両から自車両までの距離、自車両までの相対速度に対応する）を決定し、
ステップＳ１２において、車両の左へ車線変更、車両の右へ車線変更、車両の車線維持且つ車速維持、車両の車線維持且つ加速、及び、車両の車線維持且つ減速を含む動作Ａ_ｔの空間を決定することを特徴とする、
付記１に記載の敵対的模倣学習に基づく無人運転車両車線変更決定方法。 (Appendix 2)
In step S1, describing the unmanned vehicle lane change determination task as a partial observation Markov determination process is specifically described.
In step S11, the space [l, v ₀ , s _f , _{v f} _, s _b , v _b , s _lf , v _lf , s _lb , v _lb , s _rf , v _rf , s _rb , v _rb ]
(here,
l is the lane in which the own vehicle travels, v ₀ is the traveling speed of the own vehicle, and so on.
s _f and v _f correspond to the distance from the nearest vehicle to the own vehicle and the relative speed to the own vehicle, respectively, in front of the own vehicle course.
s _b and v _b correspond to the distance from the nearest vehicle to the own vehicle and the relative speed to the own vehicle, respectively, behind the course of the own vehicle.
s _lf and v _lf correspond to the distance from the vehicle closest to the vehicle in front of the vehicle in the left lane to the vehicle and the relative speed to the vehicle, respectively.
s _lb and v _lb correspond to the distance from the vehicle closest to the vehicle in the left lane behind the vehicle to the vehicle and the relative speed to the vehicle, respectively.
s _rf and v _rf correspond to the distance from the vehicle closest to the vehicle in front of the vehicle in the right lane to the vehicle and the relative speed to the vehicle, respectively.
s _rb and v _rb correspond to the distance from the vehicle closest to the vehicle in the right lane behind the vehicle to the vehicle and the relative speed to the vehicle, respectively).
In step _S12 , the space of the operation At including the lane change to the left of the vehicle, the lane change to the right of the vehicle, the lane keeping and speed maintenance of the vehicle, the lane keeping and acceleration of the vehicle, and the lane keeping and deceleration of the vehicle is determined. Characterized by
The method for determining an unmanned driving vehicle lane change based on the hostile imitation learning described in Appendix 1.

（付記３）
自車両に対し、
その進路前方の車両が検出されない場合、ｓ_ｆ、ｖ_ｆをそれぞれ固定値にセットし、
その進路後方の車両が検出されない場合、ｓ_ｂ、ｖ_ｂをそれぞれ固定値にセットし、
左車線前方の車両が検出されない場合、ｓ_ｌｆ、ｖ_ｌｆをそれぞれ固定値にセットし、
左車線後方の車両が検出されない場合、ｓ_ｌｂ、ｖ_ｌｂをそれぞれ固定値にセットし、
右車線前方の車両が検出されない場合、ｓ_ｒｆ、ｖ_ｒｆをそれぞれ固定値にセットし、
右車線後方の車両が検出されない場合、ｓ_ｒｂ、ｖ_ｒｂをそれぞれ固定値にセットすることを特徴とする、
付記２に記載の敵対的模倣学習に基づく無人運転車両車線変更決定方法。 (Appendix 3)
For your vehicle
If the vehicle in front of the path is not detected, set s _f and v _f to fixed values, respectively.
If no vehicle behind the path is detected, set s _b and v _b to fixed values, respectively.
If no vehicle in front of the left lane is detected, set _slf and _vlf to fixed values, respectively.
If no vehicle behind the left lane is detected, set _{slb and vlb} _to fixed values, respectively.
If a vehicle in front of the right lane is not detected, set _srf and _vrf to fixed values, respectively.
When a vehicle behind the right lane is not detected, s _rb and v _rb are set to fixed values, respectively.
The method for determining an unmanned driving vehicle lane change based on the hostile imitation learning described in Appendix 2.

（付記４）
ステップＳ２において、敵対的模倣学習方法を用いて、専門運転教示によって提供される例から学習する具体的な過程として、
ステップＳ２１において、専門運転者の車両の運転挙動に対して、専門運転者の運転の状態データ及び動作データの収集を含むデータ収集を行い、
ステップＳ２２において、収集した車両状態データ及び動作データのペアを抽出し、データセットτ＝｛τ_１，τ_２，τ_３，...，τ_Ｎ｝＝｛（Ｏ_１，Ａ_１），（Ｏ_２，Ａ_２），（Ｏ_３，Ａ_３），...，（Ｏ_Ｎ，Ａ_Ｎ）｝（τを敵対的模倣学習のエキスパート軌跡に定義し、τ_１～τ_Ｎは、それぞれ、１～Ｎ番目のデータペアを示し、Ｏ_１～Ｏ_Ｎは、それぞれ、収集した１～Ｎ番目の状態データを示し、Ａ_１～Ａ_Ｎは、それぞれ、収集した１～Ｎ番目の動作データを示し、Ｎは、学習データセットにおけるデータペアの総数であり、サンプリング回数に対応する）を構成し、
ステップＳ２３において、データセットτを入力として、敵対的模倣学習方法を用いて学習し、専門運転者の運転挙動を模倣し、無人運転車両車線変更決定モデルを取得することを特徴とする、
付記２に記載の敵対的模倣学習に基づく無人運転車両車線変更決定方法。 (Appendix 4)
In step S2, as a specific process of learning from an example provided by professional driving instruction using a hostile imitation learning method.
In step S21, data collection including collection of driving state data and operation data of the specialized driver is performed for the driving behavior of the vehicle of the specialized driver.
In step S22, the collected vehicle state data and operation data pairs are extracted, and the datasets τ = {τ ₁ , τ ₂ , τ ₃ , ..., τ _N } = {(O ₁ , A ₁ ), ( O ₂ , A ₂ ), (O ₃ , A ₃ ), ..., ( _{ON, AN)} (τ is defined as the expert trajectory of hostile imitation learning, and τ 1 to τ N} _are _, _respectively . The 1st to Nth data pairs are indicated, O ₁ to ON indicate the collected 1st to _Nth state data, respectively, and A ₁ to AN indicate the collected 1st to _Nth operation data, respectively. Shown, N is the total number of data pairs in the training dataset, which corresponds to the number of samplings).
In step S23, the data set τ is used as an input, and learning is performed using a hostile imitation learning method, the driving behavior of a professional driver is imitated, and an unmanned driving vehicle lane change determination model is acquired.
The method for determining an unmanned driving vehicle lane change based on the hostile imitation learning described in Appendix 2.

（付記５）
ステップＳ２３において、敵対的模倣学習として学習中に分散減少方策勾配の学習方策に基づいて専門運転パフォーマンスをシミュレーションすることは、具体的な過程として、
ステップＳ２３１において、初期化し、
最大学習ラウンドＴ、学習ペースα、及びサンプリング回数Ｎをセットし、
無人運転車両代理方策π_θを初期化し、無人運転車両代理方策π_θの重みパラメータをθ₀に初期化し、
敵対的ネットワーク判別器Ｄ_φの重みパラメータを初期化し、ここで、φ_０は、敵対的ネットワーク判別器Ｄ_φの初期化重みパラメータであり、
無人運転車両の現在の状態ベクトルＯ及び現在の動作ベクトルＡを取得し、
ステップＳ２３２において、学習ラウンドｔ（０≦ｔ≦Ｔ）ごとに、ステップＳ２３３～ステップＳ２３９を実行し、
ステップＳ２３３において、ランダムにサンプリングし、平均が０で分散がｖであるガウスベクトルδ_ｔ＝｛δ_１，δ_２，...，δ_Ｎ｝をＮ個生成し、ここで、δ_１～δ_Ｎは、１～Ｎ番目のガウスベクトルであり、δ_ｔは、Ｎ個のガウスベクトルを組み合わせたベクトルであり、
ステップＳ２３４において、現在の学習ラウンドｔの際に、無人運転車両代理方策π_θの重みパラメータθ_ｔの平均分散

を生成し、
ここで、

は、それぞれ、１～Ｎ番目のサンプル軌跡における動作データを示し、
テップＳ２３８において、敵対的ネットワーク判別器Ｄ_φの重みパラメータφ_ｔを更新し、
最小二乗損失関数を用いて敵対的ネットワーク判別器Ｄ_φの重みパラメータφ_ｔを学習して更新し、即ち、決定境界の両側でエキスパート軌跡から離れているサンプル軌跡に対して、最小二乗損失関数を用いて懲罰し、損失関数が

は、エキスパート方策のエントロピー正則化であり、

は、無人運転車両代理方策のエントロピー正則化であり、
ステップＳ２３９において、無人運転車両代理方策π_θの重みパラメータθ_ｔを更新し、
現在の学習ラウンドｔが最大学習ラウンドＴに達するまで、分散減少に基づく方策勾配法を用いて代理方策π_θの重みパラメータθ_ｔを更新して、更新後の重みパラメータθ_ｔ＋１を得ることを特徴とする、
付記４に記載の敵対的模倣学習に基づく無人運転車両車線変更決定方法。 (Appendix 5)
In step S23, simulating the professional driving performance based on the learning policy of the variance reduction policy gradient during learning as hostile imitation learning is a concrete process.
In step S231, the initialization is performed.
Set the maximum learning round T, learning pace α, and sampling count N,
Initialize the unmanned vehicle surrogate policy π _θ , initialize the weight parameter of the unmanned vehicle surrogate policy π _θ to θ ₀ ,
Initialize the weight parameter of the hostile network discriminator D _φ , where φ ₀ is the initialization weight parameter of the hostile network discriminator D _φ .
Acquire the current state vector O and the current motion vector A of the unmanned driving vehicle,
In step S232, step S233 to step S239 are executed for each learning round t (0 ≦ t ≦ T).
In step S233, N random Gaussian vectors δ _t = {δ ₁ , δ ₂ , ..., δ _N } having a mean of 0 and a variance of v are generated, where δ ₁ to δ are generated. _N is the 1st to Nth Gaussian vector, and δ _t is a vector obtained by combining N Gaussian vectors.
In step S234, during the current learning round t, the average variance of the weight parameter θ _t of the unmanned vehicle surrogate policy π _θ

To generate
here,

Indicates the motion data in the 1st to Nth sample trajectories, respectively.
In Tep S238, the weight parameter φ _t of the hostile network discriminator D _φ was updated.
The least squares loss function is used to learn and update the weight parameter φ _t of the hostile network discriminator D _φ , i.e., for sample trajectories that are far from expert trajectories on both sides of the decision boundary, the least squares loss function. Use and punish, the loss function

Is the entropy regularization of expert measures,

Is the entropy regularization of unmanned vehicle surrogate measures,
In step S239, the weight parameter θ _t of the unmanned vehicle surrogate policy π _θ is updated.
The feature is that the weight parameter θ _t of the surrogate policy π _θ is updated to obtain the updated weight parameter θ _{t + 1} by using the measure gradient method based on the variance reduction until the current learning round t reaches the maximum learning round T. To
The method for determining an unmanned driving vehicle lane change based on the hostile imitation learning described in Appendix 4.

（付記６）
ステップＳ２３９において、分散減少に基づく方策勾配法を用いて代理方策π_θの重みパラメータθ_ｔを更新する具体的な過程は、
ランダム代理方策π_{ｔ，（ｋ）}（ｋ∈｛１，２，...，Ｎ｝）毎に、インセンティブ関数

（式中、

のように、無人運転車両代理方策π_θのパラメータθ_ｔを更新するステップＳ２３９２と、を含むことを特徴とする、
付記５に記載の敵対的模倣学習に基づく無人運転車両車線変更決定方法。 (Appendix 6)
In step S239, the specific process of updating the weight parameter θ _t of the surrogate measure π _θ using the measure gradient method based on the variance reduction is
Random surrogate policy π _{t, (k)} (k ∈ {1, 2, ..., N}) for each incentive function

(During the ceremony

Is entropy regularization) in step S2391 and

The step S2392, which updates the parameter θ _t of the unmanned driving vehicle surrogate policy π _θ , is included, as described above.
The method for determining an unmanned driving vehicle lane change based on the hostile imitation learning described in Appendix 5.

（付記７）
ステップＳ３において、無人運転車両車線変更決定モデルによって無人運転車両車線変更決定結果を取得する具体的な過程として、
ステップＳ３１において、無人運転車両の状態データを含む無人運転車両の現在の環境車両情報を取得し、
ステップＳ３２において、無人運転車両の状態データに基づいて、無人運転車両車線変更決定モデルの入力状態に値を与え、
ステップＳ３３において、無人運転車両車線変更決定モデルによって車線変更決定結果を取得し、
ステップＳ３４において、連続してｎ（ｎは常数である）回の決定結果がすべて車線変更であり且つ車線変更の方向が同じであるかを判断し、ＮＯであれば、ステップＳ３５に進むが、ＹＥＳであれば、ステップＳ３６に進み、
ステップＳ３５において、現在の決定結果が車線変更であるかを判断し、
ＮＯであれば、現在の決定結果に応じて、無人運転車両の現在の運転動作を制御し、即ち、無人運転車両が現在の車線を維持しながら走行するように制御し、加速、減速、又は車速維持の動作を実行し、ステップＳ３１に戻り、
ＹＥＳであれば、無人運転車両が現在の決定結果の前の運転状態を維持し、ステップＳ３１に戻り、
ステップＳ３６において、決定結果に応じて車線変更を行い、同時に無人運転車両の車線変更中に緊急事態の有無を検出し、あれば無人運転状態から脱し、手動介入を行うが、なければ、車線変更決定結果に基づいて車線変更を完了し、ステップＳ３１に戻ることを特徴とする、
付記５に記載の敵対的模倣学習に基づく無人運転車両車線変更決定方法。 (Appendix 7)
In step S3, as a specific process of acquiring the unmanned driving vehicle lane change determination result by the unmanned driving vehicle lane change determination model,
In step S31, the current environmental vehicle information of the unmanned driving vehicle including the state data of the unmanned driving vehicle is acquired.
In step S32, a value is given to the input state of the unmanned driving vehicle lane change determination model based on the state data of the unmanned driving vehicle.
In step S33, the lane change decision result is acquired by the unmanned driving vehicle lane change decision model.
In step S34, it is determined whether the determination results of n (n is a constant) consecutive times are all lane changes and the directions of lane changes are the same. If NO, the process proceeds to step S35. If YES, the process proceeds to step S36.
In step S35, it is determined whether the current decision result is a lane change, and the result is determined.
If NO, then depending on the current decision result, the current driving behavior of the unmanned vehicle is controlled, that is, the unmanned vehicle is controlled to stay in the current lane, and is accelerated, decelerated, or decelerated. Execute the operation of maintaining the vehicle speed, return to step S31, and return to step S31.
If YES, the unmanned vehicle maintains the driving state before the current decision result and returns to step S31.
In step S36, the lane is changed according to the decision result, and at the same time, the presence or absence of an emergency is detected during the lane change of the unmanned driving vehicle. The lane change is completed based on the determination result, and the process returns to step S31.
The method for determining an unmanned driving vehicle lane change based on the hostile imitation learning described in Appendix 5.

（付記８）
付記１から７のいずれか１つに記載の敵対的模倣学習に基づく無人運転車両車線変更決定方法を実現するための無人運転車両車線変更決定システムにおいて、
無人運転車両車線変更決定タスクを部分観測マルコフ決定過程として記述するタスク記述モジュールと、
学習中に分散減少方策勾配の学習方策に基づいて専門運転パフォーマンスをシミュレーションする敵対的模倣学習方法を用いて、専門運転教示によって提供される例から学習し、無人運転車両車線変更決定モデルを取得する車線変更決定モデル構築モジュールと、
車両の無人運転走行中に、現在の環境車両情報を取得する環境車両情報取得モジュールと、
現在取得されている環境車両情報を無人運転車両車線変更決定モデルの入力パラメータとして、無人運転車両車線変更決定モデルによって車両車線変更決定結果を取得する車線変更の意思決定モジュールとを含むことを特徴とする、
無人運転車両車線変更決定システム。 (Appendix 8)
In the unmanned driving vehicle lane change determination system for realizing the unmanned driving vehicle lane change determination method based on the hostile imitation learning described in any one of Appendix 1 to 7.
A task description module that describes the unmanned vehicle lane change decision task as a partial observation Markov decision process,
Dispersion reduction policy during learning Use a hostile imitation learning method that simulates professional driving performance based on a gradient learning strategy to learn from the examples provided by professional driving instruction and obtain an unmanned vehicle lane change decision model. Lane change decision model construction module and
An environmental vehicle information acquisition module that acquires current environmental vehicle information while the vehicle is driving unmanned.
It is characterized by including the currently acquired environmental vehicle information as an input parameter of the unmanned driving vehicle lane change decision model, and a lane change decision module for acquiring the vehicle lane change decision result by the unmanned driving vehicle lane change decision model. do,
Unmanned vehicle lane change decision system.

（付記９）
プログラムが格納されている記憶媒体であって、前記プログラムがプロセッサによって実行されると、付記１から７のいずれか１つに記載の敵対的模倣学習に基づく無人運転車両車線変更決定方法を実現することを特徴とする、
プログラムが格納されている記憶媒体。 (Appendix 9)
A storage medium in which a program is stored, and when the program is executed by a processor, the method for determining an unmanned driving vehicle lane change based on hostile imitation learning according to any one of Supplementary note 1 to 7 is realized. Characterized by that
The storage medium in which the program is stored.

（付記１０）
プロセッサと、プロセッサによって実行可能なプログラムを格納するためのメモリとを含む演算機器であって、
前記プロセッサは、メモリに格納されているプログラムを実行すると、付記１から７のいずれか１つに記載の敵対的模倣学習に基づく無人運転車両車線変更決定方法を実現することを特徴とする、
演算機器。 (Appendix 10)
An arithmetic unit that includes a processor and memory for storing programs that can be executed by the processor.
When the processor executes a program stored in the memory, the processor realizes an unmanned driving vehicle lane change determination method based on the hostile imitation learning described in any one of the appendices 1 to 7.
Arithmetic equipment.

Claims

In the method of determining the lane change of an unmanned driving vehicle based on hostile imitation learning,
Step S1 to describe the unmanned vehicle lane change decision task as a partial observation Markov decision process,
Dispersion reduction policy during learning Use a hostile imitation learning method that simulates professional driving performance based on a gradient learning strategy to learn from the examples provided by professional driving teaching and obtain an unmanned vehicle lane change decision model. Step S2 and
During unmanned driving of the vehicle, the currently acquired environmental vehicle information is used as an input parameter of the unmanned driving vehicle lane change determination model, and includes step S3 of acquiring the vehicle lane change determination result by the unmanned driving vehicle lane change determination model. Characterized by that
Unmanned vehicle lane change decision method based on hostile imitation learning.

In step S1, describing the unmanned vehicle lane change determination task as a partial observation Markov determination process is specifically described.
In step S11, the space [l, v ₀ , s _f , v _f , s _b , v _b , s _lf , v _lf , s _lb , v _lb , s _rf , v _rf , s _rb , v _rb ]
(here,
l is the lane in which the own vehicle travels, and v ₀ is the traveling speed of the own vehicle.
s _f and v _f correspond to the distance from the nearest vehicle to the own vehicle and the relative speed to the own vehicle, respectively, in front of the own vehicle course.
s _b and v _b correspond to the distance from the nearest vehicle to the own vehicle and the relative speed to the own vehicle, respectively, behind the course of the own vehicle.
s _lf and v _lf correspond to the distance from the vehicle closest to the vehicle in front of the vehicle in the left lane to the vehicle and the relative speed to the vehicle, respectively.
s _lb and v _lb correspond to the distance from the vehicle closest to the vehicle in the left lane behind the vehicle to the vehicle and the relative speed to the vehicle, respectively.
s _rf and v _rf correspond to the distance from the vehicle closest to the vehicle in front of the vehicle in the right lane to the vehicle and the relative speed to the vehicle, respectively.
s _rb and v _rb correspond to the distance from the vehicle closest to the vehicle in the right lane behind the vehicle to the vehicle and the relative speed to the vehicle, respectively).
In step _S12 , the space of the operation At including the lane change to the left of the vehicle, the lane change to the right of the vehicle, the lane keeping and speed maintenance of the vehicle, the lane keeping and acceleration of the vehicle, and the lane keeping and deceleration of the vehicle is determined. Characterized by
The method for determining an unmanned driving vehicle lane change based on the hostile imitation learning according to claim 1.

For your vehicle
If the vehicle in front of the path is not detected, set s _f and v _f to fixed values, respectively.
If no vehicle behind the path is detected, set s _b and v _b to fixed values, respectively.
If no vehicle in front of the left lane is detected, set _slf and _vlf to fixed values, respectively.
If no vehicle behind the left lane is detected, set _{slb and vlb} _to fixed values, respectively.
If a vehicle in front of the right lane is not detected, set _srf and _vrf to fixed values, respectively.
When a vehicle behind the right lane is not detected, s _rb and v _rb are set to fixed values, respectively.
The method for determining an unmanned driving vehicle lane change based on the hostile imitation learning according to claim 2.

In step S2, as a specific process of learning from an example provided by professional driving instruction using a hostile imitation learning method.
In step S21, data collection including collection of driving state data and operation data of the specialized driver is performed for the driving behavior of the vehicle of the specialized driver.
In step S22, the collected vehicle state data and operation data pairs are extracted, and the datasets τ = {τ ₁ , τ ₂ , τ ₃ , ..., τ _N } = {(O ₁ , A ₁ ), ( O ₂ , A ₂ ), (O ₃ , A ₃ ), ..., ( _{ON, AN)} (τ is defined as the expert trajectory of hostile imitation learning, and τ 1 to τ N} _are _, _respectively . The 1st to Nth data pairs are indicated, O ₁ to ON indicate the collected 1st to _Nth state data, respectively, and A ₁ to AN indicate the collected 1st to _Nth operation data, respectively. Shown, N is the total number of data pairs in the training dataset, which corresponds to the number of samplings).
In step S23, the data set τ is used as an input, and learning is performed using a hostile imitation learning method, the driving behavior of a professional driver is imitated, and an unmanned driving vehicle lane change determination model is acquired.
The method for determining an unmanned driving vehicle lane change based on the hostile imitation learning according to claim 2.

In step S23, simulating the professional driving performance based on the learning policy of the variance reduction policy gradient during learning as hostile imitation learning is a concrete process.
In step S231, the initialization is performed.
Set the maximum learning round T, learning pace α, and sampling count N,
Initialize the unmanned vehicle surrogate policy π _θ , initialize the weight parameter of the unmanned vehicle surrogate policy π _θ to θ ₀ ,
Initialize the weight parameter of the hostile network discriminator D _φ , where φ ₀ is the initialization weight parameter of the hostile network discriminator D _φ .
Acquire the current state vector O and the current motion vector A of the unmanned driving vehicle,
In step S232, step S233 to step S239 are executed for each learning round t (0 ≦ t ≦ T).
In step S233, N random Gaussian vectors δ _t = {δ ₁ , δ ₂ , ..., δ _N } having a mean of 0 and a variance of v are generated, where δ ₁ to δ are generated. _N is the 1st to Nth Gaussian vector, and δ _t is a vector obtained by combining N Gaussian vectors.
In step S234, during the current learning round t, the average variance of the weight parameter θ _t of the unmanned vehicle surrogate policy π _θ

To generate
here,

Is the entropy regularization of expert measures,

Is the entropy regularization of unmanned vehicle surrogate measures,
In step S239, the weight parameter θ _t of the unmanned vehicle surrogate policy π _θ is updated.
The feature is that the weight parameter θ _t of the surrogate policy π _θ is updated to obtain the updated weight parameter θ _{t + 1} by using the measure gradient method based on the variance reduction until the current learning round t reaches the maximum learning round T. To
The method for determining an unmanned driving vehicle lane change based on the hostile imitation learning according to claim 4.

In step S239, the specific process of updating the weight parameter θ _t of the surrogate measure π _θ using the measure gradient method based on the variance reduction is
Random surrogate policy π _{t, (k)} (k ∈ {1, 2, ..., N}) for each incentive function

(During the ceremony

Is entropy regularization) in step S2391 and

The step S2392, which updates the parameter θ _t of the unmanned driving vehicle surrogate policy π _θ , is included, as described above.
The method for determining an unmanned driving vehicle lane change based on the hostile imitation learning according to claim 5.

In step S3, as a specific process of acquiring the unmanned driving vehicle lane change determination result by the unmanned driving vehicle lane change determination model,
In step S31, the current environmental vehicle information of the unmanned driving vehicle including the state data of the unmanned driving vehicle is acquired.
In step S32, a value is given to the input state of the unmanned driving vehicle lane change determination model based on the state data of the unmanned driving vehicle.
In step S33, the lane change decision result is acquired by the unmanned driving vehicle lane change decision model.
In step S34, it is determined whether the determination results of n (n is a constant) consecutive times are all lane changes and the directions of lane changes are the same. If NO, the process proceeds to step S35. If YES, the process proceeds to step S36.
In step S35, it is determined whether the current decision result is a lane change, and the result is determined.
If NO, then depending on the current decision result, the current driving behavior of the unmanned vehicle is controlled, that is, the unmanned vehicle is controlled to stay in the current lane, and is accelerated, decelerated, or decelerated. Execute the operation of maintaining the vehicle speed, return to step S31, and return to step S31.
If YES, the unmanned vehicle maintains the driving state before the current decision result and returns to step S31.
In step S36, the lane is changed according to the decision result, and at the same time, the presence or absence of an emergency is detected during the lane change of the unmanned driving vehicle. The lane change is completed based on the determination result, and the process returns to step S31.
The method for determining an unmanned driving vehicle lane change based on the hostile imitation learning according to claim 5.

In the unmanned driving vehicle lane change determination system for realizing the unmanned driving vehicle lane change determination method based on the hostile imitation learning according to any one of claims 1 to 7.
A task description module that describes the unmanned vehicle lane change decision task as a partial observation Markov decision process,
Dispersion reduction policy during learning Use a hostile imitation learning method that simulates professional driving performance based on a gradient learning strategy to learn from the examples provided by professional driving instruction and obtain an unmanned vehicle lane change decision model. Lane change decision model construction module and
An environmental vehicle information acquisition module that acquires current environmental vehicle information while the vehicle is driving unmanned.
It is characterized by including the currently acquired environmental vehicle information as an input parameter of the unmanned driving vehicle lane change decision model, and a lane change decision module for acquiring the vehicle lane change decision result by the unmanned driving vehicle lane change decision model. do,
Unmanned vehicle lane change decision system.

A storage medium in which a program is stored, and when the program is executed by a processor, the method for determining an unmanned vehicle lane change based on the hostile imitation learning according to any one of claims 1 to 7 is realized. Characterized by
The storage medium in which the program is stored.

An arithmetic unit that includes a processor and memory for storing programs that can be executed by the processor.
When the processor executes a program stored in the memory, the processor realizes the method for determining an unmanned driving vehicle lane change based on the hostile imitation learning according to any one of claims 1 to 7.
Arithmetic equipment.