WO2019071909A1 - 基于相对熵深度逆强化学习的自动驾驶系统及方法 - Google Patents

基于相对熵深度逆强化学习的自动驾驶系统及方法 Download PDF

Info

Publication number
WO2019071909A1
WO2019071909A1 PCT/CN2018/078740 CN2018078740W WO2019071909A1 WO 2019071909 A1 WO2019071909 A1 WO 2019071909A1 CN 2018078740 W CN2018078740 W CN 2018078740W WO 2019071909 A1 WO2019071909 A1 WO 2019071909A1
Authority
WO
WIPO (PCT)
Prior art keywords
driving
trajectory
strategy
road information
reinforcement learning
Prior art date
Application number
PCT/CN2018/078740
Other languages
English (en)
French (fr)
Inventor
林嘉豪
章宗长
Original Assignee
苏州大学张家港工业技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州大学张家港工业技术研究院 filed Critical 苏州大学张家港工业技术研究院
Publication of WO2019071909A1 publication Critical patent/WO2019071909A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions

Definitions

  • the invention relates to an automatic driving system and method based on relative entropy depth inverse reinforcement learning, and belongs to the technical field of automatic driving.
  • An existing automobile automatic driving system discriminates a driving environment by a camera and an image recognition system installed in a cab, and then the vehicle is controlled by a vehicle-mounted main control computer, a GPS positioning system, and a path planning software according to pre-stored road maps and the like. Navigate to plan a reasonable driving path between the vehicle's current location and destination to direct the vehicle to the destination.
  • the object of the present invention is to provide an automatic driving system and method based on relative entropy depth inverse reinforcement learning, which utilizes a deep neural network structure and inputs historical driving trajectory information of a user driver to obtain various driving strategies representing individual driving habits. These driving strategies are personalized and intelligent for automatic driving.
  • an automatic driving system based on relative entropy depth inverse reinforcement learning comprising:
  • Client display driving strategy
  • Driving basic data acquisition subsystem collecting road information
  • a storage module connecting with the client and the driving basic data collection subsystem and storing road information collected by the driving basic data collection subsystem;
  • the driving basic data collecting subsystem collects road information and transmits the road information to the client and the storage module, and the storage module receives the road information, and stores the continuous road information as a historical track. Simulating and calculating a driving strategy according to the historical trajectory, the storage module transmitting the driving strategy to a client for selection by a user, the client accepting and selecting according to the road information and user personality selection Driving strategy to implement automatic driving.
  • the storage module includes a driving trajectory library for storing a historical driving trajectory, a trajectory information processing subsystem that calculates and simulates a driving strategy according to driving trajectories and driving habits, and a driving strategy library that stores driving strategies; the driving trajectory The library transmits driving trajectory data to the trajectory information processing subsystem, and the trajectory information processing subsystem analyzes and simulates a driving strategy according to the driving trajectory data and transmits the driving strategy to the driving strategy library, and the driving strategy library receives And storing the driving strategy.
  • trajectory information processing subsystem calculates and simulates a driving strategy by using a multi-objective relative entropy depth inverse reinforcement learning algorithm.
  • the multi-objective inverse reinforcement learning algorithm uses the EM algorithm framework nested relative entropy depth inverse reinforcement learning to calculate the parameters of the multi-reward function.
  • the driving basic data collection subsystem includes a sensor for collecting road information.
  • the invention also provides a method for automatic driving based on relative entropy depth inverse reinforcement learning, the method comprising the following steps:
  • S1 collecting road information and transmitting the road information to the client and the storage module;
  • the storage module receives the road information and stores a piece of road information as a historical trajectory, calculates and simulates various driving strategies according to the historical trajectory, and transmits the driving strategy to the client;
  • the client receives the road information and the driving strategy, and implements automatic driving according to the personalized driving strategy and road information selected by the user.
  • the storage module includes a driving trajectory library for storing a historical driving trajectory, a trajectory information processing subsystem that calculates and simulates a driving strategy according to driving planning and driving habits, and a driving strategy library that stores driving strategies; the driving trajectory The library transmits driving trajectory data to the trajectory information processing subsystem, and the trajectory information processing subsystem analyzes and simulates a driving strategy according to the driving trajectory data and transmits the driving strategy to the driving strategy library, and the driving strategy library receives And storing the driving strategy.
  • a driving trajectory library for storing a historical driving trajectory
  • a trajectory information processing subsystem that calculates and simulates a driving strategy according to driving planning and driving habits
  • a driving strategy library that stores driving strategies
  • trajectory information processing subsystem calculates and simulates a driving strategy by using a multi-objective relative entropy depth inverse reinforcement learning algorithm.
  • the multi-objective inverse reinforcement learning algorithm uses the EM algorithm framework nested relative entropy depth inverse reinforcement learning to calculate the parameters of the multi-reward function.
  • the invention has the beneficial effects of: collecting the road information in real time by setting the driving basic data acquisition subsystem in the system, and transmitting the road information to the storage module, and the storage module receives the road information and stores the continuous road information as a historical track. According to the historical driving trajectory, the driving strategy is simulated to realize the individual and intelligent automatic driving.
  • FIG. 1 is a flow chart of an automatic driving system and method based on relative entropy depth inverse reinforcement learning according to the present invention.
  • Figure 2 is a schematic diagram of the Markov decision process MDP.
  • an automatic driving system based on relative entropy depth inverse reinforcement learning includes:
  • Client 1 Display driving strategy
  • Driving basic data acquisition subsystem 2 collecting road information
  • the storage module 3 is connected to the client 1 and the driving basic data collection subsystem 2 and stores the road information collected by the driving basic data collection subsystem 2;
  • the driving basic data collecting subsystem 2 collects road information and transmits the road information to the client 1 and the storage module 3.
  • the storage module 3 receives the road information and continues the road information. Storing as a historical trajectory, analyzing and calculating a driving strategy according to the historical trajectory, the storage module 3 transmitting the driving strategy to the client 1 for selection by the user, the client 1 receiving the road information and according to The individual driving strategy selected by the user implements automatic driving.
  • the storage module 3 is a cloud.
  • the client 1 selects the driving strategy according to the driving strategy of the user's personality, downloads the corresponding driving strategy from the cloud 3 driving strategy library 33, and then performs real-time driving decisions according to the driving strategy and the basic data to realize real-time driverless control.
  • the driving basic data collection subsystem 2 collects road information through sensors (not shown).
  • the collected information serves two purposes: to pass the information to the client 1 to provide basic data for the current driving decision; to pass the information to the driving track library 31 of the cloud 3, which is stored as the historical driving trajectory data of the user driver.
  • the cloud 3 includes a driving trajectory library 31 for a historical driving trajectory, a trajectory information processing subsystem 32 that calculates and simulates a driving strategy according to driving planning and driving habits, and a driving strategy library 33 that stores driving strategies; the driving trajectory library Driving the trajectory data to the trajectory information processing subsystem 32, the trajectory information processing subsystem 32 calculates and simulates a driving strategy based on the driving trajectory data and transmits the driving strategy to the driving strategy library 33, the driving The strategy library 33 receives and stores the driving strategy.
  • the trajectory information processing subsystem 32 calculates and simulates a driving strategy using a multi-objective relative entropy depth inverse reinforcement learning algorithm.
  • the multi-objective inverse reinforcement learning algorithm uses the EM algorithm framework nested relative entropy depth inverse reinforcement learning to calculate the parameters of the multi-reward function.
  • the historical driving trajectory includes an expert historical driving trajectory and a historical trajectory of the user.
  • the inverse reinforcement learning IRL refers to a problem that the reward function R is unknown in the Markov decision process MDP where the environment is known.
  • the value Q(s, a) of a state action pair is often estimated using a known environment, a given reward function R, and a Markov property (also referred to as an action cumulative bonus value). Then, using the value Q(s, a) of the converged state action pairs to find the strategy ⁇ , the agent can use the strategy ⁇ to make the decision.
  • the reward function R is often extremely difficult to know, but some excellent trajectories T N are relatively easy to obtain.
  • the problem of restoring the bonus function R using the excellent trajectory T N is called an inverse reinforcement learning problem IRL.
  • the relative entropy depth inverse reinforcement learning is performed by using the user history driving trajectory data known in the driving trajectory library 31, and the reward function R of various user personalities is restored, thereby simulating the corresponding driving strategy ⁇ . .
  • the relative entropy depth inverse reinforcement learning algorithm is a model-free algorithm. It does not need the state transition function T(s, a, s') in the known environmental model.
  • the relative entropy inverse reinforcement learning algorithm can use the importance sampling method in the calculation. Avoid the state transition function T(s, a, s').
  • the automatic driving decision process of the automobile is a Markov decision process MDP/R without a bonus function, which can be expressed as a set ⁇ state space S, action space A, environment-defined state transition probability T (omitted against Environmental transfer probability T requirements).
  • the value function of the car agent (cumulative bonus value) can be expressed as
  • a plurality of bonus functions R (targets) exist simultaneously, representing different driving habits of the user driver.
  • the prior probability distributions of the G reward functions be ⁇ 1 ,..., ⁇ G
  • the reward weights are ⁇ 1 ,..., ⁇ G
  • ( ⁇ 1 ,..., ⁇ G , ⁇ 1 , . . . , ⁇ G ) represents a set of parameters of the G reward functions.
  • the MellowMax generator is defined as: MellowMax is a more optimized algorithm that guarantees that the estimate of the V value converges to a unique point. At the same time, MellowMax has the characteristics: scientific probability distribution mechanism and expectation estimation method. In this embodiment, the reinforcement learning algorithm combined with MellowMax will be more reasonable in the exploration and utilization of the environment during the automatic driving process. This ensures that the autopilot system has sufficient learning for various scenarios and a more scientific assessment of the current state as the reinforcement learning process converges.
  • a more scientific evaluation of the expected value of the feature of the state can be obtained according to the reinforcement learning combined with a soft maximization algorithm MellowMax.
  • the probability distribution of motion selection can be obtained by using MellowMax.
  • the iterative process of reinforcement learning can be used to obtain the expected value ⁇ of the feature that can be obtained by the reward function composed of the parameter of the current depth neural network ⁇ .
  • can be understood as the cumulative expectation of the feature.
  • the EM algorithm is used to solve the above-described multi-objective inverse reinforcement learning problem with hidden variables.
  • the EM algorithm can be divided into E steps and M steps according to the steps. Through the continuous iteration of E step and M step, the approximate maximum value is approximated.
  • Step E Calculate first Where Z is a regular term.
  • z ij represents the probability that the i-th driving trajectory belongs to the driving habit (reward function) j.
  • Calculated likelihood estimate (The Q function Q( ⁇ , ⁇ t ) referred to here is the update objective function of the EM algorithm, paying attention to the difference between the Q action state value function in the reinforcement learning), and the likelihood estimation value is obtained after the calculation.
  • Step M Selecting a suitable multi-driving habit parameter set ⁇ ( ⁇ l and ⁇ l ) maximizes the likelihood estimate Q( ⁇ , ⁇ t ) in the E step. Due to the mutual independence of ⁇ l and ⁇ l , their maximization can be separately determined. Can get The second half
  • the completion of the gradient update marks the completion of an iterative update of the relative entropy depth inverse reinforcement learning.
  • the new deep network reward function which completes the parameter update with the update, generates a new strategy ⁇ for a new iteration.
  • the calculation of the E step and the M step is iteratively performed until the likelihood estimate Q( ⁇ , ⁇ t ) converges to the maximum value.
  • the set of parameters ⁇ ( ⁇ 1 , . . . , ⁇ G , ⁇ 1 , . . . , ⁇ G ) obtained at this time is the prior distribution and weight of the reward function representing the multi-driving habit that we want to solve.
  • the driving strategy ⁇ of each driving habit R is obtained through the calculation of the reinforcement learning RL. Output multiple driving strategies and save them in the cloud's driving strategy library. Users can choose a personalized, intelligent driving strategy in the client.
  • the invention also provides a method for automatic driving based on relative entropy depth inverse reinforcement learning, the method comprising the following steps:
  • S1 collecting road information and transmitting the road information to the client and the storage module;
  • the storage module receives the road information and calculates and simulates various driving strategies according to the road information, and transmits the driving strategy to the client;
  • the client receives the road information and the driving strategy, and implements automatic driving according to the personalized driving strategy and road information selected by the user.
  • the road information is collected in real time, and the road information is transmitted to the storage module 3 and the client 1.
  • the storage module 3 receives the road information and simulates driving according to the historical driving trajectory. Strategy, to achieve individual, intelligent automatic driving.
  • the driving strategy is implemented in the cloud 3 instead of running the calculation process in the client 1.
  • all driving strategies are already completed in the cloud 3.
  • the user only needs to choose to download the driving strategy he needs, and the car body can perform real-time automatic driving according to the driving strategy and real-time road information selected by the user.
  • a large amount of road information is uploaded to the cloud 3 and stored as a historical driving trajectory.
  • Use the stored historical driving trajectory big data to update the driving strategy library. Using the trajectory information big data, the system will achieve automatic driving closer to the user's needs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Traffic Control Systems (AREA)
  • Control Of Driving Devices And Active Controlling Of Vehicle (AREA)

Abstract

一种基于相对熵深度逆强化学习的自动驾驶系统,包括:(1)客户端:显示驾驶策略;(2)驾驶基础数据采集子系统:采集道路信息;(3)存储模块:与客户端及驾驶基础数据采集子系统连接并存储驾驶基础数据采集子系统所采集到的道路信息;其中,驾驶基础数据采集子系统采集道路信息并将所述道路信息传输给客户端及存储模块,存储模块接收道路信息,并将持续的一段道路信息存储为历史轨迹,根据历史轨迹进行分析计算模拟出驾驶策略,存储模块将所述驾驶策略传输至客户端以供用户选择,客户端接收道路信息并根据用户选择实施自动驾驶。该系统采用相对熵的深度逆强化学习算法实现无模型下自动驾驶。

Description

基于相对熵深度逆强化学习的自动驾驶系统及方法
本申请要求了申请日为2017年10月11,申请号为201710940590X,发明名称为“基于相对熵深度逆强化学习的自动驾驶系统及方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及一种基于相对熵深度逆强化学习的自动驾驶系统及方法,属于自动驾驶技术领域。
背景技术
随着我国汽车持有量的增加,道路交通拥堵现象越来越严重,每年发生的交通事故也在不断上升,为了更好的解决这一问题,研究和开发汽车自动驾驶系统很有必要。且随着人们对生活质量追求的提升,人们希望从疲劳的驾驶活动中得到解放,自动驾驶技术应运而生。
现有的一种汽车自动驾驶系统是由装在驾驶室的摄像机和图像识别系统辨别驾驶环境,然后由车载主控计算机、GPS定位系统和路径规划软件根据预先存好的道路地图等信息对车辆进行导航,在车辆的当前位置和目的地之间规划出合理的行驶路径将车辆导向目的地。
上述汽车自动驾驶系统中,由于道路地图是预存于车辆内,其数据的更新依赖于驾驶员的人工操作,更新频率不能够保证,并且,即使驾驶员能够做到及时更新,也可能由于现有资源里没有关于道路的最新信息而使得最终得到的资料不能够反应当下的道路情况,最终造成行车路线不合理,导航准确率不高,给行车带来不便。并且,目前在自动驾驶技术领域的大部分汽车自动驾驶系统还需要人工进行干预,并不能达到完全的自动驾驶的地步。
发明内容
本发明的目的在于提供一种基于相对熵深度逆强化学习的自动驾驶系统及方法,利用深度神经网络结构并输入用户驾驶员的历史驾驶轨迹信息,获取多种代表个性驾驶习惯的驾驶策略,通过这些驾驶策略进行个性、智能的自动驾驶。
为达到上述目的,本发明提供如下技术方案:一种基于相对熵深度逆强化学习的自动驾驶系统,所述系统包括:
客户端:显示驾驶策略;
驾驶基础数据采集子系统:采集道路信息;
存储模块:与所述客户端及驾驶基础数据采集子系统连接并存储所述驾驶基础数据采集子系统所采集到的道路信息;
其中,所述驾驶基础数据采集子系统采集道路信息并将所述道路信息传输给所述客户端及存储模块,所述存储模块接收所述道路信息,并将持续的一段道路信息存储为历史轨迹,根据所述历史轨迹进行分析计算模拟出驾驶策略,所述存储模块将所述驾驶策略传输至客户端以供用户选择,所述客户端接受并根据所述道路信息和用户个性选择的所述驾驶策略实施自动驾驶。
进一步地,所述存储模块包括用于存储历史驾驶轨迹的驾驶轨迹库、根据驾驶轨迹及驾驶习惯计算并模拟出驾驶策略的轨迹信息处理子系统及存储驾驶策略的驾驶策略库;所述驾驶轨迹库将驾驶轨迹数据传输给所述轨迹信息处理子系统,所述轨迹信息处理子系统根据所述驾驶轨迹数据分析计算并模拟出驾驶策略并传输给所述驾驶策略库,所述驾驶策略库接收并存储所述驾驶策略。
进一步地,所述轨迹信息处理子系统采用多目标的相对熵深度逆强化学习算法计算并模拟驾驶策略。
进一步地,所述多目标的逆强化学习算法采用EM算法框架嵌套相对熵深度逆强化学习计算多奖赏函数的参数。
进一步地,所述驾驶基础数据采集子系统包括用于采集道路信息的传感器。
本发明还提供了一种基于相对熵深度逆强化学习的自动驾驶的方法,所述方法包括如下步骤:
包括如下步骤:
S1:采集道路信息并将所述道路信息传输给客户端及存储模块;
S2:所述存储模块接收所述道路信息并将持续的一段道路信息存储为历史轨迹,根据所述历史轨迹分析计算并模拟多种驾驶策略,并将所述驾驶策略传递给所述客户端;
S3:所述客户端接收所述道路信息及驾驶策略,并根据用户选择的个性驾驶策略及道路信息实施自动驾驶。
进一步地,所述存储模块包括用于存储历史驾驶轨迹的驾驶轨迹库、根据驾驶规划及驾驶习惯计算并模拟出驾驶策略的轨迹信息处理子系统及存储驾驶策略的驾驶策略库;所述驾驶轨迹库将驾驶轨迹数据传输给所述轨迹信息处理子系统,所述轨迹信息处理子系统根据所述驾驶轨迹数据分析计算并模拟出驾驶策略并传输给所述驾驶策略库,所述驾驶策略库接收并存储所述驾驶策略。
进一步地,所述轨迹信息处理子系统采用多目标的相对熵深度逆强化学习算法计算并模拟驾驶策略。
进一步地,所述多目标的逆强化学习算法采用EM算法框架嵌套相对熵深度逆强化学习计算多奖赏函数的参数。
本发明的有益效果在于:通过在系统中设置驾驶基础数据采集子系统,实时采集道路信息,并将道路信息传递给存储模块,存储模块接收道路信息后并将持续的一段道路信息存储为历史轨迹,根据历史驾驶轨迹模拟驾驶策略,实现个性、智能的自动驾驶。
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,并可依照说明书的内容予以实施,以下以本发明的较佳实施例并配合附图详细说明如后。
附图说明
图1为本发明的基于相对熵深度逆强化学习的自动驾驶系统及方法的流程图。
图2为马尔科夫决策过程MDP示意图。
具体实施方式
下面结合附图和实施例,对本发明的具体实施方式作进一步详细描述。以下实施例用于说明本发明,但不用来限制本发明的范围。
请参见图1,本发明的一较佳实施例的基于相对熵深度逆强化学习的自动驾驶系统包括:
客户端1:显示驾驶策略;
驾驶基础数据采集子系统2:采集道路信息;
存储模块3:与所述客户端1及驾驶基础数据采集子系统2连接并存储所述驾驶基础数据采集子系统2所采集到的道路信息;
其中,所述驾驶基础数据采集子系统2采集道路信息并将所述道路信息传输给所述客户端1及存储模块3,所述存储模块3接收所述道路信息,并将持续的一段道路信息存储为历史轨迹,根据所述历史轨迹进行分析计算模拟出驾驶策略,所述存储模块3将所述驾驶策略传输至客户端1以供用户选择,所述客户端1接收所述道路信息并根据用户选择的个性驾驶策略实施自动驾驶。在本实施例中,所述存储模块3为云端。
所述客户端1最主要的功能是与用户完成人机交互过程,提供给个性的、智能的多种驾驶策略选择以及服务。客户端1根据用户个性的驾驶策略选择,从云端3驾驶策略库33下载相应的驾驶策略,接着根据驾驶策略和基础数据进行实时的驾驶决策,实现实时的无人驾驶控制。
所述驾驶基础数据采集子系统2通过传感器(未图示)采集道路信息。采集到的信息有两个用途:将信息传递给客户端1,为当前的驾驶决策提供基础数据;将信息传递到云端3的驾驶轨迹库31,存储为用户驾驶员的历史驾驶轨迹数据。
所述云端3包括用于历史驾驶轨迹的驾驶轨迹库31、根据驾驶规划及驾驶习惯计算并模拟出驾驶策略的轨迹信息处理子系统32及存储驾驶策略的驾驶策略库33;所述驾驶轨迹库31将驾驶轨迹数据传输给所述轨迹信息处理子系统32,所述轨迹信息处理子系统32根据所述驾驶轨迹数据分析计算并模拟出驾驶策略并传输给所述驾驶策略库33,所述驾驶策略库33接收并存储所述驾驶策略。所述轨迹信息处理子系统32采用多目标的相对熵深度逆强化学习算法计算并模拟驾驶策略。在本实施例中,所述多目标的逆强化学习算法采用EM算法框架嵌套相对熵深度逆强化学习计算多奖赏函数的参数。所述历史驾驶轨迹包括专家历史驾驶轨迹及用户的历史轨迹。
所述逆强化学习IRL是指在环境已知的马尔科夫决策过程MDP中奖赏函数R未知的问题。在一般的强化学习问题RL中,往往利用已知的环境、给定的奖赏函数R以及马尔科夫性质来估计一个状态动作对的值Q(s,a)(也可称为动作累积奖赏值),然后利用收敛的各个状态动作对的值Q(s,a)来求取策略π,智能体(Agent)便可利用策略π进行决策。在现实中,奖赏函数R往往是极难获知的,但是一些优秀的轨迹T N是比较容易获得的。在奖赏函数未知的马尔科夫决策过程MDP/R中,利用优秀的轨迹T N还原奖赏函数R的问题被称为逆强化学习 问题IRL。
在本实施例中,利用所述驾驶轨迹库31中已知的用户历史驾驶轨迹数据,进行相对熵深度逆强化学习,还原出多种用户个性的奖赏函数R,进而模拟出相应的驾驶策略π。相对熵深度逆强化学习算法是一种无模型的算法,无需已知环境模型中的状态转移函数T(s,a,s′),相对熵逆强化学习算法可以利用重要性采样的方法在计算中避开状态转移函数T(s,a,s′)。
在本实施例中,汽车的自动驾驶决策过程是一个没有奖赏函数的马尔科夫决策过程MDP/R,可以表示为集合{状态空间S,动作空间A,环境定义的状态转移概率T(省略对环境转移概率T的要求)。汽车Agent的值函数(累计奖赏值)可以表示为
Figure PCTCN2018078740-appb-000001
而汽车Agent的状态动作值函数可以表示为Q(s,a)=R θ(s,a)+γE T(s,a,s′)[V(s′)]。为了解决更加复杂的真实驾驶问题,对奖赏函数的假设不再只是简单的线性组合,而是假设为一个深度神经网络R(s,a,θ)=g 1(g 2(…(g n(f(s,a),θ n),…),θ 2),θ 1),其中f(s,a)表示(s,a)处的驾驶的道路特征信息,θ i表示深度神经网络第i层的参数。
同时,为了满足更个性、更智能的真实驾驶场景,假设有多个奖赏函数R(目标)同时存在,代表用户驾驶员不同的驾驶习惯。假设存在G个奖赏函数,令这G个奖赏函数的先验概率分布为ρ 1,…,ρ G,奖赏权重为θ 1,…,θ G,令Θ=(ρ 1,…,ρ G1,…,θ G),表示这G个奖赏函数的参数集合。
请参见图2,在已知有假设奖赏函数(由初始化或经过迭代获得)的条件下,此时我们可以把问题描述为一个完全的马尔科夫决策过程MDP。此时在完全的马尔科夫决策过程MDP下,根据强化学习的知识,利用奖赏函数R(s,a,θ)=g 1(g 2(…(g n(f,θ n),…),θ 2),θ 1),我们可以对V值以及Q值进行评估。对于强化学习的评估算法,采用一种新的软最大化方法(MellowMax)来估计V值的期望值。MellowMax的生成器定义为:
Figure PCTCN2018078740-appb-000002
MellowMax是一种更优化的算法,它能够保证对V值的估计能够收敛于唯一一点。同时, MellowMax又具备特质:科学的概率分配机制和期望估计方法。在本实施例中,结合了MellowMax的强化学习算法在自动驾驶过程中对环境的探索和利用方面将更加合理。这保证了在强化学习过程收敛时,自动驾驶系统对各种情景已经有了足够的学习并能对当前状态产生较科学的评估。
在本实施例中,根据结合了一种软最大化算法MellowMax的强化学习,可以获得对状态的特征的期望值更科学的评价。利用MellowMax可以获得动作选取的概率分布为
Figure PCTCN2018078740-appb-000003
在该软最大化的动作选取的规则下,利用强化学习的迭代过程,可以获得在以当前深度神经网络的参数为θ构成的奖赏函数所能够获得特征的期望值μ。μ可以理解为特征的累计的期望。
在本实施例中,利用EM算法来求解上述带隐变量的多目标逆强化学习问题。EM算法按步骤可分为E步和M步,通过E步、M步的不断迭代,逼近似然估计的极大值。
E步:首先计算
Figure PCTCN2018078740-appb-000004
其中Z为正则项。z ij代表第i个驾驶轨迹属于驾驶习惯(奖赏函数)j的概率。
令y i=j表示第i个驾驶轨迹属于驾驶习惯j,并用y=(y 1,…,y N)的集合表示N个驾驶轨迹的从属集合。
计算似然估计值
Figure PCTCN2018078740-appb-000005
(这里所指的Q函数Q(Θ,Θ t)是EM算法的更新目标函数,注意与强化学习中的Q动作状态值函数相区别),经过推算获得似然估计值
Figure PCTCN2018078740-appb-000006
M步:选取合适的多驾驶习惯参数集合Θ(ρ l和θ l)使得E步中的似然估计值Q(Θ,Θ t)极大化。由于ρ l和θ l的相互独立性,可以分开求它们的极大化。可以得到
Figure PCTCN2018078740-appb-000007
后半部分
Figure PCTCN2018078740-appb-000008
对于极大化Q(Θ,Θ t)后半部分的更新目标:
Figure PCTCN2018078740-appb-000009
可以理解为
Figure PCTCN2018078740-appb-000010
是关于在第l簇目标的参数为θ l的条件下得到观察的轨迹集合
Figure PCTCN2018078740-appb-000011
所能够获得最大似然方程。我们可以利用相对熵深度逆强化学 习的知识来求解这个最大似然方程。相对熵的求解公式,在符合最大似然更新目标的同时,可以很自然应用到深度神经网络参数的反向传播更新。令深度神经网络的最大化目标函数为L(θ)=logP(D,θ|r),根据联合似然函数的分解公式,可以获得L(θ)=logP(D,θ|r)=logP(D|r)+logP(θ)。对该联合似然目标函数求偏导可以获得
Figure PCTCN2018078740-appb-000012
对于该偏导的前半部分,可以进一步做分解,表示为
Figure PCTCN2018078740-appb-000013
其中
Figure PCTCN2018078740-appb-000014
根据相对熵逆强化学习的知识,可以得到求解结果为当前奖赏函数下的特征期望值与专家特征值的差值
Figure PCTCN2018078740-appb-000015
其中,利用重要性采样,
Figure PCTCN2018078740-appb-000016
其中,π是一种给定的策略,根据这种策略π采样获得
Figure PCTCN2018078740-appb-000017
个轨迹。其中
Figure PCTCN2018078740-appb-000018
其中τ=s 1a 1,…,s Ha H。进一步的,
Figure PCTCN2018078740-appb-000019
其中
Figure PCTCN2018078740-appb-000020
表示为更新深度神经网络中隐藏层参数时通过反向传播算法计算的梯度。
梯度更新完成标志着一次相对熵深度逆强化学习迭代更新的完成。利用更新完成了参数更新的新的深度网络奖赏函数产生新的策略π,进行新的迭代。
不断迭代进行E步和M步的计算,直至似然估计值Q(Θ,Θ t)收敛至极大值。此时获得的参数集合Θ=(ρ 1,…,ρ G1,…,θ G),就是我们想要求解的代表多驾驶习惯的奖赏函数的先验分布和权重。
在本实施例中,根据这个参数集合Θ,经过强化学习RL的计算,获得每个驾驶习惯R的驾驶策略π。输出多驾驶策略,并保存在云端的驾驶策略库中。用户便可在客户端中选择个性、智能的驾驶策略。
本发明还提供了一种基于相对熵深度逆强化学习的自动驾驶的方法,所述方法包括如下步骤:
S1:采集道路信息并将所述道路信息传输给客户端及存储模块;
S2:所述存储模块接收所述道路信息并根据所述道路信息分析计算并模拟多种驾驶策略,并将所述驾驶策略传递给所述客户端;
S3:所述客户端接收所述道路信息及驾驶策略,并根据用户选择的个性驾驶策略及道路信息实施自动驾驶。
综上所述:通过在系统中设置驾驶基础数据采集子系统2,实时采集道路信息,并将道路信息传递给存储模块3及客户端1,存储模块3接收道路信息后根据历史驾驶轨迹模拟驾驶策略,实现个性、智能的自动驾驶。
基于本方法的自动驾驶中,驾驶策略都在云端3中实现计算,而不是在客户端1中运行计算过程。当用户在需要进行自动驾驶的时候,所有驾驶策略都已经在云端3完成。用户只需要选择下载自己所需的驾驶策略,车体就可以根据用户所选择的驾驶策略和实时道路信息进行实时的自动驾驶。同时,在完成任何一次驾驶之后,大量的道路信息上传到云端3被存储为历史驾驶轨迹。利用存储的历史驾驶轨迹大数据,再实现对驾驶策略库的更新。利用轨迹信息大数据,本系统将实现更加贴近用户需求的自动驾驶。
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本发明的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。

Claims (9)

  1. 一种基于相对熵深度逆强化学习的自动驾驶系统,其特征在于,所述系统包括:
    客户端:显示驾驶策略;
    驾驶基础数据采集子系统:采集道路信息;
    存储模块:与所述客户端及驾驶基础数据采集子系统连接并存储所述驾驶基础数据采集子系统所采集到的道路信息;
    其中,所述驾驶基础数据采集子系统采集道路信息并将所述道路信息传输给所述客户端及存储模块,所述存储模块接收所述道路信息,并将持续的一段道路信息存储为历史轨迹,根据所述历史轨迹进行分析计算模拟出驾驶策略,所述存储模块将所述驾驶策略传输至客户端以供用户选择,所述客户端接受并根据所述道路信息和用户个性选择的所述驾驶策略实施自动驾驶。
  2. 如权利要求1所述的基于相对熵深度逆强化学习的自动驾驶系统,其特征在于,所述存储模块包括用于存储历史驾驶轨迹的驾驶轨迹库、根据驾驶轨迹及驾驶习惯计算并模拟出驾驶策略的轨迹信息处理子系统及存储驾驶策略的驾驶策略库;所述驾驶轨迹库将驾驶轨迹数据传输给所述轨迹信息处理子系统,所述轨迹信息处理子系统根据所述驾驶轨迹数据分析计算并模拟出驾驶策略并传输给所述驾驶策略库,所述驾驶策略库接收并存储所述驾驶策略。
  3. 如权利要求2所述的基于相对熵深度逆强化学习的自动驾驶系统,其特征在于,所述轨迹信息处理子系统采用多目标的相对熵深度逆强化学习算法计算并模拟驾驶策略。
  4. 如权利要求3所述的基于相对熵深度逆强化学习的自动驾驶系统,其特征在于,所述多目标的逆强化学习算法采用EM算法框架嵌套相对熵深度逆强化学习计算多奖赏函数的参数。
  5. 如权利要求1所述的基于相对熵深度逆强化学习的个性化自动驾驶系统,其特征在于,所述驾驶基础数据采集子系统包括用于采集道路信息的传感器。
  6. 一种基于相对熵深度逆强化学习的自动驾驶的方法,其特征在于,所述方法包括如下步骤:
    S1:采集道路信息并将所述道路信息传输给客户端及存储模块;
    S2:所述存储模块接收所述道路信息并将持续的一段道路信息存储为历史轨迹,根据所述历史轨迹分析计算并模拟多种驾驶策略,并将所述驾驶策略传递给所述客户端;
    S3:所述客户端接收所述道路信息及驾驶策略,并根据用户选择的个性驾驶策略及道路信息实施自动驾驶。
  7. 如权利要求6所述的基于相对熵深度逆强化学习的自动驾驶的方法,其特征在于,所述存储模块包括用于存储历史驾驶轨迹的驾驶轨迹库、根据驾驶规划及驾驶习惯计算并模拟出驾驶策略的轨迹信息处理子系统及存储驾驶策略的驾驶策略库;所述驾驶轨迹库将驾驶轨迹数据传输给所述轨迹信息处理子系统,所述轨迹信息处理子系统根据所述驾驶轨迹数据分析计算并模拟出驾驶策略并传输给所述驾驶策略库,所述驾驶策略库接收并存储所述驾驶策略。
  8. 如权利要求7所述的基于相对熵深度逆强化学习的自动驾驶的方法,其特征在于,所述轨迹信息处理子系统采用多目标的相对熵深度逆强化学习算法计算并模拟驾驶策略。
  9. 如权利要求8所述的基于相对熵深度逆强化学习的自动驾驶的方法,其特征在于,所述多目标的逆强化学习算法采用EM算法框架嵌套相对熵深度逆强化学习计算多奖赏函数的参数。
PCT/CN2018/078740 2017-10-11 2018-03-12 基于相对熵深度逆强化学习的自动驾驶系统及方法 WO2019071909A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710940590.XA CN107544516A (zh) 2017-10-11 2017-10-11 基于相对熵深度逆强化学习的自动驾驶系统及方法
CN201710940590.X 2017-10-11

Publications (1)

Publication Number Publication Date
WO2019071909A1 true WO2019071909A1 (zh) 2019-04-18

Family

ID=60967749

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/078740 WO2019071909A1 (zh) 2017-10-11 2018-03-12 基于相对熵深度逆强化学习的自动驾驶系统及方法

Country Status (2)

Country Link
CN (1) CN107544516A (zh)
WO (1) WO2019071909A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110673602A (zh) * 2019-10-24 2020-01-10 驭势科技(北京)有限公司 一种强化学习模型、车辆自动驾驶决策的方法和车载设备
TWI737437B (zh) * 2020-08-07 2021-08-21 財團法人車輛研究測試中心 軌跡決定方法
WO2023083113A1 (en) * 2021-11-10 2023-05-19 International Business Machines Corporation Reinforcement learning with inductive logic programming

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10678241B2 (en) * 2017-09-06 2020-06-09 GM Global Technology Operations LLC Unsupervised learning agents for autonomous driving applications
CN107544516A (zh) * 2017-10-11 2018-01-05 苏州大学 基于相对熵深度逆强化学习的自动驾驶系统及方法
CN108803609B (zh) * 2018-06-11 2020-05-01 苏州大学 基于约束在线规划的部分可观察自动驾驶决策方法
WO2020000192A1 (en) * 2018-06-26 2020-01-02 Psa Automobiles Sa Method for providing vehicle trajectory prediction
CN110654372B (zh) * 2018-06-29 2021-09-03 比亚迪股份有限公司 车辆驾驶控制方法、装置、车辆和存储介质
US10845815B2 (en) * 2018-07-27 2020-11-24 GM Global Technology Operations LLC Systems, methods and controllers for an autonomous vehicle that implement autonomous driver agents and driving policy learners for generating and improving policies based on collective driving experiences of the autonomous driver agents
CN109636432B (zh) * 2018-09-28 2023-05-30 创新先进技术有限公司 计算机执行的项目选择方法和装置
CN111159832B (zh) * 2018-10-19 2024-04-02 百度在线网络技术(北京)有限公司 交通信息流的构建方法和装置
CN110321811B (zh) * 2019-06-17 2023-05-02 中国工程物理研究院电子工程研究所 深度逆强化学习的无人机航拍视频中的目标检测方法
CN110238855B (zh) * 2019-06-24 2020-10-16 浙江大学 一种基于深度逆向强化学习的机器人乱序工件抓取方法
CN110955239B (zh) * 2019-11-12 2021-03-02 中国地质大学(武汉) 一种基于逆强化学习的无人船多目标轨迹规划方法及系统
CN110837258B (zh) * 2019-11-29 2024-03-08 商汤集团有限公司 自动驾驶控制方法及装置、系统、电子设备和存储介质
CN111026127B (zh) * 2019-12-27 2021-09-28 南京大学 基于部分可观测迁移强化学习的自动驾驶决策方法及系统
CN114194211B (zh) * 2021-11-30 2023-04-25 浪潮(北京)电子信息产业有限公司 一种自动驾驶方法、装置及电子设备和存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699717A (zh) * 2013-12-03 2014-04-02 重庆交通大学 基于前视断面选点的复杂道路汽车行驶轨迹预测方法
CN105718750A (zh) * 2016-01-29 2016-06-29 长沙理工大学 一种车辆行驶轨迹的预测方法及系统
CN107544516A (zh) * 2017-10-11 2018-01-05 苏州大学 基于相对熵深度逆强化学习的自动驾驶系统及方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014152554A1 (en) * 2013-03-15 2014-09-25 Caliper Corporation Lane-level vehicle navigation for vehicle routing and traffic management
DE112015004218B4 (de) * 2014-09-16 2019-05-23 Honda Motor Co., Ltd. Fahrassistenzvorrichtung
CN106842925B (zh) * 2017-01-20 2019-10-11 清华大学 一种基于深度强化学习的机车智能操纵方法与系统
CN107169567B (zh) * 2017-03-30 2020-04-07 深圳先进技术研究院 一种用于车辆自动驾驶的决策网络模型的生成方法及装置
CN107084735A (zh) * 2017-04-26 2017-08-22 电子科技大学 适用于减少冗余导航的导航路径框架
CN107229973B (zh) * 2017-05-12 2021-11-19 中国科学院深圳先进技术研究院 一种用于车辆自动驾驶的策略网络模型的生成方法及装置
CN107200017A (zh) * 2017-05-22 2017-09-26 北京联合大学 一种基于深度学习的无人驾驶车辆控制系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699717A (zh) * 2013-12-03 2014-04-02 重庆交通大学 基于前视断面选点的复杂道路汽车行驶轨迹预测方法
CN105718750A (zh) * 2016-01-29 2016-06-29 长沙理工大学 一种车辆行驶轨迹的预测方法及系统
CN107544516A (zh) * 2017-10-11 2018-01-05 苏州大学 基于相对熵深度逆强化学习的自动驾驶系统及方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LU CHENJIE: "The Research of Apprenticeship Learning Algorithm Applied in thr Unmanned Car High-Speed Driving in the Simulated Environnment", ELECTRONIC TECHNOLOGY & INFORMATION SCIENCE, CHINA MASTER'S THESE FULL-TEXT DATABASE, 15 June 2013 (2013-06-15), pages 19-21 - 32-45, ISSN: 1674-0246 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110673602A (zh) * 2019-10-24 2020-01-10 驭势科技(北京)有限公司 一种强化学习模型、车辆自动驾驶决策的方法和车载设备
CN110673602B (zh) * 2019-10-24 2022-11-25 驭势科技(北京)有限公司 一种强化学习模型、车辆自动驾驶决策的方法和车载设备
TWI737437B (zh) * 2020-08-07 2021-08-21 財團法人車輛研究測試中心 軌跡決定方法
WO2023083113A1 (en) * 2021-11-10 2023-05-19 International Business Machines Corporation Reinforcement learning with inductive logic programming

Also Published As

Publication number Publication date
CN107544516A (zh) 2018-01-05

Similar Documents

Publication Publication Date Title
WO2019071909A1 (zh) 基于相对熵深度逆强化学习的自动驾驶系统及方法
CN110745136B (zh) 一种驾驶自适应控制方法
KR102335389B1 (ko) 자율 주행 차량의 lidar 위치 추정을 위한 심층 학습 기반 특징 추출
Ohnishi et al. Barrier-certified adaptive reinforcement learning with applications to brushbot navigation
CN107169567B (zh) 一种用于车辆自动驾驶的决策网络模型的生成方法及装置
KR102292277B1 (ko) 자율 주행 차량에서 3d cnn 네트워크를 사용하여 솔루션을 추론하는 lidar 위치 추정
US20210284184A1 (en) Learning point cloud augmentation policies
EP3035314B1 (en) A traffic data fusion system and the related method for providing a traffic state for a network of roads
WO2020119363A1 (zh) 自动驾驶方法、训练方法及相关装置
EP3719603B1 (en) Action control method and apparatus
US20240160901A1 (en) Controlling agents using amortized q learning
US20220187088A1 (en) Systems and methods for providing feedback to improve fuel consumption efficiency
CN112148008B (zh) 一种基于深度强化学习的实时无人机路径预测方法
CN110488842B (zh) 一种基于双向内核岭回归的车辆轨迹预测方法
US11567495B2 (en) Methods and systems for selecting machine learning models to predict distributed computing resources
CN113299085A (zh) 一种交通信号灯控制方法、设备及存储介质
CN109858137B (zh) 一种基于可学习扩展卡尔曼滤波的复杂机动飞行器航迹估计方法
CN114261400B (zh) 一种自动驾驶决策方法、装置、设备和存储介质
CN114199248B (zh) 一种基于混合元启发算法优化anfis的auv协同定位方法
CN115311860B (zh) 一种交通流量预测模型的在线联邦学习方法
CN112036598A (zh) 一种基于多信息耦合的充电桩使用信息预测方法
Xu et al. Trajectory prediction for autonomous driving with topometric map
CN115691140B (zh) 一种汽车充电需求时空分布的分析与预测方法
CN114187759B (zh) 一种基于数据驱动模型的路侧单元驾驶辅助方法及装置
CN114495036A (zh) 一种基于三阶段注意力机制的车辆轨迹预测方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18867035

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18867035

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02/10/2020)

122 Ep: pct application non-entry in european phase

Ref document number: 18867035

Country of ref document: EP

Kind code of ref document: A1