WO2019237474A1 - 基于约束在线规划的部分可观察自动驾驶决策方法及系统 - Google Patents

基于约束在线规划的部分可观察自动驾驶决策方法及系统 Download PDF

Info

Publication number
WO2019237474A1
WO2019237474A1 PCT/CN2018/098899 CN2018098899W WO2019237474A1 WO 2019237474 A1 WO2019237474 A1 WO 2019237474A1 CN 2018098899 W CN2018098899 W CN 2018098899W WO 2019237474 A1 WO2019237474 A1 WO 2019237474A1
Authority
WO
WIPO (PCT)
Prior art keywords
state
simulation
action
history
driving
Prior art date
Application number
PCT/CN2018/098899
Other languages
English (en)
French (fr)
Inventor
姜冲
章宗长
Original Assignee
苏州大学
苏州大学张家港工业技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州大学, 苏州大学张家港工业技术研究院 filed Critical 苏州大学
Publication of WO2019237474A1 publication Critical patent/WO2019237474A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0214Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory in accordance with safety or protection criteria, e.g. avoiding hazardous areas
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0221Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0276Control of position or course in two dimensions specially adapted to land vehicles using signals provided by a source external to the vehicle

Definitions

  • the invention relates to the technical field of automatic driving, in particular to a partially observable automatic driving decision method based on constraint online planning.
  • the first problem refers to the positioning problem. In reality, driving conditions are usually more complicated, so we need centimeter-level positioning.
  • the second problem is the path planning problem, which is the problem to be solved by this patent.
  • the third problem is the vehicle execution structure, that is, the wire-control system. The main operations performed include wire-by-wire, steering, and throttle, that is, the vehicle is controlled according to the plan obtained by the planning module.
  • POMDP Partially Observable Markov Decision Process, partly observable Markov decision process.
  • POMDP regards the driving process as a decision-making process, and considers the situations that may occur during driving as a state, that is, the driving environment state unit.
  • the driving environment state unit cannot be fully obtained, that is, the state is partially observable.
  • the driving environment state unit cannot be fully obtained, that is, the state is partially observable.
  • the driving environment state unit we need to set rewards for the states that may appear later, and select an action through a specific action selection strategy, and then use value updating or strategy updating to find a method that can obtain the maximum cumulative reward.
  • Driving strategy Partially Observable Markov Decision Process, partly observable Markov decision process.
  • the belief state b is a probability distribution about the state s. After each decision and the action a is performed, the system will obtain an observation o, and then use the Bayesian update method to update the belief state:
  • the strategy can be considered by considering both the primary and secondary target constraints.
  • the planning method solves the problem of a single driving mode and the user can only passively accept the driving scheme, which improves the user experience.
  • This method has the advantages of flexible driving scheme, high reliability, comprehensive consideration, etc., and has a wide range of application scenarios in the field of autonomous driving. .
  • a partially observable autonomous driving decision-making method based on constraint online planning including:
  • the planning decision-making process is specifically:
  • the action a is input into the remote control system to drive the vehicle, and at the same time, the sensing module of the system obtains a new environmental state observation o, and adds a new historical hao record to the historical h, the historical record hao record refers to the Take action a in history h to obtain observation o;
  • the driving mode includes at least one of the following: fast mode, gentle mode, and energy-saving mode.
  • the simulation The plan includes:
  • the input of the simulation process is the particle initial state s, history h, and depth depth.
  • ⁇ depth ⁇ is satisfied, the simulation ends, otherwise the simulation continues;
  • the Monte Carlo method is used to update the belief state B (h) of history h and other related parameters in the model, and then return the reward R and cost C of this round of simulation planning.
  • the cost-constrained greedy action selection strategy includes:
  • the system evaluates the action value Q (s, a) according to UCB1, and selects the action with the highest Q value. At the same time, it also solves the convex optimization problem of cost to ensure that the cost constraint is met.
  • the obtained optimal strategy ⁇ is the output of GREEDYPOLICY.
  • each particle corresponds to a sampling state, and the belief state B (h) is the set of all particles.
  • the belief state B (h) is the set of all particles.
  • Monte Carlo simulation uses Monte Carlo simulation to update the particles.
  • this particle is passed as an input to a black box simulator to obtain a continuous state s ′, observe o and the corresponding Reward and cost. If the simulated observation o matches the actual observation, then add the particle s ′ to the belief state B (h), and repeat the above steps until all particles have been added to the belief state B (h).
  • the observation of the state of the environment o is collected and filtered by the perception module and preprocessed, and then input into the decision model, and the decision model then performs simulation and decision based on the input environment state.
  • MC Monte Carlo
  • step "the Monte Carlo method is used to update the belief state B (h) of history h and other related parameters in the model, and then the reward R and the cost C of this round of simulation planning are returned.
  • the state simulation is divided into two layers.
  • the first layer refers to using the UCB1 method to select actions when it is in the search tree, and the second layer refers to using a history-based rollout strategy to exceed the scope of the search tree. Make action selection.
  • a partially observable autonomous driving decision system based on constraint online planning including:
  • the driving environment state unit is used to receive the real-time driving environment obtained by the sensing module, and the filtering and preprocessing are performed by the sensing module to output the state required by the decision module;
  • the simulation unit is used to simulate the driving trajectory based on the current history and construct a history-based Monte Carlo search tree.
  • the simulation is divided into two stages. In the first stage, all children do not have child nodes. At this time, the UCB1 algorithm is used. Select the action. In the second stage, use the history-based rollout algorithm to select the action.
  • the search unit is used to select the initial state of the simulation unit from the historical belief state, and then select the optimal action from the simulation unit results that also meets the constraints selected by the user as the actual driving action. And output the selected results to the remote control system in the autonomous driving system;
  • a cost constraint unit is used to constrain various costs incurred in the driving process. Different constraint conditions correspond to different driving modes. The decision model needs to satisfy the constraint conditions proposed by the user while giving the optimal decision. ;
  • the input of the simulation unit is the current history h, the simulation start state s, and the depth of the current search tree, so as to construct a search tree T (h) based on the history h in the simulation, and the simulation unit will
  • the model's action selection strategy selects an optimal action a, and then uses (s, a) as the input of the driving environment black box simulator to output a new state s ′, observation o after taking action a, reward r, cost c,
  • the input of the next round of simulation is the new state s ′, the new history hao, and the depth depth + 1; the output is the reward R and the cost C of this round of simulation;
  • the input of the search unit is the current history h, and the output is under the current situation
  • the optimal strategy ⁇ that can obtain the maximum cumulative reward and meet the secondary target constraint conditions; the cost constraint unit input is state s, action a, and then a convex optimization problem about the constraint conditions is solved.
  • a computer-readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of any one of the methods.
  • the above partially observable automatic driving decision method based on constraint online planning has the following beneficial effects: (1)
  • the strategy planning method based on partially observable Monte Carlo planning disclosed in the present invention is applicable to large-scale conditionally constrained partially feasible Observing the Markov process, it can solve the calculation difficulty of the traditional Bayesian update method well, and greatly reduce the system's excessive dependence on equipment.
  • the decision-making model can achieve real-time planning, and the driving strategy can be adjusted in time through Monte Carlo planning, which is extremely flexible.
  • the cost-constrained convex optimization problem solved in the model can ensure that the resulting strategy is a stochastic strategy. Most of the problems that may be encountered can be considered, and the occurrence of various faults will not be ignored, making the system more secure and reliable.
  • the decision model does not require an explicit model of the driving environment, but only a black box simulator, which increases the feasibility of driving strategies.
  • the state perceived by the system has the Markov nature.
  • the future state of the Markov nature state is only related to the current state and has nothing to do with the previous state. Therefore, there is no need to save past information and save costs.
  • FIG. 1 is a system architecture diagram of a partially observable autonomous driving decision-making method based on constraint online planning provided by an embodiment of the present application.
  • FIG. 2 is a flowchart of a partially observable automatic driving decision method based on constraint online planning provided by an embodiment of the present application.
  • FIG. 3 is a flowchart of a simulation planning process in a partially observable automatic driving decision method based on constraint online planning provided by an embodiment of the present application.
  • a partially observable autonomous driving decision method based on constraint online planning includes:
  • the planning decision-making process is specifically:
  • the action a is input into the remote control system to drive the vehicle, and at the same time, the sensing module of the system obtains a new environmental state observation o, and adds a new historical hao record to the historical h, the historical record hao record refers to the Take action a in history h to obtain observation o;
  • the driving mode includes at least one of the following: fast mode, gentle mode, and energy-saving mode.
  • the simulation The plan includes:
  • the input of the simulation process is the particle initial state s, history h, and depth depth.
  • ⁇ depth ⁇ is satisfied, the simulation ends, otherwise the simulation continues;
  • the Monte Carlo method is used to update the belief state B (h) of history h and other related parameters in the model, and then return the reward R and cost C of this round of simulation planning.
  • the cost-constrained greedy action selection strategy includes:
  • the system evaluates the action value Q (s, a) according to UCB1, and selects the action with the highest Q value. At the same time, it also solves the convex optimization problem of cost to ensure that the cost constraint is met.
  • the obtained optimal strategy ⁇ is the output of GREEDYPOLICY.
  • each particle corresponds to a sampling state, and the belief state B (h) is the set of all particles.
  • the belief state B (h) is the set of all particles.
  • Monte Carlo simulation uses Monte Carlo simulation to update the particles.
  • this particle is passed as an input to a black box simulator to obtain a continuous state s ′, observe o and the corresponding Reward and cost. If the simulated observation o matches the actual observation, then add the particle s ′ to the belief state B (h), and repeat the above steps until all particles have been added to the belief state B (h).
  • the observation of the state of the environment o is collected and filtered by the perception module and preprocessed, and then input into the decision model, and the decision model then performs simulation and decision based on the input environment state.
  • MC Monte Carlo
  • the state The simulation is divided into two layers.
  • the first layer refers to using the UCB1 method to select action when in the search tree
  • the second layer refers to using a history-based rollout strategy to select action when it is beyond the scope of the search tree.
  • a partially observable autonomous driving decision system based on constraint online planning including:
  • the driving environment state unit is used to receive the real-time driving environment obtained by the sensing module, and the filtering and preprocessing are performed by the sensing module to output the state required by the decision module;
  • the simulation unit is used to simulate the driving trajectory based on the current history and construct a history-based Monte Carlo search tree.
  • the simulation is divided into two stages. In the first stage, all children do not have child nodes. At this time, Use the UCB1 algorithm to select actions. In the second stage, use a history-based rollout algorithm to select actions.
  • the search unit is used to select the initial state of the simulation unit from the historical belief state, and then select the optimal action from the simulation unit results that also meets the constraints selected by the user as the actual driving action. And output the selected results to the remote control system in the autonomous driving system;
  • a cost constraint unit is used to constrain various costs incurred in the driving process. Different constraint conditions correspond to different driving modes. The decision model needs to satisfy the constraint conditions proposed by the user while giving the optimal decision. ;
  • the input of the simulation unit is the current history h, the simulation start state s, and the depth of the current search tree, so as to construct a search tree T (h) based on the history h in the simulation, and the simulation unit will
  • the model's action selection strategy selects an optimal action a, and then uses (s, a) as the input of the driving environment black box simulator to output a new state s ′, observation o after taking action a, reward r, cost c,
  • the input of the next round of simulation is the new state s ′, the new history hao, and the depth depth + 1; the output is the reward R and the cost C of this round of simulation;
  • the input of the search unit is the current history h, and the output is under the current situation
  • the optimal strategy ⁇ that can obtain the maximum cumulative reward and meet the secondary target constraint conditions; the cost constraint unit input is state s, action a, and then a convex optimization problem about the constraint conditions is solved.
  • a computer-readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of any one of the methods.
  • the above partially observable automatic driving decision method based on constraint online planning has the following beneficial effects: (1)
  • the strategy planning method based on partially observable Monte Carlo planning disclosed in the present invention is applicable to large-scale conditionally constrained partially feasible Observing the Markov process, it can solve the calculation difficulty of the traditional Bayesian update method well, and greatly reduce the system's excessive dependence on equipment.
  • the decision-making model can achieve real-time planning, and the driving strategy can be adjusted in time through Monte Carlo planning, which is extremely flexible.
  • the cost-constrained convex optimization problem solved in the model can ensure that the resulting strategy is a stochastic strategy. Most of the problems that may be encountered can be considered, and the occurrence of various faults will not be ignored, making the system more secure and reliable.
  • the decision model does not require an explicit model of the driving environment, but only a black box simulator, which increases the feasibility of driving strategies.
  • the state perceived by the system has the Markov nature.
  • the future state of the Markov nature state is only related to the current state and has nothing to do with the previous state. Therefore, there is no need to save past information and save costs.
  • the model After the user selects the driving mode, the initial value of history h is empty. At this time, the model will select the initial state according to an initial state distribution. After executing a series of decisions, the history is no longer empty. A starting state s needs to be extracted from the belief state B (h) of history h. After the initial state is selected, a series of decision trajectories are simulated backward from this initial state s. Each simulation uses a GREEDYPOLICY action selection strategy to select an action a, and then enters state s and action a into a black box simulator to get the next simulated state s ′ and an observation o. Use hao to build a search tree And add s to the belief state B (h) about history h, and finally use the Monte Carlo method to update the reward value V R and the cost value V C of the state and state action pair.
  • the GREEDYPOLICY action selection strategy includes UCB1 action selection and convex optimization for secondary target constraints. Enter a state s and a search coefficient k.
  • the system evaluates the action value Q (s, a) based on UCB1 and selects from it For the action with the highest Q value, at the same time, it is necessary to solve the convex optimization problem of cost to ensure that the cost constraints are met.
  • the optimal strategy ⁇ that is finally obtained is the output of GREEDYPOLICY.
  • the state simulation is divided into two layers.
  • the first layer refers to using the UCB1 method to select actions when it is in the search tree.
  • the second layer refers to using a history-based rollout strategy when it is beyond the scope of the search tree. Action selection.
  • the input of the simulation process is the particle initial state s, history h, and depth depth.
  • the output of the final simulation process at the end of the simulation is a reward cost pair [R, C].
  • the decision model finally selects from all simulations that meets the constraint conditions, and at the same time, the action b with the largest V R (hb) is taken as the action that is actually executed.
  • UCB1 action selection and weightless particle filter are used.
  • the system adjusts the driving strategy in real time based on the historical h simulation planning, and first initializes the simulated planning according to the current historical belief state B (h). The initial state s, and then the simulation is started from s. The initial simulation depth is 0. As the simulation progresses, the depth will gradually increase. When ⁇ depth ⁇ is satisfied, the simulation planning is terminated, otherwise it continues. If the current history h is not in the search tree, then a new node T (h) is added to the search tree, which includes the number of visits N (h) of the history, the reward value V R (h), and the cost value V C (h) And its belief distribution B (h).
  • B (h) is empty.
  • the rollout strategy is used to select actions. If the current history is in the search tree, then use UCB1 and solve the constrained convex optimization problem to select the action. Put the obtained action a and the current state s into the black box simulator, and you can get the next continuous state s ′. And observe o, then use s ′ for the next simulation and add 1 to the depth of the next simulation. Finally, the state s is merged into the belief state B (h) and the belief state is updated using the Monte Carlo method.
  • the user only needs to select the driving mode.
  • the decision-making model can automatically select the corresponding driving strategy without excessive operation, and has a good user experience.
  • the decision-making model will adjust the driving strategy according to the driving environment in real time.
  • this model can be adjusted and optimized through continuous driving training. For newly encountered driving environments, you only need to add it to the history search tree and update the history to complete the model upgrade. , So it is sustainable.
  • the part of the observable automatic driving decision model and system based on constraint online planning of the present invention includes a search unit, a simulation unit, a cost constraint unit, and a driving environment state.
  • the driving environment state is used to receive the real-time driving environment obtained by the perception module, and the filtering and preprocessing are performed by the perception module to output the state required by the decision module.
  • the simulation unit is used to simulate the driving trajectory according to the current history and construct a history-based Monte Carlo search tree.
  • the simulation is divided into two stages. In the first stage, all children have no child nodes. At this time, the action is selected using the UCB1 algorithm. In the second stage, the action is selected using a history-based rollout algorithm.
  • the search unit is used to select the initial state of the simulation unit from the historical belief state, and then select the optimal action from the simulation unit results that also meets the constraints selected by the user as the actual driving action. And the selected result is output to the wire control system in the automatic driving system.
  • a cost constraint unit is used to constrain various costs incurred during the driving process, and different constraint conditions correspond to different driving modes.
  • the fuel consumption, time, and vehicle stability proposed in this patent correspond to energy-saving mode, fast mode, and gentle mode.
  • the decision model needs to satisfy the constraint conditions proposed by the user while giving the optimal decision.
  • the input of the simulation unit is the current history h, the simulation start state s, and the depth of the current search tree, so as to construct a search tree T (h) based on the history h in the simulation.
  • the simulation unit selects an optimal action a according to the model's action selection strategy, and then uses (s, a) as the input of the driving environment black box simulator to output a new state s ′, observes o after taking action a, Reward r, cost c.
  • the input of the next round of simulation is new state s ′, new history hao, and depth + 1.
  • the reward R and cost C for this round of simulation are output.
  • the input of the search unit is the current history h, and the output is the optimal policy ⁇ that can obtain the maximum cumulative reward and meet the secondary target constraint conditions in the current situation.
  • the input of the cost constraint unit is state s, action a, and then a convex optimization problem with constraint conditions is solved.
  • the initial state of the simulation is sampled from the belief state of the history h. If the history h is empty, then the sample is sampled from an initial state distribution I.
  • UCB1 action selection and weightless particle filter are used.
  • the system adjusts the driving strategy in real-time based on the historical h simulation planning, and first initializes the simulated planning The initial state s, and then the simulation is started from s.
  • the initial simulation depth is 0. As the simulation progresses, the depth will gradually increase. When ⁇ depth ⁇ is satisfied, the simulation planning is terminated, otherwise it continues. If the current history h is not in the search tree, then a new node T (h) is added to the search tree, which includes the number of visits N (h) of the history, the reward value V R (h), and the cost value V C (h) And its belief distribution B (h). Initially, B (h) is empty.
  • the rollout strategy is used to select actions. If the current history is in the search tree, then use UCB1 and solve the constrained convex optimization problem to select the action. Put the obtained action a and the current state s into the black box simulator, and you can get the next continuous state s ′. And observe o, then use s ′ for the next simulation and add 1 to the depth of the next simulation. Finally, the state s is merged into the belief state B (h) and the belief state is updated using the Monte Carlo method. During the driving process, the user only needs to select the driving mode. The decision-making model can automatically select the corresponding driving strategy without excessive operation, and has a good user experience.
  • the decision-making model will adjust the driving strategy according to the driving environment in real time.
  • this model can be adjusted and optimized through continuous driving training. For newly encountered driving environments, you only need to add it to the history search tree and update the history to complete the model upgrade. , So it is sustainable.
  • the advantages of the partially observable automatic driving decision model and system based on the constraint online planning of the present invention are: (1) The strategy planning method based on the partially observable Monte Carlo planning disclosed by the present invention is suitable for large-scale conditionally constrained parts In the observable Markov process, it can well solve the calculation difficulty of the traditional Bayesian update method, which greatly reduces the system's excessive dependence on equipment. (2) The decision-making model can achieve real-time planning, and the driving strategy can be adjusted in time through Monte Carlo planning, which is extremely flexible. (3) The cost-constrained convex optimization problem solved in the model can ensure that the resulting strategy is a stochastic strategy.
  • the decision model does not require an explicit model of the driving environment, but only a black box simulator, which increases the feasibility of driving strategies.
  • the state perceived by the system has the Markov nature. The future state of the Markov nature state is only related to the current state and has nothing to do with the previous state. Therefore, there is no need to save past information and save costs.
  • the invention discloses a partially observable automatic driving decision model and system based on constraint online planning.
  • This decision model is mainly used in automatic driving.
  • the user Before the vehicle is started, the user can choose from three modes: fast, gentle and energy saving. According to the selected mode, the model will choose the corresponding optimal driving strategy. If the fast mode is selected, the system will choose to make the car reach the destination as quickly as possible. At this time, the main limiting factor is time; if you choose gentle Mode, the system will select the most comfortable and gentle driving strategy for the user. At this time, the main limiting factor is the degree of bumps of the vehicle. If the energy-saving mode is selected, the model will choose the most fuel-efficient way to minimize frequent startup and stall.
  • the decision-making model can not only generate a driving scheme for the current driving environment, but also adjust the scheme in real time according to the real-time road conditions and vehicle conditions to enhance its flexibility.
  • This model builds a Monte Carlo search tree based on history, which makes the simulation solution based on the real situation and enhances the reliability.
  • the model satisfies certain optimal selection conditions to ensure that the obtained strategy is a random strategy, which makes up for the deficiencies of the deterministic strategy.
  • This decision model fully meets the current driving needs of ordinary users, and in particular provides a variety of modes to choose from, which greatly improves the user experience.

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种基于约束在线规划的部分可观察自动驾驶决策方法及系统。该决策方法主要用于自动驾驶中,使得车辆拥有多种驾驶模式。该系统包括:驾驶环境状态单元单元、搜索单元、模拟单元、成本约束单元。该决策方法不仅能够针对当前驾驶环境产生驾驶方案,还可以根据实时的路况车况来实时的调整方案,以增强其灵活性。该方法构建了一个基于历史的蒙特卡罗搜索树,使得模拟解决方案是在真实情况的基础上进行的,增强了可靠性。同时,该方法满足一定的最优选择条件,以确保得到的策略是随机策略,弥补了确定性策略的不足。该决策方法完全满足目前一般用户的驾驶需求,尤其是提供了多种模式可以选择,极大的提升了用户体验。

Description

基于约束在线规划的部分可观察自动驾驶决策方法及系统 技术领域
本发明涉及自动驾驶技术领域,特别是涉及基于约束在线规划的部分可观察自动驾驶决策方法。
背景技术
目前来说,自动驾驶包含有三个问题:第一个是我在哪?第二个是我要去哪?第三个是怎么去?真正的自动驾驶需要完美的解决这三个问题。第一个问题指的是定位问题,在现实中,驾驶的路况通常都会比较复杂,所以我们需要厘米级的定位。第二个问题是路径规划问题,也就是本专利所要解决的问题。第三个问题是车辆执行结构,也就是线控系统,主要执行的操作包括线控制动、转向以及油门,即根据规划模块所得出的方案来操控车辆。
传统技术存在以下技术问题:
在目前的无人驾驶领域中,常用的决策模型是POMDP(Partially Observable Markov Decision Process,部分可观测马尔科夫决策过程)。POMDP将驾驶过程看作是一个决策过程,将驾驶过程中可能出现的情况视为一个状态,即驾驶环境状态单元。在驾驶过程中,我们假设驾驶环境状态单元是无法完全获取的,即状态是部分可观察的。在每个状态下,我们需要对其后可能出现的状态进行奖赏设定,并通过特定的动作选择策略来选择一个动作,再使用值更新或策略更新等方法来寻找一个能够获取最大累积奖赏的驾驶策略。对于POMDP来说,直接求解它是比较困难的,所以通常的做法是使用信念状态将POMDP问题转化为MDP问题来求解。信念状态b是关于状态s的一个概率分布,在每一次决策并执行动作a之后,系统都会获得一个观察o,然后使用贝叶斯更新方法对信念状态进行更新:
Figure PCTCN2018098899-appb-000001
其中p(o|s′,a)和p(o|b,a)分别是由模型参数T和O决定的,并且:
Figure PCTCN2018098899-appb-000002
这是一个标准化常量。这种方法的缺点是计算量较大,无法适用于规模较大的自动驾驶场景。因为POMDP的运算量会随着状态维度以及历史数据的增加而增加,即会产生维度诅咒(curse of dimensionality)与历史诅咒(curse of history)。
发明内容
基于此,有必要针对上述技术问题,提供一种基于约束在线规划的部分可观察自动驾驶决策方法,基于部分可观察蒙特卡罗规划的,可以同时考虑主要目标与次要目标约束条件来进行策略规划的方法,解决了驾驶模式单一,用户只能被动接受驾驶方案的问题,提升了用户体验,该方法具有驾驶方案灵活、可靠性高、考虑全面等优点,在自动驾驶领域具有广泛的应用场景。
一种基于约束在线规划的部分可观察自动驾驶决策方法,包括:
接收用户选择的驾驶模式;
根据用户所选的驾驶模式来规划驾驶方案,规划决策过程具体为:
从给定初始状态分布I或者历史h的信念状态B(h)中选择模拟规划的起始状态s,用以构建蒙特卡罗搜索树,进行模拟规划:
根据成本约束贪心动作选择策略,从所构建的蒙特卡罗搜索树中选择满足约束条件的,同时还要使V R(ha)最大的动作a;
将所述动作a输入到线控系统中来驾驶车辆,同时系统的感知模块获得一个新的环境状态观察o,将新的历史hao记录加入历史h中,所述历史记录hao记录是指在所述历史h下采取动作a获得观察o;
重复上述步骤直至到达目的地。
在另外的一个实施例中,所述驾驶模式包括以下至少之一:快速模式、平缓模式、节能模式。
在另外的一个实施例中,在步骤“从给定初始状态分布或者历史的信念状态中选择模拟规划的起始状态,用以构建蒙特卡罗搜索树,进行模拟规划:”中,所述模拟规划包括:
选取模拟起始状态s,若历史为空,则从给定初始状态分布I中进行选择;若历史不为空,则从历史h的信念状态分布中进行选择;
模拟过程的输入是粒子起始状态s,历史h,以及深度depth,当满足γ depth<ε时,模拟结束,否则继续模拟;
从起始状态s开始根据约束贪心动作选择策略选取模拟动作a;
将状态s以及动作a输入到环境状态黑盒模拟器中,得到下一模拟状态s′以及相应的环境状态观察o,将hao作为蒙特卡罗搜索树的新节点;
使用蒙特卡罗方法更新模型中历史h的信念状态B(h)以及其它相关参数,然后返回本轮模拟规划的奖赏R以及成本C。
在另外的一个实施例中,步骤“根据成本约束贪心动作选择策略,从所构建的蒙特卡罗搜索树中选择满足约束条件的,同时还要使V R(ha)最大的动作a;”中,所述成本约束贪心动作选择策略包括:
(1)加入了条件约束的UCB1动作选择方法:
Figure PCTCN2018098899-appb-000003
Figure PCTCN2018098899-appb-000004
(2)对于次要目标约束条件的凸优化:
Figure PCTCN2018098899-appb-000005
Figure PCTCN2018098899-appb-000006
输入一个状态s以及探索系数k,系统会根据UCB1来评估动作值Q(s,a),并从中选取Q值最高的动作,同时,还要求解成本的凸优化问题以确保满足成本约束,最后得到的最优策略π即为GREEDYPOLICY的输出。
在另外的一个实施例中,在步骤“使用蒙特卡罗方法更新模型中历史h的信念状态B(h)以及其它相关参数,然后返回本轮模拟规划的奖赏R以及成本C。”中,
在信念状态更新时,首先从初始状态分布中抽取K个粒子
Figure PCTCN2018098899-appb-000007
每个粒子都对应着一个抽样状态,而信念状态B(h)就是所有粒子的集合。在执行过一个动作a并且获得一个观察o之后,就使用蒙特卡罗模拟来更新粒子。通过随机的选取一个粒子的方式从当前信念状态B(s,h)中抽取一个状态,这个粒子作为输入传进一个黑盒模拟器中,可以获得一个连续的状态s′,观察o以及对应的奖赏和成本,如果模拟的观察o与真实的观察相匹配,那么就将粒子s′加入信念状态B(h)中,重复上述步骤直到所有的粒子都已被加入信念状态B(h)。
在另外的一个实施例中,在步骤“将状态s以及动作a输入到环境状态黑盒模拟器中,得到下一模拟状态s′以及相应的环境状态观察o,将hao作为蒙特卡罗搜索树的新节点;”中,
环境的状态观察o由感知模块采集过滤并进行预处理,然后输入到决策模型中,决策模型再根据输入的环境状态进行模拟以及决策。
在另外的一个实施例中,在步骤“将状态以及动作输入到环境状态黑盒模拟器中,得到下一模拟状态以及相应的环境状态观察,将作为蒙特卡罗搜索树的新节点;”中,基于历史的搜索树中每个节点都会使用蒙特卡罗(MC)模拟来估计历史的奖赏值以及成本值;模型中包含一个关于驾驶环境的黑盒模拟器,而不需要明确的关于状态的概率分布。
在另外的一个实施例中,在步骤“使用蒙特卡罗方法更新模型中历史h的信念状态B(h)以及其它相关参数,然后返回本轮模拟规划的奖赏R以及成本C。
”中,状态的模拟分为两层,第一层是指当处于搜索树中时,使用UCB1方法来进行动作选择,第二层是指超出搜索树范围时,使用一个基于历史的rollout策略来进行动作选择。
一种基于约束在线规划的部分可观察自动驾驶决策系统,包括:
驾驶环境状态单元,用于接收感知模块所获取的实时驾驶环境,并由感知模块进行过滤以及预处理,输出决策模块所需要的状态;
模拟单元,用于根据当前历史进行驾驶轨迹模拟,构建基于历史的蒙特卡罗搜索树;模拟分为两个阶段,在第一阶段中,所有的孩子都不存在子节点,此时使用UCB1算法选择动作,在第二阶段中,使用基于历史的回滚(rollout)算法选择动作;
搜索单元,用于从历史的信念状态中选取模拟单元的起始状态,然后根据模拟单元所得出的结果,从中选择最优的,同时还满足用户选择的约束条件的动作作为实际的驾驶动作,并将选择的结果输出至自动驾驶系统中的线控系统;以及
成本约束单元,用于约束驾驶过程中所产生的各种成本,不同的约束条件即对应着不同的驾驶模式,所述决策模型在给出最优决策的同时还需要满足用户所提出的约束条件;
其中,所述模拟单元的输入是当前历史h、模拟的起始状态s以及当前搜索树的深度depth,以此在模拟中构建基于历史h的搜索树T(h),所述模拟单元会根据模型的动作选择策略选取一个最优动作a,然后以(s,a)作为驾驶环境黑盒模拟器的输入,输出新的状态s′、采取动作a后的观察o、奖赏r、成本c,下一轮模拟的输入为新状态s′、新历史hao、深度depth+1;输出为本轮模拟的奖赏R、成本C;所述搜索单元的输入是当前历史h,输出为在当前情况下能够获得最大累积奖赏并满足次要目标约束条件的最优策略π;所述成本约束单元输入为状态s、动作a, 然后由此求解关于约束条件的一个凸优化问题。
一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现任一项所述方法的步骤。
上述基于约束在线规划的部分可观察自动驾驶决策方法,具有下列有益效果:(1)本发明公开的基于部分可观察蒙特卡罗规划的策略规划方法,适用于大规模的有条件约束的部分可观察马尔科夫过程中,能够很好的解决传统贝叶斯更新方法计算困难的问题,大大的减轻了系统对于设备的过分依赖。(2)决策模型能够做到实时规划,通过蒙特卡罗规划能够及时的调整驾驶策略,极具灵活性。(3)模型中所求解的成本约束的凸优化问题能够确保得到的策略为随机策略,可以考虑到大部分可能遇到的问题,不会忽视各种故障的发生,使得系统更加的安全可靠。(4)该决策模型不需要关于驾驶环境的显式模型,而只需要一个黑盒模拟器,增加了驾驶策略的可行性。(5)系统感知的状态具有马尔科夫性质,具有马尔科夫性质的状态的未来状态只与当前状态有关,与之前的状态没有关系,因此无需保存过去的信息,节约了成本。
附图说明
图1为本申请实施例提供的一种基于约束在线规划的部分可观察自动驾驶决策方法的系统架构图。
图2为本申请实施例提供的一种基于约束在线规划的部分可观察自动驾驶决策方法的流程图。
图3为本申请实施例提供的一种基于约束在线规划的部分可观察自动驾驶决策方法中模拟规划过程的流程图。
具体实施方式
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
参阅图1到图3,一种基于约束在线规划的部分可观察自动驾驶决策方法, 包括:
接收用户选择的驾驶模式;
根据用户所选的驾驶模式来规划驾驶方案,规划决策过程具体为:
从给定初始状态分布I或者历史h的信念状态B(h)中选择模拟规划的起始状态s,用以构建蒙特卡罗搜索树,进行模拟规划:
根据成本约束贪心动作选择策略,从所构建的蒙特卡罗搜索树中选择满足约束条件的,同时还要使V R(ha)最大的动作a;
将所述动作a输入到线控系统中来驾驶车辆,同时系统的感知模块获得一个新的环境状态观察o,将新的历史hao记录加入历史h中,所述历史记录hao记录是指在所述历史h下采取动作a获得观察o;
重复上述步骤直至到达目的地。
在另外的一个实施例中,所述驾驶模式包括以下至少之一:快速模式、平缓模式、节能模式。
在另外的一个实施例中,在步骤“从给定初始状态分布或者历史的信念状态中选择模拟规划的起始状态,用以构建蒙特卡罗搜索树,进行模拟规划:”中,所述模拟规划包括:
选取模拟起始状态s,若历史为空,则从给定初始状态分布I中进行选择;若历史不为空,则从历史h的信念状态分布中进行选择;
模拟过程的输入是粒子起始状态s,历史h,以及深度depth,当满足γ depth<ε时,模拟结束,否则继续模拟;
从起始状态s开始根据约束贪心动作选择策略选取模拟动作a;
将状态s以及动作a输入到环境状态黑盒模拟器中,得到下一模拟状态s′以及相应的环境状态观察o,将hao作为蒙特卡罗搜索树的新节点;
使用蒙特卡罗方法更新模型中历史h的信念状态B(h)以及其它相关参数,然后返回本轮模拟规划的奖赏R以及成本C。
在另外的一个实施例中,步骤“根据成本约束贪心动作选择策略,从所构建的蒙特卡罗搜索树中选择满足约束条件的,同时还要使V R(ha)最大的动作a;”中,所述成本约束贪心动作选择策略包括:
(1)加入了条件约束的UCB1动作选择方法:
Figure PCTCN2018098899-appb-000008
Figure PCTCN2018098899-appb-000009
(2)对于次要目标约束条件的凸优化:
Figure PCTCN2018098899-appb-000010
Figure PCTCN2018098899-appb-000011
输入一个状态s以及探索系数k,系统会根据UCB1来评估动作值Q(s,a),并从中选取Q值最高的动作,同时,还要求解成本的凸优化问题以确保满足成本约束,最后得到的最优策略π即为GREEDYPOLICY的输出。
在另外的一个实施例中,在步骤“使用蒙特卡罗方法更新模型中历史h的信念状态B(h)以及其它相关参数,然后返回本轮模拟规划的奖赏R以及成本C。”中,
在信念状态更新时,首先从初始状态分布中抽取K个粒子
Figure PCTCN2018098899-appb-000012
每个粒子都对应着一个抽样状态,而信念状态B(h)就是所有粒子的集合。在执行过一个动作a并且获得一个观察o之后,就使用蒙特卡罗模拟来更新粒子。通过随机的选取一个粒子的方式从当前信念状态B(s,h)中抽取一个状态,这个粒子作为输入传进一个黑盒模拟器中,可以获得一个连续的状态s′,观察o以及对应的奖赏和成本,如 果模拟的观察o与真实的观察相匹配,那么就将粒子s′加入信念状态B(h)中,重复上述步骤直到所有的粒子都已被加入信念状态B(h)。
在另外的一个实施例中,在步骤“将状态s以及动作a输入到环境状态黑盒模拟器中,得到下一模拟状态s′以及相应的环境状态观察o,将hao作为蒙特卡罗搜索树的新节点;”中,
环境的状态观察o由感知模块采集过滤并进行预处理,然后输入到决策模型中,决策模型再根据输入的环境状态进行模拟以及决策。
在另外的一个实施例中,在步骤“将状态以及动作输入到环境状态黑盒模拟器中,得到下一模拟状态以及相应的环境状态观察,将作为蒙特卡罗搜索树的新节点;”中,基于历史的搜索树中每个节点都会使用蒙特卡罗(MC)模拟来估计历史的奖赏值以及成本值;模型中包含一个关于驾驶环境的黑盒模拟器,而不需要明确的关于状态的概率分布。
在另外的一个实施例中,在步骤“使用蒙特卡罗方法更新模型中历史h的信念状态B(h)以及其它相关参数,然后返回本轮模拟规划的奖赏R以及成本C。”中,状态的模拟分为两层,第一层是指当处于搜索树中时,使用UCB1方法来进行动作选择,第二层是指超出搜索树范围时,使用一个基于历史的rollout策略来进行动作选择。
一种基于约束在线规划的部分可观察自动驾驶决策系统,包括:
驾驶环境状态单元,用于接收感知模块所获取的实时驾驶环境,并由感知模块进行过滤以及预处理,输出决策模块所需要的状态;
模拟单元,用于根据当前历史进行驾驶轨迹模拟,构建基于历史的蒙特卡罗搜索树;模拟分为两个阶段,在第一阶段中,所有的孩子(children)都不存在子节点,此时使用UCB1算法选择动作,在第二阶段中,使用基于历史的回滚(rollout)算法选择动作;
搜索单元,用于从历史的信念状态中选取模拟单元的起始状态,然后根据模拟单元所得出的结果,从中选择最优的,同时还满足用户选择的约束条件的 动作作为实际的驾驶动作,并将选择的结果输出至自动驾驶系统中的线控系统;以及
成本约束单元,用于约束驾驶过程中所产生的各种成本,不同的约束条件即对应着不同的驾驶模式,所述决策模型在给出最优决策的同时还需要满足用户所提出的约束条件;
其中,所述模拟单元的输入是当前历史h、模拟的起始状态s以及当前搜索树的深度depth,以此在模拟中构建基于历史h的搜索树T(h),所述模拟单元会根据模型的动作选择策略选取一个最优动作a,然后以(s,a)作为驾驶环境黑盒模拟器的输入,输出新的状态s′、采取动作a后的观察o、奖赏r、成本c,下一轮模拟的输入为新状态s′、新历史hao、深度depth+1;输出为本轮模拟的奖赏R、成本C;所述搜索单元的输入是当前历史h,输出为在当前情况下能够获得最大累积奖赏并满足次要目标约束条件的最优策略π;所述成本约束单元输入为状态s、动作a,然后由此求解关于约束条件的一个凸优化问题。
一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现任一项所述方法的步骤。
上述基于约束在线规划的部分可观察自动驾驶决策方法,具有下列有益效果:(1)本发明公开的基于部分可观察蒙特卡罗规划的策略规划方法,适用于大规模的有条件约束的部分可观察马尔科夫过程中,能够很好的解决传统贝叶斯更新方法计算困难的问题,大大的减轻了系统对于设备的过分依赖。(2)决策模型能够做到实时规划,通过蒙特卡罗规划能够及时的调整驾驶策略,极具灵活性。(3)模型中所求解的成本约束的凸优化问题能够确保得到的策略为随机策略,可以考虑到大部分可能遇到的问题,不会忽视各种故障的发生,使得系统更加的安全可靠。(4)该决策模型不需要关于驾驶环境的显式模型,而只需要一个黑盒模拟器,增加了驾驶策略的可行性。(5)系统感知的状态具有马尔科夫性质,具有马尔科夫性质的状态的未来状态只与当前状态有关,与之前的状态没有关系,因此无需保存过去的信息,节约了成本。
下面介绍一个本发明具体的应用场景:
S1、在所述模型需构建基于历史的蒙特卡罗搜索树,而非基于状态的搜索 树。
S2、用户选定驾驶模式之后,历史h的初值为空,此时模型会根据一个初始状态分布来选取起始状态,当执行过一系列决策之后,历史不再为空,此时的模拟需要从历史h的信念状态B(h)中抽取一个起始状态s。选定起始状态之后,从这个起始状态s开始向后模拟出一系列的决策轨迹。每次模拟都使用一个GREEDYPOLICY动作选择策略来选择一个动作a,然后将状态s与动作a输入到一个黑盒模拟器中,得到下一个模拟状态s′以及一个观察o,使用hao来构建搜索树的下一节点并将s加入到关于历史h的信念状态B(h)中,最后用蒙特卡罗方法更新状态和状态动作对的奖赏值V R以及成本值V C
S3、GREEDYPOLICY动作选择策略中包含了UCB1动作选择以及对于次要目标约束条件的凸优化,输入一个状态s以及探索系数k,系统会根据UCB1来评估动作值Q(s,a),并从中选取Q值最高的动作,同时,还要求解成本的凸优化问题以确保满足成本约束,最后得到的最优策略π即为GREEDYPOLICY的输出。
S4、状态的模拟分为两层,第一层是指当处于搜索树中时,使用UCB1方法来进行动作选择,第二层是指超出搜索树范围时,使用一个基于历史的rollout策略来进行动作选择。
S5、在信念状态更新时,首先从初始状态分布中抽取K个粒子
Figure PCTCN2018098899-appb-000013
每个粒子都对应着一个抽样状态,而信念状态B(h)就是所有粒子的集合。在执行过一个动作a并且获得一个观察o之后,就使用蒙特卡罗模拟来更新粒子。通过随机的选取一个粒子的方式从当前信念状态B(s,h)中抽取一个状态,这个粒子作为输入传进一个黑盒模拟器中,可以获得一个连续的状态s′,观察o以及对应的奖赏和成本,如果模拟的观察o与真实的观察相匹配,那么就将粒子s′加入信念状态B(h)中,重复上述步骤直到所有的粒子都已被加入信念状态B(h)。
S6、模拟过程的输入是粒子起始状态s,历史h,以及深度depth,当满足γ depth<ε时,模拟结束最终模拟过程的输出是一个奖赏成本对[R,C]。
S7、该决策模型最终从所有模拟中挑选出满足约束条件的,同时还要使V R(hb)最大的动作b作为实际执行的动作。
本发明中利用了UCB1动作选择以及无权重粒子过滤器,在驾驶过程中,系统根据历史h实时的进行模拟规划来调整驾驶策略,先根据当前历史的信念状态B(h)来初始化模拟规划的起始状态s,其后从s开始进行模拟,初始模拟的深度为0,随着模拟的进行,深度会逐步的增加,当满足γ depth<ε时,模拟规划终止,否则继续进行。如果当前历史h不在搜索树中,那么在搜索树中新增一个节点T(h),其中包括该历史的访问次数N(h),奖赏值V R(h),成本值V C(h)以及其信念分布B(h),初始时B(h)为空,将新节点加入之后,采用rollout策略来选择动作。如果当前历史在搜索树中,那么就使用UCB1并求解条件约束的凸优化问题来进行动作选择,将得到的动作a以及当前状态s放入黑盒模拟器,可以得到连续的下一状态s′以及观察o,然后将s′用于下一轮模拟,下一轮模拟的深度加1。最后将状态s归并到信念状态B(h)中并且使用蒙特卡罗方法更新信念状态。在驾驶过程中,用户只需要选择驾驶的模式,决策模型可以自动的选择相应的驾驶策略,不需要过多的操作,具有很好的用户体验;同时决策模型会实时的根据驾驶环境调整驾驶策略,所以具有很强的灵活性与可靠性;这个模型可以通过不断的驾驶训练来调整优化,对于新遇到的驾驶环境,只需要将其加入历史搜索树并更新历史,即可完成模型的升级,所以具有可持续使用性。
为达到上述发明目的,本发明基于约束在线规划的部分可观察自动驾驶决策模型及系统,包括:搜索单元、模拟单元、成本约束单元、驾驶环境状态。
驾驶环境状态,用于接收感知模块所获取的实时驾驶环境,并由感知模块进行过滤以及预处理,输出决策模块所需要的状态。
模拟单元,用于根据当前历史进行驾驶轨迹模拟,构建基于历史的蒙特卡罗搜索树。模拟分为两个阶段,在第一阶段中,所有的孩子都不存在子节点,此时使用UCB1算法选择动作,在第二阶段中,使用基于历史的回滚(rollout)算法选择动作。
搜索单元,用于从历史的信念状态中选取模拟单元的起始状态,然后根据模拟单元所得出的结果,从中选择最优的,同时还满足用户选择的约束条件的动作作为实际的驾驶动作,并将选择的结果输出至自动驾驶系统中的线控系统。
成本约束单元,用于约束驾驶过程中所产生的各种成本,不同的约束条件即对应着不同的驾驶模式。如,本专利所提出的油耗、时间、车辆平稳度,就对应着节能模式、快速模式、平缓模式。所述决策模型在给出最优决策的同时还需要满足用户所提出的约束条件。
所述模拟单元的输入是当前历史h、模拟的起始状态s以及当前搜索树的深度depth,以此在模拟中构建基于历史h的搜索树T(h)。所述模拟单元会根据模型的动作选择策略选取一个最优动作a,然后以(s,a)作为驾驶环境黑盒模拟器的输入,输出新的状态s′、采取动作a后的观察o、奖赏r、成本c。下一轮模拟的输入为新状态s′、新历史hao、深度depth+1。输出为本轮模拟的奖赏R、成本C。
所述搜索单元的输入是当前历史h,输出为在当前情况下能够获得最大累积奖赏并满足次要目标约束条件的最优策略π。
所述成本约束单元输入为状态s、动作a,然后由此求解关于约束条件的一个凸优化问题。
其中,模拟的起始状态是从历史h的信念状态中抽样得到,若历史h为空,则从一个初始状态分布I中进行抽样。
其中,在当前驾驶环境中执行驾驶动作a后得到一个观察o,下一时刻的历史即为hao,重复上述模拟过程。
本发明中利用了UCB1动作选择以及无权重粒子过滤器,在驾驶过程中,系统根据历史h实时的进行模拟规划来调整驾驶策略,先根据当前历史的信念状态B(h)来初始化模拟规划的起始状态s,其后从s开始进行模拟,初始模拟的深度为0,随着模拟的进行,深度会逐步的增加,当满足γ depth<ε时,模拟规划终止,否则继续进行。如果当前历史h不在搜索树中,那么在搜索树中新增一个节点T(h),其中包括该历史的访问次数N(h),奖赏值V R(h),成本值V C(h)以及其信念分布B(h),初始时B(h)为空,将新节点加入之后,采用rollout策略来选择动作。如果当前历史在搜索树中,那么就使用UCB1并求解条件约束的凸优化问题来进行动作选择,将得到的动作a以及当前状态s放入黑盒模拟器,可以得到连续的下一状态s′以及观察o,然后将s′用于下一轮模拟,下一轮模拟的深度加1。最后将状态s归并到信念状态B(h)中并且使用蒙特卡罗方法更新信念状态。在驾驶过 程中,用户只需要选择驾驶的模式,决策模型可以自动的选择相应的驾驶策略,不需要过多的操作,具有很好的用户体验;同时决策模型会实时的根据驾驶环境调整驾驶策略,所以具有很强的灵活性与可靠性;这个模型可以通过不断的驾驶训练来调整优化,对于新遇到的驾驶环境,只需要将其加入历史搜索树并更新历史,即可完成模型的升级,所以具有可持续使用性。
本发明基于约束在线规划的部分可观察自动驾驶决策模型及系统的优点在于:(1)本发明公开的基于部分可观察蒙特卡罗规划的策略规划方法,适用于大规模的有条件约束的部分可观察马尔科夫过程中,能够很好的解决传统贝叶斯更新方法计算困难的问题,大大的减轻了系统对于设备的过分依赖。(2)决策模型能够做到实时规划,通过蒙特卡罗规划能够及时的调整驾驶策略,极具灵活性。(3)模型中所求解的成本约束的凸优化问题能够确保得到的策略为随机策略,可以考虑到大部分可能遇到的问题,不会忽视各种故障的发生,使得系统更加的安全可靠。(4)该决策模型不需要关于驾驶环境的显式模型,而只需要一个黑盒模拟器,增加了驾驶策略的可行性。(5)系统感知的状态具有马尔科夫性质,具有马尔科夫性质的状态的未来状态只与当前状态有关,与之前的状态没有关系,因此无需保存过去的信息,节约了成本。
本专利的创新点之一在于,传统的POMDP模型由于维度诅咒与历史诅咒导致计算量过大而无法很好的适用于自动驾驶领域,而本专利应用了POMCP方法,打破了维度诅咒,解决了计算量过大的问题。本专利的创新点之二在于,在POMCP的基础上添加了成本约束的概念,使得驾驶系统在决策时可以选择多种驾驶模式,解决了普通驾驶系统驾驶模式单一的问题,更加的灵活多变,能在很大的程度上提升用户体验。
本发明公开了一种基于约束在线规划的部分可观察自动驾驶决策模型及系统。该决策模型主要用于自动驾驶中,车辆在启动前,用户可以从快速、平缓、节能三种模式中进行选择。根据选定的模式,模型会选择相应的最优驾驶策略,如果选择了快速模式,系统会选择能够使汽车尽可能快速的到达目的地,此时主要的受限因素就是时间;如果选择了平缓模式,系统会选择使用户感觉最舒适平缓的驾驶策略,此时主要的受限因素就是车辆的颠簸程度;如果选择了节 能模式,模型会选择最节省油耗的方式,尽量减少频繁的启动、熄火、刹车等操作,此时主要的受限因素就是耗油量。该决策模型不仅能够针对当前驾驶环境产生驾驶方案,还可以根据实时的路况车况来实时的调整方案,以增强其灵活性。该模型构建了一个基于历史的蒙特卡罗搜索树,使得模拟解决方案是在真实情况的基础上进行的,增强了可靠性。同时,该模型满足一定的最优选择条件,以确保得到的策略是随机策略,弥补了确定性策略的不足。该决策模型完全满足目前一般用户的驾驶需求,尤其是提供了多种模式可以选择,极大的提升了用户体验。
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本发明的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。

Claims (10)

  1. 一种基于约束在线规划的部分可观察自动驾驶决策方法,其特征在于,包括:
    接收用户选择的所述驾驶模式;
    根据用户所选的驾驶模式来规划驾驶方案,规划决策过程具体为:
    从给定初始状态分布I或者历史h的信念状态B(h)中选择模拟规划的起始状态s,用以构建蒙特卡罗搜索树,进行模拟规划:
    根据成本约束贪心动作选择策略,从所构建的蒙特卡罗搜索树中选择满足约束条件的,同时还要使V R(ha)最大的动作a;
    将所述动作a输入到线控系统中来驾驶车辆,同时系统的感知模块获得一个新的环境状态观察o,将新的历史hao记录加入历史h中,所述历史记录hao记录是指在所述历史h下采取动作a获得观察o;
    重复上述步骤直至到达目的地。
  2. 根据权利要求1所述的基于约束在线规划的部分可观察自动驾驶决策方法,其特征在于,所述驾驶模式包括以下至少之一:快速模式、平缓模式、节能模式。
  3. 根据权利要求1所述的基于约束在线规划的部分可观察自动驾驶决策方法,其特征在于,在步骤“从给定初始状态分布或者历史的信念状态中选择模拟规划的起始状态,用以构建蒙特卡罗搜索树,进行模拟规划:”中,所述模拟规划包括:
    选取模拟起始状态s,若历史为空,则从给定初始状态分布I中进行选择;若历史不为空,则从历史h的信念状态分布中进行选择;
    模拟过程的输入是粒子起始状态s,历史h,以及深度depth,当满足γ depth<ε时,模拟结束,否则继续模拟;
    从起始状态s开始根据约束贪心动作选择策略选取模拟动作a;
    将状态s以及动作a输入到环境状态黑盒模拟器中,得到下一模拟状态s′以及相应的环境状态观察o,将hao作为蒙特卡罗搜索树的新节点;
    使用蒙特卡罗方法更新模型中历史h的信念状态B(h)以及其它相关参数,然后返回本轮模拟规划的奖赏R以及成本C。
  4. 根据权利要求1所述的基于约束在线规划的部分可观察自动驾驶决策方法,其特征在于,步骤“根据成本约束贪心动作选择策略,从所构建的蒙特卡罗搜索树中选择满足约束条件的,同时还要使V R(ha)最大的动作a;”中,所述成本约束贪心动作选择策略包括:
    (1)加入了条件约束的UCB1动作选择方法:
    Figure PCTCN2018098899-appb-100001
    Figure PCTCN2018098899-appb-100002
    (2)对于次要目标约束条件的凸优化:
    Figure PCTCN2018098899-appb-100003
    Figure PCTCN2018098899-appb-100004
    输入一个状态s以及探索系数k,系统会根据UCB1来评估动作值Q(s,a),并从中选取Q值最高的动作,同时,还要求解成本的凸优化问题以确保满足成本约束,最后得到的最优策略π即为GREEDYPOLICY的输出。
  5. 根据权利要求1所述的基于约束在线规划的部分可观察自动驾驶决策方法,其特征在于,在步骤“使用蒙特卡罗方法更新模型中历史h的信念状态B(h)以及其它相关参数,然后返回本轮模拟规划的奖赏R以及成本C。”中,
    在信念状态更新时,首先从初始状态分布中抽取K个粒子
    Figure PCTCN2018098899-appb-100005
    每个粒子都对 应着一个抽样状态,而信念状态B(h)就是所有粒子的集合。在执行过一个动作a并且获得一个观察o之后,就使用蒙特卡罗模拟来更新粒子。通过随机的选取一个粒子的方式从当前信念状态B(s,h)中抽取一个状态,这个粒子作为输入传进一个黑盒模拟器中,可以获得一个连续的状态s′,观察o以及对应的奖赏和成本,如果模拟的观察o与真实的观察相匹配,那么就将粒子s′加入信念状态B(h)中,重复上述步骤直到所有的粒子都已被加入信念状态B(h)。
  6. 根据权利要求1所述的基于约束在线规划的部分可观察自动驾驶决策方法,其特征在于,在步骤“将状态s以及动作a输入到环境状态黑盒模拟器中,得到下一模拟状态s′以及相应的环境状态观察o,将hao作为蒙特卡罗搜索树的新节点;”中,
    环境的状态观察o由感知模块采集过滤并进行预处理,然后输入到决策模型中,决策模型再根据输入的环境状态进行模拟以及决策。
  7. 根据权利要求1所述的基于约束在线规划的部分可观察自动驾驶决策方法,其特征在于,在步骤“将状态以及动作输入到环境状态黑盒模拟器中,得到下一模拟状态以及相应的环境状态观察,将作为蒙特卡罗搜索树的新节点;”中,基于历史的搜索树中每个节点都会使用蒙特卡罗(MC)模拟来估计历史的奖赏值以及成本值;模型中包含一个关于驾驶环境的黑盒模拟器,而不需要明确的关于状态的概率分布。
  8. 根据权利要求1所述的基于约束在线规划的部分可观察自动驾驶决策方法,其特征在于,在步骤“使用蒙特卡罗方法更新模型中历史h的信念状态B(h)以及其它相关参数,然后返回本轮模拟规划的奖赏R以及成本C。”中,状态的模拟分为两层,第一层是指当处于搜索树中时,使用UCB1方法来进行动作选择,第二层是指超出搜索树范围时,使用一个基于历史的rollout策略来进行动作选择。
  9. 一种基于约束在线规划的部分可观察自动驾驶决策系统,其特征在于,包括:
    驾驶环境状态单元,用于接收感知模块所获取的实时驾驶环境,并由感知模块进行过滤以及预处理,输出决策模块所需要的状态;
    模拟单元,用于根据当前历史进行驾驶轨迹模拟,构建基于历史的蒙特卡罗搜索树;模拟分为两个阶段,在第一阶段中,所有的孩子都不存在子节点,此时使用UCB1算法选择动作,在第二阶段中,使用基于历史的回滚(rollout)算法选择动作;
    搜索单元,用于从历史的信念状态中选取模拟单元的起始状态,然后根据模拟单元所得出的结果,从中选择最优的,同时还满足用户选择的约束条件的动作作为实际的驾驶动作,并将选择的结果输出至自动驾驶系统中的线控系统;以及
    成本约束单元,用于约束驾驶过程中所产生的各种成本,不同的约束条件即对应着不同的驾驶模式,所述决策模型在给出最优决策的同时还需要满足用户所提出的约束条件;
    其中,所述模拟单元的输入是当前历史h、模拟的起始状态s以及当前搜索树的深度depth,以此在模拟中构建基于历史h的搜索树T(h),所述模拟单元会根据模型的动作选择策略选取一个最优动作a,然后以(s,a)作为驾驶环境黑盒模拟器的输入,输出新的状态s′、采取动作a后的观察o、奖赏r、成本c,下一轮模拟的输入为新状态s′、新历史hao、深度depth+1;输出为本轮模拟的奖赏R、成本C;所述搜索单元的输入是当前历史h,输出为在当前情况下能够获得最大累积奖赏并满足次要目标约束条件的最优策略π;所述成本约束单元输入为状态s、动作a,然后由此求解关于约束条件的一个凸优化问题。
  10. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求1到8任一项所述方法的步骤。
PCT/CN2018/098899 2018-06-11 2018-08-06 基于约束在线规划的部分可观察自动驾驶决策方法及系统 WO2019237474A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810595164.1A CN108803609B (zh) 2018-06-11 2018-06-11 基于约束在线规划的部分可观察自动驾驶决策方法
CN201810595164.1 2018-06-11

Publications (1)

Publication Number Publication Date
WO2019237474A1 true WO2019237474A1 (zh) 2019-12-19

Family

ID=64089043

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/098899 WO2019237474A1 (zh) 2018-06-11 2018-08-06 基于约束在线规划的部分可观察自动驾驶决策方法及系统

Country Status (2)

Country Link
CN (1) CN108803609B (zh)
WO (1) WO2019237474A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109778939B (zh) * 2019-03-04 2021-07-09 江苏徐工工程机械研究院有限公司 一种可自主规划轨迹的挖掘机臂智能控制系统及方法
US20200363800A1 (en) * 2019-05-13 2020-11-19 Great Wall Motor Company Limited Decision Making Methods and Systems for Automated Vehicle
CN111026110B (zh) * 2019-11-20 2021-04-30 北京理工大学 面向含软、硬约束线性时序逻辑的不确定动作规划方法
CN110837258B (zh) * 2019-11-29 2024-03-08 商汤集团有限公司 自动驾驶控制方法及装置、系统、电子设备和存储介质
CN111240318A (zh) * 2019-12-24 2020-06-05 华中农业大学 一种机器人的人员发现算法
CN111026127B (zh) * 2019-12-27 2021-09-28 南京大学 基于部分可观测迁移强化学习的自动驾驶决策方法及系统
CN113189986B (zh) * 2021-04-16 2023-03-14 中国人民解放军国防科技大学 一种自主机器人的二阶段自适应行为规划方法及系统

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107292344A (zh) * 2017-06-26 2017-10-24 苏州大学 一种基于环境交互的机器人实时控制方法
CN107544516A (zh) * 2017-10-11 2018-01-05 苏州大学 基于相对熵深度逆强化学习的自动驾驶系统及方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103528587B (zh) * 2013-10-15 2016-09-28 西北工业大学 自主组合导航系统
CN106169188B (zh) * 2016-07-11 2019-01-15 西南交通大学 一种基于蒙特卡洛树搜索的对象跟踪方法
CN107038477A (zh) * 2016-08-10 2017-08-11 哈尔滨工业大学深圳研究生院 一种非完备信息下的神经网络与q学习结合的估值方法
CN115343947A (zh) * 2016-09-23 2022-11-15 苹果公司 自主车辆的运动控制决策
CN107063280B (zh) * 2017-03-24 2019-12-31 重庆邮电大学 一种基于控制采样的智能车辆路径规划系统及方法

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107292344A (zh) * 2017-06-26 2017-10-24 苏州大学 一种基于环境交互的机器人实时控制方法
CN107544516A (zh) * 2017-10-11 2018-01-05 苏州大学 基于相对熵深度逆强化学习的自动驾驶系统及方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BAI, AIJUN: "Markov Theory Based Planning and Sensing under Uncertainty", ELECTRONIC TECHNOLOGY & INFORMATION SCIENCE, CHINA DOCTORAL DISSERTATIONS FULL-TEXT DATABASE, 15 June 2015 (2015-06-15) *

Also Published As

Publication number Publication date
CN108803609B (zh) 2020-05-01
CN108803609A (zh) 2018-11-13

Similar Documents

Publication Publication Date Title
WO2019237474A1 (zh) 基于约束在线规划的部分可观察自动驾驶决策方法及系统
EP3485432B1 (en) Training machine learning models on multiple machine learning tasks
CN111766782B (zh) 基于深度强化学习中Actor-Critic框架的策略选择方法
Xu et al. Learning to explore via meta-policy gradient
CN106096729B (zh) 一种面向大规模环境中复杂任务的深度策略学习方法
CN112862281A (zh) 综合能源系统调度模型构建方法、装置、介质及电子设备
CN108921298B (zh) 强化学习多智能体沟通与决策方法
CN112364984A (zh) 一种协作多智能体强化学习方法
CN109284812B (zh) 一种基于改进dqn的视频游戏模拟方法
WO2021057329A1 (zh) 一种作战体系架构建模与最优搜索方法
CN111159489B (zh) 一种搜索方法
CN114261400B (zh) 一种自动驾驶决策方法、装置、设备和存储介质
CN111752304B (zh) 无人机数据采集方法及相关设备
CN108830376A (zh) 针对时间敏感的环境的多价值网络深度强化学习方法
CN111856925A (zh) 基于状态轨迹的对抗式模仿学习方法及装置
CN106897744A (zh) 一种自适应设置深度置信网络参数的方法及系统
CN112131206A (zh) 一种多模型数据库OrientDB参数配置自动调优方法
CN114881228A (zh) 一种基于q学习的平均sac深度强化学习方法和系统
CN115526317A (zh) 基于深度强化学习的多智能体知识推理方法及系统
Moskovitz et al. Reload: Reinforcement learning with optimistic ascent-descent for last-iterate convergence in constrained mdps
CN114371634B (zh) 一种基于多级事后经验回放的无人机作战模拟仿真方法
CN116245009A (zh) 人机策略生成方法
CN116306947A (zh) 一种基于蒙特卡洛树探索的多智能体决策方法
CN113052712B (zh) 社交数据的分析方法、系统及存储介质
Gros et al. Tracking the Race Between Deep Reinforcement Learning and Imitation Learning--Extended Version

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18922118

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18922118

Country of ref document: EP

Kind code of ref document: A1