WO2023103692A1 - Decision planning method for autonomous driving, electronic device, and computer storage medium - Google Patents

Decision planning method for autonomous driving, electronic device, and computer storage medium Download PDF

Info

Publication number
WO2023103692A1
WO2023103692A1 PCT/CN2022/130733 CN2022130733W WO2023103692A1 WO 2023103692 A1 WO2023103692 A1 WO 2023103692A1 CN 2022130733 W CN2022130733 W CN 2022130733W WO 2023103692 A1 WO2023103692 A1 WO 2023103692A1
Authority
WO
WIPO (PCT)
Prior art keywords
planning
strategy
information
driving
decision
Prior art date
Application number
PCT/CN2022/130733
Other languages
French (fr)
Chinese (zh)
Inventor
陈俊波
雷岚馨
敬巍
王刚
Original Assignee
阿里巴巴达摩院(杭州)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴达摩院(杭州)科技有限公司 filed Critical 阿里巴巴达摩院(杭州)科技有限公司
Publication of WO2023103692A1 publication Critical patent/WO2023103692A1/en

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W60/00Drive control systems specially adapted for autonomous road vehicles
    • B60W60/001Planning or execution of driving tasks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2556/00Input parameters relating to data
    • B60W2556/10Historical data
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2556/00Input parameters relating to data
    • B60W2556/40High definition maps

Definitions

  • Fig. 3B is a schematic diagram of a scenario example in the embodiment shown in Fig. 3A;
  • ⁇ and ⁇ indicate candidate action node (for example, an action node in MCT), and ⁇ and ⁇ indicate candidate The GP posterior mean and standard deviation of , express and the expected distance between a in other branches, where a ⁇ Ch(b).
  • the memory 606 is used to store the program 610 .
  • the memory 606 may include a high-speed random access memory (Random Access Memory, RAM) memory, and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
  • RAM Random Access Memory
  • non-volatile memory such as at least one disk memory.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Game Theory and Decision Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Development Economics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Automation & Control Theory (AREA)
  • Human Computer Interaction (AREA)
  • Transportation (AREA)
  • Mechanical Engineering (AREA)
  • Traffic Control Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A decision planning method for autonomous driving, an electronic device and a computer storage medium. The decision planning method comprises: acquiring driving perception information of an object to be decided in a continuous behavior space (S302, S404), wherein the driving perception information includes geometric information, historical driving trajectory information and map information related to the object to be decided; according to the driving perception information and driving target information, acquiring a plurality of planning strategies, conforming to the Gaussian mixture distribution, and a strategy evaluation corresponding to each planning strategy (S304, S408); and performing decision planning for the object to be decided according to the plurality of planning strategies and the strategy evaluation corresponding to each planning strategy (S306, S410). According to the decision planning method, decision planning can be effectively performed in a strong interaction scenario for autonomous driving, so that the decision effect is improved.

Description

自动驾驶的决策规划方法、电子设备及计算机存储介质Decision-making planning method for automatic driving, electronic equipment and computer storage medium
本申请要求于2021年12月07日提交中国专利局、申请号为202111481018.4、申请名称为“自动驾驶的决策规划方法、电子设备及计算机存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202111481018.4 and the application name "automatic driving decision-making planning method, electronic equipment and computer storage medium" submitted to the China Patent Office on December 07, 2021, the entire content of which is passed References are incorporated in this application.
技术领域technical field
本申请实施例涉及自动驾驶技术领域,尤其涉及一种自动驾驶的决策规划方法、电子设备及计算机存储介质。The embodiments of the present application relate to the technical field of automatic driving, and in particular, to a decision planning method for automatic driving, electronic equipment, and a computer storage medium.
背景技术Background technique
自动驾驶技术是一种采用通信、计算机、网络和控制技术对相应的设备(如自动驾驶车辆、无人机、机器人等)进行实时、连续控制的技术。Autonomous driving technology is a technology that uses communication, computer, network and control technologies to perform real-time and continuous control of corresponding equipment (such as self-driving vehicles, drones, robots, etc.).
随着自动驾驶技术的发展,驾驶决策规划也越来越多地应用于自动驾驶技术中。以自动驾驶车辆为例,目前的自动驾驶决策规划可以根据路况的变化,如遇见行人、出现车辆或者拥堵等情况,给出具体的驾驶建议,控制自动驾驶车辆进行合理驾驶操作。但是在某些场景下,如自动驾驶的强交互场景中,会因为数据的精细粒度等问题而无法给出有效的决策规划。With the development of autonomous driving technology, driving decision-making planning is increasingly used in autonomous driving technology. Taking self-driving vehicles as an example, the current decision-making plan for autonomous driving can give specific driving suggestions based on changes in road conditions, such as encountering pedestrians, vehicles, or congestion, and control self-driving vehicles to perform reasonable driving operations. However, in some scenarios, such as strong interaction scenarios of autonomous driving, effective decision-making plans cannot be given due to issues such as fine-grained data.
其中,自动驾驶中的强交互场景意指自动驾驶中需要频繁基于对方的决策而调整自己的决策规划的场景,往往存在于低速拥堵的场景,例如窄道会车,窄道错车,环岛等等。在这些场景下,传统的决策规划很难凑效。因此,目前的技术方案存在决策规划的效果不佳的问题。Among them, the strong interaction scene in automatic driving refers to the scene in automatic driving that needs to frequently adjust its own decision-making plan based on the other party's decision-making, and often exists in low-speed congestion scenes, such as narrow road meeting cars, narrow road wrong cars, roundabouts, etc. . In these scenarios, traditional decision-making planning is difficult to work. Therefore, the current technical solution has the problem that the effect of decision-making planning is not good.
发明内容Contents of the invention
有鉴于此,本申请实施例提供一种自动驾驶的决策规划方案,以至少部分解决上述问题。In view of this, an embodiment of the present application provides a decision-making and planning solution for automatic driving, so as to at least partially solve the above-mentioned problems.
根据本申请实施例的第一方面,提供了一种自动驾驶的决策规划方法,包括:获取待决策对象在连续行为空间的行驶感知信息,其中,所述行驶感知信息包括:与所述待决策对象相关的几何信息、历史行驶轨迹信息和地图信息;根据所述行驶感知信息和行驶目标信息,获得符合混合高斯分布的多个规划策略及各个规划策略对应的策略评估;根据所述多个规划策略及各个规划策略对应的策略评估,为所述待决策对象进行决策规划。According to the first aspect of the embodiments of the present application, a decision planning method for automatic driving is provided, including: acquiring driving perception information of an object to be decided in a continuous behavior space, wherein the driving perception information includes: Object-related geometric information, historical driving trajectory information, and map information; according to the driving perception information and driving target information, obtain multiple planning strategies conforming to the mixed Gaussian distribution and strategy evaluation corresponding to each planning strategy; according to the multiple planning strategies The strategy and the strategy evaluation corresponding to each planning strategy perform decision planning for the object to be decided.
根据本申请实施例的第二方面,提供了一种自动驾驶的决策规划装置,包括:According to the second aspect of the embodiments of the present application, a decision-making and planning device for automatic driving is provided, including:
第一获取模块,用于获取待决策对象在连续行为空间的行驶感知信息,其中,所述行驶感知信息包括:与所述待决策对象相关的几何信息、历史行驶轨迹信息和地图信息;The first acquisition module is used to acquire the driving perception information of the object to be decided in the continuous behavior space, wherein the driving perception information includes: geometric information, historical driving track information and map information related to the object to be decided;
第二获取模块,用于根据所述行驶感知信息和行驶目标信息,获得符合混合高斯分布的多个规划策略及各个规划策略对应的策略评估;The second acquisition module is used to obtain multiple planning strategies conforming to the mixed Gaussian distribution and strategy evaluation corresponding to each planning strategy according to the driving perception information and driving target information;
规划模块,用于根据所述多个规划策略及各个规划策略对应的策略评估,为所述待决策对象进行决策规划。A planning module, configured to perform decision planning for the object to be decided according to the plurality of planning strategies and the strategy evaluation corresponding to each planning strategy.
根据本申请实施例的第三方面,提供了一种电子设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行如第一方面所述的方法对应的操作。According to a third aspect of the embodiments of the present application, there is provided an electronic device, including: a processor, a memory, a communication interface, and a communication bus, and the processor, the memory, and the communication interface complete mutual communication via the communication bus. The communication among them; the memory is used to store at least one executable instruction, and the executable instruction causes the processor to perform the operation corresponding to the method described in the first aspect.
根据本申请实施例的第四方面,提供了一种计算机存储介质,其上存储有计算机程序,该程序被处理器执行时实现如第一方面所述的方法。According to a fourth aspect of the embodiments of the present application, there is provided a computer storage medium, on which a computer program is stored, and when the program is executed by a processor, the method as described in the first aspect is implemented.
根据本申请实施例的第五方面,提供了一种计算机程序产品,包括计算机指令,所述计算机指令指示计算设备执行如第一方面所述的方法对应的操作。According to a fifth aspect of the embodiments of the present application, there is provided a computer program product, including computer instructions, where the computer instructions instruct a computing device to perform operations corresponding to the method described in the first aspect.
根据本申请实施例提供的自动驾驶的决策规划方案,针对自动驾驶的强交互场景,一方面,使用了连续行为空间的行驶感知信息,这部分信息除具有连续性外,因其连续性还具有了更精细的数据粒度,从而使得基于这类信息确定的规划策略也具有更精细的粒度,更适于强交互场景下的自动驾驶决策处理。另一方面,针对强交互场景获得的规划策略包括多个,且该多个规划策略符合混合高斯分布,从而使得这些规划策略均具有较高的可执行度和合理性,可根据对方不同的操作情况进行有效的应对和处理,更符合强交互的需要。由此可见,通过本申请实施例的方案,可有效针对自动驾驶中的强交互场景下进行决策规划,提升决策效果。According to the automatic driving decision planning scheme provided by the embodiment of the present application, for the strong interaction scene of automatic driving, on the one hand, the driving perception information of the continuous behavior space is used. In addition to the continuity of this part of information, it also has A finer data granularity is obtained, so that the planning strategy determined based on this type of information also has a finer granularity, which is more suitable for automatic driving decision-making in strong interaction scenarios. On the other hand, there are multiple planning strategies obtained for strong interaction scenarios, and the multiple planning strategies conform to the mixed Gaussian distribution, so that these planning strategies are highly executable and reasonable, and can be used according to different operations of the other party. The situation can be effectively dealt with and dealt with, which is more in line with the needs of strong interaction. It can be seen that, through the solutions of the embodiments of the present application, decision-making planning can be effectively carried out for strong interaction scenarios in automatic driving, and the decision-making effect can be improved.
附图说明Description of drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请实施例中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments described in the embodiments of this application, and those skilled in the art can also obtain other drawings based on these drawings.
图1为一种强化学习系统的结构示意图;Fig. 1 is a structural schematic diagram of a reinforcement learning system;
图2为本申请实施例的一种强化学习网络模型的结构示意图;FIG. 2 is a schematic structural diagram of a reinforcement learning network model in an embodiment of the present application;
图3A为根据本申请实施例一的一种自动驾驶的决策规划方法的步骤流程图;FIG. 3A is a flow chart of steps of a decision planning method for automatic driving according to Embodiment 1 of the present application;
图3B为图3A所示实施例中的一种场景示例的示意图;Fig. 3B is a schematic diagram of a scenario example in the embodiment shown in Fig. 3A;
图4A为根据本申请实施例二的一种自动驾驶的决策规划方法的步骤流程图;FIG. 4A is a flow chart of steps of a decision planning method for automatic driving according to Embodiment 2 of the present application;
图4B为图4A所示实施例中的一种基于MCTS的强化学习网络模型的示意图;Fig. 4B is a schematic diagram of an MCTS-based reinforcement learning network model in the embodiment shown in Fig. 4A;
图5为根据本申请实施例三的一种自动驾驶的决策规划装置的结构框图;FIG. 5 is a structural block diagram of an automatic driving decision planning device according to Embodiment 3 of the present application;
图6为根据本申请实施例四的一种电子设备的结构示意图。FIG. 6 is a schematic structural diagram of an electronic device according to Embodiment 4 of the present application.
具体实施方式Detailed ways
为了使本领域的人员更好地理解本申请实施例中的技术方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请实施例一部分实施例,而不是全部的实施例。基于本申请实施例中的实施例,本领域普通技术人员所获得的所有其他实施例,都应当属于本申请实施例保护的范围。In order to enable those skilled in the art to better understand the technical solutions in the embodiments of the present application, the following will clearly and completely describe the technical solutions in the embodiments of the present application in conjunction with the drawings in the embodiments of the present application. Obviously, the described The embodiments are only some of the embodiments of the present application, but not all of them. All other embodiments obtained by persons of ordinary skill in the art based on the embodiments in the embodiments of the present application shall fall within the protection scope of the embodiments of the present application.
为了便于理解本申请实施例的方案,以下首先对强化学习系统进行简要示意性说明,如图1所示。In order to facilitate the understanding of the solutions of the embodiments of the present application, a brief schematic description of the reinforcement learning system is given below first, as shown in FIG. 1 .
强化学习是智能体与环境不断交互,从而不断强化智能体的决策能力的过程。图1中所示的强化学习系统包括环境(Env)和智能体(Agent)。首先,环境会给智能体一个观测值(observation)(也称状态state);智能体接收到环境给的观测值之后会做出一个动作(action);环境接收到智能体给的动作之后会做出一系列的反应,例如对这个动作给予一个奖励(reward),以及给出一个新的观测值;智能体根据环境给予的奖励值去更新自己的策略(policy),以通过不断地与环境交互,最终获得最优的策略。Reinforcement learning is the process of continuous interaction between the agent and the environment, thereby continuously strengthening the decision-making ability of the agent. The reinforcement learning system shown in Figure 1 includes an environment (Env) and an agent (Agent). First, the environment will give the agent an observation (also known as state); the agent will make an action after receiving the observation value given by the environment; the environment will do an action after receiving the action given by the agent A series of responses, such as giving a reward to this action (reward), and giving a new observation value; the agent updates its own strategy (policy) according to the reward value given by the environment, so as to continuously interact with the environment , and finally obtain the optimal strategy.
在实际应用中,强化学习系统可通过策略价值模型实现,其包括策略分支和价值分支。其中,策略分支用于智能体基于state挑选下一个action,可通过多种方式实现,如通过智能体的行为函数等,在基于MCTS(蒙特卡洛树搜索,Monte Carlo Tree Search)的强化学习中,也可通过MCTS实现。价值分支用于获得在state遵循策略分支选出的策略时,对累积奖励的期望。奖励reward是一个反馈信号,通常为一个数值,表明这个智能体在某一次基于state挑选action执行的这个操作做得有多好。In practical applications, the reinforcement learning system can be realized through a policy-value model, which includes a policy branch and a value branch. Among them, the strategy branch is used for the agent to select the next action based on the state, which can be realized in a variety of ways, such as through the behavior function of the agent, etc., in the reinforcement learning based on MCTS (Monte Carlo Tree Search) , can also be realized through MCTS. The value branch is used to obtain the expected cumulative reward when the state follows the policy chosen by the policy branch. Reward reward is a feedback signal, usually a numerical value, indicating how well the agent performed this operation based on the state selection action.
具体到本申请实施例,采用了基于MCTS的强化学习系统,本申请实施例中对该系统统称为强化学习网络模型。如图2所示,该强化学习网络模型包括GNN(图神经网络模型,Graph Neural Network)部分和策略价值模型部分,该策略价值模型部分包括策略分支和价值分支,策略分支基于MCTS产生相应的规划策略,价值分支对基于MCTS产生的规划策略及其对应产生的结果进行价值评估,获得相应的策略评估的估值。Specifically, in the embodiment of the present application, an MCTS-based reinforcement learning system is adopted, which is collectively referred to as a reinforcement learning network model in the embodiment of the present application. As shown in Figure 2, the reinforcement learning network model includes a GNN (Graph Neural Network) part and a strategy value model part. The strategy value model part includes a strategy branch and a value branch. The strategy branch generates corresponding planning based on MCTS The strategy and value branch evaluates the planning strategy based on MCTS and its corresponding results, and obtains the valuation of the corresponding strategy evaluation.
需要说明的是,图2所示的强化学习网络模型的GNN部分为可选部分,其用于对输入的信息进行特征提取。但在实际应用中,也可直接将相应的输入信息与图2中所示的目标信息(target information)一起输入策略价值模型部分。又或者,可以使用其它的网络模型替代GNN模型实现输入信息的特征提取,如CNN(卷积神经网络,Convolutional Neural Network)等。但使用GNN,可以更好、更有效地提取输入信息的特征,尤其是在提取自动驾驶的强交互场景中的图像特征。It should be noted that the GNN part of the reinforcement learning network model shown in FIG. 2 is an optional part, which is used for feature extraction of input information. However, in practical applications, the corresponding input information can also be directly input into the strategy value model part together with the target information shown in FIG. 2 . Alternatively, other network models, such as CNN (Convolutional Neural Network, Convolutional Neural Network), can be used instead of the GNN model to implement feature extraction of input information. But using GNN, the features of input information can be extracted better and more efficiently, especially in extracting image features in strong interaction scenes of autonomous driving.
基于上述结构,以下结合附图,通过多个实施例对本申请实施例提供的自动驾驶的决策规划方案进行说明。Based on the above structure, the automatic driving decision-making planning solution provided by the embodiment of the present application will be described below through multiple embodiments in conjunction with the accompanying drawings.
实施例一Embodiment one
参照图3A,示出了根据本申请实施例一的一种自动驾驶的决策规划方法的步骤流程图。Referring to FIG. 3A , it shows a flow chart of steps of a decision planning method for automatic driving according to Embodiment 1 of the present application.
本实施例的自动驾驶的决策规划方法包括以下步骤:The decision planning method for automatic driving in this embodiment includes the following steps:
步骤S302:获取待决策对象在连续行为空间的行驶感知信息。Step S302: Obtain the driving perception information of the object to be decided in the continuous behavior space.
待决策对象可以为承载有智能体装置(如处理器或芯片等),能够根据智能体装置的指令进行相应操作,如执行决策规划对应的指令操作的设备;也可以为可将相应信息上传至远端智能体装置(如服务器),并能接受智能体装置的指令进行相应操作的设备。本申请实施例中,待决策对象可实现的形式包括但不限:可自动行驶的车辆、无人机、机器人等,本申请实施例对待决策对象的具体实现形式不作限制。The object to be decided can be a device that carries an intelligent device (such as a processor or a chip, etc.), and can perform corresponding operations according to the instructions of the intelligent device, such as executing the instruction operation corresponding to the decision-making plan; it can also be a device that can upload corresponding information to A remote smart device (such as a server), and a device that can accept instructions from the smart device to perform corresponding operations. In the embodiment of the present application, the forms that can be realized by the decision-making object include but are not limited to: vehicles that can drive automatically, drones, robots, etc. The embodiment of the present application does not limit the specific realization form of the decision-making object.
待决策对象的行驶行为通常是连续的,但对其对应的数据的处理可分为基于离散 行为空间的数据处理和基于连续行为空间的数据处理。在连续行为空间的数据也具有连续性,是连续的,由此,其可以更为精准地反映待决策对象在空间中各个时刻的状态、操作等一系列信息。本申请实施例中,主要获取待决策对象在连续行为空间的行驶感知信息。该行驶感知信息至少包括:与待决策对象相关的几何信息、历史行驶轨迹信息和地图信息。The driving behavior of the object to be decided is usually continuous, but the corresponding data processing can be divided into data processing based on discrete behavior space and data processing based on continuous behavior space. The data in the continuous behavior space also has continuity and is continuous, so it can more accurately reflect a series of information such as the state and operation of the object to be decided at each moment in the space. In the embodiment of the present application, the driving perception information of the object to be decided in the continuous behavior space is mainly acquired. The driving perception information at least includes: geometric information related to the object to be decided, historical driving track information and map information.
其中,与待决策对象相关的几何信息包括待决策对象自身的几何信息(如待决策对象的轮廓、形状等信息)、和待决策对象所在环境中的物理对象的几何信息(如待决策对象周边的其它对象如车辆、障碍物、道路设施等的轮廓、形状等信息)。与待决策对象相关的历史行驶轨迹信息包括当前时刻之前的预设时间段内,待决策对象的行驶轨迹的信息,其中,预设时间段的具体设置可由本领域技术人员根据实际需求适当设置,如当前时刻之前的3-5秒等。与待决策对象相关的地图信息通常为待决策对象当前位置所在地理区域范围内的地图信息,如当前位置周边一定的区域范围的地图信息,或所在地理区域的信息等,通常包括待决策对象所在道路及周边道路的拓扑结构的数据。行驶感知信息可通过待决策对象中的信息采集设备,如摄像头、雷达、各种传感器等采集并处理获得,通过这些行驶感知信息可以较为全面并准确地刻画待决策对象当前的行驶状态。Among them, the geometric information related to the object to be decided includes the geometric information of the object to be decided (such as the outline and shape of the object to be decided), and the geometric information of the physical objects in the environment where the object to be decided is located (such as the surrounding area of the object to be decided). Other objects such as the outline, shape and other information of vehicles, obstacles, road facilities, etc.). The historical driving trajectory information related to the object to be decided includes information on the driving trajectory of the object to be decided within a preset time period before the current moment, wherein the specific setting of the preset time period can be appropriately set by those skilled in the art according to actual needs, Such as 3-5 seconds before the current moment, etc. The map information related to the object to be decided is usually the map information within the geographical area where the object to be decided is currently located, such as the map information of a certain area around the current location, or the information on the geographical area where the object is located, usually including the location of the object to be decided. Data on the topology of roads and surrounding roads. Driving perception information can be collected and processed by information collection devices in the decision-making object, such as cameras, radars, and various sensors. Through these driving perception information, the current driving state of the decision-making object can be described comprehensively and accurately.
步骤S304:根据行驶感知信息和行驶目标信息,获得符合混合高斯分布的多个规划策略及各个规划策略对应的策略评估。Step S304: According to the driving perception information and the driving target information, multiple planning strategies conforming to the mixed Gaussian distribution and strategy evaluation corresponding to each planning strategy are obtained.
其中,行驶目标信息用于表征与行驶目标相关的信息,在强交互场景中,通常可以包括但不限于:目标点(或目标区域)的信息、目标点(或目标区域)的位置、目标点(或目标区域)距离待决策对象当前位置的距离、待决策对象当前的自身速度和状态、角度等信息。强交互场景中的目标点(或目标区域)通常可以为距离待决策对象当前位置较近的一个目标点(或目标区域),可通过待决策对象进行少数次(如1-3次)的操作即可达。例如,在会车场景中,该目标点(或目标区域)可以是当前位置之前的,与对方车辆具有一定角度以避免碰撞的一个就近目标位置,等等。通过行驶目标信息,可有效确定待决策对象的行驶目标,以为后续确定规划策略提供有效依据。Among them, the driving target information is used to characterize the information related to the driving target. In a strong interaction scenario, it usually includes but is not limited to: target point (or target area) information, target point (or target area) position, target point (or target area) information such as the distance from the current position of the object to be decided, the current speed, state, and angle of the object to be decided. The target point (or target area) in the strong interaction scene can usually be a target point (or target area) that is closer to the current position of the object to be decided, and a small number of (such as 1-3) operations can be performed through the object to be decided can be reached. For example, in a meeting scene, the target point (or target area) may be a nearby target position at a certain angle with the opposing vehicle to avoid collision, before the current position, and so on. Through the driving target information, the driving target of the object to be decided can be effectively determined to provide an effective basis for the subsequent determination of the planning strategy.
在具有了行驶感知信息和行驶目标信息的情况下,即可通过适当的算法或模型如强化学习网络模型等,获得用于对待决策对象进行操作控制的规划策略(包括但不限于:导航、制动、加速、跟车、变道等)。本申请实施例中,该规划策略包括多个,且多个规划策略符合混合高斯分布。本申请实施例中,混合高斯分布意指高斯混合模型(Gaussian Mixed Model,GMM)输出的概率分布,其是多个高斯分布函数的线性组合,因此对应有多个峰值,该多个峰值与多个规划策略对应。此外,各个规划策略的效果好坏也需要有一个评价,因此,针对多个规划策略,可获得其策略评估,例如可以具体实现为估值、评分、优劣程度评价等,以判断待决策对象若执行该规划策略,其可能达到的效果的好坏程度。In the case of driving perception information and driving target information, a planning strategy (including but not limited to: navigation, control, moving, accelerating, following, changing lanes, etc.). In the embodiment of the present application, the planning strategy includes multiple planning strategies, and the multiple planning strategies conform to the mixed Gaussian distribution. In the embodiment of this application, the mixed Gaussian distribution means the probability distribution output by the Gaussian Mixed Model (GMM), which is a linear combination of multiple Gaussian distribution functions, so there are multiple peaks corresponding to the multiple peaks. corresponding to a planning strategy. In addition, the effectiveness of each planning strategy also requires an evaluation. Therefore, for multiple planning strategies, its strategy evaluation can be obtained, for example, it can be specifically realized as valuation, scoring, evaluation of the degree of pros and cons, etc., to judge the object to be decided If the planning strategy is implemented, the effect it may achieve is good or bad.
在一种可行方式中,可借助于策略价值模型来获得多个规划策略及各个规划策略的评估。例如,策略价值模型一般包括策略网络部分和价值网络部分,其中,策略网络部分采用混合密度网络(MDN,Mixture Density Network)的形式,用于输出符合混合高斯分布的多个规划策略指示(如概率分布信息等;而价值网络部分则可以对根 据策略网络部分输出的规划策略指示生成的多个规划策略进行估值,输出各个规划策略对应的策略评估。通过这种方式,可高效、快速地生成多个规划策略及其对应的策略评估,为后续对待决策对象进行决策规划提供了有效的依据。其中,根据策略网络部分输出的规划策略指示生成的多个规划策略可由本领域技术人员根据实际情况采用适当算法实现,在一种可行方式中,可基于混合高斯分布的概率分布结合MCTS生成。In one possible manner, a plurality of planning strategies and an evaluation of the individual planning strategies can be obtained by means of a strategy value model. For example, a strategy value model generally includes a strategy network part and a value network part, wherein the strategy network part adopts the form of a mixed density network (MDN, Mixture Density Network), which is used to output multiple planning strategy indicators (such as probability Distribution information, etc.; while the value network part can evaluate multiple planning strategies generated according to the planning strategy instructions output by the strategy network part, and output the strategy evaluation corresponding to each planning strategy. In this way, it can be efficiently and quickly generated A plurality of planning strategies and their corresponding strategy evaluations provide an effective basis for subsequent decision-making objects for decision-making.Wherein, according to the planning strategies output by the strategy network part, a plurality of planning strategies can be generated by those skilled in the art according to the actual situation It can be implemented with an appropriate algorithm, and in a feasible manner, it can be generated based on a probability distribution of a mixture of Gaussian distributions combined with MCTS.
此外,本申请实施例的一种可行方式中,可以直接基于行驶感知信息和行驶目标信息,获得符合混合高斯分布的多个规划策略。但不限于此,在另一种可行方式中,还可以基于行驶感知信息的特征数据和行驶目标信息,获得符合混合高斯分布的多个规划策略,该特征数据可为对多种行驶感知信息进行特征提取和融合后的数据,可以全面表征待决策对象的行驶感知情况,又可更为突出行驶感知情况的特性。可选地,对行驶感知信息的特征数据的提取和生成可由图神经网络模型(Graph Neural Network,GNN)实现。通过将行驶感知信息输入GNN,以通过GNN进行特征提取和基于多头自注意力机制的特征融合,获得行驶感知信息对应的融合特征向量。In addition, in a feasible manner of the embodiment of the present application, multiple planning strategies conforming to the mixed Gaussian distribution can be obtained directly based on the driving perception information and the driving target information. But not limited to this, in another feasible way, based on the characteristic data of the driving perception information and the driving target information, multiple planning strategies conforming to the mixed Gaussian distribution can be obtained. The data after feature extraction and fusion can fully characterize the driving perception situation of the object to be decided, and can also highlight the characteristics of the driving perception situation. Optionally, the extraction and generation of feature data of driving perception information can be realized by a graph neural network model (Graph Neural Network, GNN). By inputting the driving perception information into the GNN, GNN can perform feature extraction and feature fusion based on the multi-head self-attention mechanism to obtain the fusion feature vector corresponding to the driving perception information.
传统的GNN中,将输入的数据作为一个整体进行处理,数据处理量大,且将其输出与强化学习网络模型相结合,尤其是基于MCTS的强化学习网络模型相结合时,因MCTS需反复执行模拟和推理过程,将进一步加剧GNN的处理负担。为减轻GNN的数据处理负担,提高处理效率,在本申请的一个实施例中,将GNN设置为包括几何子图层、行驶轨迹子图层、地图子图层、池化层和全局图层。In the traditional GNN, the input data is processed as a whole, and the amount of data processing is large, and its output is combined with the reinforcement learning network model, especially when combining the reinforcement learning network model based on MCTS, because MCTS needs to be executed repeatedly The simulation and reasoning process will further increase the processing burden of GNN. In order to reduce the data processing burden of GNN and improve processing efficiency, in one embodiment of this application, GNN is set to include geometry sublayer, driving trajectory sublayer, map sublayer, pooling layer and global layer.
其中,几何子图层用于对几何信息进行特征提取,行驶轨迹子图层用于对历史行驶轨迹信息进行特征提取,地图子图层用于对地图信息进行特征提取;池化层用于分别对几何子图层、行驶轨迹子图层、和地图子图层各自提取的特征进行特征聚合;全局图层用于对几何子图层、行驶轨迹子图层、和地图子图层分别获得的聚合后的特征进行多头自注意力处理,获得融合特征向量。Among them, the geometry sublayer is used for feature extraction of geometric information, the driving trajectory sublayer is used for feature extraction of historical driving trajectory information, and the map sublayer is used for feature extraction of map information; the pooling layer is used for respectively Feature aggregation is performed on the features extracted from the geometry sublayer, driving track sublayer, and map sublayer; the global layer is used to obtain the geometry sublayer, driving track sublayer, and map sublayer respectively. The aggregated features are processed by multi-head self-attention to obtain the fusion feature vector.
进一步地,对于几何信息采用待决策对象四个角和中心的坐标对应的向量;对于历史行驶轨迹信息采用待决策对象在距离当前时间最近的五个时间步中的位置和朝向的时间序列编码向量;对于地图信息采用道路拓扑数据,首先对道路边界进行离散化,并每间隔5M(米)将其划分为一个子图,从而形成不同的子图,然后将道路边界点对点连接起来,构造道路信息向量。由此,进一步方便了GNN的处理,提高了GNN处理的效率。Further, for the geometric information, the vectors corresponding to the coordinates of the four corners and the center of the object to be decided are used; for the historical driving trajectory information, the time series encoding vector of the position and orientation of the object to be decided in the five time steps closest to the current time is used ;Using road topology data for map information, first discretize the road boundary, and divide it into a sub-graph at an interval of 5M (meters), thereby forming different sub-graphs, and then connect the road boundaries point-to-point to construct road information vector. Thus, the processing of the GNN is further facilitated, and the efficiency of the GNN processing is improved.
如前所述,不同的行驶感知信息通过不同的子图层进行处理,但在每一个子图层中,相应的输入向量首先经过一个全连接层进行特征提取,再经过一个最大池化层对本次处理的来自不同节点的所有特征数据进行聚合,获得聚合后的特征。例如,几何子图层对其本次处理的多个几何信息分别进行特征提取后进行特征聚合,获得该多个几何信息对应的聚合特征;行驶轨迹子图层对其本次处理的多个历史行驶轨迹信息分别进行特征提取后进行特征聚合,获得该多个历史行驶轨迹信息对应的聚合特征;地图子图层对其本次处理的多个道路拓扑信息分别进行特征提取后进行特征聚合,获得该多个道路拓扑信息对应的聚合特征。这些子图层输出的聚合特征可以具有固定的向量长度,进而这些聚合特征被输入全局图层。该全局图层可基于多头自注意力机制实现,在全局图层对三个子图层输入的聚合特征进行多头自注意力处理后,可获得融合 特征向量。As mentioned earlier, different driving perception information is processed through different sublayers, but in each sublayer, the corresponding input vector first passes through a fully connected layer for feature extraction, and then passes through a maximum pooling layer for All feature data from different nodes processed this time are aggregated to obtain aggregated features. For example, the geometry sublayer performs feature extraction on multiple geometric information processed this time and then performs feature aggregation to obtain the aggregated features corresponding to the multiple geometric information; the driving track sublayer performs multiple histories of this processing The driving trajectory information is subjected to feature extraction and feature aggregation to obtain the aggregation features corresponding to the multiple historical driving trajectory information; the map sub-layer performs feature extraction on the multiple road topology information processed this time and then performs feature aggregation to obtain Aggregation features corresponding to the plurality of road topology information. The aggregated features output by these sublayers can have a fixed vector length, and these aggregated features are then fed into the global layer. The global layer can be implemented based on the multi-head self-attention mechanism. After the global layer performs multi-head self-attention processing on the aggregated features input by the three sub-layers, the fusion feature vector can be obtained.
由上述描述可知,在实际应用中,上述对GNN和策略价值模型的改进可择一选择使用,但也可同时使用,以结合两者优势,达到更好的决策规划的效果。基于此,在一种可行方式中,本步骤可实现为:将行驶感知信息输入GNN,以通过GNN进行特征提取和基于多头自注意力机制的特征融合,获得行驶感知信息对应的融合特征向量;将融合特征向量和待决策对象的行驶目标信息对应的向量输入策略价值模型,通过策略价值模型获得符合混合高斯分布的多个规划策略指示及根据规划策略指示生成的各个规划策略对应的策略评估。It can be seen from the above description that in practical applications, the above-mentioned improvements to the GNN and the strategic value model can be used alternatively, but they can also be used at the same time to combine the advantages of the two to achieve better decision-making effects. Based on this, in a feasible way, this step can be implemented as: input the driving perception information into GNN, so as to perform feature extraction and feature fusion based on the multi-head self-attention mechanism through GNN, and obtain the fusion feature vector corresponding to the driving perception information; The fused feature vector and the vector corresponding to the driving target information of the object to be decided are input into the strategy value model, and multiple planning strategy instructions conforming to the mixed Gaussian distribution and the strategy evaluation corresponding to each planning strategy generated according to the planning strategy instructions are obtained through the strategy value model.
通过上述过程,实现了多个规划策略及其对应策略评估的获得,基于此,即可进行进一步的决策规划。Through the above process, multiple planning strategies and their corresponding strategy evaluations are obtained, based on which, further decision planning can be carried out.
步骤S306:根据多个规划策略及各个规划策略对应的策略评估,为待决策对象进行决策规划。Step S306: Perform decision planning for the object to be decided according to multiple planning strategies and strategy evaluations corresponding to each planning strategy.
因各个规划策略对应有相应的策略评估,因此,可根据策略评估的估值或评分高低或优劣程度,从中选择出较优选的规划策略,进而基于该选择出的规划策略进行决策规划,如向待决策对象发出操作指令,如使用油门、油门的使用程度、使用刹车、方向盘的旋转角度等等,从而指示待决策对象在强交互场景下的操作,进行有效决策。Since each planning strategy corresponds to a corresponding strategy evaluation, a more optimal planning strategy can be selected according to the valuation or score of the strategy evaluation or the degree of pros and cons, and then decision planning can be carried out based on the selected planning strategy, such as Issue operation instructions to the object to be decided, such as the use of the accelerator, the degree of use of the accelerator, the use of the brake, the rotation angle of the steering wheel, etc., so as to instruct the object to be decided to operate in the scene of strong interaction and make effective decisions.
需要说明的是,本申请实施例中,若无特殊说明,“多个”、“多条”等与“多”有关的数量均意指两个及两个以上。It should be noted that, in the embodiments of the present application, unless otherwise specified, the quantities related to "multiple" such as "multiple" and "multiple" all mean two or more.
以下,以一个具体示例对上述过程进行示例性说明,如图3B所示。Hereinafter, a specific example is used to illustrate the above process, as shown in FIG. 3B .
本示例中,设定待决策对象为自动驾驶车辆X,假设该车辆X在某一狭窄道路与人工驾驶车辆Y进行会车。如图3B中所示,首先获取车辆X在连续行为空间的行驶感知信息,包括车辆X的轮廓信息、车辆Y的轮廓信息、道路边沿的轮廓信息等。然后,获得车辆X的行驶目标信息,因车辆X需与车辆Y进行会车,其可以将目标点定位至车头前2米,车辆Y靠近车辆X一侧、车身方向30度的位置,如图3B中实心圆点所示。该行驶感知信息和行驶目标信息都将输入车辆X承载的智能体,如车辆X的控制器。控制器中设置有策略价值模型,将根据输入的行驶感知信息和行驶目标信息输出多个规划策略,本示例中假设为三个规划策略,分别为规划策略1、2、和3。假设规划策略1为直行1米然后向左前方行驶1米,规划策略2为向左前方行驶1米再直行1米,规划策略3为向左前方行驶2米。再假设控制器通过策略价值模型预测规划策略1的策略评估具体为策略评分,该策略评分为0.6,规划策略2的策略评分为0.8,规划策略3的策略评分为0.5。基于上述假设,控制器决定使用规划策略2,将规划策略2作为决策规则,为车辆X生成该决策规划对应的指令,如令方向盘左转30度、令油门下降30%的油门距离,在此状态行驶1米后,令方向盘恢复至正向0度,保持油门再行驶1米。至此,车辆X将实现与车辆Y的会车,车辆X的新位置如图3B中车辆X所示。In this example, the object to be decided is set to be an automatic driving vehicle X, and it is assumed that the vehicle X meets a human driving vehicle Y on a narrow road. As shown in FIG. 3B , first obtain the driving perception information of vehicle X in the continuous behavior space, including the contour information of vehicle X, the contour information of vehicle Y, the contour information of the road edge, and so on. Then, obtain the driving target information of vehicle X, because vehicle X needs to meet vehicle Y, it can locate the target point 2 meters in front of the vehicle, vehicle Y is close to the side of vehicle X, and the position of the vehicle body is 30 degrees, as shown in the figure Shown as a solid circle in 3B. Both the driving perception information and the driving target information will be input into the intelligent agent carried by the vehicle X, such as the controller of the vehicle X. A strategy value model is set in the controller, and multiple planning strategies will be output according to the input driving perception information and driving target information. In this example, three planning strategies are assumed, namely planning strategies 1, 2, and 3. Assume that planning strategy 1 is to go straight 1 meter and then drive 1 meter to the left, planning strategy 2 is to drive 1 meter to the left and then go straight 1 meter, and planning strategy 3 is to drive 2 meters to the left. Assume that the controller predicts that the strategy evaluation of planning strategy 1 through the strategy value model is a strategy score, the strategy score is 0.6, the strategy score of planning strategy 2 is 0.8, and the strategy score of planning strategy 3 is 0.5. Based on the above assumptions, the controller decides to use planning strategy 2, and uses planning strategy 2 as a decision rule to generate instructions corresponding to the decision planning for vehicle X, such as turning the steering wheel to the left by 30 degrees and reducing the accelerator pedal by 30%. After driving for 1 meter, return the steering wheel to 0 degrees forward, keep the accelerator and drive for another 1 meter. So far, vehicle X will meet vehicle Y, and the new position of vehicle X is shown as vehicle X in FIG. 3B .
可见,通过本实施例,针对自动驾驶的强交互场景,一方面,使用了连续行为空间的行驶感知信息,这部分信息除具有连续性外,因其连续性还具有了更精细的数据粒度,从而使得基于这类信息确定的规划策略也具有更精细的粒度,更适于强交互场景下的决策处理。另一方面,针对强交互场景获得的规划策略包括多个,且该多个规 划策略符合混合高斯分布,从而使得这些规划策略均具有较高的可执行度和合理性,可根据对方不同的操作情况进行有效的应对和处理,更符合强交互的需要。由此可见,通过本实施例的方案,可有效针对自动驾驶中的强交互场景下进行决策规划,提升决策效果。It can be seen that through this embodiment, for the strong interaction scene of automatic driving, on the one hand, the driving perception information of the continuous behavior space is used. In addition to the continuity, this part of the information also has a finer data granularity because of its continuity. Therefore, the planning strategy determined based on this type of information also has a finer granularity, and is more suitable for decision-making processing in strong interaction scenarios. On the other hand, there are multiple planning strategies obtained for strong interaction scenarios, and the multiple planning strategies conform to the mixed Gaussian distribution, so that these planning strategies are highly executable and reasonable, and can be used according to different operations of the other party. The situation can be effectively dealt with and dealt with, which is more in line with the needs of strong interaction. It can be seen that, through the solution of this embodiment, decision-making planning can be effectively carried out for strong interaction scenarios in automatic driving, and the decision-making effect can be improved.
实施例二Embodiment two
参照图4A,示出了根据本申请实施例二的一种自动驾驶的决策规划方法的步骤流程图。Referring to FIG. 4A , it shows a flow chart of steps of a decision planning method for automatic driving according to Embodiment 2 of the present application.
本实施例中,着重对强化学习网络模型与MCTS结合进行训练进行说明,训练完成的强化学习网络模型即可应用于前述实施例的自动驾驶的决策规划方案中,以对待决策对象进行有效的决策规划。In this embodiment, the training of the combination of the reinforcement learning network model and MCTS is emphasized. The training completed reinforcement learning network model can be applied to the decision-making planning scheme of the automatic driving of the foregoing embodiment, so as to make effective decisions on the decision-making object planning.
本实施例的自动驾驶的决策规划方法包括以下步骤:The decision planning method for automatic driving in this embodiment includes the following steps:
步骤S402:对强化学习网络模型中的策略价值模型进行训练。Step S402: Train the policy value model in the reinforcement learning network model.
如图2中所示,本申请实施例中的强化学习网络模型包括GNN模型部分和策略价值模型部分,其中的GNN部分可以是预先训练好的模型,因此,本实施例中重点对策略价值模型的训练进行说明。As shown in Figure 2, the reinforcement learning network model in the embodiment of the present application includes a GNN model part and a strategy value model part, wherein the GNN part can be a pre-trained model, therefore, in this embodiment, the focus is on the strategy value model training is described.
具体地,可以基于MCTS生成的决策规划监督信息,对策略价值模型进行训练。一种基于MCTS的强化学习网络模型如图4B所示,由图4B中可见,强化学习网络模型中策略价值模型的策略分支P和价值分支V均基于MCTS实现,本申请实施例中,策略价值模型的输入是行驶感知信息和行驶目标信息,输出的是在输入的信息条件下每一个可行的action(本实施例中为规划策略)的概率及评估,训练的目标是使策略价值模型输出的action的概率更加接近MCTS输出的概率,因此,也可认为MCTS的输出可作为策略价值模型的监督信息。Specifically, the strategic value model can be trained based on the decision planning supervision information generated by the MCTS. An MCTS-based reinforcement learning network model is shown in Figure 4B. It can be seen from Figure 4B that the strategy branch P and the value branch V of the strategy value model in the reinforcement learning network model are both implemented based on MCTS. In the embodiment of this application, the strategy value The input of the model is driving perception information and driving target information, and the output is the probability and evaluation of each feasible action (planning strategy in this embodiment) under the input information conditions, and the training goal is to make the strategy value model output The probability of action is closer to the probability of MCTS output. Therefore, it can also be considered that the output of MCTS can be used as the supervisory information of the strategic value model.
传统的基于MCTS的策略价值模型多用于离散行为空间,除动作精细度不够外,在某些场景下也容易卡死,不能适用于本申请实施例的方案。为此,本申请实施例在对策略价值模型进行训练时,在每次迭代训练中,获得MCTS基于连续行为空间的行驶感知样本数据、行驶目标样本信息、和KR-AUCB(基于核回归的渐进式置信度上界),输出的多个规划策略样本的信息(如概率);以多个规划策略样本中,策略评估最高的规划策略样本的信息(如概率)为监督信息,对策略价值模型进行训练。The traditional MCTS-based strategic value model is mostly used in discrete behavior spaces. In addition to insufficient action fineness, it is also easy to get stuck in some scenarios, and cannot be applied to the solution of the embodiment of this application. For this reason, in the embodiment of the present application, when training the strategic value model, in each iteration training, the driving perception sample data based on the continuous behavior space of the MCTS, the driving target sample information, and the KR-AUCB (gradual regression based on kernel regression) are obtained. The upper bound of the confidence degree of the formula), the information (such as probability) of multiple planning strategy samples is output; among the multiple planning strategy samples, the information (such as probability) of the planning strategy sample with the highest strategy evaluation is used as the supervisory information, and the strategy value model to train.
MCTS是一种基于蒙特卡洛方法构建树结构来推理和探索的规划算法,通常可以和神经网络模型还有强化学习结合。MCTS通常包括select(选择),expand(扩展),evaluate(模拟)和backup(反向传播)几个过程。其中,select过程需要从蒙特卡洛树的根节点R开始,递归选择某个子节点直到达到叶子节点L。在该过程中涉及到如何基于当前节点选择下一个节点,目前常用的方式是UCB(Upper Confidence Bounds,置信区间上界),但UCB方式效率低下,为此,本申请实施例提供了一种高效的节点选择方式,即KR-AUCB,可用于select过程和expand过程中,下文将详述。expand过程是如果叶子节点L上的操作没有结束(如需继续行驶),那么新创建叶子节点L下的一个子节点C的过程。传统方式中,通常仅创建一个新的子节点C,使得新建节点数据及其生成的节点路径(策略)有限,为此,本申请实施例也对其进行了改进,使其可基于策略分支输出的GMM的概率分布进行多子节点创建,以拓展生成的策略, 提高策略生成效率,该部分也将在下文中详述。Evaluate过程将根据生成的策略从新扩展的子节点的位置模拟相应动作至最终结果,以此计算新创建的节点的质量度。而backup过程则根据新创建的节点的质量度,沿着传递路径反向传递,更新新创建的节点的上级节点的质量度。MCTS is a planning algorithm based on the Monte Carlo method to build a tree structure for reasoning and exploration. It can usually be combined with neural network models and reinforcement learning. MCTS usually includes several processes of select (selection), expand (extension), evaluate (simulation) and backup (back propagation). Among them, the selection process needs to start from the root node R of the Monte Carlo tree, and recursively select a certain child node until reaching the leaf node L. This process involves how to select the next node based on the current node. Currently, the commonly used method is UCB (Upper Confidence Bounds, the upper bound of the confidence interval), but the UCB method is inefficient. Therefore, the embodiment of this application provides an efficient The node selection method, namely KR-AUCB, can be used in the select process and expand process, which will be described in detail below. The expand process is a process of newly creating a child node C under the leaf node L if the operation on the leaf node L is not over (if the vehicle needs to continue driving). In the traditional way, usually only a new child node C is created, so that the newly created node data and the generated node path (strategy) are limited. Therefore, the embodiment of the present application also improves it so that it can branch output based on the strategy The probability distribution of the GMM is used to create multiple sub-nodes to expand the generated strategy and improve the efficiency of strategy generation. This part will also be described in detail below. The Evaluate process will simulate the corresponding action from the position of the newly expanded child node to the final result according to the generated strategy, so as to calculate the quality degree of the newly created node. In the backup process, according to the quality degree of the newly created node, the quality degree of the superior node of the newly created node is updated along the transfer path.
以下,对本申请实施例的方案针对select过程和expand过程的改进进行详细说明。Hereinafter, the solution of the embodiment of the present application will be described in detail for the improvement of the select process and the expand process.
在本申请实施例提供的一种可行方式中,所述获得MCTS基于连续行为空间的行驶感知样本数据、行驶目标样本信息、和KR-AUCB(Kernel Regression-based Asymptotical PUCB,基于核回归的渐进置信度上界),输出的多个规划策略样本的信息可以包括:基于连续行为空间的行驶感知样本数据和行驶目标样本信息,使用KR-AUCB从对应的MCT(蒙特卡洛树)中选取节点形成初始规划策略路径;根据强化网络模型输出的符合混合高斯分布的多个动作样本,为初始规划策略路径的叶子节点创建多个子节点;基于创建的多个子节点与初始规划策略路径,获得多条扩展规划策略路径;对多条扩展规划策略路径进行规划策略模拟,以获得各条扩展规划策略路径对应的策略评估;根据各条扩展规划策略路径及其对应的策略评估,输出多个规划策略样本的信息。In a feasible manner provided by the embodiment of the present application, the acquisition of MCTS based on continuous behavior space driving perception sample data, driving target sample information, and KR-AUCB (Kernel Regression-based Asymptotical PUCB, progressive confidence based on kernel regression Degree upper bound), the output information of multiple planning strategy samples can include: driving perception sample data and driving target sample information based on continuous behavior space, using KR-AUCB to select nodes from the corresponding MCT (Monte Carlo tree) to form Initial planning strategy path; according to multiple action samples conforming to the mixed Gaussian distribution output by the enhanced network model, multiple child nodes are created for the leaf nodes of the initial planning strategy path; based on the created multiple child nodes and the initial planning strategy path, multiple extensions are obtained planning strategy path; carry out planning strategy simulation on multiple extended planning strategy paths to obtain the strategy evaluation corresponding to each extended planning strategy path; output multiple planning strategy samples according to each extended planning strategy path and its corresponding strategy evaluation information.
其中,在每一次迭代训练中,使用KR-AUCB从对应的MCT中选取节点形成初始规划策略路径可以包括:首先从MCT中选取一个节点,通常为KR-AUCB值最大的节点(初始时该节点为未访问过的节点,在策略价值模型经过多次迭代训练后,该KR-AUCB值最大的节点可能为未访问过的节点,也可能为已访问过的节点);针对该节点对应的至少一级非叶子节点的每级非叶子节点,选择出KR-AUCB值高于其它同级子节点或者访问次数低于其它同级子节点的非叶子节点节点;基于至少一级非叶子节点中的最末一级非叶子节点所对应的叶子节点中,选择出叶子节点(可以是最大值叶子节点,也可以是随机选择出的叶子节点);根据选择出的各级节点,形成初始规划策略路径。其中,KR-AUCB值可根据下述公式一计算得出。Wherein, in each iterative training, using KR-AUCB to select a node from the corresponding MCT to form an initial planning strategy path may include: first selecting a node from the MCT, usually the node with the largest KR-AUCB value (initially the node is an unvisited node, after the strategy value model has undergone multiple iterations of training, the node with the largest KR-AUCB value may be an unvisited node or a visited node); for this node, at least For each level of non-leaf nodes of the first-level non-leaf nodes, select the non-leaf node nodes whose KR-AUCB value is higher than other child nodes of the same level or the number of visits is lower than that of other child nodes of the same level; based on at least one level of non-leaf nodes From the leaf nodes corresponding to the last level of non-leaf nodes, select a leaf node (it can be a maximum leaf node, or a randomly selected leaf node); according to the selected nodes at all levels, an initial planning strategy path is formed . Wherein, the KR-AUCB value can be calculated according to the following formula one.
如图4B中所示,在MCTS的select过程中,基于连续行为空间的行驶感知样本数据和行驶目标样本信息,通过KR-AUCB来选择出节点。As shown in Figure 4B, in the selection process of MCTS, nodes are selected through KR-AUCB based on the driving perception sample data and driving target sample information in the continuous behavior space.
在KR-AUCB方式中,首先选取一个未访问过的节点;针对该未访问过的节点对应的至少一级非叶子节点的每级非叶子节点,选择出先验概率高于其它同级子节点或者访问次数低于其它同级子节点的非叶子节点节点;基于至少一级非叶子节点中的最末一级非叶子节点所对应的叶子节点中,选择出最大值叶子节点。In the KR-AUCB method, first select an unvisited node; for each level of non-leaf nodes corresponding to the unvisited node at least one level of non-leaf nodes, select a priori probability higher than other child nodes of the same level Or a non-leaf node whose visit frequency is lower than other child nodes of the same level; based on the leaf nodes corresponding to the last level of non-leaf nodes in at least one level of non-leaf nodes, select the leaf node with the maximum value.
可选地,KR-AUCB可表示为以下公式的形式:Alternatively, KR-AUCB can be expressed in the form of the following formula:
Figure PCTCN2022130733-appb-000001
Figure PCTCN2022130733-appb-000001
其中:in:
Figure PCTCN2022130733-appb-000002
Figure PCTCN2022130733-appb-000002
Figure PCTCN2022130733-appb-000003
Figure PCTCN2022130733-appb-000003
Figure PCTCN2022130733-appb-000004
Figure PCTCN2022130733-appb-000004
p asym=λP prior+(1-λ)P uniform           公式五 p asym = λP prior + (1-λ)P uniform Formula 5
上述公式中,
Figure PCTCN2022130733-appb-000005
表示被选择的action(例如为MCT中的action节点),
Figure PCTCN2022130733-appb-000006
表示已存在的兄弟action,
Figure PCTCN2022130733-appb-000007
表示高斯概率密度,
Figure PCTCN2022130733-appb-000008
表示在
Figure PCTCN2022130733-appb-000009
Figure PCTCN2022130733-appb-000010
下的期望,
Figure PCTCN2022130733-appb-000011
为策略价值模型中价值分支针对
Figure PCTCN2022130733-appb-000012
的输出,c表示针对拓展节点的拓展参数,P asum表示渐近控制拓展衰减的先验策略,W()表示节点访问次数,n a表示节点数量。
Figure PCTCN2022130733-appb-000013
表示action空间
Figure PCTCN2022130733-appb-000014
的均匀分布,Pprior=p θ,表示策略价值模型中的策略分支输出的先验action的概率分布,p θ表示策略分支的输出。
In the above formula,
Figure PCTCN2022130733-appb-000005
Represents the selected action (for example, the action node in MCT),
Figure PCTCN2022130733-appb-000006
Indicates an existing sibling action,
Figure PCTCN2022130733-appb-000007
represents the Gaussian probability density,
Figure PCTCN2022130733-appb-000008
expressed in
Figure PCTCN2022130733-appb-000009
exist
Figure PCTCN2022130733-appb-000010
lower expectations,
Figure PCTCN2022130733-appb-000011
For the value branch in the strategy value model
Figure PCTCN2022130733-appb-000012
The output of , c represents the expansion parameter for the expansion node, P asum represents the prior strategy for asymptotically controlling the expansion decay, W() represents the number of node visits, and n a represents the number of nodes.
Figure PCTCN2022130733-appb-000013
Represents the action space
Figure PCTCN2022130733-appb-000014
The uniform distribution of , Pprior=p θ , represents the probability distribution of the prior action output by the strategy branch in the strategy value model, and p θ represents the output of the strategy branch.
在进行初始节点选择时,可基于上述公式一,对MCT中的节点进行选择,以形成初始规划策略路径,如图4B中左侧实线标示出的、由浅灰色节点形成的路径,其中的叶子节点为左下线实线实心圆形所示节点,标示为L。When selecting the initial node, the nodes in the MCT can be selected based on the above formula 1 to form the initial planning strategy path, as shown by the solid line on the left in Figure 4B, the path formed by the light gray nodes, in which the leaves The node is the node shown in the solid circle with the solid line on the lower left, and is marked as L.
需要说明的是,上述公式一同样也可应用于MCTS的expand过程,同样可从新创建的节点中更高效率地选择出节点。It should be noted that the above formula one can also be applied to the expand process of the MCTS, and nodes can also be selected more efficiently from newly created nodes.
在通过上述过程从MCT中选取节点形成初始规划策略路径后,MCTS的select过程可以认为完成。进一步地,可进行expand过程。After selecting nodes from the MCT to form an initial planning strategy path through the above process, the selection process of the MCTS can be considered complete. Further, an expand process can be performed.
在本申请实施例的expand过程中,也需扩展节点,即基于叶子节点新建下级子节点。与传统的MCTS中每次新建一个子节点不同,本申请实施例中,可以基于策略价值模型中策略分支输出的混合高斯分布,一次创建多个子节点。In the expand process of the embodiment of the present application, it is also necessary to expand the node, that is, to create a lower-level child node based on the leaf node. Unlike the traditional MCTS where one child node is created each time, in the embodiment of the present application, multiple child nodes can be created at one time based on the mixed Gaussian distribution output by the strategy branch in the strategy value model.
为便于对该过程进行说明,首先对本申请实施例中的策略分支进行说明。本申请实施例中的策略分支由混合密度网络(Mixture Density Network,MDN)实现,其通过输出高斯混合模型(Gaussian Mixture Model,GMM)的参数来建模action(动作)输出的概率分布,即混合高斯分布。结合图4B中的中间的MCT可见,将策略分支的输出的该概率分布应用于MCTS的expand过程中时,将为节点L同时创建多个子节点,图4B中示例为2个。To facilitate the description of the process, the policy branch in the embodiment of the present application is described first. The policy branch in the embodiment of the present application is implemented by a mixture density network (Mixture Density Network, MDN), which models the probability distribution of the action (action) output by outputting the parameters of the Gaussian mixture model (Gaussian Mixture Model, GMM), that is, the mixture Gaussian distribution. It can be seen from the middle MCT in Figure 4B that when the probability distribution of the output of the strategy branch is applied to the expand process of the MCTS, multiple child nodes will be created for the node L at the same time, and the example in Figure 4B is 2.
本申请实施例中,在从新创建的子节点中选择节点时,对传统的贝叶斯推理方式进行了改进,以更为快速地选择出有效节点。基于此,在一种可行方式中,基于创建的多个子节点与初始规划策略路径,获得多条扩展规划策略路径可以包括:针对创建的多个子节点中的每个子节点,使用高斯过程函数拟合该子节点的信息,根据拟合后的高斯过程均值、标准差、该子节点与其它子节点之间的距离,获得该子节点的候选度;根据各个子节点的候选度,从多个子节点中选出候选子节点;根据选出的候选子节点和初始规划策略路径,获得多条扩展规划策略路径。其中,根据拟合后的高斯过程均值、标准差、该子节点与其它子节点之间的距离,获得该子节点的候选度也可认 为是贝叶斯推理方式的势能函数构建过程,将其结果作为子节点的候选度。In the embodiment of the present application, when selecting nodes from newly created child nodes, the traditional Bayesian reasoning method is improved so as to select valid nodes more quickly. Based on this, in a feasible manner, based on the created multiple child nodes and the initial planning strategy path, obtaining multiple expanded planning strategy paths may include: for each of the created multiple child nodes, using a Gaussian process function to fit According to the information of the sub-node, the candidate degree of the sub-node is obtained according to the fitted Gaussian process mean, standard deviation, and the distance between the sub-node and other sub-nodes; according to the candidate degree of each sub-node, from multiple sub-nodes Select candidate sub-nodes; according to the selected candidate sub-nodes and initial planning strategy paths, multiple extended planning strategy paths are obtained. Among them, according to the fitted Gaussian process mean, standard deviation, and the distance between the sub-node and other sub-nodes, obtaining the candidate degree of the sub-node can also be regarded as the potential energy function construction process of Bayesian reasoning. The result is used as the candidate degree of the child node.
一个具体示例过程如下:A concrete example process is as follows:
以节点b为例,对其进行节点扩展,为其创建新的子节点。首先,通过已有的action(公式六中表示为a)分支Ch(b)采集action节点a *作为候选,公式表示如下: Take node b as an example, expand its node and create a new child node for it. First, the action node a * is collected as a candidate through the existing action (expressed as a in formula 6) branch Ch(b), and the formula is expressed as follows:
Figure PCTCN2022130733-appb-000015
Figure PCTCN2022130733-appb-000015
其中,A()表示采集函数,用于将采集引导到可能发现最佳节点的概率会增加的区域。Among them, A() represents the acquisition function, which is used to guide the acquisition to the area where the probability of finding the best node will increase.
基于此,将采集函数定义为:Based on this, the acquisition function is defined as:
Figure PCTCN2022130733-appb-000016
Figure PCTCN2022130733-appb-000016
其中,
Figure PCTCN2022130733-appb-000017
表示候选的action节点(例如为MCT中的action节点),μ和σ表示候选的
Figure PCTCN2022130733-appb-000018
的GP后验平均值和标准差,
Figure PCTCN2022130733-appb-000019
表示
Figure PCTCN2022130733-appb-000020
和其他分支中的a之间的期望距离,其中,a∈Ch(b)。
in,
Figure PCTCN2022130733-appb-000017
Indicates a candidate action node (for example, an action node in MCT), and μ and σ indicate candidate
Figure PCTCN2022130733-appb-000018
The GP posterior mean and standard deviation of ,
Figure PCTCN2022130733-appb-000019
express
Figure PCTCN2022130733-appb-000020
and the expected distance between a in other branches, where a ∈ Ch(b).
Figure PCTCN2022130733-appb-000021
Figure PCTCN2022130733-appb-000021
公式七中的前两项可以被视为选择
Figure PCTCN2022130733-appb-000022
的期望值的上限,最后一项为惩罚项,用于惩罚候选节点的action过于接近已有的action。ω 1和ω 2为可调系数,用于平衡节点拓展。
The first two terms in formula seven can be considered as options
Figure PCTCN2022130733-appb-000022
The upper limit of the expected value of , the last item is a penalty item, which is used to punish the action of the candidate node for being too close to the existing action. ω 1 and ω 2 are adjustable coefficients for the expansion of balance nodes.
对候选的action进行采集后,将其添加到已访问action节点集合
Figure PCTCN2022130733-appb-000023
中。然后,基于KR-AUCB选择要采取的action节点。如果采集的结果不是候选的a*,则从
Figure PCTCN2022130733-appb-000024
中删除a*。接着,执行所选action节点对应的动作,并在下一个节点层级生成新的状态。经过一次迭代后,叶子节点的期望值
Figure PCTCN2022130733-appb-000025
由策略价值模型中的价值分支给出。最后,通过反向传播方式更新整个遍历过程中每个节点的值。
After collecting the candidate actions, add them to the collection of visited action nodes
Figure PCTCN2022130733-appb-000023
middle. Then, the action node to be taken is selected based on KR-AUCB. If the collected result is not a candidate a*, then from
Figure PCTCN2022130733-appb-000024
Delete a* in . Then, the action corresponding to the selected action node is executed, and a new state is generated at the next node level. After one iteration, the expected value of the leaf node
Figure PCTCN2022130733-appb-000025
Given by the value branch in the strategy value model. Finally, the value of each node in the entire traversal process is updated through backpropagation.
可见,基于上述过程,在MCTS的select,expand,evaluate和backup的框架下,使用高斯过程的函数拟合子节点的信息,并使用贝叶斯推理的改进方案推断出最有潜力的新子节点,有效提升了expand过程的信息使用率与效率。此外,若新创建的节点在select过程中未被选中,则将其删除,这一方式可避免在expand过程中对预设超参的依赖。最后,上述MCTS过程整体提高了evaluate过程的准确度和select过程的精确度。It can be seen that based on the above process, under the framework of select, expand, evaluate and backup of MCTS, the function of Gaussian process is used to fit the information of child nodes, and the improved scheme of Bayesian inference is used to infer the most potential new child nodes , effectively improving the information utilization rate and efficiency of the expand process. In addition, if the newly created node is not selected during the select process, it will be deleted. This method can avoid the dependence on the preset hyperparameters during the expand process. Finally, the above-mentioned MCTS process improves the accuracy of the evaluate process and the accuracy of the select process as a whole.
基于上述过程,依靠MCTS的推演来生成相对较好的连续决策轨迹即规划策略,作为监督信息来监督策略价值模型的强化学习。Based on the above process, relying on MCTS deduction to generate a relatively good continuous decision-making trajectory, that is, a planning strategy, as supervisory information to supervise the reinforcement learning of the strategy value model.
在模型训练中,每一步MCTS会推演出多个规划策略(如100个),其中访问次数最多的一个会被用来训练策略价值模型。In model training, each step of MCTS will deduce multiple planning strategies (such as 100), and the one with the most visits will be used to train the strategy value model.
在模型训练完成后,即可将其应用于实际的决策规划,如下述步骤中所述。After the model is trained, it can be applied to actual decision planning, as described in the following steps.
步骤S404:获取待决策对象在连续行为空间的行驶感知信息。Step S404: Obtain the driving perception information of the object to be decided in the continuous behavior space.
其中,行驶感知信息包括:与待决策对象相关的几何信息、历史行驶轨迹信息和地图信息。Among them, the driving perception information includes: geometric information related to the object to be decided, historical driving track information and map information.
步骤S406:将行驶感知信息输入GNN,以通过GNN进行特征提取和基于多头自注意力机制的特征融合,获得行驶感知信息对应的融合特征向量。Step S406: Input the driving perception information into the GNN to perform feature extraction and feature fusion based on the multi-head self-attention mechanism through the GNN to obtain a fusion feature vector corresponding to the driving perception information.
本实施例中,采用通过GNN对行驶感知信息进行处理的方式,以更好更高效地提取相关特征。In this embodiment, the GNN is used to process the driving perception information to extract relevant features better and more efficiently.
步骤S408:将融合特征向量和待决策对象的行驶目标信息对应的向量输入策略价值模型,通过策略价值模型获得符合混合高斯分布的多个规划策略指示及根据规划策略指示生成的各个规划策略对应的策略评估。Step S408: Input the fused feature vector and the vector corresponding to the driving target information of the object to be decided into the strategy value model, and obtain multiple planning strategy instructions conforming to the mixed Gaussian distribution through the strategy value model and corresponding to each planning strategy generated according to the planning strategy instructions Strategy evaluation.
本步骤中,策略价值模型中的策略分支输出GMM参数以建模action输出的概率分布,即混合高斯分布,基于该分布和前述MCTS过程,生成多个规划策略。并且,通过策略价值模型中的价值分支对多个规划策略进行评估,获得各个规划策略的策略评估。其中,价值分支的具体实现可参照现在策略价值模型中价值分支的实现方式,本申请实施例中对此不作限制。In this step, the strategy branch in the strategy value model outputs GMM parameters to model the probability distribution of the action output, that is, the mixed Gaussian distribution, and multiple planning strategies are generated based on the distribution and the aforementioned MCTS process. Moreover, multiple planning strategies are evaluated through the value branch in the strategy value model to obtain the strategy evaluation of each planning strategy. Wherein, the specific implementation of the value branch can refer to the implementation manner of the value branch in the current policy value model, which is not limited in the embodiment of the present application.
步骤S410:根据多个规划策略及各个规划策略对应的策略评估,为待决策对象进行决策规划。Step S410: Perform decision planning for the object to be decided according to multiple planning strategies and strategy evaluations corresponding to each planning strategy.
例如,可以从多个规划策略中,选择策略评估最高的那个规划策略,以生成决策规划。进而,根据该决策规划对待决策对象进行操作指示。For example, a planning strategy with the highest strategy evaluation may be selected from multiple planning strategies to generate a decision plan. Furthermore, according to the decision plan, an operation instruction is given to the object to be decided.
上述步骤S404-S410的描述较为简单,其具体实现可参照前述实施例一中相关步骤的描述,以及步骤S402中的描述,在此不再赘述。The description of the above steps S404-S410 is relatively simple, and its specific implementation can refer to the description of the relevant steps in the first embodiment and the description of step S402, and will not be repeated here.
通过本实施例,针对自动驾驶的强交互场景,一方面,使用了连续行为空间的行驶感知信息,这部分信息除具有连续性外,因其连续性还具有了更精细的数据粒度,从而使得基于这类信息确定的规划策略也具有更精细的粒度,更适于强交互场景下的决策处理。另一方面,针对强交互场景获得的规划策略包括多个,且该多个规划策略符合混合高斯分布,从而使得这些规划策略均具有较高的可执行度和合理性,可根据对方不同的操作情况进行有效的应对和处理,更符合强交互的需要。由此可见,通过本实施例的方案,可有效针对自动驾驶中的强交互场景下进行决策规划,提升决策效果。Through this embodiment, for the strong interaction scene of automatic driving, on the one hand, the driving perception information of the continuous behavior space is used. In addition to the continuity, this part of the information also has a finer data granularity because of its continuity, so that The planning strategy determined based on this type of information also has a finer granularity and is more suitable for decision-making in strong interaction scenarios. On the other hand, there are multiple planning strategies obtained for strong interaction scenarios, and the multiple planning strategies conform to the mixed Gaussian distribution, so that these planning strategies are highly executable and reasonable, and can be used according to different operations of the other party. The situation can be effectively dealt with and dealt with, which is more in line with the needs of strong interaction. It can be seen that, through the solution of this embodiment, decision-making planning can be effectively carried out for strong interaction scenarios in automatic driving, and the decision-making effect can be improved.
实施例三Embodiment three
参照图5,示出了根据本申请实施例三的一种自动驾驶的决策规划装置的结构框图。Referring to FIG. 5 , it shows a structural block diagram of an automatic driving decision planning device according to Embodiment 3 of the present application.
本实施例的自动驾驶的决策规划装置包括:第一获取模块502,用于获取待决策对象在连续行为空间的行驶感知信息,其中,所述行驶感知信息包括:与所述待决策对象相关的几何信息、历史行驶轨迹信息和地图信息;第二获取模块504,用于根据所述行驶感知信息和行驶目标信息,获得符合混合高斯分布的多个规划策略及各个规划策略对应的策略评估;规划模块506,用于根据所述多个规划策略及各个规划策略对应的策略评估,为所述待决策对象进行决策规划。The decision planning device for automatic driving in this embodiment includes: a first acquisition module 502, configured to acquire driving perception information of an object to be decided in a continuous behavior space, wherein the driving perception information includes: information related to the object to be decided Geometric information, historical driving track information and map information; the second acquisition module 504 is used to obtain multiple planning strategies conforming to the mixed Gaussian distribution and strategy evaluation corresponding to each planning strategy according to the driving perception information and driving target information; planning Module 506, configured to perform decision planning for the object to be decided according to the multiple planning strategies and the strategy evaluation corresponding to each planning strategy.
可选地,第二获取模块504,用于将所述行驶感知信息输入图神经网络模型,以通过所述图神经网络模型进行特征提取和基于多头自注意力机制的特征融合,获得行 驶感知信息对应的融合特征向量;将所述融合特征向量和所述待决策对象的行驶目标信息对应的向量输入策略价值模型,通过所述策略价值模型获得符合混合高斯分布的多个规划策略指示及根据所述规划策略指示生成的各个规划策略对应的策略评估。Optionally, the second acquisition module 504 is configured to input the driving perception information into a graph neural network model, so as to perform feature extraction and feature fusion based on a multi-head self-attention mechanism through the graph neural network model to obtain driving perception information Corresponding fused feature vector; input the fused feature vector and the vector corresponding to the driving target information of the object to be decided into the strategy value model, and obtain multiple planning strategy instructions conforming to the mixed Gaussian distribution through the strategy value model and according to the The planning strategy indicates the strategy evaluation corresponding to each generated planning strategy.
可选地,所述策略价值模型包括策略网络部分和价值网络部分;其中,所述策略网络部分为混合密度网络,用于输出符合混合高斯分布的多个规划策略指示;所述价值网络部分用于对根据所述策略网络部分输出的规划策略指示生成的多个规划策略进行估值,输出各个规划策略对应的策略评估。Optionally, the strategy value model includes a strategy network part and a value network part; wherein, the strategy network part is a mixed density network for outputting a plurality of planning strategy instructions conforming to a mixed Gaussian distribution; the value network part uses To evaluate the multiple planning strategies generated according to the planning strategy instructions output by the strategy network part, and output the strategy evaluation corresponding to each planning strategy.
可选地,所述图神经网络模型包括几何子图层、行驶轨迹子图层、地图子图层、池化层和全局图层;其中:所述几何子图层用于对所述几何信息进行特征提取,所述行驶轨迹子图层用于对所述历史行驶轨迹信息进行特征提取,所述地图子图层用于对所述地图信息进行特征提取;所述池化层用于分别对所述几何子图层、所述行驶轨迹子图层、和所述地图子图层各自提取的特征进行特征聚合;所述全局图层用于对所述几何子图层、所述行驶轨迹子图层、和所述地图子图层分别获得的聚合后的特征进行多头自注意力处理,获得融合特征向量。Optionally, the graph neural network model includes a geometry sublayer, a driving trajectory sublayer, a map sublayer, a pooling layer, and a global layer; wherein: the geometry sublayer is used to analyze the geometric information Carrying out feature extraction, the driving track sub-layer is used for feature extraction of the historical driving track information, the map sub-layer is used for feature extraction of the map information; the pooling layer is used for respectively Feature aggregation is performed on the features extracted from the geometry sublayer, the driving track sublayer, and the map sublayer; the global layer is used for the geometry sublayer, the driving track sublayer The aggregated features respectively obtained by the layers and the map sub-layers are processed by multi-head self-attention to obtain a fusion feature vector.
可选地,本实施例的自动驾驶的决策规划装置还包括:训练模块508,用于基于MCTS生成的决策规划监督信息,对所述策略价值模型进行训练。Optionally, the device for decision planning for automatic driving in this embodiment further includes: a training module 508, configured to train the policy value model based on the decision planning supervision information generated by the MCTS.
可选地,训练模块508,用于在每次迭代训练中,获得所述MCTS基于连续行为空间的行驶感知样本数据、行驶目标样本信息、和KR-AUCB,输出的多个规划策略样本的信息;以所述多个规划策略样本中,策略评估最高的规划策略样本的信息为监督信息,对所述策略价值模型进行训练。Optionally, the training module 508 is used to obtain the information of multiple planning strategy samples output by the MCTS based on the continuous behavior space driving perception sample data, driving target sample information, and KR-AUCB in each iterative training ; Using the information of the planning strategy sample with the highest strategy evaluation among the plurality of planning strategy samples as supervision information, training the strategy value model.
可选地,训练模块508获得所述MCTS基于连续行为空间的行驶感知样本数据、行驶目标样本信息、和KR-AUCB,输出的多个规划策略样本的信息包括:基于连续行为空间的行驶感知样本数据和行驶目标样本信息,使用KR-AUCB从对应的MCT中选取节点形成初始规划策略路径;根据所述强化网络模型输出的符合混合高斯分布的多个动作样本,为所述初始规划策略路径的叶子节点创建多个子节点;基于创建的多个子节点与所述初始规划策略路径,获得多条扩展规划策略路径;对多条扩展规划策略路径进行规划策略模拟,以获得各条扩展规划策略路径对应的策略评估;根据各条扩展规划策略路径及其对应的策略评估,输出多个规划策略样本。Optionally, the training module 508 obtains the MCTS driving perception sample data based on the continuous behavior space, driving target sample information, and KR-AUCB, and the output information of a plurality of planning strategy samples includes: driving perception samples based on the continuous behavior space Data and driving target sample information, use KR-AUCB to select nodes from the corresponding MCT to form an initial planning strategy path; output multiple action samples conforming to the mixed Gaussian distribution according to the enhanced network model, which is the initial planning strategy path The leaf node creates multiple child nodes; based on the created multiple child nodes and the initial planning strategy path, multiple extended planning strategy paths are obtained; planning strategy simulation is performed on the multiple extended planning strategy paths to obtain the corresponding extended planning strategy paths strategy evaluation; output multiple planning strategy samples according to each extended planning strategy path and its corresponding strategy evaluation.
可选地,训练模块508基于创建的多个子节点与所述初始规划策略路径,获得多条扩展规划策略路径包括:针对创建的多个子节点中的每个子节点,使用高斯过程函数拟合该子节点的信息,根据拟合后的高斯过程均值、标准差、该子节点与其它子节点之间的距离,获得该子节点的候选度;根据各个子节点的候选度,从多个子节点中选出候选子节点;根据选出的候选子节点和所述初始规划策略路径,获得多条扩展规划策略路径。Optionally, the training module 508, based on the multiple created child nodes and the initial planning strategy path, obtaining multiple extended planning strategy paths includes: for each of the created multiple child nodes, using a Gaussian process function to fit the child node According to the information of the node, the candidate degree of the child node is obtained according to the mean value and standard deviation of the Gaussian process after fitting, and the distance between the child node and other child nodes; according to the candidate degree of each child node, select Selecting candidate child nodes; obtaining a plurality of extended planning strategy paths according to the selected candidate child nodes and the initial planning strategy path.
可选地,训练模块508使用KR-AUCB从对应的MCT中选取节点形成初始规划策略路径包括:首先从MCT中选取一个未访问过的节点;针对该未访问过的节点对应的至少一级非叶子节点的每级非叶子节点,选择出先验概率高于其它同级子节点或者访问次数低于其它同级子节点的非叶子节点节点;基于所述至少一级非叶子节点中的最末一级非叶子节点所对应的叶子节点中,选择出最大值叶子节点;根据选择出的各 级节点,形成初始规划策略路径。Optionally, the training module 508 using KR-AUCB to select nodes from the corresponding MCT to form the initial planning strategy path includes: first selecting an unvisited node from the MCT; For each non-leaf node of a leaf node, select a non-leaf node whose prior probability is higher than that of other child nodes of the same level or whose number of visits is lower than that of other child nodes of the same level; based on the last From the leaf nodes corresponding to the first-level non-leaf nodes, select the leaf node with the maximum value; and form the initial planning strategy path according to the selected nodes at all levels.
本实施例的自动驾驶的决策规划装置用于实现前述多个方法实施例中相应的自动驾驶的决策规划方法,并具有相应的方法实施例的有益效果,在此不再赘述。此外,本实施例的自动驾驶的决策规划装置中的各个模块的功能实现均可参照前述方法实施例中的相应部分的描述,在此亦不再赘述。The automatic driving decision-making and planning device of this embodiment is used to implement the corresponding automatic driving decision-making and planning methods in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here. In addition, for the function implementation of each module in the automatic driving decision-making planning device of this embodiment, reference may be made to the description of the corresponding part in the foregoing method embodiments, and details are not repeated here.
实施例四Embodiment four
参照图6,示出了根据本申请实施例四的一种电子设备的结构示意图,本申请具体实施例并不对电子设备的具体实现做限定。Referring to FIG. 6 , it shows a schematic structural diagram of an electronic device according to Embodiment 4 of the present application. The specific embodiment of the present application does not limit the specific implementation of the electronic device.
如图6所示,该电子设备可以包括:处理器(processor)602、通信接口(Communications Interface)604、存储器(memory)606、以及通信总线608。As shown in FIG. 6 , the electronic device may include: a processor (processor) 602, a communication interface (Communications Interface) 604, a memory (memory) 606, and a communication bus 608.
其中:in:
处理器602、通信接口604、以及存储器606通过通信总线608完成相互间的通信。The processor 602 , the communication interface 604 , and the memory 606 communicate with each other through the communication bus 608 .
通信接口604,用于与其它电子设备或服务器进行通信。The communication interface 604 is used for communicating with other electronic devices or servers.
处理器602,用于执行程序610,具体可以执行上述校自动驾驶的决策规划方法实施例中的相关步骤。The processor 602 is configured to execute the program 610, specifically, may execute the relevant steps in the above embodiment of the decision planning method for calibrating automatic driving.
具体地,程序610可以包括程序代码,该程序代码包括计算机操作指令。Specifically, the program 610 may include program codes including computer operation instructions.
处理器602可能是中央处理器(central processing unit,CPU),或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本申请实施例的一个或多个集成电路。智能设备包括的一个或多个处理器,可以是同一类型的处理器,如一个或多个CPU;也可以是不同类型的处理器,如一个或多个CPU以及一个或多个ASIC。The processor 602 may be a central processing unit (central processing unit, CPU), or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present application. The one or more processors included in the smart device may be of the same type, such as one or more CPUs, or may be different types of processors, such as one or more CPUs and one or more ASICs.
存储器606,用于存放程序610。存储器606可能包含高速随机存取存储器(Random Access Memory,RAM)存储器,也可能还包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。The memory 606 is used to store the program 610 . The memory 606 may include a high-speed random access memory (Random Access Memory, RAM) memory, and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
程序610具体可以用于使得处理器602执行前述实施例一或二中任一所述的自动驾驶的决策规划方法。The program 610 may specifically be used to enable the processor 602 to execute the decision planning method for automatic driving described in any one of the foregoing first or second embodiments.
程序610中各步骤的具体实现可以参见上述自动驾驶的决策规划方法实施例中的相应步骤和单元中对应的描述,在此不赘述。所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的设备和模块的具体工作过程,可以参考前述方法实施例中的对应过程描述,并具有相应的效果,在此不再赘述。For the specific implementation of each step in the program 610, refer to the corresponding description of the corresponding steps and units in the above-mentioned embodiment of the decision-making planning method for automatic driving, and details are not repeated here. Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described equipment and modules can refer to the corresponding process description in the foregoing method embodiment, and have corresponding effects, which are not described herein. Let me repeat.
本申请实施例还提供了一种计算机程序产品,包括计算机指令,该计算机指令指示计算设备执行上述多个方法实施例中的任一自动驾驶的决策规划方法对应的操作。An embodiment of the present application further provides a computer program product, including computer instructions, where the computer instructions instruct a computing device to perform operations corresponding to any one of the decision-making and planning methods for automatic driving in the above-mentioned multiple method embodiments.
需要指出,根据实施的需要,可将本申请实施例中描述的各个部件/步骤拆分为更多部件/步骤,也可将两个或多个部件/步骤或者部件/步骤的部分操作组合成新的部件/步骤,以实现本申请实施例的目的。It should be pointed out that, according to the needs of implementation, each component/step described in the embodiment of the present application can be divided into more components/steps, and two or more components/steps or partial operations of components/steps can also be combined into New components/steps to achieve the purpose of the embodiment of the present application.
上述根据本申请实施例的方法可在硬件、固件中实现,或者被实现为可存储在记录介质(诸如CD ROM(Compact Disc Read Only Memory,只读光盘存储器)、RAM、软盘、硬盘或磁光盘)中的软件或计算机代码,或者被实现通过网络下载的原始存储在远程记录介质或非暂时机器可读介质中并将被存储在本地记录介质中的计算机代码, 从而在此描述的方法可被存储在使用通用计算机、专用处理器或者可编程或专用硬件(诸如ASIC或FPGA)的记录介质上的这样的软件处理。可以理解,计算机、处理器、微处理器控制器或可编程硬件包括可存储或接收软件或计算机代码的存储组件(例如,RAM、ROM、闪存等),当所述软件或计算机代码被计算机、处理器或硬件访问且执行时,实现在此描述的自动驾驶的决策规划方法。此外,当通用计算机访问用于实现在此示出的自动驾驶的决策规划方法的代码时,代码的执行将通用计算机转换为用于执行在此示出的自动驾驶的决策规划方法的专用计算机。The above-mentioned method according to the embodiment of the present application can be implemented in hardware, firmware, or can be implemented as being stored in a recording medium (such as CD ROM (Compact Disc Read Only Memory, CD-ROM), RAM, floppy disk, hard disk or magneto-optical disk ) in software or computer code, or computer code originally stored in a remote recording medium or a non-transitory machine-readable medium that is downloaded over a network and will be stored in a local recording medium so that the methods described herein can be Such software processing stored on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as ASIC or FPGA. It will be appreciated that a computer, processor, microprocessor controller, or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when When accessed and executed by a processor or hardware, the decision planning method for autonomous driving described herein is implemented. In addition, when the general-purpose computer accesses the code for implementing the decision-making method for automatic driving shown here, the execution of the code converts the general-purpose computer into a special-purpose computer for executing the decision-making method for automatic driving shown here.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及方法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请实施例的范围。Those skilled in the art can appreciate that the units and method steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Professionals and technicians may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the embodiments of the present application.
以上实施方式仅用于说明本申请实施例,而并非对本申请实施例的限制,有关技术领域的普通技术人员,在不脱离本申请实施例的精神和范围的情况下,还可以做出各种变化和变型,因此所有等同的技术方案也属于本申请实施例的范畴,本申请实施例的专利保护范围应由权利要求限定。The above implementations are only used to illustrate the embodiments of the application, rather than to limit the embodiments of the application. Those of ordinary skill in the relevant technical fields can also make various implementations without departing from the spirit and scope of the embodiments of the application Changes and modifications, so all equivalent technical solutions also belong to the scope of the embodiments of the present application, and the scope of patent protection of the embodiments of the present application should be defined by the claims.

Claims (13)

  1. 一种自动驾驶的决策规划方法,包括:A decision planning method for automatic driving, comprising:
    获取待决策对象在连续行为空间的行驶感知信息,其中,所述行驶感知信息包括:与所述待决策对象相关的几何信息、历史行驶轨迹信息和地图信息;Acquiring the driving perception information of the object to be decided in the continuous behavior space, wherein the driving perception information includes: geometric information, historical driving track information and map information related to the object to be decided;
    根据所述行驶感知信息和行驶目标信息,获得符合混合高斯分布的多个规划策略及各个规划策略对应的策略评估;According to the driving perception information and driving target information, multiple planning strategies conforming to the mixed Gaussian distribution and strategy evaluation corresponding to each planning strategy are obtained;
    根据所述多个规划策略及各个规划策略对应的策略评估,为所述待决策对象进行决策规划。Decision planning is performed for the object to be decided according to the plurality of planning strategies and the strategy evaluation corresponding to each planning strategy.
  2. 根据权利要求1所述的方法,其中,所述根据所述行驶感知信息和行驶目标信息,获得符合混合高斯分布的多个规划策略及各个规划策略对应的策略评估,包括:The method according to claim 1, wherein, according to the driving perception information and driving target information, obtaining a plurality of planning strategies conforming to the mixed Gaussian distribution and strategy evaluation corresponding to each planning strategy includes:
    将所述行驶感知信息输入图神经网络模型,以通过所述图神经网络模型进行特征提取和基于多头自注意力机制的特征融合,获得行驶感知信息对应的融合特征向量;Inputting the driving perception information into a graph neural network model to perform feature extraction and feature fusion based on a multi-head self-attention mechanism through the graph neural network model to obtain a fusion feature vector corresponding to the driving perception information;
    将所述融合特征向量和所述待决策对象的行驶目标信息对应的向量输入策略价值模型,通过所述策略价值模型获得符合混合高斯分布的多个规划策略指示及根据所述规划策略指示生成的各个规划策略对应的策略评估。Inputting the fused feature vector and the vector corresponding to the driving target information of the object to be decided into a strategy value model, and obtaining multiple planning strategy instructions conforming to the mixed Gaussian distribution and the planning strategy instructions generated according to the planning strategy instructions through the strategy value model Strategy evaluation corresponding to each planning strategy.
  3. 根据权利要求2所述的方法,其中,所述策略价值模型包括策略网络部分和价值网络部分;其中,所述策略网络部分为混合密度网络,用于输出符合混合高斯分布的多个规划策略指示;所述价值网络部分用于对根据所述策略网络部分输出的规划策略指示生成的多个规划策略进行估值,输出各个规划策略对应的策略评估。The method according to claim 2, wherein the policy value model includes a policy network part and a value network part; wherein the policy network part is a mixed density network, which is used to output a plurality of planning policy indications conforming to a mixed Gaussian distribution ; The value network part is used to evaluate multiple planning strategies generated according to the planning strategy instructions output by the strategy network part, and output the strategy evaluation corresponding to each planning strategy.
  4. 根据权利要求2或3所述的方法,其中,所述图神经网络模型包括几何子图层、行驶轨迹子图层、地图子图层、池化层和全局图层;The method according to claim 2 or 3, wherein the graph neural network model comprises a geometry sublayer, a driving trajectory sublayer, a map sublayer, a pooling layer and a global layer;
    其中:in:
    所述几何子图层用于对所述几何信息进行特征提取,所述行驶轨迹子图层用于对所述历史行驶轨迹信息进行特征提取,所述地图子图层用于对所述地图信息进行特征提取;The geometry sublayer is used for feature extraction of the geometric information, the driving trajectory sublayer is used for feature extraction of the historical driving trajectory information, and the map sublayer is used for the map information Perform feature extraction;
    所述池化层用于分别对所述几何子图层、所述行驶轨迹子图层、和所述地图子图层各自提取的特征进行特征聚合;所述全局图层用于对所述几何子图层、所述行驶轨迹子图层、和所述地图子图层分别获得的聚合后的特征进行多头自注意力处理,获得融合特征向量。The pooling layer is used to perform feature aggregation on the features extracted from the geometry sublayer, the driving trajectory sublayer, and the map sublayer respectively; The aggregated features obtained respectively from the sublayer, the driving track sublayer, and the map sublayer are subjected to multi-head self-attention processing to obtain a fusion feature vector.
  5. 根据权利要求2或3所述的方法,其中,所述方法还包括:The method according to claim 2 or 3, wherein the method further comprises:
    基于蒙特卡洛树搜索MCTS生成的决策规划监督信息,对所述策略价值模型进行训练。The policy value model is trained based on the decision planning supervision information generated by the Monte Carlo tree search MCTS.
  6. 根据权利要求5所述的方法,其中,所述基于MCTS生成的决策规划监督信息,对所述策略价值模型进行训练,包括:The method according to claim 5, wherein the decision planning supervision information generated based on the MCTS is used to train the strategic value model, comprising:
    在每次迭代训练中,获得所述MCTS基于连续行为空间的行驶感知样本数据、行驶目标样本信息、和基于核回归的渐进置信度上界KR-AUCB,输出的多个规划策略样本的信息;In each iterative training, obtain the MCTS based on the driving perception sample data of the continuous behavior space, the driving target sample information, and the progressive confidence upper bound KR-AUCB based on kernel regression, and output the information of a plurality of planning strategy samples;
    以所述多个规划策略样本中,策略评估的估值最高的规划策略样本的信息为监督信息,对所述策略价值模型进行训练。The strategy value model is trained by using the information of the planning strategy sample with the highest valuation of the strategy evaluation among the plurality of planning strategy samples as supervision information.
  7. 根据权利要求6所述的方法,其中,所述获得所述MCTS基于连续行为空间的行驶感知样本数据、行驶目标样本信息、和KR-AUCB,输出的多个规划策略样本的信息,包括:The method according to claim 6, wherein said obtaining the information of a plurality of planning strategy samples output by the MCTS based on the continuous behavior space driving perception sample data, driving target sample information, and KR-AUCB includes:
    基于连续行为空间的行驶感知样本数据和行驶目标样本信息,使用KR-AUCB从 对应的蒙特卡洛树MCT中选取节点形成初始规划策略;Based on the driving perception sample data and driving target sample information in the continuous behavior space, use KR-AUCB to select nodes from the corresponding Monte Carlo tree MCT to form an initial planning strategy;
    根据强化网络模型输出的符合混合高斯分布的多个动作样本,为所述初始规划策略的叶子节点创建多个子节点;Create a plurality of child nodes for the leaf nodes of the initial planning strategy according to the multiple action samples conforming to the mixed Gaussian distribution output by the enhanced network model;
    基于创建的多个子节点与所述初始规划策略,获得多条扩展规划策略;Obtaining multiple extended planning strategies based on the created multiple child nodes and the initial planning strategy;
    对多条扩展规划策略进行策略模拟,以获得各条扩展规划策略对应的策略评估;Perform strategy simulation on multiple expansion planning strategies to obtain the strategy evaluation corresponding to each expansion planning strategy;
    根据各条扩展规划策略及其对应的策略评估,输出多个规划策略样本。According to each extended planning strategy and its corresponding strategy evaluation, multiple planning strategy samples are output.
  8. 根据权利要求7所述的方法,其中,所述基于创建的多个子节点与所述初始规划策略,获得多条扩展规划策略,包括:The method according to claim 7, wherein said obtaining a plurality of extended planning strategies based on the created multiple child nodes and the initial planning strategy includes:
    针对创建的多个子节点中的每个子节点,使用高斯过程函数拟合该子节点的信息,根据拟合后的高斯过程均值、标准差、该子节点与其它子节点之间的距离,获得该子节点的候选度;For each child node among the multiple created child nodes, use the Gaussian process function to fit the information of the child node, and obtain the The candidate degree of the child node;
    根据各个子节点的候选度,从多个子节点中选出候选子节点;Select candidate child nodes from multiple child nodes according to the candidate degree of each child node;
    根据选出的候选子节点和所述初始规划策略,获得多条扩展规划策略。According to the selected candidate child nodes and the initial planning strategy, multiple extended planning strategies are obtained.
  9. 根据权利要求7或8所述的方法,其中,所述使用KR-AUCB从对应的MCT中选取节点形成初始规划策略,包括:The method according to claim 7 or 8, wherein said using KR-AUCB to select nodes from corresponding MCTs to form an initial planning strategy includes:
    首先从MCT中选取一个KR-AUCB值最大节点;First select a node with the largest KR-AUCB value from the MCT;
    针对该节点对应的至少一级非叶子节点的每级非叶子节点,选择出KR-AUCB值高于其它同级子节点或者访问次数低于其它同级子节点的非叶子节点;For each level of non-leaf nodes of at least one level of non-leaf nodes corresponding to the node, select a non-leaf node whose KR-AUCB value is higher than that of other child nodes of the same level or that the number of visits is lower than that of other child nodes of the same level;
    基于所述至少一级非叶子节点中的最末一级非叶子节点所对应的叶子节点中,选择出叶子节点;Selecting a leaf node based on the leaf nodes corresponding to the last level of non-leaf nodes in the at least one level of non-leaf nodes;
    根据选择出的各级节点,形成初始规划策略。According to the selected nodes at all levels, an initial planning strategy is formed.
  10. 一种自动驾驶的决策规划装置,包括:A decision-making and planning device for automatic driving, comprising:
    第一获取模块,用于获取待决策对象在连续行为空间的行驶感知信息,其中,所述行驶感知信息包括:与所述待决策对象相关的几何信息、历史行驶轨迹信息和地图信息;The first acquisition module is used to acquire the driving perception information of the object to be decided in the continuous behavior space, wherein the driving perception information includes: geometric information, historical driving track information and map information related to the object to be decided;
    第二获取模块,用于根据所述行驶感知信息和行驶目标信息,获得符合混合高斯分布的多个规划策略及各个规划策略对应的策略评估;The second acquisition module is used to obtain multiple planning strategies conforming to the mixed Gaussian distribution and strategy evaluation corresponding to each planning strategy according to the driving perception information and driving target information;
    规划模块,用于根据所述多个规划策略及各个规划策略对应的策略评估,为所述待决策对象进行决策规划。A planning module, configured to perform decision planning for the object to be decided according to the plurality of planning strategies and the strategy evaluation corresponding to each planning strategy.
  11. 一种电子设备,包括:处理器、存储器、通信接口和通信总线,所述处理器、所述存储器和所述通信接口通过所述通信总线完成相互间的通信;An electronic device, comprising: a processor, a memory, a communication interface, and a communication bus, wherein the processor, the memory, and the communication interface complete mutual communication through the communication bus;
    所述存储器用于存放至少一可执行指令,所述可执行指令使所述处理器执行如权利要求1-9中任一项所述的自动驾驶的决策规划方法对应的操作。The memory is used to store at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the decision planning method for automatic driving according to any one of claims 1-9.
  12. 一种计算机存储介质,其上存储有计算机程序,该程序被处理器执行时实现如权利要求1-9中任一项所述的自动驾驶的决策规划方法。A computer storage medium, on which a computer program is stored, and when the program is executed by a processor, the automatic driving decision planning method according to any one of claims 1-9 is realized.
  13. 一种计算机程序产品,包括计算机指令,所述计算机指令指示计算设备执行如权利要求1-9中任一项所述的自动驾驶的决策规划方法对应的操作。A computer program product, comprising computer instructions, the computer instructions instructing a computing device to perform operations corresponding to the decision planning method for automatic driving according to any one of claims 1-9.
PCT/CN2022/130733 2021-12-07 2022-11-08 Decision planning method for autonomous driving, electronic device, and computer storage medium WO2023103692A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111481018.4A CN113879339A (en) 2021-12-07 2021-12-07 Decision planning method for automatic driving, electronic device and computer storage medium
CN202111481018.4 2021-12-07

Publications (1)

Publication Number Publication Date
WO2023103692A1 true WO2023103692A1 (en) 2023-06-15

Family

ID=79015785

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/130733 WO2023103692A1 (en) 2021-12-07 2022-11-08 Decision planning method for autonomous driving, electronic device, and computer storage medium

Country Status (2)

Country Link
CN (1) CN113879339A (en)
WO (1) WO2023103692A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117114524A (en) * 2023-10-23 2023-11-24 香港中文大学(深圳) Logistics sorting method based on reinforcement learning and digital twin

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113879339A (en) * 2021-12-07 2022-01-04 阿里巴巴达摩院(杭州)科技有限公司 Decision planning method for automatic driving, electronic device and computer storage medium
CN114694123B (en) * 2022-05-30 2022-09-27 阿里巴巴达摩院(杭州)科技有限公司 Traffic signal lamp sensing method, device, equipment and storage medium
CN115731690B (en) * 2022-11-18 2023-11-28 北京理工大学 Unmanned public transportation cluster decision-making method based on graphic neural network reinforcement learning
CN115762169B (en) * 2023-01-06 2023-04-25 中通新能源汽车有限公司 Unmanned intelligent control system and method for sanitation vehicle

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109557912A (en) * 2018-10-11 2019-04-02 同济大学 A kind of decision rule method of automatic Pilot job that requires special skills vehicle
CN110471411A (en) * 2019-07-26 2019-11-19 华为技术有限公司 Automatic Pilot method and servomechanism
JP2020060901A (en) * 2018-10-09 2020-04-16 アルパイン株式会社 Operation information management device
US20210080955A1 (en) * 2019-09-12 2021-03-18 Uatc, Llc Systems and Methods for Vehicle Motion Planning Based on Uncertainty
CN113741412A (en) * 2020-05-29 2021-12-03 杭州海康威视数字技术股份有限公司 Control method and device for automatic driving equipment and storage medium
CN113879339A (en) * 2021-12-07 2022-01-04 阿里巴巴达摩院(杭州)科技有限公司 Decision planning method for automatic driving, electronic device and computer storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11074480B2 (en) * 2019-01-31 2021-07-27 StradVision, Inc. Learning method and learning device for supporting reinforcement learning by using human driving data as training data to thereby perform personalized path planning
US20220139222A1 (en) * 2019-02-13 2022-05-05 Beijing Baidu Netcom Science And Technology Co., Ltd. Driving control method and apparatus, device, medium, and system
CN109703568B (en) * 2019-02-19 2020-08-18 百度在线网络技术(北京)有限公司 Method, device and server for learning driving strategy of automatic driving vehicle in real time
CN111123957B (en) * 2020-03-31 2020-09-04 北京三快在线科技有限公司 Method and device for planning track
DE102020204351A1 (en) * 2020-04-03 2021-10-07 Robert Bosch Gesellschaft mit beschränkter Haftung DEVICE AND METHOD FOR PLANNING A MULTIPLE ORDERS FOR A VARIETY OF MACHINERY
CN112257872B (en) * 2020-10-30 2022-09-13 周世海 Target planning method for reinforcement learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2020060901A (en) * 2018-10-09 2020-04-16 アルパイン株式会社 Operation information management device
CN109557912A (en) * 2018-10-11 2019-04-02 同济大学 A kind of decision rule method of automatic Pilot job that requires special skills vehicle
CN110471411A (en) * 2019-07-26 2019-11-19 华为技术有限公司 Automatic Pilot method and servomechanism
US20210080955A1 (en) * 2019-09-12 2021-03-18 Uatc, Llc Systems and Methods for Vehicle Motion Planning Based on Uncertainty
CN113741412A (en) * 2020-05-29 2021-12-03 杭州海康威视数字技术股份有限公司 Control method and device for automatic driving equipment and storage medium
CN113879339A (en) * 2021-12-07 2022-01-04 阿里巴巴达摩院(杭州)科技有限公司 Decision planning method for automatic driving, electronic device and computer storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LEI LANXIN; LUO RUIMING; ZHENG RENJIE; WANG JINGKE; ZHANG JIANWEI; QIU CONG; MA LIULONG; JIN LIYANG; ZHANG PING; CHEN JUNBO: "KB-Tree: Learnable and Continuous Monte-Carlo Tree Search for Autonomous Driving Planning", 2021 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), IEEE, 27 September 2021 (2021-09-27), pages 4493 - 4500, XP034050545, DOI: 10.1109/IROS51168.2021.9636442 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117114524A (en) * 2023-10-23 2023-11-24 香港中文大学(深圳) Logistics sorting method based on reinforcement learning and digital twin
CN117114524B (en) * 2023-10-23 2024-01-26 香港中文大学(深圳) Logistics sorting method based on reinforcement learning and digital twin

Also Published As

Publication number Publication date
CN113879339A (en) 2022-01-04

Similar Documents

Publication Publication Date Title
WO2023103692A1 (en) Decision planning method for autonomous driving, electronic device, and computer storage medium
CN109145939B (en) Semantic segmentation method for small-target sensitive dual-channel convolutional neural network
US11537134B1 (en) Generating environmental input encoding for training neural networks
JP2023175055A (en) Autonomous vehicle planning
Chen et al. Driving maneuvers prediction based autonomous driving control by deep Monte Carlo tree search
Ding et al. Epsilon: An efficient planning system for automated vehicles in highly interactive environments
Feng et al. Trafficgen: Learning to generate diverse and realistic traffic scenarios
CN114341950A (en) Occupancy-prediction neural network
CN114964261A (en) Mobile robot path planning method based on improved ant colony algorithm
CN114323051B (en) Intersection driving track planning method and device and electronic equipment
CN116050672A (en) Urban management method and system based on artificial intelligence
Alsaleh et al. Do road users play Nash Equilibrium? A comparison between Nash and Logistic stochastic Equilibriums for multiagent modeling of road user interactions in shared spaces
CN114519433A (en) Multi-agent reinforcement learning and strategy execution method and computer equipment
CN116448134B (en) Vehicle path planning method and device based on risk field and uncertain analysis
WO2018210303A1 (en) Road model construction
CN111310919B (en) Driving control strategy training method based on scene segmentation and local path planning
US11436504B1 (en) Unified scene graphs
CN117007066A (en) Unmanned trajectory planning method integrated by multiple planning algorithms and related device
CN116764225A (en) Efficient path-finding processing method, device, equipment and medium
WO2021258847A1 (en) Driving decision-making method, device, and chip
JP7356961B2 (en) Pedestrian road crossing simulation device, pedestrian road crossing simulation method, and pedestrian road crossing simulation program
CN110887503B (en) Moving track simulation method, device, equipment and medium
Xu et al. TrafficEKF: A learning based traffic aware extended Kalman filter
CN114613159A (en) Traffic signal lamp control method, device and equipment based on deep reinforcement learning
CN116674562B (en) Vehicle control method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22903115

Country of ref document: EP

Kind code of ref document: A1