WO2023103692A1

WO2023103692A1 - Decision planning method for autonomous driving, electronic device, and computer storage medium

Info

Publication number: WO2023103692A1
Application number: PCT/CN2022/130733
Authority: WO
Inventors: 陈俊波; 雷岚馨; 敬巍; 王刚
Original assignee: 阿里巴巴达摩院(杭州)科技有限公司
Priority date: 2021-12-07
Filing date: 2022-11-08
Publication date: 2023-06-15
Also published as: CN113879339A

Abstract

A decision planning method for autonomous driving, an electronic device and a computer storage medium. The decision planning method comprises: acquiring driving perception information of an object to be decided in a continuous behavior space (S302, S404), wherein the driving perception information includes geometric information, historical driving trajectory information and map information related to the object to be decided; according to the driving perception information and driving target information, acquiring a plurality of planning strategies, conforming to the Gaussian mixture distribution, and a strategy evaluation corresponding to each planning strategy (S304, S408); and performing decision planning for the object to be decided according to the plurality of planning strategies and the strategy evaluation corresponding to each planning strategy (S306, S410). According to the decision planning method, decision planning can be effectively performed in a strong interaction scenario for autonomous driving, so that the decision effect is improved.

Description

Decision-making planning method for automatic driving, electronic equipment and computer storage medium

This application claims the priority of the Chinese patent application with the application number 202111481018.4 and the application name "automatic driving decision-making planning method, electronic equipment and computer storage medium" submitted to the China Patent Office on December 07, 2021, the entire content of which is passed References are incorporated in this application.

technical field

The embodiments of the present application relate to the technical field of automatic driving, and in particular, to a decision planning method for automatic driving, electronic equipment, and a computer storage medium.

Background technique

Autonomous driving technology is a technology that uses communication, computer, network and control technologies to perform real-time and continuous control of corresponding equipment (such as self-driving vehicles, drones, robots, etc.).

With the development of autonomous driving technology, driving decision-making planning is increasingly used in autonomous driving technology. Taking self-driving vehicles as an example, the current decision-making plan for autonomous driving can give specific driving suggestions based on changes in road conditions, such as encountering pedestrians, vehicles, or congestion, and control self-driving vehicles to perform reasonable driving operations. However, in some scenarios, such as strong interaction scenarios of autonomous driving, effective decision-making plans cannot be given due to issues such as fine-grained data.

Among them, the strong interaction scene in automatic driving refers to the scene in automatic driving that needs to frequently adjust its own decision-making plan based on the other party's decision-making, and often exists in low-speed congestion scenes, such as narrow road meeting cars, narrow road wrong cars, roundabouts, etc. . In these scenarios, traditional decision-making planning is difficult to work. Therefore, the current technical solution has the problem that the effect of decision-making planning is not good.

Contents of the invention

In view of this, an embodiment of the present application provides a decision-making and planning solution for automatic driving, so as to at least partially solve the above-mentioned problems.

According to the first aspect of the embodiments of the present application, a decision planning method for automatic driving is provided, including: acquiring driving perception information of an object to be decided in a continuous behavior space, wherein the driving perception information includes: Object-related geometric information, historical driving trajectory information, and map information; according to the driving perception information and driving target information, obtain multiple planning strategies conforming to the mixed Gaussian distribution and strategy evaluation corresponding to each planning strategy; according to the multiple planning strategies The strategy and the strategy evaluation corresponding to each planning strategy perform decision planning for the object to be decided.

According to the second aspect of the embodiments of the present application, a decision-making and planning device for automatic driving is provided, including:

The first acquisition module is used to acquire the driving perception information of the object to be decided in the continuous behavior space, wherein the driving perception information includes: geometric information, historical driving track information and map information related to the object to be decided;

The second acquisition module is used to obtain multiple planning strategies conforming to the mixed Gaussian distribution and strategy evaluation corresponding to each planning strategy according to the driving perception information and driving target information;

A planning module, configured to perform decision planning for the object to be decided according to the plurality of planning strategies and the strategy evaluation corresponding to each planning strategy.

According to a third aspect of the embodiments of the present application, there is provided an electronic device, including: a processor, a memory, a communication interface, and a communication bus, and the processor, the memory, and the communication interface complete mutual communication via the communication bus. The communication among them; the memory is used to store at least one executable instruction, and the executable instruction causes the processor to perform the operation corresponding to the method described in the first aspect.

According to a fourth aspect of the embodiments of the present application, there is provided a computer storage medium, on which a computer program is stored, and when the program is executed by a processor, the method as described in the first aspect is implemented.

According to a fifth aspect of the embodiments of the present application, there is provided a computer program product, including computer instructions, where the computer instructions instruct a computing device to perform operations corresponding to the method described in the first aspect.

According to the automatic driving decision planning scheme provided by the embodiment of the present application, for the strong interaction scene of automatic driving, on the one hand, the driving perception information of the continuous behavior space is used. In addition to the continuity of this part of information, it also has A finer data granularity is obtained, so that the planning strategy determined based on this type of information also has a finer granularity, which is more suitable for automatic driving decision-making in strong interaction scenarios. On the other hand, there are multiple planning strategies obtained for strong interaction scenarios, and the multiple planning strategies conform to the mixed Gaussian distribution, so that these planning strategies are highly executable and reasonable, and can be used according to different operations of the other party. The situation can be effectively dealt with and dealt with, which is more in line with the needs of strong interaction. It can be seen that, through the solutions of the embodiments of the present application, decision-making planning can be effectively carried out for strong interaction scenarios in automatic driving, and the decision-making effect can be improved.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments described in the embodiments of this application, and those skilled in the art can also obtain other drawings based on these drawings.

Fig. 1 is a structural schematic diagram of a reinforcement learning system;

FIG. 2 is a schematic structural diagram of a reinforcement learning network model in an embodiment of the present application;

FIG. 3A is a flow chart of steps of a decision planning method for automatic driving according to Embodiment 1 of the present application;

Fig. 3B is a schematic diagram of a scenario example in the embodiment shown in Fig. 3A;

FIG. 4A is a flow chart of steps of a decision planning method for automatic driving according to Embodiment 2 of the present application;

Fig. 4B is a schematic diagram of an MCTS-based reinforcement learning network model in the embodiment shown in Fig. 4A;

FIG. 5 is a structural block diagram of an automatic driving decision planning device according to Embodiment 3 of the present application;

FIG. 6 is a schematic structural diagram of an electronic device according to Embodiment 4 of the present application.

Detailed ways

In order to enable those skilled in the art to better understand the technical solutions in the embodiments of the present application, the following will clearly and completely describe the technical solutions in the embodiments of the present application in conjunction with the drawings in the embodiments of the present application. Obviously, the described The embodiments are only some of the embodiments of the present application, but not all of them. All other embodiments obtained by persons of ordinary skill in the art based on the embodiments in the embodiments of the present application shall fall within the protection scope of the embodiments of the present application.

In order to facilitate the understanding of the solutions of the embodiments of the present application, a brief schematic description of the reinforcement learning system is given below first, as shown in FIG. 1 .

Reinforcement learning is the process of continuous interaction between the agent and the environment, thereby continuously strengthening the decision-making ability of the agent. The reinforcement learning system shown in Figure 1 includes an environment (Env) and an agent (Agent). First, the environment will give the agent an observation (also known as state); the agent will make an action after receiving the observation value given by the environment; the environment will do an action after receiving the action given by the agent A series of responses, such as giving a reward to this action (reward), and giving a new observation value; the agent updates its own strategy (policy) according to the reward value given by the environment, so as to continuously interact with the environment , and finally obtain the optimal strategy.

In practical applications, the reinforcement learning system can be realized through a policy-value model, which includes a policy branch and a value branch. Among them, the strategy branch is used for the agent to select the next action based on the state, which can be realized in a variety of ways, such as through the behavior function of the agent, etc., in the reinforcement learning based on MCTS (Monte Carlo Tree Search) , can also be realized through MCTS. The value branch is used to obtain the expected cumulative reward when the state follows the policy chosen by the policy branch. Reward reward is a feedback signal, usually a numerical value, indicating how well the agent performed this operation based on the state selection action.

Specifically, in the embodiment of the present application, an MCTS-based reinforcement learning system is adopted, which is collectively referred to as a reinforcement learning network model in the embodiment of the present application. As shown in Figure 2, the reinforcement learning network model includes a GNN (Graph Neural Network) part and a strategy value model part. The strategy value model part includes a strategy branch and a value branch. The strategy branch generates corresponding planning based on MCTS The strategy and value branch evaluates the planning strategy based on MCTS and its corresponding results, and obtains the valuation of the corresponding strategy evaluation.

It should be noted that the GNN part of the reinforcement learning network model shown in FIG. 2 is an optional part, which is used for feature extraction of input information. However, in practical applications, the corresponding input information can also be directly input into the strategy value model part together with the target information shown in FIG. 2 . Alternatively, other network models, such as CNN (Convolutional Neural Network, Convolutional Neural Network), can be used instead of the GNN model to implement feature extraction of input information. But using GNN, the features of input information can be extracted better and more efficiently, especially in extracting image features in strong interaction scenes of autonomous driving.

Based on the above structure, the automatic driving decision-making planning solution provided by the embodiment of the present application will be described below through multiple embodiments in conjunction with the accompanying drawings.

Embodiment one

Referring to FIG. 3A , it shows a flow chart of steps of a decision planning method for automatic driving according to Embodiment 1 of the present application.

The decision planning method for automatic driving in this embodiment includes the following steps:

Step S302: Obtain the driving perception information of the object to be decided in the continuous behavior space.

The object to be decided can be a device that carries an intelligent device (such as a processor or a chip, etc.), and can perform corresponding operations according to the instructions of the intelligent device, such as executing the instruction operation corresponding to the decision-making plan; it can also be a device that can upload corresponding information to A remote smart device (such as a server), and a device that can accept instructions from the smart device to perform corresponding operations. In the embodiment of the present application, the forms that can be realized by the decision-making object include but are not limited to: vehicles that can drive automatically, drones, robots, etc. The embodiment of the present application does not limit the specific realization form of the decision-making object.

The driving behavior of the object to be decided is usually continuous, but the corresponding data processing can be divided into data processing based on discrete behavior space and data processing based on continuous behavior space. The data in the continuous behavior space also has continuity and is continuous, so it can more accurately reflect a series of information such as the state and operation of the object to be decided at each moment in the space. In the embodiment of the present application, the driving perception information of the object to be decided in the continuous behavior space is mainly acquired. The driving perception information at least includes: geometric information related to the object to be decided, historical driving track information and map information.

Among them, the geometric information related to the object to be decided includes the geometric information of the object to be decided (such as the outline and shape of the object to be decided), and the geometric information of the physical objects in the environment where the object to be decided is located (such as the surrounding area of the object to be decided). Other objects such as the outline, shape and other information of vehicles, obstacles, road facilities, etc.). The historical driving trajectory information related to the object to be decided includes information on the driving trajectory of the object to be decided within a preset time period before the current moment, wherein the specific setting of the preset time period can be appropriately set by those skilled in the art according to actual needs, Such as 3-5 seconds before the current moment, etc. The map information related to the object to be decided is usually the map information within the geographical area where the object to be decided is currently located, such as the map information of a certain area around the current location, or the information on the geographical area where the object is located, usually including the location of the object to be decided. Data on the topology of roads and surrounding roads. Driving perception information can be collected and processed by information collection devices in the decision-making object, such as cameras, radars, and various sensors. Through these driving perception information, the current driving state of the decision-making object can be described comprehensively and accurately.

Step S304: According to the driving perception information and the driving target information, multiple planning strategies conforming to the mixed Gaussian distribution and strategy evaluation corresponding to each planning strategy are obtained.

Among them, the driving target information is used to characterize the information related to the driving target. In a strong interaction scenario, it usually includes but is not limited to: target point (or target area) information, target point (or target area) position, target point (or target area) information such as the distance from the current position of the object to be decided, the current speed, state, and angle of the object to be decided. The target point (or target area) in the strong interaction scene can usually be a target point (or target area) that is closer to the current position of the object to be decided, and a small number of (such as 1-3) operations can be performed through the object to be decided can be reached. For example, in a meeting scene, the target point (or target area) may be a nearby target position at a certain angle with the opposing vehicle to avoid collision, before the current position, and so on. Through the driving target information, the driving target of the object to be decided can be effectively determined to provide an effective basis for the subsequent determination of the planning strategy.

In the case of driving perception information and driving target information, a planning strategy (including but not limited to: navigation, control, moving, accelerating, following, changing lanes, etc.). In the embodiment of the present application, the planning strategy includes multiple planning strategies, and the multiple planning strategies conform to the mixed Gaussian distribution. In the embodiment of this application, the mixed Gaussian distribution means the probability distribution output by the Gaussian Mixed Model (GMM), which is a linear combination of multiple Gaussian distribution functions, so there are multiple peaks corresponding to the multiple peaks. corresponding to a planning strategy. In addition, the effectiveness of each planning strategy also requires an evaluation. Therefore, for multiple planning strategies, its strategy evaluation can be obtained, for example, it can be specifically realized as valuation, scoring, evaluation of the degree of pros and cons, etc., to judge the object to be decided If the planning strategy is implemented, the effect it may achieve is good or bad.

In one possible manner, a plurality of planning strategies and an evaluation of the individual planning strategies can be obtained by means of a strategy value model. For example, a strategy value model generally includes a strategy network part and a value network part, wherein the strategy network part adopts the form of a mixed density network (MDN, Mixture Density Network), which is used to output multiple planning strategy indicators (such as probability Distribution information, etc.; while the value network part can evaluate multiple planning strategies generated according to the planning strategy instructions output by the strategy network part, and output the strategy evaluation corresponding to each planning strategy. In this way, it can be efficiently and quickly generated A plurality of planning strategies and their corresponding strategy evaluations provide an effective basis for subsequent decision-making objects for decision-making.Wherein, according to the planning strategies output by the strategy network part, a plurality of planning strategies can be generated by those skilled in the art according to the actual situation It can be implemented with an appropriate algorithm, and in a feasible manner, it can be generated based on a probability distribution of a mixture of Gaussian distributions combined with MCTS.

In addition, in a feasible manner of the embodiment of the present application, multiple planning strategies conforming to the mixed Gaussian distribution can be obtained directly based on the driving perception information and the driving target information. But not limited to this, in another feasible way, based on the characteristic data of the driving perception information and the driving target information, multiple planning strategies conforming to the mixed Gaussian distribution can be obtained. The data after feature extraction and fusion can fully characterize the driving perception situation of the object to be decided, and can also highlight the characteristics of the driving perception situation. Optionally, the extraction and generation of feature data of driving perception information can be realized by a graph neural network model (Graph Neural Network, GNN). By inputting the driving perception information into the GNN, GNN can perform feature extraction and feature fusion based on the multi-head self-attention mechanism to obtain the fusion feature vector corresponding to the driving perception information.

In the traditional GNN, the input data is processed as a whole, and the amount of data processing is large, and its output is combined with the reinforcement learning network model, especially when combining the reinforcement learning network model based on MCTS, because MCTS needs to be executed repeatedly The simulation and reasoning process will further increase the processing burden of GNN. In order to reduce the data processing burden of GNN and improve processing efficiency, in one embodiment of this application, GNN is set to include geometry sublayer, driving trajectory sublayer, map sublayer, pooling layer and global layer.

Among them, the geometry sublayer is used for feature extraction of geometric information, the driving trajectory sublayer is used for feature extraction of historical driving trajectory information, and the map sublayer is used for feature extraction of map information; the pooling layer is used for respectively Feature aggregation is performed on the features extracted from the geometry sublayer, driving track sublayer, and map sublayer; the global layer is used to obtain the geometry sublayer, driving track sublayer, and map sublayer respectively. The aggregated features are processed by multi-head self-attention to obtain the fusion feature vector.

Further, for the geometric information, the vectors corresponding to the coordinates of the four corners and the center of the object to be decided are used; for the historical driving trajectory information, the time series encoding vector of the position and orientation of the object to be decided in the five time steps closest to the current time is used ;Using road topology data for map information, first discretize the road boundary, and divide it into a sub-graph at an interval of 5M (meters), thereby forming different sub-graphs, and then connect the road boundaries point-to-point to construct road information vector. Thus, the processing of the GNN is further facilitated, and the efficiency of the GNN processing is improved.

As mentioned earlier, different driving perception information is processed through different sublayers, but in each sublayer, the corresponding input vector first passes through a fully connected layer for feature extraction, and then passes through a maximum pooling layer for All feature data from different nodes processed this time are aggregated to obtain aggregated features. For example, the geometry sublayer performs feature extraction on multiple geometric information processed this time and then performs feature aggregation to obtain the aggregated features corresponding to the multiple geometric information; the driving track sublayer performs multiple histories of this processing The driving trajectory information is subjected to feature extraction and feature aggregation to obtain the aggregation features corresponding to the multiple historical driving trajectory information; the map sub-layer performs feature extraction on the multiple road topology information processed this time and then performs feature aggregation to obtain Aggregation features corresponding to the plurality of road topology information. The aggregated features output by these sublayers can have a fixed vector length, and these aggregated features are then fed into the global layer. The global layer can be implemented based on the multi-head self-attention mechanism. After the global layer performs multi-head self-attention processing on the aggregated features input by the three sub-layers, the fusion feature vector can be obtained.

It can be seen from the above description that in practical applications, the above-mentioned improvements to the GNN and the strategic value model can be used alternatively, but they can also be used at the same time to combine the advantages of the two to achieve better decision-making effects. Based on this, in a feasible way, this step can be implemented as: input the driving perception information into GNN, so as to perform feature extraction and feature fusion based on the multi-head self-attention mechanism through GNN, and obtain the fusion feature vector corresponding to the driving perception information; The fused feature vector and the vector corresponding to the driving target information of the object to be decided are input into the strategy value model, and multiple planning strategy instructions conforming to the mixed Gaussian distribution and the strategy evaluation corresponding to each planning strategy generated according to the planning strategy instructions are obtained through the strategy value model.

Through the above process, multiple planning strategies and their corresponding strategy evaluations are obtained, based on which, further decision planning can be carried out.

Step S306: Perform decision planning for the object to be decided according to multiple planning strategies and strategy evaluations corresponding to each planning strategy.

Since each planning strategy corresponds to a corresponding strategy evaluation, a more optimal planning strategy can be selected according to the valuation or score of the strategy evaluation or the degree of pros and cons, and then decision planning can be carried out based on the selected planning strategy, such as Issue operation instructions to the object to be decided, such as the use of the accelerator, the degree of use of the accelerator, the use of the brake, the rotation angle of the steering wheel, etc., so as to instruct the object to be decided to operate in the scene of strong interaction and make effective decisions.

It should be noted that, in the embodiments of the present application, unless otherwise specified, the quantities related to "multiple" such as "multiple" and "multiple" all mean two or more.

Hereinafter, a specific example is used to illustrate the above process, as shown in FIG. 3B .

In this example, the object to be decided is set to be an automatic driving vehicle X, and it is assumed that the vehicle X meets a human driving vehicle Y on a narrow road. As shown in FIG. 3B , first obtain the driving perception information of vehicle X in the continuous behavior space, including the contour information of vehicle X, the contour information of vehicle Y, the contour information of the road edge, and so on. Then, obtain the driving target information of vehicle X, because vehicle X needs to meet vehicle Y, it can locate the target point 2 meters in front of the vehicle, vehicle Y is close to the side of vehicle X, and the position of the vehicle body is 30 degrees, as shown in the figure Shown as a solid circle in 3B. Both the driving perception information and the driving target information will be input into the intelligent agent carried by the vehicle X, such as the controller of the vehicle X. A strategy value model is set in the controller, and multiple planning strategies will be output according to the input driving perception information and driving target information. In this example, three planning strategies are assumed, namely planning strategies 1, 2, and 3. Assume that planning strategy 1 is to go straight 1 meter and then drive 1 meter to the left, planning strategy 2 is to drive 1 meter to the left and then go straight 1 meter, and planning strategy 3 is to drive 2 meters to the left. Assume that the controller predicts that the strategy evaluation of planning strategy 1 through the strategy value model is a strategy score, the strategy score is 0.6, the strategy score of planning strategy 2 is 0.8, and the strategy score of planning strategy 3 is 0.5. Based on the above assumptions, the controller decides to use planning strategy 2, and uses planning strategy 2 as a decision rule to generate instructions corresponding to the decision planning for vehicle X, such as turning the steering wheel to the left by 30 degrees and reducing the accelerator pedal by 30%. After driving for 1 meter, return the steering wheel to 0 degrees forward, keep the accelerator and drive for another 1 meter. So far, vehicle X will meet vehicle Y, and the new position of vehicle X is shown as vehicle X in FIG. 3B .

It can be seen that through this embodiment, for the strong interaction scene of automatic driving, on the one hand, the driving perception information of the continuous behavior space is used. In addition to the continuity, this part of the information also has a finer data granularity because of its continuity. Therefore, the planning strategy determined based on this type of information also has a finer granularity, and is more suitable for decision-making processing in strong interaction scenarios. On the other hand, there are multiple planning strategies obtained for strong interaction scenarios, and the multiple planning strategies conform to the mixed Gaussian distribution, so that these planning strategies are highly executable and reasonable, and can be used according to different operations of the other party. The situation can be effectively dealt with and dealt with, which is more in line with the needs of strong interaction. It can be seen that, through the solution of this embodiment, decision-making planning can be effectively carried out for strong interaction scenarios in automatic driving, and the decision-making effect can be improved.

Embodiment two

Referring to FIG. 4A , it shows a flow chart of steps of a decision planning method for automatic driving according to Embodiment 2 of the present application.

In this embodiment, the training of the combination of the reinforcement learning network model and MCTS is emphasized. The training completed reinforcement learning network model can be applied to the decision-making planning scheme of the automatic driving of the foregoing embodiment, so as to make effective decisions on the decision-making object planning.

Step S402: Train the policy value model in the reinforcement learning network model.

As shown in Figure 2, the reinforcement learning network model in the embodiment of the present application includes a GNN model part and a strategy value model part, wherein the GNN part can be a pre-trained model, therefore, in this embodiment, the focus is on the strategy value model training is described.

Specifically, the strategic value model can be trained based on the decision planning supervision information generated by the MCTS. An MCTS-based reinforcement learning network model is shown in Figure 4B. It can be seen from Figure 4B that the strategy branch P and the value branch V of the strategy value model in the reinforcement learning network model are both implemented based on MCTS. In the embodiment of this application, the strategy value The input of the model is driving perception information and driving target information, and the output is the probability and evaluation of each feasible action (planning strategy in this embodiment) under the input information conditions, and the training goal is to make the strategy value model output The probability of action is closer to the probability of MCTS output. Therefore, it can also be considered that the output of MCTS can be used as the supervisory information of the strategic value model.

The traditional MCTS-based strategic value model is mostly used in discrete behavior spaces. In addition to insufficient action fineness, it is also easy to get stuck in some scenarios, and cannot be applied to the solution of the embodiment of this application. For this reason, in the embodiment of the present application, when training the strategic value model, in each iteration training, the driving perception sample data based on the continuous behavior space of the MCTS, the driving target sample information, and the KR-AUCB (gradual regression based on kernel regression) are obtained. The upper bound of the confidence degree of the formula), the information (such as probability) of multiple planning strategy samples is output; among the multiple planning strategy samples, the information (such as probability) of the planning strategy sample with the highest strategy evaluation is used as the supervisory information, and the strategy value model to train.

MCTS is a planning algorithm based on the Monte Carlo method to build a tree structure for reasoning and exploration. It can usually be combined with neural network models and reinforcement learning. MCTS usually includes several processes of select (selection), expand (extension), evaluate (simulation) and backup (back propagation). Among them, the selection process needs to start from the root node R of the Monte Carlo tree, and recursively select a certain child node until reaching the leaf node L. This process involves how to select the next node based on the current node. Currently, the commonly used method is UCB (Upper Confidence Bounds, the upper bound of the confidence interval), but the UCB method is inefficient. Therefore, the embodiment of this application provides an efficient The node selection method, namely KR-AUCB, can be used in the select process and expand process, which will be described in detail below. The expand process is a process of newly creating a child node C under the leaf node L if the operation on the leaf node L is not over (if the vehicle needs to continue driving). In the traditional way, usually only a new child node C is created, so that the newly created node data and the generated node path (strategy) are limited. Therefore, the embodiment of the present application also improves it so that it can branch output based on the strategy The probability distribution of the GMM is used to create multiple sub-nodes to expand the generated strategy and improve the efficiency of strategy generation. This part will also be described in detail below. The Evaluate process will simulate the corresponding action from the position of the newly expanded child node to the final result according to the generated strategy, so as to calculate the quality degree of the newly created node. In the backup process, according to the quality degree of the newly created node, the quality degree of the superior node of the newly created node is updated along the transfer path.

Hereinafter, the solution of the embodiment of the present application will be described in detail for the improvement of the select process and the expand process.

In a feasible manner provided by the embodiment of the present application, the acquisition of MCTS based on continuous behavior space driving perception sample data, driving target sample information, and KR-AUCB (Kernel Regression-based Asymptotical PUCB, progressive confidence based on kernel regression Degree upper bound), the output information of multiple planning strategy samples can include: driving perception sample data and driving target sample information based on continuous behavior space, using KR-AUCB to select nodes from the corresponding MCT (Monte Carlo tree) to form Initial planning strategy path; according to multiple action samples conforming to the mixed Gaussian distribution output by the enhanced network model, multiple child nodes are created for the leaf nodes of the initial planning strategy path; based on the created multiple child nodes and the initial planning strategy path, multiple extensions are obtained planning strategy path; carry out planning strategy simulation on multiple extended planning strategy paths to obtain the strategy evaluation corresponding to each extended planning strategy path; output multiple planning strategy samples according to each extended planning strategy path and its corresponding strategy evaluation information.

Wherein, in each iterative training, using KR-AUCB to select a node from the corresponding MCT to form an initial planning strategy path may include: first selecting a node from the MCT, usually the node with the largest KR-AUCB value (initially the node is an unvisited node, after the strategy value model has undergone multiple iterations of training, the node with the largest KR-AUCB value may be an unvisited node or a visited node); for this node, at least For each level of non-leaf nodes of the first-level non-leaf nodes, select the non-leaf node nodes whose KR-AUCB value is higher than other child nodes of the same level or the number of visits is lower than that of other child nodes of the same level; based on at least one level of non-leaf nodes From the leaf nodes corresponding to the last level of non-leaf nodes, select a leaf node (it can be a maximum leaf node, or a randomly selected leaf node); according to the selected nodes at all levels, an initial planning strategy path is formed . Wherein, the KR-AUCB value can be calculated according to the following formula one.

As shown in Figure 4B, in the selection process of MCTS, nodes are selected through KR-AUCB based on the driving perception sample data and driving target sample information in the continuous behavior space.

In the KR-AUCB method, first select an unvisited node; for each level of non-leaf nodes corresponding to the unvisited node at least one level of non-leaf nodes, select a priori probability higher than other child nodes of the same level Or a non-leaf node whose visit frequency is lower than other child nodes of the same level; based on the leaf nodes corresponding to the last level of non-leaf nodes in at least one level of non-leaf nodes, select the leaf node with the maximum value.

Alternatively, KR-AUCB can be expressed in the form of the following formula:

in:

p _asym = λP _prior + (1-λ)P _uniform Formula 5

In the above formula,

Represents the selected action (for example, the action node in MCT),

Indicates an existing sibling action,

represents the Gaussian probability density,

expressed in

exist

lower expectations,

For the value branch in the strategy value model

The output of , c represents the expansion parameter for the expansion node, P _asum represents the prior strategy for asymptotically controlling the expansion decay, W() represents the number of node visits, and n _a represents the number of nodes.

Represents the action space

The uniform distribution of , Pprior=p ^θ , represents the probability distribution of the prior action output by the strategy branch in the strategy value model, and p ^θ represents the output of the strategy branch.

When selecting the initial node, the nodes in the MCT can be selected based on the above formula 1 to form the initial planning strategy path, as shown by the solid line on the left in Figure 4B, the path formed by the light gray nodes, in which the leaves The node is the node shown in the solid circle with the solid line on the lower left, and is marked as L.

It should be noted that the above formula one can also be applied to the expand process of the MCTS, and nodes can also be selected more efficiently from newly created nodes.

After selecting nodes from the MCT to form an initial planning strategy path through the above process, the selection process of the MCTS can be considered complete. Further, an expand process can be performed.

In the expand process of the embodiment of the present application, it is also necessary to expand the node, that is, to create a lower-level child node based on the leaf node. Unlike the traditional MCTS where one child node is created each time, in the embodiment of the present application, multiple child nodes can be created at one time based on the mixed Gaussian distribution output by the strategy branch in the strategy value model.

To facilitate the description of the process, the policy branch in the embodiment of the present application is described first. The policy branch in the embodiment of the present application is implemented by a mixture density network (Mixture Density Network, MDN), which models the probability distribution of the action (action) output by outputting the parameters of the Gaussian mixture model (Gaussian Mixture Model, GMM), that is, the mixture Gaussian distribution. It can be seen from the middle MCT in Figure 4B that when the probability distribution of the output of the strategy branch is applied to the expand process of the MCTS, multiple child nodes will be created for the node L at the same time, and the example in Figure 4B is 2.

In the embodiment of the present application, when selecting nodes from newly created child nodes, the traditional Bayesian reasoning method is improved so as to select valid nodes more quickly. Based on this, in a feasible manner, based on the created multiple child nodes and the initial planning strategy path, obtaining multiple expanded planning strategy paths may include: for each of the created multiple child nodes, using a Gaussian process function to fit According to the information of the sub-node, the candidate degree of the sub-node is obtained according to the fitted Gaussian process mean, standard deviation, and the distance between the sub-node and other sub-nodes; according to the candidate degree of each sub-node, from multiple sub-nodes Select candidate sub-nodes; according to the selected candidate sub-nodes and initial planning strategy paths, multiple extended planning strategy paths are obtained. Among them, according to the fitted Gaussian process mean, standard deviation, and the distance between the sub-node and other sub-nodes, obtaining the candidate degree of the sub-node can also be regarded as the potential energy function construction process of Bayesian reasoning. The result is used as the candidate degree of the child node.

A concrete example process is as follows:

Take node b as an example, expand its node and create a new child node for it. First, the action node a ^* is collected as a candidate through the existing action (expressed as a in formula 6) branch Ch(b), and the formula is expressed as follows:

Among them, A() represents the acquisition function, which is used to guide the acquisition to the area where the probability of finding the best node will increase.

Based on this, the acquisition function is defined as:

in,

Indicates a candidate action node (for example, an action node in MCT), and μ and σ indicate candidate

The GP posterior mean and standard deviation of ,

express

and the expected distance between a in other branches, where a ∈ Ch(b).

The first two terms in formula seven can be considered as options

The upper limit of the expected value of , the last item is a penalty item, which is used to punish the action of the candidate node for being too close to the existing action. ω ₁ and ω ₂ are adjustable coefficients for the expansion of balance nodes.

After collecting the candidate actions, add them to the collection of visited action nodes

middle. Then, the action node to be taken is selected based on KR-AUCB. If the collected result is not a candidate a*, then from

Delete a* in . Then, the action corresponding to the selected action node is executed, and a new state is generated at the next node level. After one iteration, the expected value of the leaf node

Given by the value branch in the strategy value model. Finally, the value of each node in the entire traversal process is updated through backpropagation.

It can be seen that based on the above process, under the framework of select, expand, evaluate and backup of MCTS, the function of Gaussian process is used to fit the information of child nodes, and the improved scheme of Bayesian inference is used to infer the most potential new child nodes , effectively improving the information utilization rate and efficiency of the expand process. In addition, if the newly created node is not selected during the select process, it will be deleted. This method can avoid the dependence on the preset hyperparameters during the expand process. Finally, the above-mentioned MCTS process improves the accuracy of the evaluate process and the accuracy of the select process as a whole.

Based on the above process, relying on MCTS deduction to generate a relatively good continuous decision-making trajectory, that is, a planning strategy, as supervisory information to supervise the reinforcement learning of the strategy value model.

In model training, each step of MCTS will deduce multiple planning strategies (such as 100), and the one with the most visits will be used to train the strategy value model.

After the model is trained, it can be applied to actual decision planning, as described in the following steps.

Step S404: Obtain the driving perception information of the object to be decided in the continuous behavior space.

Among them, the driving perception information includes: geometric information related to the object to be decided, historical driving track information and map information.

Step S406: Input the driving perception information into the GNN to perform feature extraction and feature fusion based on the multi-head self-attention mechanism through the GNN to obtain a fusion feature vector corresponding to the driving perception information.

In this embodiment, the GNN is used to process the driving perception information to extract relevant features better and more efficiently.

Step S408: Input the fused feature vector and the vector corresponding to the driving target information of the object to be decided into the strategy value model, and obtain multiple planning strategy instructions conforming to the mixed Gaussian distribution through the strategy value model and corresponding to each planning strategy generated according to the planning strategy instructions Strategy evaluation.

In this step, the strategy branch in the strategy value model outputs GMM parameters to model the probability distribution of the action output, that is, the mixed Gaussian distribution, and multiple planning strategies are generated based on the distribution and the aforementioned MCTS process. Moreover, multiple planning strategies are evaluated through the value branch in the strategy value model to obtain the strategy evaluation of each planning strategy. Wherein, the specific implementation of the value branch can refer to the implementation manner of the value branch in the current policy value model, which is not limited in the embodiment of the present application.

Step S410: Perform decision planning for the object to be decided according to multiple planning strategies and strategy evaluations corresponding to each planning strategy.

For example, a planning strategy with the highest strategy evaluation may be selected from multiple planning strategies to generate a decision plan. Furthermore, according to the decision plan, an operation instruction is given to the object to be decided.

The description of the above steps S404-S410 is relatively simple, and its specific implementation can refer to the description of the relevant steps in the first embodiment and the description of step S402, and will not be repeated here.

Through this embodiment, for the strong interaction scene of automatic driving, on the one hand, the driving perception information of the continuous behavior space is used. In addition to the continuity, this part of the information also has a finer data granularity because of its continuity, so that The planning strategy determined based on this type of information also has a finer granularity and is more suitable for decision-making in strong interaction scenarios. On the other hand, there are multiple planning strategies obtained for strong interaction scenarios, and the multiple planning strategies conform to the mixed Gaussian distribution, so that these planning strategies are highly executable and reasonable, and can be used according to different operations of the other party. The situation can be effectively dealt with and dealt with, which is more in line with the needs of strong interaction. It can be seen that, through the solution of this embodiment, decision-making planning can be effectively carried out for strong interaction scenarios in automatic driving, and the decision-making effect can be improved.

Embodiment three

Referring to FIG. 5 , it shows a structural block diagram of an automatic driving decision planning device according to Embodiment 3 of the present application.

The decision planning device for automatic driving in this embodiment includes: a first acquisition module 502, configured to acquire driving perception information of an object to be decided in a continuous behavior space, wherein the driving perception information includes: information related to the object to be decided Geometric information, historical driving track information and map information; the second acquisition module 504 is used to obtain multiple planning strategies conforming to the mixed Gaussian distribution and strategy evaluation corresponding to each planning strategy according to the driving perception information and driving target information; planning Module 506, configured to perform decision planning for the object to be decided according to the multiple planning strategies and the strategy evaluation corresponding to each planning strategy.

Optionally, the second acquisition module 504 is configured to input the driving perception information into a graph neural network model, so as to perform feature extraction and feature fusion based on a multi-head self-attention mechanism through the graph neural network model to obtain driving perception information Corresponding fused feature vector; input the fused feature vector and the vector corresponding to the driving target information of the object to be decided into the strategy value model, and obtain multiple planning strategy instructions conforming to the mixed Gaussian distribution through the strategy value model and according to the The planning strategy indicates the strategy evaluation corresponding to each generated planning strategy.

Optionally, the strategy value model includes a strategy network part and a value network part; wherein, the strategy network part is a mixed density network for outputting a plurality of planning strategy instructions conforming to a mixed Gaussian distribution; the value network part uses To evaluate the multiple planning strategies generated according to the planning strategy instructions output by the strategy network part, and output the strategy evaluation corresponding to each planning strategy.

Optionally, the graph neural network model includes a geometry sublayer, a driving trajectory sublayer, a map sublayer, a pooling layer, and a global layer; wherein: the geometry sublayer is used to analyze the geometric information Carrying out feature extraction, the driving track sub-layer is used for feature extraction of the historical driving track information, the map sub-layer is used for feature extraction of the map information; the pooling layer is used for respectively Feature aggregation is performed on the features extracted from the geometry sublayer, the driving track sublayer, and the map sublayer; the global layer is used for the geometry sublayer, the driving track sublayer The aggregated features respectively obtained by the layers and the map sub-layers are processed by multi-head self-attention to obtain a fusion feature vector.

Optionally, the device for decision planning for automatic driving in this embodiment further includes: a training module 508, configured to train the policy value model based on the decision planning supervision information generated by the MCTS.

Optionally, the training module 508 is used to obtain the information of multiple planning strategy samples output by the MCTS based on the continuous behavior space driving perception sample data, driving target sample information, and KR-AUCB in each iterative training ; Using the information of the planning strategy sample with the highest strategy evaluation among the plurality of planning strategy samples as supervision information, training the strategy value model.

Optionally, the training module 508 obtains the MCTS driving perception sample data based on the continuous behavior space, driving target sample information, and KR-AUCB, and the output information of a plurality of planning strategy samples includes: driving perception samples based on the continuous behavior space Data and driving target sample information, use KR-AUCB to select nodes from the corresponding MCT to form an initial planning strategy path; output multiple action samples conforming to the mixed Gaussian distribution according to the enhanced network model, which is the initial planning strategy path The leaf node creates multiple child nodes; based on the created multiple child nodes and the initial planning strategy path, multiple extended planning strategy paths are obtained; planning strategy simulation is performed on the multiple extended planning strategy paths to obtain the corresponding extended planning strategy paths strategy evaluation; output multiple planning strategy samples according to each extended planning strategy path and its corresponding strategy evaluation.

Optionally, the training module 508, based on the multiple created child nodes and the initial planning strategy path, obtaining multiple extended planning strategy paths includes: for each of the created multiple child nodes, using a Gaussian process function to fit the child node According to the information of the node, the candidate degree of the child node is obtained according to the mean value and standard deviation of the Gaussian process after fitting, and the distance between the child node and other child nodes; according to the candidate degree of each child node, select Selecting candidate child nodes; obtaining a plurality of extended planning strategy paths according to the selected candidate child nodes and the initial planning strategy path.

Optionally, the training module 508 using KR-AUCB to select nodes from the corresponding MCT to form the initial planning strategy path includes: first selecting an unvisited node from the MCT; For each non-leaf node of a leaf node, select a non-leaf node whose prior probability is higher than that of other child nodes of the same level or whose number of visits is lower than that of other child nodes of the same level; based on the last From the leaf nodes corresponding to the first-level non-leaf nodes, select the leaf node with the maximum value; and form the initial planning strategy path according to the selected nodes at all levels.

The automatic driving decision-making and planning device of this embodiment is used to implement the corresponding automatic driving decision-making and planning methods in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here. In addition, for the function implementation of each module in the automatic driving decision-making planning device of this embodiment, reference may be made to the description of the corresponding part in the foregoing method embodiments, and details are not repeated here.

Embodiment four

Referring to FIG. 6 , it shows a schematic structural diagram of an electronic device according to Embodiment 4 of the present application. The specific embodiment of the present application does not limit the specific implementation of the electronic device.

As shown in FIG. 6 , the electronic device may include: a processor (processor) 602, a communication interface (Communications Interface) 604, a memory (memory) 606, and a communication bus 608.

in:

The processor 602 , the communication interface 604 , and the memory 606 communicate with each other through the communication bus 608 .

The communication interface 604 is used for communicating with other electronic devices or servers.

The processor 602 is configured to execute the program 610, specifically, may execute the relevant steps in the above embodiment of the decision planning method for calibrating automatic driving.

Specifically, the program 610 may include program codes including computer operation instructions.

The processor 602 may be a central processing unit (central processing unit, CPU), or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement the embodiments of the present application. The one or more processors included in the smart device may be of the same type, such as one or more CPUs, or may be different types of processors, such as one or more CPUs and one or more ASICs.

The memory 606 is used to store the program 610 . The memory 606 may include a high-speed random access memory (Random Access Memory, RAM) memory, and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 610 may specifically be used to enable the processor 602 to execute the decision planning method for automatic driving described in any one of the foregoing first or second embodiments.

For the specific implementation of each step in the program 610, refer to the corresponding description of the corresponding steps and units in the above-mentioned embodiment of the decision-making planning method for automatic driving, and details are not repeated here. Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described equipment and modules can refer to the corresponding process description in the foregoing method embodiment, and have corresponding effects, which are not described herein. Let me repeat.

An embodiment of the present application further provides a computer program product, including computer instructions, where the computer instructions instruct a computing device to perform operations corresponding to any one of the decision-making and planning methods for automatic driving in the above-mentioned multiple method embodiments.

It should be pointed out that, according to the needs of implementation, each component/step described in the embodiment of the present application can be divided into more components/steps, and two or more components/steps or partial operations of components/steps can also be combined into New components/steps to achieve the purpose of the embodiment of the present application.

The above-mentioned method according to the embodiment of the present application can be implemented in hardware, firmware, or can be implemented as being stored in a recording medium (such as CD ROM (Compact Disc Read Only Memory, CD-ROM), RAM, floppy disk, hard disk or magneto-optical disk ) in software or computer code, or computer code originally stored in a remote recording medium or a non-transitory machine-readable medium that is downloaded over a network and will be stored in a local recording medium so that the methods described herein can be Such software processing stored on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as ASIC or FPGA. It will be appreciated that a computer, processor, microprocessor controller, or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when When accessed and executed by a processor or hardware, the decision planning method for autonomous driving described herein is implemented. In addition, when the general-purpose computer accesses the code for implementing the decision-making method for automatic driving shown here, the execution of the code converts the general-purpose computer into a special-purpose computer for executing the decision-making method for automatic driving shown here.

Those skilled in the art can appreciate that the units and method steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Professionals and technicians may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the embodiments of the present application.

The above implementations are only used to illustrate the embodiments of the application, rather than to limit the embodiments of the application. Those of ordinary skill in the relevant technical fields can also make various implementations without departing from the spirit and scope of the embodiments of the application Changes and modifications, so all equivalent technical solutions also belong to the scope of the embodiments of the present application, and the scope of patent protection of the embodiments of the present application should be defined by the claims.

Claims

A decision planning method for automatic driving, comprising:

Acquiring the driving perception information of the object to be decided in the continuous behavior space, wherein the driving perception information includes: geometric information, historical driving track information and map information related to the object to be decided;

According to the driving perception information and driving target information, multiple planning strategies conforming to the mixed Gaussian distribution and strategy evaluation corresponding to each planning strategy are obtained;

Decision planning is performed for the object to be decided according to the plurality of planning strategies and the strategy evaluation corresponding to each planning strategy.
The method according to claim 1, wherein, according to the driving perception information and driving target information, obtaining a plurality of planning strategies conforming to the mixed Gaussian distribution and strategy evaluation corresponding to each planning strategy includes:

Inputting the driving perception information into a graph neural network model to perform feature extraction and feature fusion based on a multi-head self-attention mechanism through the graph neural network model to obtain a fusion feature vector corresponding to the driving perception information;

Inputting the fused feature vector and the vector corresponding to the driving target information of the object to be decided into a strategy value model, and obtaining multiple planning strategy instructions conforming to the mixed Gaussian distribution and the planning strategy instructions generated according to the planning strategy instructions through the strategy value model Strategy evaluation corresponding to each planning strategy.
The method according to claim 2, wherein the policy value model includes a policy network part and a value network part; wherein the policy network part is a mixed density network, which is used to output a plurality of planning policy indications conforming to a mixed Gaussian distribution ; The value network part is used to evaluate multiple planning strategies generated according to the planning strategy instructions output by the strategy network part, and output the strategy evaluation corresponding to each planning strategy.
The method according to claim 2 or 3, wherein the graph neural network model comprises a geometry sublayer, a driving trajectory sublayer, a map sublayer, a pooling layer and a global layer;

in:

The geometry sublayer is used for feature extraction of the geometric information, the driving trajectory sublayer is used for feature extraction of the historical driving trajectory information, and the map sublayer is used for the map information Perform feature extraction;

The pooling layer is used to perform feature aggregation on the features extracted from the geometry sublayer, the driving trajectory sublayer, and the map sublayer respectively; The aggregated features obtained respectively from the sublayer, the driving track sublayer, and the map sublayer are subjected to multi-head self-attention processing to obtain a fusion feature vector.
The method according to claim 2 or 3, wherein the method further comprises:

The policy value model is trained based on the decision planning supervision information generated by the Monte Carlo tree search MCTS.
The method according to claim 5, wherein the decision planning supervision information generated based on the MCTS is used to train the strategic value model, comprising:

In each iterative training, obtain the MCTS based on the driving perception sample data of the continuous behavior space, the driving target sample information, and the progressive confidence upper bound KR-AUCB based on kernel regression, and output the information of a plurality of planning strategy samples;

The strategy value model is trained by using the information of the planning strategy sample with the highest valuation of the strategy evaluation among the plurality of planning strategy samples as supervision information.
The method according to claim 6, wherein said obtaining the information of a plurality of planning strategy samples output by the MCTS based on the continuous behavior space driving perception sample data, driving target sample information, and KR-AUCB includes:

Based on the driving perception sample data and driving target sample information in the continuous behavior space, use KR-AUCB to select nodes from the corresponding Monte Carlo tree MCT to form an initial planning strategy;

Create a plurality of child nodes for the leaf nodes of the initial planning strategy according to the multiple action samples conforming to the mixed Gaussian distribution output by the enhanced network model;

Obtaining multiple extended planning strategies based on the created multiple child nodes and the initial planning strategy;

Perform strategy simulation on multiple expansion planning strategies to obtain the strategy evaluation corresponding to each expansion planning strategy;

According to each extended planning strategy and its corresponding strategy evaluation, multiple planning strategy samples are output.
The method according to claim 7, wherein said obtaining a plurality of extended planning strategies based on the created multiple child nodes and the initial planning strategy includes:

For each child node among the multiple created child nodes, use the Gaussian process function to fit the information of the child node, and obtain the The candidate degree of the child node;

Select candidate child nodes from multiple child nodes according to the candidate degree of each child node;

According to the selected candidate child nodes and the initial planning strategy, multiple extended planning strategies are obtained.
The method according to claim 7 or 8, wherein said using KR-AUCB to select nodes from corresponding MCTs to form an initial planning strategy includes:

First select a node with the largest KR-AUCB value from the MCT;

For each level of non-leaf nodes of at least one level of non-leaf nodes corresponding to the node, select a non-leaf node whose KR-AUCB value is higher than that of other child nodes of the same level or that the number of visits is lower than that of other child nodes of the same level;

Selecting a leaf node based on the leaf nodes corresponding to the last level of non-leaf nodes in the at least one level of non-leaf nodes;

According to the selected nodes at all levels, an initial planning strategy is formed.
A decision-making and planning device for automatic driving, comprising:

The first acquisition module is used to acquire the driving perception information of the object to be decided in the continuous behavior space, wherein the driving perception information includes: geometric information, historical driving track information and map information related to the object to be decided;

The second acquisition module is used to obtain multiple planning strategies conforming to the mixed Gaussian distribution and strategy evaluation corresponding to each planning strategy according to the driving perception information and driving target information;

A planning module, configured to perform decision planning for the object to be decided according to the plurality of planning strategies and the strategy evaluation corresponding to each planning strategy.
An electronic device, comprising: a processor, a memory, a communication interface, and a communication bus, wherein the processor, the memory, and the communication interface complete mutual communication through the communication bus;

The memory is used to store at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the decision planning method for automatic driving according to any one of claims 1-9.
A computer storage medium, on which a computer program is stored, and when the program is executed by a processor, the automatic driving decision planning method according to any one of claims 1-9 is realized.
A computer program product, comprising computer instructions, the computer instructions instructing a computing device to perform operations corresponding to the decision planning method for automatic driving according to any one of claims 1-9.