WO2021238303A1 - 运动规划的方法与装置 - Google Patents
运动规划的方法与装置 Download PDFInfo
- Publication number
- WO2021238303A1 WO2021238303A1 PCT/CN2021/075925 CN2021075925W WO2021238303A1 WO 2021238303 A1 WO2021238303 A1 WO 2021238303A1 CN 2021075925 W CN2021075925 W CN 2021075925W WO 2021238303 A1 WO2021238303 A1 WO 2021238303A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- reinforcement learning
- time domain
- network model
- driving
- learning network
- Prior art date
Links
- 230000033001 locomotion Effects 0.000 title claims abstract description 142
- 238000000034 method Methods 0.000 title claims abstract description 117
- 230000002787 reinforcement Effects 0.000 claims abstract description 129
- 230000003993 interaction Effects 0.000 claims abstract description 24
- 230000015654 memory Effects 0.000 claims description 45
- 238000012549 training Methods 0.000 claims description 44
- 230000009471 action Effects 0.000 claims description 36
- 230000006870 function Effects 0.000 claims description 24
- 238000012545 processing Methods 0.000 claims description 24
- 230000003068 static effect Effects 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 abstract description 29
- 230000008859 change Effects 0.000 abstract description 11
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 238000004422 calculation algorithm Methods 0.000 description 27
- 239000003795 chemical substances by application Substances 0.000 description 18
- 239000011159 matrix material Substances 0.000 description 16
- 230000006399 behavior Effects 0.000 description 13
- 230000008447 perception Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 9
- 238000004364 calculation method Methods 0.000 description 7
- 238000013528 artificial neural network Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 5
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000003672 processing method Methods 0.000 description 3
- MHABMANUFPZXEB-UHFFFAOYSA-N O-demethyl-aloesaponarin I Natural products O=C1C2=CC=CC(O)=C2C(=O)C2=C1C=C(O)C(C(O)=O)=C2C MHABMANUFPZXEB-UHFFFAOYSA-N 0.000 description 2
- 230000004913 activation Effects 0.000 description 2
- 230000003542 behavioural effect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0214—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory in accordance with safety or protection criteria, e.g. avoiding hazardous areas
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0217—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory in accordance with energy consumption, time reduction or distance reduction criteria
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0221—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
Definitions
- This application relates to the field of artificial intelligence, in particular to a method and device for motion planning.
- the key technologies for automatic driving include perception positioning, planning and decision-making, and execution control.
- planning decisions include motion planning, which is a method for navigating an autonomous vehicle from its current location to a destination on the premise of following road traffic rules.
- the scenes to be processed by automatic driving are very complicated, especially in dynamic traffic scenes, that is, traffic scenes with dynamic obstacles (pedestrians or vehicles) (also known as other traffic participants).
- the driving vehicle has game behaviors in the process of interacting with dynamic obstacles.
- autonomous vehicles are required to be able to flexibly respond to dynamic obstacles.
- the motion planning scheme lacks the ability to flexibly respond to dynamic obstacles in the process of interacting with dynamic obstacles.
- the present application provides a method and device for motion planning, which can realize that the autonomous vehicle can flexibly respond to the dynamic obstacle during the interaction process between the autonomous vehicle and the dynamic obstacle.
- a method for motion planning includes: acquiring driving environment information, the driving environment information including position information of dynamic obstacles; and inputting the state representation of the driving environment information into reinforcement learning after training
- the network model obtains the prediction time domain output by the reinforcement learning network model, where the prediction time domain represents the time length or the number of steps for predicting the motion trajectory of the dynamic obstacle; the prediction time domain is used for motion planning.
- the input of the reinforcement learning network model is driving environment information
- the output of the reinforcement learning network model is the prediction time domain.
- the state in the reinforcement learning algorithm is the driving environment information
- the action is the prediction time domain.
- the reinforcement learning network model in the embodiments of the present application may also be referred to as a predictive time-domain strategy network.
- the prediction time domain is determined in real time according to the driving environment information, so that the prediction time domain is not fixed, but can dynamically change with the change of the driving environment, so that motion planning based on the prediction time domain can be realized in During the interaction between the autonomous vehicle and the dynamic obstacle, the autonomous vehicle can flexibly respond to the dynamic obstacle.
- the autonomous driving vehicle drives according to the motion trajectory obtained by the motion planning based on the predicted time domain obtained by the reinforcement learning method, which can realize the dynamic adjustment of the driving style in the process of interacting with the dynamic obstacle.
- Driving style indicates whether the driving behavior is aggressive or conservative.
- the prediction time domain is fixed. It can be considered that the driving style of the autonomous vehicle is fixed, and the traffic scene is complex and changeable. If the driving style of the autonomous vehicle is fixed, it is difficult to balance communication efficiency and driving safety.
- the predicted time domain is obtained through reinforcement learning, and the size of the predicted time domain is not fixed, but dynamically changes with the change of the driving environment, that is, for the different movement states of dynamic obstacles ,
- the prediction time domain can be different. Therefore, in this application, as the driving environment of the autonomous vehicle changes, the prediction time domain can be large or small, and the driving style of the corresponding autonomous vehicle can be conservative or aggressive, so that it can be implemented in the process of interacting with dynamic obstacles. Dynamically adjust driving style.
- the using the predicted time domain to perform motion planning includes: using the predicted time domain as a hyperparameter to predict the motion trajectory of the dynamic obstacle; According to the position information of the static obstacle included in the driving environment information and the predicted movement trajectory of the dynamic obstacle, the movement trajectory of the autonomous driving vehicle is planned.
- the method further includes: controlling the automatic driving vehicle to drive according to the motion trajectory obtained by the motion plan.
- a data processing method includes: obtaining training data of the reinforcement learning network model based on data obtained by interaction between a reinforcement learning network model and an autonomous driving driving environment; using the training data, Perform reinforcement learning training on the reinforcement learning network model to obtain the trained reinforcement learning network model, wherein the input of the reinforcement learning network model is driving environment information, and the output of the reinforcement learning network model is prediction
- the prediction time domain indicates the time length or the number of steps for predicting the motion trajectory of the dynamic obstacle of the automatic driving.
- the input of the reinforcement learning network model is driving environment information
- the output of the reinforcement learning network model is the prediction time domain
- Applying the reinforcement learning network model trained by the data processing method provided by this application to automatic driving can determine a more appropriate prediction time domain according to the driving environment in the process of motion planning, and perform motion planning based on the prediction time domain. It can be realized that the autonomous vehicle can flexibly respond to the dynamic obstacle during the interaction between the autonomous vehicle and the dynamic obstacle.
- the obtaining the training data of the reinforcement learning network model based on the data obtained by the interaction of the reinforcement learning network model and the driving environment of autonomous driving includes: obtaining the training data of the reinforcement learning network model through the following steps Describe a set of samples in the training data ⁇ state s, action a, reward r>.
- Acquire driving environment information use the driving environment information as the state s, the driving environment information includes position information of dynamic obstacles; input the state s into the reinforcement learning network model to be trained, and obtain the reinforcement learning network
- the prediction time domain output by the model takes the prediction time domain as the action a, wherein the prediction time domain represents the time length or the number of steps for the motion trajectory prediction of the dynamic obstacle; the prediction time domain is used to perform
- the motion plan is to obtain the motion trajectory of the self-driving vehicle; the reward r is obtained by controlling the self-driving vehicle to drive according to the motion trajectory of the self-driving vehicle.
- the obtaining the reward r includes: calculating the reward r according to a reward function, wherein the reward function takes any one or more of the following factors into consideration : Driving safety, traffic efficiency of autonomous vehicles, traffic efficiency of other traffic participants.
- a data processing device in a third aspect, includes an acquisition unit, a prediction unit, and a planning unit.
- the acquiring unit is configured to acquire driving environment information, and the driving environment information includes location information of dynamic obstacles.
- the prediction unit is configured to input the state representation of the driving environment information into the trained reinforcement learning network model, and obtain the prediction time domain output by the reinforcement learning network model. The length of time or the number of steps for the motion trajectory prediction.
- the planning unit is configured to use the predicted time domain to perform motion planning.
- the planning unit is configured to: use the predicted time domain as a hyperparameter to predict the motion trajectory of the dynamic obstacle;
- the position information of the static obstacle included and the predicted motion trajectory of the dynamic obstacle are included to plan the motion trajectory of the autonomous driving vehicle.
- the device further includes a control unit for controlling the autonomous vehicle to drive according to the motion trajectory obtained by the motion plan.
- a data processing device in a fourth aspect, includes an acquisition unit and a training unit.
- the acquisition unit is configured to obtain training data of the reinforcement learning network model based on the data obtained through the interaction between the reinforcement learning network model and the driving environment of automatic driving.
- the training unit is configured to use the training data to perform reinforcement learning training on the reinforcement learning network model to obtain the trained reinforcement learning network model.
- the input of the reinforcement learning network model is driving environment information
- the output of the reinforcement learning network model is the prediction time domain
- the prediction time domain represents the time length or the number of steps for predicting the motion trajectory of the dynamic obstacle of automatic driving .
- the root obtaining unit is configured to obtain a set of samples ⁇ state s, action a, reward r> in the training data through the following steps.
- Acquire driving environment information use the driving environment information as the state s, and the driving environment information includes location information of dynamic obstacles.
- the predicted time domain is used for motion planning to obtain the motion trajectory of the autonomous vehicle.
- the reward r is obtained by controlling the self-driving vehicle to follow the motion trajectory of the self-driving vehicle.
- the obtaining unit is configured to calculate the reward r according to a reward function, wherein the reward function takes any one or more of the following factors into consideration: driving safety Performance, the traffic efficiency of autonomous vehicles, and the traffic efficiency of other traffic participants.
- an autonomous driving vehicle including the data processing device provided in the third aspect.
- the self-driving vehicle further includes the data processing device provided in the fourth aspect.
- a data processing device in a sixth aspect, includes: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to execute the first The method in one aspect or the second aspect.
- a computer-readable medium stores program code for device execution, and the program code includes the method for executing the above-mentioned first aspect or second aspect.
- a computer program product containing instructions is provided, when the computer program product runs on a computer, the computer executes the method in the first aspect or the second aspect.
- a chip in a ninth aspect, includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface, and executes the method in the first aspect or the second aspect described above.
- the chip may further include a memory in which instructions are stored, and the processor is configured to execute instructions stored on the memory.
- the processor is used to execute the method in the first aspect or the second aspect described above.
- the predicted time domain is obtained through reinforcement learning, and the size of the predicted time domain is not fixed, but dynamically changes with changes in the driving environment, that is, for dynamic obstacles For different moving states, the prediction time domain can be different. Therefore, in this application, as the driving environment of the autonomous vehicle changes, the prediction time domain can be large or small, and the driving style of the corresponding autonomous vehicle can be conservative or aggressive, so that it can be implemented in the process of interacting with dynamic obstacles. Dynamically adjust driving style.
- Fig. 1 is a schematic block diagram of an automatic driving system.
- Figure 2 is a schematic diagram of an autonomous driving scene.
- Figure 3 is a schematic diagram of the principle of reinforcement learning.
- Fig. 4 is a schematic flowchart of a motion planning method provided by an embodiment of the present application.
- FIG. 5 is another schematic flowchart of a motion planning method provided by an embodiment of the present application.
- Fig. 6 is a schematic flowchart of a method for training a reinforcement learning network model provided by an embodiment of the present application.
- FIG. 7 is a schematic flowchart of step S610 in FIG. 6.
- Fig. 8 is a schematic diagram of another scene of automatic driving.
- Fig. 9 is a schematic block diagram of a data processing device provided by an embodiment of the present application.
- FIG. 10 is another schematic block diagram of a data processing device provided by an embodiment of the present application.
- FIG. 11 is another schematic block diagram of a data processing device provided by an embodiment of the present application.
- FIG. 12 is still another schematic block diagram of the data processing device provided by an embodiment of the present application.
- FIG. 13 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application.
- the automatic driving system may include a perception module 110, a decision planning module 120 and an execution control module 130.
- the environment perception module 110, the decision planning module 120, and the execution control module 130 in the automatic driving system are exemplarily described below.
- the environmental perception module 110 is responsible for collecting environmental information, for example, information on obstacles such as other vehicles and pedestrians, and traffic rules information such as traffic signs and traffic lights on the road.
- the decision planning that the decision planning module 120 is responsible for can be divided into the following three levels.
- Global route planning refers to, after receiving a destination information, combine the map information and the current position information and attitude information of the vehicle to generate an optimal global route as a subsequent local route planning Reference and guidance.
- the "optimal” here can refer to conditions such as the shortest path, the fastest time, or the need to pass a designated point.
- the behavioral decision layer means that after receiving the global path, it makes specific behavior decisions based on the environment information obtained from the environment perception module 110 and the current driving path of the vehicle (for example, Changing lanes and overtaking, following cars, giving way, stopping, entering and exiting stations, etc.).
- Common behavior decision-making layer algorithms include: finite state machine, decision tree, rule-based reasoning model, etc.
- Motion planning refers to generating a motion trajectory that satisfies various constraint conditions (for example, safety, dynamic constraints of the vehicle itself, etc.) according to the specific behavior decisions made by the behavior decision-making layer.
- the motion trajectory is used as the input of the execution control module 130 to determine the travel path of the vehicle.
- the execution control module 130 is responsible for controlling the travel path of the vehicle according to the motion trajectory output by the decision planning module 120.
- the scenes to be processed by automatic driving are very complicated, including: empty road scenes, scenes sharing the road with pedestrians and obstacles, empty intersection scenes, busy intersection scenes, and pedestrians who violate traffic rules /Vehicle scene, normal driving vehicle/pedestrian scene, etc.
- the dynamic traffic scene shown in Figure 2 there are other traffic participants: pedestrians and other moving vehicles.
- pedestrians and other moving vehicles are dynamic obstacles.
- Autonomous vehicles have game behaviors in the process of interacting with dynamic obstacles. Therefore, in dynamic traffic scenarios, autonomous vehicles are required to be able to flexibly respond to dynamic obstacles.
- the main implementation methods of motion planning are based on search (for example, A* algorithm), sampling (for example, RRT algorithm), parameterized trajectory (for example, Reeds-Shepp curve), and optimization (for example, based on Frenet coordinate system)
- search for example, A* algorithm
- sampling for example, RRT algorithm
- parameterized trajectory for example, Reeds-Shepp curve
- optimization for example, based on Frenet coordinate system
- the present application provides a method for motion planning, which can enable an autonomous vehicle to flexibly respond to dynamic obstacles in the process of interacting with dynamic obstacles.
- Reinforcement learning is used to describe and solve the problem of agents (agents) in the process of interacting with the environment through learning strategies to maximize returns or achieve specific goals.
- a common model of reinforcement learning is the markov decision process (MDP).
- MDP is a mathematical model for analyzing decision-making problems.
- Reinforcement learning is an agent (agent) learning in a "trial and error” manner, through actions (action) interacting with the environment to obtain reward (reward) guiding behavior, the goal is to make the agent get the maximum reward.
- the reinforcement signal (ie, reward) provided by the environment is an evaluation of the quality of the generated action, rather than telling the reinforcement learning system how to generate the correct action. Since the information provided by the external environment is scarce, the agent must learn from its own experience. In this way, the agent obtains knowledge in an action-evaluation (i.e. reward) environment, and improves the action plan to adapt to the environment.
- Common reinforcement learning algorithms include Q-learning, policy gradient, actor-critic, etc.
- reinforcement learning mainly includes five elements: agent, environment, state, action, and reward.
- the input of the agent is the state and the output For action.
- the training process of reinforcement learning is: through multiple interactions between the agent and the environment, the actions, states, and rewards of each interaction are obtained; using these multiple sets (actions, states, and rewards) as training data, the agent is trained once. Using the above process, the agent is trained for the next round until the convergence condition is met.
- the process of obtaining the actions, states, and rewards of an interaction is shown in Figure 3.
- the current state s0 of the environment is input to the agent, and the action a0 output by the agent is obtained, which is calculated according to the relevant performance indicators of the environment under the action of action a0
- the reward r0 of this interaction so far, the action a0, action a0 and reward r0 of this interaction are obtained.
- FIG. 4 is a schematic flowchart of a method 400 for motion planning according to an embodiment of the application. Taking the automatic driving system as shown in FIG. 1 as an example, the method 300 may be executed by the decision planning module 120. As shown in FIG. 4, the method 400 includes steps S410, S420, and S430.
- the driving environment information includes position information of dynamic obstacles.
- Dynamic obstacles represent various moving obstacles such as pedestrians and vehicles in the driving environment. Dynamic obstacles can also be called dynamic traffic participants. For example, dynamic obstacles include other moving vehicles or pedestrians.
- the driving environment information may also include road structure information, location information of static obstacles, location information of autonomous vehicles, and so on.
- the road structure information includes traffic rules information such as traffic signs and traffic lights on the road.
- the method for obtaining the driving environment information may be to obtain the driving environment information according to the information collected by various sensors on the autonomous driving vehicle. This application does not limit the method of obtaining driving environment information.
- S420 Input the state representation of the driving environment information into the trained reinforcement learning network model, and obtain the prediction time domain output by the reinforcement learning network model.
- the prediction time domain represents the time length or the number of steps for predicting the motion trajectory of the dynamic obstacle.
- the reinforcement learning network model in the embodiment of the present application represents the agent in the reinforcement learning method (as shown in FIG. 3).
- the input of the reinforcement learning network model is driving environment information
- the output of the reinforcement learning network model is the prediction time domain.
- the state in the reinforcement learning algorithm is the driving environment information
- the action is the prediction time domain.
- the reinforcement learning network model in the embodiments of the present application may also be referred to as a predictive time-domain strategy network.
- the state representation of the driving environment information represents data after processing the driving environment information.
- the way to process the driving environment information can be determined according to the definition of the state in the reinforcement learning algorithm.
- the definition of the state in the reinforcement learning algorithm can be designed according to the application requirements. This application does not limit this.
- the prediction time domain mentioned in the embodiment of the present application indicates the time length or the number of steps for predicting the motion trajectory of the dynamic obstacle.
- the predicted time domain positioning is the predicted duration, for example, the predicted time domain is 5, which means that the duration of the motion trajectory prediction for the dynamic obstacle is 5 time units.
- the time unit can be preset.
- the predicted time domain positioning is the predicted number of steps.
- the predicted time domain is 5, which means that the number of steps for predicting the motion trajectory of a dynamic obstacle is 5 unit steps.
- the unit step length can be preset.
- the prediction time domain in the embodiment of the present application can also be expressed as the prediction time domain of the planner used to plan the motion trajectory of the dynamic obstacle.
- the reinforcement learning network model used in the motion planning method 400 (and the method 500 described below) provided by the embodiment of the present application is a trained model, specifically, it is based on the prediction and prediction of the driving environment.
- the domain is the trained model of the training target.
- the training method of the reinforcement learning network model it will be described below in conjunction with FIG. 6, and will not be described in detail here.
- the process of using the prediction time domain for motion planning includes the following steps:
- step S420 Use the predicted time domain obtained in step S420 as a hyperparameter to predict the motion trajectory of the dynamic obstacle;
- the planning algorithm is used to plan the motion trajectory of the autonomous vehicle.
- the method of motion planning for an autonomous vehicle based on the predicted time length or the number of steps of the motion trajectory of the dynamic obstacle can refer to the prior art, which is not detailed here. Narrated.
- the self-driving vehicle can drive according to the motion trajectory of the self-driving vehicle obtained in step S430 until the driving task is completed.
- the self-driving vehicle drives step C1 according to the trajectory of the self-driving vehicle obtained in step S430. If the driving task is not completed, it will re-acquire a new state based on the updated driving environment, continue to execute steps S420 and S430, and follow The motion trajectory of the autonomous vehicle obtained in step S430 travels to step C2. If the driving task is not completed, continue to loop the above operations, and if the driving task is completed, the automatic driving ends.
- the values of C1 and C2 involved can be preset or determined in real time according to the driving environment. C1 and C2 can be the same or different.
- the autonomous vehicle can travel 10 unit steps according to the motion trajectory of the autonomous vehicle obtained in step S430.
- the unit step length can be preset.
- the prediction time domain is determined in real time according to the driving environment information, so that the prediction time domain is not fixed, but can dynamically change with the change of the driving environment, so that motion planning based on the prediction time domain can be realized in During the interaction between the autonomous vehicle and the dynamic obstacle, the autonomous vehicle can flexibly respond to the dynamic obstacle.
- an autonomous vehicle drives according to a motion trajectory obtained by motion planning based on the predicted time domain obtained by the reinforcement learning method, which can dynamically adjust the driving style in the process of interacting with dynamic obstacles.
- Driving style indicates whether the driving behavior is aggressive or conservative.
- the corresponding driving style when the prediction time domain is large, the corresponding driving style can be regarded as conservative; when the prediction time domain is small, the corresponding driving style can be regarded as aggressive.
- the prediction time domain is fixed. It can be considered that the driving style of the autonomous vehicle is fixed, and the traffic scene is complex and changeable. If the driving style of the autonomous vehicle is fixed, it is difficult to balance communication efficiency and driving safety.
- the predicted time domain is obtained through reinforcement learning, and the size of the predicted time domain is not fixed, but dynamically changes with the change of the driving environment, that is, for the different movement states of dynamic obstacles ,
- the prediction time domain can be different. Therefore, in this application, as the driving environment of the autonomous vehicle changes, the prediction time domain can be large or small, and the driving style of the corresponding autonomous vehicle can be conservative or aggressive, so that it can be implemented in the process of interacting with dynamic obstacles. Dynamically adjust driving style.
- FIG. 5 is a schematic flowchart of a method 500 for motion planning according to an embodiment of the application.
- the driving environment information includes position information of dynamic obstacles.
- the driving environment information may also include road structure information, location information of static obstacles, location information of autonomous vehicles, and so on.
- step S520 Input the state representation of the driving environment information obtained in step S510 into the trained reinforcement learning network model, and obtain the predicted time domain output by the reinforcement learning network model.
- step S530 Perform motion planning on the autonomous vehicle according to the predicted time domain obtained in step S520 to obtain a planned trajectory of the autonomous vehicle.
- Step S530 may include the following two steps:
- step S520 1) Using the predicted time domain obtained in step S520 as a hyperparameter to predict the motion trajectory of the dynamic obstacle;
- the planning algorithm is used to plan the motion trajectory of the autonomous vehicle.
- S540 Control the autonomous vehicle to drive C steps according to the motion trajectory of the autonomous vehicle obtained in step S530, or in other words, execute the first C steps of the motion trajectory of the autonomous vehicle obtained in step S530, where C is a positive integer.
- step S550 Determine whether the driving task is completed, if so, the automatic driving operation ends, and if not, go to step S510.
- the motion planning method provided in the embodiments of the present application uses a reinforcement learning method to determine the prediction time domain in real time according to the driving environment information, so that the prediction time domain is not fixed, but can dynamically change with the change of the driving environment, based on The motion planning in the predicted time domain can realize that the autonomous vehicle can flexibly respond to the dynamic obstacle during the interaction between the autonomous vehicle and the dynamic obstacle.
- applying the motion planning method provided by the embodiment of the present application to automatic driving can realize the dynamic adjustment of the driving style in the process of interacting with dynamic obstacles.
- FIG. 6 is a schematic flowchart of a data processing method 600 provided by an embodiment of this application.
- the method 600 can be applied to the training method 400 and the reinforcement learning network model used in the method 500.
- the method 600 includes the following steps.
- S610 Obtain training data of the reinforcement learning network model according to the data obtained by the interaction between the reinforcement learning network model and the driving environment of the autonomous driving.
- the input of the reinforcement learning network model is driving environment information
- the output of the reinforcement learning network model is the prediction time domain
- the prediction time domain represents the time length or the number of steps for predicting the motion trajectory of the dynamic obstacle of automatic driving.
- S620 Use the training data to perform reinforcement learning training on the reinforcement learning network model to obtain a trained reinforcement learning network model.
- the reinforcement learning network model in the embodiment of the present application represents the agent in the reinforcement learning method (as shown in FIG. 3).
- the training data of the reinforcement learning network model includes multiple sets of samples, and each set of samples can be expressed as ⁇ state s, action a, reward r>.
- state s the meaning of the action a and the reward r can be referred to the previous description in conjunction with FIG. 3, and will not be repeated here.
- step S610 includes: obtaining a set of samples ⁇ state s, action a, reward r> in the training data of the reinforcement learning network model through the following steps S611 to S614.
- S611 Acquire driving environment information, and use the driving environment information as the state s.
- the driving environment information includes position information of dynamic obstacles.
- the driving environment information may also include road structure information, location information of static obstacles, location information of autonomous vehicles, and the like.
- the method for obtaining the driving environment information may be to obtain the driving environment information according to the information collected by various sensors on the autonomous driving vehicle. This application does not limit the method of obtaining driving environment information.
- S612 Input the state s into the reinforcement learning network model to be trained, obtain the prediction time domain output by the reinforcement learning network model, and use the prediction time domain as the action a, where the prediction time domain represents the motion trajectory prediction of the dynamic obstacle Duration or number of steps.
- S613 Perform motion planning using the predicted time domain to obtain the motion trajectory of the autonomous vehicle.
- Step S613 may include the following two steps:
- step S612 Use the predicted time domain obtained in step S612 as a hyperparameter to predict the motion trajectory of the dynamic obstacle
- the planning algorithm is used to plan the motion trajectory of the autonomous vehicle.
- S614 Obtain the reward r by controlling the autonomous vehicle to drive according to the motion trajectory of the autonomous vehicle.
- the updated driving environment information is obtained, and the reward r is calculated based on the updated driving environment information.
- the strategy for obtaining the reward r based on the updated driving environment information can be determined according to application requirements, which is not limited in this application.
- the reward r may be calculated through a cost function.
- the cost function can be designed according to application requirements.
- the cost function may be determined based on the game behavior between the autonomous driving vehicle and other vehicles.
- the consideration factors for designing the cost function include any one or more of the following:
- the reward r is obtained according to the following piecewise function, which may be called a cost function:
- the first segment "-0.5 ⁇ time_step" in the piecewise function is used to encourage the autonomous vehicle to complete the driving task as soon as possible, which is based on the consideration of the traffic efficiency of the autonomous vehicle.
- Time_step represents the timing information of the driving task.
- the second segment "-10" in the piecewise function is used to punish the collision behavior for safety reasons.
- the third segment "10" in the piecewise function is used to reward the completion of the driving task.
- the fourth segment "5" in the piecewise function is used to reward other vehicles for passing through narrow lanes, so that the reinforcement learning algorithm not only considers the driving efficiency of autonomous vehicles, but also the driving efficiency of other vehicles, which is to encourage taking into account other vehicles. Consideration of vehicle traffic efficiency.
- Applying the reinforcement learning network model trained by the method 600 provided by the embodiment of the present application to automatic driving can determine a more appropriate prediction time domain according to the driving environment in the process of motion planning, and perform motion planning based on the prediction time domain.
- the autonomous vehicle In the process of interaction between the autonomous vehicle and the dynamic obstacle, the autonomous vehicle can flexibly respond to the dynamic obstacle.
- the driving task of the narrow lane meeting scene shown in Figure 8 is that the autonomous vehicle and other vehicles (moving) expect to pass through the narrow lane.
- the two vehicles drive without considering the right of way.
- the autonomous vehicle is based on the form of the other vehicle. Behavior adjusts its own driving behavior.
- Step 1) obtain the state in the reinforcement learning algorithm.
- two-dimensional feasible area information and infeasible area information can be obtained.
- the area information (including two-dimensional feasible area information and infeasible area information) is represented as an 84 ⁇ 84 projection matrix.
- the last 4 frames of the projection matrix with an interval of 5 in the historical projection matrix can be coordinate transformed according to the current vehicle coordinate system, and the resulting projection
- the matrix sequence is used as the input of the reinforcement learning network model.
- Step 2) input the state obtained in step 1), that is, the matrix sequence, into the reinforcement learning network model, and obtain the prediction time domain of the dynamic obstacle by the planning algorithm.
- the network structure of the reinforcement learning network model can use the ACKTR algorithm.
- the ACKTR algorithm is a policy gradient algorithm under the Actor-Critic framework.
- the ACKTR algorithm includes a policy network and a value network.
- the matrix sequence obtained in step 1) is used as the input of the reinforcement learning network model.
- the output value of the strategy network is designed as the prediction time domain of the planning algorithm for dynamic obstacles. For the description of the prediction time domain, please refer to the previous article, so I won't repeat it here.
- step 3 the prediction time domain obtained in step 2) is used as the hyperparameter, and the trajectory prediction of the time domain step length is performed on other dynamic vehicles using the uniform speed prediction model.
- a polynomial planning algorithm is used for motion planning.
- the polynomial algorithm is a sampling-based planning algorithm.
- the algorithm is planned in the Frenet coordinate system of the structured road. First, the horizontal distance from the center line of the lane and the longitudinal expected speed are sampled, and then the fifth-order polynomial fitting is used to generate Set of candidate trajectories, and finally optimize the trajectory according to the cost function of the planner, output the optimal trajectory, and complete the motion planning.
- the self-driving vehicle can drive according to the motion trajectory of the self-driving vehicle obtained in step 3) until the driving task is completed.
- the self-driving vehicle travels several steps according to the trajectory of the self-driving vehicle obtained in step 3). If the driving task is not completed, continue to perform steps 1) to 3), and follow the self-driving vehicle obtained in step 3) If the driving task is not completed, continue to cycle the above operations. If the driving task is completed, the automatic driving task ends.
- the reinforcement learning network model involved in the example described in conjunction with FIG. 8 can be obtained by training using the method 600 in the above embodiment. The specific description is detailed above, so I won't repeat it here.
- the embodiment of the application adopts the reinforcement learning method to determine the prediction time domain in real time according to the driving environment information, so that the prediction time domain is not fixed, but can dynamically change with the change of the driving environment, so that the prediction time domain is based on the prediction time domain.
- the domain of motion planning can realize that the autonomous vehicle can flexibly respond to the dynamic obstacle during the interaction between the autonomous vehicle and the dynamic obstacle.
- an embodiment of the present application further provides a data processing device 900.
- the device 900 includes an environment perception module 910, a motion planning module 920, and a vehicle control module 930.
- the environment perception module 910 is configured to obtain driving environment information, and transmit the driving environment information to the motion planning module 920.
- the environment perception module 910 is configured to obtain driving environment information according to information collected by various sensors on the vehicle.
- the driving environment information includes position information of dynamic obstacles.
- the driving environment information may also include road structure information, location information of static obstacles, location information of autonomous vehicles, and so on.
- the motion planning module 920 is configured to receive driving environment information from the environment perception module 910, and use the reinforcement learning network model to obtain the prediction time domain of dynamic obstacles, and perform motion planning based on the prediction time domain to obtain the motion trajectory of the autonomous vehicle, And the planning control information corresponding to the motion trajectory is transferred to the vehicle control module 930.
- the motion planning module 920 is configured to execute step S420 and step S430 in the method 400 provided in the above method embodiment.
- the vehicle control module 930 is configured to receive planning control information from the motion planning module 920, and control the vehicle to control the vehicle to complete the driving task according to the action instruction information corresponding to the planning control information.
- the device 900 provided in the embodiment of the present application may be installed on an autonomous driving vehicle.
- an embodiment of the present application further provides an apparatus 1000 for motion planning, and the apparatus 1000 is configured to execute the method 400 or the method 500 in the above method embodiment.
- the device 1000 includes an acquisition unit 1010, a prediction unit 1020, and a planning unit 1030.
- the acquiring unit 1010 is used to acquire driving environment information, and the driving environment information includes location information of dynamic obstacles.
- the prediction unit 1020 is configured to input the state representation of the driving environment information into the trained reinforcement learning network model, and obtain the prediction time domain output by the reinforcement learning network model.
- the prediction time domain represents the time length or the number of steps for predicting the motion trajectory of the dynamic obstacle.
- the planning unit 1030 is used for motion planning using the predicted time domain.
- the operation of the planning unit 1030 to perform motion planning using the predicted time domain includes the following steps.
- the device 1000 may further include a control unit 1040, which is used to control the autonomous driving vehicle to drive according to the motion trajectory obtained by the motion plan.
- a control unit 1040 which is used to control the autonomous driving vehicle to drive according to the motion trajectory obtained by the motion plan.
- the prediction unit 1020, the planning unit 1030, and the control unit 1040 may be implemented by a processor.
- the acquiring unit 1010 can be implemented through a communication interface.
- an embodiment of the present application further provides a data processing apparatus 1100, and the apparatus 1100 is configured to execute the method 600 in the above method embodiment.
- the device 1100 includes an acquisition unit 1110 and a training unit 1120.
- the obtaining unit 1110 is configured to obtain training data of the reinforcement learning network model according to the data obtained by the interaction between the reinforcement learning network model and the driving environment of automatic driving.
- the training unit 1120 is used to use training data to perform reinforcement learning training on the reinforcement learning network model to obtain a trained reinforcement learning network model.
- the input of the reinforcement learning network model is the driving environment information
- the output of the reinforcement learning network model is the prediction time domain
- the prediction time domain represents the time length or the number of steps to predict the motion trajectory of the dynamic obstacle of automatic driving.
- the obtaining unit 1110 is used to obtain a set of samples ⁇ state s, action a, reward r> in the training data through step S611 to step S614 as shown in FIG. 7. Please refer to the above description, which will not be repeated here.
- an embodiment of the present application also provides a data processing apparatus 1200.
- the device 1200 includes a processor 1210, the processor 1210 is coupled with a memory 1220, the memory 1220 is used to store computer programs or instructions, and the processor 1210 is used to execute the computer programs or instructions stored in the memory 1220, so that the method in the above method embodiment Be executed.
- the apparatus 1200 may further include a memory 1220.
- the device 1200 may further include a data interface 1230, and the data interface 1230 is used for data transmission with the outside world.
- the apparatus 1200 is used to implement the method 400 in the foregoing embodiment.
- the apparatus 1200 is used to implement the method 500 in the foregoing embodiment.
- the apparatus 1200 is used to implement the method 600 in the foregoing embodiment.
- An embodiment of the present application also provides an autonomous driving vehicle, which includes a data processing device 900 as shown in FIG. 9 or a data processing device 1000 as shown in FIG. 10.
- the self-driving vehicle further includes a data processing device 1100 as shown in FIG. 11.
- An embodiment of the present application also provides an autonomous driving vehicle, which includes a data processing device 1200 as shown in FIG. 12.
- the embodiments of the present application also provide a computer-readable medium that stores program code for device execution, and the program code includes a method for executing the above-mentioned embodiments.
- the embodiments of the present application also provide a computer program product containing instructions, which when the computer program product runs on a computer, cause the computer to execute the method of the foregoing embodiment.
- An embodiment of the present application also provides a chip, which includes a processor and a data interface, and the processor reads instructions stored on the memory through the data interface, and executes the method of the foregoing embodiment.
- the chip may further include a memory in which instructions are stored, and the processor is used to execute the instructions stored in the memory, and when the instructions are executed, the processor is used to execute the method in the foregoing embodiment.
- FIG. 13 is a chip hardware structure provided by an embodiment of the application, and the chip includes a neural network processor 1300.
- the chip can be installed in any one or more of the following devices:
- the methods 400, 500, or 600 in the above method embodiments can all be implemented in the chip as shown in FIG. 13.
- the neural network processor 1300 is mounted on a host CPU (Host CPU) as a coprocessor, and the host CPU distributes tasks.
- the core part of the neural network processor 1300 is the arithmetic circuit 1303, and the controller 1304 controls the arithmetic circuit 1303 to obtain data in the memory (weight memory 1302 or input memory 1301) and perform calculations.
- the arithmetic circuit 1303 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 1303 is a two-dimensional systolic array. The arithmetic circuit 1303 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 1303 is a general-purpose matrix processor.
- the arithmetic circuit 1303 fetches the data corresponding to matrix B from the weight memory 1302 and caches it on each PE in the arithmetic circuit 1303.
- the arithmetic circuit 1303 fetches the matrix A data and matrix B from the input memory 1301 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator 1308.
- the vector calculation unit 1307 can perform further processing on the output of the arithmetic circuit 1303, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on.
- the vector calculation unit 1307 can be used for network calculations in the non-convolution/non-FC layer of the neural network, such as pooling, batch normalization, local response normalization, etc. .
- the vector calculation unit 1307 can store the processed output vector in a unified memory (also referred to as a unified buffer) 1306.
- the vector calculation unit 1307 may apply a nonlinear function to the output of the arithmetic circuit 1303, such as a vector of accumulated values, to generate the activation value.
- the vector calculation unit 1307 generates a normalized value, a combined value, or both.
- the processed output vector can be used as an activation input to the arithmetic circuit 1303, for example for use in a subsequent layer in a neural network.
- the method 400, 500 or 600 in the above method embodiment may be executed by 1303 or 1307.
- the unified memory 1306 is used to store input data and output data.
- the input data in the external memory can be transferred to the input memory 1301 and/or unified memory 1306 through the storage unit access controller 1305 (direct memory access controller, DMAC), the weight data in the external memory can be stored in the weight memory 1302, and the The data in the unified memory 1306 is stored in the external memory.
- DMAC direct memory access controller
- the bus interface unit (BIU) 1310 is used to implement interaction between the main CPU, the DMAC, and the instruction fetch memory 1309 through the bus.
- An instruction fetch buffer 1309 connected to the controller 1304 is used to store instructions used by the controller 1304;
- the controller 1304 is used to call the instructions cached in the memory 1309 to control the working process of the computing accelerator.
- the unified memory 1306, the input memory 1301, the weight memory 1302 and the instruction fetch memory 1309 are all on-chip (On-Chip) memories.
- the external memory is the memory external to the NPU.
- the external memory can be a double data rate synchronous dynamic random access memory.
- Memory double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (HBM) or other readable and writable memory.
- the disclosed system, device, and method can be implemented in other ways.
- the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
- the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
- the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
- the technical solution of this application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
- the aforementioned storage media include: Universal Serial Bus flash disk (USB flash disk, UFD) (UFD can also be referred to as U disk or USB flash drive for short), mobile hard disk, read-only memory (read-only memory, ROM), random access Various media that can store program codes, such as random access memory (RAM), magnetic disks, or optical disks.
- USB flash disk UFD
- UFD Universal Serial Bus flash disk
- ROM read-only memory
- RAM random access memory
- magnetic disks magnetic disks
- optical disks optical disks.
Landscapes
- Engineering & Computer Science (AREA)
- Aviation & Aerospace Engineering (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Traffic Control Systems (AREA)
- Control Of Driving Devices And Active Controlling Of Vehicle (AREA)
Abstract
本申请涉及人工智能领域,具体涉及自动驾驶领域,提供一种运动规划的方法与装置,该方法包括:获取驾驶环境信息,驾驶环境信息包括动态障碍物的位置信息;将驾驶环境信息的状态表征输入训练后的强化学习网络模型,获取强化学习网络模型输出的预测时域,预测时域表示对动态障碍物进行运动轨迹预测的时长或者步数;利用预测时域进行运动规划。预测时域是通过强化学习得到的,从而可以随驾驶环境的改变而动态改变的,可以实现在自动驾驶车辆与动态障碍物交互过程中使得自动驾驶车辆可以灵活应对动态障碍物。
Description
本申请要求于2020年05月29日提交中国专利局、申请号为202010471732.4、申请名称为“运动规划的方法与装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及人工智能领域,具体涉及一种运动规划的方法与装置。
自动驾驶实现的关键技术包括感知定位、规划决策、执行控制。其中,规划决策包括运动规划(motion planning),运动规划是在遵循道路交通规则的前提下,将自动驾驶车辆从当前位置导航到目的地的一种方法。
在实际开放道路场景下,自动驾驶要处理的场景非常繁杂,尤其在动态的交通场景中,即存在动态障碍物(行人或车辆)(也可称为其它交通参与者)的交通场景中,自动驾驶车辆在与动态障碍物交互过程中存在博弈行为,这种场景下,要求自动驾驶车辆可以灵活应对动态障碍物。
目前,运动规划的方案缺乏在与动态障碍物交互过程中灵活应对动态障碍物的能力。
发明内容
本申请提供一种运动规划的方法与装置,可以实现在自动驾驶车辆与动态障碍物交互过程中使得自动驾驶车辆可以灵活应对动态障碍物。
第一方面,提供一种运动规划的方法,所述方法包括:获取驾驶环境信息,所述驾驶环境信息包括动态障碍物的位置信息;将所述驾驶环境信息的状态表征输入训练后的强化学习网络模型,获取所述强化学习网络模型输出的预测时域,所述预测时域表示对所述动态障碍物进行运动轨迹预测的时长或者步数;利用所述预测时域进行运动规划。
该强化学习网络模型的输入为驾驶环境信息,该强化学习网络模型的输出为预测时域。换句话说,强化学习算法中的状态(state)为驾驶环境信息,动作(action)为预测时域。本申请实施例中的强化学习网络模型也可以称为预测时域策略网络。
通过采用强化学习方法,根据驾驶环境信息实时确定预测时域,使得预测时域不是固定的,而是可以随驾驶环境的变换而动态变化的,从而基于该预测时域进行运动规划,可以实现在自动驾驶车辆与动态障碍物交互过程中使得自动驾驶车辆可以灵活应对动态障碍物。
自动驾驶车辆按照基于采用强化学习方法获得的预测时域进行运动规划得到的运动轨迹进行行驶,可以实现在与动态障碍物交互过程中动态调整驾驶风格。驾驶风格表示驾驶行为是激进的还是保守的。
现有技术中,预测时域是固定的,可以视为,自动驾驶车辆的驾驶风格是固定的,而交通场景复杂多变,如果自动驾驶车辆的驾驶风格固定,难以兼顾通信效率与行驶安全。
在本申请中,预测时域是通过强化学习得到的,则该预测时域的大小不是固定的,而是随驾驶环境的改变而动态改变的,也就是说,针对动态障碍物不同的移动状态,该预测时域可以是不同的。因此,在本申请中,随着自动驾驶车辆的驾驶环境的改变,预测时域可大可小,对应的自动驾驶车辆的驾驶风格可保守可激进,从而可以实现在与动态障碍物交互过程中动态调整驾驶风格。
结合第一方面,在一种可能的实现方式中,所述利用所述预测时域进行运动规划,包括:将所述预测时域作为超参数,对所述动态障碍物的运动轨迹进行预测;根据所述驾驶环境信息中包括的静态障碍物的位置信息,以及所预测的所述动态障碍物的运动轨迹,规划自动驾驶车辆的运动轨迹。
结合第一方面,在一种可能的实现方式中,还包括:控制自动驾驶车辆按照所述运动规划得到的运动轨迹进行行驶。
第二方面,提供一种数据处理的方法,所述方法包括:根据强化学习网络模型与自动驾驶的驾驶环境交互获得的数据,获得所述强化学习网络模型的训练数据;利用所述训练数据,对所述强化学习网络模型进行强化学习的训练,以获得训练后的所述强化学习网络模型,其中,所述强化学习网络模型的输入为驾驶环境信息,所述强化学习网络模型的输出为预测时域,所述预测时域表示对自动驾驶的动态障碍物进行运动轨迹预测的时长或者步数。
该强化学习网络模型的输入为驾驶环境信息,强化学习网络模型的输出为预测时域。
将采用本申请提供的数据处理的方法训练得到的强化学习网络模型应用于自动驾驶,可以在运动规划的过程中,根据驾驶环境确定较为合适的预测时域,基于该预测时域进行运动规划,可以实现在自动驾驶车辆与动态障碍物交互过程中使得自动驾驶车辆可以灵活应对动态障碍物。
结合第二方面,在一种可能的实现方式中,所述根据强化学习网络模型与自动驾驶的驾驶环境交互获得的数据,获得所述强化学习网络模型的训练数据,包括:通过如下步骤获得所述训练数据中的一组样本<状态s,动作a,奖励r>。
获取驾驶环境信息,将所述驾驶环境信息作为所述状态s,所述驾驶环境信息包括动态障碍物的位置信息;将所述状态s输入待训练的强化学习网络模型,获取所述强化学习网络模型输出的预测时域,将所述预测时域作为所述动作a,其中,所述预测时域表示对所述动态障碍物进行运动轨迹预测的时长或者步数;利用所述预测时域进行运动规划,获得自动驾驶车辆的运动轨迹;通过控制所述自动驾驶车辆按照所述自动驾驶车辆的运动轨迹进行行驶,获得所述奖励r。
结合第二方面,在一种可能的实现方式中,所述获得所述奖励r,包括:根据回报函数,计算所述奖励r,其中,所述回报函数考虑了下列任一种或多种因素:驾驶安全性、自动驾驶车辆的通行效率、其他交通参与者的通行效率。
第三方面,提供一种数据处理的装置,所述装置包括获取单元、预测单元与规划单元。
所述获取单元用于获取驾驶环境信息,所述驾驶环境信息包括动态障碍物的位置信息。所述预测单元,用于将所述驾驶环境信息的状态表征输入训练后的强化学习网络模型, 获取所述强化学习网络模型输出的预测时域,所述预测时域表示对所述动态障碍物进行运动轨迹预测的时长或者步数。所述规划单元,用于利用所述预测时域进行运动规划。
结合第三方面,在一种可能的实现方式中,所述规划单元用于:将所述预测时域作为超参数,对所述动态障碍物的运动轨迹进行预测;根据所述驾驶环境信息中包括的静态障碍物的位置信息,以及所预测的所述动态障碍物的运动轨迹,规划自动驾驶车辆的运动轨迹。
结合第三方面,在一种可能的实现方式中,所述装置还包括控制单元,用于控制自动驾驶车辆按照所述运动规划得到的运动轨迹进行行驶。
第四方面,提供一种数据处理的装置,所述装置包括获取单元与训练单元。
所述获取单元用于根据强化学习网络模型与自动驾驶的驾驶环境交互获得的数据,获得所述强化学习网络模型的训练数据。所述训练单元用于利用所述训练数据,对所述强化学习网络模型进行强化学习的训练,以获得训练后的所述强化学习网络模型。其中,所述强化学习网络模型的输入为驾驶环境信息,所述强化学习网络模型的输出为预测时域,所述预测时域表示对自动驾驶的动态障碍物进行运动轨迹预测的时长或者步数。
结合第四方面,在一种可能的实现方式中,所述根获取单元用于,通过如下步骤获得所述训练数据中的一组样本<状态s,动作a,奖励r>。
获取驾驶环境信息,将所述驾驶环境信息作为所述状态s,所述驾驶环境信息包括动态障碍物的位置信息。将所述状态s输入待训练的强化学习网络模型,获取所述强化学习网络模型输出的预测时域,将所述预测时域作为所述动作a,其中,所述预测时域表示对所述动态障碍物进行运动轨迹预测的时长或者步数。利用所述预测时域进行运动规划,获得自动驾驶车辆的运动轨迹。通过控制所述自动驾驶车辆按照所述自动驾驶车辆的运动轨迹进行行驶,获得所述奖励r。
结合第四方面,在一种可能的实现方式中,所述获取单元用于,根据回报函数,计算所述奖励r,其中,所述回报函数考虑了下列任一种或多种因素:驾驶安全性、自动驾驶车辆的通行效率、其他交通参与者的通行效率。
第五方面,提供一种自动驾驶车辆,包括第三方面提供的数据处理的装置。
结合第四方面,在一种可能的实现方式中,所述自动驾驶车辆还包括第四方面提供的数据处理的装置。
第六方面,提供一种数据处理的装置,该装置包括:存储器,用于存储程序;处理器,用于执行存储器存储的程序,当存储器存储的程序被执行时,处理器用于执行上述第一方面或第二方面中的方法。
第七方面,提供一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行上述第一方面或第二方面中的方法。
第八方面,提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第一方面或第二方面中的方法。
第九方面,提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行上述第一方面或第二方面中的方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执 行上述第一方面或第二方面中的方法。
基于上述描述,在本申请中,预测时域是通过强化学习得到的,则该预测时域的大小不是固定的,而是随驾驶环境的改变而动态改变的,也就是说,针对动态障碍物不同的移动状态,该预测时域可以是不同的。因此,在本申请中,随着自动驾驶车辆的驾驶环境的改变,预测时域可大可小,对应的自动驾驶车辆的驾驶风格可保守可激进,从而可以实现在与动态障碍物交互过程中动态调整驾驶风格。
图1是自动驾驶系统的示意性框图。
图2是自动驾驶的场景示意图。
图3是强化学习的原理示意图。
图4是本申请实施例提供的运动规划的方法的示意性流程图。
图5是本申请实施例提供的运动规划的方法的另一示意性流程图。
图6是本申请实施例提供的训练强化学习网络模型的方法的示意性流程图。
图7是图6中步骤S610的示意性流程图。
图8是自动驾驶的另一场景示意图。
图9是本申请实施例提供的数据处理的装置的示意性框图。
图10是本申请实施例提供的数据处理的装置的另一示意性框图。
图11是本申请实施例提供的数据处理的装置的又一示意性框图。
图12是本申请实施例提供的数据处理的装置的再一示意性框图。
图13是本申请实施例提供的一种芯片硬件结构示意图。
随着智能驾驶的到来,智能汽车(intelligent vehicles)成为各大厂商重点研究的目标。智能汽车根据传感器输入的各种参数等生成期望的路径,并将相应的控制量提供给后续的控制器。智能驾驶也称为自动驾驶。自动驾驶的关键技术包括感知定位、决策规划、执行控制。作为示例,如图1所示,自动驾驶系统可以包括感知模块110、决策规划模块120与执行控制模块130。
下面对自动驾驶系统中的环境感知模块110、决策规划模块120与执行控制模块130进行过示例性地描述。
环境感知模块110负责采集环境信息,例如,其他车辆、行人等障碍物信息,道路上交通标志、红绿灯等交通规则信息。
决策规划模块120负责的决策规划可以分为如下三个层次。
1)全局路径规划(route planning),指的是,在收到一个目的地信息后,结合地图信息和本车的当前位置信息与姿态信息,生成一条最优的全局路径,作为后续局部路径规划的参考与引导。这里的“最优”可以指路径最短、时间最快或必须经过指定点等条件。
常见的全局路径规划算法包括Dijkstra、A-Star算法,以及在这两种算法基础上的多种改进。
2)行为决策层(behavioral layer),指的是,在接收到全局路径后,根据从环境感知 模块110得到的环境信息,以及本车当前的行驶路径等信息,作出具体的行为决策(例如,变道超车、跟车行驶、让行、停车、进出站等)。
常见的行为决策层的算法包括:有限状态机、决策树、基于规则的推理模型等。
3)运动规划(motion planning),指的是,根据行为决策层作出的具体的行为决策,生成一条满足各种约束条件(例如,安全性、车辆本身的动力学约束等)的运动轨迹,该运动轨迹作为执行控制模块130的输入决定车辆的行驶路径。
执行控制模块130负责,根据决策规划模块120输出的运动轨迹,控制车辆的行驶路径。
在实际开放道路场景下,自动驾驶要处理的场景非常繁杂,包括:空旷的道路场景、与行人、障碍物共用道路的场景、空旷的十字路口场景、繁忙的十字路口场景、违反交通规则的行人/车辆场景、正常行驶的车辆/行人场景等。例如,在如图2所示的动态交通场景中,具有其它交通参与者:行人与移动的其它车辆,对自动驾驶车辆来说,行人与移动的其它车辆是动态障碍物。自动驾驶车辆在与动态障碍物交互过程中存在博弈行为。因此,在动态交通场景中,要求自动驾驶车辆可以灵活应对动态障碍物。
目前,运动规划的主要实现方式有基于搜索(例如,A*类算法)、采样(例如,RRT类算法)、参数化轨迹(例如,Reeds-Shepp曲线)以及优化(例如,基于Frenet坐标系)的解决方案,这些解决方案缺乏在与动态障碍物交互过程中灵活应对动态障碍物的能力。
针对上述问题,本申请提供一种运动规划的方法,可以使得自动驾驶车辆在与动态障碍物交互过程中可以灵活应对动态障碍物。
为了更好地理解本申请实施例,下面先描述本申请实施例涉及的强化学习。
强化学习(reinforcement learning,RL)用于描述和解决智能体(agent)在与环境的交互过程中通过学习策略以达成回报最大化或实现特定目标的问题。强化学习的常见模型是马尔可夫决策过程(markov decision process,MDP)。MDP是一种分析决策问题的数学模型。强化学习是智能体(agent)以“试错”的方式进行学习,通过动作(action)与环境进行交互获得的奖励(reward)指导行为,目标是使智能体获得最大的奖励。强化学习中由环境提供的强化信号(即奖励)对产生动作的好坏作一种评价,而不是告诉强化学习系统如何去产生正确的动作。由于外部环境提供的信息很少,智能体必须靠自身的经历进行学习。通过这种方式,智能体在行动-评价(即奖励)的环境中获得知识,改进行动方案以适应环境。常见的强化学习算法有Q-learning,policy gradient,actor-critic等。
如图3所示,强化学习主要包含五个元素:智能体(agent)、环境(environment)、状态(state)、动作(action)与奖励(reward),其中,智能体的输入为状态,输出为动作。强化学习的训练过程为:通过智能体与环境进行多次交互,获得每次交互的动作、状态、奖励;将这多组(动作,状态,奖励)作为训练数据,对智能体进行一次训练。采用上述过程,对智能体进行下一轮次训练,直至满足收敛条件。
作为示例,获得一次交互的动作、状态、奖励的过程如图3所示,将环境当前状态s0输入至智能体,获得智能体输出的动作a0,根据环境在动作a0作用下的相关性能指标计算本次交互的奖励r0,至此,获得本次交互的动作a0、动作a0与奖励r0。记录本次交互的动作a0、动作a0与奖励r0,以备后续用来训练智能体。还记录环境在动作a0作用下的下一个状态s1,以便实现智能体与环境的下一次交互。
下面将结合附图,对本申请中的技术方案进行描述。
图4为本申请实施例提供的一种运动规划的方法400的示意性流程图。以自动驾驶系统如图1为例,该方法300可以由决策规划模块120执行。如图4所示,该方法400包括步骤S410、S420、S430。
S410,获取驾驶环境信息。
该驾驶环境信息包括动态障碍物的位置信息。动态障碍物表示驾驶环境中行人、车辆等各种运动的障碍物。动态障碍物也可以称为动态交通参与者。例如,动态障碍物包括其它行驶的车辆或行人。
例如,驾驶环境信息还可以包括道路结构信息、静态障碍物的位置信息、自动驾驶车辆的位置信息等。其中,道路结构信息包括道路上交通标志、红绿灯等交通规则信息等。
获取驾驶环境信息的方法可以为,根据自动驾驶车辆上的各个传感器采集的信息获取驾驶环境信息。本申请对获取驾驶环境信息的方式不作限定。
S420,将驾驶环境信息的状态表征输入训练后的强化学习网络模型,获取强化学习网络模型输出的预测时域,该预测时域表示对动态障碍物进行运动轨迹预测的时长或者步数。
本申请实施例中的强化学习网络模型表示强化学习方法中的智能体(如图3所示)。
该强化学习网络模型的输入为驾驶环境信息,该强化学习网络模型的输出为预测时域。换句话说,强化学习算法中的状态(state)为驾驶环境信息,动作(action)为预测时域。本申请实施例中的强化学习网络模型也可以称为预测时域策略网络。
需要说明的是,驾驶环境信息的状态表征表示对驾驶环境信息进行处理后的数据。实际应用中,可以根据强化学习算法中对状态的定义来确定对驾驶环境信息的处理方式。
实际应用中,可以根据应用需求设计强化学习算法中状态的定义。本申请对此不作限定。
本申请实施例中提及的预测时域表示对动态障碍物进行运动轨迹预测的时长或者步数。
作为一个示例,假设将预测时域定位是预测的时长,例如,预测时域为5,表示,对动态障碍物进行运动轨迹预测的时长为5个时间单位。该时间单位可以预设。
作为另一个示例,假设将预测时域定位是预测的步数,例如,预测时域为5,表示,对动态障碍物进行运动轨迹预测的步数为5个单位步长。该单位步长可以预设。
本申请实施例中的预测时域还可以表述为是,用于规划动态障碍物的运动轨迹的规划器的预测时域。
需要说明的是,本申请实施例提供的运动规划的方法400(以及下文将描述的方法500)中采用的强化学习网络模型为已经训练好的模型,具体地,是以基于驾驶环境预测预测时域为训练目标训练好的模型。关于强化学习网络模型的训练方法,下文将结合图6进行描述,这里暂不详述。
S430,利用该预测时域进行运动规划。
例如,利用预测时域进行运动规划的流程包括如下步骤:
1)将步骤S420中得到的预测时域作为超参数,对动态障碍物的运动轨迹进行预测;
2)根据驾驶环境信息中的静态障碍物的位置信息,以及所预测的动态障碍物的运动 轨迹,利用规划算法进行规划自动驾驶车辆的运动轨迹。
需要说明的是,根据动态障碍物的运动轨迹预测的时长或步数(即本申请实施例中的预测时域)对自动驾驶车辆进行运动规划的方法可参考现有技术,本文对此不作详述。
应理解,自动驾驶车辆可以按照步骤S430中得到的自动驾驶车辆的运动轨迹进行行驶,直至驾驶任务完成。
例如,自动驾驶车辆按照步骤S430中得到的自动驾驶车辆的运动轨迹行驶C1步,若驾驶任务未完成,则基于更新后的驾驶环境重新获取新的状态,继续执行步骤S420与步骤S430,并按照步骤S430中得到的自动驾驶车辆的运动轨迹行驶C2步,若驾驶任务未完成,继续循环上述操作,若驾驶任务完成则自动驾驶结束。其中,涉及的C1与C2的取值可以预设或根据驾驶环境实时确定。C1与C2可以相同,也可以不同。
以C1与C2相同且取值为10为例,则自动驾驶车辆可以按照步骤S430中得到的自动驾驶车辆的运动轨迹行驶10个单位步长。单位步长可以预设的。
通过采用强化学习方法,根据驾驶环境信息实时确定预测时域,使得预测时域不是固定的,而是可以随驾驶环境的变换而动态变化的,从而基于该预测时域进行运动规划,可以实现在自动驾驶车辆与动态障碍物交互过程中使得自动驾驶车辆可以灵活应对动态障碍物。
例如,自动驾驶车辆按照基于采用强化学习方法获得的预测时域进行运动规划得到的运动轨迹进行行驶,可以实现在与动态障碍物交互过程中动态调整驾驶风格。
驾驶风格表示驾驶行为是激进的还是保守的。
例如,在预测时域较大的情况下,可以将对应的驾驶风格视为是保守的;在预测时域较小的情况下,可以将对应的驾驶风格视为是激进的。
现有技术中,预测时域是固定的,可以视为,自动驾驶车辆的驾驶风格是固定的,而交通场景复杂多变,如果自动驾驶车辆的驾驶风格固定,难以兼顾通信效率与行驶安全。
在本申请中,预测时域是通过强化学习得到的,则该预测时域的大小不是固定的,而是随驾驶环境的改变而动态改变的,也就是说,针对动态障碍物不同的移动状态,该预测时域可以是不同的。因此,在本申请中,随着自动驾驶车辆的驾驶环境的改变,预测时域可大可小,对应的自动驾驶车辆的驾驶风格可保守可激进,从而可以实现在与动态障碍物交互过程中动态调整驾驶风格。
下面结合图5描述本申请实施例提供的运动规划的方法的一个例子。
图5为本申请实施例提供的一种运动规划的方法500的示意性流程图。
S510,获取驾驶环境信息。
该驾驶环境信息包括动态障碍物的位置信息。
驾驶环境信息还可以包括道路结构信息、静态障碍物的位置信息、自动驾驶车辆的位置信息等。
S520,将步骤S510获取的驾驶环境信息的状态表征输入训练后的强化学习网络模型,获得该强化学习网络模型输出的预测时域。
S530,根据步骤S520中得到的预测时域,对自动驾驶车辆进行运动规划,获得自动驾驶车辆的规划轨迹。
步骤S530可以包括如下两个步骤:
1)将步骤S520中得到的预测时域作为超参数,对动态障碍物的运动轨迹进行预测;
2)根据驾驶环境信息中的静态障碍物的位置信息,以及所预测的动态障碍物的运动轨迹,利用规划算法进行规划自动驾驶车辆的运动轨迹。
S540,控制自动驾驶车辆按照步骤S530中获得的自动驾驶车辆的运动轨迹行驶C步,或者说,执行步骤S530中获得的自动驾驶车辆的运动轨迹的前C步,C为正整数。
S550,判断驾驶任务是否完成,若是,则自动驾驶操作结束,若否,转到步骤S510。
本申请实施例提供的运动规划的方法,通过采用强化学习方法,根据驾驶环境信息实时确定预测时域,使得预测时域不是固定的,而是可以随驾驶环境的变换而动态变化的,从而基于该预测时域进行运动规划,可以实现在自动驾驶车辆与动态障碍物交互过程中使得自动驾驶车辆可以灵活应对动态障碍物。
例如,将本申请实施例提供的运动规划的方法应用于自动驾驶,可以实现在与动态障碍物交互过程中动态调整驾驶风格。
图6为本申请实施例提供的一种数据处理的方法600的示意性流程图。例如,该方法600可应用于训练得到方法400与方法500中采用的强化学习网络模型。该方法600包括如下步骤。
S610,根据强化学习网络模型与自动驾驶的驾驶环境交互获得的数据,获得强化学习网络模型的训练数据。该强化学习网络模型的输入为驾驶环境信息,强化学习网络模型的输出为预测时域,预测时域表示对自动驾驶的动态障碍物进行运动轨迹预测的时长或者步数。
S620,利用该训练数据,对强化学习网络模型进行强化学习的训练,以获得训练后的强化学习网络模型。
本申请实施例中的强化学习网络模型表示强化学习方法中的智能体(如图3所示)。该强化学习网络模型的训练数据包括多组样本,每组样本可以表示为<状态s,动作a,奖励r>。关于状态s,动作a、奖励r的含义参见前文结合图3的描述,这里不再赘述。
如图7所示,在本申请实施例中,步骤S610包括:通过如下步骤S611至步骤S614,获得该强化学习网络模型的训练数据中的一组样本<状态s,动作a,奖励r>。
S611,获取驾驶环境信息,将驾驶环境信息作为该状态s。
该驾驶环境信息包括动态障碍物的位置信息。
例如,该驾驶环境信息还可以包括道路结构信息、静态障碍物的位置信息、自动驾驶车辆的位置信息等。
获取驾驶环境信息的方法可以为,根据自动驾驶车辆上的各个传感器采集的信息获取驾驶环境信息。本申请对获取驾驶环境信息的方式不作限定。
S612,将该状态s输入待训练的强化学习网络模型,获取强化学习网络模型输出的预测时域,将预测时域作为该动作a,其中,预测时域表示对动态障碍物进行运动轨迹预测的时长或者步数。
S613,利用预测时域进行运动规划,获得自动驾驶车辆的运动轨迹。
步骤S613可以包括如下两个步骤:
1)将步骤S612中得到的预测时域作为超参数,对动态障碍物的运动轨迹进行预测;
2)根据驾驶环境信息中的静态障碍物的位置信息,以及所预测的动态障碍物的运动 轨迹,利用规划算法进行规划自动驾驶车辆的运动轨迹。
S614,通过控制自动驾驶车辆按照自动驾驶车辆的运动轨迹进行行驶,获得该奖励r。
例如,通过控制自动驾驶车辆按照自动驾驶车辆的运动轨迹进行行驶,获得更新后的驾驶环境信息,基于更新后的驾驶环境信息计算得到奖励r。其中,基于更新后的驾驶环境信息获得奖励r的策略可以根据应用需求确定,本申请对此不作限定。
应理解,通过循环执行多轮步骤S611至步骤S614,可得到多组样本<状态s,动作a,奖励r>。其中,在每次执行下一轮步骤S611至步骤S614之前,强化学习网络模型会根据上一轮步骤S614获得的奖励更新状态s与动作a之间的映射关系。
将这多组样本作为训练数据,对强化学习网络模型进行一次训练。继续采用上述过程,对强化学习网络模型进行下一轮次训练,直至满足模型收敛条件,则获得训练好的强化学习网络模型。
可选地,在本实施例的步骤S614中,可以通过代价函数,计算得到奖励r。
该代价函数可以根据应用需求进行设计。
可选地,该代价函数可以是根据自动驾驶车辆与其它车辆之间的博弈行为确定的。
作为示例,设计该代价函数的考虑因素包括下列中任一种或多种:
驾驶安全性、自动驾驶车辆的通行效率、其他交通参与者(例如,其它车辆)的通行效率。
作为一个示例,奖励r根据如下分段函数获得,该分段函数可以称为代价函数:
该分段函数中的第一段“-0.5×time_step”是用于鼓励自动驾驶车辆尽快完成驾驶任务,是出于自动驾驶车辆的通行效率的考虑。其中,Time_step表示驾驶任务的计时信息。
该分段函数中的第二段“-10”用于惩罚碰撞行为,是出于安全性的考虑。
该分段函数中的第三段“10”用于对完成驾驶任务进行奖励。
该分段函数中的第四段“5”用于对其它车辆通过窄道进行奖励,使得强化学习算法不仅考虑自动驾驶车辆的行驶效率,还考虑其它车辆的行驶效率,是出于鼓励兼顾其它车辆的通行效率的考虑。
将本申请实施例提供的方法600训练得到的强化学习网络模型应用于自动驾驶,可以在运动规划的过程中,根据驾驶环境确定较为合适的预测时域,基于该预测时域进行运动规划,可以实现在自动驾驶车辆与动态障碍物交互过程中使得自动驾驶车辆可以灵活应对动态障碍物。
下面描述一个将本申请实施例提供的方法应用于如图8所示的窄道会车场景的例子。
图8所示的窄道会车场景的驾驶任务是,自动驾驶车辆与其它车辆(移动的)期望通过窄道,两车在不考虑路权的情况下行驶,自动驾驶车辆根据对方车辆的形式行为对自身的行驶行为进行调整。
步骤1),获取强化学习算法中的状态。
例如,通过激光雷达,获取二维可行区域信息与不可行区域信息。例如,将这些区域 信息(包括二维可行区域信息与不可行区域信息)表征为84×84投影矩阵。
例如,为了使得强化学习网络模型能够对自动驾驶车辆与其他车辆的运动具有描述能力,可以将历史投影矩阵中间隔为5的最近4帧投影矩阵按照当前车辆坐标系进行坐标变换,将得到的投影矩阵序列作为强化学习网络模型的输入。
步骤2),将步骤1)获取的状态,即矩阵序列,输入强化学习网络模型,获得规划算法对动态障碍物的预测时域。
例如,强化学习网络模型的网络结构可以采用ACKTR算法。该ACKTR算法为Actor-Critic框架下的策略梯度算法。该ACKTR算法包括策略网络与值网络。
例如,为了处理矩阵输入,可以设计包含卷积层与全连接层的值网络与策略网络模型。将步骤1)中得到的矩阵序列作为该强化学习网络模型的输入。将策略网络的输出值设计为规划算法对动态障碍物的预测时域。关于预测时域的说明参见前文,这里不再赘述。
步骤3),以步骤2)中得到的预测时域作为超参数,利用匀速预测模型对动态的其他车辆进行该时域步长的轨迹预测。
基于静态障碍物以及对动态障碍物的轨迹预测,例如,采用多项式规划算法进行运动规划。多项式算法是一种基于采样的规划算法,该算法在结构化道路的Frenet坐标系下进行规划,首先对偏离车道中心线的横向距离以及纵向期望速度进行采样,之后通过五次多项式拟合,生成备选轨迹集合,最后根据规划器的代价函数对轨迹进行优选,输出最优轨迹,完成运动规划。
应理解,自动驾驶车辆可以按照步骤3)中得到的自动驾驶车辆的运动轨迹进行行驶,直至驾驶任务完成。
例如,自动驾驶车辆按照步骤3)中得到的自动驾驶车辆的运动轨迹行驶若干步,若驾驶任务未完成,则继续执行步骤1)至步骤3),并按照步骤3)中得到的自动驾驶车辆的运动轨迹行驶若干步,若驾驶任务未完成,继续循环上述操作,若驾驶任务完成则自动驾驶任务结束。
在结合图8描述的例子中涉及的强化学习网络模型可以采用上文实施例中的方法600训练得到。具体描述详见上文,这里不再赘述。
上述可知,本申请实施例,通过采用强化学习方法,根据驾驶环境信息实时确定预测时域,使得预测时域不是固定的,而是可以随驾驶环境的变换而动态变化的,从而基于该预测时域进行运动规划,可以实现在自动驾驶车辆与动态障碍物交互过程中使得自动驾驶车辆可以灵活应对动态障碍物。
本文中描述的各个实施例可以为独立的方案,也可以根据内在逻辑进行组合,这些方案都落入本申请的保护范围中。
上文描述了本申请提供的方法实施例,下文将描述本申请提供的装置实施例。应理解,装置实施例的描述与方法实施例的描述相互对应,因此,未详细描述的内容可以参见上文方法实施例,为了简洁,这里不再赘述。
如图9所示,本申请实施例还提供一种数据处理的装置900,该装置900包括环境感知模块910、运动规划模块920、车辆控制模块930。
环境感知模块910,用于获取驾驶环境信息,并向该驾驶环境信息传递给运动规划模块920。
例如,环境感知模块910用于根据车辆上各个传感器所采集的信息,获取驾驶环境信息。
该驾驶环境信息包括动态障碍物的位置信息。
驾驶环境信息还可以包括道路结构信息、静态障碍物的位置信息、自动驾驶车辆的位置信息等。
运动规划模块920,用于从环境感知模块910接收驾驶环境信息,并采用强化学习网络模型获得动态障碍物的预测时域,并基于该预测时域进行运动规划,获得自动驾驶车辆的运动轨迹,并将该运动轨迹对应的规划控制信息传递给车辆控制模块930。
例如,运动规划模块920用于执行上文方法实施例提供的方法400中的步骤S420与步骤S430。
车辆控制模块930,用于从运动规划模块920接收规划控制信息,并控制车辆依据规划控制信息对应的动作指令信息控制车辆完成驾驶任务。
本申请实施例提供的装置900可以设置在自动驾驶车辆上。
如图10所示,本申请实施例还提供一种运动规划的装置1000,装置1000用于执行上文方法实施例中的方法400或方法500。装置1000包括获取单元1010、预测单元1020与规划单元1030。
获取单元1010用于获取驾驶环境信息,驾驶环境信息包括动态障碍物的位置信息。
预测单元1020用于将驾驶环境信息的状态表征输入训练后的强化学习网络模型,获取强化学习网络模型输出的预测时域,预测时域表示对动态障碍物进行运动轨迹预测的时长或者步数。
规划单元1030用于利用预测时域进行运动规划。
例如,规划单元1030利用预测时域进行运动规划的操作包括如下步骤。
将预测时域作为超参数,对动态障碍物的运动轨迹进行预测;根据驾驶环境信息中包括的静态障碍物的位置信息,以及所预测的动态障碍物的运动轨迹,规划自动驾驶车辆的运动轨迹。
如图10所示,该装置1000还可以包括控制单元1040,用于控制自动驾驶车辆按照运动规划得到的运动轨迹进行行驶。
例如,预测单元1020、规划单元1030与控制单元1040可以通过处理器实现。获取单元1010可以通过通信接口实现。
如图11所示,本申请实施例还提供一种数据处理的装置1100,装置1100用于执行上文方法实施例中的方法600。装置1100包括获取单元1110与训练单元1120。
获取单元1110用于根据强化学习网络模型与自动驾驶的驾驶环境交互获得的数据,获得强化学习网络模型的训练数据。
训练单元1120用于利用训练数据,对强化学习网络模型进行强化学习的训练,以获得训练后的强化学习网络模型。其中,强化学习网络模型的输入为驾驶环境信息,强化学习网络模型的输出为预测时域,预测时域表示对自动驾驶的动态障碍物进行运动轨迹预测的时长或者步数。
例如,获取单元1110用于通过如图7所示的步骤S611至步骤S614获得训练数据中的一组样本<状态s,动作a,奖励r>。参见上文描述,这里不再赘述。
如图12所示,本申请实施例还提供一种数据处理的装置1200。该装置1200包括处理器1210,处理器1210与存储器1220耦合,存储器1220用于存储计算机程序或指令,处理器1210用于执行存储器1220存储的计算机程序或指令,使得上文方法实施例中的方法被执行。
可选地,如图12所示,该装置1200还可以包括存储器1220。
可选地,如图12所示,该装置1200还可以包括数据接口1230,数据接口1230用于与外界进行数据的传输。
可选地,作为一种方案,该装置1200用于实现上文实施例中的方法400。
可选地,作为另一种方案,该装置1200用于实现上文实施例中的方法500。
可选地,作为又一种方案,该装置1200用于实现上文实施例中的方法600。
本申请实施例还提供一种自动驾驶车辆,包括如图9所示的数据处理的装置900或如图10所示的数据处理的装置1000。
可选地,该自动驾驶车辆还包括如图11所示的数据处理的装置1100。
本申请实施例还提供一种自动驾驶车辆,包括如图12所示的数据处理的装置1200。
本申请实施例还提供一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行上述实施例的方法。
本申请实施例还提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述实施例的方法。
本申请实施例还提供一种芯片,该芯片包括处理器与数据接口,处理器通过数据接口读取存储器上存储的指令,执行上述实施例的方法。
可选地,作为一种实现方式,该芯片还可以包括存储器,存储器中存储有指令,处理器用于执行存储器上存储的指令,当指令被执行时,处理器用于执行上述实施例中的方法。
图13为本申请实施例提供的一种芯片硬件结构,该芯片上包括神经网络处理器1300。该芯片可以被设置在如下任一种或多种装置中:
如图9所示的装置900、如图10所示的装置1000、如图11中所示的装置1100、如图12所示的装置1200。
上文方法实施例中的方法400、500或600均可在如图13所示的芯片中得以实现。
神经网络处理器1300作为协处理器挂载到主处理器(Host CPU)上,由主CPU分配任务。神经网络处理器1300的核心部分为运算电路1303,控制器1304控制运算电路1303获取存储器(权重存储器1302或输入存储器1301)中的数据并进行运算。
在一些实现中,运算电路1303内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路1303是二维脉动阵列。运算电路1303还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1303是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路1303从权重存储器1302中取矩阵B相应的数据,并缓存在运算电路1303中每一个PE上。运算电路1303从输入存储器1301中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1308中。
向量计算单元1307可以对运算电路1303的输出做进一步处理,如向量乘,向量加, 指数运算,对数运算,大小比较等等。例如,向量计算单元1307可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。
在一些实现种,向量计算单元能1307将经处理的输出的向量存储到统一存储器(也可称为统一缓存器)1306。例如,向量计算单元1307可以将非线性函数应用到运算电路1303的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1307生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1303的激活输入,例如用于在神经网络中的后续层中的使用。
上文方法实施例中的方法400、500或600可以由1303或1307执行。
统一存储器1306用于存放输入数据以及输出数据。
可以通过存储单元访问控制器1305(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器1301和/或统一存储器1306、将外部存储器中的权重数据存入权重存储器1302,以及将统一存储器1306中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)1310,用于通过总线实现主CPU、DMAC和取指存储器1309之间进行交互。
与控制器1304连接的取指存储器(instruction fetch buffer)1309,用于存储控制器1304使用的指令;
控制器1304,用于调用指存储器1309中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器1306,输入存储器1301,权重存储器1302以及取指存储器1309均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。
需要说明的是,本文中涉及的第一、第二、第三或第四等各种数字编号仅为描述方便进行的区分,并不用来限制本申请实施例的范围。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组 件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:通用串行总线闪存盘(USB flash disk,UFD)(UFD也可以简称为U盘或者优盘)、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。
Claims (17)
- 一种运动规划的方法,其特征在于,包括:获取驾驶环境信息,所述驾驶环境信息包括动态障碍物的位置信息;将所述驾驶环境信息的状态表征输入训练后的强化学习网络模型,获取所述强化学习网络模型输出的预测时域,所述预测时域表示对所述动态障碍物进行运动轨迹预测的时长或者步数;利用所述预测时域进行运动规划。
- 根据权利要求1所述的方法,其特征在于,所述利用所述预测时域进行运动规划,包括:将所述预测时域作为超参数,对所述动态障碍物的运动轨迹进行预测;根据所述驾驶环境信息中包括的静态障碍物的位置信息,以及所预测的所述动态障碍物的运动轨迹,规划自动驾驶车辆的运动轨迹。
- 根据权利要求1或2所述的方法,其特征在于,还包括:控制自动驾驶车辆按照所述运动规划得到的运动轨迹进行行驶。
- 一种数据处理的方法,其特征在于,包括:根据强化学习网络模型与自动驾驶的驾驶环境交互获得的数据,获得所述强化学习网络模型的训练数据;利用所述训练数据,对所述强化学习网络模型进行强化学习的训练,以获得训练后的所述强化学习网络模型,其中,所述强化学习网络模型的输入为驾驶环境信息,所述强化学习网络模型的输出为预测时域,所述预测时域表示对自动驾驶的动态障碍物进行运动轨迹预测的时长或者步数。
- 根据权利要求4所述的方法,其特征在于,所述根据强化学习网络模型与自动驾驶的驾驶环境交互获得的数据,获得所述强化学习网络模型的训练数据,包括:通过如下步骤获得所述训练数据中的一组样本<状态s,动作a,奖励r>:获取驾驶环境信息,将所述驾驶环境信息作为所述状态s,所述驾驶环境信息包括动态障碍物的位置信息;将所述状态s输入待训练的强化学习网络模型,获取所述强化学习网络模型输出的预测时域,将所述预测时域作为所述动作a,其中,所述预测时域表示对所述动态障碍物进行运动轨迹预测的时长或者步数;利用所述预测时域进行运动规划,获得自动驾驶车辆的运动轨迹;通过控制所述自动驾驶车辆按照所述自动驾驶车辆的运动轨迹进行行驶,获得所述奖励r。
- 根据权利要求5所述的方法,其特征在于,所述获得所述奖励r,包括:根据回报函数,计算所述奖励r,其中,所述回报函数考虑了下列任一种或多种因素:驾驶安全性、自动驾驶车辆的通行效率、其他交通参与者的通行效率。
- 一种运动规划的装置,其特征在于,包括:获取单元,用于获取驾驶环境信息,所述驾驶环境信息包括动态障碍物的位置信息;预测单元,用于将所述驾驶环境信息的状态表征输入训练后的强化学习网络模型,获取所述强化学习网络模型输出的预测时域,所述预测时域表示对所述动态障碍物进行运动轨迹预测的时长或者步数;规划单元,用于利用所述预测时域进行运动规划。
- 根据权利要求7所述的装置,其特征在于,所述规划单元用于:将所述预测时域作为超参数,对所述动态障碍物的运动轨迹进行预测;根据所述驾驶环境信息中包括的静态障碍物的位置信息,以及所预测的所述动态障碍物的运动轨迹,规划自动驾驶车辆的运动轨迹。
- 根据权利要求7或8所述的装置,其特征在于,还包括:控制单元,用于控制自动驾驶车辆按照所述运动规划得到的运动轨迹进行行驶。
- 一种数据处理的装置,其特征在于,包括:获取单元,用于根据强化学习网络模型与自动驾驶的驾驶环境交互获得的数据,获得所述强化学习网络模型的训练数据;训练单元,用于利用所述训练数据,对所述强化学习网络模型进行强化学习的训练,以获得训练后的所述强化学习网络模型,其中,所述强化学习网络模型的输入为驾驶环境信息,所述强化学习网络模型的输出为预测时域,所述预测时域表示对自动驾驶的动态障碍物进行运动轨迹预测的时长或者步数。
- 根据权利要求10所述的装置,其特征在于,所述根获取单元用于,通过如下步骤获得所述训练数据中的一组样本<状态s,动作a,奖励r>:获取驾驶环境信息,将所述驾驶环境信息作为所述状态s,所述驾驶环境信息包括动态障碍物的位置信息;将所述状态s输入待训练的强化学习网络模型,获取所述强化学习网络模型输出的预测时域,将所述预测时域作为所述动作a,其中,所述预测时域表示对所述动态障碍物进行运动轨迹预测的时长或者步数;利用所述预测时域进行运动规划,获得自动驾驶车辆的运动轨迹;通过控制所述自动驾驶车辆按照所述自动驾驶车辆的运动轨迹进行行驶,获得所述奖励r。
- 根据权利要求11所述的装置,其特征在于,所述获取单元用于,根据回报函数,计算所述奖励r,其中,所述回报函数考虑了下列任一种或多种因素:驾驶安全性、自动驾驶车辆的通行效率、其他交通参与者的通行效率。
- 一种自动驾驶车辆,其特征在于,包括:如权利要求7-9中任一项所述的运动规划的装置。
- 根据权利要求13所述的装置,其特征在于,还包括:如权利要求10-12中任一项所述的数据处理的装置。
- 一种数据处理的装置,其特征在于,包括:存储器,用于存储可执行指令;处理器,用于调用并运行所述存储器中的所述可执行指令,以执行权利要求1至6中 任一项所述的方法。
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有程序指令,当所述程序指令由处理器运行时,实现权利要求1至6中任一项所述的方法。
- 一种计算机程序产品,其特征在于,所述计算机程序产品包括计算机程序代码,当所述计算机程序代码在计算机上运行时,实现权利要求1至6中任一项所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010471732.4A CN113805572B (zh) | 2020-05-29 | 2020-05-29 | 运动规划的方法与装置 |
CN202010471732.4 | 2020-05-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021238303A1 true WO2021238303A1 (zh) | 2021-12-02 |
Family
ID=78745524
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/075925 WO2021238303A1 (zh) | 2020-05-29 | 2021-02-08 | 运动规划的方法与装置 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113805572B (zh) |
WO (1) | WO2021238303A1 (zh) |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114179835A (zh) * | 2021-12-30 | 2022-03-15 | 清华大学苏州汽车研究院(吴江) | 基于真实场景下强化学习的自动驾驶车辆决策训练方法 |
CN114274980A (zh) * | 2022-01-27 | 2022-04-05 | 中国第一汽车股份有限公司 | 轨迹控制方法、装置、车辆及存储介质 |
CN114312831A (zh) * | 2021-12-16 | 2022-04-12 | 浙江零跑科技股份有限公司 | 一种基于空间注意力机制的车辆轨迹预测方法 |
CN114355793A (zh) * | 2021-12-24 | 2022-04-15 | 阿波罗智能技术(北京)有限公司 | 用于车辆仿真评测的自动驾驶规划模型的训练方法及装置 |
CN114396949A (zh) * | 2022-01-18 | 2022-04-26 | 重庆邮电大学 | 一种基于ddpg的移动机器人无先验地图导航决策方法 |
CN114506344A (zh) * | 2022-03-10 | 2022-05-17 | 福瑞泰克智能系统有限公司 | 一种车辆轨迹的确定方法及装置 |
CN114548497A (zh) * | 2022-01-13 | 2022-05-27 | 山东师范大学 | 一种实现场景自适应的人群运动路径规划方法及系统 |
CN114647936A (zh) * | 2022-03-16 | 2022-06-21 | 重庆长安汽车股份有限公司 | 基于场景的车辆行驶轨迹生成方法及可读存储介质 |
CN114715193A (zh) * | 2022-04-15 | 2022-07-08 | 重庆大学 | 一种实时轨迹规划方法及系统 |
CN114779780A (zh) * | 2022-04-26 | 2022-07-22 | 四川大学 | 一种随机环境下路径规划方法及系统 |
CN114771526A (zh) * | 2022-04-14 | 2022-07-22 | 重庆长安汽车股份有限公司 | 一种自动换道的纵向车速控制方法及系统 |
CN114815829A (zh) * | 2022-04-26 | 2022-07-29 | 澳克诺(上海)汽车科技有限公司 | 交叉路口的运动轨迹预测方法 |
CN114859921A (zh) * | 2022-05-12 | 2022-08-05 | 鹏城实验室 | 一种基于强化学习的自动驾驶优化方法及相关设备 |
CN114995421A (zh) * | 2022-05-31 | 2022-09-02 | 重庆长安汽车股份有限公司 | 自动驾驶避障方法、装置、电子设备、存储介质及程序产品 |
CN115303297A (zh) * | 2022-07-25 | 2022-11-08 | 武汉理工大学 | 基于注意力机制与图模型强化学习的城市场景下端到端自动驾驶控制方法及装置 |
CN115489572A (zh) * | 2022-09-21 | 2022-12-20 | 交控科技股份有限公司 | 基于强化学习的列车ato控制方法、设备及存储介质 |
CN115494772A (zh) * | 2022-09-26 | 2022-12-20 | 北京易航远智科技有限公司 | 基于高精地图的自动驾驶控制方法及自动驾驶控制装置 |
CN115617036A (zh) * | 2022-09-13 | 2023-01-17 | 中国电子科技集团公司电子科学研究院 | 一种多模态信息融合的机器人运动规划方法及设备 |
CN116304595A (zh) * | 2023-05-11 | 2023-06-23 | 中南大学湘雅医院 | 基于共享云平台的智能运动分析系统及方法 |
CN116501086A (zh) * | 2023-04-27 | 2023-07-28 | 天津大学 | 一种基于强化学习的飞行器自主规避决策方法 |
CN117141520A (zh) * | 2023-10-31 | 2023-12-01 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | 一种实时轨迹规划方法、装置和设备 |
CN117302204A (zh) * | 2023-11-30 | 2023-12-29 | 北京科技大学 | 依托强化学习的多风格车辆轨迹跟踪避撞控制方法及装置 |
CN117698762A (zh) * | 2023-12-12 | 2024-03-15 | 海识(烟台)信息科技有限公司 | 基于环境感知和行为预测的智能驾驶辅助系统及方法 |
WO2024067115A1 (zh) * | 2022-09-28 | 2024-04-04 | 华为技术有限公司 | 一种生成流模型的训练方法及相关装置 |
CN118182538A (zh) * | 2024-05-17 | 2024-06-14 | 北京理工大学前沿技术研究院 | 基于课程强化学习的无保护左转场景决策规划方法及系统 |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114386599B (zh) * | 2022-01-11 | 2023-01-31 | 北京百度网讯科技有限公司 | 训练轨迹预测模型和轨迹规划的方法和装置 |
CN114644016A (zh) * | 2022-04-14 | 2022-06-21 | 中汽创智科技有限公司 | 车辆自动驾驶决策方法、装置、车载终端和存储介质 |
CN118171684A (zh) * | 2023-03-27 | 2024-06-11 | 华为技术有限公司 | 神经网络、自动驾驶方法和装置 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002162242A (ja) * | 2000-11-27 | 2002-06-07 | Denso Corp | タクシー用情報表示装置 |
CN109855639A (zh) * | 2019-01-15 | 2019-06-07 | 天津大学 | 基于障碍物预测与mpc算法的无人驾驶轨迹规划方法 |
US20190283742A1 (en) * | 2018-03-14 | 2019-09-19 | Honda Motor Co., Ltd. | Vehicle control device, vehicle control method, and storage medium |
CN110293968A (zh) * | 2019-06-18 | 2019-10-01 | 百度在线网络技术(北京)有限公司 | 自动驾驶车辆的控制方法、装置、设备及可读存储介质 |
CN110456634A (zh) * | 2019-07-01 | 2019-11-15 | 江苏大学 | 一种基于人工神经网络的无人车控制参数选取方法 |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103823466B (zh) * | 2013-05-23 | 2016-08-10 | 电子科技大学 | 一种动态环境下移动机器人路径规划方法 |
WO2018162521A1 (en) * | 2017-03-07 | 2018-09-13 | Robert Bosch Gmbh | Action planning system and method for autonomous vehicles |
CN108875998A (zh) * | 2018-04-20 | 2018-11-23 | 北京智行者科技有限公司 | 一种自动驾驶车辆规划方法和系统 |
US11794757B2 (en) * | 2018-06-11 | 2023-10-24 | Colorado State University Research Foundation | Systems and methods for prediction windows for optimal powertrain control |
CN109829386B (zh) * | 2019-01-04 | 2020-12-11 | 清华大学 | 基于多源信息融合的智能车辆可通行区域检测方法 |
CN110471408B (zh) * | 2019-07-03 | 2022-07-29 | 天津大学 | 基于决策过程的无人驾驶车辆路径规划方法 |
CN110398969B (zh) * | 2019-08-01 | 2022-09-27 | 北京主线科技有限公司 | 自动驾驶车辆自适应预测时域转向控制方法及装置 |
CN110989576B (zh) * | 2019-11-14 | 2022-07-12 | 北京理工大学 | 速差滑移转向车辆的目标跟随及动态障碍物避障控制方法 |
CN110780674A (zh) * | 2019-12-04 | 2020-02-11 | 哈尔滨理工大学 | 一种提高自动驾驶轨迹跟踪控制的方法 |
CN111123927A (zh) * | 2019-12-20 | 2020-05-08 | 北京三快在线科技有限公司 | 轨迹规划方法、装置、自动驾驶设备和存储介质 |
-
2020
- 2020-05-29 CN CN202010471732.4A patent/CN113805572B/zh active Active
-
2021
- 2021-02-08 WO PCT/CN2021/075925 patent/WO2021238303A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002162242A (ja) * | 2000-11-27 | 2002-06-07 | Denso Corp | タクシー用情報表示装置 |
US20190283742A1 (en) * | 2018-03-14 | 2019-09-19 | Honda Motor Co., Ltd. | Vehicle control device, vehicle control method, and storage medium |
CN109855639A (zh) * | 2019-01-15 | 2019-06-07 | 天津大学 | 基于障碍物预测与mpc算法的无人驾驶轨迹规划方法 |
CN110293968A (zh) * | 2019-06-18 | 2019-10-01 | 百度在线网络技术(北京)有限公司 | 自动驾驶车辆的控制方法、装置、设备及可读存储介质 |
CN110456634A (zh) * | 2019-07-01 | 2019-11-15 | 江苏大学 | 一种基于人工神经网络的无人车控制参数选取方法 |
Non-Patent Citations (1)
Title |
---|
CHEN SENGPENG, WU JIA;CHEN XIU-YUN: "Hyperparameter Optimization Method Based on Reinforcement Learning", JOURNAL OF CHINESE COMPUTER SYSTEMS, GAI-KAN BIANJIBU , SHENYANG, CN, vol. 41, no. 4, 30 April 2020 (2020-04-30), CN , pages 679 - 684, XP055872007, ISSN: 1000-1220 * |
Cited By (36)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114312831A (zh) * | 2021-12-16 | 2022-04-12 | 浙江零跑科技股份有限公司 | 一种基于空间注意力机制的车辆轨迹预测方法 |
CN114312831B (zh) * | 2021-12-16 | 2023-10-03 | 浙江零跑科技股份有限公司 | 一种基于空间注意力机制的车辆轨迹预测方法 |
CN114355793A (zh) * | 2021-12-24 | 2022-04-15 | 阿波罗智能技术(北京)有限公司 | 用于车辆仿真评测的自动驾驶规划模型的训练方法及装置 |
CN114355793B (zh) * | 2021-12-24 | 2023-12-29 | 阿波罗智能技术(北京)有限公司 | 用于车辆仿真评测的自动驾驶规划模型的训练方法及装置 |
CN114179835A (zh) * | 2021-12-30 | 2022-03-15 | 清华大学苏州汽车研究院(吴江) | 基于真实场景下强化学习的自动驾驶车辆决策训练方法 |
CN114179835B (zh) * | 2021-12-30 | 2024-01-05 | 清华大学苏州汽车研究院(吴江) | 基于真实场景下强化学习的自动驾驶车辆决策训练方法 |
CN114548497A (zh) * | 2022-01-13 | 2022-05-27 | 山东师范大学 | 一种实现场景自适应的人群运动路径规划方法及系统 |
CN114396949A (zh) * | 2022-01-18 | 2022-04-26 | 重庆邮电大学 | 一种基于ddpg的移动机器人无先验地图导航决策方法 |
CN114396949B (zh) * | 2022-01-18 | 2023-11-10 | 重庆邮电大学 | 一种基于ddpg的移动机器人无先验地图导航决策方法 |
CN114274980A (zh) * | 2022-01-27 | 2022-04-05 | 中国第一汽车股份有限公司 | 轨迹控制方法、装置、车辆及存储介质 |
CN114506344A (zh) * | 2022-03-10 | 2022-05-17 | 福瑞泰克智能系统有限公司 | 一种车辆轨迹的确定方法及装置 |
CN114506344B (zh) * | 2022-03-10 | 2024-03-08 | 福瑞泰克智能系统有限公司 | 一种车辆轨迹的确定方法及装置 |
CN114647936A (zh) * | 2022-03-16 | 2022-06-21 | 重庆长安汽车股份有限公司 | 基于场景的车辆行驶轨迹生成方法及可读存储介质 |
CN114771526A (zh) * | 2022-04-14 | 2022-07-22 | 重庆长安汽车股份有限公司 | 一种自动换道的纵向车速控制方法及系统 |
CN114715193A (zh) * | 2022-04-15 | 2022-07-08 | 重庆大学 | 一种实时轨迹规划方法及系统 |
CN114779780A (zh) * | 2022-04-26 | 2022-07-22 | 四川大学 | 一种随机环境下路径规划方法及系统 |
CN114815829A (zh) * | 2022-04-26 | 2022-07-29 | 澳克诺(上海)汽车科技有限公司 | 交叉路口的运动轨迹预测方法 |
CN114859921A (zh) * | 2022-05-12 | 2022-08-05 | 鹏城实验室 | 一种基于强化学习的自动驾驶优化方法及相关设备 |
CN114995421A (zh) * | 2022-05-31 | 2022-09-02 | 重庆长安汽车股份有限公司 | 自动驾驶避障方法、装置、电子设备、存储介质及程序产品 |
CN115303297A (zh) * | 2022-07-25 | 2022-11-08 | 武汉理工大学 | 基于注意力机制与图模型强化学习的城市场景下端到端自动驾驶控制方法及装置 |
CN115617036A (zh) * | 2022-09-13 | 2023-01-17 | 中国电子科技集团公司电子科学研究院 | 一种多模态信息融合的机器人运动规划方法及设备 |
CN115617036B (zh) * | 2022-09-13 | 2024-05-28 | 中国电子科技集团公司电子科学研究院 | 一种多模态信息融合的机器人运动规划方法及设备 |
CN115489572A (zh) * | 2022-09-21 | 2022-12-20 | 交控科技股份有限公司 | 基于强化学习的列车ato控制方法、设备及存储介质 |
CN115489572B (zh) * | 2022-09-21 | 2024-05-14 | 交控科技股份有限公司 | 基于强化学习的列车ato控制方法、设备及存储介质 |
CN115494772A (zh) * | 2022-09-26 | 2022-12-20 | 北京易航远智科技有限公司 | 基于高精地图的自动驾驶控制方法及自动驾驶控制装置 |
WO2024067115A1 (zh) * | 2022-09-28 | 2024-04-04 | 华为技术有限公司 | 一种生成流模型的训练方法及相关装置 |
CN116501086A (zh) * | 2023-04-27 | 2023-07-28 | 天津大学 | 一种基于强化学习的飞行器自主规避决策方法 |
CN116501086B (zh) * | 2023-04-27 | 2024-03-26 | 天津大学 | 一种基于强化学习的飞行器自主规避决策方法 |
CN116304595A (zh) * | 2023-05-11 | 2023-06-23 | 中南大学湘雅医院 | 基于共享云平台的智能运动分析系统及方法 |
CN116304595B (zh) * | 2023-05-11 | 2023-08-04 | 中南大学湘雅医院 | 基于共享云平台的智能运动分析系统及方法 |
CN117141520B (zh) * | 2023-10-31 | 2024-01-12 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | 一种实时轨迹规划方法、装置和设备 |
CN117141520A (zh) * | 2023-10-31 | 2023-12-01 | 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) | 一种实时轨迹规划方法、装置和设备 |
CN117302204B (zh) * | 2023-11-30 | 2024-02-20 | 北京科技大学 | 依托强化学习的多风格车辆轨迹跟踪避撞控制方法及装置 |
CN117302204A (zh) * | 2023-11-30 | 2023-12-29 | 北京科技大学 | 依托强化学习的多风格车辆轨迹跟踪避撞控制方法及装置 |
CN117698762A (zh) * | 2023-12-12 | 2024-03-15 | 海识(烟台)信息科技有限公司 | 基于环境感知和行为预测的智能驾驶辅助系统及方法 |
CN118182538A (zh) * | 2024-05-17 | 2024-06-14 | 北京理工大学前沿技术研究院 | 基于课程强化学习的无保护左转场景决策规划方法及系统 |
Also Published As
Publication number | Publication date |
---|---|
CN113805572A (zh) | 2021-12-17 |
CN113805572B (zh) | 2023-12-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021238303A1 (zh) | 运动规划的方法与装置 | |
JP7086911B2 (ja) | 自動運転車両のためのリアルタイム意思決定 | |
US11713006B2 (en) | Systems and methods for streaming processing for autonomous vehicles | |
US20220363259A1 (en) | Method for generating lane changing decision-making model, method for lane changing decision-making of unmanned vehicle and electronic device | |
WO2022052406A1 (zh) | 一种自动驾驶训练方法、装置、设备及介质 | |
CN110955242B (zh) | 机器人导航方法、系统、机器人及存储介质 | |
CN110646009B (zh) | 一种基于dqn的车辆自动驾驶路径规划的方法及装置 | |
KR102335389B1 (ko) | 자율 주행 차량의 lidar 위치 추정을 위한 심층 학습 기반 특징 추출 | |
KR102292277B1 (ko) | 자율 주행 차량에서 3d cnn 네트워크를 사용하여 솔루션을 추론하는 lidar 위치 추정 | |
US11702105B2 (en) | Technology to generalize safe driving experiences for automated vehicle behavior prediction | |
KR102350181B1 (ko) | 자율 주행 차량에서 rnn 및 lstm을 사용하여 시간적 평활화를 수행하는 lidar 위치 추정 | |
JP2023546810A (ja) | 車両軌跡計画方法、車両軌跡計画装置、電子機器、及びコンピュータプログラム | |
WO2018057978A1 (en) | Decision making for autonomous vehicle motion control | |
CN114162146B (zh) | 行驶策略模型训练方法以及自动驾驶的控制方法 | |
CN114261400B (zh) | 一种自动驾驶决策方法、装置、设备和存储介质 | |
CN114519433A (zh) | 多智能体强化学习、策略执行方法及计算机设备 | |
US20230162539A1 (en) | Driving decision-making method and apparatus and chip | |
CN116476863A (zh) | 基于深度强化学习的自动驾驶横纵向一体化决策方法 | |
Ahmed et al. | A deep q-network reinforcement learning-based model for autonomous driving | |
CN115311860A (zh) | 一种交通流量预测模型的在线联邦学习方法 | |
CN115047864A (zh) | 模型训练的方法、无人设备的控制方法以及装置 | |
Wang et al. | An end-to-end deep reinforcement learning model based on proximal policy optimization algorithm for autonomous driving of off-road vehicle | |
Chi et al. | Deep Reinforcement Learning with Intervention Module for Autonomous Driving | |
WO2022252013A1 (en) | Method and apparatus for training neural network for imitating demonstrator's behavior | |
Yandrapu | Reinforcement Learning based Motion Planning of Autonomous Ground Vehicles |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21813771 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21813771 Country of ref document: EP Kind code of ref document: A1 |