WO2021103834A1 - 换道决策模型生成方法和无人车换道决策方法及装置 - Google Patents

换道决策模型生成方法和无人车换道决策方法及装置 Download PDF

Info

Publication number
WO2021103834A1
WO2021103834A1 PCT/CN2020/121339 CN2020121339W WO2021103834A1 WO 2021103834 A1 WO2021103834 A1 WO 2021103834A1 CN 2020121339 W CN2020121339 W CN 2020121339W WO 2021103834 A1 WO2021103834 A1 WO 2021103834A1
Authority
WO
WIPO (PCT)
Prior art keywords
lane
vehicle
network
target
training sample
Prior art date
Application number
PCT/CN2020/121339
Other languages
English (en)
French (fr)
Inventor
时天宇
冉旭
Original Assignee
初速度(苏州)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 初速度(苏州)科技有限公司 filed Critical 初速度(苏州)科技有限公司
Priority to DE112020003136.5T priority Critical patent/DE112020003136T5/de
Priority to US17/773,378 priority patent/US20220363259A1/en
Publication of WO2021103834A1 publication Critical patent/WO2021103834A1/zh

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W30/00Purposes of road vehicle drive control systems not related to the control of a particular sub-unit, e.g. of systems using conjoint control of vehicle sub-units
    • B60W30/08Active safety systems predicting or avoiding probable or impending collision or attempting to minimise its consequences
    • B60W30/095Predicting travel path or likelihood of collision
    • B60W30/0953Predicting travel path or likelihood of collision the prediction being responsive to vehicle dynamic parameters
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/16Anti-collision systems
    • G08G1/167Driving aids for lane monitoring, lane changing, e.g. blind spot detection
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W30/00Purposes of road vehicle drive control systems not related to the control of a particular sub-unit, e.g. of systems using conjoint control of vehicle sub-units
    • B60W30/18Propelling the vehicle
    • B60W30/18009Propelling the vehicle related to particular drive situations
    • B60W30/18163Lane change; Overtaking manoeuvres
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W10/00Conjoint control of vehicle sub-units of different type or different function
    • B60W10/20Conjoint control of vehicle sub-units of different type or different function including control of steering systems
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W30/00Purposes of road vehicle drive control systems not related to the control of a particular sub-unit, e.g. of systems using conjoint control of vehicle sub-units
    • B60W30/08Active safety systems predicting or avoiding probable or impending collision or attempting to minimise its consequences
    • B60W30/095Predicting travel path or likelihood of collision
    • B60W30/0956Predicting travel path or likelihood of collision the prediction being responsive to traffic or environmental parameters
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • B60W50/0097Predicting future conditions
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W60/00Drive control systems specially adapted for autonomous road vehicles
    • B60W60/001Planning or execution of driving tasks
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2520/00Input parameters relating to overall vehicle dynamics
    • B60W2520/10Longitudinal speed
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2520/00Input parameters relating to overall vehicle dynamics
    • B60W2520/10Longitudinal speed
    • B60W2520/105Longitudinal acceleration
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2552/00Input parameters relating to infrastructure
    • B60W2552/10Number of lanes
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2554/00Input parameters relating to objects
    • B60W2554/40Dynamic objects, e.g. animals, windblown objects
    • B60W2554/404Characteristics
    • B60W2554/4041Position
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2554/00Input parameters relating to objects
    • B60W2554/40Dynamic objects, e.g. animals, windblown objects
    • B60W2554/404Characteristics
    • B60W2554/4042Longitudinal speed

Definitions

  • the present invention relates to the technical field of unmanned driving, in particular to a method for generating a lane-changing decision model and an unmanned vehicle lane-changing decision method and device.
  • the architecture of autonomous systems of unmanned vehicles can usually be divided into perception systems and decision-making control systems.
  • Traditional decision-making control systems use optimization-based algorithms.
  • most classic optimization-based methods are complex due to the complexity of calculations. , Resulting in unable to solve the problem of complex decision-making tasks.
  • the driving situation of the vehicle is complex.
  • unmanned vehicles use complex sensors, such as cameras and laser rangefinders. Since the sensor data obtained by the above-mentioned sensors usually depends on the complex and unknown environment, the above-mentioned After the sensor data obtained by the sensor is directly input into the algorithm framework, it is difficult for the algorithm to output the optimal control amount.
  • the slam algorithm is usually used to map the environment and then obtain the trajectory in the result map.
  • this model-based algorithm increases instability due to high uncertainty (such as road bumps) when the vehicle is driving. factor.
  • This specification provides a method for generating a lane-changing decision model and an unmanned vehicle lane-changing decision method and device to overcome at least one technical problem in the prior art.
  • a method for generating a lane-changing decision model including:
  • the training sample set includes multiple training sample groups, each of the training sample groups includes training at each time step in the process of the vehicle completing the lane change according to the planned lane change trajectory
  • a sample the training sample includes a set of state variables and corresponding control variables.
  • the state variables include the pose, speed, acceleration of the target vehicle, the pose, speed, acceleration of the vehicle in front of the target vehicle’s own lane, and the follow-up on the target lane.
  • the pose, speed, and acceleration of the vehicle; the control variables include the speed and angular velocity of the target vehicle;
  • the decision model based on the deep reinforcement learning network is trained through the training sample set to obtain the lane-changing decision model.
  • the lane-changing decision model associates the state quantity of the target vehicle with the corresponding control quantity.
  • the training sample set is obtained in at least one of the following ways:
  • the vehicle completes the lane change according to the rule-based optimization algorithm, and obtains the state quantity at each time step of the target vehicle and the target vehicle's own lane in the process of multiple lane changes. And the corresponding control amount;
  • the vehicle data includes the target vehicle and the vehicle in front of the target vehicle’s own lane, as well as all the data at each time step of following the vehicle on the target lane.
  • the state quantity and the corresponding control quantity are the values that are stored in the database.
  • the decision model based on the deep reinforcement learning network includes a prediction network based on learning and a pre-trained rule-based target network, and the decision model based on the deep reinforcement learning network is trained through the training sample set ,
  • the steps to get the lane change decision model include:
  • any state quantity in each group of training samples is used as the input of the prediction network to obtain the prediction control quantity of the next time step of the state quantity by the prediction network;
  • the state quantity of the next time step of the state quantity in the training sample and the corresponding control quantity are used as the input of the target network to obtain the value evaluation Q value output by the target network;
  • a loss function is calculated according to the multiple groups of empirical data and the Q value output by the target network corresponding to each group of empirical data, and the loss function is optimized, Obtain the gradient of the predicted network parameter change, and update the predicted network parameter until the loss function converges.
  • a loss function is calculated according to the empirical data, and the loss function is optimized and iterated to obtain the step of updating the parameters of the prediction network After that, it also includes:
  • the loss function is the mean square error between the value evaluation Q value of the first preset number of preset networks and the value evaluation Q value of the target network, and the value evaluation Q value of the preset network is related to the state of the input
  • the value evaluation Q value of the target network relates to the state quantity in the input training sample, the corresponding control quantity and the strategy parameter of the target network.
  • an unmanned vehicle lane changing decision method including:
  • the sensor data including the pose, speed, and acceleration of the target vehicle and the vehicle in front of the target vehicle's own lane and the following vehicle in the target lane;
  • the control value at each moment in the lane change process is sent to the actuator, so that the target vehicle completes the lane change.
  • a lane-changing decision model generating device including:
  • the sample acquisition module is configured to acquire a training sample set of a vehicle lane change, the training sample set includes a plurality of training sample groups, each of the training sample groups includes each vehicle in the process of completing the lane change according to the planned lane change trajectory
  • a training sample under a time step the training sample includes a set of state variables and corresponding control variables.
  • the state variables include the pose, speed, and acceleration of the target vehicle, and the pose and speed of the vehicle in front of the target vehicle’s own lane , Acceleration and the pose, speed, acceleration of the following car on the target lane;
  • the control variable includes the speed and angular velocity of the target vehicle;
  • the model training module is configured to train the decision model based on the deep reinforcement learning network through the training sample set to obtain the lane change decision model, and the lane change decision model associates the state quantity of the target vehicle with the corresponding control quantity .
  • the decision model based on the deep reinforcement learning network includes a prediction network based on learning and a pre-trained rule-based target network
  • the model training module includes:
  • the sample input unit is configured to use any state quantity in each group of training samples as the input of the prediction network for the training sample set pre-added to the experience pool, and obtain the next time step of the state quantity of the prediction network A long predictive control value; the state value of the next time step of the state value in the training sample and the corresponding control value are used as the input of the target network to obtain the value evaluation Q value output by the target network;
  • the reward generating unit is configured to use the predictive control amount as an input of a pre-built environmental simulator to obtain the environmental reward output by the environmental simulator and the state quantity of the next time step;
  • the experience saving unit is configured to store the state quantity, the corresponding predictive control quantity, the environmental reward, and the state quantity of the next time step as a set of experience data in the experience pool;
  • the parameter update unit is configured to calculate a loss function based on multiple sets of the empirical data and the Q value output by the target network corresponding to each set of empirical data after the number of groups of the empirical data reaches a first preset number Optimize the loss function to obtain the gradient of the predicted network parameter change, and update the predicted network parameter until the loss function converges.
  • the parameter update unit further includes:
  • an unmanned vehicle lane changing decision device including:
  • the data acquisition module is configured to acquire sensor data in the body sensor of the target vehicle at the determined lane change time.
  • the sensor data includes the pose, speed, and speed of the target vehicle and the vehicle in front of the target vehicle's own lane and the following vehicle on the target lane. Acceleration
  • the control quantity generation module is configured to call the lane-changing decision model, and obtain the control quantity of the target vehicle at each moment in the lane-changing process through the lane-changing decision model.
  • the lane-changing decision model makes the state quantity of the target vehicle correspond to Is associated with the control amount;
  • the control quantity output module is configured to send the control quantity at each moment in the lane change process to the actuator, so that the target vehicle completes the lane change.
  • the embodiment of this specification provides a method for generating a lane-changing decision model and a method and device for unmanned vehicle lane-changing decision-making.
  • a decision model based on a deep reinforcement learning network is trained through a set of obtained training samples.
  • the decision model includes a prediction based on learning.
  • Network and pre-trained rule-based target network input each group of state quantities in the training sample set into the prediction network, and input the state quantity and control quantity of the next time step of the state quantity in the training sample set into the target network, according to the prediction network
  • the output value estimate corresponding to the execution result of the predicted control quantity and the value estimate of the target network to the input training sample are used to calculate the loss function, and the loss function is solved to update the strategy parameters of the prediction network, so that the strategy of the prediction network continuously approximates the training sample data
  • the strategy based on rules is used to guide the spatial search of learning-based neural networks from state quantities to control quantities, thereby incorporating planning-based optimization algorithms into the framework of reinforcement learning, improving the planning efficiency of predictive networks, and rule-based
  • the strategy is added to solve the problem that the loss function may not be able to converge and increase the stability of the model.
  • the decision model can associate the state quantity of the target vehicle with the corresponding control quantity. Compared with the traditional offline optimization algorithm, it can directly receive sensor input and has good online planning efficiency. Decision-making difficulties caused by environmental uncertainty; compared to pure deep neural networks, it has better planning efficiency and increases the ability to adapt to specific application scenarios.
  • the decision model based on the deep reinforcement learning network through the obtained training sample set.
  • the decision model includes a prediction network based on learning and a pre-trained rule-based target network.
  • the training samples are collected and each group of state quantities is input to the prediction
  • the network inputs the state quantity and control quantity of the next time step of the state quantity in the training sample set into the target network, and estimates the value of the execution result of the corresponding predictive control quantity output by the predictive network and the target network's value estimate of the input training sample
  • To calculate the loss function solve the loss function to update the strategy parameters of the prediction network, so that the strategy of the prediction network continuously approximates the strategy of the training sample data, and guide the learning-based neural network from the state quantity to the control quantity space with the rule-based strategy Search, thereby incorporating planning-based optimization algorithms into the framework of reinforcement learning, improving the planning efficiency of the prediction network, and adding rules to the strategy solves the problem that the loss function may fail to converge and increases the stability of the model.
  • the decision model can associate the state quantity of the target vehicle with the corresponding control quantity. Compared with the traditional offline optimization algorithm, it can directly receive sensor input and has good online planning efficiency. The decision-making difficulties caused by environmental uncertainty; it has better planning efficiency and increased adaptability to specific application scenarios than pure deep neural networks, which is one of the innovations of the embodiments of this specification.
  • the lane-changing decision model obtained according to the method can directly learn the sensor data input by the sensor and output the corresponding control amount, which solves the decision-making difficulties caused by complex sensors and environmental uncertainty in the prior art.
  • the fusion of the optimization method and the deep learning network achieves good planning efficiency, which is one of the innovations of the embodiments of this specification.
  • the strategy of the prediction network and the optimization strategy are linked, so that the parameters of the prediction network are constantly updated iteratively, so that the predicted control amount output by the prediction network gradually approaches a more anthropomorphic decision, thereby making the decision
  • the model has better decision-making ability, which is one of the innovations of the embodiments of this specification.
  • Fig. 1 is a schematic flowchart showing a method for generating a lane-changing decision model according to an embodiment of the present specification
  • FIG. 2 is a schematic flowchart showing the training process of a lane-changing decision model according to an embodiment of the present specification
  • Fig. 3 is a schematic diagram showing the principle of the training process of the lane-changing decision model provided according to an embodiment of the present specification
  • FIG. 4 is a schematic flowchart showing a method for decision-making for lane changing of an unmanned vehicle according to an embodiment of the present specification
  • Fig. 5 is a schematic diagram showing the principle of an unmanned vehicle lane changing decision method provided according to an embodiment of the present specification
  • FIG. 6 is a schematic diagram showing the structure of an apparatus for generating a lane-changing decision model according to an embodiment of the present specification
  • FIG. 7 is a schematic diagram showing the structure of a lane-changing decision model training module provided according to an embodiment of the present specification.
  • FIG. 8 is a schematic diagram showing the structure of an unmanned vehicle lane changing decision device provided according to an embodiment of the present specification.
  • the embodiments of this specification disclose a method for generating a lane-changing decision model and an unmanned vehicle lane-changing decision method and device, which are described in detail in the following embodiments.
  • FIG. 1 a schematic flowchart of a method for generating a lane-changing decision model provided by an embodiment of the present specification.
  • the method for generating the lane change decision model specifically includes the following steps:
  • the training sample includes a set of state variables and corresponding control variables.
  • the state variables include the pose, speed, acceleration of the target vehicle, the pose, speed, acceleration of the vehicle in front of the target vehicle’s own lane, and the target lane
  • the pose, speed, and acceleration of the up-following car includes the speed and angular velocity of the target vehicle.
  • the decision-making system needs to understand the external environment based on the information input by the perception system, and get the next action of the unmanned vehicle based on the state of the input.
  • the deep neural network based on reinforcement learning needs to learn the state quantity and the control quantity Therefore, the corresponding training sample set is obtained so that the deep neural network can obtain the corresponding control amount according to the state quantity, and the training sample set is obtained by at least one of the following methods:
  • the vehicle completes the lane change according to the rule-based optimization algorithm, and obtains the state quantity at each time step of the target vehicle and the target vehicle's own lane in the process of multiple lane changes. And the corresponding control amount.
  • the first acquisition method is based on a rule-based optimization algorithm.
  • the simulated vehicle realizes a smooth lane change multiple times according to the optimization algorithm, so as to obtain the state quantity and the corresponding control quantity at each time step in the lane change process , So that the neural network learns the corresponding relationship between the state quantity and the corresponding control quantity, and the optimization algorithm may be a mixed integer quadratic programming MIQP algorithm.
  • the vehicle data includes the target vehicle and the vehicle in front of the target vehicle’s own lane, as well as all the data at each time step of following the vehicle on the target lane.
  • the state quantity and the corresponding control quantity are the values that are stored in the database.
  • the data required by the training sample set is obtained from a database, so that the deep neural network can have a certain degree of personification decision-making ability through training based on the training sample set.
  • S120 Train a decision model based on the deep reinforcement learning network through the training sample set to obtain a lane-changing decision model.
  • the lane-changing decision model associates the state quantity of the target vehicle with the corresponding control quantity.
  • the decision model based on the deep reinforcement learning network includes a prediction network based on learning and a pre-trained rule-based target network;
  • FIG. 2 is a schematic flowchart of the training process of the lane changing decision model provided by this embodiment.
  • the training steps of the lane change decision model specifically include:
  • S210 For the training sample set pre-added to the experience pool, use any state quantity in each group of training samples as the input of the prediction network to obtain the prediction control quantity of the prediction network for the next time step of the state quantity ; The state quantity of the next time step of the state quantity in the training sample and the corresponding control quantity are used as the input of the target network to obtain the value evaluation Q value output by the target network.
  • the predictive network can predict the amount of control that the unmanned vehicle should take in the next time step based on the state quantity under the current time step, and the target network obtains the corresponding value evaluation through the input state quantity and control quantity Q value
  • the value evaluation Q value is used to characterize the pros and cons of the strategy corresponding to the state quantity and the control quantity.
  • the state quantity under the current time step in the training sample set is input into the prediction network to obtain the predictive control quantity under the next time step output by the prediction network, and the state quantity of the next time step of the state quantity in the training sample And the corresponding control quantity is input into the target network to obtain the value evaluation of the corresponding strategy, so that the difference of the control quantity obtained according to different strategies under the next time step can be compared.
  • S220 Use the predicted control amount as an input of a pre-built environmental simulator, and obtain the environmental reward output by the environmental simulator and the state amount of the next time step.
  • the predictive control quantity needs to be executed, and the feedback environment reward is obtained from the environment, and the simulation execution of the predictive control quantity is realized through the pre-built environment simulator , So as to obtain the execution result of the predictive control quantity and the environmental reward, in order to evaluate the predictive control quantity, and then construct a loss function to update the predictive network.
  • S230 The state quantity, the corresponding predictive control quantity, the environmental reward, and the state quantity of the next time step are stored in the experience pool as a set of experience data.
  • the loss function is optimized by the stochastic gradient descent method to obtain the gradient of the predicted network parameter change, thereby updating the predicted network parameter , Continuously update the parameters until the loss function converges, so as to gradually reduce the difference between the strategy of the prediction network and the target strategy, so that the decision model can output a more reasonable and anthropomorphic decision control amount.
  • the method further includes: when the number of updates of the predicted network parameter reaches a second preset number, obtaining the predicted control value and the corresponding state value with the environmental reward higher than the preset value in the experience pool, or obtaining the experience pool
  • the mid-environment reward ranking is in the first third preset number of predictive control quantities and corresponding state quantities, and the predictive control quantities and corresponding state quantities are added to the target network training sample set of the target network to train and update the The parameters of the target network.
  • the decision-making model can be optimized online, so that the decision-making model has better planning efficiency and achieves more robust effects.
  • the loss function is the mean square error between the value evaluation Q value of the first preset number of preset networks and the value evaluation Q value of the target network, and the value evaluation Q value of the preset network is about The input state quantity, the corresponding predictive control quantity and the strategy parameter of the predictive network; the value evaluation Q value of the target network relates to the state quantity in the input training sample, the corresponding control quantity and the strategy parameter of the target network.
  • the training method optimizes the predictive network parameters by constructing a loss function so that the predictive network finds a better strategy for solving the complex problems in vehicle lane changing, and uses a rule-based strategy to guide the learning-based neural network from state quantity to control Quantitative spatial search, which incorporates planning-based optimization algorithms into the framework of reinforcement learning, improves the planning efficiency of the prediction network, and increases the stability of the model.
  • FIG. 3 is a schematic diagram showing the principle of the training process of the lane changing decision model provided according to an embodiment of the present specification.
  • any state quantity s in each group of training samples is used as the input of the prediction network to obtain the next time step of the state quantity of the prediction network
  • the long predictive control quantity a; the state quantity s'of the next time step of the state quantity in the training sample and the corresponding control quantity a'are used as the input of the target network, and the value evaluation Q of the output of the target network is obtained T value;
  • the predicted control variable a is used as the input of the pre-built environment simulator, and the environment reward r output by the environment simulator and the state quantity s1 of the next time step are obtained;
  • the state quantity s and the corresponding The predictive control amount a, the environmental reward r, and the state amount s1 of the next time step are stored as a set of experience data in the experience pool; when the number of sets of the experience data reaches the first preset
  • the rule-based strategy in the target network is used to guide the strategy optimization of the learning-based neural network
  • the planning-based optimization algorithm is incorporated into the framework of reinforcement learning, which not only retains the advantage of the neural network that can directly receive sensor data input, It also improves the planning efficiency of the prediction network, and the addition of planning strategies increases the stability of the model.
  • Fig. 4 is a schematic flowchart showing a method for decision-making for lane changing of an unmanned vehicle according to an embodiment of the present specification.
  • the steps of the unmanned vehicle lane changing decision method include:
  • S310 Acquire sensor data from the body sensor of the target vehicle at the determined lane change time.
  • the sensor data includes the pose, speed, and acceleration of the target vehicle and the vehicle in front of the target vehicle's own lane and the following vehicle on the target lane.
  • S320 Invoke the lane-changing decision model, and obtain the control value of the target vehicle at each moment in the lane-changing process through the lane-changing decision model.
  • the lane-changing decision model associates the state quantity of the target vehicle with the corresponding control value.
  • the sensor data obtained from the body sensor of the target vehicle is directly input into the lane change decision model trained according to the lane change decision model generation method, and the corresponding control value output by the decision model at the corresponding time is obtained, so that The target vehicle changes lanes smoothly, realizing that the decision-making model directly receives the input of the sensor, and has good planning efficiency.
  • Fig. 5 is a schematic diagram showing the principle of an unmanned vehicle lane changing decision method provided according to an embodiment of the present specification.
  • the sensor data in the body sensor of the target vehicle is acquired.
  • the sensor data includes the pose, speed, acceleration of the target vehicle, and the pose, speed, and speed of the vehicle in front of the target vehicle in its own lane. Acceleration, as well as the pose, speed, acceleration of the following car on the target lane; call the lane-changing decision model, and obtain the control amount of the target vehicle at each moment in the lane-changing process through the lane-changing decision model; execute each moment The amount of control that makes the target vehicle complete the lane change.
  • the lane-changing decision model trained according to the method for generating the lane-changing decision model can directly receive the sensor data input obtained from the body sensor of the target vehicle, and output the corresponding control amount at the corresponding time, so as to make the target vehicle stable Change lanes.
  • the lane-changing decision-making method realizes that the sensor data is used as the direct input of the decision-making model, and makes the unmanned vehicle smoothly complete the lane-changing according to the anthropomorphic decision.
  • this specification also provides embodiments of a lane-changing decision model generating device and an unmanned vehicle lane-changing decision device, which can be implemented by software , It can also be implemented by hardware or a combination of software and hardware.
  • a lane-changing decision model generating device can be implemented by software
  • an unmanned vehicle lane-changing decision device can also be implemented by hardware or a combination of software and hardware.
  • software implementation as an example, as a logical device, it is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory through the processor of the device where it is located.
  • a hardware structure of the device where the lane-changing decision model generation device and the unmanned vehicle lane-changing decision device are located in this manual may include a processor, a network interface, a memory, and a non-volatile memory, and may also include Other hardware, this will not be repeated here.
  • FIG. 6 is a schematic diagram showing the structure of an apparatus 400 for generating a lane-changing decision model according to an embodiment of the present specification.
  • the lane change decision model generating device 400 includes:
  • the sample acquisition module 410 is configured to acquire a training sample set of the vehicle lane change, the training sample set includes a plurality of training sample groups, each of the training sample groups includes the process of the vehicle completing the lane change according to the planned lane change trajectory
  • a training sample at each time step, the training sample includes a set of state variables and corresponding control variables.
  • the state variables include the pose, speed, acceleration of the target vehicle, the pose of the vehicle in front of the target vehicle’s own lane, Speed, acceleration, and the pose, speed, and acceleration of the following car on the target lane;
  • the control variable includes the speed and angular velocity of the target vehicle;
  • the model training module 420 is configured to train a decision model based on the deep reinforcement learning network through the training sample set to obtain a lane change decision model, the lane change decision model making the state quantity of the target vehicle correlated with the corresponding control quantity United.
  • the sample obtaining module 410 obtains the training sample set in at least one of the following ways:
  • the vehicle completes the lane change according to the rule-based optimization algorithm, and obtains the state quantity at each time step of the target vehicle and the target vehicle's own lane in the process of multiple lane changes. And the corresponding control amount;
  • the vehicle data includes the target vehicle and the vehicle in front of the target vehicle’s own lane, as well as all the data at each time step of following the vehicle on the target lane.
  • the state quantity and the corresponding control quantity are the values that are stored in the database.
  • Fig. 7 is a schematic diagram showing the structure of a lane-changing decision model training module provided according to an embodiment of the present specification.
  • the decision model based on the deep reinforcement learning network includes a prediction network based on learning and a pre-trained rule-based target network.
  • the model training module 420 includes:
  • the sample input unit 402 is configured to use any state quantity in each group of training samples as the input of the prediction network for the training sample set pre-added to the experience pool, and obtain the next time of the state quantity of the prediction network
  • the predictive control quantity of the step size; the state quantity of the next time step of the state quantity in the training sample and the corresponding control quantity are used as the input of the target network to obtain the value evaluation Q value output by the target network;
  • the reward generating unit 404 is configured to use the predicted control amount as an input of a pre-built environmental simulator to obtain the environmental reward output by the environmental simulator and the state amount of the next time step;
  • the experience saving unit 406 is configured to store the state quantity, the corresponding predictive control quantity, the environmental reward, and the state quantity of the next time step as a set of experience data in the experience pool;
  • the parameter update unit 408 is configured to calculate the loss according to the multiple sets of the empirical data and the Q value of the target network output corresponding to each set of empirical data every time the number of groups of the empirical data reaches a first preset number Function to optimize the loss function to obtain the gradient of the predicted network parameter change, and update the predicted network parameter until the loss function converges.
  • the parameter update unit 408 is further configured to:
  • the loss function of the parameter update unit is characterized by comprising: the loss function is the value of the Q value of the first preset number of preset networks and the value Q of the target network.
  • Mean square error the value evaluation Q value of the preset network relates to the input state quantity, the corresponding predictive control quantity and the parameters of the prediction network; the value evaluation Q value of the target network relates to the state quantity in the input training sample and the corresponding The control volume and the parameters of the target network.
  • FIG. 8 is a schematic diagram showing the structure of an unmanned vehicle lane changing decision device 500 according to an embodiment of the present specification.
  • the driverless vehicle lane change decision device 500 specifically includes the following modules:
  • the data acquisition module 510 is configured to acquire sensor data in the body sensor of the target vehicle at the determined lane change time.
  • the sensor data includes the pose and speed of the target vehicle and the vehicle in front of the target vehicle's own lane and the following vehicle on the target lane. , Acceleration;
  • the control quantity generation module 520 is configured to call the lane-changing decision model, and obtain the control quantity of the target vehicle at each moment in the lane-changing process through the lane-changing decision model.
  • the lane-changing decision model makes the state quantity of the target vehicle and Corresponding to the corresponding control quantity;
  • the control quantity output module 530 is configured to send the control quantity at each moment in the lane change process to the actuator, so that the target vehicle completes the lane change.
  • the decision model based on the deep reinforcement learning network is trained through the obtained training sample set, and the prediction network parameters are optimized by constructing the loss function so that the prediction network finds a better strategy to solve the complex problems in vehicle lane changing, making the prediction
  • the strategy of the network continuously approximates the strategy of the training sample data.
  • the decision model can associate the state quantity of the target vehicle with the corresponding control quantity. Compared with the traditional offline optimization algorithm, it can directly receive sensor input and has good online planning efficiency. Difficulties in decision-making caused by environmental uncertainty; Compared with pure deep neural networks, it has better learning efficiency and increases the ability to adapt to specific application scenarios.
  • modules in the device in the embodiment may be distributed in the device in the embodiment according to the description of the embodiment, or may be located in one or more devices different from this embodiment with corresponding changes.
  • the modules of the above-mentioned embodiments can be combined into one module, or further divided into multiple sub-modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mechanical Engineering (AREA)
  • Transportation (AREA)
  • Automation & Control Theory (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Chemical & Material Sciences (AREA)
  • Feedback Control In General (AREA)
  • Analytical Chemistry (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Combustion & Propulsion (AREA)

Abstract

一种换道决策模型生成方法和无人车换道决策方法及装置,其中,所述换道决策模型生成方法包括:获取车辆换道的训练样本集,所述训练样本集包括多个训练样本组,每个所述训练样本组包括车辆按照规划的换道轨迹完成换道的过程中每个时间步长下的训练样本,所述训练样本包括一组状态量及对应的控制量,所述状态量包括目标车辆的位姿、速度、加速度,目标车辆本车道前车的位姿、速度、加速度以及目标车道上跟车的位姿、速度、加速度;所述控制量包括目标车辆的速度、角速度;通过所述训练样本集对基于深度强化学习网络的决策模型进行训练,得到换道决策模型,所述换道决策模型使得目标车辆的状态量与对应的控制量相关联。

Description

换道决策模型生成方法和无人车换道决策方法及装置 技术领域
本发明涉及无人驾驶技术领域,具体而言,涉及一种换道决策模型生成方法和无人车换道决策方法及装置。
背景技术
在无人驾驶领域,无人驾驶车辆的自主系统的架构通常可分为感知系统和决策控制系统,传统决策控制系统采用基于优化的算法,但是,大多数经典的基于优化的方法因为计算量复杂,导致无法解决复杂决策任务问题。而实际中,车辆行驶情况复杂,非结构化环境中无人驾驶车辆使用复杂的传感器,例如相机和激光测距仪,由于上述传感器获取的传感数据通常取决于复杂且未知的环境,将上述传感器获得的传感数据直接输入到算法框架后,使算法输出最佳控制量具有困难。传统方法中,通常使用slam算法来绘制出环境,然后在结果图中获取轨迹,但是这种基于模型的算法,在车辆行驶时,由于高度的不确定性(比如路面的颠簸)增加了不稳定因素。
发明内容
本说明书提供一种换道决策模型生成方法和无人车换道决策方法及装置,用以克服现有技术中存在的至少一个技术问题。
根据本说明书实施例的第一方面,提供一种换道决策模型生成方法,包括:
获取车辆换道的训练样本集,所述训练样本集包括多个训练样本组,每个所述训练样本组包括车辆按照规划的换道轨迹完成换道的过程中每个时间步长下的训练样本,所述训练样本包括一组状态量及对应的控制量,所述状态量包括目标车辆的位姿、速度、加速度,目标车辆本车道前车的位姿、速度、加速度以及目标车道上跟车的位姿、速度、加速度;所述控制量包括目标车辆的速度、角速度;
通过所述训练样本集对基于深度强化学习网络的决策模型进行训练,得到换道决策模型,所述换道决策模型使得目标车辆的状态量与对应的控制量相关联。
可选地,所述训练样本集通过以下至少一种方式得到:
第一获取方式:
在模拟器中按照基于规则的优化算法使得车辆完成换道,获取多次换道过程中目标车辆和目标车辆本车道前车以及目标车道上跟车的每一时间步长下的所述状态量和对应的所述控制量;
第二获取方式:
从存储车辆换道信息的数据库中,采样出车辆换道过程中的车辆数据,所述车辆数据包括目标车辆和目标车辆本车道前车以及目标车道上跟车的每一时间步长下的所述状态量和对应的所述控制量。
可选地,所述基于深度强化学习网络的决策模型包括基于学习的预测网络和预先训练完成的基于规则的目标网络,所述通过所述训练样本集对基于深度强化学习网络的决策模型进行训练,得到换道决策模型的步骤包括:
对于预先加入经验池的训练样本集,将每组训练样本中的任一状态量作为所述预测网络的输入,得到所述预测网络对该状态量的下一时间步长的预测控制量;将训练样本中该状态量的下一时间步长的状态量和对应的控制量作为所述目标网络的输入,得到所述目标网络输出的价值评估Q值;
将所述预测控制量作为预先构建的环境模拟器的输入,得到所述环境模拟器输出的环境奖励以及下一时间步长的状态量;
将该状态量、对应的预测控制量、所述环境奖励以及下一时间步长的状态量作为一组经验数据存储到经验池中;
当所述经验数据的组数每达到第一预设数目后,根据多组所述经验数据以及每组经验数据对应的所述目标网络输出的Q值,计算损失函数,优化所述损失函数,得到所述预测网络参数变化的梯度,更新所述预测网络参数直至所述损失函数收敛。
可选地,所述当所述经验数据的组数每达到第一预设数目后,则根据所述经验数据计算损失函数,优化迭代所述损失函数,得到更新所述预测网络的参数的步骤之后,还包括:
当所述预测网络参数的更新次数达到第二预设数目后,获取经验池中环境奖励高于预设值的预测控制量和对应的状态量,或者获取经验池中环境奖励排名位于前第三预设数目的预测控制量和对应的状态量,将所述预测控制量以及对应的状态量添加至所述目标网络的目标网络训练样本集中,以训练更新所述目标网络的参数。
可选地,所述损失函数为第一预设数目个预设网络的价值评估Q值与目标网络的价值评估Q值的均方误差,所述预设网络的价值评估Q值关于输入的状态量、对应预测控制量以及预测网络的策略参数;所述目标网络的价值评估Q值关于输入的训练样本中的状态量、对应控制量以及目标网络的策略参数。
根据本说明书实施例的第二方面,提供一种无人车换道决策方法,包括:
在确定的换道时刻,获取目标车辆车身传感器中的传感器数据,所述传感器数据包括目标车辆和目标车辆本车道前车以及目标车道上跟车的位姿、速度、加速度;
调用换道决策模型,通过所述换道决策模型得到换道过程中,每一时刻目标车辆的控制量,所述换道决策模型使得目标车辆的状态量与对应的控制量相关联;
将换道过程中每一时刻的控制量发送给执行机构,使得目标车辆完成换道。
根据本说明书实施例的第三方面,提供一种换道决策模型生成装置,包括:
样本获取模块,被配置为获取车辆换道的训练样本集,所述训练样本集包括多个训练样本组,每个所述训练样本组包括车辆按照规划的换道轨迹完成换道的过程中每个时间步长下的训练样本,所述训练样本包括一组状态量及对应的控制量,所述状态量包括目标车辆的位姿、速度、加速度,目标车辆本车道前车的位姿、速度、加速度以及目标车道上跟车的位姿、速度、加速度;所述控制量包括目标车辆的速度、角速度;
模型训练模块,被配置为通过所述训练样本集对基于深度强化学习网络的决策模型进行训练,得到换道决策模型,所述换道决策模型使得目标车辆的状态量与对应的控制量相关联。
可选地,所述基于深度强化学习网络的决策模型包括基于学习的预测网络和预先训练完成的基于规则的目标网络,所述模型训练模块包括:
样本输入单元,被配置为对于预先加入经验池的训练样本集,将每组训练样本中的任一状态量作为所述预测网络的输入,得到所述预测网络对该状态量的下一时间步长的预测控制量;将训练样本中该状态量的下一时间步长的状态量和对应的控制量作为所述目标网络的输入,得到所述目标网络输出的价值评估Q值;
奖励生成单元,被配置为将所述预测控制量作为预先构建的环境模拟器的输入,得到所述环境模拟器输出的环境奖励以及下一时间步长的状态量;
经验保存单元,被配置为将该状态量、对应的预测控制量、所述环境奖励以及下一时间步长的状态量作为一组经验数据存储到经验池中;
参数更新单元,被配置为当所述经验数据的组数每达到第一预设数目后,根据多组所述经验数据以及每组经验数据对应的所述目标网络输出的Q值,计算损失函数,优化所述损失函数,得到所述预测网络参数变化的梯度,更新所述预测网络参数直至所述损失函数收敛。
可选地,所述参数更新单元,还包括:
当所述预测网络参数的更新次数达到第二预设数目后,获取经验池中环境奖励高于预设值的预测控制量和对应的状态量,或者获取经验池中环境奖励排名位于前第三预设数目的预测控制量和对应的状态量,将所述预测控制量以及对应的状态量添加至所述目标网络的目标网络训练样本集中,以训练更新所述目标网络的参数。
根据本说明书实施例的第四方面,提供一种无人车换道决策装置,包括:
数据获取模块,被配置为在确定的换道时刻,获取目标车辆车身传感器中的传感器数据,所述传感器数据包括目标车辆和目标车辆本车道前车以及目标车道上跟车的位姿、速度、加速度;
控制量生成模块,被配置为调用换道决策模型,通过所述换道决策模型 得到换道过程中,每一时刻目标车辆的控制量,所述换道决策模型使得目标车辆的状态量与对应的控制量相关联;
控制量输出模块,被配置为将换道过程中每一时刻的控制量发送给执行机构,使得目标车辆完成换道。
本说明书实施例的有益效果如下:
本说明书实施例提供一种换道决策模型生成方法和无人车换道决策方法及装置,通过获得的训练样本集对基于深度强化学习网络的决策模型进行训练,该决策模型包括基于学习的预测网络和预先训练完成的基于规则的目标网络,将训练样本集中每组状态量输入预测网络,将训练样本集中该状态量的下一时间步长的状态量和控制量输入目标网络,根据预测网络输出的对应预测控制量的执行结果的价值估计和目标网络对输入训练样本的价值估计来计算损失函数,求解该损失函数以更新预测网络的策略参数,使得该预测网络的策略不断近似训练样本数据的策略,以基于规则的策略指导基于学习的神经网络从状态量到控制量的空间搜索,从而将基于规划的优化算法纳入强化学习的框架中,提高了预测网络的规划效率,并且基于规则的策略加入解决了损失函数可能出现无法收敛的问题,增加了模型的稳定性。该决策模型能够将目标车辆的状态量与对应的控制量相关联,相比于传统离线优化算法,能够直接接收传感器的输入并且具有良好的在线规划效率,解决了现有技术中由于复杂传感器和环境不确定性带来的决策困难;相比于单纯的深度神经网络具有更好的规划效率并增加了对具体应用场景的适应能力。
本说明书实施例的创新点包括:
1、通过获得的训练样本集对基于深度强化学习网络的决策模型进行训练,该决策模型包括基于学习的预测网络和预先训练完成的基于规则的目标网络,将训练样本集中每组状态量输入预测网络,将训练样本集中该状态量的下一时间步长的状态量和控制量输入目标网络,根据预测网络输出的对应预测控制量的执行结果的价值估计和目标网络对输入训练样本的价值估计来计算损失函数,求解该损失函数以更新预测网络的策略参数,使得该预测网络的策略不断近似训练样本数据的策略,以基于规则的策略指导基于学习的神经网络从状态量到控制量的空间搜索,从而将基于规划的优化算法纳入强化学习的框架中,提高了预测网络的规划效率,并且规则的策略加入解决了损失函数可能出现无法收敛的问题,增加了模型的稳定性。该决策模型能够将目标车辆的状态量与对应的控制量相关联,相比于传统离线优化算法,能够直接接收传感器的输入并且具有良好的在线规划效率,解决了现有技术中由于复杂传感器和环境不确定性带来的决策困难;相比于单纯的深度神经网络具有更好的规划效率并增加了对具体应用场景的适应能力,是本说明书实施例的创新点之一。
2、通过基于规则的目标网络对训练样本的策略计算价值评估,来指导基于学习的预测网络从状态量到控制量的空间搜索,用优化的策略指导预测网络策略的更新,从而使得深度强化学习网络能够解决复杂的换道决策问题, 是本说明书实施例的创新点之一。
3、按照所述方法得到的换道决策模型能够实现直接学习传感器输入的传感数据,并输出对应的控制量,解决了现有技术中由于复杂传感器和环境不确定性带来的决策困难,将优化的方式与深度学习网络融合实现了良好的规划效率,是本说明书实施例的创新点之一。
4、通过计算所述损失函数,将预测网络的策略和优化策略建立联系,从而不断迭代更新预测网络的参数,使得预测网络输出的预测控制量逐渐逼近更拟人化的决策,从而使得所述决策模型具有更好的决策能力,是本说明书实施例的创新点之一。
5、在训练所述预测网络的过程中,按预设的频率从经验池中挑选满足预设条件的经验数据加入所述目标网络的训练样本集中,更新目标网络的参数,以使得所述决策模型具有更好的规划效率,是本说明书实施例的创新点之一。
附图说明
为了更清楚地说明本说明书实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是示出了根据本说明书一实施例提供的换道决策模型生成方法的流程示意图;
图2是示出了根据本说明书一实施例提供的换道决策模型训练过程的流程示意图;
图3是示出了根据本说明书一实施例提供的换道决策模型训练过程的原理示意图;
图4是示出了根据本说明书一实施例提供的无人车换道决策方法的流程示意图;
图5是示出了根据本说明书一实施例提供的无人车换道决策方法的原理示意图;
图6是示出了根据本说明书一实施例提供的换道决策模型生成装置的结构示意图;
图7是示出了根据本说明书一实施例提供的换道决策模型训练模块的结构示意图;
图8是示出了根据本说明书一实施例提供的无人车换道决策装置的结构示意图。
具体实施方式
下面将结合本说明书实施例中的附图,对本说明书实施例中的技术方案 进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有付出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
需要说明的是,本说明书实施例及附图中的术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
本说明书实施例公开了一种换道决策模型生成方法和无人车换道决策方法及装置,在下面的实施例中逐一进行详细说明。
参见图1,本说明书一实施例提供的换道决策模型生成方法的流程示意图。该换道决策模型生成方法,具体包括以下步骤:
S110:获取车辆换道的训练样本集,所述训练样本集包括多个训练样本组,每个所述训练样本组包括车辆按照规划的换道轨迹完成换道的过程中每个时间步长下的训练样本,所述训练样本包括一组状态量及对应的控制量,所述状态量包括目标车辆的位姿、速度、加速度,目标车辆本车道前车的位姿、速度、加速度以及目标车道上跟车的位姿、速度、加速度;所述控制量包括目标车辆的速度、角速度。
在无人车换道过程中,决策系统需要根据感知系统输入的信息理解外部环境,根据输入的状态得出无人车下一步的动作,基于强化学习的深度神经网络需要学习状态量与控制量之间的联系,由此获取对应的训练样本集使得所述深度神经网络能够根据状态量得到对应的控制量,所述训练样本集通过以下至少一种方式得到:
第一获取方式:
在模拟器中按照基于规则的优化算法使得车辆完成换道,获取多次换道过程中目标车辆和目标车辆本车道前车以及目标车道上跟车的每一时间步长下的所述状态量和对应的所述控制量。
所述第一获取方式基于规则的优化算法,在模拟器中,模拟车辆按照优化算法多次实现平稳换道,从而获得换道过程中每一时间步长下所述状态量以及对应的控制量,使得所述神经网络学习所述状态量以及对应的控制量之间的对应关系,所述优化算法可以是混合整数二次规划MIQP算法。
第二获取方式:
从存储车辆换道信息的数据库中,采样出车辆换道过程中的车辆数据,所述车辆数据包括目标车辆和目标车辆本车道前车以及目标车道上跟车的每一时间步长下的所述状态量和对应的所述控制量。
所述第二获取方式,从数据库中获得所述训练样本集需要的数据,使得所述深度神经网络通过基于该训练样本集的训练能够具有一定程度拟人化决 策的能力。
S120:通过所述训练样本集对基于深度强化学习网络的决策模型进行训练,得到换道决策模型,所述换道决策模型使得目标车辆的状态量与对应的控制量相关联。
一个实施例中,所述基于深度强化学习网络的决策模型包括基于学习的预测网络和预先训练完成的基于规则的目标网络;
图2是本实施例提供的换道决策模型训练过程的流程示意图。所述换道决策模型的训练步骤具体包括:
S210:对于预先加入经验池的训练样本集,将每组训练样本中的任一状态量作为所述预测网络的输入,得到所述预测网络对该状态量的下一时间步长的预测控制量;将训练样本中该状态量的下一时间步长的状态量和对应的控制量作为所述目标网络的输入,得到所述目标网络输出的价值评估Q值。
由于所述预测网络能够根据当前时间步长下的状态量预测出下一时间步长下无人车应该采取的控制量,而目标网络则是通过输入的状态量和控制量得到对应的价值评估Q值,所述价值评估Q值用于表征该状态量和控制量对应的策略的优劣。
因此,将训练样本集中当前时间步长下的状态量输入预测网络,得到预测网络输出的下一时间步长下的预测控制量,将训练样本中该状态量的下一时间步长的状态量和对应的控制量输入所述目标网络,得到对应策略的价值评估,从而能够比较下一时间步长下依据不同策略得到的控制量的差异。
S220:将所述预测控制量作为预先构建的环境模拟器的输入,得到所述环境模拟器输出的环境奖励以及下一时间步长的状态量。
计算所述预测网络输出的预测控制量的价值评估Q值,需要执行该预测控制量,并从环境中得到反馈的环境奖励,通过预先构建的环境模拟器来实现对该预测控制量的模拟执行,从而获得该预测控制量的执行结果和环境奖励,以此来评价该预测控制量,进而构造损失函数以更新所述预测网络。
S230:将该状态量、对应的预测控制量、所述环境奖励以及下一时间步长的状态量作为一组经验数据存储到经验池中。
将所述预测控制量以及对应的环境奖励和下一时间步长的状态量存储到经验池中,首先获得了车辆换道的更多可用数据,其次有利于根据经验数据对所述目标网络的参数进行更新,以获得更合理的对控制策略的价值评估,从而使得所训练的决策模型能够做出更拟人化的决策。
S240:当所述经验数据的组数每达到第一预设数目后,根据多组所述经验数据以及每组经验数据对应的所述目标网络输出的Q值,计算损失函数,优化所述损失函数,得到所述预测网络参数变化的梯度,更新所述预测网络参数直至所述损失函数收敛。
根据预测控制量获得的环境奖励来计算表征该预测控制量的价值评估的Q值,根据多个所述预测控制量的价值评估Q值与对应时间步长下训练样本对应的价值评估Q值,构造损失函数,所述损失函数表征了当前预测网络学习到的策略与训练样本中目标策略的差异,通过随机梯度下降法优化所述损失函数,得到预测网络参数变化的梯度,从而更新预测网络参数,不断进行参数更新,直到损失函数收敛,从而逐渐减小预测网络的策略与所述目标策略的差异,使得所述决策模型能够输出更合理更拟人化的决策控制量。
在一个具体实施例中,所述当所述经验数据的组数每达到第一预设数目后,则根据所述经验数据计算损失函数,优化迭代所述损失函数,得到更新所述预测网络的参数的步骤之后,还包括:当所述预测网络参数的更新次数达到第二预设数目后,获取经验池中环境奖励高于预设值的预测控制量和对应的状态量,或者获取经验池中环境奖励排名位于前第三预设数目的预测控制量和对应的状态量,将所述预测控制量以及对应的状态量添加至所述目标网络的目标网络训练样本集中,以训练更新所述目标网络的参数。
通过对所述目标网络参数的更新,使得该决策模型能够在线优化,使得所述决策模型具有更好的规划效率,并且取得更加稳健的效果。
在一个具体实施例中,所述损失函数为第一预设数目个预设网络的价值评估Q值与目标网络的价值评估Q值的均方误差,所述预设网络的价值评估Q值关于输入的状态量、对应预测控制量以及预测网络的策略参数;所述目标网络的价值评估Q值关于输入的训练样本中的状态量、对应控制量以及目标网络的策略参数。
本实施例中,所述训练方法,通过构建损失函数来优化预测网络参数使得预测网络找到解决车辆换道中复杂问题的更优策略,以基于规则的策略指导基于学习的神经网络从状态量到控制量的空间搜索,从而将基于规划的优化算法纳入强化学习的框架中,提高了预测网络的规划效率,并且增加了模型的稳定性。
图3是示出了根据本说明书一实施例提供的换道决策模型训练过程的原理示意图。如图3所示,对于预先加入经验池的训练样本集,将每组训练样本中的任一状态量s作为所述预测网络的输入,得到所述预测网络对该状态量的下一时间步长的预测控制量a;将训练样本中该状态量的下一时间步长的状态量s’和对应的控制量a’作为所述目标网络的输入,得到所述目标网络输出的价值评估Q T值;将所述预测控制量a作为预先构建的环境模拟器的输入,得到所述环境模拟器输出的环境奖励r以及下一时间步长的状态量s1;将该状态量s、对应的预测控制量a、所述环境奖励r以及下一时间步长的状态量s1作为一组经验数据存储到经验池中;当所述经验数据的组数每达到第一预设数目后,根据多组所述经验数据以及每组经验数据对应的所述目标网络输出的Q T值,计算损失函数,优化迭代所述损失函数,得到更新所述预测网络的参数,直至收敛。
本实施例中,以目标网络中基于规则的策略指导基于学习的神经网络的 策略优化,将基于规划的优化算法纳入强化学习的框架中,既保留了神经网络可以直接接收传感器数据输入的优势,又提高了预测网络的规划效率,并且基于规划策略的加入增加了模型的稳定性。
图4是示出了根据本说明书一实施例提供的无人车换道决策方法的流程示意图。所述无人车换道决策方法的步骤包括:
S310:在确定的换道时刻,获取目标车辆车身传感器中的传感器数据,所述传感器数据包括目标车辆和目标车辆本车道前车以及目标车道上跟车的位姿、速度、加速度。
获取目标车辆、目标车辆本车道前车以及目标车道上跟车的位姿、速度、加速度,依据所述数据得出目标车辆实现换道需要执行的控制量。
S320:调用换道决策模型,通过所述换道决策模型得到换道过程中,每一时刻目标车辆的控制量,所述换道决策模型使得目标车辆的状态量与对应的控制量相关联。
S330:将换道过程中每一时刻的控制量发送给执行机构,使得目标车辆完成换道。
从换道的初始时刻开始,对目标车辆每一时间步长下获得的所述状态量调用换道决策模型进行计算,得到对应的控制量,以使得目标车辆按照对应的控制量执行能够实现平稳换道。
本实施例中,将目标车辆车身传感器中获得的传感数据直接输入按照所述换道决策模型生成方法训练的换道决策模型中,得到该决策模在相应时刻输出的对应控制量,从而使得目标车辆平稳换道,实现了决策模型直接接收传感器的输入,并具有较好的规划效率。
图5是示出了根据本说明书一实施例提供的无人车换道决策方法的原理示意图。如图5所示,在确定的换道时刻,获取目标车辆车身传感器中的传感器数据,所述传感器数据包括目标车辆的位姿、速度、加速度,目标车辆本车道前车的位姿、速度、加速度,以及目标车道上跟车的位姿、速度、加速度;调用换道决策模型,通过所述换道决策模型得到换道过程中,每一时刻目标车辆的控制量;执行所述每一时刻的控制量,使得目标车辆完成换道。
本实施例中,按照所述换道决策模型生成方法训练的换道决策模型能够直接接收目标车辆车身传感器中获得的传感数据输入,并在相应时刻输出的对应控制量,以使得目标车辆平稳换道。该换道决策方法,实现了将传感器数据作为决策模型的直接输入,并使得无人车平稳地按照拟人化的决策完成换道。
与前述换道决策模型生成方法和无人车换道决策方法相对应,本说明书还提供了换道决策模型生成装置和无人车换道决策装置实施例,所述装置实施例可以通过软件实现,也可以通过硬件或者软硬件结合的方式实现。以软 件实现为例,作为一个逻辑意义上的装置,是通过其所在设备的处理器将非易失性存储器中对应的计算机程序指令读取到内存中运行形成的。从硬件层面而言,本说明书换道决策模型生成装置和无人车换道决策装置所在设备的一种硬件结构可以包括处理器、网络接口、内存以及非易失性存储器之外,还可以包括其他硬件,对此不再赘述。
图6是示出了根据本说明书一实施例提供的换道决策模型生成装置400的结构示意图。所述换道决策模型生成装置400包括:
样本获取模块410,被配置为获取车辆换道的训练样本集,所述训练样本集包括多个训练样本组,每个所述训练样本组包括车辆按照规划的换道轨迹完成换道的过程中每个时间步长下的训练样本,所述训练样本包括一组状态量及对应的控制量,所述状态量包括目标车辆的位姿、速度、加速度,目标车辆本车道前车的位姿、速度、加速度以及目标车道上跟车的位姿、速度、加速度;所述控制量包括目标车辆的速度、角速度;
模型训练模块420,被配置为通过所述训练样本集对基于深度强化学习网络的决策模型进行训练,得到换道决策模型,所述换道决策模型使得目标车辆的状态量与对应的控制量相关联。
在一个具体实施例中,所述样本获取模块410通过以下至少一种方式得到所述训练样本集:
第一获取方式:
在模拟器中按照基于规则的优化算法使得车辆完成换道,获取多次换道过程中目标车辆和目标车辆本车道前车以及目标车道上跟车的每一时间步长下的所述状态量和对应的所述控制量;
第二获取方式:
从存储车辆换道信息的数据库中,采样出车辆换道过程中的车辆数据,所述车辆数据包括目标车辆和目标车辆本车道前车以及目标车道上跟车的每一时间步长下的所述状态量和对应的所述控制量。
图7是示出了根据本说明书一实施例提供的换道决策模型训练模块的结构示意图。所述基于深度强化学习网络的决策模型包括基于学习的预测网络和预先训练完成的基于规则的目标网络,所述模型训练模块420包括:
样本输入单元402,被配置为对于预先加入经验池的训练样本集,将每组训练样本中的任一状态量作为所述预测网络的输入,得到所述预测网络对该状态量的下一时间步长的预测控制量;将训练样本中该状态量的下一时间步长的状态量和对应的控制量作为所述目标网络的输入,得到所述目标网络输出的价值评估Q值;
奖励生成单元404,被配置为将所述预测控制量作为预先构建的环境模拟器的输入,得到所述环境模拟器输出的环境奖励以及下一时间步长的状态 量;
经验保存单元406,被配置为将该状态量、对应的预测控制量、所述环境奖励以及下一时间步长的状态量作为一组经验数据存储到经验池中;
参数更新单元408,被配置为当所述经验数据的组数每达到第一预设数目后,根据多组所述经验数据以及每组经验数据对应的所述目标网络输出的Q值,计算损失函数,优化所述损失函数,得到所述预测网络参数变化的梯度,更新所述预测网络参数直至所述损失函数收敛。
在一个具体实施例中,所述参数更新单元408,还被配置为:
当所述预测网络参数的更新次数达到第二预设数目后,获取经验池中环境奖励高于预设值的预测控制量和对应的状态量,或者获取经验池中环境奖励排名位于前第三预设数目的预测控制量和对应的状态量,将所述预测控制量以及对应的状态量添加至所述目标网络的目标网络训练样本集中,以训练更新所述目标网络的参数。
在一个具体实施例中,所述参数更新单元的损失函数,其特征在于,包括:所述损失函数为第一预设数目个预设网络的价值评估Q值与目标网络的价值评估Q值的均方误差,所述预设网络的价值评估Q值关于输入的状态量、对应预测控制量以及预测网络的参数;所述目标网络的价值评估Q值关于输入的训练样本中的状态量、对应控制量以及目标网络的参数。
图8是示出了根据本说明书一实施例提供的无人车换道决策装置500的结构示意图。所述无人车换道决策装置500具体包括以下模块:
数据获取模块510,被配置为在确定的换道时刻,获取目标车辆车身传感器中的传感器数据,所述传感器数据包括目标车辆和目标车辆本车道前车以及目标车道上跟车的位姿、速度、加速度;
控制量生成模块520,被配置为调用换道决策模型,通过所述换道决策模型得到换道过程中,每一时刻目标车辆的控制量,所述换道决策模型使得目标车辆的状态量与对应的控制量相关联;
控制量输出模块530,被配置为将换道过程中每一时刻的控制量发送给执行机构,使得目标车辆完成换道。
上述装置中各个单元的功能和作用的实现过程具体详见上述方法中对应步骤的实现过程,在此不再赘述。
综上所述,通过获得的训练样本集对基于深度强化学习网络的决策模型进行训练,通过构建损失函数来优化预测网络参数使得预测网络找到解决车辆换道中复杂问题的更优策略,使得该预测网络的策略不断近似训练样本数据的策略。该决策模型能够将目标车辆的状态量与对应的控制量相关联,相比于传统离线优化算法,能够直接接收传感器的输入并且具有良好的在线规划效率,解决了现有技术中由于复杂传感器和环境不确定性带来的决策困难; 相比于单纯的深度神经网络具有更好的学习效率并增加了对具体应用场景的适应能力。
本领域普通技术人员可以理解:附图只是一个实施例的示意图,附图中的模块或流程并不一定是实施本发明所必须的。
本领域普通技术人员可以理解:实施例中的装置中的模块可以按照实施例描述分布于实施例的装置中,也可以进行相应变化位于不同于本实施例的一个或多个装置中。上述实施例的模块可以合并为一个模块,也可以进一步拆分成多个子模块。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明实施例技术方案的精神和范围。

Claims (10)

  1. 一种换道决策模型生成方法,包括:
    获取车辆换道的训练样本集,所述训练样本集包括多个训练样本组,每个所述训练样本组包括车辆按照规划的换道轨迹完成换道的过程中每个时间步长下的训练样本,所述训练样本包括一组状态量及对应的控制量,所述状态量包括目标车辆的位姿、速度、加速度,目标车辆本车道前车的位姿、速度、加速度以及目标车道上跟车的位姿、速度、加速度;所述控制量包括目标车辆的速度、角速度;
    通过所述训练样本集对基于深度强化学习网络的决策模型进行训练,得到换道决策模型,所述换道决策模型使得目标车辆的状态量与对应的控制量相关联。
  2. 根据权利要求1所述的方法,所述训练样本集通过以下至少一种方式得到:
    第一获取方式:
    在模拟器中按照基于规则的优化算法使得车辆完成换道,获取多次换道过程中目标车辆和目标车辆本车道前车以及目标车道上跟车的每一时间步长下的所述状态量和对应的所述控制量;
    第二获取方式:
    从存储车辆换道信息的数据库中,采样出车辆换道过程中的车辆数据,所述车辆数据包括目标车辆和目标车辆本车道前车以及目标车道上跟车的每一时间步长下的所述状态量和对应的所述控制量。
  3. 根据权利要求1所述的方法,其特征在于,所述基于深度强化学习网络的决策模型包括基于学习的预测网络和预先训练完成的基于规则的目标网络,所述通过所述训练样本集对基于深度强化学习网络的决策模型进行训练, 得到换道决策模型的步骤包括:
    对于预先加入经验池的训练样本集,将每组训练样本中的任一状态量作为所述预测网络的输入,得到所述预测网络对该状态量的下一时间步长的预测控制量;将训练样本中该状态量的下一时间步长的状态量和对应的控制量作为所述目标网络的输入,得到所述目标网络输出的价值评估Q值;
    将所述预测控制量作为预先构建的环境模拟器的输入,得到所述环境模拟器输出的环境奖励以及下一时间步长的状态量;
    将该状态量、对应的预测控制量、所述环境奖励以及下一时间步长的状态量作为一组经验数据存储到经验池中;
    当所述经验数据的组数每达到第一预设数目后,根据多组所述经验数据以及每组经验数据对应的所述目标网络输出的Q值,计算损失函数,优化所述损失函数,得到所述预测网络参数变化的梯度,更新所述预测网络参数直至所述损失函数收敛。
  4. 根据权利要求3所述的方法,其特征在于,所述当所述经验数据的组数每达到第一预设数目后,则根据所述经验数据计算损失函数,优化所述损失函数,得到所述预测网络参数变化的梯度,更新所述预测网络参数直至所述损失函数收敛的步骤之后,还包括:
    当所述预测网络参数的更新次数达到第二预设数目后,获取经验池中环境奖励高于预设值的预测控制量和对应的状态量,或者获取经验池中环境奖励排名位于前第三预设数目的预测控制量和对应的状态量,将所述预测控制量以及对应的状态量添加至所述目标网络的目标网络训练样本集中,以训练更新所述目标网络的参数。
  5. 根据权利要求3所述的方法,其特征在于,所述损失函数为第一预设数目个预设网络的价值评估Q值与目标网络的价值评估Q值的均方误差,所 述预设网络的价值评估Q值关于输入的状态量、对应预测控制量以及预测网络的策略参数;所述目标网络的价值评估Q值关于输入的训练样本中的状态量、对应控制量以及目标网络的策略参数。
  6. 一种无人车换道决策方法,包括:
    在确定的换道时刻,获取目标车辆车身传感器中的传感器数据,所述传感器数据包括目标车辆和目标车辆本车道前车以及目标车道上跟车的位姿、速度、加速度;
    调用换道决策模型,通过所述换道决策模型得到换道过程中,每一时刻目标车辆的控制量,所述换道决策模型使得目标车辆的状态量与对应的控制量相关联;
    将换道过程中每一时刻的控制量发送给执行机构,使得目标车辆完成换道。
  7. 一种换道决策模型生成装置,包括:
    样本获取模块,被配置为获取车辆换道的训练样本集,所述训练样本集包括多个训练样本组,每个所述训练样本组包括车辆按照规划的换道轨迹完成换道的过程中每个时间步长下的训练样本,所述训练样本包括一组状态量及对应的控制量,所述状态量包括目标车辆的位姿、速度、加速度,目标车辆本车道前车的位姿、速度、加速度以及目标车道上跟车的位姿、速度、加速度;所述控制量包括目标车辆的速度、角速度;
    模型训练模块,被配置为通过所述训练样本集对基于深度强化学习网络的决策模型进行训练,得到换道决策模型,所述换道决策模型使得目标车辆的状态量与对应的控制量相关联。
  8. 根据权利要求7所述装置,其特征在于,所述基于深度强化学习网络的决策模型包括基于学习的预测网络和预先训练完成的基于规则的目标网 络,所述模型训练模块包括:
    样本输入单元,被配置为对于预先加入经验池的训练样本集,将每组训练样本中的任一状态量作为所述预测网络的输入,得到所述预测网络对该状态量的下一时间步长的预测控制量;将训练样本中该状态量的下一时间步长的状态量和对应的控制量作为所述目标网络的输入,得到所述目标网络输出的价值评估Q值;
    奖励生成单元,被配置为将所述预测控制量作为预先构建的环境模拟器的输入,得到所述环境模拟器输出的环境奖励以及下一时间步长的状态量;
    经验保存单元,被配置为将该状态量、对应的预测控制量、所述环境奖励以及下一时间步长的状态量作为一组经验数据存储到经验池中;
    参数更新单元,被配置为当所述经验数据的组数每达到第一预设数目后,根据多组所述经验数据以及每组经验数据对应的所述目标网络输出的Q值,计算损失函数,优化所述损失函数,得到所述预测网络参数变化的梯度,更新所述预测网络参数直至所述损失函数收敛。
  9. 根据权利要求7所述装置,其特征在于,所述参数更新单元,还被配置为:
    当所述预测网络参数的更新次数达到第二预设数目后,获取经验池中环境奖励高于预设值的预测控制量和对应的状态量,或者获取经验池中环境奖励排名位于前第三预设数目的预测控制量和对应的状态量,将所述预测控制量以及对应的状态量添加至所述目标网络的目标网络训练样本集中,以训练更新所述目标网络的参数。
  10. 一种无人车换道决策装置,包括:
    数据获取模块,被配置为在确定的换道时刻,获取目标车辆车身传感器中的传感器数据,所述传感器数据包括目标车辆和目标车辆本车道前车以及 目标车道上跟车的位姿、速度、加速度;
    控制量生成模块,被配置为调用换道决策模型,通过所述换道决策模型得到换道过程中,每一时刻目标车辆的控制量,所述换道决策模型使得目标车辆的状态量与对应的控制量相关联;
    控制量输出模块,被配置为将换道过程中每一时刻的控制量发送给执行机构,使得目标车辆完成换道。
PCT/CN2020/121339 2019-11-27 2020-10-16 换道决策模型生成方法和无人车换道决策方法及装置 WO2021103834A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
DE112020003136.5T DE112020003136T5 (de) 2019-11-27 2020-10-16 Verfahren zum Erzeugen eines Spurwechsel-Entscheidungsmodells, Verfahren und Vorrichtung zur Spurwechsel-Entscheidung eines unbemannten Fahrzeugs
US17/773,378 US20220363259A1 (en) 2019-11-27 2020-10-16 Method for generating lane changing decision-making model, method for lane changing decision-making of unmanned vehicle and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911181338.0A CN112937564B (zh) 2019-11-27 2019-11-27 换道决策模型生成方法和无人车换道决策方法及装置
CN201911181338.0 2019-11-27

Publications (1)

Publication Number Publication Date
WO2021103834A1 true WO2021103834A1 (zh) 2021-06-03

Family

ID=76129958

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/121339 WO2021103834A1 (zh) 2019-11-27 2020-10-16 换道决策模型生成方法和无人车换道决策方法及装置

Country Status (4)

Country Link
US (1) US20220363259A1 (zh)
CN (1) CN112937564B (zh)
DE (1) DE112020003136T5 (zh)
WO (1) WO2021103834A1 (zh)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113324556A (zh) * 2021-06-04 2021-08-31 苏州智加科技有限公司 基于车路协同强化学习的路径规划方法及装置、应用系统
CN113495563A (zh) * 2021-06-10 2021-10-12 吉林大学 用于自动驾驶虚拟测试的交通车换道决策规划方法
CN113552883A (zh) * 2021-07-19 2021-10-26 吉林大学 基于深度强化学习的地面无人车自主驾驶方法及系统
CN113777918A (zh) * 2021-07-28 2021-12-10 张金宁 一种数字孪生架构的汽车智能线控底盘控制方法
CN113807009A (zh) * 2021-08-31 2021-12-17 东南大学 一种微观换道轨迹的分段提取方法
CN113928321A (zh) * 2021-11-24 2022-01-14 北京联合大学 一种基于端到端的深度强化学习换道决策方法和装置
CN114179835A (zh) * 2021-12-30 2022-03-15 清华大学苏州汽车研究院(吴江) 基于真实场景下强化学习的自动驾驶车辆决策训练方法
CN114692890A (zh) * 2021-12-24 2022-07-01 中国人民解放军军事科学院战争研究院 基于模型的权值组合规划值扩展的方法
CN114723005A (zh) * 2022-03-28 2022-07-08 中国人民解放军国防科技大学 一种基于深度图表征学习的多层网络瓦解策略推断方法
CN115489320A (zh) * 2022-09-23 2022-12-20 西南交通大学 一种基于深度强化学习的列车受电弓智能控制方法

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018083667A1 (en) * 2016-11-04 2018-05-11 Deepmind Technologies Limited Reinforcement learning systems
CN113581182B (zh) * 2021-09-07 2024-04-19 上海交通大学 基于强化学习的自动驾驶车辆换道轨迹规划方法及系统
CN114021840A (zh) * 2021-11-12 2022-02-08 京东鲲鹏(江苏)科技有限公司 换道策略生成方法和装置、计算机存储介质、电子设备
CN114355936A (zh) * 2021-12-31 2022-04-15 深兰人工智能(深圳)有限公司 智能体的控制方法、装置、智能体及计算机可读存储介质
CN116069043B (zh) * 2023-03-24 2023-08-15 华南农业大学 一种无人驾驶农机作业速度自主决策方法
CN116859755B (zh) * 2023-08-29 2023-12-08 南京邮电大学 无人车驾驶控制的最小化协方差强化学习训练加速方法
CN117601904B (zh) * 2024-01-22 2024-05-14 中国第一汽车股份有限公司 车辆行驶轨迹的规划方法、装置、车辆及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106114507A (zh) * 2016-06-21 2016-11-16 百度在线网络技术(北京)有限公司 用于智能车辆的局部轨迹规划方法和装置
CN108313054A (zh) * 2018-01-05 2018-07-24 北京智行者科技有限公司 自动驾驶自主换道决策方法和装置及自动驾驶车辆
KR20190098735A (ko) * 2019-08-01 2019-08-22 엘지전자 주식회사 차량 단말 및 그의 동작 방법
CN110304045A (zh) * 2019-06-25 2019-10-08 中国科学院自动化研究所 智能驾驶横向换道决策方法、系统和装置
CN110356401A (zh) * 2018-04-05 2019-10-22 北京图森未来科技有限公司 一种自动驾驶车辆及其变道控制方法和系统

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106080590B (zh) * 2016-06-12 2018-04-03 百度在线网络技术(北京)有限公司 车辆控制方法和装置以及决策模型的获取方法和装置
CN106740457A (zh) * 2016-12-07 2017-05-31 镇江市高等专科学校 基于bp神经网络模型的车辆换道决策方法
DE112019001605T5 (de) * 2018-03-27 2020-12-17 Nvidia Corporation Trainieren, testen und verifizieren von autonomen maschinen unter verwendung simulierter umgebungen
CN109739218A (zh) * 2018-12-24 2019-05-10 江苏大学 一种基于gru网络的仿优秀驾驶员换道模型建立方法
CN109933086B (zh) * 2019-03-14 2022-08-30 天津大学 基于深度q学习的无人机环境感知与自主避障方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106114507A (zh) * 2016-06-21 2016-11-16 百度在线网络技术(北京)有限公司 用于智能车辆的局部轨迹规划方法和装置
CN108313054A (zh) * 2018-01-05 2018-07-24 北京智行者科技有限公司 自动驾驶自主换道决策方法和装置及自动驾驶车辆
CN110356401A (zh) * 2018-04-05 2019-10-22 北京图森未来科技有限公司 一种自动驾驶车辆及其变道控制方法和系统
CN110304045A (zh) * 2019-06-25 2019-10-08 中国科学院自动化研究所 智能驾驶横向换道决策方法、系统和装置
KR20190098735A (ko) * 2019-08-01 2019-08-22 엘지전자 주식회사 차량 단말 및 그의 동작 방법

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113324556B (zh) * 2021-06-04 2024-03-26 苏州智加科技有限公司 基于车路协同强化学习的路径规划方法及装置、应用系统
CN113324556A (zh) * 2021-06-04 2021-08-31 苏州智加科技有限公司 基于车路协同强化学习的路径规划方法及装置、应用系统
CN113495563A (zh) * 2021-06-10 2021-10-12 吉林大学 用于自动驾驶虚拟测试的交通车换道决策规划方法
CN113495563B (zh) * 2021-06-10 2022-09-20 吉林大学 用于自动驾驶虚拟测试的交通车换道决策规划方法
CN113552883A (zh) * 2021-07-19 2021-10-26 吉林大学 基于深度强化学习的地面无人车自主驾驶方法及系统
CN113552883B (zh) * 2021-07-19 2024-05-14 吉林大学 基于深度强化学习的地面无人车自主驾驶方法及系统
CN113777918A (zh) * 2021-07-28 2021-12-10 张金宁 一种数字孪生架构的汽车智能线控底盘控制方法
CN113807009A (zh) * 2021-08-31 2021-12-17 东南大学 一种微观换道轨迹的分段提取方法
CN113807009B (zh) * 2021-08-31 2022-11-18 东南大学 一种微观换道轨迹的分段提取方法
CN113928321B (zh) * 2021-11-24 2022-08-26 北京联合大学 一种基于端到端的深度强化学习换道决策方法和装置
CN113928321A (zh) * 2021-11-24 2022-01-14 北京联合大学 一种基于端到端的深度强化学习换道决策方法和装置
CN114692890A (zh) * 2021-12-24 2022-07-01 中国人民解放军军事科学院战争研究院 基于模型的权值组合规划值扩展的方法
CN114179835A (zh) * 2021-12-30 2022-03-15 清华大学苏州汽车研究院(吴江) 基于真实场景下强化学习的自动驾驶车辆决策训练方法
CN114179835B (zh) * 2021-12-30 2024-01-05 清华大学苏州汽车研究院(吴江) 基于真实场景下强化学习的自动驾驶车辆决策训练方法
CN114723005A (zh) * 2022-03-28 2022-07-08 中国人民解放军国防科技大学 一种基于深度图表征学习的多层网络瓦解策略推断方法
CN114723005B (zh) * 2022-03-28 2024-05-03 中国人民解放军国防科技大学 一种基于深度图表征学习的多层网络瓦解策略推断方法
CN115489320A (zh) * 2022-09-23 2022-12-20 西南交通大学 一种基于深度强化学习的列车受电弓智能控制方法

Also Published As

Publication number Publication date
CN112937564A (zh) 2021-06-11
DE112020003136T5 (de) 2022-03-24
CN112937564B (zh) 2022-09-02
US20220363259A1 (en) 2022-11-17

Similar Documents

Publication Publication Date Title
WO2021103834A1 (zh) 换道决策模型生成方法和无人车换道决策方法及装置
Liu et al. Lifelong federated reinforcement learning: a learning architecture for navigation in cloud robotic systems
CN110646009B (zh) 一种基于dqn的车辆自动驾驶路径规划的方法及装置
CN111098852B (zh) 一种基于强化学习的泊车路径规划方法
US20210201156A1 (en) Sample-efficient reinforcement learning
US10860927B2 (en) Stacked convolutional long short-term memory for model-free reinforcement learning
Zhong et al. Collision-free path planning for welding manipulator via hybrid algorithm of deep reinforcement learning and inverse kinematics
CN107479547B (zh) 基于示教学习的决策树行为决策算法
WO2018189404A1 (en) Distributional reinforcement learning
US20210263526A1 (en) Method and device for supporting maneuver planning for an automated driving vehicle or a robot
JP2020204803A (ja) 学習方法及びプログラム
US20220366246A1 (en) Controlling agents using causally correct environment models
EP4150426A2 (en) Tools for performance testing and/or training autonomous vehicle planners
CN116476863A (zh) 基于深度强化学习的自动驾驶横纵向一体化决策方法
CN114239974B (zh) 多智能体的位置预测方法、装置、电子设备及存储介质
CN114779792B (zh) 基于模仿与强化学习的医药机器人自主避障方法及系统
CN116300977A (zh) 一种依托强化学习的铰接车轨迹跟踪控制方法及装置
CN116009542A (zh) 动态多智能体覆盖路径规划方法、装置、设备及存储介质
CN116817909A (zh) 一种基于深度强化学习的无人机中继式导航方法
EP4330107A1 (en) Motion planning
CN114967472A (zh) 一种无人机轨迹跟踪状态补偿深度确定性策略梯度控制方法
Maiuri et al. Application of Reinforcement Learning for Intelligent Support Decision System: A Paradigm Towards Safety and Explainability
Casanueva-Morato et al. Bioinspired Spike‐Based Hippocampus and Posterior Parietal Cortex Models for Robot Navigation and Environment Pseudomapping
Yarom et al. Development of a Simulation Environment for Automated Model Configuration for the Design and Validation of AI-based Driving Functions.
EP3742344A1 (en) Computer-implemented method of and apparatus for training a neural network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20892710

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20892710

Country of ref document: EP

Kind code of ref document: A1