WO2022088798A1 - 一种自动驾驶决策方法、系统、设备及计算机存储介质 - Google Patents

一种自动驾驶决策方法、系统、设备及计算机存储介质 Download PDF

Info

Publication number
WO2022088798A1
WO2022088798A1 PCT/CN2021/109174 CN2021109174W WO2022088798A1 WO 2022088798 A1 WO2022088798 A1 WO 2022088798A1 CN 2021109174 W CN2021109174 W CN 2021109174W WO 2022088798 A1 WO2022088798 A1 WO 2022088798A1
Authority
WO
WIPO (PCT)
Prior art keywords
environment information
traffic environment
value
learning model
reinforcement learning
Prior art date
Application number
PCT/CN2021/109174
Other languages
English (en)
French (fr)
Inventor
李茹杨
李仁刚
赵雅倩
李雪雷
魏辉
徐哲
张亚强
Original Assignee
浪潮(北京)电子信息产业有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浪潮(北京)电子信息产业有限公司 filed Critical 浪潮(北京)电子信息产业有限公司
Priority to US18/246,126 priority Critical patent/US20230365163A1/en
Publication of WO2022088798A1 publication Critical patent/WO2022088798A1/zh

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W60/00Drive control systems specially adapted for autonomous road vehicles
    • B60W60/001Planning or execution of driving tasks
    • B60W60/0027Planning or execution of driving tasks using trajectory prediction for other traffic participants
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W60/00Drive control systems specially adapted for autonomous road vehicles
    • B60W60/001Planning or execution of driving tasks
    • B60W60/0015Planning or execution of driving tasks specially adapted for safety
    • B60W60/0016Planning or execution of driving tasks specially adapted for safety of the vehicle or its occupants
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W40/00Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models
    • B60W40/02Estimation or calculation of non-directly measurable driving parameters for road vehicle drive control systems not related to the control of a particular sub unit, e.g. by using mathematical models related to ambient conditions
    • B60W40/04Traffic conditions
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • B60W50/0097Predicting future conditions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2554/00Input parameters relating to objects
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2554/00Input parameters relating to objects
    • B60W2554/40Dynamic objects, e.g. animals, windblown objects
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2556/00Input parameters relating to data
    • B60W2556/10Historical data
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2556/00Input parameters relating to data
    • B60W2556/40High definition maps

Definitions

  • the present invention relates to the technical field of automatic driving, and more particularly, to an automatic driving decision-making method, system, device and computer storage medium.
  • Deep Reinforcement Learning Deep Reinforcement Learning
  • AGI Artificial General Intelligence
  • deep reinforcement learning can guide the vehicle to learn autonomous driving from scratch, it can also learn autonomous driving through continuous "trial and error" in the face of new traffic scenarios, which has a wide range of applications.
  • the vehicle usually needs several steps or even dozens of steps of training to make a better decision, and the sampling efficiency is low, which is contrary to the instantaneous decision-making requirements of autonomous driving scenarios.
  • choosing the steps of poor action will lead to a large variance, which is reflected in the unstable driving of the vehicle, and even accidents such as running out of the lane and collision.
  • the purpose of this application is to provide an automatic driving method, which can solve the technical problem of how to realize fast and stable automatic driving to a certain extent.
  • the present application also provides an automatic driving system, a device, and a computer-readable storage medium.
  • a method of autonomous driving comprising:
  • the mapping relationship includes a mapping relationship between the real-time traffic environment information and the existing traffic environment information of the existing deep reinforcement learning model.
  • the adjustment of the target deep reinforcement learning model based on the pre-stored existing deep reinforcement learning model and the mapped traffic environment information includes:
  • the parameters of the evaluation network are updated based on the reward value, the value function value.
  • the updating of the parameters of the evaluation network based on the reward value and the value function value includes:
  • the loss function calculation formula includes:
  • L represents the loss value
  • N represents the number of samples collected
  • r t represents the return value at time t
  • represents the discount factor, 0 ⁇ 1
  • Q′ ⁇ (s t+1 , a t+1 ) represents the value function value calculated by the target network in the evaluation network at time t+1
  • s t+1 represents the traffic environment information at time t+1
  • a t+1 represents the vehicle at time t+1 action
  • Q ⁇ (s t , at t ) represents the value function value calculated by the prediction network in the evaluation network at time t
  • s t represents the traffic environment information at time t
  • at t represents the vehicle at time t action.
  • the method further includes:
  • the parameters of the policy network of the target deep reinforcement learning model are updated.
  • the method further includes:
  • the mapping relationship is determined by minimizing the distance value.
  • calculating the distance value between the target traffic environment information and the existing traffic environment information includes:
  • the distance value calculation formula includes:
  • MMD H (D S , D T ) represents the distance value;
  • D S represents the existing traffic environment information;
  • D T represents the target traffic environment information;
  • A represents the mapping relationship;
  • T represents transposition;
  • s S represents the traffic environment information in the existing traffic environment information;
  • s T represents the target traffic Traffic environment information in the environment information;
  • H represents the regenerated kernel Hilbert space.
  • the determining of the mapping relationship by minimizing the distance value includes:
  • the mapping relationship is determined by minimizing the distance value based on a canonical linear regression method or a support vector machine method or a principal component analysis method.
  • An autonomous driving system comprising:
  • the first acquisition module is used to acquire the real-time traffic environment information of the automatic driving vehicle during the driving process at the current moment;
  • a first mapping module configured to map the real-time traffic environment information based on a preset mapping relationship to obtain mapped traffic environment information
  • a first adjustment module configured to adjust the target deep reinforcement learning model based on the pre-stored existing deep reinforcement learning model and the mapped traffic environment information
  • a first judging module for judging whether to end the automatic driving, and if not, returning to the step of obtaining the real-time traffic environment information of the automatic driving vehicle during the driving process at the current moment;
  • the mapping relationship includes a mapping relationship between the real-time traffic environment information and the existing traffic environment information of the existing deep reinforcement learning model.
  • An autonomous driving device comprising:
  • the processor is configured to implement the steps of any of the above automatic driving methods when executing the computer program.
  • a computer-readable storage medium storing a computer program in the computer-readable storage medium, when the computer program is executed by a processor, implements the steps of any one of the above automatic driving methods.
  • An automatic driving method provided by the present application acquires real-time traffic environment information of an automatic driving vehicle during driving at the current moment; maps the real-time traffic environment information based on a preset mapping relationship to obtain the mapped traffic environment information; Store the existing deep reinforcement learning model and map traffic environment information to adjust the target deep reinforcement learning model; determine whether to end the automatic driving, if not, return to execute to obtain the real-time traffic of the autonomous vehicle during driving at the current moment The step of environmental information; wherein, the mapping relationship includes the mapping relationship between real-time traffic environment information and existing traffic environment information of an existing deep reinforcement learning model.
  • the target deep reinforcement learning model can be adjusted with the help of the mapping relationship and the existing deep reinforcement learning model, which can avoid the adjustment of the target deep reinforcement learning model from scratch, and speed up the decision-making efficiency of the target deep reinforcement learning model.
  • Fast and stable autonomous driving The automatic driving system, device and computer-readable storage medium provided by the present application also solve the corresponding technical problems.
  • Fig. 2 is the adjustment flow chart of the target deep reinforcement learning model in the application
  • FIG. 3 is a schematic structural diagram of an automatic driving system provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of an automatic driving device provided by an embodiment of the present application.
  • FIG. 5 is another schematic structural diagram of an automatic driving device provided by an embodiment of the present application.
  • FIG. 1 is a flowchart of an automatic driving method provided by an embodiment of the present application.
  • Step S101 Acquire real-time traffic environment information of the autonomous vehicle during driving at the current moment.
  • the real-time traffic environment information of the autonomous driving vehicle during the driving process at the current moment can be obtained first.
  • the type of environmental information can be determined according to actual needs.
  • vehicle-mounted sensor devices such as cameras, global positioning systems, inertial measurement units, millimeter-wave radars, and lidars can be used to obtain the driving environment status, such as weather data, traffic lights, and traffic topology information.
  • the location and operating status of autonomous vehicles, other traffic participants, and the raw traffic environment information such as direct raw image data obtained by the camera are directly used as real-time traffic environment information, and can also be processed by models such as RefineNet. Depth maps and semantic segmentation maps as real-time traffic environment information, etc.
  • Step S102 Map the real-time traffic environment information based on a preset mapping relationship to obtain mapped traffic environment information; the mapping relationship includes a mapping relationship between the real-time traffic environment information and the existing traffic environment information of the existing deep reinforcement learning model.
  • the existing deep reinforcement learning model needs to be used to adjust the target deep reinforcement learning model in this application, if the real-time traffic environment information is processed directly with the help of the existing deep reinforcement learning model, the processing results may not be possible.
  • the real-time traffic environment information can be mapped based on the preset mapping relationship to obtain the mapped traffic environment information; because the mapping relationship includes the real-time traffic environment information and existing traffic environment information.
  • the mapping relationship between the existing traffic environment information of the deep reinforcement learning model so the mapping of the traffic environment information can meet the processing requirements of the existing deep reinforcement learning model and can carry the relevant information of the real-time traffic environment information. If the target deep reinforcement learning model is adjusted, the adjustment accuracy of the target deep reinforcement learning model can be guaranteed.
  • the existing deep reinforcement learning model refers to the deep reinforcement learning model that has been trained and meets the conditions.
  • the existing deep reinforcement learning model can be a deep reinforcement learning model obtained after training for a preset duration according to the existing traffic environment information. , it can also be a deep reinforcement learning model obtained after training with a preset step size according to the existing traffic environment information. Specific restrictions.
  • Step S103 Adjust the target deep reinforcement learning model based on the pre-stored existing deep reinforcement learning model and the mapped traffic environment information.
  • the target deep reinforcement learning model can be mapped to the target deep reinforcement learning model based on the pre-stored existing deep reinforcement learning model and the mapped traffic environment information. make adjustments.
  • the adjustment process of the target deep reinforcement learning model can be determined according to the actual needs and the specific structure of the target deep reinforcement learning model, and the structure of the target deep reinforcement learning model can be determined according to the applied deep reinforcement learning algorithm, such as DQN (Deep RL).
  • DQN Deep RL
  • DDPG Deep Deterministic Policy Gradient, deep determination policy gradient algorithm
  • A3C Asynchronous Advantage Actor-Critic, asynchronous advantage Actor-Critic algorithm
  • SAC Soft Actor-Critic, Relaxed Actor-Critic algorithm
  • TD3 Twin Delayed Deep Deterministic policy gradient, double-delay deterministic policy gradient algorithm
  • Step S104 determine whether to end the automatic driving, if not, return to step S101; if yes, execute step S105: end.
  • the conditions for judging whether to end automatic driving can be determined according to actual needs.
  • the conditions for ending automatic driving can be that the number of adjustments reaches a preset number of times, and the adjustment duration reaches a preset time period, etc., which is not specifically limited in this application.
  • An automatic driving method provided by the present application acquires real-time traffic environment information of an automatic driving vehicle during driving at the current moment; maps the real-time traffic environment information based on a preset mapping relationship to obtain the mapped traffic environment information; Store the existing deep reinforcement learning model and map traffic environment information to adjust the target deep reinforcement learning model; determine whether to end the automatic driving, if not, return to execute to obtain the real-time traffic of the autonomous vehicle during driving at the current moment The step of environmental information; wherein, the mapping relationship includes the mapping relationship between real-time traffic environment information and existing traffic environment information of an existing deep reinforcement learning model.
  • the target deep reinforcement learning model can be adjusted with the help of the mapping relationship and the existing deep reinforcement learning model, which can avoid the adjustment of the target deep reinforcement learning model from scratch, and speed up the decision-making efficiency of the target deep reinforcement learning model. Fast and stable autonomous driving.
  • FIG. 2 is a flowchart of the adjustment of the target deep reinforcement learning model in the present application.
  • the process of adjusting the target deep reinforcement learning model based on the pre-stored existing deep reinforcement learning model and the mapped traffic environment information may include the following steps:
  • Step S201 processing the mapped traffic environment information based on the parameters of the existing strategy network of the existing deep reinforcement learning model to obtain vehicle actions.
  • both the existing deep reinforcement learning model and the target deep reinforcement learning model include a policy network and an evaluation network
  • the traffic environment can be mapped based on the parameters of the existing policy network of the existing deep reinforcement learning model.
  • the information is processed to obtain vehicle actions, such as acceleration, deceleration, steering, lane change, braking, etc.
  • Step S202 Calculate the value function value of the vehicle action based on the evaluation network of the target deep reinforcement learning model.
  • the value function value of the vehicle action can be calculated based on the evaluation network of the target deep reinforcement learning model. , to evaluate the decision-making ability of the policy network with the help of the value function value.
  • Step S203 Obtain the reward value of the vehicle action.
  • the reward value of the vehicle action can also be obtained.
  • Benchmarks such as the average driving speed of the self-driving vehicle, distance from the center of the lane, running a red light, collision, etc., give the self-driving vehicle a reward value.
  • Step S204 Update the parameters of the evaluation network based on the reward value and the value function value.
  • the parameters of the evaluation network can be updated based on the reward value and the value function value.
  • the loss value in the process of updating the parameters of the evaluation network based on the return value and the value function value, can be calculated based on the return value and the value function value through the loss function calculation formula; the value of the evaluation network can be updated by minimizing the loss value.
  • the loss function calculation formula includes:
  • L represents the loss value
  • N represents the number of samples collected
  • r t represents the return value at time t
  • represents the discount factor, 0 ⁇ 1
  • Q′ ⁇ (s t+1 , a t+1 ) represents The value function value calculated by the target network in the evaluation network at time t+1
  • s t+1 represents the traffic environment information at time t+1
  • a t+1 represents the vehicle action at time t+1
  • Q ⁇ (s t , at ) represents the value function value calculated by the prediction network in the evaluation network at time t
  • s t represents the traffic environment information at time t
  • at t represents the vehicle action at time t.
  • the strategy of the target deep reinforcement learning model can also be adjusted.
  • the parameters of the network are updated.
  • the process of updating the parameters of the policy network can be determined according to actual needs, which is not specifically limited in this application.
  • the target traffic environment information may also be obtained; the existing traffic environment information may be read; information; in the regenerated kernel Hilbert space, calculate the distance value between the target traffic environment information and the existing traffic environment information; determine the mapping relationship by minimizing the distance value. That is, the present application can quickly determine the mapping relationship through the target traffic environment information, the existing traffic environment information and the regenerated kernel Hilbert space.
  • the distance value calculation formula can be used to calculate the distance value in the regeneration kernel Hilbert space. , calculate the distance value between the target traffic environment information and the existing traffic environment information;
  • the formula for calculating the distance value includes:
  • MMD H (D S , D T ) represents the distance value;
  • D S represents the existing traffic environment information;
  • D T represents the target traffic environment information;
  • n represents the number of samples in the existing traffic environment information;
  • T represents the transposition;
  • s S represents the traffic environment information in the existing traffic environment information;
  • s T represents the traffic environment information in the target traffic environment information;
  • H represents the regeneration kernel Hilbert space.
  • the mapping relationship in the process of determining the mapping relationship by minimizing the distance value, can be determined by minimizing the distance value based on a regular linear regression method, a support vector machine method, or a principal component analysis method.
  • a simple deep learning algorithm such as the DQN algorithm
  • the target network can be used to pre-train the autonomous vehicle in the target field, such as building two neural networks with the same structure but different parameter update frequencies , that is, the target network (Target Net) updated at a certain interval and the prediction network (Pred Net) updated at each step, the target network and the prediction network can both use a simple 3-layer neural network, with only 1 hidden layer in the middle;
  • the traffic environment state collected by the vehicle sensor device is input, the output target value Q target and the predicted value Q pred are calculated, and the action a Tt corresponding to the maximum value is selected as the driving action of the autonomous vehicle.
  • FIG. 3 is a schematic structural diagram of an automatic driving system according to an embodiment of the present application.
  • the first obtaining module 101 is used to obtain the real-time traffic environment information of the autonomous vehicle during driving at the current moment;
  • a first mapping module 102 configured to map real-time traffic environment information based on a preset mapping relationship to obtain mapped traffic environment information
  • the first adjustment module 103 is configured to adjust the target deep reinforcement learning model based on the pre-stored existing deep reinforcement learning model and the mapped traffic environment information;
  • the first judging module 104 is used to judge whether to end the automatic driving, and if not, return to the step of obtaining the real-time traffic environment information of the automatic driving vehicle during the driving process at the current moment;
  • the mapping relationship includes the mapping relationship between the real-time traffic environment information and the existing traffic environment information of the existing deep reinforcement learning model.
  • the first adjustment module may include:
  • the first processing unit is used for processing the mapped traffic environment information based on the parameters of the existing strategy network of the existing deep reinforcement learning model to obtain the vehicle action;
  • a first calculation unit used for calculating the value function value of the vehicle action based on the evaluation network of the target deep reinforcement learning model
  • a first obtaining unit used for obtaining the reward value of the vehicle action
  • the first updating unit is used to update the parameters of the evaluation network based on the reward value and the value function value.
  • the first update unit may include:
  • the second calculation unit is used to calculate the loss value based on the return value and the value function value through the loss function calculation formula
  • a second updating unit configured to update the parameters of the evaluation network by minimizing the loss value
  • the loss function calculation formula includes:
  • L represents the loss value
  • N represents the number of samples collected
  • r t represents the return value at time t
  • represents the discount factor, 0 ⁇ 1
  • Q′ ⁇ (s t+1 , a t+1 ) represents The value function value calculated by the target network in the evaluation network at time t+1
  • s t+1 represents the traffic environment information at time t+1
  • a t+1 represents the vehicle action at time t+1
  • Q ⁇ (s t , at ) represents the value function value calculated by the prediction network in the evaluation network at time t
  • s t represents the traffic environment information at time t
  • at t represents the vehicle action at time t.
  • the third updating unit is used for updating the parameters of the strategy network of the target deep reinforcement learning model after the first updating unit updates the parameters of the target evaluation network based on the reward value and the value function value.
  • the second obtaining module is used for the first mapping module to map the real-time traffic environment information based on the preset mapping relationship, and obtain the target traffic environment information before obtaining the mapped traffic environment information;
  • a first reading module used for reading existing traffic environment information
  • the first calculation module is used to calculate the distance value between the target traffic environment information and the existing traffic environment information in the regenerated kernel Hilbert space;
  • the first determining module is configured to determine the mapping relationship by minimizing the distance value.
  • the first computing module may include:
  • the third calculation unit is used to calculate the distance value between the target traffic environment information and the existing traffic environment information in the regenerated kernel Hilbert space through the distance value calculation formula;
  • the formula for calculating the distance value includes:
  • MMD H (D S , D T ) represents the distance value;
  • D S represents the existing traffic environment information;
  • D T represents the target traffic environment information;
  • n represents the number of samples in the existing traffic environment information;
  • T represents the transposition;
  • s S represents the traffic environment information in the existing traffic environment information;
  • s T represents the traffic environment information in the target traffic environment information;
  • H represents the regeneration kernel Hilbert space.
  • the first determination module may include:
  • the first determining unit is configured to determine the mapping relationship by minimizing the distance value based on the regular linear regression method, the support vector machine method or the principal component analysis method.
  • the present application also provides an automatic driving device and a computer-readable storage medium, both of which have the corresponding effects of the automatic driving method provided by the embodiments of the present application.
  • FIG. 4 is a schematic structural diagram of an automatic driving device according to an embodiment of the present application.
  • An automatic driving device provided by an embodiment of the present application includes a memory 201 and a processor 202, where a computer program is stored in the memory 201, and the processor 202 implements the following steps when executing the computer program:
  • the mapping relationship includes the mapping relationship between the real-time traffic environment information and the existing traffic environment information of the existing deep reinforcement learning model.
  • An automatic driving device provided by an embodiment of the present application includes a memory 201 and a processor 202.
  • a computer program is stored in the memory 201.
  • the processor 202 executes the computer program, the following steps are implemented: an existing strategy based on an existing deep reinforcement learning model
  • the parameters of the network process the mapped traffic environment information to obtain the vehicle action;
  • the evaluation network based on the target deep reinforcement learning model calculates the value function value of the vehicle action; obtains the reward value of the vehicle action; updates the evaluation network based on the reward value and the value function value. parameter.
  • An automatic driving device provided by an embodiment of the present application includes a memory 201 and a processor 202.
  • a computer program is stored in the memory 201.
  • the processor 202 executes the computer program, the following steps are implemented: calculating a formula based on a loss function, based on the return value, value The loss value is calculated by the function value; the parameters of the evaluation network are updated by minimizing the loss value; wherein, the loss function calculation formula includes:
  • L represents the loss value
  • N represents the number of samples collected
  • r t represents the return value at time t
  • represents the discount factor, 0 ⁇ 1
  • Q′ ⁇ (s t+1 , a t+1 ) represents The value function value calculated by the target network in the evaluation network at time t+1
  • s t+1 represents the traffic environment information at time t+1
  • a t+1 represents the vehicle action at time t+1
  • Q ⁇ (s t , at ) represents the value function value calculated by the prediction network in the evaluation network at time t
  • s t represents the traffic environment information at time t
  • at t represents the vehicle action at time t.
  • An automatic driving device includes a memory 201 and a processor 202, where a computer program is stored in the memory 201, and the processor 202 implements the following steps when executing the computer program: updating the evaluation network based on the reward value and the value function value. After the parameters, the parameters of the policy network of the target deep reinforcement learning model are updated.
  • An automatic driving device provided by an embodiment of the present application includes a memory 201 and a processor 202.
  • a computer program is stored in the memory 201.
  • the processor 202 executes the computer program, the following steps are implemented: based on a preset mapping relationship, real-time traffic environment information Perform mapping to obtain the target traffic environment information before obtaining the mapped traffic environment information; read the existing traffic environment information; in the regenerated kernel Hilbert space, calculate the distance value between the target traffic environment information and the existing traffic environment information; The mapping relationship is determined by minimizing the distance value.
  • An automatic driving device provided by an embodiment of the present application includes a memory 201 and a processor 202.
  • a computer program is stored in the memory 201.
  • the processor 202 executes the computer program, the following steps are implemented: using a distance value calculation formula, when regenerating the kernel Hill In Bert space, calculate the distance value between the target traffic environment information and the existing traffic environment information;
  • the formula for calculating the distance value includes:
  • MMD H (D S , D T ) represents the distance value;
  • D S represents the existing traffic environment information;
  • D T represents the target traffic environment information;
  • n represents the number of samples in the existing traffic environment information;
  • T represents the transposition;
  • s S represents the traffic environment information in the existing traffic environment information;
  • s T represents the traffic environment information in the target traffic environment information;
  • H represents the regeneration kernel Hilbert space.
  • An automatic driving device provided by an embodiment of the present application includes a memory 201 and a processor 202, a computer program is stored in the memory 201, and the processor 202 implements the following steps when executing the computer program: based on a regular linear regression method or a support vector machine method or The principal component analysis method determines the mapping relationship by minimizing the distance value.
  • another automatic driving device may further include: an input port 203 connected to the processor 202 for transmitting externally inputted commands to the processor 202 ; an input port 203 connected to the processor 202
  • the display unit 204 is used to display the processing result of the processor 202 to the outside world; the communication module 205 connected with the processor 202 is used to realize the communication between the automatic driving device and the outside world.
  • the display unit 204 can be a display panel, a laser scanning display, etc.; the communication mode adopted by the communication module 205 includes but is not limited to mobile high-definition link technology (HML), universal serial bus (USB), high-definition multimedia interface (HDMI), Wireless connection: wireless fidelity technology (WiFi), Bluetooth communication technology, Bluetooth low energy communication technology, communication technology based on IEEE802.11s.
  • HML mobile high-definition link technology
  • USB universal serial bus
  • HDMI high-definition multimedia interface
  • WiFi wireless fidelity technology
  • Bluetooth communication technology Bluetooth low energy communication technology
  • communication technology based on IEEE802.11s IEEE802.11s.
  • a computer-readable storage medium provided by an embodiment of the present application, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the following steps are implemented:
  • the mapping relationship includes the mapping relationship between the real-time traffic environment information and the existing traffic environment information of the existing deep reinforcement learning model.
  • a computer-readable storage medium provided by an embodiment of the present application, a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the following steps are implemented: parameters of an existing policy network based on an existing deep reinforcement learning model Process the mapped traffic environment information to obtain the vehicle action; calculate the value function value of the vehicle action based on the evaluation network of the target deep reinforcement learning model; obtain the reward value of the vehicle action; update the parameters of the evaluation network based on the reward value and the value function value.
  • a computer-readable storage medium provided by an embodiment of the present application, a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the following steps are implemented: calculating a loss function calculation formula based on a return value and a value function value Loss value; update the parameters of the evaluation network by minimizing the loss value; wherein, the loss function calculation formula includes:
  • L represents the loss value
  • N represents the number of samples collected
  • r t represents the return value at time t
  • represents the discount factor, 0 ⁇ 1
  • Q′ ⁇ (s t+1 , a t+1 ) represents The value function value calculated by the target network in the evaluation network at time t+1
  • s t+1 represents the traffic environment information at time t+1
  • a t+1 represents the vehicle action at time t+1
  • Q ⁇ (s t , at ) represents the value function value calculated by the prediction network in the evaluation network at time t
  • s t represents the traffic environment information at time t
  • at t represents the vehicle action at time t.
  • a computer-readable storage medium provided by an embodiment of the present application, a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the following steps are implemented: after updating the parameters of the evaluation network based on the reward value and the value function value, Update the parameters of the policy network of the target deep reinforcement learning model.
  • a computer-readable storage medium provided by an embodiment of the present application, a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the following steps are implemented: mapping real-time traffic environment information based on a preset mapping relationship, Before obtaining the mapped traffic environment information, obtain the target traffic environment information; read the existing traffic environment information; in the regenerated kernel Hilbert space, calculate the distance value between the target traffic environment information and the existing traffic environment information; Distance value to determine the mapping relationship.
  • a computer-readable storage medium provided by an embodiment of the present application, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the following steps are implemented: , calculate the distance value between the target traffic environment information and the existing traffic environment information;
  • the formula for calculating the distance value includes:
  • MMD H (D S , D T ) represents the distance value;
  • D S represents the existing traffic environment information;
  • D T represents the target traffic environment information;
  • n represents the number of samples in the existing traffic environment information;
  • T represents the transposition;
  • s S represents the traffic environment information in the existing traffic environment information;
  • s T represents the traffic environment information in the target traffic environment information;
  • H represents the regeneration kernel Hilbert space.
  • a computer-readable storage medium provided by an embodiment of the present application, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the following steps are implemented: based on a regular linear regression method or a support vector machine method or principal component analysis method to determine the mapping relationship by minimizing the distance value.
  • the computer-readable storage medium involved in this application includes random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs , or any other form of storage medium known in the art.
  • RAM random access memory
  • ROM read only memory
  • electrically programmable ROM electrically erasable programmable ROM
  • registers hard disks, removable disks, CD-ROMs , or any other form of storage medium known in the art.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Automation & Control Theory (AREA)
  • Transportation (AREA)
  • Mechanical Engineering (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Traffic Control Systems (AREA)

Abstract

一种自动驾驶方法,包括以下步骤:步骤S101,获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息;步骤S102,基于预设的映射关系对实时交通环境信息进行映射,得到映射交通环境信息;步骤S103,基于预先存储的已有深度强化学习模型及映射交通环境信息,对目标深度强化学习模型进行调整;步骤S104,判断是否结束自动驾驶,若否,则返回执行获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息的步骤。该自动驾驶方法借助映射关系和已有深度强化学习模型来对目标深度强化学习模型进行调整,能够避免从头对目标深度强化学习模型进行调整,加快目标深度强化学习模型的决策效率,进行实现快速、稳定的自动驾驶。还提供了一种自动驾驶系统、一种自动驾驶设备及存储有该自动驾驶方法的计算机介质。

Description

一种自动驾驶决策方法、系统、设备及计算机存储介质
本申请要求于2020年10月29日提交中国专利局、申请号为202011181627.3、发明名称为“一种自动驾驶决策方法、系统、设备及计算机存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及自动驾驶技术领域,更具体地说,涉及一种自动驾驶决策方法、系统、设备及计算机存储介质。
背景技术
现代城市交通中,机动车数量日益增多,道路拥堵情况严重,且交通事故频发。为最大程度降低人为因素造成的危害,人们将目光转向自动驾驶领域。结合深度学习的深度强化学习(DRL,Deep Reinforcement Learning)是近年来快速发展的一类机器学习方法,智能体-环境交互作用和序列决策机制接近人类学习的过程,因此也被称为实现“通用人工智能(AGI,Artificial General Intelligence)”的关键步骤,被应用于自动驾驶决策过程中。
虽然深度强化学习能够指导车辆从头开始学习自动驾驶,在面对全新交通场景时也能够通过不断“试错”的方式学会自动驾驶,具有广泛的应用性。但是,从头开始学习自动驾驶的过程中,车辆通常需要几步、甚至几十步的训练才能做出一个较好的决策,采样效率较低,这与自动驾驶场景的瞬时决策要求相悖。同时,选取较差动作的步骤会导致方差较大,体现为车辆行驶不平稳,甚至出现冲出车道、碰撞等事故。
综上所述,如何实现快速、稳定的自动驾驶是目前本领域技术人员亟待解决的问题。
发明内容
有鉴于此,本申请的目的是提供一种自动驾驶方法,其能在一定程度上解决如何实现快速、稳定的自动驾驶的技术问题。本申请还提供了一种自动驾驶系统、设备及计算机可读存储介质。
为了实现上述目的,本申请提供如下技术方案:
一种自动驾驶方法,包括:
获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息;
基于预设的映射关系对所述实时交通环境信息进行映射,得到映射交通环境信息;
基于预先存储的已有深度强化学习模型及所述映射交通环境信息,对目标深度强化学习模型进行调整;
判断是否结束自动驾驶,若否,则返回执行所述获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息的步骤;
其中,所述映射关系包括所述实时交通环境信息与所述已有深度强化学习模型的已有交通环境信息间的映射关系。
优选的,所述基于预先存储的已有深度强化学习模型及所述映射交通环境信息,对目标深度强化学习模型进行调整,包括:
基于所述已有深度强化学习模型的已有策略网络的参数对所述映射交通环境信息进行处理,得到车辆动作;
基于所述目标深度强化学习模型的评价网络计算所述车辆动作的价值函数值;
获取所述车辆动作的回报值;
基于所述回报值、所述价值函数值更新所述评价网络的参数。
优选的,所述基于所述回报值、所述价值函数值更新所述评价网络的参数,包括:
通过损失函数计算公式,基于所述回报值、所述价值函数值计算损失值;
通过最小化所述损失值来更新所述评价网络的参数;
其中,所述损失函数计算公式包括:
Figure PCTCN2021109174-appb-000001
其中,L表示所述损失值;N表示采集的样本数量;r t表示t时刻下的回报值;γ表示折扣因子,0<γ<1;Q′ ω(s t+1,a t+1)表示所述评价网络中的目标网络在t+1时刻下计算得到的价值函数值;s t+1表示t+1时刻下的交通环境信息;a t+1表示t+1时刻下的车辆动作;Q ω(s t,a t)表示所述评价网络中的预测网络在t时刻下计算得到的价值函数值;s t表示t时刻下的交通环境信息;a t表示t时刻下的车辆动作。
优选的,所述基于所述回报值、所述价值函数值更新所述评价网络的参数之后,还包括:
对所述目标深度强化学习模型的策略网络的参数进行更新。
优选的,所述基于预设的映射关系对所述实时交通环境信息进行映射,得到映射交通环境信息之前,还包括:
获取目标交通环境信息;
读取所述已有交通环境信息;
在再生核希尔伯特空间中,计算所述目标交通环境信息与所述已有交通环境信息间的距离值;
通过最小化所述距离值来确定所述映射关系。
优选的,所述在再生核希尔伯特空间中,计算所述目标交通环境信息与所述已有交通环境信息间的距离值,包括:
通过距离值计算公式,在再生核希尔伯特空间中,计算所述目标交通环境信息与所述已有交通环境信息间的所述距离值;
所述距离值计算公式包括:
Figure PCTCN2021109174-appb-000002
其中,MMD H(D S,D T)表示所述距离值;D S表示所述已有交通环境信息;D T表示所述目标交通环境信息;n表示所述已有交通环境信息中的样本数量;m表示所述目标交通环境信息中的样本数量;A表示所述映射关系;T 表示转置;s S表示所述已有交通环境信息中的交通环境信息;s T表示所述目标交通环境信息中的交通环境信息;H表示所述再生核希尔伯特空间。
优选的,所述通过最小化所述距离值来确定所述映射关系,包括:
基于正则线性回归方法或支持向量机方法或主成分分析方法,通过最小化所述距离值来确定所述映射关系。
一种自动驾驶系统,包括:
第一获取模块,用于获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息;
第一映射模块,用于基于预设的映射关系对所述实时交通环境信息进行映射,得到映射交通环境信息;
第一调整模块,用于基于预先存储的已有深度强化学习模型及所述映射交通环境信息,对目标深度强化学习模型进行调整;
第一判断模块,用于判断是否结束自动驾驶,若否,则返回执行所述获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息的步骤;
其中,所述映射关系包括所述实时交通环境信息与所述已有深度强化学习模型的已有交通环境信息间的映射关系。
一种自动驾驶设备,包括:
存储器,用于存储计算机程序;
处理器,用于执行所述计算机程序时实现如上任一所述自动驾驶方法的步骤。
一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,所述计算机程序被处理器执行时实现如上任一所述自动驾驶方法的步骤。
本申请提供的一种自动驾驶方法,获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息;基于预设的映射关系对实时交通环境信息进行映射,得到映射交通环境信息;基于预先存储的已有深度强化学习模型及映射交通环境信息,对目标深度强化学习模型进行调整;判断是否结束自动驾驶,若否,则返回执行获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息的步骤;其中,映射关系包括实时交通环境信 息与已有深度强化学习模型的已有交通环境信息间的映射关系。本申请中,可以借助映射关系和已有深度强化学习模型来对目标深度强化学习模型进行调整,可以避免从头对目标深度强化学习模型进行调整,加快目标深度强化学习模型的决策效率,进行可以实现快速、稳定的自动驾驶。本申请提供的一种自动驾驶系统、设备及计算机可读存储介质也解决了相应技术问题。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请实施例提供的一种自动驾驶方法的流程图;
图2为本申请中对目标深度强化学习模型的调整流程图;
图3为本申请实施例提供的一种自动驾驶系统的结构示意图;
图4为本申请实施例提供的一种自动驾驶设备的结构示意图;
图5为本申请实施例提供的一种自动驾驶设备的另一结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
请参阅图1,图1为本申请实施例提供的一种自动驾驶方法的流程图。
本申请实施例提供的一种自动驾驶方法,可以包括以下步骤:
步骤S101:获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息。
实际应用中,在自动驾驶过程中,需要根据当前的交通环境信息预测自动驾驶车辆的下一步驾驶动作,所以可以先获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息,实时交通环境信息的类型可以根据实际需要确定,比如可以借助摄像头、全球定位系统、惯性测量单元、毫米波雷达、激光雷达等车载传感器装置,获取行车环境状态,如天气数据、交通信号灯、交通拓扑信息,自动驾驶车辆、其他交通参与者的位置、运行状态等信息,摄像头获取的直接原始图像数据等原始交通环境信息来直接作为实时交通环境信息,还可以通过RefineNet等模型对原始交通环境信息处理得到的深度图和语义分割图作为实时交通环境信息等。
步骤S102:基于预设的映射关系对实时交通环境信息进行映射,得到映射交通环境信息;映射关系包括实时交通环境信息与已有深度强化学习模型的已有交通环境信息间的映射关系。
实际应用中,因为本申请中需要借助已有深度强化学习模型来对目标深度强化学习模型进行调整,如果直接借助已有深度强化学习模型来对实时交通环境信息进行处理的话,可能存在处理结果无法与实时交通环境信息相匹配的情况,为了避免此种情况,可以先基于预设的映射关系来对实时交通环境信息进行映射,得到映射交通环境信息;因为映射关系包括实时交通环境信息与已有深度强化学习模型的已有交通环境信息间的映射关系,所以映射交通环境信息可以满足已有深度强化学习模型的处理要求且可以携带实时交通环境信息的相关信息,这样后续借助映射交通环境信息来对目标深度强化学习模型进行调整的话,可以保证目标深度强化学习模型的调整准确性。
应当指出,已有深度强化学习模型指的是已经训练的满足条件的深度强化学习模型,比如已有深度强化学习模型可以为按照已有交通环境信息进行预设时长训练后得到的深度强化学习模型,也可以为按照已有交通环境信息进行预设步长训练后得到的深度强化学习模型等,基于已有交通环境信息进行深度学习模型训练的过程可以参阅现有技术,本申请在此不做具体限定。
步骤S103:基于预先存储的已有深度强化学习模型及映射交通环境信 息,对目标深度强化学习模型进行调整。
实际应用中,在基于预设的映射关系对实时交通环境信息进行映射,得到映射交通环境信息之后,便可以基于预先存储的已有深度强化学习模型及映射交通环境信息,对目标深度强化学习模型进行调整。
应当指出,对目标深度强化学习模型的调整过程可以根据实际需要及目标深度强化学习模型的具体结构来确定,且目标深度强化学习模型的结构可以根据所应用的深度强化学习算法,比如DQN(Deep-Q-Network,深度Q网络)算法、DDPG(Deep Deterministic Policy Gradient,深度确定策略梯度算法)算法、A3C(Asynchronous Advantage Actor-Critic,异步优势Actor-Critic算法)算法、SAC(Soft Actor-Critic,松弛Actor-Critic算法)算法、TD3(Twin Delayed Deep Deterministic policy gradient,双延迟确定性策略梯度算法)算法等,来确定本申请在此不做具体限定。
步骤S104:判断是否结束自动驾驶,若否,则返回执行步骤S101;若是,则执行步骤S105:结束。
实际应用中,因为每次调整过程中只是应用了当前时刻下的实时交通环境信息,可能需要进行多次调整才能完善目标深度强化学习模型的参数,所以在基于预先存储的已有深度强化学习模型及映射交通环境信息,对目标深度强化学习模型进行调整之后,可以判断是否结束自动驾驶,若否,则返回执行获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息的步骤;若是,则可以直接结束。
应当指出,判断是否结束自动驾驶的条件可以根据实际需要确定,比如结束自动驾驶的条件可以为调整次数达到预设次数,调整时长达到预设时长等,本申请在此不做具体限定。
本申请提供的一种自动驾驶方法,获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息;基于预设的映射关系对实时交通环境信息进行映射,得到映射交通环境信息;基于预先存储的已有深度强化学习模型及映射交通环境信息,对目标深度强化学习模型进行调整;判断是否结束自动驾驶,若否,则返回执行获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息的步骤;其中,映射关系包括实时交通环境信 息与已有深度强化学习模型的已有交通环境信息间的映射关系。本申请中,可以借助映射关系和已有深度强化学习模型来对目标深度强化学习模型进行调整,可以避免从头对目标深度强化学习模型进行调整,加快目标深度强化学习模型的决策效率,进行可以实现快速、稳定的自动驾驶。
请参阅图2,图2为本申请中对目标深度强化学习模型的调整流程图。
本申请实施例提供的一种自动驾驶方法中,基于预先存储的已有深度强化学习模型及映射交通环境信息,对目标深度强化学习模型进行调整的过程中,可以包括以下步骤:
步骤S201:基于已有深度强化学习模型的已有策略网络的参数对映射交通环境信息进行处理,得到车辆动作。
实际应用中,在已有深度强化学习模型及目标深度强化学习模型中均包括策略网络和评价网络的情况下,可以先基于已有深度强化学习模型的已有策略网络的参数来对映射交通环境信息进行处理,得到车辆动作,如加速、减速、转向、变道、刹车等。
步骤S202:基于目标深度强化学习模型的评价网络计算车辆动作的价值函数值。
实际应用中,在基于已有深度强化学习模型的已有策略网络的参数对映射交通环境信息进行处理,得到车辆动作之后,便可以基于目标深度强化学习模型的评价网络计算车辆动作的价值函数值,以借助价值函数值对策略网络的决策能力进行评价。
步骤S203:获取车辆动作的回报值。
实际应用中,在基于目标深度强化学习模型的目标评价网络计算车辆动作的价值函数值之后,还可以获取车辆动作的回报值,具体的,可以根据自动驾驶车辆采取的车辆动作,结合设定的基准,如自动驾驶车辆平均行驶速度、偏离车道中心距离、闯红灯、发生碰撞等因素,给予自动驾驶车辆一个回报值。
步骤S204:基于回报值、价值函数值更新评价网络的参数。
实际应用中,在获取车辆动作的回报值之后,便可以基于回报值、价值函数值更新评价网络的参数。
具体应用场景中,在基于回报值、价值函数值更新评价网络的参数的过程中,可以通过损失函数计算公式,基于回报值、价值函数值计算损失值;通过最小化损失值来更新评价网络的参数;其中,损失函数计算公式包括:
Figure PCTCN2021109174-appb-000003
其中,L表示损失值;N表示采集的样本数量;r t表示t时刻下的回报值;γ表示折扣因子,0<γ<1;Q′ ω(s t+1,a t+1)表示评价网络中的目标网络在t+1时刻下计算得到的价值函数值;s t+1表示t+1时刻下的交通环境信息;a t+1表示t+1时刻下的车辆动作;Q ω(s t,a t)表示评价网络中的预测网络在t时刻下计算得到的价值函数值;s t表示t时刻下的交通环境信息;a t表示t时刻下的车辆动作。
本申请实施例提供的一种自动驾驶方法中,在基于回报值、价值函数值更新评价网络的参数之后,为了进一步保证目标深度强化学习模型的准确性,还可以对目标深度强化学习模型的策略网络的参数进行更新。对策略网络的参数进行更新的过程可以根据实际需要确定,本申请在此不做具体限定。
本申请实施例提供的一种自动驾驶方法中,在基于预设的映射关系对实时交通环境信息进行映射,得到映射交通环境信息之前,还可以:获取目标交通环境信息;读取已有交通环境信息;在再生核希尔伯特空间中,计算目标交通环境信息与已有交通环境信息间的距离值;通过最小化距离值来确定映射关系。也即本申请可以通过目标交通环境信息、已有交通环境信息及再生核希尔伯特空间快速确定映射关系。
具体应用场景中,在再生核希尔伯特空间中,计算目标交通环境信息与已有交通环境信息间的距离值的过程中,可以通过距离值计算公式,在 再生核希尔伯特空间中,计算目标交通环境信息与已有交通环境信息间的距离值;
距离值计算公式包括:
Figure PCTCN2021109174-appb-000004
其中,MMD H(D S,D T)表示距离值;D S表示已有交通环境信息;D T表示目标交通环境信息;n表示已有交通环境信息中的样本数量;m表示目标交通环境信息中的样本数量;A表示映射关系;T表示转置;s S表示已有交通环境信息中的交通环境信息;s T表示目标交通环境信息中的交通环境信息;H表示再生核希尔伯特空间。
具体应用场景中,在通过最小化距离值来确定映射关系的过程中,可以基于正则线性回归方法或支持向量机方法或主成分分析方法等,通过最小化距离值来确定映射关系。
具体应用场景中,在获取目标交通环境信息的过程中,可以使用简单的深度学习算法,如DQN算法对目标领域自动驾驶车辆进行预训练,比如构建2个结构相同但参数更新频率不同的神经网络,即间隔一定时间更新的目标网络(Target Net)和每步更新的预测网络(Pred Net),目标网络和预测网络可以均使用简单的3层神经网络,中间仅包含1层隐藏层;此时输入车辆传感器装置采集到的交通环境状态,计算输出目标价值Q target和预测价值Q pred,并选择最大的价值对应的动作a Tt作为自动驾驶车辆的驾驶动作。随后,获得回报r Tt和新的交通环境状态s Tt+1,并将学习经历c Ti=(s Ti,a Ti,r Ti,s Ti+1)存储到回放缓冲区D T中,以此生成目标交通环境信息。
请参阅图3,图3为本申请实施例提供的一种自动驾驶系统的结构示意图。
本申请实施例提供的一种自动驾驶系统,可以包括:
第一获取模块101,用于获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息;
第一映射模块102,用于基于预设的映射关系对实时交通环境信息进行映射,得到映射交通环境信息;
第一调整模块103,用于基于预先存储的已有深度强化学习模型及映射交通环境信息,对目标深度强化学习模型进行调整;
第一判断模块104,用于判断是否结束自动驾驶,若否,则返回执行获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息的步骤;
其中,映射关系包括实时交通环境信息与已有深度强化学习模型的已有交通环境信息间的映射关系。
本申请实施例提供的一种自动驾驶系统,第一调整模块可以包括:
第一处理单元,用于基于已有深度强化学习模型的已有策略网络的参数对映射交通环境信息进行处理,得到车辆动作;
第一计算单元,用于基于目标深度强化学习模型的评价网络计算车辆动作的价值函数值;
第一获取单元,用于获取车辆动作的回报值;
第一更新单元,用于基于回报值、价值函数值更新评价网络的参数。
本申请实施例提供的一种自动驾驶系统,第一更新单元可以包括:
第二计算单元,用于通过损失函数计算公式,基于回报值、价值函数值计算损失值;
第二更新单元,用于通过最小化损失值来更新评价网络的参数;
其中,损失函数计算公式包括:
Figure PCTCN2021109174-appb-000005
其中,L表示损失值;N表示采集的样本数量;r t表示t时刻下的回报值;γ表示折扣因子,0<γ<1;Q′ ω(s t+1,a t+1)表示评价网络中的目标网络在t+1时刻下计算得到的价值函数值;s t+1表示t+1时刻下的交通环境信息;a t+1表示t+1时刻下的车辆动作;Q ω(s t,a t)表示评价网络中的预测网络在t时刻下计算得到的价值函数值;s t表示t时刻下的交通环境信息;a t表示t时刻下的车辆动作。
本申请实施例提供的一种自动驾驶系统,还可以包括:
第三更新单元,用于第一更新单元基于回报值、价值函数值更新目标评价网络的参数之后,对目标深度强化学习模型的策略网络的参数进行更新。
本申请实施例提供的一种自动驾驶系统,还可以包括:
第二获取模块,用于第一映射模块基于预设的映射关系对实时交通环境信息进行映射,得到映射交通环境信息之前,获取目标交通环境信息;
第一读取模块,用于读取已有交通环境信息;
第一计算模块,用于在再生核希尔伯特空间中,计算目标交通环境信息与已有交通环境信息间的距离值;
第一确定模块,用于通过最小化距离值来确定映射关系。
本申请实施例提供的一种自动驾驶系统,第一计算模块可以包括:
第三计算单元,用于通过距离值计算公式,在再生核希尔伯特空间中,计算目标交通环境信息与已有交通环境信息间的距离值;
距离值计算公式包括:
Figure PCTCN2021109174-appb-000006
其中,MMD H(D S,D T)表示距离值;D S表示已有交通环境信息;D T表示目标交通环境信息;n表示已有交通环境信息中的样本数量;m表示目标交通环境信息中的样本数量;A表示映射关系;T表示转置;s S表示已有交通环境信息中的交通环境信息;s T表示目标交通环境信息中的交通环境信息;H表示再生核希尔伯特空间。
本申请实施例提供的一种自动驾驶系统,第一确定模块可以包括:
第一确定单元,用于基于正则线性回归方法或支持向量机方法或主成分分析方法,通过最小化距离值来确定映射关系。
本申请还提供了一种自动驾驶设备及计算机可读存储介质,其均具有本申请实施例提供的一种自动驾驶方法具有的对应效果。请参阅图4,图4为本申请实施例提供的一种自动驾驶设备的结构示意图。
本申请实施例提供的一种自动驾驶设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:
获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息;
基于预设的映射关系对实时交通环境信息进行映射,得到映射交通环境信息;
基于预先存储的已有深度强化学习模型及映射交通环境信息,对目标深度强化学习模型进行调整;
判断是否结束自动驾驶,若否,则返回执行获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息的步骤;
其中,映射关系包括实时交通环境信息与已有深度强化学习模型的已有交通环境信息间的映射关系。
本申请实施例提供的一种自动驾驶设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:基于已有深度强化学习模型的已有策略网络的参数对映射交通环境信息进行处理,得到车辆动作;基于目标深度强化学习模型的评价网络计算车辆动作的价值函数值;获取车辆动作的回报值;基于回报值、价值函数值更新评价网络的参数。
本申请实施例提供的一种自动驾驶设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:通过损失函数计算公式,基于回报值、价值函数值计算损失值;通过最小化损失值来更新评价网络的参数;其中,损失函数计算公式包括:
Figure PCTCN2021109174-appb-000007
其中,L表示损失值;N表示采集的样本数量;r t表示t时刻下的回报值;γ表示折扣因子,0<γ<1;Q′ ω(s t+1,a t+1)表示评价网络中的目标网络在t+1时刻下计算得到的价值函数值;s t+1表示t+1时刻下的交通环境信息;a t+1表示t+1时刻下的车辆动作;Q ω(s t,a t)表示评价网络中的预测网络在t时刻下计 算得到的价值函数值;s t表示t时刻下的交通环境信息;a t表示t时刻下的车辆动作。
本申请实施例提供的一种自动驾驶设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:基于回报值、价值函数值更新评价网络的参数之后,对目标深度强化学习模型的策略网络的参数进行更新。
本申请实施例提供的一种自动驾驶设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:基于预设的映射关系对实时交通环境信息进行映射,得到映射交通环境信息之前,获取目标交通环境信息;读取已有交通环境信息;在再生核希尔伯特空间中,计算目标交通环境信息与已有交通环境信息间的距离值;通过最小化距离值来确定映射关系。
本申请实施例提供的一种自动驾驶设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:通过距离值计算公式,在再生核希尔伯特空间中,计算目标交通环境信息与已有交通环境信息间的距离值;
距离值计算公式包括:
Figure PCTCN2021109174-appb-000008
其中,MMD H(D S,D T)表示距离值;D S表示已有交通环境信息;D T表示目标交通环境信息;n表示已有交通环境信息中的样本数量;m表示目标交通环境信息中的样本数量;A表示映射关系;T表示转置;s S表示已有交通环境信息中的交通环境信息;s T表示目标交通环境信息中的交通环境信息;H表示再生核希尔伯特空间。
本申请实施例提供的一种自动驾驶设备,包括存储器201和处理器202,存储器201中存储有计算机程序,处理器202执行计算机程序时实现如下步骤:基于正则线性回归方法或支持向量机方法或主成分分析方法,通过最小化距离值来确定映射关系。
请参阅图5,本申请实施例提供的另一种自动驾驶设备中还可以包括: 与处理器202连接的输入端口203,用于传输外界输入的命令至处理器202;与处理器202连接的显示单元204,用于显示处理器202的处理结果至外界;与处理器202连接的通信模块205,用于实现自动驾驶设备与外界的通信。显示单元204可以为显示面板、激光扫描使显示器等;通信模块205所采用的通信方式包括但不局限于移动高清链接技术(HML)、通用串行总线(USB)、高清多媒体接口(HDMI)、无线连接:无线保真技术(WiFi)、蓝牙通信技术、低功耗蓝牙通信技术、基于IEEE802.11s的通信技术。
本申请实施例提供的一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:
获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息;
基于预设的映射关系对实时交通环境信息进行映射,得到映射交通环境信息;
基于预先存储的已有深度强化学习模型及映射交通环境信息,对目标深度强化学习模型进行调整;
判断是否结束自动驾驶,若否,则返回执行获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息的步骤;
其中,映射关系包括实时交通环境信息与已有深度强化学习模型的已有交通环境信息间的映射关系。
本申请实施例提供的一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:基于已有深度强化学习模型的已有策略网络的参数对映射交通环境信息进行处理,得到车辆动作;基于目标深度强化学习模型的评价网络计算车辆动作的价值函数值;获取车辆动作的回报值;基于回报值、价值函数值更新评价网络的参数。
本申请实施例提供的一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:通过损失函数计算公式,基于回报值、价值函数值计算损失值;通过最小化损失值来更新评价网络的参数;其中,损失函数计算公式包括:
Figure PCTCN2021109174-appb-000009
其中,L表示损失值;N表示采集的样本数量;r t表示t时刻下的回报值;γ表示折扣因子,0<γ<1;Q′ ω(s t+1,a t+1)表示评价网络中的目标网络在t+1时刻下计算得到的价值函数值;s t+1表示t+1时刻下的交通环境信息;a t+1表示t+1时刻下的车辆动作;Q ω(s t,a t)表示评价网络中的预测网络在t时刻下计算得到的价值函数值;s t表示t时刻下的交通环境信息;a t表示t时刻下的车辆动作。
本申请实施例提供的一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:基于回报值、价值函数值更新评价网络的参数之后,对目标深度强化学习模型的策略网络的参数进行更新。
本申请实施例提供的一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:基于预设的映射关系对实时交通环境信息进行映射,得到映射交通环境信息之前,获取目标交通环境信息;读取已有交通环境信息;在再生核希尔伯特空间中,计算目标交通环境信息与已有交通环境信息间的距离值;通过最小化距离值来确定映射关系。
本申请实施例提供的一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:通过距离值计算公式,在再生核希尔伯特空间中,计算目标交通环境信息与已有交通环境信息间的距离值;
距离值计算公式包括:
Figure PCTCN2021109174-appb-000010
其中,MMD H(D S,D T)表示距离值;D S表示已有交通环境信息;D T表示目标交通环境信息;n表示已有交通环境信息中的样本数量;m表示目标交通环境信息中的样本数量;A表示映射关系;T表示转置;s S表示已有 交通环境信息中的交通环境信息;s T表示目标交通环境信息中的交通环境信息;H表示再生核希尔伯特空间。
本申请实施例提供的一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序,计算机程序被处理器执行时实现如下步骤:基于正则线性回归方法或支持向量机方法或主成分分析方法,通过最小化距离值来确定映射关系。
本申请所涉及的计算机可读存储介质包括随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质。
本申请实施例提供的自动驾驶系统、设备及计算机可读存储介质中相关部分的说明请参见本申请实施例提供的自动驾驶方法中对应部分的详细说明,在此不再赘述。另外,本申请实施例提供的上述技术方案中与现有技术中对应技术方案实现原理一致的部分并未详细说明,以免过多赘述。
还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
对所公开的实施例的上述说明,使本领域技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (10)

  1. 一种自动驾驶方法,其特征在于,包括:
    获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息;
    基于预设的映射关系对所述实时交通环境信息进行映射,得到映射交通环境信息;
    基于预先存储的已有深度强化学习模型及所述映射交通环境信息,对目标深度强化学习模型进行调整;
    判断是否结束自动驾驶,若否,则返回执行所述获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息的步骤;
    其中,所述映射关系包括所述实时交通环境信息与所述已有深度强化学习模型的已有交通环境信息间的映射关系。
  2. 根据权利要求1所述的方法,其特征在于,所述基于预先存储的已有深度强化学习模型及所述映射交通环境信息,对目标深度强化学习模型进行调整,包括:
    基于所述已有深度强化学习模型的已有策略网络的参数对所述映射交通环境信息进行处理,得到车辆动作;
    基于所述目标深度强化学习模型的评价网络计算所述车辆动作的价值函数值;
    获取所述车辆动作的回报值;
    基于所述回报值、所述价值函数值更新所述评价网络的参数。
  3. 根据权利要求2所述的方法,其特征在于,所述基于所述回报值、所述价值函数值更新所述评价网络的参数,包括:
    通过损失函数计算公式,基于所述回报值、所述价值函数值计算损失值;
    通过最小化所述损失值来更新所述评价网络的参数;
    其中,所述损失函数计算公式包括:
    Figure PCTCN2021109174-appb-100001
    其中,L表示所述损失值;N表示采集的样本数量;r t表示t时刻下的回报值;γ表示折扣因子,0<γ<1;Q′ ω(s t+1,a t+1)表示所述评价网络中的目标网络在t+1时刻下计算得到的价值函数值;s t+1表示t+1时刻下的交通环境信息;a t+1表示t+1时刻下的车辆动作;Q ω(s t,a t)表示所述评价网络中的预测网络在t时刻下计算得到的价值函数值;s t表示t时刻下的交通环境信息;a t表示t时刻下的车辆动作。
  4. 根据权利要求2所述的方法,其特征在于,所述基于所述回报值、所述价值函数值更新所述评价网络的参数之后,还包括:
    对所述目标深度强化学习模型的策略网络的参数进行更新。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述基于预设的映射关系对所述实时交通环境信息进行映射,得到映射交通环境信息之前,还包括:
    获取目标交通环境信息;
    读取所述已有交通环境信息;
    在再生核希尔伯特空间中,计算所述目标交通环境信息与所述已有交通环境信息间的距离值;
    通过最小化所述距离值来确定所述映射关系。
  6. 根据权利要求5所述的方法,其特征在于,所述在再生核希尔伯特空间中,计算所述目标交通环境信息与所述已有交通环境信息间的距离值,包括:
    通过距离值计算公式,在再生核希尔伯特空间中,计算所述目标交通环境信息与所述已有交通环境信息间的所述距离值;
    所述距离值计算公式包括:
    Figure PCTCN2021109174-appb-100002
    其中,MMD H(D S,D T)表示所述距离值;D S表示所述已有交通环境信息;D T表示所述目标交通环境信息;n表示所述已有交通环境信息中的样本数量;m表示所述目标交通环境信息中的样本数量;A表示所述映射关系;T 表示转置;s S表示所述已有交通环境信息中的交通环境信息;s T表示所述目标交通环境信息中的交通环境信息;H表示所述再生核希尔伯特空间。
  7. 根据权利要求6所述的方法,其特征在于,所述通过最小化所述距离值来确定所述映射关系,包括:
    基于正则线性回归方法或支持向量机方法或主成分分析方法,通过最小化所述距离值来确定所述映射关系。
  8. 一种自动驾驶系统,其特征在于,包括:
    第一获取模块,用于获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息;
    第一映射模块,用于基于预设的映射关系对所述实时交通环境信息进行映射,得到映射交通环境信息;
    第一调整模块,用于基于预先存储的已有深度强化学习模型及所述映射交通环境信息,对目标深度强化学习模型进行调整;
    第一判断模块,用于判断是否结束自动驾驶,若否,则返回执行所述获取当前时刻下,自动驾驶车辆在行驶过程中的实时交通环境信息的步骤;
    其中,所述映射关系包括所述实时交通环境信息与所述已有深度强化学习模型的已有交通环境信息间的映射关系。
  9. 一种自动驾驶设备,其特征在于,包括:
    存储器,用于存储计算机程序;
    处理器,用于执行所述计算机程序时实现如权利要求1至7任一项所述自动驾驶方法的步骤。
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述自动驾驶方法的步骤。
PCT/CN2021/109174 2020-10-29 2021-07-29 一种自动驾驶决策方法、系统、设备及计算机存储介质 WO2022088798A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/246,126 US20230365163A1 (en) 2020-10-29 2021-07-29 Automatic Driving Decision Making Method, System And Device And Computer Storage Medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011181627.3 2020-10-29
CN202011181627.3A CN112249032B (zh) 2020-10-29 2020-10-29 一种自动驾驶决策方法、系统、设备及计算机存储介质

Publications (1)

Publication Number Publication Date
WO2022088798A1 true WO2022088798A1 (zh) 2022-05-05

Family

ID=74267165

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/109174 WO2022088798A1 (zh) 2020-10-29 2021-07-29 一种自动驾驶决策方法、系统、设备及计算机存储介质

Country Status (3)

Country Link
US (1) US20230365163A1 (zh)
CN (1) CN112249032B (zh)
WO (1) WO2022088798A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114708568A (zh) * 2022-06-07 2022-07-05 东北大学 基于改进RTFNet的纯视觉自动驾驶控制系统、方法、介质
CN115903457A (zh) * 2022-11-02 2023-04-04 曲阜师范大学 一种基于深度强化学习的低风速永磁同步风力发电机控制方法
CN116128013A (zh) * 2023-04-07 2023-05-16 中国人民解放军国防科技大学 基于多样性种群训练的临机协同方法、装置和计算机设备

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112249032B (zh) * 2020-10-29 2022-02-18 浪潮(北京)电子信息产业有限公司 一种自动驾驶决策方法、系统、设备及计算机存储介质
CN112947466B (zh) * 2021-03-09 2023-04-07 湖北大学 一种面向自动驾驶的平行规划方法、设备及存储介质
CN113511215B (zh) * 2021-05-31 2022-10-04 西安电子科技大学 一种混合自动驾驶决策方法、设备及计算机存储介质
CN113253612B (zh) * 2021-06-01 2021-09-17 苏州浪潮智能科技有限公司 一种自动驾驶控制方法、装置、设备及可读存储介质
CN114291111B (zh) * 2021-12-30 2024-03-08 广州小鹏自动驾驶科技有限公司 目标路径的确定方法、装置、车辆及存储介质
CN114104005B (zh) * 2022-01-26 2022-04-19 苏州浪潮智能科技有限公司 自动驾驶设备的决策方法、装置、设备及可读存储介质
CN114859921B (zh) * 2022-05-12 2024-06-28 鹏城实验室 一种基于强化学习的自动驾驶优化方法及相关设备
CN118313484B (zh) * 2024-05-24 2024-08-06 中国科学技术大学 一种自动驾驶中基于模型的离线到在线强化学习方法
CN118410875B (zh) * 2024-06-27 2024-10-11 广汽埃安新能源汽车股份有限公司 一种自动驾驶行为模型生成方法及系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107168303A (zh) * 2017-03-16 2017-09-15 中国科学院深圳先进技术研究院 一种汽车的自动驾驶方法及装置
CN107506830A (zh) * 2017-06-20 2017-12-22 同济大学 面向智能汽车规划决策模块的人工智能训练平台
CN109835375A (zh) * 2019-01-29 2019-06-04 中国铁道科学研究院集团有限公司通信信号研究所 基于人工智能技术的高速铁路列车自动驾驶系统
EP3570214A2 (en) * 2018-09-12 2019-11-20 Baidu Online Network Technology (Beijing) Co., Ltd. Automobile image processing method and apparatus, and readable storage medium
CN110647839A (zh) * 2019-09-18 2020-01-03 深圳信息职业技术学院 自动驾驶策略的生成方法、装置及计算机可读存储介质
CN111123738A (zh) * 2019-11-25 2020-05-08 的卢技术有限公司 提高仿真环境中深度强化学习算法训练效率的方法及系统
CN112249032A (zh) * 2020-10-29 2021-01-22 浪潮(北京)电子信息产业有限公司 一种自动驾驶决策方法、系统、设备及计算机存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102166811B1 (ko) * 2019-01-21 2020-10-19 한양대학교 산학협력단 심층강화학습과 운전자보조시스템을 이용한 자율주행차량의 제어 방법 및 장치
US11560146B2 (en) * 2019-03-26 2023-01-24 Ford Global Technologies, Llc Interpreting data of reinforcement learning agent controller
CN110673602B (zh) * 2019-10-24 2022-11-25 驭势科技(北京)有限公司 一种强化学习模型、车辆自动驾驶决策的方法和车载设备
CN111273676B (zh) * 2020-03-24 2023-04-18 广东工业大学 一种端到端自动驾驶的方法及系统
CN111401556B (zh) * 2020-04-22 2023-06-30 清华大学深圳国际研究生院 一种对抗式模仿学习中奖励函数的选择方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107168303A (zh) * 2017-03-16 2017-09-15 中国科学院深圳先进技术研究院 一种汽车的自动驾驶方法及装置
CN107506830A (zh) * 2017-06-20 2017-12-22 同济大学 面向智能汽车规划决策模块的人工智能训练平台
EP3570214A2 (en) * 2018-09-12 2019-11-20 Baidu Online Network Technology (Beijing) Co., Ltd. Automobile image processing method and apparatus, and readable storage medium
CN109835375A (zh) * 2019-01-29 2019-06-04 中国铁道科学研究院集团有限公司通信信号研究所 基于人工智能技术的高速铁路列车自动驾驶系统
CN110647839A (zh) * 2019-09-18 2020-01-03 深圳信息职业技术学院 自动驾驶策略的生成方法、装置及计算机可读存储介质
CN111123738A (zh) * 2019-11-25 2020-05-08 的卢技术有限公司 提高仿真环境中深度强化学习算法训练效率的方法及系统
CN112249032A (zh) * 2020-10-29 2021-01-22 浪潮(北京)电子信息产业有限公司 一种自动驾驶决策方法、系统、设备及计算机存储介质

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114708568A (zh) * 2022-06-07 2022-07-05 东北大学 基于改进RTFNet的纯视觉自动驾驶控制系统、方法、介质
CN115903457A (zh) * 2022-11-02 2023-04-04 曲阜师范大学 一种基于深度强化学习的低风速永磁同步风力发电机控制方法
CN115903457B (zh) * 2022-11-02 2023-09-08 曲阜师范大学 一种基于深度强化学习的低风速永磁同步风力发电机控制方法
CN116128013A (zh) * 2023-04-07 2023-05-16 中国人民解放军国防科技大学 基于多样性种群训练的临机协同方法、装置和计算机设备
CN116128013B (zh) * 2023-04-07 2023-07-04 中国人民解放军国防科技大学 基于多样性种群训练的临机协同方法、装置和计算机设备

Also Published As

Publication number Publication date
US20230365163A1 (en) 2023-11-16
CN112249032B (zh) 2022-02-18
CN112249032A (zh) 2021-01-22

Similar Documents

Publication Publication Date Title
WO2022088798A1 (zh) 一种自动驾驶决策方法、系统、设备及计算机存储介质
WO2022052406A1 (zh) 一种自动驾驶训练方法、装置、设备及介质
US11117569B2 (en) Planning parking trajectory generation for self-driving vehicles using optimization method
CN111123933B (zh) 车辆轨迹规划的方法、装置、智能驾驶域控制器和智能车
US11378961B2 (en) Method for generating prediction trajectories of obstacles for autonomous driving vehicles
WO2021008605A1 (zh) 一种确定车速的方法和装置
US10824153B2 (en) Cost design for path selection in autonomous driving technology
EP3751453B1 (en) Detecting adversarial samples by a vision based perception system
CN111898211A (zh) 基于深度强化学习的智能车速度决策方法及其仿真方法
US11180160B2 (en) Spiral curve based vertical parking planner system for autonomous driving vehicles
US10909377B2 (en) Tracking objects with multiple cues
CN110779538A (zh) 相对于自主导航而跨本地和基于云的系统来分配处理资源
CN113160585B (zh) 交通灯配时优化方法、系统及存储介质
WO2023197408A1 (zh) 车速控制模型训练样本的确定方法及装置
US11250701B2 (en) Data transfer logic for transferring data between sensors and planning and control of an autonomous driving vehicle
WO2022228251A1 (zh) 一种车辆驾驶方法、装置及系统
WO2023004698A1 (zh) 智能驾驶决策方法、车辆行驶控制方法、装置及车辆
US11400955B2 (en) Multi-point enforced based stitch method to connect two smoothed reference lines
WO2023206388A1 (zh) 换道决策方法、装置及存储介质
CN115454082A (zh) 车辆避障方法及系统、计算机可读存储介质和电子设备
EP3698228A1 (en) Optimal planner switch method for three point turn of autonomous driving vehicles
US11360483B2 (en) Method and system for generating reference lines for autonomous driving vehicles
US12007786B2 (en) System and method for real-time lane validation
CN117649776B (zh) 一种单交叉口信号灯控制方法、装置、终端及存储介质
CN116331190B (zh) 记忆泊车的记忆路线的修正方法、装置、设备及车辆

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21884533

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21884533

Country of ref document: EP

Kind code of ref document: A1