WO2022252457A1 - 一种自动驾驶控制方法、装置、设备及可读存储介质 - Google Patents

一种自动驾驶控制方法、装置、设备及可读存储介质 Download PDF

Info

Publication number
WO2022252457A1
WO2022252457A1 PCT/CN2021/121903 CN2021121903W WO2022252457A1 WO 2022252457 A1 WO2022252457 A1 WO 2022252457A1 CN 2021121903 W CN2021121903 W CN 2021121903W WO 2022252457 A1 WO2022252457 A1 WO 2022252457A1
Authority
WO
WIPO (PCT)
Prior art keywords
strategy
noise
noisy
automatic driving
network
Prior art date
Application number
PCT/CN2021/121903
Other languages
English (en)
French (fr)
Inventor
李仁刚
赵雅倩
李茹杨
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Priority to US18/039,271 priority Critical patent/US11887009B2/en
Publication of WO2022252457A1 publication Critical patent/WO2022252457A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W60/00Drive control systems specially adapted for autonomous road vehicles
    • B60W60/001Planning or execution of driving tasks
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W60/00Drive control systems specially adapted for autonomous road vehicles
    • B60W60/001Planning or execution of driving tasks
    • B60W60/0015Planning or execution of driving tasks specially adapted for safety
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions
    • G08G1/0108Measuring and analyzing of parameters relative to traffic conditions based on the source of data
    • G08G1/0112Measuring and analyzing of parameters relative to traffic conditions based on the source of data from the vehicle, e.g. floating car data [FCD]
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions
    • G08G1/0125Traffic data processing
    • G08G1/0133Traffic data processing for classifying traffic situation
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/01Detecting movement of traffic to be counted or controlled
    • G08G1/0104Measuring and analyzing of parameters relative to traffic conditions
    • G08G1/0137Measuring and analyzing of parameters relative to traffic conditions for specific applications
    • G08G1/0141Measuring and analyzing of parameters relative to traffic conditions for specific applications for traffic information dissemination
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • B60W2050/0062Adapting control system settings
    • B60W2050/0075Automatic parameter input, automatic initialising or calibrating means
    • B60W2050/0082Automatic parameter input, automatic initialising or calibrating means for initialising the control system
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2554/00Input parameters relating to objects
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/0265Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric the criterion being a learning criterion
    • G05B13/027Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric the criterion being a learning criterion using neural networks only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the present application relates to the technical field of automatic driving, in particular to an automatic driving control method, device, equipment and readable storage medium.
  • Autonomous driving is a very complex integrated technology, covering hardware devices such as on-board sensors, data processors, and controllers. With the help of modern mobile communication and network technologies, information transmission and sharing among traffic participants is realized through complex algorithms. Functions such as environment perception, decision-making planning, and control execution can realize automatic acceleration/deceleration, steering, overtaking, braking and other operations of the vehicle.
  • the existing autonomous driving research and application methods are mainly divided into two categories, modular approach and end-to-end approach.
  • the reinforcement learning method in the end-to-end method explores and improves the autonomous driving policy from scratch with the help of Markov decision process (MDP). Due to the rapid development of advanced machine learning methods represented by reinforcement learning and the inherent potential to surpass human drivers, the research and application of autonomous driving based on reinforcement learning has broad prospects for development.
  • MDP Markov decision process
  • the vehicle selects an action based on the current traffic environment state and the driving strategy represented by the neural network.
  • An exploratory noise is added on the basis of the action to increase the exploratory nature of the automatic driving strategy.
  • the exploratory noise generally takes the form of Gaussian distribution sampling. Due to the randomness of the exploratory noise, this random noise is not related to the environment state and the driving strategy. The size of the noise is uncontrollable.
  • Self-driving vehicles may make different decisions in the face of the same traffic state. If there is a problem with the final generated decision, it cannot be determined whether the problem is caused by the neural network or the disturbance. Autonomous driving poses safety hazards.
  • the purpose of the present application is to provide an automatic driving control method, device, equipment and readable storage medium, which can improve the stability and safety of the automatic driving vehicle.
  • An automatic driving control method comprising:
  • the deep reinforcement learning automatic driving decision-making system includes: a noiseless strategy network and a noisy strategy network;
  • Automatic driving control is performed according to the driving strategy generated by the optimized noisy strategy network.
  • adjusting the noise parameters injected into the noisy strategy network within a disturbance threshold range according to the noisy strategy and the noise-free strategy includes:
  • the parameter optimization of the system parameters of the noisy policy network according to the noise parameters includes:
  • the sum of the original parameter and the noise parameter is used as an optimized system parameter of the noisy strategy network.
  • the automatic driving control is performed according to the driving strategy generated by the optimized noisy strategy network, it also includes:
  • execution times reach the training times threshold, execute the step of performing automatic driving control according to the driving strategy generated by the optimized noisy strategy network.
  • the step of acquiring vehicle traffic environment state information is performed.
  • the automatic driving control method also includes:
  • the step of initializing the system parameters of the deep reinforcement learning automatic driving decision-making system is executed.
  • An automatic driving control device comprising:
  • the parameter initialization unit is used to initialize the system parameters of the deep reinforcement learning automatic driving decision-making system; wherein, the deep reinforcement learning automatic driving decision-making system includes: a noiseless strategy network and a noisy strategy network;
  • An environment acquisition unit configured to acquire vehicle traffic environment state information
  • a strategy generation unit configured to input the vehicle traffic environment state information into the noise-free strategy network and the noise-based strategy network to generate an automatic driving strategy, and obtain a noise-free strategy and a noise-based strategy;
  • a noise adjustment unit configured to adjust noise parameters injected into the noisy strategy network within a disturbance threshold range according to the noisy strategy and the noise-free strategy;
  • a parameter optimization unit configured to perform parameter optimization on the system parameters of the noisy strategy network according to the noise parameters, to generate an optimized noisy strategy network
  • a driving control unit configured to perform automatic driving control according to the driving strategy generated by the optimized noisy strategy network.
  • the noise adjustment unit includes:
  • a difference calculation subunit configured to calculate a strategy difference between the noisy strategy and the noise-free strategy
  • the difference judging subunit is used to judge whether the policy difference exceeds the disturbance threshold; if it exceeds, trigger the first processing subunit; if not, trigger the second processing subunit;
  • the first processing subunit is configured to use the quotient of the strategy difference and the modulation factor as the noise parameter;
  • the second processing subunit is configured to use the product of the strategy difference and the modulation factor as the noise parameter; wherein the modulation factor is greater than 1.
  • the parameter optimization unit includes:
  • a parameter determination subunit configured to optimize the system parameters of the noise-free strategy network according to the noisy strategy, and use the optimized system parameters of the noise-free strategy network as original parameters;
  • the sum optimization subunit is configured to use the sum of the original parameters and the noise parameters as the optimized system parameters of the noisy strategy network.
  • An automatic driving control device comprising:
  • a processor configured to implement the steps of the above automatic driving control method when executing the computer program.
  • a readable storage medium where a computer program is stored on the readable storage medium, and when the computer program is executed by a processor, the steps of the above automatic driving control method are realized.
  • the method provided by the embodiment of the present application uses a noise-free and noise-free dual-strategy network for optimal setting of parameters, inputs the same vehicle traffic environment state information into the noise-free and noise-free dual-strategy network, and uses the noise-free strategy network as the By comparing and benchmarking, setting the action space disturbance threshold for adaptive adjustment of noise parameters, by adaptively injecting noise into the policy network parameter space and indirectly adding action noise, it can effectively improve the deep reinforcement learning algorithm's exploration of the environment and action space, and improve The exploration performance and stability of autonomous driving based on deep reinforcement learning can ensure that vehicle decision-making and action selection fully consider the influence of environmental conditions and driving strategies, thereby improving the stability and safety of autonomous driving vehicles.
  • the embodiment of the present application also provides an automatic driving control device, equipment, and readable storage medium corresponding to the above automatic driving control method, which have the above technical effects, and will not be repeated here.
  • Fig. 1 is the implementation flowchart of a kind of automatic driving control method in the embodiment of the present application
  • FIG. 2 is a schematic structural diagram of an automatic driving control device in an embodiment of the present application.
  • Fig. 3 is a schematic structural diagram of an automatic driving control device in an embodiment of the present application.
  • the core of the present application is to provide an automatic driving control method, which can improve the stability and safety of the automatic driving vehicle.
  • the automatic driving sequence decision-making process based on deep reinforcement learning is: the automatic driving vehicle uses the driving strategy represented by the neural network to select actions according to the current traffic environment state, such as acceleration/deceleration, steering, lane change, braking, etc., and obtains a reward .
  • the self-driving vehicle adjusts the driving strategy according to the obtained reward, and enters the next decision-making process in combination with the new traffic state.
  • Autonomous vehicles make sequential decisions through interactions with the environment, and learn optimal driving strategies to achieve safe driving.
  • the main method currently used in the existing technology is to add an exploration noise to the action selected in each decision-making process, generally in the form of Gaussian distribution sampling, such as policy network generation speed up to For an action command of 50km/h, select a random value from the Gaussian distribution, such as 10, and finally generate an action command to speed up to 60km/h (50+10).
  • This method of adding exploration noise is very simple. However, this random noise is not related to the environment state and driving strategy. Autonomous vehicles may make different decisions in the face of the same traffic state, which will make exploration more unpredictable and bring Security risks.
  • the deep reinforcement learning automatic driving decision-making system is a system for generating automatic driving strategy information built in this embodiment.
  • the deep reinforcement learning automatic driving decision-making system includes two strategy networks, a noiseless strategy network and a noisy strategy network.
  • Policy network where the noise-free policy network refers to the policy network without noise (No_Noise_Net), and the noisy policy network refers to the policy network with hidden noise (Noise_Net), where the policy network is a network built based on the deep reinforcement learning policy parameter space
  • the deep learning algorithm specified in the deep reinforcement learning automatic driving decision-making system is not limited, considering the state space and action space continuity of the automatic driving problem, including deep reinforcement learning algorithms such as DDPG, A3C, SAC, and TD3
  • the simpler DDPG algorithm is mainly used as an example for illustration, and the application of other deep reinforcement learning algorithms can refer to the introduction of this embodiment, and will not be repeated here.
  • the system parameters in the system parameters of the initialization deep reinforcement learning automatic driving decision-making system mainly include ⁇ 0 (initial strategy parameters without noise), ⁇ ′ 0 (initial strategy parameters with implicit noise), ⁇ 0 ( Network initial parameters), and the initial policy parameter noise ⁇ 0 four.
  • the deep reinforcement learning automatic driving decision system also includes an evaluation network (Critic_Net).
  • Critic_Net evaluation network
  • the vehicle traffic environment state information refers to the traffic environment state information around the vehicle to be controlled by automatic driving, the collection process of the vehicle traffic environment state information, and the information items specifically contained in the vehicle traffic environment state information (automatic driving control can be realized based on this) There is no limitation in this embodiment.
  • vehicle-mounted sensor devices such as cameras, global positioning systems, inertial measurement units, millimeter-wave radars, and laser radars can be used to obtain driving environment conditions (such as weather data, traffic lights, and traffic topology information), and automatically Information such as the location and operating status of the driving vehicle and other traffic participants, the direct raw image data obtained by the camera, and the depth map and semantic segmentation map processed by the deep learning model (such as RefineNet, etc.), etc., these driving environment states, The current self-driving vehicle information, the location of other traffic participants, the running status of other traffic participants, and the semantic segmentation map are used as the state information of the vehicle traffic environment.
  • driving environment conditions such as weather data, traffic lights, and traffic topology information
  • automatically Information such as the location and operating status of the driving vehicle and other traffic participants, the direct raw image data obtained by the camera, and the depth map and semantic segmentation map processed by the deep learning model (such as RefineNet, etc.), etc.
  • the vehicle traffic environment state information is input into the noise-free policy network and the noise-free policy network respectively, the policy network without noise (noise-free policy network) and the policy network with hidden noise (noise policy network) share a policy function ⁇ , That is, the non-noise policy network and the noise policy network share a set of calculation methods for automatic driving, and both can realize the calculation of automatic driving independently.
  • the process of invoking the two networks to respectively process the vehicle traffic environment state information can refer to the information processing method of the current policy network, which is not limited here.
  • the noise-free strategy and the noise-free strategy can indicate the degree of influence of noise on the decision-making of automatic driving. If the difference is too large, it indicates that the added noise may be too large. At this time, it may cause great interference to the normal decision-making, resulting in the deviation of the noise strategy.
  • the original strategy indicates to accelerate to 50km/h
  • adding a large noise may cause the strategy to change to accelerate to 70km/h, resulting in overspeed and other unfavorable factors for safe and stable driving.
  • a disturbance threshold is set.
  • the disturbance threshold is the range of the added noise, and the noise is limited within the range of the disturbance threshold.
  • the specific noise value adjustment rules are not limited in this embodiment, and can be set according to actual use needs.
  • One implementation is as follows:
  • the evaluation standard of the strategy difference is not limited in this embodiment, for example, it can be The distance is used as the evaluation standard of the policy difference.
  • the policy difference between the noisy policy and the noise-free policy is calculated, that is, the disturbance magnitude of the policy parameter noise on the action
  • the disturbance threshold is a preset policy difference threshold.
  • the policy difference between the noisy policy network and the noise-free policy in the actual policy generation will not exceed the disturbance threshold, so as to avoid excessive influence of noise parameters on policy generation. Affects the accuracy and stability of the generation strategy.
  • the product of the distance and the modulation factor is used as the noise parameter; wherein, the modulation factor is greater than 1.
  • the noise parameter reduction strategy proposed in this embodiment is to do business; if the strategy difference does not exceed the disturbance threshold, it is to increase the exploration of deep learning. To avoid noise disturbance exceeding the disturbance threshold, the noise parameter can be increased.
  • the strategy for increasing the noise parameter proposed in this embodiment is to do the product.
  • noise parameter adjustment method is used as an example to introduce.
  • other calculation methods can also be adopted, such as subtracting a certain value if the disturbance threshold is exceeded, and adding a certain value if the disturbance threshold is not exceeded. Reference may be made to the introduction of this embodiment, and details are not repeated here.
  • the system parameters of the noisy policy network are optimized according to the noise parameters.
  • the evaluation network (Critic_Net) parameter ⁇ needs to be updated , Noise-free policy network (No_Noise_Net) parameter ⁇ and hidden noise network (Noise_Net) parameter ⁇ ′.
  • the implementation of updating the evaluation network (Critic_Net) parameter ⁇ and the noise-free policy network (No_Noise_Net) parameter ⁇ can refer to the implementation in related technologies, which is not limited in this embodiment. Understand, here is an implementation method:
  • the evaluation network calculates the value function Q(s t , a′ t ) based on the implicit noise action a′ t , and obtains the reward r t given by the environment.
  • the loss function is minimized to update the network parameters ⁇ .
  • the loss function is defined as:
  • N is the number of samples collected
  • is the discount factor, which is usually taken as a constant between 0 and 1.
  • J( ⁇ ) is the objective function of the policy gradient method, usually expressed as a function of the reward r t . Maximize the objective function to get the policy gradient pass Update the No_Noise_Net parameter ⁇ , where ⁇ is a fixed time step parameter.
  • This update method can ensure the accuracy of the parameter orientation of the noisy policy network.
  • only the above-mentioned system parameter update method of the noisy policy network is used as an example for introduction. For other implementation methods, reference may be made to the introduction of this embodiment, which is not limited here.
  • automatic driving control can be carried out according to the optimized noisy strategy network.
  • the real-time collected vehicle traffic environment state information is transmitted to the optimized noisy strategy network, and the optimized The driving strategy output by the final noisy strategy network is used as the driving strategy to be executed for automatic driving control, and wherein, according to the determined driving strategy to be executed, the implementation of automatic driving control according to it can refer to the introduction of related technologies, I won't repeat them here.
  • the above is an exit method of parameter optimization, that is, when the number of executions of parameter optimization reaches the preset threshold of training times (for example, 10,000), the step of parameter optimization will be exited, and the currently generated system parameters will be used as optimized parameters.
  • the current deep reinforcement learning automatic driving decision-making system is used as the optimized network, and then performs the steps of automatic driving control according to the optimized network; On the basis of the generated system parameters, continue to add new vehicle traffic environment state information to perform parameter optimization training, and perform steps after step S102.
  • the current parameter optimization step can be exited. performing the step of initializing the system parameters of the deep reinforcement learning automatic driving decision-making system, and re-performing the optimization training of the system parameters on the basis of the re-initialized system parameters.
  • driving accidents such as current vehicle collision, rushing out of the lane, etc., are not limited here.
  • the technical solution provided by the embodiment of the present application uses the noise-free and noise-free dual-strategy network to optimize the setting of parameters, and inputs the same vehicle traffic environment state information into the noise-free and noise-free dual-strategy network to
  • the noise-free policy network is used as a comparison and benchmark, and the action space disturbance threshold is set for adaptive adjustment of noise parameters.
  • By adaptively injecting noise into the policy network parameter space and indirectly adding action noise it can effectively improve the deep reinforcement learning algorithm's ability to understand the environment and actions.
  • Space exploration improves the exploration performance and stability of autonomous driving based on deep reinforcement learning, ensuring that vehicle decision-making and action selection fully consider the impact of environmental conditions and driving strategies, thereby improving the stability and safety of autonomous vehicles.
  • the embodiment of the present application also provides an automatic driving control device, and the automatic driving control device described below and the automatic driving control method described above can be referred to in correspondence.
  • the device includes the following modules:
  • the parameter initialization unit 110 is mainly used to initialize the system parameters of the deep reinforcement learning automatic driving decision-making system; wherein, the deep reinforcement learning automatic driving decision-making system includes: a noiseless strategy network and a noisy strategy network;
  • the environment acquiring unit 120 is mainly used to acquire vehicle traffic environment state information
  • the strategy generation unit 130 is mainly used to input the vehicle traffic environment state information into the noise-free strategy network and the noise strategy network to generate the automatic driving strategy, and obtain the noise-free strategy and the noise strategy;
  • the noise adjustment unit 140 is mainly used to adjust the noise parameters injected into the noise strategy network within the disturbance threshold range according to the noise strategy and the noise-free strategy;
  • the parameter optimization unit 150 is mainly used to optimize the system parameters of the noisy strategy network according to the noise parameters, and generate and optimize the noisy strategy network;
  • the driving control unit 160 is mainly used for performing automatic driving control according to the driving strategy generated by the optimal noise strategy network.
  • the noise adjustment unit includes:
  • a difference calculation subunit used to calculate the policy difference between the noisy policy and the noise-free policy
  • the difference judging subunit is used to judge whether the policy difference exceeds the disturbance threshold; if it exceeds, trigger the first processing subunit; if not, trigger the second processing subunit;
  • the first processing subunit is configured to use the quotient of the strategy difference and the modulation factor as the noise parameter;
  • the second processing subunit is configured to use the product of the strategy difference and the modulation factor as the noise parameter; wherein the modulation factor is greater than 1.
  • the parameter optimization unit includes:
  • a parameter determination subunit configured to optimize the system parameters of the noise-free strategy network according to the noisy strategy, and use the optimized system parameters of the noise-free strategy network as original parameters;
  • the sum optimization subunit is configured to use the sum of the original parameters and the noise parameters as the optimized system parameters of the noisy strategy network.
  • the embodiment of the present application also provides an automatic driving control device, and the automatic driving control device described below and the automatic driving control method described above can be referred to in correspondence.
  • the automatic driving control equipment includes:
  • the processor is configured to implement the steps of the automatic driving control method in the above method embodiment when executing the computer program.
  • FIG. 3 is a schematic structural diagram of an automatic driving control device provided in this embodiment.
  • the automatic driving control device may have relatively large differences due to different configurations or performances, and may include one or more processing (central processing units, CPU) 322 (for example, one or more processors) and memory 332, and memory 332 stores one or more computer application programs 342 or data 344.
  • the storage 332 may be a short-term storage or a persistent storage.
  • the program stored in the memory 332 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the data processing device.
  • the central processing unit 322 can be configured to communicate with the memory 332, and execute a series of instruction operations in the memory 332 on the automatic driving control device 301.
  • the automatic driving control device 301 may also include one or more power sources 326 , one or more wired or wireless network interfaces 350 , one or more input and output interfaces 358 , and/or, one or more operating systems 341 .
  • the steps in the automatic driving control method described above can be realized by the structure of the automatic driving control device.
  • the embodiment of the present application also provides a readable storage medium, and a readable storage medium described below and an automatic driving control method described above can be referred to in correspondence.
  • a readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of the automatic driving control method in the above method embodiment are implemented.
  • the readable storage medium can be a USB flash drive, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk or an optical disk, etc., which can store program codes.
  • readable storage media can be a USB flash drive, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk or an optical disk, etc., which can store program codes.
  • readable storage media can be a USB flash drive, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk or an optical disk, etc., which can store program codes.
  • readable storage media can be a USB flash drive, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM),

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Automation & Control Theory (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Human Computer Interaction (AREA)
  • Transportation (AREA)
  • Mechanical Engineering (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Control Of Driving Devices And Active Controlling Of Vehicle (AREA)
  • Traffic Control Systems (AREA)

Abstract

本申请公开了一种自动驾驶控制方法,该方法使用有噪声和无噪声双策略网络进行参数的优化设置,将相同的车辆交通环境状态信息输入至有噪声和无噪声双策略网络中,以无噪声策略网络作为对比和基准,设定动作空间扰动阈值进行噪声参数的自适应调整,通过在策略网络参数空间自适应注入噪声,间接添加动作噪声,能够有效提升深度强化学习算法对环境和动作空间的探索,提升基于深度强化学习的自动驾驶探索性能和稳定性,保证车辆决策和动作选择充分考虑环境状态、驾驶策略的影响,进而提升自动驾驶车辆的稳定性、安全性。本申请还公开了一种自动驾驶控制装置、设备及可读存储介质,具有相应的技术效果。

Description

一种自动驾驶控制方法、装置、设备及可读存储介质
本申请要求在2021年6月1日提交中国专利局、申请号为202110606769.8、发明名称为“一种自动驾驶控制方法、装置、设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及自动驾驶技术领域,特别是涉及一种自动驾驶控制方法、装置、设备及可读存储介质。
背景技术
现代城市交通中,机动车数量日益增多,道路拥堵情况严重,且交通事故频发。辅助驾驶/自动驾驶作为最有潜力改善交通状况、提升出行安全与便捷程度的方式,受到越来越多的关注。自动驾驶是一项十分复杂的集成性技术,涵盖车载传感器、数据处理器、控制器等硬件装置,借助现代移动通信与网络技术实现各交通参与者之间的信息传递与共享,通过复杂算法完成环境感知、决策规划和控制执行等功能,实现车辆的自动加速/减速、转向、超车、刹车等操作。
现有的自动驾驶研究和应用方式主要分为两类,模块化方法和端到端方法。其中,端到端方法中的强化学习方法借助马尔科夫决策过程(MDP)从头开始探索和改进自动驾驶策略。由于以强化学习为代表的高级机器学习方法的快速发展,以及超越人类驾驶员的内在潜力,因此基于强化学习的自动驾驶研究与应用具有广阔的发展前景。
目前,基于深度强化学习的自动驾驶序列决策过程中车辆依据当前交通环境状态,利用神经网络表示的驾驶策略选取动作,为了帮助自动驾驶车辆对动作空间进行充分探索,需要在每一个决策过程所选取动作基础上增加一个探索噪声,以增加自动驾驶策略的探索性,探索噪声一般采取高斯分布采样的形式,由于探索噪声具有随机性,这种随机噪声与环境状态、驾驶策略均无关联,导致添加的噪声大小不可控,自动驾驶车辆面对同样的交通状态可能做出不同决 策,如果最后生成的决策有问题,就不能确定是神经网络出问题还是扰动出问题,导致探索更加无法预测,容易为自动驾驶带来安全隐患。
综上所述,如何提升自动驾驶车辆的稳定性以及安全性,是目前本领域技术人员急需解决的技术问题。
发明内容
本申请的目的是提供一种自动驾驶控制方法、装置、设备及可读存储介质,可以提升自动驾驶车辆的稳定性以及安全性。
为解决上述技术问题,本申请提供如下技术方案:
一种自动驾驶控制方法,包括:
初始化深度强化学习自动驾驶决策系统的系统参数;其中,所述深度强化学习自动驾驶决策系统包括:无噪声策略网络、有噪声策略网络;
获取车辆交通环境状态信息;
将所述车辆交通环境状态信息分别输入至所述无噪声策略网络以及所述有噪声策略网络进行自动驾驶策略生成,得到无噪声策略以及有噪声策略;
根据所述有噪声策略与所述无噪声策略,在扰动阈值范围内调整注入至所述有噪声策略网络的噪声参数;
根据所述噪声参数对所述有噪声策略网络的系统参数进行参数优化,生成优化有噪声策略网络;
根据所述优化有噪声策略网络生成的驾驶策略进行自动驾驶控制。
可选地,所述根据所述有噪声策略与所述无噪声策略,在扰动阈值范围内调整注入至所述有噪声策略网络的噪声参数,包括:
计算所述有噪声策略与所述无噪声策略间的策略差异;
判断所述策略差异是否超过所述扰动阈值;
若超过,将所述策略差异与调制因子的商作为所述噪声参数;
若未超过,将所述策略差异与所述调制因子的乘积作为所述噪声参数;其中,所述调制因子大于1。
可选地,所述根据所述噪声参数对所述有噪声策略网络的系统参数进行参数优化,包括:
根据所述有噪声策略对所述无噪声策略网络的系统参数进行参数优化,并将优化后的所述无噪声策略网络的系统参数作为原始参数;
将所述原始参数与所述噪声参数的和,作为所述有噪声策略网络的优化系统参数。
可选地,在所述根据所述优化有噪声策略网络生成的驾驶策略进行自动驾驶控制之前,还包括:
确定所述参数优化的执行次数;
判断所述执行次数是否达到训练次数阈值;
若所述执行次数达到所述训练次数阈值,执行所述根据所述优化有噪声策略网络生成的驾驶策略进行自动驾驶控制的步骤。
若所述执行次数未达到所述训练次数阈值,执行所述获取车辆交通环境状态信息的步骤。
可选地,所述自动驾驶控制方法还包括:
若接收到出现驾驶事故通知,执行所述初始化深度强化学习自动驾驶决策系统的系统参数的步骤。
一种自动驾驶控制装置,包括:
参数初始化单元,用于初始化深度强化学习自动驾驶决策系统的系统参数;其中,所述深度强化学习自动驾驶决策系统包括:无噪声策略网络、有噪声策略网络;
环境获取单元,用于获取车辆交通环境状态信息;
策略生成单元,用于将所述车辆交通环境状态信息分别输入至所述无噪声策略网络以及所述有噪声策略网络进行自动驾驶策略生成,得到无噪声策略以及有噪声策略;
噪声调整单元,用于根据所述有噪声策略与所述无噪声策略,在扰动阈值范围内调整注入至所述有噪声策略网络的噪声参数;
参数优化单元,用于根据所述噪声参数对所述有噪声策略网络的系统参数进行参数优化,生成优化有噪声策略网络;
驾驶控制单元,用于根据所述优化有噪声策略网络生成的驾驶策略进行自动驾驶控制。
可选地,所述噪声调整单元包括:
差异计算子单元,用于计算所述有噪声策略与所述无噪声策略间的策略差异;
差异判断子单元,用于判断所述策略差异是否超过所述扰动阈值;若超过,触发第一处理子单元;若未超过,触发第二处理子单元;
所述第一处理子单元,用于将所述策略差异与调制因子的商作为所述噪声参数;
所述第二处理子单元,用于将所述策略差异与所述调制因子的乘积作为所述噪声参数;其中,所述调制因子大于1。
可选地,所述参数优化单元包括:
参数确定子单元,用于根据所述有噪声策略对所述无噪声策略网络的系统参数进行参数优化,并将优化后的所述无噪声策略网络的系统参数作为原始参数;
求和优化子单元,用于将所述原始参数与所述噪声参数的和,作为所述有噪声策略网络的优化系统参数。
一种自动驾驶控制设备,包括:
存储器,用于存储计算机程序;
处理器,用于执行所述计算机程序时实现上述自动驾驶控制方法的步骤。
一种可读存储介质,所述可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现上述自动驾驶控制方法的步骤。
本申请实施例所提供的方法,使用有噪声和无噪声双策略网络进行参数的优化设置,将相同的车辆交通环境状态信息输入至有噪声和无噪声双策略网络中,以无噪声策略网络作为对比和基准,设定动作空间扰动阈值进行噪声参数的自适应调整,通过在策略网络参数空间自适应注入噪声,间接添加动作噪声,能够有效提升深度强化学习算法对环境和动作空间的探索,提升基于深度强化学习的自动驾驶探索性能和稳定性,保证车辆决策和动作选择充分考虑环境状态、驾驶策略的影响,进而提升自动驾驶车辆的稳定性、安全性。
相应地,本申请实施例还提供了与上述自动驾驶控制方法相对应的自动驾驶控制装置、设备和可读存储介质,具有上述技术效果,在此不再赘述。
附图说明
为了更清楚地说明本申请实施例或相关技术中的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例中一种自动驾驶控制方法的实施流程图;
图2为本申请实施例中一种自动驾驶控制装置的结构示意图;
图3为本申请实施例中一种自动驾驶控制设备的结构示意图。
具体实施方式
本申请的核心是提供一种自动驾驶控制方法,可以提升自动驾驶车辆的稳定性以及安全性。
为了使本技术领域的人员更好地理解本申请方案,下面结合附图和具体实施方式对本申请作进一步的详细说明。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
目前,基于深度强化学习的自动驾驶序列决策过程为:自动驾驶车辆依据当前交通环境状态,利用神经网络表示的驾驶策略选取动作,如加速/减速、转向、变道、刹车等,并获得一个奖励。自动驾驶车辆根据获得的奖励对驾驶策略进行调整,并结合新的交通状态进入下一个决策过程。自动驾驶车辆通过与环境之间的交互做出序列决策,学习到最优驾驶策略,以实现安全驾驶。
为了帮助自动驾驶车辆对动作空间进行充分探索,目前现有技术主要采用的方法是在每一个决策过程所选取动作基础上增加一个探索噪声,一般采取高斯分布采样的形式,比如策略网络生成提速至50km/h的动作指令,则从高斯分布中选取一个随机值,比如10,则最终生成提速至60km/h(50+10)的动作指令。这种添加探索噪声的方法十分简便,然而,这种随机噪声与环境状态、驾驶策略均无关联,自动驾驶车辆面对同样的交通状态可能做出不同决策,会 导致探索更加无法预测,带来安全隐患。
为了避免添加的探索噪声对于自动驾驶控制带来的不稳定以及不安全因素,本实施例中提出一种自动驾驶控制方法,请参考图1,图1为本申请实施例中一种自动驾驶控制方法的流程图,该方法包括以下步骤:
S101、初始化深度强化学习自动驾驶决策系统的系统参数;
其中,深度强化学习自动驾驶决策系统为本实施例中搭建的用于生成自动驾驶策略信息的系统,具体地,深度强化学习自动驾驶决策系统中共包含2个策略网络,无噪声策略网络、有噪声策略网络,其中,无噪声策略网络指不含噪声的策略网络(No_Noise_Net),有噪声策略网络指隐含噪声的策略网络(Noise_Net),其中,策略网络为基于深度强化学习策略参数空间搭建的网络,本实施例中对于深度强化学习自动驾驶决策系统中指定的深度学习算法不做限定,考虑到自动驾驶问题的状态空间和动作空间连续性,包括DDPG、A3C、SAC、TD3等深度强化学习算法可供选择,本实施例中主要以较简单的DDPG算法为例进行说明,其他深度强化学习算法的应用均可参照本实施例的介绍,在此不再赘述。则相应地,初始化深度强化学习自动驾驶决策系统的系统参数中的系统参数主要可以包括θ 0(不含噪声的初始策略参数)、θ′ 0(隐含噪声的初始策略参数)、ω 0(网络初始参数),以及初始策略参数噪声σ 0四种。
深度强化学习自动驾驶决策系统中除了策略网络外,还包括评价网络(Critic_Net),需要说明的是,无噪声策略网络、有噪声策略网络以及评价网络的具体网络结构本实施例中不做限定,可以参照相关技术进行相应网络结构的搭建,在此不再赘述。
S102、获取车辆交通环境状态信息;
车辆交通环境状态信息指待自动驾驶控制的车辆周边的交通环境状态信息,车辆交通环境状态信息的采集过程以及车辆交通环境状态信息中具体包含的信息项(可以据此实现自动驾驶控制即可)本实施例中不做限定,比如可以借助摄像头、全球定位系统、惯性测量单元、毫米波雷达、激光雷达等车载传感器装置,获取行车环境状态(如天气数据、交通信号灯、交通拓扑信息),自动驾驶车辆、其他交通参与者的位置、运行状态等信息、摄像头获取的直接原始图像数据,以及通过深度学习模型(如RefineNet等)处理得到的深度图 和语义分割图等,将这些行车环境状态、当前自动驾驶的车辆信息、其他交通参与者的位置、其他交通参与者的运行状态、语义分割图作为车辆交通环境状态信息,本实施例仅以上述信息形式以及获取方式为例进行介绍,其他信息的获取方式均可参照本实施例的介绍,在此不再赘述。
S103、将车辆交通环境状态信息分别输入至无噪声策略网络以及有噪声策略网络进行自动驾驶策略生成,得到无噪声策略以及有噪声策略;
将车辆交通环境状态信息分别输入至无噪声策略网络以及有噪声策略网络,不含噪声的策略网络(无噪声策略网络)和隐含噪声的策略网络(有噪声策略网络)共用一个策略函数π,即无噪声策略网络和有噪声策略网络中共用一套自动驾驶的计算方式,均可以单独实现自动驾驶的计算。
针对输入的车辆交通环境状态信息S t,无噪声策略网络基于不含噪声的策略参数θ进行自动驾驶策略生成,得到无噪声策略的动作a t=π(a t|s t,θ);有噪声策略网络基于隐含噪声的策略参数θ′进行自动驾驶策略生成,得到有噪声策略的动作a′ t=π(a′ t|s t,θ′)。需要说明的是,调用两个网络分别对车辆交通环境状态信息进行处理的过程可以参照目前策略网络的信息处理方式,在此不作限定。
S104、根据有噪声策略与无噪声策略,在扰动阈值范围内调整注入至有噪声策略网络的噪声参数;
有噪声策略与无噪声策略可以指示噪声对于自动驾驶决策的影响程度,若差值过大指示添加的噪声可能过大,此时可能会对正常决策产生较大的干扰,造成有噪声策略的偏差,比如原本策略指示加速至50km/h,添加一个较大的噪声后可能会导致策略变为加速至70km/h,造成超速等对于安全稳定驾驶不利的因素。本实施例中为了避免随机噪声对于策略稳定性以及正确性的影响,同时保证策略的探索性,设置了一个扰动阈值,该扰动阈值为添加的噪声的范围,将噪声限制在扰动阈值范围内,可以避免噪声过大的影响,同时根据有噪声策略与无噪声策略进行噪声值的调整,可以实现生成的策略对于添加的噪声反向回馈,基于设定的扰动阈值,自适应地调整下一次注入策略参数空间的噪声σ t+1
而对于具体地噪声值调整规则本实施例中不做限定,可以根据实际使用需 要进行设定,一种实现方式如下:
(1)计算有噪声策略与无噪声策略间的策略差异;
计算不含噪声与隐含噪声的自动驾驶车辆动作a t(无噪声策略)与a′ t(有噪声策略)的策略差异,其中策略差异的评价标准本实施例中不做限定,比如可以以距离作为策略差异的评价标准,相应地,计算有噪声策略与无噪声策略间的策略差异,即策略参数噪声对动作的扰动幅度
Figure PCTCN2021121903-appb-000001
本实施例中仅以距离作为策略差异的评价标准为例进行介绍,其它评价标准均可参照本实施例的介绍,在此不再赘述。
(2)判断策略差异是否超过扰动阈值;
扰动阈值为预先设定的策略差异阈值,本申请中会控制有噪声策略网络与所述无噪声策略在实际策略生成中的策略差异不会超出扰动阈值,避免噪声参数对于策略生成影响过大,影响生成策略的准确性以及稳定性。
(3)若超过,将距离与调制因子的商作为噪声参数;
(4)若未超过,将距离与调制因子的乘积作为噪声参数;其中,调制因子大于1。
若策略差异超过扰动阈值,指示当前噪声扰动过大,需调小噪声参数,本实施例中提出的噪声参数的调小策略为做商;若策略差异未超过扰动阈值,为增加深度学习的探索性,同时避免噪声扰动超出扰动阈值,可以调大噪声参数,本实施例中提出的噪声参数的调大策略为做乘积。
根据策略差异d与扰动阈值δ的关系,自适应地更新参数噪声σ t+1
Figure PCTCN2021121903-appb-000002
其中调制因子α>1。
本实施例中仅以上述噪声参数调整方式为例进行介绍,此外,还可以采取其他的计算方式,比如若超过扰动阈值减去一定值,未超过扰动阈值加上一定值等,其他调整方式均可参照本实施例的介绍,在此不再赘述。
S105、根据噪声参数对有噪声策略网络的系统参数进行参数优化,生成优化有噪声策略网络;
基于设定的扰动阈值,自适应地调整注入策略参数空间的噪声参数σ t+1后,根据噪声参数对有噪声策略网络的系统参数进行参数优化,具体地需要更新评价网络(Critic_Net)参数ω、无噪声策略网络(No_Noise_Net)参数θ与隐含噪声网络(Noise_Net)参数θ′。
其中,在确定了噪声参数后,更新评价网络(Critic_Net)参数ω、无噪声策略网络(No_Noise_Net)参数θ的实现方式可以参照相关技术中的实现方式,本实施例中对此不作限定,为加深理解,在此介绍一种实现方式:
(1)评价网络(Critic_Net)基于隐含噪声动作a′ t计算价值函数Q(s t,a′ t),并得到环境给予的奖励r t。最小化损失函数来更新网络参数ω。损失函数定义为:
Figure PCTCN2021121903-appb-000003
式中,N为采集的样本数量,γ为折扣因子,通常取为介于0-1之间的常数。Q′(s Dt+1,a′ Dt+1)表示的价值函数通过回放缓冲区D的数据计算得到,回放缓冲区D由预先训练得到的一系列历史数据c Dt=(s Dt,a′ Dt,r Dt,s Dt+1)组成,其中包含的均为含噪声的动作。
通过如下策略梯度,更新无噪声策略网络(No_Noise_Net)参数θ:
Figure PCTCN2021121903-appb-000004
式中,J(θ)为策略梯度方法的目标函数,通常表示为关于奖励r t的函数。最大化目标函数得到策略梯度
Figure PCTCN2021121903-appb-000005
通过
Figure PCTCN2021121903-appb-000006
对无噪声策略网络(No_Noise_Net)参数θ进行更新,其中κ为固定的时间步参数。
而对于有噪声策略网络(Noise_Net)参数θ′的参数优化方式,本实施例中提出一种优化方式,具体可以为:结合上述步骤得到的自适应的噪声参数σ t+1和优化后的无噪声策略网络(No_Noise_Net)参数θ,令θ′=θ+σ t+1,即将优化后的所述无噪声策略网络的系统参数与噪声参数求和,作为所述有噪声策略网络的优化系统参数θ′。该更新方式可以保证有噪声策略网络的参数导向的精准度。本实施例中仅以上述有噪声策略网络的系统参数更新方式为例进行介绍,其他实现方式均可参照本实施例的介绍,在此不作限定。
S106、根据优化有噪声策略网络生成的驾驶策略进行自动驾驶控制。
对有噪声策略网络进行参数优化后,即可根据优化后的有噪声策略网络进行自动驾驶控制,具体地将实时采集到的车辆交通环境状态信息传输至优化后的有噪声策略网络,并将优化后的有噪声策略网络输出的驾驶策略作为待执行的驾驶策略进行自动驾驶控制,而其中,根据确定的待执行的驾驶策略后,根据其进行自动驾驶控制的实现方式可以参照相关技术的介绍,在此不再赘述。
而进一步地,上述步骤中介绍了一次系统参数优化的实现步骤,为了增强自动驾驶决策的精准度,一般可以执行若干次后,将最终得到的有噪声策略网络作为待调用的网络进行自动驾驶的控制。
则相应地,为了提升优化效果,在根据优化有噪声策略网络生成的驾驶策略进行自动驾驶控制之前,可以进一步执行以下步骤:
(1)确定参数优化的执行次数;
(2)判断执行次数是否达到训练次数阈值;
(3)若执行次数达到训练次数阈值,执行根据优化有噪声策略网络生成的驾驶策略进行自动驾驶控制的步骤。
(4)若执行次数未达到训练次数阈值,执行获取车辆交通环境状态信息的步骤。
以上为一种参数优化的退出方式,即参数优化的执行次数达到预先设定的训练次数阈值(比如10000次)时,退出参数优化的步骤,将当前生成的系统参数作为优化后的参数,将当前的深度强化学习自动驾驶决策系统作为优化后的网络,转而执行根据优化后的网络进行自动驾驶控制的步骤;若未达到预先设定的训练次数阈值,则在上次参数优化训练过程中生成的系统参数的基础上继续添加新的车辆交通环境状态信息进行参数优化训练,执行步骤S102之后的步骤。
而在一些情况下,参数优化的过程中可能会出现一些偏差导致自动驾驶安全性收到威胁的情况,为保证自动驾驶的安全性,若接收到出现驾驶事故通知,可以退出当前参数优化的步骤,执行所述初始化深度强化学习自动驾驶决策系统的系统参数的步骤,在重新初始化的系统参数的基础上重新进行系统参数的优化训练。而其中驾驶事故比如当前车辆发生碰撞、冲出车道等,在此不作限 定。
基于上述介绍,本申请实施例所提供的技术方案,使用有噪声和无噪声双策略网络进行参数的优化设置,将相同的车辆交通环境状态信息输入至有噪声和无噪声双策略网络中,以无噪声策略网络作为对比和基准,设定动作空间扰动阈值进行噪声参数的自适应调整,通过在策略网络参数空间自适应注入噪声,间接添加动作噪声,能够有效提升深度强化学习算法对环境和动作空间的探索,提升基于深度强化学习的自动驾驶探索性能和稳定性,保证车辆决策和动作选择充分考虑环境状态、驾驶策略的影响,进而提升自动驾驶车辆的稳定性、安全性。
相应于上面的方法实施例,本申请实施例还提供了一种自动驾驶控制装置,下文描述的自动驾驶控制装置与上文描述的自动驾驶控制方法可相互对应参照。
参见图2所示,该装置包括以下模块:
参数初始化单元110主要用于初始化深度强化学习自动驾驶决策系统的系统参数;其中,深度强化学习自动驾驶决策系统包括:无噪声策略网络、有噪声策略网络;
环境获取单元120主要用于获取车辆交通环境状态信息;
策略生成单元130主要用于将车辆交通环境状态信息分别输入至无噪声策略网络以及有噪声策略网络进行自动驾驶策略生成,得到无噪声策略以及有噪声策略;
噪声调整单元140主要用于根据有噪声策略与无噪声策略,在扰动阈值范围内调整注入至有噪声策略网络的噪声参数;
参数优化单元150主要用于根据噪声参数对有噪声策略网络的系统参数进行参数优化,生成优化有噪声策略网络;
驾驶控制单元160主要用于根据优化有噪声策略网络生成的驾驶策略进行自动驾驶控制。
在本申请的一种具体实施方式中,噪声调整单元包括:
差异计算子单元,用于计算所述有噪声策略与所述无噪声策略间的策略差 异;
差异判断子单元,用于判断所述策略差异是否超过所述扰动阈值;若超过,触发第一处理子单元;若未超过,触发第二处理子单元;
所述第一处理子单元,用于将所述策略差异与调制因子的商作为所述噪声参数;
所述第二处理子单元,用于将所述策略差异与所述调制因子的乘积作为所述噪声参数;其中,所述调制因子大于1。
在本申请的一种具体实施方式中,参数优化单元包括:
参数确定子单元,用于根据所述有噪声策略对所述无噪声策略网络的系统参数进行参数优化,并将优化后的所述无噪声策略网络的系统参数作为原始参数;
求和优化子单元,用于将所述原始参数与所述噪声参数的和,作为所述有噪声策略网络的优化系统参数。
相应于上面的方法实施例,本申请实施例还提供了一种自动驾驶控制设备,下文描述的一种自动驾驶控制设备与上文描述的一种自动驾驶控制方法可相互对应参照。
该自动驾驶控制设备包括:
存储器,用于存储计算机程序;
处理器,用于执行计算机程序时实现上述方法实施例的自动驾驶控制方法的步骤。
具体的,请参考图3,为本实施例提供的一种自动驾驶控制设备的具体结构示意图,该自动驾驶控制设备可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)322(例如,一个或一个以上处理器)和存储器332,存储器332存储有一个或一个以上的计算机应用程序342或数据344。其中,存储器332可以是短暂存储或持久存储。存储在存储器332的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对数据处理设备中的一系列指令操作。更进一步地,中央处理器 322可以设置为与存储器332通信,在自动驾驶控制设备301上执行存储器332中的一系列指令操作。
自动驾驶控制设备301还可以包括一个或一个以上电源326,一个或一个以上有线或无线网络接口350,一个或一个以上输入输出接口358,和/或,一个或一个以上操作系统341。
上文所描述的自动驾驶控制方法中的步骤可以由自动驾驶控制设备的结构实现。
相应于上面的方法实施例,本申请实施例还提供了一种可读存储介质,下文描述的一种可读存储介质与上文描述的一种自动驾驶控制方法可相互对应参照。
一种可读存储介质,可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现上述方法实施例的自动驾驶控制方法的步骤。
该可读存储介质具体可以为U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可存储程序代码的可读存储介质。
本领域技术人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。

Claims (8)

  1. 一种自动驾驶控制方法,其特征在于,包括:
    初始化深度强化学习自动驾驶决策系统的系统参数;其中,所述深度强化学习自动驾驶决策系统包括:无噪声策略网络、有噪声策略网络;
    获取车辆交通环境状态信息;
    将所述车辆交通环境状态信息分别输入至所述无噪声策略网络以及所述有噪声策略网络进行自动驾驶策略生成,得到无噪声策略以及有噪声策略;
    根据所述有噪声策略与所述无噪声策略,在扰动阈值范围内调整注入至所述有噪声策略网络的噪声参数;其中,所述根据所述有噪声策略与所述无噪声策略,在扰动阈值范围内调整注入至所述有噪声策略网络的噪声参数,包括:计算所述有噪声策略与所述无噪声策略间的策略差异;判断所述策略差异是否超过所述扰动阈值;若超过,将所述策略差异与调制因子的商作为所述噪声参数;若未超过,将所述策略差异与所述调制因子的乘积作为所述噪声参数;其中,所述调制因子大于1;
    根据所述噪声参数对所述有噪声策略网络的系统参数进行参数优化,生成优化有噪声策略网络;
    根据所述优化有噪声策略网络生成的驾驶策略进行自动驾驶控制。
  2. 根据权利要求1所述的自动驾驶控制方法,其特征在于,所述根据所述噪声参数对所述有噪声策略网络的系统参数进行参数优化,包括:
    根据所述有噪声策略对所述无噪声策略网络的系统参数进行参数优化,并将优化后的所述无噪声策略网络的系统参数作为原始参数;
    将所述原始参数与所述噪声参数的和,作为所述有噪声策略网络的优化系统参数。
  3. 根据权利要求1所述的自动驾驶控制方法,其特征在于,在所述根据所述优化有噪声策略网络生成的驾驶策略进行自动驾驶控制 之前,还包括:
    确定所述参数优化的执行次数;
    判断所述执行次数是否达到训练次数阈值;
    若所述执行次数达到所述训练次数阈值,执行所述根据所述优化有噪声策略网络生成的驾驶策略进行自动驾驶控制的步骤;
    若所述执行次数未达到所述训练次数阈值,执行所述获取车辆交通环境状态信息的步骤。
  4. 根据权利要求3所述的自动驾驶控制方法,其特征在于,还包括:
    若接收到出现驾驶事故通知,执行所述初始化深度强化学习自动驾驶决策系统的系统参数的步骤。
  5. 一种自动驾驶控制装置,其特征在于,包括:
    参数初始化单元,用于初始化深度强化学习自动驾驶决策系统的系统参数;其中,所述深度强化学习自动驾驶决策系统包括:无噪声策略网络、有噪声策略网络;
    环境获取单元,用于获取车辆交通环境状态信息;
    策略生成单元,用于将所述车辆交通环境状态信息分别输入至所述无噪声策略网络以及所述有噪声策略网络进行自动驾驶策略生成,得到无噪声策略以及有噪声策略;
    噪声调整单元,用于根据所述有噪声策略与所述无噪声策略,在扰动阈值范围内调整注入至所述有噪声策略网络的噪声参数;其中,所述噪声调整单元包括:差异计算子单元,用于计算所述有噪声策略与所述无噪声策略间的策略差异;差异判断子单元,用于判断所述策略差异是否超过所述扰动阈值;若超过,触发第一处理子单元;若未超过,触发第二处理子单元;所述第一处理子单元,用于将所述策略差异与调制因子的商作为所述噪声参数;所述第二处理子单元,用于将所述策略差异与所述调制因子的乘积作为所述噪声参数;其中,所述调制因子大于1;
    参数优化单元,用于根据所述噪声参数对所述有噪声策略网络的 系统参数进行参数优化,生成优化有噪声策略网络;
    驾驶控制单元,用于根据所述优化有噪声策略网络生成的驾驶策略进行自动驾驶控制。
  6. 根据权利要求5所述的自动驾驶控制装置,其特征在于,所述参数优化单元包括:
    参数确定子单元,用于根据所述有噪声策略对所述无噪声策略网络的系统参数进行参数优化,并将优化后的所述无噪声策略网络的系统参数作为原始参数;
    求和优化子单元,用于将所述原始参数与所述噪声参数的和,作为所述有噪声策略网络的优化系统参数。
  7. 一种自动驾驶控制设备,其特征在于,包括:
    存储器,用于存储计算机程序;
    处理器,用于执行所述计算机程序时实现如权利要求1至4任一项所述自动驾驶控制方法的步骤。
  8. 一种可读存储介质,其特征在于,所述可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至4任一项所述自动驾驶控制方法的步骤。
PCT/CN2021/121903 2021-06-01 2021-09-29 一种自动驾驶控制方法、装置、设备及可读存储介质 WO2022252457A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/039,271 US11887009B2 (en) 2021-06-01 2021-09-29 Autonomous driving control method, apparatus and device, and readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110606769.8A CN113253612B (zh) 2021-06-01 2021-06-01 一种自动驾驶控制方法、装置、设备及可读存储介质
CN202110606769.8 2021-06-01

Publications (1)

Publication Number Publication Date
WO2022252457A1 true WO2022252457A1 (zh) 2022-12-08

Family

ID=77185702

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/121903 WO2022252457A1 (zh) 2021-06-01 2021-09-29 一种自动驾驶控制方法、装置、设备及可读存储介质

Country Status (3)

Country Link
US (1) US11887009B2 (zh)
CN (1) CN113253612B (zh)
WO (1) WO2022252457A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113253612B (zh) * 2021-06-01 2021-09-17 苏州浪潮智能科技有限公司 一种自动驾驶控制方法、装置、设备及可读存储介质
CN114444718B (zh) * 2022-01-26 2023-03-24 北京百度网讯科技有限公司 机器学习模型的训练方法、信号控制方法和装置
CN117376661B (zh) * 2023-12-06 2024-02-27 山东大学 一种基于神经网络的细粒度视频流自适应调节系统及方法
CN118343164B (zh) * 2024-06-17 2024-10-01 北京理工大学前沿技术研究院 一种自动驾驶车辆行为决策方法、系统、设备及存储介质
CN118393900B (zh) * 2024-06-26 2024-08-27 山东海量信息技术研究院 自动驾驶决策控制方法、装置、系统、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180009445A1 (en) * 2016-07-08 2018-01-11 Toyota Motor Engineering & Manufacturing North America, Inc. Online learning and vehicle control method based on reinforcement learning without active exploration
CN109492763A (zh) * 2018-09-17 2019-03-19 同济大学 一种基于强化学习网络训练的自动泊车方法
CN109657800A (zh) * 2018-11-30 2019-04-19 清华大学深圳研究生院 基于参数噪声的强化学习模型优化方法及装置
CN110322017A (zh) * 2019-08-13 2019-10-11 吉林大学 基于深度强化学习的自动驾驶智能车轨迹跟踪控制策略
CN112249032A (zh) * 2020-10-29 2021-01-22 浪潮(北京)电子信息产业有限公司 一种自动驾驶决策方法、系统、设备及计算机存储介质
CN113253612A (zh) * 2021-06-01 2021-08-13 苏州浪潮智能科技有限公司 一种自动驾驶控制方法、装置、设备及可读存储介质

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018211139A1 (en) * 2017-05-19 2018-11-22 Deepmind Technologies Limited Training action selection neural networks using a differentiable credit function
WO2018215344A1 (en) * 2017-05-20 2018-11-29 Deepmind Technologies Limited Noisy neural network layers
US11669769B2 (en) * 2018-12-13 2023-06-06 Diveplane Corporation Conditioned synthetic data generation in computer-based reasoning systems
US20200033869A1 (en) * 2018-07-27 2020-01-30 GM Global Technology Operations LLC Systems, methods and controllers that implement autonomous driver agents and a policy server for serving policies to autonomous driver agents for controlling an autonomous vehicle
KR102267316B1 (ko) * 2019-03-05 2021-06-21 네이버랩스 주식회사 심층 강화 학습에 기반한 자율주행 에이전트의 학습 방법 및 시스템
US11699062B2 (en) * 2019-09-06 2023-07-11 Honda Motor Co., Ltd. System and method for implementing reward based strategies for promoting exploration
CN111845741B (zh) * 2020-06-28 2021-08-03 江苏大学 一种基于分层强化学习的自动驾驶决策控制方法及系统
CN111966118B (zh) * 2020-08-14 2021-07-20 哈尔滨工程大学 一种rov推力分配与基于强化学习的运动控制方法
CN112099496B (zh) * 2020-09-08 2023-03-21 苏州浪潮智能科技有限公司 一种自动驾驶训练方法、装置、设备及介质
CN112255931B (zh) * 2020-10-10 2024-04-16 万物镜像(北京)计算机系统有限公司 数据处理方法、装置、存储介质及电子设备
CN112462792B (zh) * 2020-12-09 2022-08-09 哈尔滨工程大学 一种基于Actor-Critic算法的水下机器人运动控制方法
CN112580148B (zh) * 2020-12-20 2022-11-18 东南大学 基于深度强化学习的重型营运车辆防侧翻驾驶决策方法
US20220261630A1 (en) * 2021-02-18 2022-08-18 International Business Machines Corporation Leveraging dynamical priors for symbolic mappings in safe reinforcement learning
US20220309383A1 (en) * 2021-03-24 2022-09-29 International Business Machines Corporation Learning of operator for planning problem

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180009445A1 (en) * 2016-07-08 2018-01-11 Toyota Motor Engineering & Manufacturing North America, Inc. Online learning and vehicle control method based on reinforcement learning without active exploration
CN109492763A (zh) * 2018-09-17 2019-03-19 同济大学 一种基于强化学习网络训练的自动泊车方法
CN109657800A (zh) * 2018-11-30 2019-04-19 清华大学深圳研究生院 基于参数噪声的强化学习模型优化方法及装置
CN110322017A (zh) * 2019-08-13 2019-10-11 吉林大学 基于深度强化学习的自动驾驶智能车轨迹跟踪控制策略
CN112249032A (zh) * 2020-10-29 2021-01-22 浪潮(北京)电子信息产业有限公司 一种自动驾驶决策方法、系统、设备及计算机存储介质
CN113253612A (zh) * 2021-06-01 2021-08-13 苏州浪潮智能科技有限公司 一种自动驾驶控制方法、装置、设备及可读存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MATTHIAS PLAPPERT; REIN HOUTHOOFT; PRAFULLA DHARIWAL; SZYMON SIDOR; RICHARD Y. CHEN; XI CHEN; TAMIM ASFOUR; PIETER ABBEEL; MARCIN : "Parameter Space Noise for Exploration", ARXIV.ORG, 6 June 2017 (2017-06-06), XP080767963 *

Also Published As

Publication number Publication date
US20230351200A1 (en) 2023-11-02
US11887009B2 (en) 2024-01-30
CN113253612B (zh) 2021-09-17
CN113253612A (zh) 2021-08-13

Similar Documents

Publication Publication Date Title
WO2022252457A1 (zh) 一种自动驾驶控制方法、装置、设备及可读存储介质
WO2022052406A1 (zh) 一种自动驾驶训练方法、装置、设备及介质
CN111898211B (zh) 基于深度强化学习的智能车速度决策方法及其仿真方法
WO2022148282A1 (zh) 一种车辆轨迹规划方法、装置、存储介质及设备
US12045061B2 (en) Multi-AGV motion planning method, device and system
CN107229973B (zh) 一种用于车辆自动驾驶的策略网络模型的生成方法及装置
CN114358128B (zh) 一种训练端到端的自动驾驶策略的方法
WO2021244207A1 (zh) 训练驾驶行为决策模型的方法及装置
US20210271988A1 (en) Reinforcement learning with iterative reasoning for merging in dense traffic
WO2023082726A1 (zh) 换道策略生成方法和装置、计算机存储介质、电子设备
CN115973179A (zh) 模型训练方法、车辆控制方法、装置、电子设备及车辆
CN117406756B (zh) 一种运动轨迹参数的确定方法、装置、设备和存储介质
CN113264064B (zh) 用于交叉路口场景的自动驾驶方法及相关设备
Nan et al. Interaction-aware planning with deep inverse reinforcement learning for human-like autonomous driving in merge scenarios
CN110390398A (zh) 在线学习方法
CN116653957A (zh) 一种变速变道方法、装置、设备及存储介质
Elallid et al. Vehicles control: Collision avoidance using federated deep reinforcement learning
CN117950395A (zh) 轨迹规划方法及装置、移动工具、存储介质
Lin et al. Connectivity guaranteed multi-robot navigation via deep reinforcement learning
Wang et al. An end-to-end deep reinforcement learning model based on proximal policy optimization algorithm for autonomous driving of off-road vehicle
Yang et al. Deep Reinforcement Learning Lane-Changing Decision Algorithm for Intelligent Vehicles Combining LSTM Trajectory Prediction
Zangirolami et al. Dealing with uncertainty: Balancing exploration and exploitation in deep recurrent reinforcement learning
Molaie et al. Auto-Driving Policies in Highway based on Distributional Deep Reinforcement Learning
Wang et al. Research on decision making of intelligent vehicle based on composite priority experience replay
Shi et al. Research on Autonomous Driving Decision Based on Improved Deep Deterministic Policy Algorithm

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21943810

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21943810

Country of ref document: EP

Kind code of ref document: A1