CN116176572A - Automobile emergency collision avoidance control method based on DQN deep reinforcement learning - Google Patents

Automobile emergency collision avoidance control method based on DQN deep reinforcement learning Download PDF

Info

Publication number
CN116176572A
CN116176572A CN202310168297.1A CN202310168297A CN116176572A CN 116176572 A CN116176572 A CN 116176572A CN 202310168297 A CN202310168297 A CN 202310168297A CN 116176572 A CN116176572 A CN 116176572A
Authority
CN
China
Prior art keywords
automobile
training
braking
action
longitudinal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310168297.1A
Other languages
Chinese (zh)
Inventor
卢晓晖
郑馨義
吕新展
李绍松
李佳纯
董旭升
张鹏飞
张袅娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changchun University of Technology
Original Assignee
Changchun University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changchun University of Technology filed Critical Changchun University of Technology
Priority to CN202310168297.1A priority Critical patent/CN116176572A/en
Publication of CN116176572A publication Critical patent/CN116176572A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W30/00Purposes of road vehicle drive control systems not related to the control of a particular sub-unit, e.g. of systems using conjoint control of vehicle sub-units, or advanced driver assistance systems for ensuring comfort, stability and safety or drive control systems for propelling or retarding the vehicle
    • B60W30/08Active safety systems predicting or avoiding probable or impending collision or attempting to minimise its consequences
    • B60W30/09Taking automatic action to avoid collision, e.g. braking and steering
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W30/00Purposes of road vehicle drive control systems not related to the control of a particular sub-unit, e.g. of systems using conjoint control of vehicle sub-units, or advanced driver assistance systems for ensuring comfort, stability and safety or drive control systems for propelling or retarding the vehicle
    • B60W30/08Active safety systems predicting or avoiding probable or impending collision or attempting to minimise its consequences
    • B60W30/095Predicting travel path or likelihood of collision
    • B60W30/0953Predicting travel path or likelihood of collision the prediction being responsive to vehicle dynamic parameters
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W30/00Purposes of road vehicle drive control systems not related to the control of a particular sub-unit, e.g. of systems using conjoint control of vehicle sub-units, or advanced driver assistance systems for ensuring comfort, stability and safety or drive control systems for propelling or retarding the vehicle
    • B60W30/08Active safety systems predicting or avoiding probable or impending collision or attempting to minimise its consequences
    • B60W30/095Predicting travel path or likelihood of collision
    • B60W30/0956Predicting travel path or likelihood of collision the prediction being responsive to traffic or environmental parameters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2520/00Input parameters relating to overall vehicle dynamics
    • B60W2520/10Longitudinal speed
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2554/00Input parameters relating to objects
    • B60W2554/40Dynamic objects, e.g. animals, windblown objects
    • B60W2554/402Type
    • B60W2554/4029Pedestrians
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W2554/00Input parameters relating to objects
    • B60W2554/80Spatial relation or speed relative to objects
    • B60W2554/802Longitudinal distance
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention provides an automobile emergency collision avoidance control method based on DQN deep reinforcement learning, which belongs to the field of new energy automobile braking, and aims to solve the problem that an AEB system only depends on the longitudinal braking collision avoidance effect in an emergency and has more pertinence in braking actions when facing different obstacles, and the method comprises subtask design, state and action space design, multi-objective rewarding function design, DQN parameter setting and training; the method improves the algorithm training efficiency, improves the safety of the automobile, and makes the avoidance strategy of the automobile more humanized.

Description

Automobile emergency collision avoidance control method based on DQN deep reinforcement learning
Technical field:
the invention belongs to the field of new energy automobile braking, and particularly relates to an automobile emergency collision avoidance control method based on DQN deep reinforcement learning.
The background technology is as follows:
automatic emergency braking (Autonomous Emergency Braking) is an active safety function based on an environmental awareness sensor sensing the risk of a collision with a vehicle, pedestrian or other traffic participant in front of it, and automatically triggering an actuator by the system to apply the braking to avoid or mitigate the extent of the collision.
Most researchers use traditional methods such as rule-based methods, PID algorithm-based methods, fuzzy control methods, model predictive control and the like to study the automatic emergency braking function under urban working conditions. However, the methods have the problems of a large number of manual parameter adjustment, low control precision, model-dependent precision, high model complexity, large calculated amount, low calculation speed and the like. And the methods have limitations in processing all scenes possibly happening on a real road, have weak adaptive capacity to complex traffic environment and poor robustness, are very dependent on manual experience to formulate rules, and can not make different braking or steering actions according to different categories of obstacles only by using information based on environment-aware sensors. The end-to-end structure based on deep reinforcement learning can directly obtain control actions such as accelerator, brake, wheel rotation angle and the like by using perception input, so that the workload and parameter adjustment cost for each layer of algorithm construction are greatly reduced, the generalization capability of automatic driving is improved, and corresponding target requirements can be added into a reward function in the algorithm to optimize a plurality of targets. The deep reinforcement learning combines the powerful ability of sensing abstract features of the deep neural network, can adaptively learn in changeable environments, has certain adaptability to scenes outside training environments, and has higher calculation speed after training is finished than that of the traditional algorithm. At present, deep reinforcement learning has good performance in the fields of target recognition, automatic control, games and the like, and the combination of the deep reinforcement learning and intelligent driving is also one popular direction of automatic driving technology research.
Currently, in the already commercialized autopilot technology, the AEB system is a longitudinal collision avoidance assistance system, i.e. the risk of collision of the vehicle in the longitudinal direction of travel is reduced by means of early warning or active braking control. However, it has been found through investigation that the main disadvantage of relying solely on longitudinal braking systems is that they cannot avoid collisions with obstacles having a distance smaller than the total braking distance, and that at higher driving speeds steering based collision avoidance maneuvers are more efficient than braking based collision avoidance maneuvers. Because the critical collision avoidance distance required for the steering operation is shorter than that required for the braking control under such conditions, that is, in the case where collision cannot be avoided with the emergency braking operation, the vehicle can still effectively avoid collision occurrence by the steering control strategy. According to the prior studies, in an emergency, the proportion of driver who takes a brake-to-collision only decreases from 72% to 43% and the proportion of driver who takes a steering-to-collision only decreases from 14% to 0, and the proportion of driver who takes a steering-to-brake combined-to-collision increases from 14% to 57%, as the collision time (Time to Collision, TTC) decreases from 2.5s to 1.5 s.
The invention uses deep reinforcement learning to enable the intelligent automobile to self-adaptively learn an automatic emergency collision avoidance control strategy, the state input adopts a method of combining high-dimensional image information and one-dimensional sensor information, pedestrians and static obstacle vehicles in the images have different shapes and colors under the same position and speed, the automobile can better distinguish the types of the front obstacles, and only the sensor information is used, so that the same data information can be obtained only for the automobile or the person. The state information in the environment is acquired and processed through the deep neural network, the value function of reinforcement learning is fitted, the environment is fully explored by using the error testing thought of reinforcement learning, the advantages of machine learning and deep learning are combined, the advantages are compensated, the steering action is added on the basis of longitudinal braking, the braking and steering system tries to improve the collision situation by adopting steering or lane changing when the system detects a dangerous area, different driving decisions of the automobile are realized under different emergency situations, and the emergency collision avoidance system of the automobile is more humanized. And has very important significance for improving the safety of the vehicle.
The invention comprises the following steps:
in order to enable the emergency collision avoidance system of the new energy automobile to have the characteristics of safety and humanization, the invention provides an automobile emergency collision avoidance control method based on DQN deep reinforcement learning. The method reduces the occurrence of road collision accidents by controlling speed and steering in the event of an impending collision using deep reinforcement learning. The proposed autonomous braking system uses cameras and sensors to obtain obstacle information, automatically deciding whether to apply the brake or steer at each time step when it is at risk of collision. The problem of designing braking control is expressed as finding an optimal strategy in a Markov Decision Process (MDP) model, where the state is given by the image and vehicle speed, the relative distance of the vehicle from the obstacle, the relative speed, and the action space definition includes five actions of 1) no braking, 2) weak braking, 3) strong braking, 4) braking and left turning, 5) braking and right turning. Strategies for braking and steering control are learned through computer simulation using a deep reinforcement learning method called Deep Q Network (DQN). To derive an ideal braking and steering strategy, a multitasking reinforcement learning method is proposed, and a multi-objective rewarding function balances the damage to the obstacle in the event of an accident and the rewards obtained when the vehicle is at risk of getting out as soon as possible. DQN is trained for vehicles encountering pedestrian crossing urban roads and stationary obstacle scenes.
The technical scheme adopted for solving the technical problems is as follows:
the automobile emergency collision avoidance control method based on DQN deep reinforcement learning is characterized by comprising subtask design, state and action space design, multi-objective rewarding function design, DQN parameter setting and training; designing different training tasks according to different collision time TTC intervals by subtask design; the method comprises the steps of firstly splicing high-dimensional semantic segmentation image information and one-dimensional sensor information (the speed of an automobile, the relative distance between the automobile and an obstacle and the relative speed between the automobile and the obstacle) to be used as state input, and then designing different action spaces according to different training tasks; the design of the multi-objective rewarding function mainly guides the automobile to finish the emergency avoidance task and simultaneously gives consideration to efficiency and comfort; finally, the DQN parameter configuration and training are to set super parameters according to actual training task requirements, then to carry out cyclic training in a Carla simulation software platform, to carry out iterative optimization on parameters of a neural network, to enable automobile students to execute correct actions in different emergency states, to enable obstacles to exist in front, and to enable own vehicles to carry out braking or automatic emergency steering, so as to avoid rear-end collisions or collision accidents. According to the method, two strategy networks and the multi-objective rewarding function are established according to the multi-task partition reinforcement learning method, so that training efficiency is greatly improved, and meanwhile, images are introduced as a part of state input, so that different actions can be adopted by a vehicle when the vehicle faces different obstacles, and the vehicle is more humanized.
The method comprises the following steps:
step 1, designing a subtask, wherein the process comprises the following substeps:
step 1.1, establishing a Markov model of an emergency collision avoidance process:
the emergency collision avoidance process has markov, namely, the speed, acceleration, position information, image information and the like of the vehicle at the next moment are only related to the current state and are not related to the historical state. The vehicle is braked or turned at the current moment, and the state of the vehicle at the next moment is affected. The Markov decision process is constructed based on interactive object agents and environments and comprises three elements of states, actions and rewarding functions. The intelligent agent perceives the current system state and acts on the environment according to the strategy, so that the state of the environment is changed to be rewarded. The set of states, actions, state transition probabilities, rewards, and discount factors (S, A, P, R, γ) form the five-tuple of the reinforcement learning markov decision process. The invention models the emergency collision avoidance process as a Markov decision process, and realizes the joint optimization of safety and efficiency by learning the optimal decision through a discrete state space and maximizing the accumulated rewards.
Step 1.2, designing a longitudinal brake control training task:
In this task, the primary speed V of the host vehicle init Randomly selected between 2.67m/s and 16.67m/s, the pedestrian or stationary car is at a distance of 5.V from the car ( init ) m, the pedestrians randomly cross the road from the two ends of the road, and the pedestrian speed V ped Randomly selecting between 1m/s and 3m/s, wherein a stationary automobile appears in the center of a lane; the collision time TTC is randomly selected between 1.5s and 4s, and the starting time of the pedestrian or the appearance time of the stationary automobile is the longitudinal distance (TTC.V) between the main automobile and the pedestrian or the stationary automobile init ) m is the position; in the collision time interval TTC of this task, the main vehicle is facing a general scene, and has a sufficient longitudinal braking distance, so that the main vehicle is only trained to take a longitudinal braking action to avoid pedestrians or stationary vehicles crossing the road.
Step 1.3, designing a transverse and longitudinal combined brake control training task:
in this task, the primary speed V of the host vehicle init Randomly selected between 2.67m/s and 16.67m/s, the pedestrian or stationary car is at a distance of 5.V from the car ( init ) m, the pedestrians randomly cross the road from the two ends of the road, and the pedestrian speed V ped Randomly selecting between 1m/s and 3m/s, wherein a stationary automobile appears in the center of a lane; the collision time TTC is randomly selected from 0.5s to 1.5s, and the starting time of the pedestrian or the appearance time of the stationary automobile is the longitudinal distance between the main automobile and the pedestrian or the stationary automobile (TTC.V) init ) m is the position; in the collision time interval TTC of the task, the main vehicle faces an emergency burst scene, when the main vehicle runs to a dangerous area, the main vehicle is trained to take longitudinal braking action preferentially when facing pedestrians crossing a road to ensure safety, and when facing a stationary vehicle, the main vehicle takes steering braking action preferentially to realize avoidance, so that the braking action of the main vehicle when facing different obstacles is more targeted.
Step 2, designing a state space and an action space, wherein the process comprises the state space design and the action space design;
step 2.1, designing a state space, wherein the process comprises image preprocessing and information splicing:
step 2.1.1, preprocessing an image:
firstly, the visual angle of a semantic segmentation camera is adjusted to be a top view, the horizontal position of the camera is adjusted, a semantic segmentation aerial view taking the position of a host vehicle as the center of a picture is obtained, each object is assigned with a pixel category by semantic segmentation, each category corresponds to one color in a palette, and each object color in a normal semantic segmentation image has a respective digital label, so that the final output segmentation picture is one picture containing blocks with different colors. However, in the invention, only the digital labels with different colors of the lane lines, the host vehicle and the obstacles (people and stationary vehicles) are reserved, and the digital labels with all other things are changed into the same number, namely, only 5 different colors of the lane lines, the host vehicle, the obstacles (people and stationary vehicles) and other things are left in the preprocessed image. The purpose is to simplify the color of things with little relation around the automobile, and to emphasize the color of the automobile and the road, the lane line and the pedestrian.
Step 2.1.2, information splicing:
after the image is preprocessed in the step 2.1.1, characteristic information of the image is extracted through a convolutional neural network, wherein the convolutional network pi (z, p) specifically comprises three convolutional layers and one full-connection layer. Where z is high-dimensional image information, p is one-dimensional information (host vehicle speed V, relative distance d, relative speed V rel ). The full-connection layer FC1 processes the output result of the flattened third-layer convolution layer Conv3, and then splices the one-dimensional feature matrix of the image and the one-dimensional sensor information (the speed of the automobile, the relative distance between the automobile and the obstacle and the relative speed between the automobile and the obstacle) through the Cat splicing function to obtain a new one-dimensional matrix, namely, the state input in the DQN algorithm, which is used as the input of the full-connection network of the subsequent output action.
Step 2.2, designing an action space;
step 2.2.1, action space of a longitudinal brake control training task:
under the task, the main vehicle has a sufficient longitudinal braking distance, so the action space only comprises longitudinal braking actions (no braking, weak braking and forced braking) of the vehicle, the one-dimensional matrix spliced in the step 2.1.2 is input into a real Q1 network as a state, the network has two full-connection layers, the number of output neurons is 3, and the three actions correspond to the three actions respectively, and the Q values of the three actions are output.
Step 2.2.2, action space of the transverse and longitudinal combined brake control training task:
in most emergency scenes of the task, the car faces emergency situations, so that not only the longitudinal braking action of the car is trained, but also the steering braking action is included, the action space comprises the longitudinal and transverse actions (no braking, weak braking, strong braking, braking and right turning, braking and left turning) of the car, the one-dimensional matrix spliced in the step 2.1.2 is input into a real Q2 network as a state, the network is provided with two full-connection layers, the number of output neurons is 5, the output neurons correspond to five actions respectively, and the Q values of the five actions are output.
Aiming at two different Q networks of two intelligent Agent agents with different tasks, when an automobile faces different emergency conditions, the system can switch different real Q networks to output actions to finish collision avoidance tasks.
Step 3, designing a multi-objective rewarding function, wherein the objective is to enable a learned strategy of the automobile to be compatible with safety, efficiency and comfort;
in order to realize the stability of vehicle control and the safety of obstacle avoidance during emergency obstacle avoidance control, the invention designs the rewarding function by adopting the division of an ideal parking area and a dangerous area. When the host vehicle is out of the dangerous area, the vehicle is encouraged to take braking measures to brake, when the host vehicle is in the dangerous area, the vehicle is encouraged to take steering lane changing measures, and when the distance is smaller than the dangerous area, the vehicle is encouraged to collide with a high probability. The bonus function is designed as follows:
Step 3.1, designing a reward function of a training task of longitudinal emergency collision avoidance control:
the task is aimed at emergency situations with TTC between 1.5 and 4 seconds, and the invention adopts ideal parking areas (3-6 m) and dangerous areas (0-3 m) to design decision-making reward functions, wherein the reward functions are designed as follows:
Figure BDA0004096917230000051
wherein V is the current speed of the automobile, d is the longitudinal distance between the automobile and the front obstacle, V init For the initial speed of the car at the beginning of each round, deltaV is the last time speed V of the car t-1 With the current moment velocity V t Is a difference in (c). When the automobile meets any judgment condition except the fourth judgment condition and the last judgment condition of the formula (1), obtaining corresponding rewards and ending the training of the round, and immediately carrying out the training of the next round. The parking area is divided for safety and road traffic efficiency, and the speed change at two adjacent moments of the automobile is for comfort of the vehicle, so that excessive speed change is prevented.
Step 3.2, designing a reward function of the transverse and longitudinal combined emergency collision avoidance control training task:
the task aims at emergency situations with TTC between 0.5 and 1.5 seconds, the dangerous area is modified to be (0.5-3 m), the ideal parking area is (3-6 m), in order to train a pedestrian crossing a road when the automobile runs to the dangerous area, the longitudinal braking action is preferentially adopted to ensure safety, the steering braking action is preferentially adopted to realize avoidance when the automobile is faced to a stationary automobile, and therefore, the rewarding function is designed to be as follows:
Figure BDA0004096917230000061
Wherein V is the current speed of the car, d is the longitudinal distance between the car and the obstacle ahead, k1, k2 are the weighting factors for encouraging and punishing steering action, respectively, k1=1 when the obstacle is a pedestrian, k1=10 when the obstacle is a stationary car, k2= -10 when the obstacle is a stationary car, and when no collision occurs in a dangerous area, the steering action is more encouraging to be taken in the face of the stationary car, the longitudinal braking action is preferentially taken in the face of the pedestrian, and k2= -10,V init For the initial speed of the main car at the beginning of each round, deltaV is the last time speed V of the car t-1 With the current moment velocity V t D is the difference of d lat Is the lateral distance between the car and the center line of the lane. The width of the car in the Carla simulation software is 1.8m, and the car is considered to collide with a stationary car or a pedestrian when the transverse distance is smaller than 2m and the current round is finished, so that when the car meets any judgment condition except the fifth judgment condition and the last judgment condition in the formula (2), corresponding rewards are obtained and the training of the round is finished, and the training of the next round is immediately carried out. Compared with pedestrians, the method has the advantages that the restriction on the transverse distance is added in rewards of dangerous areas, compared with pedestrians, the method has the advantages that the pedestrians are more encouraged to take steering action to carry out emergency avoidance when facing stationary vehicles, the ideal parking area is 3-6 meters, the vehicles are encouraged to take longitudinal braking parking in the area, if the vehicles take steering lane changing action in the area, larger negative rewards are given, and in addition, the vehicles are dangerous to take steering action beyond 6 meters away from an obstacle, and the negative rewards are given.
Step 4, setting and training DQN parameters, wherein the process comprises algorithm environment configuration and iterative optimization training;
the DQN parameter setting, several more important super-parameter designs are designed in the DQN algorithm. The method is characterized in that a discount factor gamma is used for adjusting near-far-term influence in reinforcement learning, the value range (0, 1) is as large as possible under the premise that an algorithm can converge, the learning rate lr of a full-connection neural network is obtained in the model, 0.95.Buffer size is obtained in the model and is used for improving the experience playback pool size of data efficiency in DQN, the value is usually obtained in [10000, 100000], the specific value is selected according to the performance of a computer, the exploration time proportion of the exploration time proportion and the final epsilon jointly determine the balance of the exploration and utilization of the DQN in the model, the exploration time proportion is obtained in the mode that epsilon falls from 1 to the final epsilon, the learning rate lr is obtained in the model, the final epsilon is obtained in the model, and the final epsilon falls from 1 to 0.05 slowly in 10000 rounds.
Step 4.2, iterative optimization training of a longitudinal braking collision avoidance control task:
initializing super parameters and performing cyclic training. In each training round, the real Q1 network receives the state input, outputs the Q values of the three actions, and the agent selects the action using a greedy algorithm to obtain the reward to reach the next state. The status, action, prize, next status, end-of-or-not flag is packed into a five-tuple (S, A, R, S) t+1 Done) are stored as an experience in an experience playback pool. The target Q1 network randomly extracts a batch of experience from the experience pool, the output Q value of the experience pool and the output Q value of the real Q1 network are subjected to mean square error, the LOSS is LOSS LOSS of the neural network, and the optimization target of the neural network is to minimize the LOSS, so that the output action of the real Q1 network can be as close to the output Q value of the target Q1 network as possible. The next action is continued, and the reciprocating training is performed in this way. The algorithm aims to train an intelligent agent to learn a strategy for maximizing rewards, and collision with a target obstacle is avoided under the conditions of safety, efficiency and comfort.
Step 4.3, iterative optimization training of a transverse and longitudinal combined braking collision avoidance control task:
initializing super parameters and performing cyclic training. In each training round, the real Q2 network receives the state input, outputs the Q values of the five actions, and the agent selects the actions using a greedy algorithm to obtain rewards and reach the next state. The status, action, prize, next status, end-of-or-not flag is packed into a five-tuple (S, A, R, S) t+1 Done) are stored as an experience in an experience playback pool. The target Q2 network randomly extracts a batch of experience from the experience pool, the output Q value of the experience pool and the output Q value of the real Q2 network are subjected to mean square error, the LOSS is LOSS LOSS of the neural network, the neural network is optimized to minimize the LOSS, so that the output action of the real Q2 network can be close to the output Q value of the target Q2 network as much as possible, the next action can be continuously executed, and the cyclic reciprocating training is performed. The algorithm target is also a strategy for training the intelligent agent to learn a maximum rewarding, and collision with a target obstacle is avoided under the conditions of safety, efficiency and comfort.
And 4.4, saving the neural network parameters of the real network Q1 and the real network Q2 in the two tasks as an online neural network controller. The TTC selects the real network Q1 as the controller at 1.5 seconds to 4 seconds, and outputs the longitudinal braking action. And the TTC selects the real network Q2 as a controller in 0.5 to 1.5 seconds, and outputs transverse or longitudinal control actions to realize collision avoidance tasks.
The beneficial effects of the invention are as follows: the emergency collision avoidance control method based on the DQN deep reinforcement learning algorithm can improve the safety and reliability of the braking effect in a complex system facing complex and changeable environments and having multiple constraints, and reduce the number of traffic accidents and the loss thereof; the invention starts from end-to-end control, jumps out of formal modeling, forms an interactive learning method of the personification automatic emergency braking strategy which faces to the multi-scene running environment and carries the intelligent body, and can take corresponding driving behaviors according to the current state of the vehicle, thereby effectively ensuring the self-adaptive capacity of the vehicle braking strategy; from the technical aspect, steering actions are added on the basis of the traditional AEB system, and the intelligent driving safety composition technology is developed to provide assistance for the development of intelligent driving, so that the safety of intelligent driving is ensured, the braking safety of a new energy automobile is improved, and a certain reference is provided for the development of intelligent braking of the new energy automobile.
Drawings
FIG. 1 is a flow chart of the steps of the invention.
Fig. 2 is a schematic diagram of an experimental scenario of the present invention.
Fig. 3 is a structural diagram of a neural network of the present invention.
Fig. 4 is a control flow chart of the present invention.
Detailed Description
The invention is described in detail below with reference to the drawings and the implementation.
The invention provides an emergency collision avoidance control method based on DQN deep reinforcement learning, which uses high-dimensional image information and one-dimensional sensor information to be combined as state input and combines a training method of multi-objective function and multi-task division. The automobile steering device has the advantages that the automobile steering device can finish automatic emergency steering under the condition that the automobile cannot finish braking collision avoidance targets when the automobile is provided with an obstacle in front, and rear-end collision or collision accidents are avoided. As shown in fig. 1, the method specifically comprises the following steps:
step 1, designing a subtask, wherein the process comprises the following substeps:
step 1.1, establishing a Markov model of an emergency collision avoidance process:
the emergency collision avoidance process has markov, namely, the speed, acceleration, position information, image information and the like of the vehicle at the next moment are only related to the current state and are not related to the historical state. The vehicle is braked or turned at the current moment, and the state of the vehicle at the next moment is affected. The Markov decision process is constructed based on interactive object agents and environments and comprises three elements of states, actions and rewarding functions. The intelligent agent perceives the current system state and acts on the environment according to the strategy, so that the state of the environment is changed to be rewarded. The set of states, actions, state transition probabilities, rewards, and discount factors (S, A, P, R, γ) form the five-tuple of the reinforcement learning markov decision process. The invention models the collision avoidance process with the emergency as a Markov decision process, and realizes the joint optimization of safety and efficiency by learning the optimal decision through a discrete state space and maximizing the accumulated rewards.
Step 1.2, designing a longitudinal brake control training task:
in this task, as shown in fig. 2, the stop line is set to be 3m from the obstacle, less than 3m is defined as a dangerous area, the ideal stop line is set to be 6m from the obstacle, 3m to 6m are defined as ideal stop areas, and the pedestrian crossing road trigger point Ptrig is when the automobile passes through the position, the pedestrian starts moving across the road. Primary speed V of main vehicle init Randomly selected between 2.67m/s and 16.67m/s, the pedestrian or stationary car is at a distance of 5.V from the car ( init ) m, the pedestrians randomly cross the road from the two ends of the road, and the pedestrian speed V ped Randomly selecting between 1m/s and 3m/s, wherein a stationary automobile appears in the center of a lane; the time to collision TTC is randomly selected between 1.5s and 4s, and the time to start a pedestrian or the time to appear in a stationary vehicle is at a longitudinal distance (TTC.V) from the pedestrian or stationary vehicle init ) m is the position; time of collision T for this taskThe main vehicle faces a general scene under the TC zone, and the main vehicle has a sufficient longitudinal braking distance, so that the main vehicle is only trained to take longitudinal braking action to avoid pedestrians or stationary vehicles crossing a road.
Step 1.3, designing a transverse and longitudinal combined brake control training task:
in this task, the stop line is set to be 0.5m from the obstacle, less than 0.5m to 3m are defined as dangerous areas, the ideal stop line is set to be 6m from the obstacle, 3m to 6m are defined as ideal stop areas, and the pedestrian crossing road trigger point Ptrig is when the automobile passes through the position, the pedestrian starts moving across the road. Primary speed V of main vehicle init Randomly selected between 2.67m/s and 16.67m/s, the pedestrian or stationary car is at a distance from the car (5V init ) m, the pedestrians randomly cross the road from the two ends of the road, and the pedestrian speed V ped Randomly selecting between 1m/s and 3m/s, wherein a stationary automobile appears in the center of a lane; the time to collision TTC is randomly selected between 0.5s and 1.5s, and the time to start a pedestrian or the time to appear in a stationary vehicle is at a longitudinal distance (TTC.V) from the pedestrian or stationary vehicle init ) m is the position; the main vehicle faces an emergency burst scene in the collision time TTC interval of the task, when the main vehicle runs to a dangerous area, the main vehicle is trained to take longitudinal braking action preferentially when facing pedestrians crossing a road to ensure safety, and the main vehicle takes steering braking action preferentially when facing a stationary vehicle to realize avoidance, so that the braking action of the main vehicle when facing different obstacles is more targeted.
Step 2, designing a state space and an action space, wherein the process comprises the state space design and the action space design;
step 2.1, designing a state space, wherein the process comprises image preprocessing and information splicing:
step 2.1.1, preprocessing an image:
firstly, the visual angle of a semantic segmentation camera is adjusted to be a top view, the horizontal position of the camera is adjusted, a semantic segmentation aerial view taking the position of a host vehicle as the center of a picture is obtained, each object is assigned with a pixel category by semantic segmentation, each category corresponds to one color in a palette, and each object color in a normal semantic segmentation image is provided with a respective digital label, so that the final output segmentation picture is one picture containing blocks with different colors. In Carla, for example, the number label of a building is 1, the number label of a pedestrian is 4, the number label of a street lamp is 5, and so on. But in the invention, only the digital labels of lane lines, host vehicles and barriers (people and stationary vehicles) with different colors are reserved, the digital label of a road is changed to 1, the digital label of the lane lines is changed to 2, the digital label of the vehicle is changed to 5, the digital label of a pedestrian is changed to 8, the digital label of the other things is changed to 4, namely the other things are all the same color, namely only 5 different colors of the lane lines, the host vehicles, the barriers (people and stationary vehicles) and the other things are left in the preprocessed image. The purpose is to simplify the color of things with little relation around the automobile, and to emphasize the color of the automobile and the road, lane lines and pedestrians.
Step 2.1.2, information splicing:
after the image is preprocessed in the step 2.1.1, characteristic information of the image is extracted through a convolutional neural network, wherein the convolutional network pi (z, p) specifically comprises three convolutional layers and one full-connection layer. Where z is high-dimensional image information, p is one-dimensional information (host vehicle speed V, relative distance d, relative speed V rel ). As shown in the image preprocessing part of fig. 3, the three convolution layers are all composed of convolution kernels with the size of 5×5, the step size stride=2, and the activation function is ReLU; the first layer is fully connected to form a fully connected layer FC1. The full-connection layer FC1 processes the output result of the flattened third layer Conv3 with the size of 1 multiplied by 256, then splices the one-dimensional characteristic matrix of the image with one-dimensional sensor information (the speed of the automobile, the relative distance between the automobile and the obstacle and the relative speed between the automobile and the obstacle) through a Cat splicing function to obtain a new one-dimensional matrix, and the spliced output result is 1 multiplied by 259, namely the state input in the DQN algorithm, and is used as the input of the full-connection network of the subsequent output action.
Step 2.2, designing an action space;
step 2.2.1, action space of a longitudinal brake control training task:
under this task, the host vehicle has a sufficient longitudinal braking distance, and therefore the operating space contains only the longitudinal braking operation (no braking, weak braking, forced braking) of the vehicle. As shown in the lower part of fig. 3, the one-dimensional matrix spliced in step 2.1.2 is input as a state into a real Q1 network, the network has two fully connected layers, the number of input neurons of the first fully connected layer is 259, the number of output neurons is 128, the number of input neurons of the second fully connected layer is 128, the activation function is leakyReLU, the number of output neurons is 3, and the three actions respectively correspond to the Q values of the three actions.
Step 2.2.2, action space of the transverse and longitudinal combined brake control training task:
in most emergency situations of this task, the longitudinal braking distance of the host vehicle is insufficient, so that not only the longitudinal braking action of the vehicle is trained, but also the steering braking action is included, the action space containing the longitudinal and transverse actions of the vehicle (no braking, weak braking, strong braking, braking and turning right, braking and turning left). The one-dimensional matrix spliced in the step 2.1.2 is input into a real Q2 network as a state, the network is provided with two full-connection layers, the number of input neurons of the first full-connection layer FC2 is 259, the number of output neurons is 128, the number of input neurons of the second full-connection layer FC3 is 128, the activation function is a leakage ReLU, the number of output neurons is 5, the five actions are respectively corresponding, and the Q values of the five actions are output.
Aiming at two different Q networks of two intelligent Agent agents with different tasks, when an automobile faces different emergency conditions, the system can switch different real Q networks to output actions to finish collision avoidance tasks.
Step 3, designing a multi-objective rewarding function, wherein the objectives comprise safety, efficiency and comfort;
in order to realize the stability of vehicle control and the safety of obstacle avoidance during emergency obstacle avoidance control, the invention designs a decision-making reward function by adopting a safe parking area and a dangerous area. When the host vehicle is out of the dangerous area, the vehicle is encouraged to take braking measures to brake, when the host vehicle is in the dangerous area, the vehicle is encouraged to take steering lane changing measures, and when the host vehicle is smaller than the dangerous area, the vehicle is encouraged to collide with the vehicle with high probability. The bonus function is designed as follows:
Step 3.1, designing a reward function of a training task of longitudinal emergency collision avoidance control:
this task is designed for a general case of TTC between 1.5 and 4 seconds, defining an ideal parking area (3-6 m) and a dangerous area (0-3 m), with a bonus function designed as:
Figure BDA0004096917230000111
wherein V is the current speed of the automobile, d is the longitudinal distance between the automobile and the front obstacle, V init For the initial speed of the car at the beginning of each round, deltaV is the last time speed V of the car t-1 With the current moment velocity V t Is a difference in (2); when the automobile meets any judgment condition except the fourth judgment condition and the last judgment condition of the formula (3), obtaining corresponding rewards and ending the training of the round, and immediately performing the training of the next round; the parking area is divided for safety and road traffic efficiency, and the speed change at two adjacent moments of the automobile is for comfort of the vehicle, so that excessive speed change is prevented.
Step 3.2, designing a reward function of the transverse and longitudinal combined emergency collision avoidance control training task:
the task aims at emergency situations with TTC between 0.5 seconds and 1.5 seconds, the dangerous area is modified into (0.5-3 m), the ideal parking area (3-6 m), in order to train the pedestrians crossing roads when the automobile runs to the dangerous area, the longitudinal braking action is preferentially adopted to ensure safety, the steering braking action is preferentially adopted to realize avoidance when the automobile is faced to the stationary automobile, and therefore, the rewarding function is designed to be as follows:
Figure BDA0004096917230000112
Wherein V is the current speed of the automobile, d is the longitudinal distance between the automobile and the obstacle in front, k1 and k2 are weight coefficients for encouraging and punishing steering actions respectively, and the obstacle is a vehicleK1=1 when the object is a pedestrian, k1=10 when the obstacle is a stationary car, and if no collision occurs in a dangerous area, the steering action is more encouraged to be adopted to face the stationary car, the longitudinal braking action is preferentially adopted to face the pedestrian, k2= -10, v init For the initial speed of the car at the beginning of each round, deltaV is the last time speed V of the car t-1 With the current moment velocity V t D is the difference of d lat Is the transverse distance between the automobile and the center line of the lane; the width of the car in the Carla simulation software is 1.8m, and the main car collides with the stationary car or the pedestrian when the transverse distance is smaller than 2m and the current round is finished, so that when the car meets any judgment condition except the fifth judgment condition and the last judgment condition of the formula (4), corresponding rewards are obtained and the training of the round is finished, and the training of the next round is immediately carried out; compared with pedestrians, the method has the advantages that the restriction on the transverse distance is added in rewards of dangerous areas, compared with pedestrians, the method has the advantages that the pedestrians are more encouraged to take steering action to carry out emergency avoidance when facing stationary vehicles, the ideal parking area is 3-6 meters, the vehicles are encouraged to take longitudinal braking parking in the area, if the vehicles take steering lane changing action in the area, larger negative rewards are given, and in addition, the vehicles are dangerous to take steering action beyond 6 meters away from an obstacle, and the negative rewards are given.
Step 4, setting and training DQN parameters, wherein the process comprises algorithm environment configuration and iterative optimization training;
step 4.1, setting the DQN parameters, and designing several important super-parameter designs in the DQN algorithm. The method is characterized in that a discount factor gamma is used for adjusting near-far-term influence in reinforcement learning, the value range (0, 1) is as large as possible under the premise that an algorithm can converge, the learning rate lr of a full-connection neural network is obtained in the model, 0.95.Buffer size is obtained in the model and is used for improving the experience playback pool size of data efficiency in DQN, the value is usually obtained in [10000, 100000], the specific value is selected according to the performance of a computer, the exploration time proportion of the exploration time proportion and the final epsilon jointly determine the balance of the exploration and utilization of the DQN in the model, the exploration time proportion is obtained in the mode that epsilon falls from 1 to the final epsilon, the learning rate lr is obtained in the model, the final epsilon is obtained in the model, and the final epsilon falls from 1 to 0.05 slowly in 10000 rounds.
And 4.2, iterative optimization training of a longitudinal braking collision avoidance control task, wherein the specific algorithm steps of the longitudinal braking collision avoidance control task are as follows:
step 4.2.1, initializing. Firstly, initializing an experience playback pool D1, wherein the capacity of the experience playback pool D1 is N; initializing a real Q1 network, and randomly generating a weight omega 1; initializing a target Q1 network, wherein the weight is ω1' =ω1;
Step 4.2.2 loop through each round of epoode=1, 2, …, M:
step 4.2.3, starting each round, initializing the state S1;
step 4.2.4 generating action at with epsilon-greedy policy: selecting a random action with epsilon probability, or selecting at=max a Q(s t ,a;ω);
Step 4.2.5 the host vehicle performs action at, interacts with the environment in Carla, receives instant prize r t And a new state St+1;
step 4.2.6A set of samples of the transition (S t ,a t ,r t ,S t+1 ) Storing the data into an experience playback pool D as a data set for training the neural network;
step 4.2.7 randomly extracts data transitions(s) of a batch miniband from the empirical playback pool D1 j ,a j ,r j ,s j+1 );
Step 4.2.8 if step j+1 is to reach the end state, let y j =r j Otherwise, let y j =r j +γmax a' Q(s t+1 ,a';ω');
Step 4.2.9 loss function l= (y) j -Q(s t ,a j ;ω)) 2 The loss function L is the mean square error between the target Q value and the current Q value, the loss value is minimized in the training process, and the parameters of the real Q1 network are updated by using a gradient descent method on L with respect to omega 1;
step 4.2.10 updates the target Q network every C steps, i.e. copies parameters of the real Q1 network to the target Q1 network, ω1' =ω1;
step 4.2.11, training circularly until the final algorithm converges;
and 4.3, iterative optimization training of a transverse and longitudinal combined braking collision avoidance control task, wherein the specific algorithm steps of the transverse and longitudinal combined braking collision avoidance control task are as follows:
And 4.3.1, initializing. Firstly, initializing an experience playback pool D2, wherein the capacity of the experience playback pool D2 is N; initializing a real Q2 network, and randomly generating a weight omega 2; initializing a target Q2 network, wherein the weight is ω2' =ω2;
step 4.3.2 loop through each round of epoode=1, 2, …, M:
step 4.3.3, starting each round, initializing a state S1;
step 4.3.4 generates action at with the epsilon-greedy policy: selecting a random action with epsilon probability, or selecting at=max a Q(s t ,a;ω);
Step 4.3.5 the host vehicle performs action at, interacts with the environment in Carla, receives instant prize r t And a new state St+1;
step 4.3.6A set of samples of the transition (S t ,a t ,r t ,S t+1 ) Storing the data into an experience playback pool D2 as a data set for training the neural network;
step 4.3.7 randomly extracts data transitions(s) of a batch miniband from the empirical playback pool D2 j ,a j ,r j ,s j+1 );
Step 4.3.8 if step j+1 is to reach the end state, let y j =r j Otherwise, let y j =r j +γmax a' Q(s t+1 ,a';ω');
Step 4.3.9 loss function l= (y) j -Q(s t ,a j ;ω)) 2 The loss function L is the mean square error between the target Q value and the current Q value, the loss value is minimized in the training process, and the parameters of the real Q2 network are updated by using a gradient descent method on L with respect to omega 2;
step 4.3.10 updates the target Q network every C steps, i.e. copies parameters of the real Q2 network to the target Q2 network, ω2' =ω2;
Step 4.3.11, training is circulated until the final algorithm converges;
and 4.4, saving the neural network parameters of the real network Q1 and the real network Q2 in the two tasks as an online neural network controller. As shown in fig. 4, when TTC is 1.5 seconds to 4 seconds, the real network Q1 is selected as a controller, and a longitudinal braking action is output; and when the TTC is between 0.5 and 1.5 seconds, the real network Q2 is selected as a controller, and the transverse or longitudinal control action is output to realize the collision avoidance task.

Claims (1)

1. The automobile emergency collision avoidance control method based on DQN deep reinforcement learning is characterized by comprising subtask design, state and action space design, multi-objective rewarding function design, DQN parameter setting and training; the subtask design designs different training tasks according to different emergency situations; the state and action space design is that firstly, an image and sensor information are spliced to be used as state input, and then different transverse and longitudinal action spaces are designed according to different training tasks; the design of the multi-objective rewarding function enables the avoidance strategy of the automobile to have safety, efficiency and comfort; finally, setting and training the DQN parameters according to actual super parameters, and performing iterative optimization on network parameters by cyclic training;
The method comprises the following steps:
step 1, designing a subtask, wherein the process comprises the following substeps:
step 1.1, establishing a Markov model of an emergency collision avoidance process;
step 1.2, designing a longitudinal brake control training task:
in this task, the primary speed V of the host vehicle init Randomly selected between 2.67m/s and 16.67m/s, the pedestrian or stationary car is at a distance of 5.V from the car ( init ) m, the pedestrians randomly cross the road from the two ends of the road, and the pedestrian speed V ped Randomly selecting between 1m/s and 3m/s, wherein a stationary automobile appears in the center of a lane; the collision time TTC is randomly selected from 1.5s to 4s, and the starting time of the pedestrian or the appearance time of the stationary automobile is that the main automobile is in contact with the pedestrian or the stationary automobileLongitudinal distance (TTC.V) init ) m, the main vehicle faces a general scene in the collision time TTC interval of the task, and has a sufficient longitudinal braking distance, so that the main vehicle is only trained to take longitudinal braking action to avoid pedestrians or stationary vehicles crossing the road;
step 1.3, designing a transverse and longitudinal combined brake control training task:
in this task, the primary speed V of the host vehicle init Randomly selected between 2.67m/s and 16.67m/s, the pedestrian or stationary car is at a distance of 5.V from the car ( init ) m, the pedestrians randomly cross the road from the two ends of the road, and the pedestrian speed V ped Randomly selecting 1m/s to 3m/s, wherein a stationary car appears in the center of a lane, the collision time TTC is randomly selected from 0.5s to 1.5s, and the pedestrian starting time or the stationary car appearance time is the longitudinal distance (TTC.V) between a main car and a pedestrian or the stationary car init ) m, the main vehicle faces an emergency burst scene in a collision time TTC interval of the task, when the main vehicle runs to a dangerous area, the main vehicle is trained to take longitudinal braking action preferentially when facing pedestrians crossing a road to ensure safety, and steering braking action is taken preferentially when facing a stationary vehicle to realize avoidance, so that the braking action of the main vehicle when facing different obstacles is more targeted;
step 2, designing a state space and an action space, wherein the process comprises the state space design and the action space design;
step 2.1, designing a state space, wherein the process comprises image preprocessing and information splicing:
step 2.1.1, preprocessing an image:
firstly, adjusting the view angle of a semantic segmentation camera to be a top view angle, acquiring a semantic segmentation aerial view taking the position of a host vehicle as the center of a picture, wherein each object is assigned with a pixel category by semantic segmentation, and each category corresponds to one color in a palette, so that the final output segmentation picture is one picture containing blocks with different colors; however, in the invention, only the digital labels with different colors of the lane lines, the host vehicle and the obstacles (people and stationary vehicles) are reserved, and the digital labels with the colors of all other things are changed into the same number, namely, only 5 different colors of the lane lines, the host vehicle, the obstacles (people and stationary vehicles) and other things are left in the preprocessed image;
Step 2.1.2, information splicing:
after the image is preprocessed in the step 2.1.1, extracting characteristic information of the image through a convolutional neural network, wherein the convolutional neural network comprises three layers of convolutional layers and a full-connection layer, the convolutional layers extract characteristic information of semantic segmentation top views, the full-connection layer flattens the information of the image into a one-dimensional matrix, and then the one-dimensional characteristic matrix of the image and one-dimensional sensor information (the speed of an automobile, the relative distance between the automobile and an obstacle and the relative speed between the automobile and the obstacle) are spliced through a Cat splicing function to obtain a new one-dimensional matrix, namely, state input in a DQN algorithm;
step 2.2, designing an action space;
step 2.2.1, action space of a longitudinal brake control training task:
under the task, the main vehicle has a sufficient longitudinal braking distance, so the action space only comprises longitudinal braking actions (no braking, weak braking and forced braking) of the vehicle, the one-dimensional matrix spliced in the step 2.1.2 is used as a state to be input into a real Q1 network, the network has two full-connection layers, and the Q values of three actions are output;
step 2.2.2, action space of the transverse and longitudinal combined brake control training task:
in most of the scenes of the task, the automobile faces emergency situations, so that not only the longitudinal braking action of the automobile is trained, but also the steering braking action is included, the action space comprises the longitudinal and transverse actions (no braking, weak braking, strong braking, braking and right turning, braking and left turning) of the automobile, the one-dimensional matrix spliced in the step 2.1.2 is input into a real Q2 network as a state, the network is provided with two full-connection layers, and the Q values of five actions are output;
Step 3, designing a multi-objective rewarding function, wherein the objective is to enable a learned strategy of the automobile to be compatible with safety, efficiency and comfort;
in order to realize the stability of vehicle control and the safety of obstacle avoidance when in emergency obstacle avoidance control, the invention designs a reward function by adopting the division of an ideal parking area and a dangerous area, when a host vehicle is outside the dangerous area, the vehicle is encouraged to take braking measures to brake, when the host vehicle is in the dangerous area, the vehicle is encouraged to take steering lane changing measures, when the distance is smaller than the dangerous area, the vehicle is encouraged to collide with a large probability, and the reward function is designed as follows:
step 3.1, designing a reward function of a training task of longitudinal emergency collision avoidance control:
this task is designed for a decision-making reward function defining an ideal parking area (3-6 m) and a dangerous area (0-3 m) for a general case of collision time TTC between 1.5 seconds and 4 seconds, the reward function being designed as:
Figure FDA0004096917220000031
wherein V is the current speed of the automobile, d is the longitudinal distance between the automobile and the front obstacle, V init For the initial speed of the car at the beginning of each round, deltaV is the last time speed V of the car t-1 With the current moment velocity V t Is a difference in (2); when the automobile meets any judgment condition except the fourth judgment condition and the last judgment condition of the formula (1), obtaining corresponding rewards and ending the training of the round, and immediately performing the training of the next round; the parking area is divided for safety and road traffic efficiency, and the speed change at two adjacent moments of the automobile is for comfort of the automobile, so that excessive speed change is prevented;
Step 3.2, designing a reward function of the transverse and longitudinal combined emergency collision avoidance control training task:
the task aims at emergency situations with collision time TTC between 0.5 seconds and 1.5 seconds, the dangerous area is modified to be (0.5-3 m), the ideal parking area is modified to be (3-6 m), in order to train a pedestrian crossing a road to take a longitudinal braking action preferentially when the automobile runs to the dangerous area to ensure safety, and a steering braking action is taken preferentially when the automobile is in the face of a stationary automobile to realize avoidance, so that a reward function is designed to be as follows:
Figure FDA0004096917220000032
wherein V is the current speed of the car, d is the longitudinal distance between the car and the obstacle ahead, k1, k2 are the weighting factors for encouraging and punishing steering actions, respectively, k1=1 when the obstacle is a pedestrian, k1=10 when the obstacle is a stationary car, k2= -10, V when the dangerous area is not collided, the steering action is more encouraging to be taken in the face of the stationary car, the longitudinal braking action is preferably taken in the face of the pedestrian init For the initial speed of the car at the beginning of each round, deltaV is the last time speed V of the car t-1 With the current moment velocity V t D is the difference of d lat The lateral distance between the automobile and a stationary automobile or a pedestrian; the width of the car in the Carla simulation software is 1.8m, and the main car collides with the stationary car or the pedestrian when the transverse distance is smaller than 2m and the current round is finished, so that when the car meets any judgment condition except the fifth judgment condition and the last judgment condition of the formula (2), corresponding rewards are obtained and the training of the round is finished, and the training of the next round is immediately carried out; adding a limitation on the transverse distance in the rewarding of a dangerous area, compared with pedestrians, more encouraging the automobile to take steering action to carry out emergency avoidance when facing a stationary automobile, wherein the ideal parking area is 3-6 meters, encouraging the automobile to take longitudinal braking parking in the area, if the automobile takes steering lane changing action in the area, a larger negative rewarding is given, and in addition, the automobile takes steering action beyond 6 meters from an obstacle to be dangerous, and the negative rewarding is given;
Step 4, setting and training DQN parameters, wherein the process comprises algorithm environment configuration and iterative optimization training;
step 4.1, setting and configuring DQN parameters, and designing several important super-parameter designs in the DQN algorithm; the method comprises the steps of firstly, taking a discount factor gamma, taking the value range (0, 1), taking the value of the discount factor as large as possible on the premise that an algorithm can converge, taking the value of 0.95 in the model, then taking the value of 0.95 in the model, taking the value of experience playback pool used for improving data efficiency in DQN, taking the value of 0.05 in the model, and taking the value of 0.05 in the model, wherein the value of 0.05 in the model is fully guaranteed by properly increasing the Buffer size on the premise that the number of steps considered before for a complex task is larger, the value of the discount factor is usually taken as large as possible in [10000, 100000] on the premise that the algorithm can converge, taking 20000 in the model, and finally jointly determining the balance of DQN exploration and utilization by the ratio of epsilon, and the time that epsilon is reduced from 1 to final epsilon, and the value is taken in the model, and the value of 0.1 is usually taken in the complex task, and the value is taken from 0.05 to 0.05;
Step 4.2, iterative optimization training of a longitudinal braking collision avoidance control task:
initializing super parameters, performing cyclic training, wherein in each training round, a real Q1 network receives state input and outputs Q values of three actions; the intelligent agent uses greedy algorithm to select action, obtain rewards and reach the next state; packing the state, action, rewards and next state whether the end mark is packed into a five-tuple as an experience to be stored in an experience playback pool; the target Q1 network randomly extracts a batch of experience from the experience pool, the output Q value of the experience pool and the output Q value of the real Q1 network are subjected to mean square error, the LOSS is LOSS LOSS of the neural network, the optimization target of the neural network is to minimize the LOSS, so that the output action of the real Q1 network can be as close to the output Q value of the target Q1 network as possible, the next action can be continuously executed, and the cyclic reciprocating training is performed; the algorithm aims at training an intelligent agent to learn a strategy for maximizing rewards, and avoiding collision with a target obstacle under the conditions of ensuring safety, efficiency and comfort;
step 4.3, iterative optimization training of a transverse and longitudinal combined braking collision avoidance control task:
initializing super parameters, performing cyclic training, wherein in each training round, a real Q2 network receives state input, outputs Q values of five actions, and an intelligent agent selects actions by using a greedy algorithm to obtain rewards and reach the next state; packing the state, action, rewards and next state whether the end mark is packed into a five-tuple as an experience and storing the five-tuple in an experience playback pool; the target Q2 network randomly extracts a batch of experience from the experience pool, the output Q value of the experience pool and the output Q value of the real Q2 network are subjected to mean square error, the LOSS is LOSS LOSS of the neural network, the optimization target of the neural network is to minimize the LOSS, so that the output action of the real Q2 network can be as close to the output Q value of the target Q2 network as possible, the next action can be continuously executed, and the cyclic reciprocating training is performed; the algorithm aims at training an intelligent agent to learn a strategy for maximizing rewards, and avoiding collision with a target obstacle under the conditions of ensuring safety, efficiency and comfort;
And 4.4, saving the neural network parameters of the target Q1 network and the target Q2 network in the two tasks, and using the neural network parameters as an online neural network controller, wherein the TTC selects the target Q1 network as the controller when the TTC is 1.5 to 4 seconds, outputs a longitudinal braking action, and selects the target Q2 network as the controller when the TTC is 0.5 to 1.5 seconds, and outputs a transverse or longitudinal control action to realize the collision avoidance task.
CN202310168297.1A 2023-02-27 2023-02-27 Automobile emergency collision avoidance control method based on DQN deep reinforcement learning Pending CN116176572A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310168297.1A CN116176572A (en) 2023-02-27 2023-02-27 Automobile emergency collision avoidance control method based on DQN deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310168297.1A CN116176572A (en) 2023-02-27 2023-02-27 Automobile emergency collision avoidance control method based on DQN deep reinforcement learning

Publications (1)

Publication Number Publication Date
CN116176572A true CN116176572A (en) 2023-05-30

Family

ID=86451997

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310168297.1A Pending CN116176572A (en) 2023-02-27 2023-02-27 Automobile emergency collision avoidance control method based on DQN deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN116176572A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116822655A (en) * 2023-08-24 2023-09-29 南京邮电大学 Acceleration method for automatically controlled training process

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116822655A (en) * 2023-08-24 2023-09-29 南京邮电大学 Acceleration method for automatically controlled training process
CN116822655B (en) * 2023-08-24 2023-11-24 南京邮电大学 Acceleration method for automatically controlled training process

Similar Documents

Publication Publication Date Title
CN111081065B (en) Intelligent vehicle collaborative lane change decision model under road section mixed traveling condition
CN113291308B (en) Vehicle self-learning lane-changing decision-making system and method considering driving behavior characteristics
CN112233413B (en) Multilane space-time trajectory optimization method for intelligent networked vehicle
CN114013443B (en) Automatic driving vehicle lane change decision control method based on hierarchical reinforcement learning
CN111311945A (en) Driving decision system and method fusing vision and sensor information
CN110525428B (en) Automatic parking method based on fuzzy depth reinforcement learning
CN113954837B (en) Deep learning-based lane change decision-making method for large-scale commercial vehicle
CN111679660B (en) Unmanned deep reinforcement learning method integrating human-like driving behaviors
US11263465B2 (en) Low-dimensional ascertaining of delimited regions and motion paths
CN115039142A (en) Method and apparatus for predicting intent of vulnerable road user
Zong et al. Obstacle avoidance for self-driving vehicle with reinforcement learning
EP3705367B1 (en) Training a generator unit and a discriminator unit for collision-aware trajectory prediction
CN114564016A (en) Navigation obstacle avoidance control method, system and model combining path planning and reinforcement learning
US11801864B1 (en) Cost-based action determination
CN116176572A (en) Automobile emergency collision avoidance control method based on DQN deep reinforcement learning
CN111899509B (en) Intelligent networking automobile state vector calculation method based on vehicle-road information coupling
CN114399743A (en) Method for generating future track of obstacle
CN114103893A (en) Unmanned vehicle trajectory prediction anti-collision method
Guo et al. Toward human-like behavior generation in urban environment based on Markov decision process with hybrid potential maps
CN115257819A (en) Decision-making method for safe driving of large-scale commercial vehicle in urban low-speed environment
CN111824182A (en) Three-axis heavy vehicle self-adaptive cruise control algorithm based on deep reinforcement learning
CN115973169A (en) Driving behavior decision method based on risk field model, electronic device and medium
CN115107767A (en) Artificial intelligence-based automatic driving brake and anti-collision control method
DE102022109385A1 (en) Reward feature for vehicles
Guo et al. Research on integrated decision control algorithm for autonomous vehicles under multi-task hybrid constraints in intelligent transportation scenarios

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication