CN116176572A

CN116176572A - Automobile emergency collision avoidance control method based on DQN deep reinforcement learning

Info

Publication number: CN116176572A
Application number: CN202310168297.1A
Authority: CN
Inventors: 卢晓晖; 郑馨義; 吕新展; 李绍松; 李佳纯; 董旭升; 张鹏飞; 张袅娜
Original assignee: Changchun University of Technology
Current assignee: Changchun University of Technology
Priority date: 2023-02-27
Filing date: 2023-02-27
Publication date: 2023-05-30

Abstract

The invention provides an automobile emergency collision avoidance control method based on DQN deep reinforcement learning, which belongs to the field of new energy automobile braking, and aims to solve the problem that an AEB system only depends on the longitudinal braking collision avoidance effect in an emergency and has more pertinence in braking actions when facing different obstacles, and the method comprises subtask design, state and action space design, multi-objective rewarding function design, DQN parameter setting and training; the method improves the algorithm training efficiency, improves the safety of the automobile, and makes the avoidance strategy of the automobile more humanized.

Description

Automobile emergency collision avoidance control method based on DQN deep reinforcement learning

Technical field:

the invention belongs to the field of new energy automobile braking, and particularly relates to an automobile emergency collision avoidance control method based on DQN deep reinforcement learning.

The background technology is as follows:

automatic emergency braking (Autonomous Emergency Braking) is an active safety function based on an environmental awareness sensor sensing the risk of a collision with a vehicle, pedestrian or other traffic participant in front of it, and automatically triggering an actuator by the system to apply the braking to avoid or mitigate the extent of the collision.

Most researchers use traditional methods such as rule-based methods, PID algorithm-based methods, fuzzy control methods, model predictive control and the like to study the automatic emergency braking function under urban working conditions. However, the methods have the problems of a large number of manual parameter adjustment, low control precision, model-dependent precision, high model complexity, large calculated amount, low calculation speed and the like. And the methods have limitations in processing all scenes possibly happening on a real road, have weak adaptive capacity to complex traffic environment and poor robustness, are very dependent on manual experience to formulate rules, and can not make different braking or steering actions according to different categories of obstacles only by using information based on environment-aware sensors. The end-to-end structure based on deep reinforcement learning can directly obtain control actions such as accelerator, brake, wheel rotation angle and the like by using perception input, so that the workload and parameter adjustment cost for each layer of algorithm construction are greatly reduced, the generalization capability of automatic driving is improved, and corresponding target requirements can be added into a reward function in the algorithm to optimize a plurality of targets. The deep reinforcement learning combines the powerful ability of sensing abstract features of the deep neural network, can adaptively learn in changeable environments, has certain adaptability to scenes outside training environments, and has higher calculation speed after training is finished than that of the traditional algorithm. At present, deep reinforcement learning has good performance in the fields of target recognition, automatic control, games and the like, and the combination of the deep reinforcement learning and intelligent driving is also one popular direction of automatic driving technology research.

Currently, in the already commercialized autopilot technology, the AEB system is a longitudinal collision avoidance assistance system, i.e. the risk of collision of the vehicle in the longitudinal direction of travel is reduced by means of early warning or active braking control. However, it has been found through investigation that the main disadvantage of relying solely on longitudinal braking systems is that they cannot avoid collisions with obstacles having a distance smaller than the total braking distance, and that at higher driving speeds steering based collision avoidance maneuvers are more efficient than braking based collision avoidance maneuvers. Because the critical collision avoidance distance required for the steering operation is shorter than that required for the braking control under such conditions, that is, in the case where collision cannot be avoided with the emergency braking operation, the vehicle can still effectively avoid collision occurrence by the steering control strategy. According to the prior studies, in an emergency, the proportion of driver who takes a brake-to-collision only decreases from 72% to 43% and the proportion of driver who takes a steering-to-collision only decreases from 14% to 0, and the proportion of driver who takes a steering-to-brake combined-to-collision increases from 14% to 57%, as the collision time (Time to Collision, TTC) decreases from 2.5s to 1.5 s.

The invention uses deep reinforcement learning to enable the intelligent automobile to self-adaptively learn an automatic emergency collision avoidance control strategy, the state input adopts a method of combining high-dimensional image information and one-dimensional sensor information, pedestrians and static obstacle vehicles in the images have different shapes and colors under the same position and speed, the automobile can better distinguish the types of the front obstacles, and only the sensor information is used, so that the same data information can be obtained only for the automobile or the person. The state information in the environment is acquired and processed through the deep neural network, the value function of reinforcement learning is fitted, the environment is fully explored by using the error testing thought of reinforcement learning, the advantages of machine learning and deep learning are combined, the advantages are compensated, the steering action is added on the basis of longitudinal braking, the braking and steering system tries to improve the collision situation by adopting steering or lane changing when the system detects a dangerous area, different driving decisions of the automobile are realized under different emergency situations, and the emergency collision avoidance system of the automobile is more humanized. And has very important significance for improving the safety of the vehicle.

The invention comprises the following steps:

in order to enable the emergency collision avoidance system of the new energy automobile to have the characteristics of safety and humanization, the invention provides an automobile emergency collision avoidance control method based on DQN deep reinforcement learning. The method reduces the occurrence of road collision accidents by controlling speed and steering in the event of an impending collision using deep reinforcement learning. The proposed autonomous braking system uses cameras and sensors to obtain obstacle information, automatically deciding whether to apply the brake or steer at each time step when it is at risk of collision. The problem of designing braking control is expressed as finding an optimal strategy in a Markov Decision Process (MDP) model, where the state is given by the image and vehicle speed, the relative distance of the vehicle from the obstacle, the relative speed, and the action space definition includes five actions of 1) no braking, 2) weak braking, 3) strong braking, 4) braking and left turning, 5) braking and right turning. Strategies for braking and steering control are learned through computer simulation using a deep reinforcement learning method called Deep Q Network (DQN). To derive an ideal braking and steering strategy, a multitasking reinforcement learning method is proposed, and a multi-objective rewarding function balances the damage to the obstacle in the event of an accident and the rewards obtained when the vehicle is at risk of getting out as soon as possible. DQN is trained for vehicles encountering pedestrian crossing urban roads and stationary obstacle scenes.

The technical scheme adopted for solving the technical problems is as follows:

the automobile emergency collision avoidance control method based on DQN deep reinforcement learning is characterized by comprising subtask design, state and action space design, multi-objective rewarding function design, DQN parameter setting and training; designing different training tasks according to different collision time TTC intervals by subtask design; the method comprises the steps of firstly splicing high-dimensional semantic segmentation image information and one-dimensional sensor information (the speed of an automobile, the relative distance between the automobile and an obstacle and the relative speed between the automobile and the obstacle) to be used as state input, and then designing different action spaces according to different training tasks; the design of the multi-objective rewarding function mainly guides the automobile to finish the emergency avoidance task and simultaneously gives consideration to efficiency and comfort; finally, the DQN parameter configuration and training are to set super parameters according to actual training task requirements, then to carry out cyclic training in a Carla simulation software platform, to carry out iterative optimization on parameters of a neural network, to enable automobile students to execute correct actions in different emergency states, to enable obstacles to exist in front, and to enable own vehicles to carry out braking or automatic emergency steering, so as to avoid rear-end collisions or collision accidents. According to the method, two strategy networks and the multi-objective rewarding function are established according to the multi-task partition reinforcement learning method, so that training efficiency is greatly improved, and meanwhile, images are introduced as a part of state input, so that different actions can be adopted by a vehicle when the vehicle faces different obstacles, and the vehicle is more humanized.

The method comprises the following steps:

step 1, designing a subtask, wherein the process comprises the following substeps:

step 1.1, establishing a Markov model of an emergency collision avoidance process:

the emergency collision avoidance process has markov, namely, the speed, acceleration, position information, image information and the like of the vehicle at the next moment are only related to the current state and are not related to the historical state. The vehicle is braked or turned at the current moment, and the state of the vehicle at the next moment is affected. The Markov decision process is constructed based on interactive object agents and environments and comprises three elements of states, actions and rewarding functions. The intelligent agent perceives the current system state and acts on the environment according to the strategy, so that the state of the environment is changed to be rewarded. The set of states, actions, state transition probabilities, rewards, and discount factors (S, A, P, R, γ) form the five-tuple of the reinforcement learning markov decision process. The invention models the emergency collision avoidance process as a Markov decision process, and realizes the joint optimization of safety and efficiency by learning the optimal decision through a discrete state space and maximizing the accumulated rewards.

Step 1.2, designing a longitudinal brake control training task:

In this task, the primary speed V of the host vehicle _init Randomly selected between 2.67m/s and 16.67m/s, the pedestrian or stationary car is at a distance of 5.V from the car ( _init ) m, the pedestrians randomly cross the road from the two ends of the road, and the pedestrian speed V _ped Randomly selecting between 1m/s and 3m/s, wherein a stationary automobile appears in the center of a lane; the collision time TTC is randomly selected between 1.5s and 4s, and the starting time of the pedestrian or the appearance time of the stationary automobile is the longitudinal distance (TTC.V) between the main automobile and the pedestrian or the stationary automobile _init ) m is the position; in the collision time interval TTC of this task, the main vehicle is facing a general scene, and has a sufficient longitudinal braking distance, so that the main vehicle is only trained to take a longitudinal braking action to avoid pedestrians or stationary vehicles crossing the road.

Step 1.3, designing a transverse and longitudinal combined brake control training task:

in this task, the primary speed V of the host vehicle _init Randomly selected between 2.67m/s and 16.67m/s, the pedestrian or stationary car is at a distance of 5.V from the car ( _init ) m, the pedestrians randomly cross the road from the two ends of the road, and the pedestrian speed V _ped Randomly selecting between 1m/s and 3m/s, wherein a stationary automobile appears in the center of a lane; the collision time TTC is randomly selected from 0.5s to 1.5s, and the starting time of the pedestrian or the appearance time of the stationary automobile is the longitudinal distance between the main automobile and the pedestrian or the stationary automobile (TTC.V) _init ) m is the position; in the collision time interval TTC of the task, the main vehicle faces an emergency burst scene, when the main vehicle runs to a dangerous area, the main vehicle is trained to take longitudinal braking action preferentially when facing pedestrians crossing a road to ensure safety, and when facing a stationary vehicle, the main vehicle takes steering braking action preferentially to realize avoidance, so that the braking action of the main vehicle when facing different obstacles is more targeted.

Step 2, designing a state space and an action space, wherein the process comprises the state space design and the action space design;

step 2.1, designing a state space, wherein the process comprises image preprocessing and information splicing:

step 2.1.1, preprocessing an image:

firstly, the visual angle of a semantic segmentation camera is adjusted to be a top view, the horizontal position of the camera is adjusted, a semantic segmentation aerial view taking the position of a host vehicle as the center of a picture is obtained, each object is assigned with a pixel category by semantic segmentation, each category corresponds to one color in a palette, and each object color in a normal semantic segmentation image has a respective digital label, so that the final output segmentation picture is one picture containing blocks with different colors. However, in the invention, only the digital labels with different colors of the lane lines, the host vehicle and the obstacles (people and stationary vehicles) are reserved, and the digital labels with all other things are changed into the same number, namely, only 5 different colors of the lane lines, the host vehicle, the obstacles (people and stationary vehicles) and other things are left in the preprocessed image. The purpose is to simplify the color of things with little relation around the automobile, and to emphasize the color of the automobile and the road, the lane line and the pedestrian.

Step 2.1.2, information splicing:

after the image is preprocessed in the step 2.1.1, characteristic information of the image is extracted through a convolutional neural network, wherein the convolutional network pi (z, p) specifically comprises three convolutional layers and one full-connection layer. Where z is high-dimensional image information, p is one-dimensional information (host vehicle speed V, relative distance d, relative speed V _rel ). The full-connection layer FC1 processes the output result of the flattened third-layer convolution layer Conv3, and then splices the one-dimensional feature matrix of the image and the one-dimensional sensor information (the speed of the automobile, the relative distance between the automobile and the obstacle and the relative speed between the automobile and the obstacle) through the Cat splicing function to obtain a new one-dimensional matrix, namely, the state input in the DQN algorithm, which is used as the input of the full-connection network of the subsequent output action.

Step 2.2, designing an action space;

step 2.2.1, action space of a longitudinal brake control training task:

under the task, the main vehicle has a sufficient longitudinal braking distance, so the action space only comprises longitudinal braking actions (no braking, weak braking and forced braking) of the vehicle, the one-dimensional matrix spliced in the step 2.1.2 is input into a real Q1 network as a state, the network has two full-connection layers, the number of output neurons is 3, and the three actions correspond to the three actions respectively, and the Q values of the three actions are output.

Step 2.2.2, action space of the transverse and longitudinal combined brake control training task:

in most emergency scenes of the task, the car faces emergency situations, so that not only the longitudinal braking action of the car is trained, but also the steering braking action is included, the action space comprises the longitudinal and transverse actions (no braking, weak braking, strong braking, braking and right turning, braking and left turning) of the car, the one-dimensional matrix spliced in the step 2.1.2 is input into a real Q2 network as a state, the network is provided with two full-connection layers, the number of output neurons is 5, the output neurons correspond to five actions respectively, and the Q values of the five actions are output.

Aiming at two different Q networks of two intelligent Agent agents with different tasks, when an automobile faces different emergency conditions, the system can switch different real Q networks to output actions to finish collision avoidance tasks.

Step 3, designing a multi-objective rewarding function, wherein the objective is to enable a learned strategy of the automobile to be compatible with safety, efficiency and comfort;

in order to realize the stability of vehicle control and the safety of obstacle avoidance during emergency obstacle avoidance control, the invention designs the rewarding function by adopting the division of an ideal parking area and a dangerous area. When the host vehicle is out of the dangerous area, the vehicle is encouraged to take braking measures to brake, when the host vehicle is in the dangerous area, the vehicle is encouraged to take steering lane changing measures, and when the distance is smaller than the dangerous area, the vehicle is encouraged to collide with a high probability. The bonus function is designed as follows:

Step 3.1, designing a reward function of a training task of longitudinal emergency collision avoidance control:

the task is aimed at emergency situations with TTC between 1.5 and 4 seconds, and the invention adopts ideal parking areas (3-6 m) and dangerous areas (0-3 m) to design decision-making reward functions, wherein the reward functions are designed as follows:

wherein V is the current speed of the automobile, d is the longitudinal distance between the automobile and the front obstacle, V _init For the initial speed of the car at the beginning of each round, deltaV is the last time speed V of the car _t-1 With the current moment velocity V _t Is a difference in (c). When the automobile meets any judgment condition except the fourth judgment condition and the last judgment condition of the formula (1), obtaining corresponding rewards and ending the training of the round, and immediately carrying out the training of the next round. The parking area is divided for safety and road traffic efficiency, and the speed change at two adjacent moments of the automobile is for comfort of the vehicle, so that excessive speed change is prevented.

Step 3.2, designing a reward function of the transverse and longitudinal combined emergency collision avoidance control training task:

the task aims at emergency situations with TTC between 0.5 and 1.5 seconds, the dangerous area is modified to be (0.5-3 m), the ideal parking area is (3-6 m), in order to train a pedestrian crossing a road when the automobile runs to the dangerous area, the longitudinal braking action is preferentially adopted to ensure safety, the steering braking action is preferentially adopted to realize avoidance when the automobile is faced to a stationary automobile, and therefore, the rewarding function is designed to be as follows:

Wherein V is the current speed of the car, d is the longitudinal distance between the car and the obstacle ahead, k1, k2 are the weighting factors for encouraging and punishing steering action, respectively, k1=1 when the obstacle is a pedestrian, k1=10 when the obstacle is a stationary car, k2= -10 when the obstacle is a stationary car, and when no collision occurs in a dangerous area, the steering action is more encouraging to be taken in the face of the stationary car, the longitudinal braking action is preferentially taken in the face of the pedestrian, and k2= -10,V _init For the initial speed of the main car at the beginning of each round, deltaV is the last time speed V of the car _t-1 With the current moment velocity V _t D is the difference of d _lat Is the lateral distance between the car and the center line of the lane. The width of the car in the Carla simulation software is 1.8m, and the car is considered to collide with a stationary car or a pedestrian when the transverse distance is smaller than 2m and the current round is finished, so that when the car meets any judgment condition except the fifth judgment condition and the last judgment condition in the formula (2), corresponding rewards are obtained and the training of the round is finished, and the training of the next round is immediately carried out. Compared with pedestrians, the method has the advantages that the restriction on the transverse distance is added in rewards of dangerous areas, compared with pedestrians, the method has the advantages that the pedestrians are more encouraged to take steering action to carry out emergency avoidance when facing stationary vehicles, the ideal parking area is 3-6 meters, the vehicles are encouraged to take longitudinal braking parking in the area, if the vehicles take steering lane changing action in the area, larger negative rewards are given, and in addition, the vehicles are dangerous to take steering action beyond 6 meters away from an obstacle, and the negative rewards are given.

Step 4, setting and training DQN parameters, wherein the process comprises algorithm environment configuration and iterative optimization training;

the DQN parameter setting, several more important super-parameter designs are designed in the DQN algorithm. The method is characterized in that a discount factor gamma is used for adjusting near-far-term influence in reinforcement learning, the value range (0, 1) is as large as possible under the premise that an algorithm can converge, the learning rate lr of a full-connection neural network is obtained in the model, 0.95.Buffer size is obtained in the model and is used for improving the experience playback pool size of data efficiency in DQN, the value is usually obtained in [10000, 100000], the specific value is selected according to the performance of a computer, the exploration time proportion of the exploration time proportion and the final epsilon jointly determine the balance of the exploration and utilization of the DQN in the model, the exploration time proportion is obtained in the mode that epsilon falls from 1 to the final epsilon, the learning rate lr is obtained in the model, the final epsilon is obtained in the model, and the final epsilon falls from 1 to 0.05 slowly in 10000 rounds.

Step 4.2, iterative optimization training of a longitudinal braking collision avoidance control task:

initializing super parameters and performing cyclic training. In each training round, the real Q1 network receives the state input, outputs the Q values of the three actions, and the agent selects the action using a greedy algorithm to obtain the reward to reach the next state. The status, action, prize, next status, end-of-or-not flag is packed into a five-tuple (S, A, R, S) _t+1 Done) are stored as an experience in an experience playback pool. The target Q1 network randomly extracts a batch of experience from the experience pool, the output Q value of the experience pool and the output Q value of the real Q1 network are subjected to mean square error, the LOSS is LOSS LOSS of the neural network, and the optimization target of the neural network is to minimize the LOSS, so that the output action of the real Q1 network can be as close to the output Q value of the target Q1 network as possible. The next action is continued, and the reciprocating training is performed in this way. The algorithm aims to train an intelligent agent to learn a strategy for maximizing rewards, and collision with a target obstacle is avoided under the conditions of safety, efficiency and comfort.

Step 4.3, iterative optimization training of a transverse and longitudinal combined braking collision avoidance control task:

initializing super parameters and performing cyclic training. In each training round, the real Q2 network receives the state input, outputs the Q values of the five actions, and the agent selects the actions using a greedy algorithm to obtain rewards and reach the next state. The status, action, prize, next status, end-of-or-not flag is packed into a five-tuple (S, A, R, S) _t+1 Done) are stored as an experience in an experience playback pool. The target Q2 network randomly extracts a batch of experience from the experience pool, the output Q value of the experience pool and the output Q value of the real Q2 network are subjected to mean square error, the LOSS is LOSS LOSS of the neural network, the neural network is optimized to minimize the LOSS, so that the output action of the real Q2 network can be close to the output Q value of the target Q2 network as much as possible, the next action can be continuously executed, and the cyclic reciprocating training is performed. The algorithm target is also a strategy for training the intelligent agent to learn a maximum rewarding, and collision with a target obstacle is avoided under the conditions of safety, efficiency and comfort.

And 4.4, saving the neural network parameters of the real network Q1 and the real network Q2 in the two tasks as an online neural network controller. The TTC selects the real network Q1 as the controller at 1.5 seconds to 4 seconds, and outputs the longitudinal braking action. And the TTC selects the real network Q2 as a controller in 0.5 to 1.5 seconds, and outputs transverse or longitudinal control actions to realize collision avoidance tasks.

The beneficial effects of the invention are as follows: the emergency collision avoidance control method based on the DQN deep reinforcement learning algorithm can improve the safety and reliability of the braking effect in a complex system facing complex and changeable environments and having multiple constraints, and reduce the number of traffic accidents and the loss thereof; the invention starts from end-to-end control, jumps out of formal modeling, forms an interactive learning method of the personification automatic emergency braking strategy which faces to the multi-scene running environment and carries the intelligent body, and can take corresponding driving behaviors according to the current state of the vehicle, thereby effectively ensuring the self-adaptive capacity of the vehicle braking strategy; from the technical aspect, steering actions are added on the basis of the traditional AEB system, and the intelligent driving safety composition technology is developed to provide assistance for the development of intelligent driving, so that the safety of intelligent driving is ensured, the braking safety of a new energy automobile is improved, and a certain reference is provided for the development of intelligent braking of the new energy automobile.

Drawings

FIG. 1 is a flow chart of the steps of the invention.

Fig. 2 is a schematic diagram of an experimental scenario of the present invention.

Fig. 3 is a structural diagram of a neural network of the present invention.

Fig. 4 is a control flow chart of the present invention.

Detailed Description

The invention is described in detail below with reference to the drawings and the implementation.

The invention provides an emergency collision avoidance control method based on DQN deep reinforcement learning, which uses high-dimensional image information and one-dimensional sensor information to be combined as state input and combines a training method of multi-objective function and multi-task division. The automobile steering device has the advantages that the automobile steering device can finish automatic emergency steering under the condition that the automobile cannot finish braking collision avoidance targets when the automobile is provided with an obstacle in front, and rear-end collision or collision accidents are avoided. As shown in fig. 1, the method specifically comprises the following steps:

the emergency collision avoidance process has markov, namely, the speed, acceleration, position information, image information and the like of the vehicle at the next moment are only related to the current state and are not related to the historical state. The vehicle is braked or turned at the current moment, and the state of the vehicle at the next moment is affected. The Markov decision process is constructed based on interactive object agents and environments and comprises three elements of states, actions and rewarding functions. The intelligent agent perceives the current system state and acts on the environment according to the strategy, so that the state of the environment is changed to be rewarded. The set of states, actions, state transition probabilities, rewards, and discount factors (S, A, P, R, γ) form the five-tuple of the reinforcement learning markov decision process. The invention models the collision avoidance process with the emergency as a Markov decision process, and realizes the joint optimization of safety and efficiency by learning the optimal decision through a discrete state space and maximizing the accumulated rewards.

Step 1.2, designing a longitudinal brake control training task:

in this task, as shown in fig. 2, the stop line is set to be 3m from the obstacle, less than 3m is defined as a dangerous area, the ideal stop line is set to be 6m from the obstacle, 3m to 6m are defined as ideal stop areas, and the pedestrian crossing road trigger point Ptrig is when the automobile passes through the position, the pedestrian starts moving across the road. Primary speed V of main vehicle _init Randomly selected between 2.67m/s and 16.67m/s, the pedestrian or stationary car is at a distance of 5.V from the car ( _init ) m, the pedestrians randomly cross the road from the two ends of the road, and the pedestrian speed V _ped Randomly selecting between 1m/s and 3m/s, wherein a stationary automobile appears in the center of a lane; the time to collision TTC is randomly selected between 1.5s and 4s, and the time to start a pedestrian or the time to appear in a stationary vehicle is at a longitudinal distance (TTC.V) from the pedestrian or stationary vehicle _init ) m is the position; time of collision T for this taskThe main vehicle faces a general scene under the TC zone, and the main vehicle has a sufficient longitudinal braking distance, so that the main vehicle is only trained to take longitudinal braking action to avoid pedestrians or stationary vehicles crossing a road.

in this task, the stop line is set to be 0.5m from the obstacle, less than 0.5m to 3m are defined as dangerous areas, the ideal stop line is set to be 6m from the obstacle, 3m to 6m are defined as ideal stop areas, and the pedestrian crossing road trigger point Ptrig is when the automobile passes through the position, the pedestrian starts moving across the road. Primary speed V of main vehicle _init Randomly selected between 2.67m/s and 16.67m/s, the pedestrian or stationary car is at a distance from the car (5V _init ) m, the pedestrians randomly cross the road from the two ends of the road, and the pedestrian speed V _ped Randomly selecting between 1m/s and 3m/s, wherein a stationary automobile appears in the center of a lane; the time to collision TTC is randomly selected between 0.5s and 1.5s, and the time to start a pedestrian or the time to appear in a stationary vehicle is at a longitudinal distance (TTC.V) from the pedestrian or stationary vehicle _init ) m is the position; the main vehicle faces an emergency burst scene in the collision time TTC interval of the task, when the main vehicle runs to a dangerous area, the main vehicle is trained to take longitudinal braking action preferentially when facing pedestrians crossing a road to ensure safety, and the main vehicle takes steering braking action preferentially when facing a stationary vehicle to realize avoidance, so that the braking action of the main vehicle when facing different obstacles is more targeted.

step 2.1.1, preprocessing an image:

firstly, the visual angle of a semantic segmentation camera is adjusted to be a top view, the horizontal position of the camera is adjusted, a semantic segmentation aerial view taking the position of a host vehicle as the center of a picture is obtained, each object is assigned with a pixel category by semantic segmentation, each category corresponds to one color in a palette, and each object color in a normal semantic segmentation image is provided with a respective digital label, so that the final output segmentation picture is one picture containing blocks with different colors. In Carla, for example, the number label of a building is 1, the number label of a pedestrian is 4, the number label of a street lamp is 5, and so on. But in the invention, only the digital labels of lane lines, host vehicles and barriers (people and stationary vehicles) with different colors are reserved, the digital label of a road is changed to 1, the digital label of the lane lines is changed to 2, the digital label of the vehicle is changed to 5, the digital label of a pedestrian is changed to 8, the digital label of the other things is changed to 4, namely the other things are all the same color, namely only 5 different colors of the lane lines, the host vehicles, the barriers (people and stationary vehicles) and the other things are left in the preprocessed image. The purpose is to simplify the color of things with little relation around the automobile, and to emphasize the color of the automobile and the road, lane lines and pedestrians.

Step 2.1.2, information splicing:

after the image is preprocessed in the step 2.1.1, characteristic information of the image is extracted through a convolutional neural network, wherein the convolutional network pi (z, p) specifically comprises three convolutional layers and one full-connection layer. Where z is high-dimensional image information, p is one-dimensional information (host vehicle speed V, relative distance d, relative speed V _rel ). As shown in the image preprocessing part of fig. 3, the three convolution layers are all composed of convolution kernels with the size of 5×5, the step size stride=2, and the activation function is ReLU; the first layer is fully connected to form a fully connected layer FC1. The full-connection layer FC1 processes the output result of the flattened third layer Conv3 with the size of 1 multiplied by 256, then splices the one-dimensional characteristic matrix of the image with one-dimensional sensor information (the speed of the automobile, the relative distance between the automobile and the obstacle and the relative speed between the automobile and the obstacle) through a Cat splicing function to obtain a new one-dimensional matrix, and the spliced output result is 1 multiplied by 259, namely the state input in the DQN algorithm, and is used as the input of the full-connection network of the subsequent output action.

Step 2.2, designing an action space;

step 2.2.1, action space of a longitudinal brake control training task:

under this task, the host vehicle has a sufficient longitudinal braking distance, and therefore the operating space contains only the longitudinal braking operation (no braking, weak braking, forced braking) of the vehicle. As shown in the lower part of fig. 3, the one-dimensional matrix spliced in step 2.1.2 is input as a state into a real Q1 network, the network has two fully connected layers, the number of input neurons of the first fully connected layer is 259, the number of output neurons is 128, the number of input neurons of the second fully connected layer is 128, the activation function is leakyReLU, the number of output neurons is 3, and the three actions respectively correspond to the Q values of the three actions.

in most emergency situations of this task, the longitudinal braking distance of the host vehicle is insufficient, so that not only the longitudinal braking action of the vehicle is trained, but also the steering braking action is included, the action space containing the longitudinal and transverse actions of the vehicle (no braking, weak braking, strong braking, braking and turning right, braking and turning left). The one-dimensional matrix spliced in the step 2.1.2 is input into a real Q2 network as a state, the network is provided with two full-connection layers, the number of input neurons of the first full-connection layer FC2 is 259, the number of output neurons is 128, the number of input neurons of the second full-connection layer FC3 is 128, the activation function is a leakage ReLU, the number of output neurons is 5, the five actions are respectively corresponding, and the Q values of the five actions are output.

Step 3, designing a multi-objective rewarding function, wherein the objectives comprise safety, efficiency and comfort;

in order to realize the stability of vehicle control and the safety of obstacle avoidance during emergency obstacle avoidance control, the invention designs a decision-making reward function by adopting a safe parking area and a dangerous area. When the host vehicle is out of the dangerous area, the vehicle is encouraged to take braking measures to brake, when the host vehicle is in the dangerous area, the vehicle is encouraged to take steering lane changing measures, and when the host vehicle is smaller than the dangerous area, the vehicle is encouraged to collide with the vehicle with high probability. The bonus function is designed as follows:

this task is designed for a general case of TTC between 1.5 and 4 seconds, defining an ideal parking area (3-6 m) and a dangerous area (0-3 m), with a bonus function designed as:

wherein V is the current speed of the automobile, d is the longitudinal distance between the automobile and the front obstacle, V _init For the initial speed of the car at the beginning of each round, deltaV is the last time speed V of the car _t-1 With the current moment velocity V _t Is a difference in (2); when the automobile meets any judgment condition except the fourth judgment condition and the last judgment condition of the formula (3), obtaining corresponding rewards and ending the training of the round, and immediately performing the training of the next round; the parking area is divided for safety and road traffic efficiency, and the speed change at two adjacent moments of the automobile is for comfort of the vehicle, so that excessive speed change is prevented.

the task aims at emergency situations with TTC between 0.5 seconds and 1.5 seconds, the dangerous area is modified into (0.5-3 m), the ideal parking area (3-6 m), in order to train the pedestrians crossing roads when the automobile runs to the dangerous area, the longitudinal braking action is preferentially adopted to ensure safety, the steering braking action is preferentially adopted to realize avoidance when the automobile is faced to the stationary automobile, and therefore, the rewarding function is designed to be as follows:

Wherein V is the current speed of the automobile, d is the longitudinal distance between the automobile and the obstacle in front, k1 and k2 are weight coefficients for encouraging and punishing steering actions respectively, and the obstacle is a vehicleK1=1 when the object is a pedestrian, k1=10 when the obstacle is a stationary car, and if no collision occurs in a dangerous area, the steering action is more encouraged to be adopted to face the stationary car, the longitudinal braking action is preferentially adopted to face the pedestrian, k2= -10, v _init For the initial speed of the car at the beginning of each round, deltaV is the last time speed V of the car _t-1 With the current moment velocity V _t D is the difference of d _lat Is the transverse distance between the automobile and the center line of the lane; the width of the car in the Carla simulation software is 1.8m, and the main car collides with the stationary car or the pedestrian when the transverse distance is smaller than 2m and the current round is finished, so that when the car meets any judgment condition except the fifth judgment condition and the last judgment condition of the formula (4), corresponding rewards are obtained and the training of the round is finished, and the training of the next round is immediately carried out; compared with pedestrians, the method has the advantages that the restriction on the transverse distance is added in rewards of dangerous areas, compared with pedestrians, the method has the advantages that the pedestrians are more encouraged to take steering action to carry out emergency avoidance when facing stationary vehicles, the ideal parking area is 3-6 meters, the vehicles are encouraged to take longitudinal braking parking in the area, if the vehicles take steering lane changing action in the area, larger negative rewards are given, and in addition, the vehicles are dangerous to take steering action beyond 6 meters away from an obstacle, and the negative rewards are given.

step 4.1, setting the DQN parameters, and designing several important super-parameter designs in the DQN algorithm. The method is characterized in that a discount factor gamma is used for adjusting near-far-term influence in reinforcement learning, the value range (0, 1) is as large as possible under the premise that an algorithm can converge, the learning rate lr of a full-connection neural network is obtained in the model, 0.95.Buffer size is obtained in the model and is used for improving the experience playback pool size of data efficiency in DQN, the value is usually obtained in [10000, 100000], the specific value is selected according to the performance of a computer, the exploration time proportion of the exploration time proportion and the final epsilon jointly determine the balance of the exploration and utilization of the DQN in the model, the exploration time proportion is obtained in the mode that epsilon falls from 1 to the final epsilon, the learning rate lr is obtained in the model, the final epsilon is obtained in the model, and the final epsilon falls from 1 to 0.05 slowly in 10000 rounds.

And 4.2, iterative optimization training of a longitudinal braking collision avoidance control task, wherein the specific algorithm steps of the longitudinal braking collision avoidance control task are as follows:

step 4.2.1, initializing. Firstly, initializing an experience playback pool D1, wherein the capacity of the experience playback pool D1 is N; initializing a real Q1 network, and randomly generating a weight omega 1; initializing a target Q1 network, wherein the weight is ω1' =ω1;

Step 4.2.2 loop through each round of epoode=1, 2, …, M:

step 4.2.3, starting each round, initializing the state S1;

step 4.2.4 generating action at with epsilon-greedy policy: selecting a random action with epsilon probability, or selecting at=max _a Q(s _t ,a；ω)；

Step 4.2.5 the host vehicle performs action at, interacts with the environment in Carla, receives instant prize r _t And a new state St+1;

step 4.2.6A set of samples of the transition (S _t ,a _t ,r _t ,S _t+1 ) Storing the data into an experience playback pool D as a data set for training the neural network;

step 4.2.7 randomly extracts data transitions(s) of a batch miniband from the empirical playback pool D1 _j ,a _j ,r _j ,s _j+1 )；

Step 4.2.8 if step j+1 is to reach the end state, let y _j ＝r _j Otherwise, let y _j ＝r _j +γmax _a' Q(s _t+1 ,a'；ω')；

Step 4.2.9 loss function l= (y) _j -Q(s _t ,a _j ；ω)) ² The loss function L is the mean square error between the target Q value and the current Q value, the loss value is minimized in the training process, and the parameters of the real Q1 network are updated by using a gradient descent method on L with respect to omega 1;

step 4.2.10 updates the target Q network every C steps, i.e. copies parameters of the real Q1 network to the target Q1 network, ω1' =ω1;

step 4.2.11, training circularly until the final algorithm converges;

and 4.3, iterative optimization training of a transverse and longitudinal combined braking collision avoidance control task, wherein the specific algorithm steps of the transverse and longitudinal combined braking collision avoidance control task are as follows:

And 4.3.1, initializing. Firstly, initializing an experience playback pool D2, wherein the capacity of the experience playback pool D2 is N; initializing a real Q2 network, and randomly generating a weight omega 2; initializing a target Q2 network, wherein the weight is ω2' =ω2;

step 4.3.2 loop through each round of epoode=1, 2, …, M:

step 4.3.3, starting each round, initializing a state S1;

step 4.3.4 generates action at with the epsilon-greedy policy: selecting a random action with epsilon probability, or selecting at=max _a Q(s _t ,a；ω)；

Step 4.3.5 the host vehicle performs action at, interacts with the environment in Carla, receives instant prize r _t And a new state St+1;

step 4.3.6A set of samples of the transition (S _t ,a _t ,r _t ,S _t+1 ) Storing the data into an experience playback pool D2 as a data set for training the neural network;

step 4.3.7 randomly extracts data transitions(s) of a batch miniband from the empirical playback pool D2 _j ,a _j ,r _j ,s _j+1 )；

Step 4.3.8 if step j+1 is to reach the end state, let y _j ＝r _j Otherwise, let y _j ＝r _j +γmax _a' Q(s _t+1 ,a'；ω')；

Step 4.3.9 loss function l= (y) _j -Q(s _t ,a _j ；ω)) ² The loss function L is the mean square error between the target Q value and the current Q value, the loss value is minimized in the training process, and the parameters of the real Q2 network are updated by using a gradient descent method on L with respect to omega 2;

step 4.3.10 updates the target Q network every C steps, i.e. copies parameters of the real Q2 network to the target Q2 network, ω2' =ω2;

Step 4.3.11, training is circulated until the final algorithm converges;

and 4.4, saving the neural network parameters of the real network Q1 and the real network Q2 in the two tasks as an online neural network controller. As shown in fig. 4, when TTC is 1.5 seconds to 4 seconds, the real network Q1 is selected as a controller, and a longitudinal braking action is output; and when the TTC is between 0.5 and 1.5 seconds, the real network Q2 is selected as a controller, and the transverse or longitudinal control action is output to realize the collision avoidance task.

Claims

1. The automobile emergency collision avoidance control method based on DQN deep reinforcement learning is characterized by comprising subtask design, state and action space design, multi-objective rewarding function design, DQN parameter setting and training; the subtask design designs different training tasks according to different emergency situations; the state and action space design is that firstly, an image and sensor information are spliced to be used as state input, and then different transverse and longitudinal action spaces are designed according to different training tasks; the design of the multi-objective rewarding function enables the avoidance strategy of the automobile to have safety, efficiency and comfort; finally, setting and training the DQN parameters according to actual super parameters, and performing iterative optimization on network parameters by cyclic training;

The method comprises the following steps:

step 1.1, establishing a Markov model of an emergency collision avoidance process;

step 1.2, designing a longitudinal brake control training task:

in this task, the primary speed V of the host vehicle _init Randomly selected between 2.67m/s and 16.67m/s, the pedestrian or stationary car is at a distance of 5.V from the car ( _init ) m, the pedestrians randomly cross the road from the two ends of the road, and the pedestrian speed V _ped Randomly selecting between 1m/s and 3m/s, wherein a stationary automobile appears in the center of a lane; the collision time TTC is randomly selected from 1.5s to 4s, and the starting time of the pedestrian or the appearance time of the stationary automobile is that the main automobile is in contact with the pedestrian or the stationary automobileLongitudinal distance (TTC.V) _init ) m, the main vehicle faces a general scene in the collision time TTC interval of the task, and has a sufficient longitudinal braking distance, so that the main vehicle is only trained to take longitudinal braking action to avoid pedestrians or stationary vehicles crossing the road;

in this task, the primary speed V of the host vehicle _init Randomly selected between 2.67m/s and 16.67m/s, the pedestrian or stationary car is at a distance of 5.V from the car ( _init ) m, the pedestrians randomly cross the road from the two ends of the road, and the pedestrian speed V _ped Randomly selecting 1m/s to 3m/s, wherein a stationary car appears in the center of a lane, the collision time TTC is randomly selected from 0.5s to 1.5s, and the pedestrian starting time or the stationary car appearance time is the longitudinal distance (TTC.V) between a main car and a pedestrian or the stationary car _init ) m, the main vehicle faces an emergency burst scene in a collision time TTC interval of the task, when the main vehicle runs to a dangerous area, the main vehicle is trained to take longitudinal braking action preferentially when facing pedestrians crossing a road to ensure safety, and steering braking action is taken preferentially when facing a stationary vehicle to realize avoidance, so that the braking action of the main vehicle when facing different obstacles is more targeted;

step 2.1.1, preprocessing an image:

firstly, adjusting the view angle of a semantic segmentation camera to be a top view angle, acquiring a semantic segmentation aerial view taking the position of a host vehicle as the center of a picture, wherein each object is assigned with a pixel category by semantic segmentation, and each category corresponds to one color in a palette, so that the final output segmentation picture is one picture containing blocks with different colors; however, in the invention, only the digital labels with different colors of the lane lines, the host vehicle and the obstacles (people and stationary vehicles) are reserved, and the digital labels with the colors of all other things are changed into the same number, namely, only 5 different colors of the lane lines, the host vehicle, the obstacles (people and stationary vehicles) and other things are left in the preprocessed image;

Step 2.1.2, information splicing:

after the image is preprocessed in the step 2.1.1, extracting characteristic information of the image through a convolutional neural network, wherein the convolutional neural network comprises three layers of convolutional layers and a full-connection layer, the convolutional layers extract characteristic information of semantic segmentation top views, the full-connection layer flattens the information of the image into a one-dimensional matrix, and then the one-dimensional characteristic matrix of the image and one-dimensional sensor information (the speed of an automobile, the relative distance between the automobile and an obstacle and the relative speed between the automobile and the obstacle) are spliced through a Cat splicing function to obtain a new one-dimensional matrix, namely, state input in a DQN algorithm;

step 2.2, designing an action space;

step 2.2.1, action space of a longitudinal brake control training task:

under the task, the main vehicle has a sufficient longitudinal braking distance, so the action space only comprises longitudinal braking actions (no braking, weak braking and forced braking) of the vehicle, the one-dimensional matrix spliced in the step 2.1.2 is used as a state to be input into a real Q1 network, the network has two full-connection layers, and the Q values of three actions are output;

in most of the scenes of the task, the automobile faces emergency situations, so that not only the longitudinal braking action of the automobile is trained, but also the steering braking action is included, the action space comprises the longitudinal and transverse actions (no braking, weak braking, strong braking, braking and right turning, braking and left turning) of the automobile, the one-dimensional matrix spliced in the step 2.1.2 is input into a real Q2 network as a state, the network is provided with two full-connection layers, and the Q values of five actions are output;

in order to realize the stability of vehicle control and the safety of obstacle avoidance when in emergency obstacle avoidance control, the invention designs a reward function by adopting the division of an ideal parking area and a dangerous area, when a host vehicle is outside the dangerous area, the vehicle is encouraged to take braking measures to brake, when the host vehicle is in the dangerous area, the vehicle is encouraged to take steering lane changing measures, when the distance is smaller than the dangerous area, the vehicle is encouraged to collide with a large probability, and the reward function is designed as follows:

this task is designed for a decision-making reward function defining an ideal parking area (3-6 m) and a dangerous area (0-3 m) for a general case of collision time TTC between 1.5 seconds and 4 seconds, the reward function being designed as:

wherein V is the current speed of the automobile, d is the longitudinal distance between the automobile and the front obstacle, V _init For the initial speed of the car at the beginning of each round, deltaV is the last time speed V of the car _t-1 With the current moment velocity V _t Is a difference in (2); when the automobile meets any judgment condition except the fourth judgment condition and the last judgment condition of the formula (1), obtaining corresponding rewards and ending the training of the round, and immediately performing the training of the next round; the parking area is divided for safety and road traffic efficiency, and the speed change at two adjacent moments of the automobile is for comfort of the automobile, so that excessive speed change is prevented;

the task aims at emergency situations with collision time TTC between 0.5 seconds and 1.5 seconds, the dangerous area is modified to be (0.5-3 m), the ideal parking area is modified to be (3-6 m), in order to train a pedestrian crossing a road to take a longitudinal braking action preferentially when the automobile runs to the dangerous area to ensure safety, and a steering braking action is taken preferentially when the automobile is in the face of a stationary automobile to realize avoidance, so that a reward function is designed to be as follows:

wherein V is the current speed of the car, d is the longitudinal distance between the car and the obstacle ahead, k1, k2 are the weighting factors for encouraging and punishing steering actions, respectively, k1=1 when the obstacle is a pedestrian, k1=10 when the obstacle is a stationary car, k2= -10, V when the dangerous area is not collided, the steering action is more encouraging to be taken in the face of the stationary car, the longitudinal braking action is preferably taken in the face of the pedestrian _init For the initial speed of the car at the beginning of each round, deltaV is the last time speed V of the car _t-1 With the current moment velocity V _t D is the difference of d _lat The lateral distance between the automobile and a stationary automobile or a pedestrian; the width of the car in the Carla simulation software is 1.8m, and the main car collides with the stationary car or the pedestrian when the transverse distance is smaller than 2m and the current round is finished, so that when the car meets any judgment condition except the fifth judgment condition and the last judgment condition of the formula (2), corresponding rewards are obtained and the training of the round is finished, and the training of the next round is immediately carried out; adding a limitation on the transverse distance in the rewarding of a dangerous area, compared with pedestrians, more encouraging the automobile to take steering action to carry out emergency avoidance when facing a stationary automobile, wherein the ideal parking area is 3-6 meters, encouraging the automobile to take longitudinal braking parking in the area, if the automobile takes steering lane changing action in the area, a larger negative rewarding is given, and in addition, the automobile takes steering action beyond 6 meters from an obstacle to be dangerous, and the negative rewarding is given;

step 4.1, setting and configuring DQN parameters, and designing several important super-parameter designs in the DQN algorithm; the method comprises the steps of firstly, taking a discount factor gamma, taking the value range (0, 1), taking the value of the discount factor as large as possible on the premise that an algorithm can converge, taking the value of 0.95 in the model, then taking the value of 0.95 in the model, taking the value of experience playback pool used for improving data efficiency in DQN, taking the value of 0.05 in the model, and taking the value of 0.05 in the model, wherein the value of 0.05 in the model is fully guaranteed by properly increasing the Buffer size on the premise that the number of steps considered before for a complex task is larger, the value of the discount factor is usually taken as large as possible in [10000, 100000] on the premise that the algorithm can converge, taking 20000 in the model, and finally jointly determining the balance of DQN exploration and utilization by the ratio of epsilon, and the time that epsilon is reduced from 1 to final epsilon, and the value is taken in the model, and the value of 0.1 is usually taken in the complex task, and the value is taken from 0.05 to 0.05;

initializing super parameters, performing cyclic training, wherein in each training round, a real Q1 network receives state input and outputs Q values of three actions; the intelligent agent uses greedy algorithm to select action, obtain rewards and reach the next state; packing the state, action, rewards and next state whether the end mark is packed into a five-tuple as an experience to be stored in an experience playback pool; the target Q1 network randomly extracts a batch of experience from the experience pool, the output Q value of the experience pool and the output Q value of the real Q1 network are subjected to mean square error, the LOSS is LOSS LOSS of the neural network, the optimization target of the neural network is to minimize the LOSS, so that the output action of the real Q1 network can be as close to the output Q value of the target Q1 network as possible, the next action can be continuously executed, and the cyclic reciprocating training is performed; the algorithm aims at training an intelligent agent to learn a strategy for maximizing rewards, and avoiding collision with a target obstacle under the conditions of ensuring safety, efficiency and comfort;

initializing super parameters, performing cyclic training, wherein in each training round, a real Q2 network receives state input, outputs Q values of five actions, and an intelligent agent selects actions by using a greedy algorithm to obtain rewards and reach the next state; packing the state, action, rewards and next state whether the end mark is packed into a five-tuple as an experience and storing the five-tuple in an experience playback pool; the target Q2 network randomly extracts a batch of experience from the experience pool, the output Q value of the experience pool and the output Q value of the real Q2 network are subjected to mean square error, the LOSS is LOSS LOSS of the neural network, the optimization target of the neural network is to minimize the LOSS, so that the output action of the real Q2 network can be as close to the output Q value of the target Q2 network as possible, the next action can be continuously executed, and the cyclic reciprocating training is performed; the algorithm aims at training an intelligent agent to learn a strategy for maximizing rewards, and avoiding collision with a target obstacle under the conditions of ensuring safety, efficiency and comfort;

And 4.4, saving the neural network parameters of the target Q1 network and the target Q2 network in the two tasks, and using the neural network parameters as an online neural network controller, wherein the TTC selects the target Q1 network as the controller when the TTC is 1.5 to 4 seconds, outputs a longitudinal braking action, and selects the target Q2 network as the controller when the TTC is 0.5 to 1.5 seconds, and outputs a transverse or longitudinal control action to realize the collision avoidance task.