CN114880938A - Method for realizing decision of automatically driving automobile behavior - Google Patents
Method for realizing decision of automatically driving automobile behavior Download PDFInfo
- Publication number
- CN114880938A CN114880938A CN202210528980.7A CN202210528980A CN114880938A CN 114880938 A CN114880938 A CN 114880938A CN 202210528980 A CN202210528980 A CN 202210528980A CN 114880938 A CN114880938 A CN 114880938A
- Authority
- CN
- China
- Prior art keywords
- vehicle
- quantile
- surrounding
- implicit
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000006870 function Effects 0.000 claims abstract description 43
- 238000004088 simulation Methods 0.000 claims abstract description 28
- 238000012549 training Methods 0.000 claims abstract description 27
- 230000009471 action Effects 0.000 claims abstract description 22
- 238000013528 artificial neural network Methods 0.000 claims abstract description 9
- 230000007613 environmental effect Effects 0.000 claims abstract description 6
- 238000005457 optimization Methods 0.000 claims abstract description 5
- 230000008447 perception Effects 0.000 claims abstract description 5
- 230000006399 behavior Effects 0.000 claims description 30
- 230000033001 locomotion Effects 0.000 claims description 12
- 230000001133 acceleration Effects 0.000 claims description 8
- 239000000126 substance Substances 0.000 claims description 7
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 230000002787 reinforcement Effects 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 5
- 230000035945 sensitivity Effects 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 3
- 230000005484 gravity Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 2
- 230000003542 behavioural effect Effects 0.000 claims 1
- 238000002474 experimental method Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/10—Geometric CAD
- G06F30/15—Vehicle, aircraft or watercraft design
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0637—Strategic management or analysis, e.g. setting a goal or target of an organisation; Planning actions based on goals; Analysis or evaluation of effectiveness of goals
-
- G06Q50/40—
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Abstract
The invention relates to a method for realizing behavior decision of an automatic driving automobile, belonging to the technical field of automatic driving automobiles. The method comprises the following steps: s1: constructing a signal lamp crossroad simulation training scene containing environmental uncertainty factors; s2: constructing an implicit quantile network model, including constructing a state space, an action space and a reward function; s3: using the implicit quantile network model constructed in the neural network optimization step S2; s4: and generating a behavior decision with risk perception capability by combining the Wang function according to reward distribution information output by the implicit quantile network model after the optimization in the step S3. The method and the system can sense risks caused by uncertain factors in the environment, and improve the safety of the automatic driving automobile at the crossroad with the signal lamp.
Description
Technical Field
The invention belongs to the technical field of automatic driving automobiles, and relates to a method for realizing behavior decision of an automatic driving automobile.
Background
When the automatic driving automobile runs in an actual environment, various factors of the environment, including surrounding vehicles, pedestrians and the like, need to be considered by a decision making system of the automatic driving automobile. However, how to ensure the driving safety of the automatic driving automobile when facing complex driving conditions is still not solved. Especially at signal lamp intersections, how to consider the violation behaviors of surrounding vehicles and pedestrians in the behavior decision system, such as dangerous behaviors like running red light, is very important for improving the safety of the automatically driven vehicles.
At present, the decision-making method for the automatic driving automobile crossroads mainly comprises the following steps: a rule-based decision method, a partially observable Markov-based decision method, and a deep reinforcement learning-based decision method. At present, in order to improve the adaptability of an automatic driving decision system to a complex traffic scene, a method based on deep reinforcement learning is widely adopted. The advantages of such methods over rule-based decision methods are: the method can avoid complicated design steps and parameter adjustment work brought by a rule-based algorithm. In addition, the method can solve the problem that the observable Markov method is difficult to adapt to large-scale decision. Generally, a decision method based on deep reinforcement learning generates driving data through continuous interaction between an automobile and an environment, and autonomously learns a decision strategy adapted to a complex environment, and representative decision methods include a deep Q learning network (DQN), a soft actor-critic (SAC), and the like. However, these methods hardly consider the violation of traffic participants at the intersection with signal lights, and it is difficult to ensure the driving safety of vehicles at the intersection.
Therefore, a safety decision method capable of considering the violation of the traffic participants is needed to ensure the safety of the autonomous vehicle.
Disclosure of Invention
In view of this, the present invention provides a method for implementing a behavior decision of an autonomous vehicle, which can sense risks caused by uncertainty factors in an environment and can improve the safety of the autonomous vehicle passing through a signal lamp intersection.
In order to achieve the purpose, the invention provides the following technical scheme:
a method for realizing automatic driving automobile behavior decision comprises the following steps:
s1: constructing a signal lamp crossroad simulation training scene containing environmental uncertainty factors;
s2: constructing an Implicit Quantile Network (IQN) model, including constructing a state space, an action space and a reward function;
s3: an Implicit Quantile Network (IQN) model constructed using the neural network optimization step S2;
s4: and generating a behavior decision with risk perception capability by combining the Wang function according to reward distribution information output by the optimized Implicit Quantile Network (IQN) model in the step S3.
Further, step S1 specifically includes the following steps:
s11: setting a pedestrian model: describing a pedestrian motion track in a simulation training scene by adopting the following kinematics model;
wherein v is p For pedestrian speed, omega p Is angular velocity, x p 、y p 、θ p Respectively an abscissa, an ordinate and a course angle of the center of gravity of the pedestrian;derivatives of x, y, θ, v, respectively;
s12: setting a surrounding vehicle model, and defining the motion of the vehicle and the surrounding vehicle in a simulation training scene, wherein the motion is described by the following equation:
wherein x and y are respectively the abscissa and ordinate of the vehicle mass center, v represents the vehicle mass center speed, theta is the vehicle yaw angle, beta is the slip angle at the vehicle mass center, and l f 、l r Distances from the center of mass of the vehicle to the front and rear axles of the vehicle, delta f For the steering angle of the front wheels of the vehicle, a c Is the vehicle acceleration;derivatives of x, y, θ, v, respectively;
to enable the surrounding vehicles in the simulated training scenario to interact with the host vehicle, it is provided that the surrounding motor vehicles are controlled by a Velocity Difference Model (Velocity Difference Model):
a c =k[V-v+λΔv]
V=V 1 +V 2 tanh[C 1 (x front +L length,front -x)+C 2 ]
wherein k is a sensitivity coefficient, Δ V is a relative speed between the host vehicle and surrounding vehicles, λ is a speed difference reaction coefficient, and V is a speed difference reaction coefficient 1 、V 2 、C 1 、C 2 The parameters are self-defined and can be obtained through experiments generally; x is the number of front Is the transverse coordinate of the center of mass of the surrounding vehicle, L length,front Is the body length of the surrounding vehicle, and x is the transverse coordinate of the center of mass of the vehicle;
s13: setting behavior types of surrounding motor vehicles and pedestrians;
in order to simulate a real traffic scene, the behavior types of surrounding motor vehicles and pedestrians are set as follows: and the regular vehicles, the regular pedestrians, the illegal vehicles and the illegal pedestrians are classified into four types. Specifically, a conventional vehicle will comply with the traffic light rule, and an illegal vehicle will not comply with the traffic light rule, i.e., a red light running behavior will occur; regular pedestrians will obey the traffic light rule, and illegal pedestrians will not obey the traffic light rule, namely the behavior of red light running can occur. When the simulation environment operates, one of four types of conventional vehicles, conventional pedestrians, illegal vehicles and illegal pedestrians is randomly extracted at each simulation moment and added into the simulation environment.
S14: initializing an environment: randomly initializing the initial state of a signal lamp, the initial speed, the position and the target speed of surrounding motor vehicles; the simulation environment outputs environment information E at each simulation time t, which is defined as:
E={E e ,E s1 ,E s2 ,...,E si ,...,E p1 ,E p2 ,...,E pi ,...,traffic_light} si=1,2,...,ns,pi=1,2,...,np
E e ={x e ,y e ,v e ,θ e }
E vi ={x si ,y si ,v si ,θ si }
E pi ={x pi ,y pi ,v pi ,θ pi }
wherein subscript e represents own vehicle; subscript si denotes the si th surrounding vehicle, i.e., s1 denotes the first surrounding vehicle, ns denotes the number of surrounding traffic participating vehicles; the subscript pi indicates the pi-th pedestrian, i.e., p1 is the first pedestrian, np indicates the number of pedestrians; x is the number of e ,y e ,v e ,θ e Respectively a transverse coordinate, a longitudinal coordinate, a mass center speed and a yaw angle of the mass center of the bicycle; x is the number of vi ,y vi ,v vi ,θ vi Respectively the transverse coordinate, the longitudinal coordinate, the centroid speed and the yaw angle of the centroid of the surrounding vehicle; x is the number of pi ,y pi ,v pi ,θ pi Respectively a transverse coordinate, a longitudinal coordinate, a mass center speed and a yaw angle of the mass center of the pedestrian; traffic _ light represents traffic signal light status.
Further, in step S2,
1) the constructed state space S includes: position of the vehicle (x) e ,y e ) Velocity v e Yaw angle theta e Relative position of surrounding vehicle with respect to own vehicle (Δ x) si ,Δy si ) Relative velocity Δ v si And its relative yaw angle delta theta si Relative position of surrounding pedestrian with respect to own vehicle (Δ x) pi ,Δy pi ) Relative velocity Δ v pi And its relative yaw angle delta theta pi I.e. the state space S is represented as:
S={s e ,s s1 ,s s2 ,...,s si ,...,s p1 ,s p2 ,...,s pi ,...,traffic_light} si=1,2, ... ,ns,pi=1,2, ... ,np
s e ={x e ,y e ,v e ,θ e }
s si ={Δx si ,Δy si ,Δv si ,Δθ si }
s pi ={Δx pi ,Δy pi ,Δv pi ,Δθ pi }
2) the constructed motion space a includes: acceleration a of vehicle c Steering angle delta with front wheel f Thereby controlling the movement of the target vehicle, i.e., a(s) ═ a c ,δ f };
3) The constructed reward function R includes: safety against collision r col Target prize r goal And traffic signal light reward r light Namely:
R=χ 1 r col +χ 2 r goal +χ 3 r light
wherein, χ 1 、χ 2 、χ 3 Weighting coefficients for the terms in the reward function;
safety against collision r col The self-vehicle is required to avoid collision with other traffic participating vehicles and pedestrians;
when the own vehicle collides with surrounding vehicles and pedestrians, Cind is 1, otherwise Cind is 0;
targeted reward r goal The running speed of the self vehicle is required to reach the destination safely within the specified time as much as possible;
when the self vehicle can safely reach the destination within a specified time, Gind is 1, otherwise, Gind is 0;
traffic signal light reward r light The self-vehicle is required to comply with the traffic light rule;
when the vehicle passes through the intersection and complies with the traffic rules, Lind is 1, and otherwise Lind is 0.
Further, step S3 specifically includes the following steps:
s31: construction of implicit quantile network Z Using neural network τ (S, A) with inputs of state space S, quantile τ and parameter θ τ (ii) a Constructing a target implicit quantile network Z using a neural network τ′ (S, A) with the input as state space S, quantile τ', and parameter θ τ′ (ii) a In addition, a hyper-parameter K, N, N' is set, where K is the implicit quantile network Z τ Outputting the sampling times of the optimal action, wherein N is an implicit quantile network Z τ The number of samples in calculating the loss function, N' being the target implicit quantile network Z τ′ Calculating the sampling times of the loss function;
s32: randomly initializing a decision model based on deep reinforcement learning, wherein the decision model comprises hyper-parameters and network structure parameters of the model;
s33: implicit quantile-based network Z τ (S, A), inputting the state S at the current time t t Calculating the action A based on the following formula t ;
Meanwhile, the reward R acquired at the current time t is calculated according to the reward function t Calculating the state S at the t +1 moment based on the simulation environment output E t+1 (ii) a Establishing an experience pool, and combining data { S } t ,A t ,R t ,S t+1 Putting the training data into an experience pool, and replacing the old training data with the new training data according to a first-in first-out principle when the training data volume exceeds the capacity of the experience pool;
s34: randomly extracting B samples from the experience pool, and carrying out implicit quantile network Z τ (S, A) and target implicit quantile network Z τ′ And (S, A) updating. The method specifically comprises the following steps: first, for any two quantiles τ i ,τ′ j Taking the difference, expressed as:
wherein the content of the first and second substances,gamma is a discount factor, K is more than or equal to 1 and less than or equal to K, i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to N', tau is the optimal action at the moment of t +1 k ,τ i ,τ j U (0,1), wherein U is uniformly distributed;
second, the gradient of the loss function can be expressed as:
wherein the content of the first and second substances,in order to be the gradient of the loss function,is a threshold value, and is,is the Huber function;to indicate a function, i.e., to satisfy the condition of 1, otherwise to 0,to set the threshold.
Further, step S4 specifically includes the following steps:
step S41: reward distribution information Z obtained based on step S3 τ Using Wang function rho Wang The original distribution information is changed, and the calculation formula is as follows:
wherein, phi is a standard normal distribution probability density function, phi -1 Which is the inverse of the standard normally distributed probability density function,expressing the mean value, wherein alpha is a self-defined risk parameter value;
step S42: selecting an optimal action: maximizing rho Wang (Z τ ) And (3) calculating the action decision instruction with risk sensitivity:
The invention has the beneficial effects that:
1) the invention constructs a signal lamp crossroad simulation training scene containing environmental uncertainty factors, and the training scene can simulate the illegal behaviors of vehicles around, pedestrians running red light and the like and better accords with a real traffic scene.
2) The invention constructs an Implicit Quantile Network (IQN) based model which can calculate the distribution information of the reward.
3) The behavior decision with risk perception capability can be generated based on reward distribution information output by an Implicit Quantile Network (IQN) model and combined with the Wang function, and the safety of the decision of automatically driving the vehicle can be improved.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 is a logic framework diagram of the overall implementation of the method of the present invention;
FIG. 2 is a flow chart of the method of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.
Referring to fig. 1-2, the present invention provides a method for implementing decision-making of auto-driving behavior. Considering that violation behaviors such as surrounding vehicles and pedestrians running red light exist in a real traffic scene, a signal lamp crossroad simulation training scene containing environmental uncertainty factors is designed, and the training scene can simulate the violation behaviors such as surrounding vehicles and pedestrians running red light and better accords with the real traffic scene. In order to improve the safety of the automatic driving vehicle, the method specifically comprises the following steps:
s1: constructing a signal lamp crossroad simulation training scene containing environmental uncertainty factors; the method specifically comprises the following steps:
s101: setting a pedestrian model: describing the pedestrian motion track in the simulation training scene by adopting the following kinematics model:
wherein v is p For pedestrian speed, omega p Is angular velocity, x p ,y p ,θ p Respectively an abscissa, an ordinate and a course angle of the center of gravity of the pedestrian,the derivatives of x, y, theta, v, respectively.
S102: setting a surrounding vehicle model, and defining the motion of the vehicle and the surrounding vehicle in a simulation environment, wherein the motion is described by the following equations:
wherein x and y are respectively the abscissa and ordinate of the vehicle mass center, v represents the vehicle mass center speed, theta is the vehicle yaw angle, beta is the vehicle mass center side slip angle, and l f ,l r Distances from the center of mass of the vehicle to the front and rear axles of the vehicle, delta f For the steering angle of the front wheels of the vehicle, a c In order to be the acceleration of the vehicle,the derivatives of x, y, theta, v, respectively.
To enable the surrounding motor vehicles in the simulation environment to interact with the own vehicle, it is provided that the surrounding motor vehicles are controlled by a Velocity Difference Model (Velocity Difference Model):
a c =k[V-v+λΔv]
V=V 1 +V 2 tanh[C 1 (x front +L length,front -x)+C 2 ]
wherein, a c Is vehicle acceleration, k is a sensitivity coefficient, V is vehicle speed, Deltav is relative speed of the vehicle and surrounding vehicles, lambda is a speed difference reaction coefficient, V is a speed difference reaction coefficient 1 ,V 2 ,C 1 ,C 2 For self-defining parameters, x can be obtained through experiments front Is the transverse coordinate of the center of mass of the surrounding vehicle, L length,front Is the body length of the surrounding vehicle and x is the lateral coordinate of the center of mass of the vehicle.
S103: setting the behavior types of surrounding motor vehicles and pedestrians: in order to simulate a real traffic scene, the behavior types of surrounding motor vehicles and pedestrians are set as follows: and the four categories of the conventional vehicles, the conventional pedestrians, the illegal vehicles and the illegal pedestrians. Specifically, the conventional vehicles comply with the traffic light rule, and the illegal vehicles do not comply with the traffic light rule, namely, the behavior of running the red light can occur; regular pedestrians can obey the traffic light rule, and illegal pedestrians cannot obey the traffic light rule, namely, the behavior of running the red light can be generated. When the simulation environment operates, one of four types of conventional vehicles, conventional pedestrians, illegal vehicles and illegal pedestrians is randomly extracted at each simulation moment and added into the simulation environment.
S104: initializing the environment: the initial state of the signal lamp, the initial speed, the position and the target speed of the surrounding motor vehicles are initialized randomly. The simulation environment outputs environment information E at each simulation time t. E is specifically defined as:
E={E e ,E s1 ,E s2 ,...,E si ,...,E p1 ,E p2 ,...,E pi ,...,traffic_light} si=1,2,...,ns,pi=1,2,...,np
E e ={x e ,y e ,v e ,θ e }
E vi ={x si ,y si ,v si ,θ si }
E pi ={x pi ,y pi ,v pi ,θ pi }
wherein the subscript e represents the own vehicle, the subscript si represents the si-th surrounding vehicle, i.e., s1 represents the first surrounding vehicle, ns represents the number of surrounding traffic participating vehicles, the subscript pi represents the pi-th pedestrian, i.e., p1 is the first pedestrian, and np represents the number of pedestrians; x is the number of e ,y e ,v e ,θ e Respectively a transverse coordinate, a longitudinal coordinate, a mass center speed and a yaw angle of the mass center of the self-vehicle; x is the number of vi ,y vi ,v vi ,θ vi Respectively the transverse coordinate, the longitudinal coordinate, the centroid speed and the yaw angle of the centroid of the surrounding vehicle; x is the number of pi ,y pi ,v pi ,θ pi The transverse coordinate, the longitudinal coordinate, the centroid speed and the yaw angle of the centroid of the pedestrian are respectively.
S2: constructing and optimizing an implicit quantile-based network (IQN) model; the method specifically comprises the following steps:
s201: constructing a state space S including a position (x) of the own vehicle e ,y e ) Velocity v e Yaw angle theta e Relative position of surrounding vehicle with respect to own vehicle (Δ x) si ,Δy si ) Relative velocity Δ v si And its relative yaw angle delta theta si Relative position of surrounding pedestrian with respect to own vehicle (Δ x) pi ,Δy pi ) Relative velocity Δ v pi And its relative yaw angle delta theta pi The traffic signal status traffic _ light, i.e., S, is represented as:
S={s e ,s s1 ,s s2 ,...,s si ,...,s p1 ,s p2 ,...,s pi ,...,traffic_light} si=1,2, ... ,ns,pi=1,2, ... ,np
s e ={x e ,y e ,v e ,θ e }
s si ={Δx si ,Δy si ,Δv si ,Δθ si }
s pi ={Δx pi ,Δy pi ,Δv pi ,Δθ pi }
where the subscript e denotes the own vehicle, the subscript si denotes the si th surrounding vehicle, i.e., s1 denotes the first surrounding vehicle, ns denotes the number of surrounding traffic participating vehicles, the subscript pi denotes the pi th pedestrian, i.e., p1 is the first pedestrian, and np denotes the number of pedestrians.
S202: constructing an action space A consisting of the acceleration of the vehicle and the steering angle of the front wheels, thereby controlling the movement of the target vehicle, i.e.
A(S)={a c ,δ f }
Wherein, a c For vehicle acceleration, δ f Is the front wheel steering angle.
S203: constructing a reward function R comprising a collision safety R col Target prize r goal Traffic signal light reward r light Namely:
R=χ 1 r col +χ 2 r goal +χ 3 r light
wherein, χ 1 ,χ 2 ,χ 3 For each of the reward functionsThe weight coefficient of the term;
safety in collision col The self-vehicle is required to avoid collision with other traffic participating vehicles and pedestrians.
When the own vehicle collides with a surrounding vehicle or a pedestrian, Cind is 1, otherwise Cind is 0.
Targeted reward r goal The running speed of the vehicle is required to reach the destination safely within a specified time as much as possible.
When the self vehicle can safely reach the destination within the specified time, Gind is 1, otherwise, Gind is 0.
Traffic signal light reward r light The vehicle is required to comply with the traffic light regulations.
When the vehicle passes through the intersection, the traffic rules are observed, and Lind is 1, otherwise Lind is 0.
S204: construction of implicit quantile networks Z Using neural networks τ (S, A) with inputs of state space S, quantile τ and parameter θ τ (ii) a Constructing a target implicit quantile network Z using a neural network τ′ (S, A) with inputs of state space S and quantile τ', the parameter being represented by θ τ′ . In addition, a hyper-parameter K, N, N' is set, where K is the implicit quantile network Z τ Outputting the sampling times of the optimal action, wherein N is an implicit quantile network Z τ The number of samples in calculating the loss function, N' being the target implicit quantile network Z τ′ The number of samples in the loss function is calculated.
S205: randomly initializing a decision model based on deep reinforcement learning, wherein the decision model comprises hyper-parameters and network structure parameters of the model;
s206: implicit quantile-based network Z τ (S, A), inputting the state S at the current time t t Calculating the action A based on the following formula t ,
Meanwhile, the reward function calculates the reward R acquired at the current time t t Calculating the state S at the t +1 moment based on the simulation environment output E t+1 (ii) a Establishing an experience pool, and combining data { S } t ,A t ,R t ,S t+1 Putting the training data into an experience pool, and replacing the old training data with the new training data according to a first-in first-out principle when the training data volume exceeds the capacity of the experience pool;
s207: randomly extracting B samples from the experience pool, and carrying out implicit quantile network Z τ (S, A) and target implicit quantile network Z τ′ And (S, A) updating. First, for any two quantiles τ i ,τ′ j The difference can be expressed as:
wherein, the first and the second end of the pipe are connected with each other,for the optimal action at time t +1, γ is the discount factor, R t Is the instant reward at the time t, A is the action space, K is more than or equal to 1 and less than or equal to K, i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to N', tau k ,τ i ,τ j U (0,1), U is uniformly distributed.
Second, the gradient of the loss function can be expressed as:
wherein, the first and the second end of the pipe are connected with each other,in order to be the gradient of the loss function,is a threshold value, and is,in order to be a function of the Huber,to indicate a function, i.e., to satisfy the condition of 1, otherwise to 0,to set the threshold.
S3: generating a behavior decision with risk perception capability by combining a Wang function based on reward distribution information output by an Implicit Quantile Network (IQN) model; the method specifically comprises the following steps:
step S301: the reward distribution information Z obtained based on step S2 τ Using Wang function rho Wang The original distribution information is changed by the following specific formula:
wherein, phi is a standard normal distribution probability density function, phi -1 Which is the inverse of the standard normally distributed probability density function,and expressing the mean value, wherein alpha is a self-defined risk parameter value.
Step S302: selecting an optimal action: maximizing ρ Wang (Z τ ) And (3) calculating the action decision instruction with risk sensitivity:
Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.
Claims (6)
1. A method for realizing automatic driving automobile behavior decision is characterized by comprising the following steps:
s1: constructing a signal lamp crossroad simulation training scene containing environmental uncertainty factors;
s2: constructing an implicit quantile network model, including constructing a state space, an action space and a reward function;
s3: using the implicit quantile network model constructed in the neural network optimization step S2;
s4: and generating a behavior decision with risk perception capability by combining the Wang function according to the reward distribution information output by the implicit quantile network model after the optimization in the step S3.
2. The method for implementing automated driving vehicle behavior decision-making as claimed in claim 1, wherein step S1 specifically comprises the steps of:
s11: setting a pedestrian model: describing a pedestrian motion track in a simulation training scene by adopting a kinematics model;
wherein v is p For pedestrian speed, omega p Is angular velocity, x p 、y p 、θ p Respectively an abscissa, an ordinate and a course angle of the center of gravity of the pedestrian;are respectively x p 、y p 、θ p A derivative of (a);
s12: setting a surrounding vehicle model, and specifying that the motion of the vehicle and the surrounding vehicle in a simulation training scene is described by the following equation:
wherein x and y are respectively the abscissa and ordinate of the vehicle mass center, v represents the vehicle mass center speed, theta is the vehicle yaw angle, beta is the slip angle at the vehicle mass center, and l f 、l r Distances from the center of mass of the vehicle to the front and rear axles of the vehicle, delta f For the steering angle of the front wheels of the vehicle, a c Is the vehicle acceleration;derivatives of x, y, θ, v, respectively;
to enable the surrounding vehicles in the simulated training scenario to interact with the own vehicle, it is provided that the surrounding motor vehicles are controlled by a speed difference model:
a c =k[V-v+λΔv]
V=V 1 +V 2 tanh[C 1 (x front +L length,front -x)+C 2 ]
wherein k is a sensitivity coefficient, Δ V is a relative speed between the host vehicle and surrounding vehicles, λ is a speed difference reaction coefficient, and V is a speed difference reaction coefficient 1 、V 2 、C 1 、C 2 To define the parameters, x front Is the transverse coordinate of the center of mass of the surrounding vehicle, L length,front Is the body length of the surrounding vehicle, and x is the transverse coordinate of the center of mass of the vehicle;
s13: setting the behavior types of surrounding motor vehicles and pedestrians, comprising the following steps: the four categories of conventional vehicles, conventional pedestrians, illegal vehicles and illegal pedestrians;
s14: initializing the environment: randomly initializing the initial state of a signal lamp, the initial speed, the position and the target speed of surrounding motor vehicles; the simulation environment outputs environment information E at each simulation time t, which is defined as:
E={E e ,E s1 ,E s2 ,...,E si ,...,E p1 ,E p2 ,...,E pi ,...,traffic_light} si=1,2,...,ns,pi=1,2,...,np
E e ={x e ,y e ,v e ,θ e }
E vi ={x si ,y si ,v si ,θ si }
E pi ={x pi ,y pi ,v pi ,θ pi }
wherein subscript e represents own vehicle; subscript si denotes the si th surrounding vehicle, i.e., s1 denotes the first surrounding vehicle, ns denotes the number of surrounding traffic participating vehicles; the subscript pi indicates the pi-th pedestrian, i.e., p1 is the first pedestrian, np indicates the number of pedestrians; x is the number of e ,y e ,v e ,θ e Respectively a transverse coordinate, a longitudinal coordinate, a mass center speed and a yaw angle of the mass center of the bicycle; x is the number of vi ,y vi ,v vi ,θ vi Respectively the transverse coordinate, the longitudinal coordinate, the centroid speed and the yaw angle of the centroid of the surrounding vehicle; x is the number of pi ,y pi ,v pi ,θ pi Respectively a transverse coordinate, a longitudinal coordinate, a mass center speed and a yaw angle of the mass center of the pedestrian; traffic _ light represents traffic signal light status.
3. The method for implementing automated driving vehicle behavior decision-making as claimed in claim 2, wherein, in step S2,
1) the constructed state space S includes: position of own vehicle (x) e ,y e ) Velocity v e Yaw angle theta e Relative position of surrounding vehicle with respect to own vehicle (Δ x) si ,Δy si ) Relative velocity Δ v si And its relative yaw angle delta theta si Relative position of surrounding pedestrian with respect to own vehicle (Δ x) pi ,Δy pi ) Relative velocity Δ v pi And relative thereofYaw angle delta theta pi I.e. the state space S is represented as:
S={s e ,s s1 ,s s2 ,...,s si ,...,s p1 ,s p2 ,...,s pi ,...,traffic_light} si=1,2,...,ns,pi=1,2,...,np
s e ={x e ,y e ,v e ,θ e }
s si ={Δx si ,Δy si ,Δv si ,Δθ si }
s pi ={Δx pi ,Δy pi ,Δv pi ,Δθ pi }
2) the constructed motion space a includes: acceleration a of vehicle c Steering angle delta with front wheel f I.e. a(s) ═ a c ,δ f };
3) The constructed reward function R includes: safety against collision r col Target prize r goal And traffic signal light reward r light Namely:
R=χ 1 r col +χ 2 r goal +χ 3 r light
wherein, χ 1 、χ 2 、χ 3 Weighting coefficients for the terms in the reward function;
safety against collision r col The self-vehicle is required to avoid collision with other traffic participating vehicles and pedestrians;
when the own vehicle collides with surrounding vehicles and pedestrians, Cind is 1, otherwise Cind is 0;
targeted reward r goal The driving speed of the self vehicle is required to safely reach the destination within the specified time;
when the self vehicle can safely reach the destination within a specified time, Gind is 1, otherwise, Gind is 0;
traffic signal light reward r light The self-vehicle is required to comply with the traffic light rule;
when the vehicle passes through the intersection and complies with the traffic rules, Lind is 1, and otherwise Lind is 0.
4. The method for implementing automated driving vehicle behavior decision-making according to claim 3, wherein step S3 specifically comprises the following steps:
s31: construction of implicit quantile network Z Using neural network τ (S, A) with inputs of state space S, quantile τ and parameter θ τ (ii) a Constructing a target implicit quantile network Z using a neural network τ′ (S, A) with the input as state space S, quantile τ', and parameter θ τ′ (ii) a In addition, a hyper-parameter K, N, N' is set, where K is the implicit quantile network Z τ Outputting the sampling times of the optimal action, wherein N is an implicit quantile network Z τ The number of samples in calculating the loss function, N' being the target implicit quantile network Z τ′ Calculating the sampling times of the loss function;
s32: randomly initializing a decision model based on deep reinforcement learning, wherein the decision model comprises hyper-parameters and network structure parameters of the model;
s33: implicit quantile-based network Z τ (S, A), inputting the state S at the current time t t Calculating the action A based on the following formula t ;
Meanwhile, the reward R acquired at the current time t is calculated according to the reward function t Calculating the state S at the t +1 moment based on the simulation environment output E t+1 (ii) a Establishing an experience pool, and combining data { S } t ,A t ,R t ,S t+1 Putting the training data into an experience pool, and replacing old training data with new training data according to a first-in first-out principle when the training data volume exceeds the capacity of the experience pool;
s34: randomly extracting B samples from the experience pool, and carrying out implicit quantile network Z τ (S, A) and target implicit quantile network Z τ′ And (S, A) updating.
5. The method for implementing automated driving vehicle behavior decision according to claim 4, wherein step S34 specifically comprises: first, for any two quantiles τ i ,τ′ j Taking the difference, expressed as:
wherein the content of the first and second substances,gamma is a discount factor, K is more than or equal to 1 and less than or equal to K, i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to N', tau is the optimal action at the moment of t +1 k ,τ i ,τ j U (0,1), wherein U is uniformly distributed;
second, the gradient of the loss function is expressed as:
6. The method for implementing automated driving vehicle behavior decision-making according to claim 5, wherein step S4 specifically comprises the following steps:
step S41: reward distribution information Z obtained based on step S3 τ Using Wang function rho Wang The original distribution information is changed, and the calculation formula is as follows:
wherein, phi is a standard normal distribution probability density function, phi -1 Which is the inverse of the standard normally distributed probability density function,expressing the mean value, wherein alpha is a self-defined risk parameter value;
step S42: selecting an optimal action: maximizing rho Wang (Z τ ) Value, i.e. calculating a risk sensitive behavioral decision instruction:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210528980.7A CN114880938B (en) | 2022-05-16 | 2022-05-16 | Method for realizing decision of automatically driving automobile behavior |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210528980.7A CN114880938B (en) | 2022-05-16 | 2022-05-16 | Method for realizing decision of automatically driving automobile behavior |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114880938A true CN114880938A (en) | 2022-08-09 |
CN114880938B CN114880938B (en) | 2023-04-18 |
Family
ID=82675965
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210528980.7A Active CN114880938B (en) | 2022-05-16 | 2022-05-16 | Method for realizing decision of automatically driving automobile behavior |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114880938B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107169567A (en) * | 2017-03-30 | 2017-09-15 | 深圳先进技术研究院 | The generation method and device of a kind of decision networks model for Vehicular automatic driving |
CN114013443A (en) * | 2021-11-12 | 2022-02-08 | 哈尔滨工业大学 | Automatic driving vehicle lane change decision control method based on hierarchical reinforcement learning |
CN114312830A (en) * | 2021-12-14 | 2022-04-12 | 江苏大学 | Intelligent vehicle coupling decision model and method considering dangerous driving conditions |
-
2022
- 2022-05-16 CN CN202210528980.7A patent/CN114880938B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107169567A (en) * | 2017-03-30 | 2017-09-15 | 深圳先进技术研究院 | The generation method and device of a kind of decision networks model for Vehicular automatic driving |
CN114013443A (en) * | 2021-11-12 | 2022-02-08 | 哈尔滨工业大学 | Automatic driving vehicle lane change decision control method based on hierarchical reinforcement learning |
CN114312830A (en) * | 2021-12-14 | 2022-04-12 | 江苏大学 | Intelligent vehicle coupling decision model and method considering dangerous driving conditions |
Non-Patent Citations (1)
Title |
---|
WILL DABNEY 等: "Implicit Quantile Networks for Distributional Reinforcement Learning", 《HTTPS://ARXIV.ORG/PDF/1806.06923.PDF》 * |
Also Published As
Publication number | Publication date |
---|---|
CN114880938B (en) | 2023-04-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3678911B1 (en) | Pedestrian behavior predictions for autonomous vehicles | |
CN109598934B (en) | Rule and learning model-based method for enabling unmanned vehicle to drive away from high speed | |
CN111775949B (en) | Personalized driver steering behavior auxiliary method of man-machine co-driving control system | |
CN110969848A (en) | Automatic driving overtaking decision method based on reinforcement learning under opposite double lanes | |
CN106991251B (en) | Cellular machine simulation method for highway traffic flow | |
CN105857306A (en) | Vehicle autonomous parking path programming method used for multiple parking scenes | |
CN112249008B (en) | Unmanned automobile early warning method aiming at complex dynamic environment | |
CN114013443B (en) | Automatic driving vehicle lane change decision control method based on hierarchical reinforcement learning | |
CN113753026B (en) | Decision-making method for preventing rollover of large commercial vehicle by considering road adhesion condition | |
CN110716562A (en) | Decision-making method for multi-lane driving of unmanned vehicle based on reinforcement learning | |
CN110956851B (en) | Intelligent networking automobile cooperative scheduling lane changing method | |
CN112896188B (en) | Automatic driving decision control system considering front vehicle encounter | |
US20220242422A1 (en) | Systems and methods for updating the parameters of a model predictive controller with learned external parameters generated using simulations and machine learning | |
CN114644017A (en) | Method for realizing safety decision control of automatic driving vehicle | |
CN114035575B (en) | Unmanned vehicle motion planning method and system based on semantic segmentation | |
CN113255998B (en) | Expressway unmanned vehicle formation method based on multi-agent reinforcement learning | |
CN113722835B (en) | Personification random lane change driving behavior modeling method | |
Wang et al. | Vehicle trajectory prediction by knowledge-driven LSTM network in urban environments | |
CN113715842A (en) | High-speed moving vehicle control method based on simulation learning and reinforcement learning | |
CN115303289A (en) | Vehicle dynamics model based on depth Gaussian, training method, intelligent vehicle trajectory tracking control method and terminal equipment | |
CN115593433A (en) | Remote take-over method for automatic driving vehicle | |
US20220242401A1 (en) | Systems and methods for updating the parameters of a model predictive controller with learned controls parameters generated using simulations and machine learning | |
CN114880938B (en) | Method for realizing decision of automatically driving automobile behavior | |
CN115123217B (en) | Mine obstacle vehicle driving track generation method and device and computer equipment | |
CN113033902B (en) | Automatic driving lane change track planning method based on improved deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |