CN114880938A

CN114880938A - Method for realizing decision of automatically driving automobile behavior

Info

Publication number: CN114880938A
Application number: CN202210528980.7A
Authority: CN
Inventors: 唐小林; 杨凯; 李深; 汪锋; 沈子超; 邓忠伟; 胡晓松; 李佳承
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2022-05-16
Filing date: 2022-05-16
Publication date: 2022-08-09
Anticipated expiration: 2042-05-16
Also published as: CN114880938B

Abstract

The invention relates to a method for realizing behavior decision of an automatic driving automobile, belonging to the technical field of automatic driving automobiles. The method comprises the following steps: s1: constructing a signal lamp crossroad simulation training scene containing environmental uncertainty factors; s2: constructing an implicit quantile network model, including constructing a state space, an action space and a reward function; s3: using the implicit quantile network model constructed in the neural network optimization step S2; s4: and generating a behavior decision with risk perception capability by combining the Wang function according to reward distribution information output by the implicit quantile network model after the optimization in the step S3. The method and the system can sense risks caused by uncertain factors in the environment, and improve the safety of the automatic driving automobile at the crossroad with the signal lamp.

Description

Method for realizing decision of automatically driving automobile behavior

Technical Field

The invention belongs to the technical field of automatic driving automobiles, and relates to a method for realizing behavior decision of an automatic driving automobile.

Background

When the automatic driving automobile runs in an actual environment, various factors of the environment, including surrounding vehicles, pedestrians and the like, need to be considered by a decision making system of the automatic driving automobile. However, how to ensure the driving safety of the automatic driving automobile when facing complex driving conditions is still not solved. Especially at signal lamp intersections, how to consider the violation behaviors of surrounding vehicles and pedestrians in the behavior decision system, such as dangerous behaviors like running red light, is very important for improving the safety of the automatically driven vehicles.

At present, the decision-making method for the automatic driving automobile crossroads mainly comprises the following steps: a rule-based decision method, a partially observable Markov-based decision method, and a deep reinforcement learning-based decision method. At present, in order to improve the adaptability of an automatic driving decision system to a complex traffic scene, a method based on deep reinforcement learning is widely adopted. The advantages of such methods over rule-based decision methods are: the method can avoid complicated design steps and parameter adjustment work brought by a rule-based algorithm. In addition, the method can solve the problem that the observable Markov method is difficult to adapt to large-scale decision. Generally, a decision method based on deep reinforcement learning generates driving data through continuous interaction between an automobile and an environment, and autonomously learns a decision strategy adapted to a complex environment, and representative decision methods include a deep Q learning network (DQN), a soft actor-critic (SAC), and the like. However, these methods hardly consider the violation of traffic participants at the intersection with signal lights, and it is difficult to ensure the driving safety of vehicles at the intersection.

Therefore, a safety decision method capable of considering the violation of the traffic participants is needed to ensure the safety of the autonomous vehicle.

Disclosure of Invention

In view of this, the present invention provides a method for implementing a behavior decision of an autonomous vehicle, which can sense risks caused by uncertainty factors in an environment and can improve the safety of the autonomous vehicle passing through a signal lamp intersection.

In order to achieve the purpose, the invention provides the following technical scheme:

a method for realizing automatic driving automobile behavior decision comprises the following steps:

s1: constructing a signal lamp crossroad simulation training scene containing environmental uncertainty factors;

s2: constructing an Implicit Quantile Network (IQN) model, including constructing a state space, an action space and a reward function;

s3: an Implicit Quantile Network (IQN) model constructed using the neural network optimization step S2;

s4: and generating a behavior decision with risk perception capability by combining the Wang function according to reward distribution information output by the optimized Implicit Quantile Network (IQN) model in the step S3.

Further, step S1 specifically includes the following steps:

s11: setting a pedestrian model: describing a pedestrian motion track in a simulation training scene by adopting the following kinematics model;

wherein v is _p For pedestrian speed, omega _p Is angular velocity, x _p 、y _p 、θ _p Respectively an abscissa, an ordinate and a course angle of the center of gravity of the pedestrian;

derivatives of x, y, θ, v, respectively;

s12: setting a surrounding vehicle model, and defining the motion of the vehicle and the surrounding vehicle in a simulation training scene, wherein the motion is described by the following equation:

wherein x and y are respectively the abscissa and ordinate of the vehicle mass center, v represents the vehicle mass center speed, theta is the vehicle yaw angle, beta is the slip angle at the vehicle mass center, and l _f 、l _r Distances from the center of mass of the vehicle to the front and rear axles of the vehicle, delta _f For the steering angle of the front wheels of the vehicle, a _c Is the vehicle acceleration;

derivatives of x, y, θ, v, respectively;

to enable the surrounding vehicles in the simulated training scenario to interact with the host vehicle, it is provided that the surrounding motor vehicles are controlled by a Velocity Difference Model (Velocity Difference Model):

a _c ＝k[V-v+λΔv]

V＝V ₁ +V ₂ tanh[C ₁ (x _front +L _length,front -x)+C ₂ ]

wherein k is a sensitivity coefficient, Δ V is a relative speed between the host vehicle and surrounding vehicles, λ is a speed difference reaction coefficient, and V is a speed difference reaction coefficient ₁ 、V ₂ 、C ₁ 、C ₂ The parameters are self-defined and can be obtained through experiments generally; x is the number of _front Is the transverse coordinate of the center of mass of the surrounding vehicle, L _length,front Is the body length of the surrounding vehicle, and x is the transverse coordinate of the center of mass of the vehicle;

s13: setting behavior types of surrounding motor vehicles and pedestrians;

in order to simulate a real traffic scene, the behavior types of surrounding motor vehicles and pedestrians are set as follows: and the regular vehicles, the regular pedestrians, the illegal vehicles and the illegal pedestrians are classified into four types. Specifically, a conventional vehicle will comply with the traffic light rule, and an illegal vehicle will not comply with the traffic light rule, i.e., a red light running behavior will occur; regular pedestrians will obey the traffic light rule, and illegal pedestrians will not obey the traffic light rule, namely the behavior of red light running can occur. When the simulation environment operates, one of four types of conventional vehicles, conventional pedestrians, illegal vehicles and illegal pedestrians is randomly extracted at each simulation moment and added into the simulation environment.

S14: initializing an environment: randomly initializing the initial state of a signal lamp, the initial speed, the position and the target speed of surrounding motor vehicles; the simulation environment outputs environment information E at each simulation time t, which is defined as:

E＝{E _e ,E _s1 ,E _s2 ,...,E _si ,...,E _p1 ,E _p2 ,...,E _pi ,...,traffic_light} _{si＝1,2,...,ns,pi＝1,2,...,np}

E _e ＝{x _e ,y _e ,v _e ,θ _e }

E _vi ＝{x _si ,y _si ,v _si ,θ _si }

E _pi ＝{x _pi ,y _pi ,v _pi ,θ _pi }

wherein subscript e represents own vehicle; subscript si denotes the si th surrounding vehicle, i.e., s1 denotes the first surrounding vehicle, ns denotes the number of surrounding traffic participating vehicles; the subscript pi indicates the pi-th pedestrian, i.e., p1 is the first pedestrian, np indicates the number of pedestrians; x is the number of _e ,y _e ,v _e ,θ _e Respectively a transverse coordinate, a longitudinal coordinate, a mass center speed and a yaw angle of the mass center of the bicycle; x is the number of _vi ,y _vi ,v _vi ,θ _vi Respectively the transverse coordinate, the longitudinal coordinate, the centroid speed and the yaw angle of the centroid of the surrounding vehicle; x is the number of _pi ,y _pi ,v _pi ,θ _pi Respectively a transverse coordinate, a longitudinal coordinate, a mass center speed and a yaw angle of the mass center of the pedestrian; traffic _ light represents traffic signal light status.

Further, in step S2,

1) the constructed state space S includes: position of the vehicle (x) _e ,y _e ) Velocity v _e Yaw angle theta _e Relative position of surrounding vehicle with respect to own vehicle (Δ x) _si ,Δy _si ) Relative velocity Δ v _si And its relative yaw angle delta theta _si Relative position of surrounding pedestrian with respect to own vehicle (Δ x) _pi ,Δy _pi ) Relative velocity Δ v _pi And its relative yaw angle delta theta _pi I.e. the state space S is represented as:

S＝{s _e ,s _s1 ,s _s2 ,...,s _si ,...,s _p1 ,s _p2 ,...,s _pi ,...,traffic_light} _si＝1,2, ... _{,ns,pi＝1,2,} ... _,np

s _e ＝{x _e ,y _e ,v _e ,θ _e }

s _si ＝{Δx _si ,Δy _si ,Δv _si ,Δθ _si }

s _pi ＝{Δx _pi ,Δy _pi ,Δv _pi ,Δθ _pi }

2) the constructed motion space a includes: acceleration a of vehicle _c Steering angle delta with front wheel _f Thereby controlling the movement of the target vehicle, i.e., a(s) ═ a _c ,δ _f }；

3) The constructed reward function R includes: safety against collision r _col Target prize r _goal And traffic signal light reward r _light Namely:

R＝χ ₁ r _col +χ ₂ r _goal +χ ₃ r _light

wherein, χ ₁ 、χ ₂ 、χ ₃ Weighting coefficients for the terms in the reward function;

safety against collision r _col The self-vehicle is required to avoid collision with other traffic participating vehicles and pedestrians;

when the own vehicle collides with surrounding vehicles and pedestrians, Cind is 1, otherwise Cind is 0;

targeted reward r _goal The running speed of the self vehicle is required to reach the destination safely within the specified time as much as possible;

when the self vehicle can safely reach the destination within a specified time, Gind is 1, otherwise, Gind is 0;

traffic signal light reward r _light The self-vehicle is required to comply with the traffic light rule;

when the vehicle passes through the intersection and complies with the traffic rules, Lind is 1, and otherwise Lind is 0.

Further, step S3 specifically includes the following steps:

s31: construction of implicit quantile network Z Using neural network _τ (S, A) with inputs of state space S, quantile τ and parameter θ ^τ (ii) a Constructing a target implicit quantile network Z using a neural network _τ′ (S, A) with the input as state space S, quantile τ', and parameter θ ^τ′ (ii) a In addition, a hyper-parameter K, N, N' is set, where K is the implicit quantile network Z _τ Outputting the sampling times of the optimal action, wherein N is an implicit quantile network Z _τ The number of samples in calculating the loss function, N' being the target implicit quantile network Z _τ′ Calculating the sampling times of the loss function;

s32: randomly initializing a decision model based on deep reinforcement learning, wherein the decision model comprises hyper-parameters and network structure parameters of the model;

s33: implicit quantile-based network Z _τ (S, A), inputting the state S at the current time t _t Calculating the action A based on the following formula _t ；

Meanwhile, the reward R acquired at the current time t is calculated according to the reward function _t Calculating the state S at the t +1 moment based on the simulation environment output E _t+1 (ii) a Establishing an experience pool, and combining data { S } _t ,A _t ,R _t ,S _t+1 Putting the training data into an experience pool, and replacing the old training data with the new training data according to a first-in first-out principle when the training data volume exceeds the capacity of the experience pool;

s34: randomly extracting B samples from the experience pool, and carrying out implicit quantile network Z _τ (S, A) and target implicit quantile network Z _τ′ And (S, A) updating. The method specifically comprises the following steps: first, for any two quantiles τ _i ,τ′ _j Taking the difference, expressed as:

wherein the content of the first and second substances,

gamma is a discount factor, K is more than or equal to 1 and less than or equal to K, i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to N', tau is the optimal action at the moment of t +1 _k ,τ _i ,τ _j U (0,1), wherein U is uniformly distributed;

second, the gradient of the loss function can be expressed as:

wherein the content of the first and second substances,

in order to be the gradient of the loss function,

is a threshold value, and is,

is the Huber function;

to indicate a function, i.e., to satisfy the condition of 1, otherwise to 0,

to set the threshold.

Further, step S4 specifically includes the following steps:

step S41: reward distribution information Z obtained based on step S3 _τ Using Wang function rho _Wang The original distribution information is changed, and the calculation formula is as follows:

wherein, phi is a standard normal distribution probability density function, phi ^-1 Which is the inverse of the standard normally distributed probability density function,

expressing the mean value, wherein alpha is a self-defined risk parameter value;

step S42: selecting an optimal action: maximizing rho _Wang (Z _τ ) And (3) calculating the action decision instruction with risk sensitivity:

wherein the content of the first and second substances,

the optimal action selected for time t.

The invention has the beneficial effects that:

1) the invention constructs a signal lamp crossroad simulation training scene containing environmental uncertainty factors, and the training scene can simulate the illegal behaviors of vehicles around, pedestrians running red light and the like and better accords with a real traffic scene.

2) The invention constructs an Implicit Quantile Network (IQN) based model which can calculate the distribution information of the reward.

3) The behavior decision with risk perception capability can be generated based on reward distribution information output by an Implicit Quantile Network (IQN) model and combined with the Wang function, and the safety of the decision of automatically driving the vehicle can be improved.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a logic framework diagram of the overall implementation of the method of the present invention;

FIG. 2 is a flow chart of the method of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Referring to fig. 1-2, the present invention provides a method for implementing decision-making of auto-driving behavior. Considering that violation behaviors such as surrounding vehicles and pedestrians running red light exist in a real traffic scene, a signal lamp crossroad simulation training scene containing environmental uncertainty factors is designed, and the training scene can simulate the violation behaviors such as surrounding vehicles and pedestrians running red light and better accords with the real traffic scene. In order to improve the safety of the automatic driving vehicle, the method specifically comprises the following steps:

s1: constructing a signal lamp crossroad simulation training scene containing environmental uncertainty factors; the method specifically comprises the following steps:

s101: setting a pedestrian model: describing the pedestrian motion track in the simulation training scene by adopting the following kinematics model:

wherein v is _p For pedestrian speed, omega _p Is angular velocity, x _p ，y _p ，θ _p Respectively an abscissa, an ordinate and a course angle of the center of gravity of the pedestrian,

the derivatives of x, y, theta, v, respectively.

S102: setting a surrounding vehicle model, and defining the motion of the vehicle and the surrounding vehicle in a simulation environment, wherein the motion is described by the following equations:

wherein x and y are respectively the abscissa and ordinate of the vehicle mass center, v represents the vehicle mass center speed, theta is the vehicle yaw angle, beta is the vehicle mass center side slip angle, and l _f ，l _r Distances from the center of mass of the vehicle to the front and rear axles of the vehicle, delta _f For the steering angle of the front wheels of the vehicle, a _c In order to be the acceleration of the vehicle,

the derivatives of x, y, theta, v, respectively.

To enable the surrounding motor vehicles in the simulation environment to interact with the own vehicle, it is provided that the surrounding motor vehicles are controlled by a Velocity Difference Model (Velocity Difference Model):

a _c ＝k[V-v+λΔv]

V＝V ₁ +V ₂ tanh[C ₁ (x _front +L _length,front -x)+C ₂ ]

wherein, a _c Is vehicle acceleration, k is a sensitivity coefficient, V is vehicle speed, Deltav is relative speed of the vehicle and surrounding vehicles, lambda is a speed difference reaction coefficient, V is a speed difference reaction coefficient ₁ ，V ₂ ，C ₁ ，C ₂ For self-defining parameters, x can be obtained through experiments _front Is the transverse coordinate of the center of mass of the surrounding vehicle, L _length,front Is the body length of the surrounding vehicle and x is the lateral coordinate of the center of mass of the vehicle.

S103: setting the behavior types of surrounding motor vehicles and pedestrians: in order to simulate a real traffic scene, the behavior types of surrounding motor vehicles and pedestrians are set as follows: and the four categories of the conventional vehicles, the conventional pedestrians, the illegal vehicles and the illegal pedestrians. Specifically, the conventional vehicles comply with the traffic light rule, and the illegal vehicles do not comply with the traffic light rule, namely, the behavior of running the red light can occur; regular pedestrians can obey the traffic light rule, and illegal pedestrians cannot obey the traffic light rule, namely, the behavior of running the red light can be generated. When the simulation environment operates, one of four types of conventional vehicles, conventional pedestrians, illegal vehicles and illegal pedestrians is randomly extracted at each simulation moment and added into the simulation environment.

S104: initializing the environment: the initial state of the signal lamp, the initial speed, the position and the target speed of the surrounding motor vehicles are initialized randomly. The simulation environment outputs environment information E at each simulation time t. E is specifically defined as:

E _e ＝{x _e ,y _e ,v _e ,θ _e }

E _vi ＝{x _si ,y _si ,v _si ,θ _si }

E _pi ＝{x _pi ,y _pi ,v _pi ,θ _pi }

wherein the subscript e represents the own vehicle, the subscript si represents the si-th surrounding vehicle, i.e., s1 represents the first surrounding vehicle, ns represents the number of surrounding traffic participating vehicles, the subscript pi represents the pi-th pedestrian, i.e., p1 is the first pedestrian, and np represents the number of pedestrians; x is the number of _e ,y _e ,v _e ,θ _e Respectively a transverse coordinate, a longitudinal coordinate, a mass center speed and a yaw angle of the mass center of the self-vehicle; x is the number of _vi ,y _vi ,v _vi ,θ _vi Respectively the transverse coordinate, the longitudinal coordinate, the centroid speed and the yaw angle of the centroid of the surrounding vehicle; x is the number of _pi ,y _pi ,v _pi ,θ _pi The transverse coordinate, the longitudinal coordinate, the centroid speed and the yaw angle of the centroid of the pedestrian are respectively.

S2: constructing and optimizing an implicit quantile-based network (IQN) model; the method specifically comprises the following steps:

s201: constructing a state space S including a position (x) of the own vehicle _e ,y _e ) Velocity v _e Yaw angle theta _e Relative position of surrounding vehicle with respect to own vehicle (Δ x) _si ,Δy _si ) Relative velocity Δ v _si And its relative yaw angle delta theta _si Relative position of surrounding pedestrian with respect to own vehicle (Δ x) _pi ,Δy _pi ) Relative velocity Δ v _pi And its relative yaw angle delta theta _pi The traffic signal status traffic _ light, i.e., S, is represented as:

s _e ＝{x _e ,y _e ,v _e ,θ _e }

s _si ＝{Δx _si ,Δy _si ,Δv _si ,Δθ _si }

s _pi ＝{Δx _pi ,Δy _pi ,Δv _pi ,Δθ _pi }

where the subscript e denotes the own vehicle, the subscript si denotes the si th surrounding vehicle, i.e., s1 denotes the first surrounding vehicle, ns denotes the number of surrounding traffic participating vehicles, the subscript pi denotes the pi th pedestrian, i.e., p1 is the first pedestrian, and np denotes the number of pedestrians.

S202: constructing an action space A consisting of the acceleration of the vehicle and the steering angle of the front wheels, thereby controlling the movement of the target vehicle, i.e.

A(S)＝{a _c ,δ _f }

Wherein, a _c For vehicle acceleration, δ _f Is the front wheel steering angle.

S203: constructing a reward function R comprising a collision safety R _col Target prize r _goal Traffic signal light reward r _light Namely:

R＝χ ₁ r _col +χ ₂ r _goal +χ ₃ r _light

wherein, χ ₁ ，χ ₂ ，χ ₃ For each of the reward functionsThe weight coefficient of the term;

safety in collision _col The self-vehicle is required to avoid collision with other traffic participating vehicles and pedestrians.

When the own vehicle collides with a surrounding vehicle or a pedestrian, Cind is 1, otherwise Cind is 0.

Targeted reward r _goal The running speed of the vehicle is required to reach the destination safely within a specified time as much as possible.

When the self vehicle can safely reach the destination within the specified time, Gind is 1, otherwise, Gind is 0.

Traffic signal light reward r _light The vehicle is required to comply with the traffic light regulations.

When the vehicle passes through the intersection, the traffic rules are observed, and Lind is 1, otherwise Lind is 0.

S204: construction of implicit quantile networks Z Using neural networks _τ (S, A) with inputs of state space S, quantile τ and parameter θ ^τ (ii) a Constructing a target implicit quantile network Z using a neural network _τ′ (S, A) with inputs of state space S and quantile τ', the parameter being represented by θ ^τ′ . In addition, a hyper-parameter K, N, N' is set, where K is the implicit quantile network Z _τ Outputting the sampling times of the optimal action, wherein N is an implicit quantile network Z _τ The number of samples in calculating the loss function, N' being the target implicit quantile network Z _τ′ The number of samples in the loss function is calculated.

S205: randomly initializing a decision model based on deep reinforcement learning, wherein the decision model comprises hyper-parameters and network structure parameters of the model;

s206: implicit quantile-based network Z _τ (S, A), inputting the state S at the current time t _t Calculating the action A based on the following formula _t ，

Meanwhile, the reward function calculates the reward R acquired at the current time t _t Calculating the state S at the t +1 moment based on the simulation environment output E _t+1 (ii) a Establishing an experience pool, and combining data { S } _t ,A _t ,R _t ,S _t+1 Putting the training data into an experience pool, and replacing the old training data with the new training data according to a first-in first-out principle when the training data volume exceeds the capacity of the experience pool;

s207: randomly extracting B samples from the experience pool, and carrying out implicit quantile network Z _τ (S, A) and target implicit quantile network Z _τ′ And (S, A) updating. First, for any two quantiles τ _i ,τ′ _j The difference can be expressed as:

wherein, the first and the second end of the pipe are connected with each other,

for the optimal action at time t +1, γ is the discount factor, R _t Is the instant reward at the time t, A is the action space, K is more than or equal to 1 and less than or equal to K, i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to N', tau _k ,τ _i ,τ _j U (0,1), U is uniformly distributed.

Second, the gradient of the loss function can be expressed as:

in order to be the gradient of the loss function,

is a threshold value, and is,

in order to be a function of the Huber,

to indicate a function, i.e., to satisfy the condition of 1, otherwise to 0,

to set the threshold.

S3: generating a behavior decision with risk perception capability by combining a Wang function based on reward distribution information output by an Implicit Quantile Network (IQN) model; the method specifically comprises the following steps:

step S301: the reward distribution information Z obtained based on step S2 _τ Using Wang function rho _Wang The original distribution information is changed by the following specific formula:

and expressing the mean value, wherein alpha is a self-defined risk parameter value.

Step S302: selecting an optimal action: maximizing ρ _Wang (Z _τ ) And (3) calculating the action decision instruction with risk sensitivity:

wherein the content of the first and second substances,

the optimal action selected for time t.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A method for realizing automatic driving automobile behavior decision is characterized by comprising the following steps:

s2: constructing an implicit quantile network model, including constructing a state space, an action space and a reward function;

s3: using the implicit quantile network model constructed in the neural network optimization step S2;

s4: and generating a behavior decision with risk perception capability by combining the Wang function according to the reward distribution information output by the implicit quantile network model after the optimization in the step S3.

2. The method for implementing automated driving vehicle behavior decision-making as claimed in claim 1, wherein step S1 specifically comprises the steps of:

s11: setting a pedestrian model: describing a pedestrian motion track in a simulation training scene by adopting a kinematics model;

are respectively x _p 、y _p 、θ _p A derivative of (a);

s12: setting a surrounding vehicle model, and specifying that the motion of the vehicle and the surrounding vehicle in a simulation training scene is described by the following equation:

derivatives of x, y, θ, v, respectively;

to enable the surrounding vehicles in the simulated training scenario to interact with the own vehicle, it is provided that the surrounding motor vehicles are controlled by a speed difference model:

a _c ＝k[V-v+λΔv]

V＝V ₁ +V ₂ tanh[C ₁ (x _front +L _{length，front} -x)+C ₂ ]

wherein k is a sensitivity coefficient, Δ V is a relative speed between the host vehicle and surrounding vehicles, λ is a speed difference reaction coefficient, and V is a speed difference reaction coefficient ₁ 、V ₂ 、C ₁ 、C ₂ To define the parameters, x _front Is the transverse coordinate of the center of mass of the surrounding vehicle, L _{length，front} Is the body length of the surrounding vehicle, and x is the transverse coordinate of the center of mass of the vehicle;

s13: setting the behavior types of surrounding motor vehicles and pedestrians, comprising the following steps: the four categories of conventional vehicles, conventional pedestrians, illegal vehicles and illegal pedestrians;

s14: initializing the environment: randomly initializing the initial state of a signal lamp, the initial speed, the position and the target speed of surrounding motor vehicles; the simulation environment outputs environment information E at each simulation time t, which is defined as:

E＝{E _e ，E _s1 ，E _s2 ，...，E _si ，...，E _p1 ，E _p2 ，...，E _pi ，...，traffic_light} _{si＝1，2，...，ns，pi＝1，2，...，np}

E _e ＝{x _e ，y _e ，v _e ，θ _e }

E _vi ＝{x _si ，y _si ，v _si ，θ _si }

E _pi ＝{x _pi ，y _pi ，v _pi ，θ _pi }

wherein subscript e represents own vehicle; subscript si denotes the si th surrounding vehicle, i.e., s1 denotes the first surrounding vehicle, ns denotes the number of surrounding traffic participating vehicles; the subscript pi indicates the pi-th pedestrian, i.e., p1 is the first pedestrian, np indicates the number of pedestrians; x is the number of _e ，y _e ，v _e ，θ _e Respectively a transverse coordinate, a longitudinal coordinate, a mass center speed and a yaw angle of the mass center of the bicycle; x is the number of _vi ，y _vi ，v _vi ，θ _vi Respectively the transverse coordinate, the longitudinal coordinate, the centroid speed and the yaw angle of the centroid of the surrounding vehicle; x is the number of _pi ，y _pi ，v _pi ，θ _pi Respectively a transverse coordinate, a longitudinal coordinate, a mass center speed and a yaw angle of the mass center of the pedestrian; traffic _ light represents traffic signal light status.

3. The method for implementing automated driving vehicle behavior decision-making as claimed in claim 2, wherein, in step S2,

1) the constructed state space S includes: position of own vehicle (x) _e ，y _e ) Velocity v _e Yaw angle theta _e Relative position of surrounding vehicle with respect to own vehicle (Δ x) _si ，Δy _si ) Relative velocity Δ v _si And its relative yaw angle delta theta _si Relative position of surrounding pedestrian with respect to own vehicle (Δ x) _pi ，Δy _pi ) Relative velocity Δ v _pi And relative thereofYaw angle delta theta _pi I.e. the state space S is represented as:

S＝{s _e ，s _s1 ，s _s2 ，...，s _si ，...，s _p1 ，s _p2 ，...，s _pi ，...，traffic_light} _{si＝1，2，...，ns，pi＝1，2，...，np}

s _e ＝{x _e ，y _e ，v _e ，θ _e }

s _si ＝{Δx _si ，Δy _si ，Δv _si ，Δθ _si }

s _pi ＝{Δx _pi ，Δy _pi ，Δv _pi ，Δθ _pi }

2) the constructed motion space a includes: acceleration a of vehicle _c Steering angle delta with front wheel _f I.e. a(s) ═ a _c ，δ _f }；

R＝χ ₁ r _col +χ ₂ r _goal +χ ₃ r _light

targeted reward r _goal The driving speed of the self vehicle is required to safely reach the destination within the specified time;

4. The method for implementing automated driving vehicle behavior decision-making according to claim 3, wherein step S3 specifically comprises the following steps:

Meanwhile, the reward R acquired at the current time t is calculated according to the reward function _t Calculating the state S at the t +1 moment based on the simulation environment output E _t+1 (ii) a Establishing an experience pool, and combining data { S } _t ，A _t ，R _t ，S _t+1 Putting the training data into an experience pool, and replacing old training data with new training data according to a first-in first-out principle when the training data volume exceeds the capacity of the experience pool;

s34: randomly extracting B samples from the experience pool, and carrying out implicit quantile network Z _τ (S, A) and target implicit quantile network Z _τ′ And (S, A) updating.

5. The method for implementing automated driving vehicle behavior decision according to claim 4, wherein step S34 specifically comprises: first, for any two quantiles τ _i ，τ′ _j Taking the difference, expressed as:

wherein the content of the first and second substances,

gamma is a discount factor, K is more than or equal to 1 and less than or equal to K, i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to N', tau is the optimal action at the moment of t +1 _k ，τ _i ，τ _j U (0,1), wherein U is uniformly distributed;

second, the gradient of the loss function is expressed as:

wherein the content of the first and second substances,

in order to be the gradient of the loss function,

is a threshold value, and is,

is the Huber function;

to indicate the function, i.e. the condition is satisfied as 1, otherwise it is 0, and m is the set threshold.

6. The method for implementing automated driving vehicle behavior decision-making according to claim 5, wherein step S4 specifically comprises the following steps:

step S42: selecting an optimal action: maximizing rho _Wang (Z _τ ) Value, i.e. calculating a risk sensitive behavioral decision instruction:

wherein the content of the first and second substances,

the optimal action selected for time t.