CN114880938A - Method for realizing decision of automatically driving automobile behavior - Google Patents

Method for realizing decision of automatically driving automobile behavior Download PDF

Info

Publication number
CN114880938A
CN114880938A CN202210528980.7A CN202210528980A CN114880938A CN 114880938 A CN114880938 A CN 114880938A CN 202210528980 A CN202210528980 A CN 202210528980A CN 114880938 A CN114880938 A CN 114880938A
Authority
CN
China
Prior art keywords
vehicle
quantile
surrounding
implicit
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210528980.7A
Other languages
Chinese (zh)
Other versions
CN114880938B (en
Inventor
唐小林
杨凯
李深
汪锋
沈子超
邓忠伟
胡晓松
李佳承
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN202210528980.7A priority Critical patent/CN114880938B/en
Publication of CN114880938A publication Critical patent/CN114880938A/en
Application granted granted Critical
Publication of CN114880938B publication Critical patent/CN114880938B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/10Geometric CAD
    • G06F30/15Vehicle, aircraft or watercraft design
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0637Strategic management or analysis, e.g. setting a goal or target of an organisation; Planning actions based on goals; Analysis or evaluation of effectiveness of goals
    • G06Q50/40
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention relates to a method for realizing behavior decision of an automatic driving automobile, belonging to the technical field of automatic driving automobiles. The method comprises the following steps: s1: constructing a signal lamp crossroad simulation training scene containing environmental uncertainty factors; s2: constructing an implicit quantile network model, including constructing a state space, an action space and a reward function; s3: using the implicit quantile network model constructed in the neural network optimization step S2; s4: and generating a behavior decision with risk perception capability by combining the Wang function according to reward distribution information output by the implicit quantile network model after the optimization in the step S3. The method and the system can sense risks caused by uncertain factors in the environment, and improve the safety of the automatic driving automobile at the crossroad with the signal lamp.

Description

Method for realizing decision of automatically driving automobile behavior
Technical Field
The invention belongs to the technical field of automatic driving automobiles, and relates to a method for realizing behavior decision of an automatic driving automobile.
Background
When the automatic driving automobile runs in an actual environment, various factors of the environment, including surrounding vehicles, pedestrians and the like, need to be considered by a decision making system of the automatic driving automobile. However, how to ensure the driving safety of the automatic driving automobile when facing complex driving conditions is still not solved. Especially at signal lamp intersections, how to consider the violation behaviors of surrounding vehicles and pedestrians in the behavior decision system, such as dangerous behaviors like running red light, is very important for improving the safety of the automatically driven vehicles.
At present, the decision-making method for the automatic driving automobile crossroads mainly comprises the following steps: a rule-based decision method, a partially observable Markov-based decision method, and a deep reinforcement learning-based decision method. At present, in order to improve the adaptability of an automatic driving decision system to a complex traffic scene, a method based on deep reinforcement learning is widely adopted. The advantages of such methods over rule-based decision methods are: the method can avoid complicated design steps and parameter adjustment work brought by a rule-based algorithm. In addition, the method can solve the problem that the observable Markov method is difficult to adapt to large-scale decision. Generally, a decision method based on deep reinforcement learning generates driving data through continuous interaction between an automobile and an environment, and autonomously learns a decision strategy adapted to a complex environment, and representative decision methods include a deep Q learning network (DQN), a soft actor-critic (SAC), and the like. However, these methods hardly consider the violation of traffic participants at the intersection with signal lights, and it is difficult to ensure the driving safety of vehicles at the intersection.
Therefore, a safety decision method capable of considering the violation of the traffic participants is needed to ensure the safety of the autonomous vehicle.
Disclosure of Invention
In view of this, the present invention provides a method for implementing a behavior decision of an autonomous vehicle, which can sense risks caused by uncertainty factors in an environment and can improve the safety of the autonomous vehicle passing through a signal lamp intersection.
In order to achieve the purpose, the invention provides the following technical scheme:
a method for realizing automatic driving automobile behavior decision comprises the following steps:
s1: constructing a signal lamp crossroad simulation training scene containing environmental uncertainty factors;
s2: constructing an Implicit Quantile Network (IQN) model, including constructing a state space, an action space and a reward function;
s3: an Implicit Quantile Network (IQN) model constructed using the neural network optimization step S2;
s4: and generating a behavior decision with risk perception capability by combining the Wang function according to reward distribution information output by the optimized Implicit Quantile Network (IQN) model in the step S3.
Further, step S1 specifically includes the following steps:
s11: setting a pedestrian model: describing a pedestrian motion track in a simulation training scene by adopting the following kinematics model;
Figure BDA0003645434760000021
Figure BDA0003645434760000022
Figure BDA0003645434760000023
wherein v is p For pedestrian speed, omega p Is angular velocity, x p 、y p 、θ p Respectively an abscissa, an ordinate and a course angle of the center of gravity of the pedestrian;
Figure BDA0003645434760000024
derivatives of x, y, θ, v, respectively;
s12: setting a surrounding vehicle model, and defining the motion of the vehicle and the surrounding vehicle in a simulation training scene, wherein the motion is described by the following equation:
Figure BDA0003645434760000025
Figure BDA0003645434760000026
Figure BDA0003645434760000027
Figure BDA0003645434760000028
Figure BDA0003645434760000029
wherein x and y are respectively the abscissa and ordinate of the vehicle mass center, v represents the vehicle mass center speed, theta is the vehicle yaw angle, beta is the slip angle at the vehicle mass center, and l f 、l r Distances from the center of mass of the vehicle to the front and rear axles of the vehicle, delta f For the steering angle of the front wheels of the vehicle, a c Is the vehicle acceleration;
Figure BDA00036454347600000210
derivatives of x, y, θ, v, respectively;
to enable the surrounding vehicles in the simulated training scenario to interact with the host vehicle, it is provided that the surrounding motor vehicles are controlled by a Velocity Difference Model (Velocity Difference Model):
a c =k[V-v+λΔv]
V=V 1 +V 2 tanh[C 1 (x front +L length,front -x)+C 2 ]
wherein k is a sensitivity coefficient, Δ V is a relative speed between the host vehicle and surrounding vehicles, λ is a speed difference reaction coefficient, and V is a speed difference reaction coefficient 1 、V 2 、C 1 、C 2 The parameters are self-defined and can be obtained through experiments generally; x is the number of front Is the transverse coordinate of the center of mass of the surrounding vehicle, L length,front Is the body length of the surrounding vehicle, and x is the transverse coordinate of the center of mass of the vehicle;
s13: setting behavior types of surrounding motor vehicles and pedestrians;
in order to simulate a real traffic scene, the behavior types of surrounding motor vehicles and pedestrians are set as follows: and the regular vehicles, the regular pedestrians, the illegal vehicles and the illegal pedestrians are classified into four types. Specifically, a conventional vehicle will comply with the traffic light rule, and an illegal vehicle will not comply with the traffic light rule, i.e., a red light running behavior will occur; regular pedestrians will obey the traffic light rule, and illegal pedestrians will not obey the traffic light rule, namely the behavior of red light running can occur. When the simulation environment operates, one of four types of conventional vehicles, conventional pedestrians, illegal vehicles and illegal pedestrians is randomly extracted at each simulation moment and added into the simulation environment.
S14: initializing an environment: randomly initializing the initial state of a signal lamp, the initial speed, the position and the target speed of surrounding motor vehicles; the simulation environment outputs environment information E at each simulation time t, which is defined as:
E={E e ,E s1 ,E s2 ,...,E si ,...,E p1 ,E p2 ,...,E pi ,...,traffic_light} si=1,2,...,ns,pi=1,2,...,np
E e ={x e ,y e ,v ee }
E vi ={x si ,y si ,v sisi }
E pi ={x pi ,y pi ,v pipi }
wherein subscript e represents own vehicle; subscript si denotes the si th surrounding vehicle, i.e., s1 denotes the first surrounding vehicle, ns denotes the number of surrounding traffic participating vehicles; the subscript pi indicates the pi-th pedestrian, i.e., p1 is the first pedestrian, np indicates the number of pedestrians; x is the number of e ,y e ,v ee Respectively a transverse coordinate, a longitudinal coordinate, a mass center speed and a yaw angle of the mass center of the bicycle; x is the number of vi ,y vi ,v vivi Respectively the transverse coordinate, the longitudinal coordinate, the centroid speed and the yaw angle of the centroid of the surrounding vehicle; x is the number of pi ,y pi ,v pipi Respectively a transverse coordinate, a longitudinal coordinate, a mass center speed and a yaw angle of the mass center of the pedestrian; traffic _ light represents traffic signal light status.
Further, in step S2,
1) the constructed state space S includes: position of the vehicle (x) e ,y e ) Velocity v e Yaw angle theta e Relative position of surrounding vehicle with respect to own vehicle (Δ x) si ,Δy si ) Relative velocity Δ v si And its relative yaw angle delta theta si Relative position of surrounding pedestrian with respect to own vehicle (Δ x) pi ,Δy pi ) Relative velocity Δ v pi And its relative yaw angle delta theta pi I.e. the state space S is represented as:
S={s e ,s s1 ,s s2 ,...,s si ,...,s p1 ,s p2 ,...,s pi ,...,traffic_light} si=1,2, ... ,ns,pi=1,2, ... ,np
s e ={x e ,y e ,v ee }
s si ={Δx si ,Δy si ,Δv si ,Δθ si }
s pi ={Δx pi ,Δy pi ,Δv pi ,Δθ pi }
2) the constructed motion space a includes: acceleration a of vehicle c Steering angle delta with front wheel f Thereby controlling the movement of the target vehicle, i.e., a(s) ═ a cf };
3) The constructed reward function R includes: safety against collision r col Target prize r goal And traffic signal light reward r light Namely:
R=χ 1 r col2 r goal3 r light
wherein, χ 1 、χ 2 、χ 3 Weighting coefficients for the terms in the reward function;
safety against collision r col The self-vehicle is required to avoid collision with other traffic participating vehicles and pedestrians;
Figure BDA0003645434760000031
when the own vehicle collides with surrounding vehicles and pedestrians, Cind is 1, otherwise Cind is 0;
targeted reward r goal The running speed of the self vehicle is required to reach the destination safely within the specified time as much as possible;
Figure BDA0003645434760000041
when the self vehicle can safely reach the destination within a specified time, Gind is 1, otherwise, Gind is 0;
traffic signal light reward r light The self-vehicle is required to comply with the traffic light rule;
Figure BDA0003645434760000042
when the vehicle passes through the intersection and complies with the traffic rules, Lind is 1, and otherwise Lind is 0.
Further, step S3 specifically includes the following steps:
s31: construction of implicit quantile network Z Using neural network τ (S, A) with inputs of state space S, quantile τ and parameter θ τ (ii) a Constructing a target implicit quantile network Z using a neural network τ′ (S, A) with the input as state space S, quantile τ', and parameter θ τ′ (ii) a In addition, a hyper-parameter K, N, N' is set, where K is the implicit quantile network Z τ Outputting the sampling times of the optimal action, wherein N is an implicit quantile network Z τ The number of samples in calculating the loss function, N' being the target implicit quantile network Z τ′ Calculating the sampling times of the loss function;
s32: randomly initializing a decision model based on deep reinforcement learning, wherein the decision model comprises hyper-parameters and network structure parameters of the model;
s33: implicit quantile-based network Z τ (S, A), inputting the state S at the current time t t Calculating the action A based on the following formula t
Figure BDA0003645434760000043
Meanwhile, the reward R acquired at the current time t is calculated according to the reward function t Calculating the state S at the t +1 moment based on the simulation environment output E t+1 (ii) a Establishing an experience pool, and combining data { S } t ,A t ,R t ,S t+1 Putting the training data into an experience pool, and replacing the old training data with the new training data according to a first-in first-out principle when the training data volume exceeds the capacity of the experience pool;
s34: randomly extracting B samples from the experience pool, and carrying out implicit quantile network Z τ (S, A) and target implicit quantile network Z τ′ And (S, A) updating. The method specifically comprises the following steps: first, for any two quantiles τ i ,τ′ j Taking the difference, expressed as:
Figure BDA0003645434760000044
Figure BDA0003645434760000045
wherein the content of the first and second substances,
Figure BDA0003645434760000046
gamma is a discount factor, K is more than or equal to 1 and less than or equal to K, i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to N', tau is the optimal action at the moment of t +1 kij U (0,1), wherein U is uniformly distributed;
second, the gradient of the loss function can be expressed as:
Figure BDA0003645434760000047
Figure BDA0003645434760000048
Figure BDA0003645434760000051
wherein the content of the first and second substances,
Figure BDA0003645434760000052
in order to be the gradient of the loss function,
Figure BDA0003645434760000053
is a threshold value, and is,
Figure BDA0003645434760000054
is the Huber function;
Figure BDA0003645434760000055
to indicate a function, i.e., to satisfy the condition of 1, otherwise to 0,
Figure BDA0003645434760000056
to set the threshold.
Further, step S4 specifically includes the following steps:
step S41: reward distribution information Z obtained based on step S3 τ Using Wang function rho Wang The original distribution information is changed, and the calculation formula is as follows:
Figure BDA0003645434760000057
wherein, phi is a standard normal distribution probability density function, phi -1 Which is the inverse of the standard normally distributed probability density function,
Figure BDA00036454347600000510
expressing the mean value, wherein alpha is a self-defined risk parameter value;
step S42: selecting an optimal action: maximizing rho Wang (Z τ ) And (3) calculating the action decision instruction with risk sensitivity:
Figure BDA0003645434760000058
wherein the content of the first and second substances,
Figure BDA0003645434760000059
the optimal action selected for time t.
The invention has the beneficial effects that:
1) the invention constructs a signal lamp crossroad simulation training scene containing environmental uncertainty factors, and the training scene can simulate the illegal behaviors of vehicles around, pedestrians running red light and the like and better accords with a real traffic scene.
2) The invention constructs an Implicit Quantile Network (IQN) based model which can calculate the distribution information of the reward.
3) The behavior decision with risk perception capability can be generated based on reward distribution information output by an Implicit Quantile Network (IQN) model and combined with the Wang function, and the safety of the decision of automatically driving the vehicle can be improved.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 is a logic framework diagram of the overall implementation of the method of the present invention;
FIG. 2 is a flow chart of the method of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.
Referring to fig. 1-2, the present invention provides a method for implementing decision-making of auto-driving behavior. Considering that violation behaviors such as surrounding vehicles and pedestrians running red light exist in a real traffic scene, a signal lamp crossroad simulation training scene containing environmental uncertainty factors is designed, and the training scene can simulate the violation behaviors such as surrounding vehicles and pedestrians running red light and better accords with the real traffic scene. In order to improve the safety of the automatic driving vehicle, the method specifically comprises the following steps:
s1: constructing a signal lamp crossroad simulation training scene containing environmental uncertainty factors; the method specifically comprises the following steps:
s101: setting a pedestrian model: describing the pedestrian motion track in the simulation training scene by adopting the following kinematics model:
Figure BDA0003645434760000061
Figure BDA0003645434760000062
Figure BDA0003645434760000063
wherein v is p For pedestrian speed, omega p Is angular velocity, x p ,y p ,θ p Respectively an abscissa, an ordinate and a course angle of the center of gravity of the pedestrian,
Figure BDA0003645434760000064
the derivatives of x, y, theta, v, respectively.
S102: setting a surrounding vehicle model, and defining the motion of the vehicle and the surrounding vehicle in a simulation environment, wherein the motion is described by the following equations:
Figure BDA0003645434760000065
Figure BDA0003645434760000066
Figure BDA0003645434760000067
Figure BDA0003645434760000068
Figure BDA0003645434760000069
wherein x and y are respectively the abscissa and ordinate of the vehicle mass center, v represents the vehicle mass center speed, theta is the vehicle yaw angle, beta is the vehicle mass center side slip angle, and l f ,l r Distances from the center of mass of the vehicle to the front and rear axles of the vehicle, delta f For the steering angle of the front wheels of the vehicle, a c In order to be the acceleration of the vehicle,
Figure BDA00036454347600000610
the derivatives of x, y, theta, v, respectively.
To enable the surrounding motor vehicles in the simulation environment to interact with the own vehicle, it is provided that the surrounding motor vehicles are controlled by a Velocity Difference Model (Velocity Difference Model):
a c =k[V-v+λΔv]
V=V 1 +V 2 tanh[C 1 (x front +L length,front -x)+C 2 ]
wherein, a c Is vehicle acceleration, k is a sensitivity coefficient, V is vehicle speed, Deltav is relative speed of the vehicle and surrounding vehicles, lambda is a speed difference reaction coefficient, V is a speed difference reaction coefficient 1 ,V 2 ,C 1 ,C 2 For self-defining parameters, x can be obtained through experiments front Is the transverse coordinate of the center of mass of the surrounding vehicle, L length,front Is the body length of the surrounding vehicle and x is the lateral coordinate of the center of mass of the vehicle.
S103: setting the behavior types of surrounding motor vehicles and pedestrians: in order to simulate a real traffic scene, the behavior types of surrounding motor vehicles and pedestrians are set as follows: and the four categories of the conventional vehicles, the conventional pedestrians, the illegal vehicles and the illegal pedestrians. Specifically, the conventional vehicles comply with the traffic light rule, and the illegal vehicles do not comply with the traffic light rule, namely, the behavior of running the red light can occur; regular pedestrians can obey the traffic light rule, and illegal pedestrians cannot obey the traffic light rule, namely, the behavior of running the red light can be generated. When the simulation environment operates, one of four types of conventional vehicles, conventional pedestrians, illegal vehicles and illegal pedestrians is randomly extracted at each simulation moment and added into the simulation environment.
S104: initializing the environment: the initial state of the signal lamp, the initial speed, the position and the target speed of the surrounding motor vehicles are initialized randomly. The simulation environment outputs environment information E at each simulation time t. E is specifically defined as:
E={E e ,E s1 ,E s2 ,...,E si ,...,E p1 ,E p2 ,...,E pi ,...,traffic_light} si=1,2,...,ns,pi=1,2,...,np
E e ={x e ,y e ,v ee }
E vi ={x si ,y si ,v sisi }
E pi ={x pi ,y pi ,v pipi }
wherein the subscript e represents the own vehicle, the subscript si represents the si-th surrounding vehicle, i.e., s1 represents the first surrounding vehicle, ns represents the number of surrounding traffic participating vehicles, the subscript pi represents the pi-th pedestrian, i.e., p1 is the first pedestrian, and np represents the number of pedestrians; x is the number of e ,y e ,v ee Respectively a transverse coordinate, a longitudinal coordinate, a mass center speed and a yaw angle of the mass center of the self-vehicle; x is the number of vi ,y vi ,v vivi Respectively the transverse coordinate, the longitudinal coordinate, the centroid speed and the yaw angle of the centroid of the surrounding vehicle; x is the number of pi ,y pi ,v pipi The transverse coordinate, the longitudinal coordinate, the centroid speed and the yaw angle of the centroid of the pedestrian are respectively.
S2: constructing and optimizing an implicit quantile-based network (IQN) model; the method specifically comprises the following steps:
s201: constructing a state space S including a position (x) of the own vehicle e ,y e ) Velocity v e Yaw angle theta e Relative position of surrounding vehicle with respect to own vehicle (Δ x) si ,Δy si ) Relative velocity Δ v si And its relative yaw angle delta theta si Relative position of surrounding pedestrian with respect to own vehicle (Δ x) pi ,Δy pi ) Relative velocity Δ v pi And its relative yaw angle delta theta pi The traffic signal status traffic _ light, i.e., S, is represented as:
S={s e ,s s1 ,s s2 ,...,s si ,...,s p1 ,s p2 ,...,s pi ,...,traffic_light} si=1,2, ... ,ns,pi=1,2, ... ,np
s e ={x e ,y e ,v ee }
s si ={Δx si ,Δy si ,Δv si ,Δθ si }
s pi ={Δx pi ,Δy pi ,Δv pi ,Δθ pi }
where the subscript e denotes the own vehicle, the subscript si denotes the si th surrounding vehicle, i.e., s1 denotes the first surrounding vehicle, ns denotes the number of surrounding traffic participating vehicles, the subscript pi denotes the pi th pedestrian, i.e., p1 is the first pedestrian, and np denotes the number of pedestrians.
S202: constructing an action space A consisting of the acceleration of the vehicle and the steering angle of the front wheels, thereby controlling the movement of the target vehicle, i.e.
A(S)={a cf }
Wherein, a c For vehicle acceleration, δ f Is the front wheel steering angle.
S203: constructing a reward function R comprising a collision safety R col Target prize r goal Traffic signal light reward r light Namely:
R=χ 1 r col2 r goal3 r light
wherein, χ 1 ,χ 2 ,χ 3 For each of the reward functionsThe weight coefficient of the term;
safety in collision col The self-vehicle is required to avoid collision with other traffic participating vehicles and pedestrians.
Figure BDA0003645434760000081
When the own vehicle collides with a surrounding vehicle or a pedestrian, Cind is 1, otherwise Cind is 0.
Targeted reward r goal The running speed of the vehicle is required to reach the destination safely within a specified time as much as possible.
Figure BDA0003645434760000082
When the self vehicle can safely reach the destination within the specified time, Gind is 1, otherwise, Gind is 0.
Traffic signal light reward r light The vehicle is required to comply with the traffic light regulations.
Figure BDA0003645434760000083
When the vehicle passes through the intersection, the traffic rules are observed, and Lind is 1, otherwise Lind is 0.
S204: construction of implicit quantile networks Z Using neural networks τ (S, A) with inputs of state space S, quantile τ and parameter θ τ (ii) a Constructing a target implicit quantile network Z using a neural network τ′ (S, A) with inputs of state space S and quantile τ', the parameter being represented by θ τ′ . In addition, a hyper-parameter K, N, N' is set, where K is the implicit quantile network Z τ Outputting the sampling times of the optimal action, wherein N is an implicit quantile network Z τ The number of samples in calculating the loss function, N' being the target implicit quantile network Z τ′ The number of samples in the loss function is calculated.
S205: randomly initializing a decision model based on deep reinforcement learning, wherein the decision model comprises hyper-parameters and network structure parameters of the model;
s206: implicit quantile-based network Z τ (S, A), inputting the state S at the current time t t Calculating the action A based on the following formula t
Figure BDA0003645434760000084
Meanwhile, the reward function calculates the reward R acquired at the current time t t Calculating the state S at the t +1 moment based on the simulation environment output E t+1 (ii) a Establishing an experience pool, and combining data { S } t ,A t ,R t ,S t+1 Putting the training data into an experience pool, and replacing the old training data with the new training data according to a first-in first-out principle when the training data volume exceeds the capacity of the experience pool;
s207: randomly extracting B samples from the experience pool, and carrying out implicit quantile network Z τ (S, A) and target implicit quantile network Z τ′ And (S, A) updating. First, for any two quantiles τ i ,τ′ j The difference can be expressed as:
Figure BDA0003645434760000091
Figure BDA0003645434760000092
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003645434760000093
for the optimal action at time t +1, γ is the discount factor, R t Is the instant reward at the time t, A is the action space, K is more than or equal to 1 and less than or equal to K, i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to N', tau kij U (0,1), U is uniformly distributed.
Second, the gradient of the loss function can be expressed as:
Figure BDA0003645434760000094
Figure BDA0003645434760000095
Figure BDA0003645434760000096
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003645434760000097
in order to be the gradient of the loss function,
Figure BDA0003645434760000098
is a threshold value, and is,
Figure BDA0003645434760000099
in order to be a function of the Huber,
Figure BDA00036454347600000910
to indicate a function, i.e., to satisfy the condition of 1, otherwise to 0,
Figure BDA00036454347600000911
to set the threshold.
S3: generating a behavior decision with risk perception capability by combining a Wang function based on reward distribution information output by an Implicit Quantile Network (IQN) model; the method specifically comprises the following steps:
step S301: the reward distribution information Z obtained based on step S2 τ Using Wang function rho Wang The original distribution information is changed by the following specific formula:
Figure BDA00036454347600000912
wherein, phi is a standard normal distribution probability density function, phi -1 Which is the inverse of the standard normally distributed probability density function,
Figure BDA00036454347600000913
and expressing the mean value, wherein alpha is a self-defined risk parameter value.
Step S302: selecting an optimal action: maximizing ρ Wang (Z τ ) And (3) calculating the action decision instruction with risk sensitivity:
Figure BDA00036454347600000914
wherein the content of the first and second substances,
Figure BDA00036454347600000915
the optimal action selected for time t.
Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims (6)

1. A method for realizing automatic driving automobile behavior decision is characterized by comprising the following steps:
s1: constructing a signal lamp crossroad simulation training scene containing environmental uncertainty factors;
s2: constructing an implicit quantile network model, including constructing a state space, an action space and a reward function;
s3: using the implicit quantile network model constructed in the neural network optimization step S2;
s4: and generating a behavior decision with risk perception capability by combining the Wang function according to the reward distribution information output by the implicit quantile network model after the optimization in the step S3.
2. The method for implementing automated driving vehicle behavior decision-making as claimed in claim 1, wherein step S1 specifically comprises the steps of:
s11: setting a pedestrian model: describing a pedestrian motion track in a simulation training scene by adopting a kinematics model;
Figure FDA0003645434750000011
Figure FDA0003645434750000012
Figure FDA0003645434750000013
wherein v is p For pedestrian speed, omega p Is angular velocity, x p 、y p 、θ p Respectively an abscissa, an ordinate and a course angle of the center of gravity of the pedestrian;
Figure FDA0003645434750000014
are respectively x p 、y p 、θ p A derivative of (a);
s12: setting a surrounding vehicle model, and specifying that the motion of the vehicle and the surrounding vehicle in a simulation training scene is described by the following equation:
Figure FDA0003645434750000015
Figure FDA0003645434750000016
Figure FDA0003645434750000017
Figure FDA0003645434750000018
Figure FDA0003645434750000019
wherein x and y are respectively the abscissa and ordinate of the vehicle mass center, v represents the vehicle mass center speed, theta is the vehicle yaw angle, beta is the slip angle at the vehicle mass center, and l f 、l r Distances from the center of mass of the vehicle to the front and rear axles of the vehicle, delta f For the steering angle of the front wheels of the vehicle, a c Is the vehicle acceleration;
Figure FDA00036454347500000110
derivatives of x, y, θ, v, respectively;
to enable the surrounding vehicles in the simulated training scenario to interact with the own vehicle, it is provided that the surrounding motor vehicles are controlled by a speed difference model:
a c =k[V-v+λΔv]
V=V 1 +V 2 tanh[C 1 (x front +L length,front -x)+C 2 ]
wherein k is a sensitivity coefficient, Δ V is a relative speed between the host vehicle and surrounding vehicles, λ is a speed difference reaction coefficient, and V is a speed difference reaction coefficient 1 、V 2 、C 1 、C 2 To define the parameters, x front Is the transverse coordinate of the center of mass of the surrounding vehicle, L length,front Is the body length of the surrounding vehicle, and x is the transverse coordinate of the center of mass of the vehicle;
s13: setting the behavior types of surrounding motor vehicles and pedestrians, comprising the following steps: the four categories of conventional vehicles, conventional pedestrians, illegal vehicles and illegal pedestrians;
s14: initializing the environment: randomly initializing the initial state of a signal lamp, the initial speed, the position and the target speed of surrounding motor vehicles; the simulation environment outputs environment information E at each simulation time t, which is defined as:
E={E e ,E s1 ,E s2 ,...,E si ,...,E p1 ,E p2 ,...,E pi ,...,traffic_light} si=1,2,...,ns,pi=1,2,...,np
E e ={x e ,y e ,v e ,θ e }
E vi ={x si ,y si ,v si ,θ si }
E pi ={x pi ,y pi ,v pi ,θ pi }
wherein subscript e represents own vehicle; subscript si denotes the si th surrounding vehicle, i.e., s1 denotes the first surrounding vehicle, ns denotes the number of surrounding traffic participating vehicles; the subscript pi indicates the pi-th pedestrian, i.e., p1 is the first pedestrian, np indicates the number of pedestrians; x is the number of e ,y e ,v e ,θ e Respectively a transverse coordinate, a longitudinal coordinate, a mass center speed and a yaw angle of the mass center of the bicycle; x is the number of vi ,y vi ,v vi ,θ vi Respectively the transverse coordinate, the longitudinal coordinate, the centroid speed and the yaw angle of the centroid of the surrounding vehicle; x is the number of pi ,y pi ,v pi ,θ pi Respectively a transverse coordinate, a longitudinal coordinate, a mass center speed and a yaw angle of the mass center of the pedestrian; traffic _ light represents traffic signal light status.
3. The method for implementing automated driving vehicle behavior decision-making as claimed in claim 2, wherein, in step S2,
1) the constructed state space S includes: position of own vehicle (x) e ,y e ) Velocity v e Yaw angle theta e Relative position of surrounding vehicle with respect to own vehicle (Δ x) si ,Δy si ) Relative velocity Δ v si And its relative yaw angle delta theta si Relative position of surrounding pedestrian with respect to own vehicle (Δ x) pi ,Δy pi ) Relative velocity Δ v pi And relative thereofYaw angle delta theta pi I.e. the state space S is represented as:
S={s e ,s s1 ,s s2 ,...,s si ,...,s p1 ,s p2 ,...,s pi ,...,traffic_light} si=1,2,...,ns,pi=1,2,...,np
s e ={x e ,y e ,v e ,θ e }
s si ={Δx si ,Δy si ,Δv si ,Δθ si }
s pi ={Δx pi ,Δy pi ,Δv pi ,Δθ pi }
2) the constructed motion space a includes: acceleration a of vehicle c Steering angle delta with front wheel f I.e. a(s) ═ a c ,δ f };
3) The constructed reward function R includes: safety against collision r col Target prize r goal And traffic signal light reward r light Namely:
R=χ 1 r col2 r goal3 r light
wherein, χ 1 、χ 2 、χ 3 Weighting coefficients for the terms in the reward function;
safety against collision r col The self-vehicle is required to avoid collision with other traffic participating vehicles and pedestrians;
Figure FDA0003645434750000021
when the own vehicle collides with surrounding vehicles and pedestrians, Cind is 1, otherwise Cind is 0;
targeted reward r goal The driving speed of the self vehicle is required to safely reach the destination within the specified time;
Figure FDA0003645434750000031
when the self vehicle can safely reach the destination within a specified time, Gind is 1, otherwise, Gind is 0;
traffic signal light reward r light The self-vehicle is required to comply with the traffic light rule;
Figure FDA0003645434750000032
when the vehicle passes through the intersection and complies with the traffic rules, Lind is 1, and otherwise Lind is 0.
4. The method for implementing automated driving vehicle behavior decision-making according to claim 3, wherein step S3 specifically comprises the following steps:
s31: construction of implicit quantile network Z Using neural network τ (S, A) with inputs of state space S, quantile τ and parameter θ τ (ii) a Constructing a target implicit quantile network Z using a neural network τ′ (S, A) with the input as state space S, quantile τ', and parameter θ τ′ (ii) a In addition, a hyper-parameter K, N, N' is set, where K is the implicit quantile network Z τ Outputting the sampling times of the optimal action, wherein N is an implicit quantile network Z τ The number of samples in calculating the loss function, N' being the target implicit quantile network Z τ′ Calculating the sampling times of the loss function;
s32: randomly initializing a decision model based on deep reinforcement learning, wherein the decision model comprises hyper-parameters and network structure parameters of the model;
s33: implicit quantile-based network Z τ (S, A), inputting the state S at the current time t t Calculating the action A based on the following formula t
Figure FDA0003645434750000033
Meanwhile, the reward R acquired at the current time t is calculated according to the reward function t Calculating the state S at the t +1 moment based on the simulation environment output E t+1 (ii) a Establishing an experience pool, and combining data { S } t ,A t ,R t ,S t+1 Putting the training data into an experience pool, and replacing old training data with new training data according to a first-in first-out principle when the training data volume exceeds the capacity of the experience pool;
s34: randomly extracting B samples from the experience pool, and carrying out implicit quantile network Z τ (S, A) and target implicit quantile network Z τ′ And (S, A) updating.
5. The method for implementing automated driving vehicle behavior decision according to claim 4, wherein step S34 specifically comprises: first, for any two quantiles τ i ,τ′ j Taking the difference, expressed as:
Figure FDA0003645434750000034
Figure FDA0003645434750000035
wherein the content of the first and second substances,
Figure FDA0003645434750000036
gamma is a discount factor, K is more than or equal to 1 and less than or equal to K, i is more than or equal to 1 and less than or equal to N, j is more than or equal to 1 and less than or equal to N', tau is the optimal action at the moment of t +1 k ,τ i ,τ j U (0,1), wherein U is uniformly distributed;
second, the gradient of the loss function is expressed as:
Figure FDA0003645434750000041
Figure FDA0003645434750000042
Figure FDA0003645434750000043
wherein the content of the first and second substances,
Figure FDA0003645434750000044
in order to be the gradient of the loss function,
Figure FDA0003645434750000045
is a threshold value, and is,
Figure FDA0003645434750000046
is the Huber function;
Figure FDA0003645434750000047
to indicate the function, i.e. the condition is satisfied as 1, otherwise it is 0, and m is the set threshold.
6. The method for implementing automated driving vehicle behavior decision-making according to claim 5, wherein step S4 specifically comprises the following steps:
step S41: reward distribution information Z obtained based on step S3 τ Using Wang function rho Wang The original distribution information is changed, and the calculation formula is as follows:
Figure FDA0003645434750000048
wherein, phi is a standard normal distribution probability density function, phi -1 Which is the inverse of the standard normally distributed probability density function,
Figure FDA00036454347500000411
expressing the mean value, wherein alpha is a self-defined risk parameter value;
step S42: selecting an optimal action: maximizing rho Wang (Z τ ) Value, i.e. calculating a risk sensitive behavioral decision instruction:
Figure FDA0003645434750000049
wherein the content of the first and second substances,
Figure FDA00036454347500000410
the optimal action selected for time t.
CN202210528980.7A 2022-05-16 2022-05-16 Method for realizing decision of automatically driving automobile behavior Active CN114880938B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210528980.7A CN114880938B (en) 2022-05-16 2022-05-16 Method for realizing decision of automatically driving automobile behavior

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210528980.7A CN114880938B (en) 2022-05-16 2022-05-16 Method for realizing decision of automatically driving automobile behavior

Publications (2)

Publication Number Publication Date
CN114880938A true CN114880938A (en) 2022-08-09
CN114880938B CN114880938B (en) 2023-04-18

Family

ID=82675965

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210528980.7A Active CN114880938B (en) 2022-05-16 2022-05-16 Method for realizing decision of automatically driving automobile behavior

Country Status (1)

Country Link
CN (1) CN114880938B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169567A (en) * 2017-03-30 2017-09-15 深圳先进技术研究院 The generation method and device of a kind of decision networks model for Vehicular automatic driving
CN114013443A (en) * 2021-11-12 2022-02-08 哈尔滨工业大学 Automatic driving vehicle lane change decision control method based on hierarchical reinforcement learning
CN114312830A (en) * 2021-12-14 2022-04-12 江苏大学 Intelligent vehicle coupling decision model and method considering dangerous driving conditions

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169567A (en) * 2017-03-30 2017-09-15 深圳先进技术研究院 The generation method and device of a kind of decision networks model for Vehicular automatic driving
CN114013443A (en) * 2021-11-12 2022-02-08 哈尔滨工业大学 Automatic driving vehicle lane change decision control method based on hierarchical reinforcement learning
CN114312830A (en) * 2021-12-14 2022-04-12 江苏大学 Intelligent vehicle coupling decision model and method considering dangerous driving conditions

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WILL DABNEY 等: "Implicit Quantile Networks for Distributional Reinforcement Learning", 《HTTPS://ARXIV.ORG/PDF/1806.06923.PDF》 *

Also Published As

Publication number Publication date
CN114880938B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
EP3678911B1 (en) Pedestrian behavior predictions for autonomous vehicles
CN109598934B (en) Rule and learning model-based method for enabling unmanned vehicle to drive away from high speed
CN111775949B (en) Personalized driver steering behavior auxiliary method of man-machine co-driving control system
CN110969848A (en) Automatic driving overtaking decision method based on reinforcement learning under opposite double lanes
CN106991251B (en) Cellular machine simulation method for highway traffic flow
CN105857306A (en) Vehicle autonomous parking path programming method used for multiple parking scenes
CN112249008B (en) Unmanned automobile early warning method aiming at complex dynamic environment
CN114013443B (en) Automatic driving vehicle lane change decision control method based on hierarchical reinforcement learning
CN113753026B (en) Decision-making method for preventing rollover of large commercial vehicle by considering road adhesion condition
CN110716562A (en) Decision-making method for multi-lane driving of unmanned vehicle based on reinforcement learning
CN110956851B (en) Intelligent networking automobile cooperative scheduling lane changing method
CN112896188B (en) Automatic driving decision control system considering front vehicle encounter
US20220242422A1 (en) Systems and methods for updating the parameters of a model predictive controller with learned external parameters generated using simulations and machine learning
CN114644017A (en) Method for realizing safety decision control of automatic driving vehicle
CN114035575B (en) Unmanned vehicle motion planning method and system based on semantic segmentation
CN113255998B (en) Expressway unmanned vehicle formation method based on multi-agent reinforcement learning
CN113722835B (en) Personification random lane change driving behavior modeling method
Wang et al. Vehicle trajectory prediction by knowledge-driven LSTM network in urban environments
CN113715842A (en) High-speed moving vehicle control method based on simulation learning and reinforcement learning
CN115303289A (en) Vehicle dynamics model based on depth Gaussian, training method, intelligent vehicle trajectory tracking control method and terminal equipment
CN115593433A (en) Remote take-over method for automatic driving vehicle
US20220242401A1 (en) Systems and methods for updating the parameters of a model predictive controller with learned controls parameters generated using simulations and machine learning
CN114880938B (en) Method for realizing decision of automatically driving automobile behavior
CN115123217B (en) Mine obstacle vehicle driving track generation method and device and computer equipment
CN113033902B (en) Automatic driving lane change track planning method based on improved deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant