CN115079697A - Commercial vehicle queue path planning method, controller and storage device combining deep reinforcement learning and RSS strategy - Google Patents
Commercial vehicle queue path planning method, controller and storage device combining deep reinforcement learning and RSS strategy Download PDFInfo
- Publication number
- CN115079697A CN115079697A CN202210748792.5A CN202210748792A CN115079697A CN 115079697 A CN115079697 A CN 115079697A CN 202210748792 A CN202210748792 A CN 202210748792A CN 115079697 A CN115079697 A CN 115079697A
- Authority
- CN
- China
- Prior art keywords
- output
- network
- strategy
- time
- longitudinal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 62
- 230000002787 reinforcement Effects 0.000 title claims abstract description 21
- 230000006870 function Effects 0.000 claims description 28
- 238000012549 training Methods 0.000 claims description 24
- 239000003795 chemical substances by application Substances 0.000 claims description 23
- 230000009471 action Effects 0.000 claims description 21
- 230000008569 process Effects 0.000 claims description 19
- 230000001133 acceleration Effects 0.000 claims description 17
- 239000011159 matrix material Substances 0.000 claims description 16
- 238000011176 pooling Methods 0.000 claims description 14
- 238000005070 sampling Methods 0.000 claims description 14
- 238000006073 displacement reaction Methods 0.000 claims description 13
- 238000005457 optimization Methods 0.000 claims description 11
- 235000004522 Pentaglottis sempervirens Nutrition 0.000 claims description 9
- 238000013459 approach Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 6
- 230000000694 effects Effects 0.000 claims description 5
- 238000013527 convolutional neural network Methods 0.000 claims description 4
- 238000009795 derivation Methods 0.000 claims description 4
- 230000035484 reaction time Effects 0.000 claims description 4
- 238000000342 Monte Carlo simulation Methods 0.000 claims description 3
- 230000007423 decrease Effects 0.000 claims description 3
- 230000007613 environmental effect Effects 0.000 claims description 3
- 230000003068 static effect Effects 0.000 claims description 3
- 230000036461 convulsion Effects 0.000 claims description 2
- 230000010354 integration Effects 0.000 abstract description 2
- 230000003993 interaction Effects 0.000 abstract description 2
- 230000006399 behavior Effects 0.000 description 5
- 238000013461 design Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 230000035945 sensitivity Effects 0.000 description 5
- 230000008447 perception Effects 0.000 description 4
- 230000004888 barrier function Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000021824 exploration behavior Effects 0.000 description 3
- 238000012854 evaluation process Methods 0.000 description 2
- 206010063385 Intellectualisation Diseases 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000004134 energy conservation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0214—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory in accordance with safety or protection criteria, e.g. avoiding hazardous areas
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0221—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0287—Control of position or course in two dimensions specially adapted to land vehicles involving a plurality of land vehicles, e.g. fleet or convoy travelling
- G05D1/0289—Control of position or course in two dimensions specially adapted to land vehicles involving a plurality of land vehicles, e.g. fleet or convoy travelling with means for avoiding collisions between vehicles
Abstract
The invention discloses a method for planning a queue path of a commercial vehicle, a controller and a storage device by combining deep reinforcement learning and RSS strategies, wherein an A3C framework is introduced, vehicles in a fleet can respectively and interactively learn with the environment in a plurality of threads simultaneously by utilizing a multithreading method, and each thread summarizes the learning results and stores the results in Global _ net. And regularly taking back the learning results of different vehicles in the fleet from Global _ net to guide the learning interaction behind the vehicle and the environment. Meanwhile, the speed planning is carried out by using the Lattice algorithm and taking the ST image, so that the driving stability and comfort of the motorcade can be effectively improved, and the smoothness of the driving track of the commercial vehicle is ensured. Finally, the invention combines a safety constraint RSS strategy, and the automatic driving automobile safety strategy based on the mathematical formula provides a frame for the implicit rule, thereby realizing the organic integration with other participants on the road and effectively solving the safety problem when other vehicles are merged into the road during the queue driving.
Description
Technical Field
The invention belongs to the field of automatic driving in artificial intelligence, and relates to a commercial vehicle queue path planning method, a controller and a storage device which combine deep reinforcement learning and RSS (responsibility sensitive safety) models.
Background
The intelligent automobile is a high and new technology product based on an environment perception technology, an intelligent driving technology, a wireless communication technology and a computer technology, and the process of transformation and upgrading of the automobile industry is a process of gradually realizing intellectualization of the automobile. The automobile in an intelligent running state takes safety, environmental protection, energy conservation, comfort and the like as comprehensive control targets to cooperatively construct an efficient and ordered transportation network.
Commercial fleets are currently widely used in engineering. Commercial vehicles mainly include five types, namely passenger cars, trucks, semi-trailer tractors, incomplete passenger car vehicles and incomplete truck vehicles. Has the characteristics of large volume, heavy weight, large blind area of the driver vision, and the like. Currently, commercial fleet path planning presents a number of problems in the training process: firstly, a plurality of vehicle bodies participate in training at the same time, and the training difficulty is high, and even the network is difficult to converge. Secondly, the reward function is difficult to design, each fleet member has one reward function, actions output by the fleet members are mutually interfered, and the reward counteraction phenomenon exists, so that exploration in training is difficult. Finally, the commercial vehicle has large volume and large load, so that the safety of the commercial vehicle cannot be well guaranteed when the commercial vehicle is not driven by people. Therefore, how to find a commercial vehicle queue planning method which can simultaneously achieve safety and high efficiency becomes an important subject.
Disclosure of Invention
In order to solve the problem of the commercial vehicle queue, a framework of A3C is introduced, A3C utilizes a multithreading method to enable vehicles in a fleet to perform interactive learning with the environment in a plurality of threads simultaneously, and each thread collects learning results and stores the learning results in Global _ net. And regularly taking back the learning results of different vehicles in the fleet from Global _ net to guide the learning interaction behind the vehicle and the environment. Meanwhile, the speed planning is carried out by using the Lattice algorithm and taking the ST image, so that the driving stability and comfort of the motorcade can be effectively improved, and the smoothness of the driving track of the commercial vehicle is ensured. Finally, the invention combines a safety constraint strategy, namely an RSS (responsibility sensitivity safety) strategy, and the set of automatic driving automobile safety strategy based on the mathematical formula provides a frame for the implicit rule, thereby realizing the organic integration with other participants on the road and effectively solving the safety problem when the vehicles are merged when the vehicles run in a queue.
The invention provides a commercial vehicle queue planning method combining deep reinforcement learning and an RSS (responsibility sensitivity safety) strategy, which improves the learning efficiency of a motorcade by utilizing an A3C framework and improves the safety and stability of the motorcade during driving through the constraints of a Lattice algorithm and the RSS strategy.
In order to achieve the purpose, the invention adopts the following technical scheme:
the commercial vehicle queue planning method combining deep reinforcement learning and RSS (responsibility sensitive safety) strategies comprises the following steps:
step 1: in order to better acquire information of surrounding traffic environment, the invention designs an effective and simple time sequence aerial view as the state quantity of the strategy network, thereby greatly improving the learning of the strategy network and ensuring the safety of output tracks. The generation of the time-series bird's eye view comprises the following two steps: (1) and obtaining surrounding environment information including moving and static obstacles and lane lines. Obtaining 0-t future of dynamic barrier by using prediction module (both lstm and GCN network) end Position information within time of (a); (2) and generating a characteristic aerial view of three dimensions of transverse dimension, longitudinal dimension and time by using the information obtained by the perception module and the prediction.
Step 2: frenet coordinate transformation is performed, and the state quantity of the agent at the current time is obtained from the characteristic aerial view. Coordinate information of the vehicle in a Cartesian coordinate system(X,θ X ,k X ,v X ,a X ) Can be converted intoWherein X is the coordinate of the vehicle in a Cartesian coordinate system and is a vector, and theta X For the orientation of the vehicle in a Cartesian coordinate system, k X Is curvature, v X Is the linear velocity of the vehicle in a Cartesian coordinate system, a X Is the acceleration of the vehicle in a cartesian coordinate system. s is the longitudinal displacement in the Frenet coordinate system,the first derivative of the longitudinal displacement s with respect to time in the Frenet coordinate system,the second derivative of the longitudinal displacement s with respect to time in the Frenet coordinate system. Thereby obtaining the state quantity:the motion space is designed as the longitudinal end state of the track:
and 3, step 3: the invention uses an A3C algorithm framework to generate training samples to fill an experience pool in an exploration process by using a training method of a fleet shared network. All the agents share the strategy network and participate in the training of the network together, so that the problem of network non-convergence is avoided.
The obtained state quantityAnd the motion spaceAs input, a strategy gradient algorithm is utilized to improve a Lattice planning algorithm, and meanwhile, a reward function is designed by combining with an RSS (really simple syndication) strategy, so that the last state sampling point of the intelligent agent is trained。
Policy network pi θ The optimization goal of (z, a) is to maximize the expected return on the output planned trajectory:
wherein z is the state characteristic of the surrounding traffic environment, a is the network output action (namely the longitudinal end state of the track), theta is the network parameter, p (tau, theta) is the probability of executing the action a to output the track tau under the parameter theta and the state z, r (tau) is the reward function of the track tau, theta represents the network parameter, and pi represents the policy network.
The above policy network pi θ The optimization method of (z, a) is a gradient ascent method:α represents the learning rate of the gradient descent magnitude.
Calculating a derivative of the optimization objective with respect to the network parameter θ:
in the actual sampling process, the intelligent agent continuously obtains tracks and rewards from traffic scenes, and then adjusts strategies according to the rewards and real-timely adjusts the experience data<z,a,τ,r>And storing the experience into an experience pool (Memory). During training, n pieces of experience data are randomly sampled from an experience pool by using a Monte Carlo method<z,a,τ,r>Gradient of objective function according to law of large numbersCarrying out simplified approximation:
the updating direction of the final strategy parameter theta can be obtained as follows:
to reduce the variance, the baseline b is increased at the reward r (τ) to reduce the variance:
J(π)=∑ τ p(τ,θ)·[r(τ)-b]
the part BL containing the baseline is differentiated by the network parameter theta:
obviously, according to the derivation result of BL on the network parameter theta, increasing the baseline b in the objective function J (pi) does not affect the gradient of the final optimization objective J (pi)Adding a baseline b that is independent of action a does not affect the gradient of the final strategy.
The variance is calculated according to the formula:
as is apparent from the above description of the preferred embodiment,the smaller the variance, the smaller the variance. Designing a function of the part with respect to b
Then, the derivative of f (b) with respect to b is calculated:
f (b) derivative with respect to b f' (b) where b ═ Σ τ r (τ) is 0, i.e. when b ═ Σ τ r (τ) is the minimum of f (b) and the minimum of variance. Obviously, Σ τ r (tau) is an implicit state value V (z), so the state value V (z) can be used as a base line b to reduce the variance, and the convergence rate and the effect of the policy network are improved.
The intelligent agent for reinforcement learning continuously improves the self ability in the training process, and in the process, the intelligent agent needs to continuously try and error in an unfamiliar state space. In a strange state feature, a new behavior may result in the agent receiving a higher reward and also in a worse behavior action. An "exploratory action" is to try some new action. The "exploit action" is the taking of a known action that can achieve the maximum reward, only requiring that the policy action be explicitly executed. Too many "exploratory behaviors" (fewer "exploitation behaviors") make convergence of the agent more difficult, and too few "exploratory behaviors" (more "exploitation behaviors") make it highly likely that the agent converges to the locally optimal space. There is therefore a trade-off between "exploration behavior" and "utilization behavior". In order to strengthen the exploration capability of the intelligent agent in the strange state space in the training process and avoid the intelligent agent from falling into the local optimal space in the training process, the strategy network pi designed by the invention θ The output of (z, a) will fit a normal distribution. The method specifically comprises two parts of a mean value mu (z, theta) and a variance sigma (z, theta):
theoretically, the strategic network pi θ (z, a) in the learning process, the output mean value mu (z, theta) can continuously approach the optimal strategy arg max Q (z, a), Q (z, a) representing the action cost function, arg max A variable point z, a is obtained corresponding to the maximum value of Q. The variance σ (z, θ) of the output will continuously approach to 0 and the randomness of the strategy will decrease. When executing the strategy, sampling the actions from the normal distributionAnd outputting and executing.
And 4, step 4: and fitting a longitudinal trajectory polynomial. Using the current longitudinal state of the vehicleAnd optimal longitudinal end state of reinforcement learning outputAs a boundary condition, there is a fifth order polynomial of s with respect to time t:
the boundary conditions are as follows:
according to the fifth order polynomial of the longitudinal trajectory and the boundary condition:
based on the obtained a 1 a 2 a 3 a 4 a 5 Obtaining the fifth-order polynomial s of the longitudinal track trajectory 。
And obtaining an optimal track, and inputting the optimal track to the control module.
The invention provides an intelligent automobile controller, wherein an execution program of the method is arranged in the controller.
The invention also provides a storage device, which is internally provided with the program code of the method.
The invention has the beneficial effects that:
(1) aiming at the automatic driving task, the invention adopts a method of combining a Lattice algorithm and deep reinforcement learning to solve the problem of driving of the commercial vehicle queue under an RSS (responsibility sensitivity safety) strategy. By using the A3C framework, the training efficiency is greatly improved, and the network convergence is promoted. Meanwhile, under the RSS (responsibility sensitive safety) framework, the safety of the reinforcement learning planning path is greatly improved.
(2) Compared with the Lattice algorithm, the method abandons the sampling with higher time complexity and the evaluation process of each alternative track cost function, and greatly improves the timeliness of the algorithm. Meanwhile, the training process of reinforcement learning has better universality, and the design of the reward function based on the final control effect can be more suitable for complex traffic scenes and complex vehicle dynamics characteristics.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a Policy Gradient network neural network architecture used by the present invention;
trace point sampling of the ST diagram of fig. 3.
Detailed Description
The invention will be further explained with reference to the drawings.
The invention provides a commercial vehicle queue path planning method combining Deep Reinforcement Learning (DRL) and Lattice algorithm under RSS (responsibility sensitivity safety) strategy, which can realize the improvement of safety and stability when a commercial large vehicle runs in a queue, and as shown in figure 1, the method specifically comprises the following steps:
as shown in fig. 1, using the A3C framework, a training method using a fleet sharing network generated respective gradient-filled Global _ net during exploration. Compared with the method that the states of all the members of the fleet are used as input and the tracks of the number of the members are output, the method only takes the state of each intelligent agent as input and outputs the track of the intelligent agent, so that network convergence is promoted, and the phenomena of mutual interference of member actions and reward offset are avoided. Meanwhile, all intelligent networked automobiles share the decision network and participate in network training together. The individual training process for each agent is described below.
And outputting a longitudinal end state sampling point by using an intelligent agent for deep reinforcement learning training:
(1) and designing a characteristic aerial view. The invention designs an effective and simple time sequence aerial view as the state quantity of the strategy network, greatly improves the learning of the strategy network and ensures the safety of the output track.
The generation of the time-series bird's eye view comprises the following two steps: (1) according to the perception module of the automatic driving automobile, the surrounding environment information including moving and static obstacles and lane lines is obtained. Obtaining 0-t of dynamic barrier in future by utilizing prediction module end Position information within time of (a); (2) and generating a characteristic aerial view of three dimensions of transverse dimension, longitudinal dimension and time by using the information obtained by the perception module and the prediction.
The size of the three-dimensional time sequence aerial view matrix is (40, 400, 80). Wherein the first dimension 40 represents the lateral extent of 10m each, left and right of the reference line, with a lateral displacement interval of 0.5 m; the second dimension 400 represents a range of 200m longitudinally forward from the origin of the own vehicle with a longitudinal displacement interval of 0.5m, and the third dimension 80 represents a time range within 8s in the future with a time interval of 1 s. Specifically, when a point [ α, β, γ ] in the time-series bird's eye view matrix is-1, it indicates that the point is an obstacle or an untravelable area in the time space; when a point [ alpha, beta, gamma ] in the time-series bird's eye view matrix is 0, the point is a travelable area in the time space; when a point [ α, β, γ ] ] in the time-series bird's eye view matrix is 1, it indicates that the point is one point of the reference line.
(2) And (5) designing state quantities. In order to obtain the state quantity of the intelligent agent at the current time from the characteristic aerial view, Frenet coordinate transformation is carried out, and coordinate information (X, theta) of the vehicle is converted under a Cartesian coordinate system X ,k X ,v X ,a X ) Can be converted intoWherein X is the coordinate of the vehicle in a Cartesian coordinate system and is a vector, and theta X For the orientation of the vehicle in a Cartesian coordinate system, k X Is curvature, v X Is the linear velocity of the vehicle in a Cartesian coordinate system, a X Is the acceleration of the vehicle in a cartesian coordinate system. s is the longitudinal displacement in the Frenet coordinate system,the first derivative of the longitudinal displacement s with respect to time in the Frenet coordinate system,the second derivative of the longitudinal displacement s with respect to time in the Frenet coordinate system. Thereby obtaining the state quantity:the motion space is designed as the longitudinal end state of the track:
(3) and (4) designing a policy network. The obtained state quantityAnd the motion spaceAs input, a strategy gradient algorithm is utilized to improve a Lattice planning algorithm, meanwhile, a reward function is designed by combining an RSS (really simple syndication) strategy, and the last state sampling point of the intelligent agent is trained. Policy network pi θ The optimization goal of (z, a) is to maximize the expected return on the output planned trajectory:
wherein z is the state characteristic of the surrounding traffic environment, a is the network output action (namely the longitudinal end state of the track), theta is the network parameter, p (tau, theta) is the probability of executing the action a to output the track tau under the parameter theta and the state z, and R (tau) is the reward function of the track tau.
and (3) derivation is carried out on the expected return J (theta) of the parameter theta to obtain the optimal theta, at the moment, the strategy network pi is optimal, the track is optimal, and the derivative of the optimization target relative to the network parameter theta is calculated:
in the actual sampling process, the intelligent agent continuously obtains tracks and rewards from traffic scenes, and then adjusts strategies according to the rewards and real-timely adjusts the experience data<z,a,τ,r>And storing the experience into an experience pool (Memory). During training, n pieces of experience data are randomly sampled from an experience pool by using a Monte Carlo method<z,a,τ,r>Gradient of objective function according to law of large numbersCarrying out simplified approximation:
to reduce the variance, the baseline b is increased at the reward r (τ) to reduce the variance:
J(π)=∑ τ p(τ,θ)·[r(τ)-b]
the part BL containing the baseline is differentiated by the network parameter theta:
obviously, according to the derivation result of BL on the network parameter theta, increasing the baseline b in the objective function J (pi) does not affect the gradient of the final optimization objective J (pi)Adding a baseline b that is independent of action a does not affect the gradient of the final strategy.
The variance is calculated according to the formula:
as is apparent from the above description of the preferred embodiment,the smaller the variance, the smaller. Designing a function of the part with respect to b
Then, the derivative of f (b) with respect to b is calculated:
f (b) derivative with respect to b f' (b) where b ═ Σ τ r (τ) is 0, i.e. when b ═ Σ τ r (τ) is the minimum of f (b) and the minimum of variance. Obviously, Σ τ r (tau) is an implicit state value V (z), so the state value V (z) can be used as a base line b to reduce the variance, and the convergence rate and the effect of the policy network are improved.
In order to strengthen the exploration capability of the intelligent agent in the strange state space in the training process and avoid the intelligent agent from falling into the local optimal space in the training process, the strategy network pi designed by the invention θ The output of (z, a) will fit a normal distribution. The method specifically comprises two parts of a mean value mu (z, theta) and a variance sigma (z, theta):
theoretically, the strategic network pi θ (z, a) in the learning process, the output mean value mu (z, theta) can continuously approach the optimal strategy arg max Q (z, a), the variance σ (z, θ) of the output will not beThe randomness of the strategy decreases as the break approaches 0. When executing the strategy, sampling the actions from the normal distributionAnd outputting and executing.
(4) A reward function is designed. The reward function reward comprises the following parts, k 1 ~k 3 The corresponding proportionality coefficient for each section is:
reward=k 1 ·r speed +k 2 ·r acc +k 3 ·r safe
wherein r is speed For speed awards, the goal is to maintain the vehicle speed at the target vehicle speed, v target To a desired target vehicle speed, t total The number of points, v, corresponding to the track in time units t For the vehicle speed of the planned trajectory at time t:
wherein r is acc Awarding for longitudinal comfort, the goal is to maintain a small longitudinal jerk,to plan the longitudinal acceleration of the trajectory at time t:
wherein r is safe For security rewards, the reward function is further designed under the RSS (responsibility sensitive security) policy with the goal of generating a trajectory that meets the security standards.
Longitudinal safety distance:
v f is the front speed, v r For rear speed, ρ is driver reaction time, a min,brake At minimum braking acceleration, a max,brake At maximum braking acceleration, a max,accel Is the maximum acceleration.
Transverse safe distance:
wherein v is 1 Is the speed of the bicycle, v 2 The speed of the other vehicles in the transverse direction, namely the speed of the other vehicles trying to jam the vehicle entering the motorcade,mu is the minimum value of the transverse distance when the transverse speed of the two vehicles is 0.The acceleration is the maximum acceleration in the lateral direction,and p is the driver reaction time, which is the lateral minimum braking acceleration.
When the vehicle runs according to the track generated by the strategy network, when the transverse and longitudinal distances between the vehicle and the front and rear vehicles or other vehicles which are jammed in the fleet are less than the minimum safety distance, the reward is-100, otherwise, the reward is 0:
where d is the distance to other vehicles.
(5) And fitting a longitudinal trajectory polynomial. Using the current longitudinal state of the vehicleAnd optimal longitudinal end state of reinforcement learning outputAs a boundary condition, there is a fifth order polynomial of s with respect to time t:
the boundary conditions are as follows:
according to the fifth order polynomial of the longitudinal trajectory and the boundary condition:
according to the obtained a 0 a 1 a 2 a 3 a 4 a 5 Obtaining the fifth-order polynomial s of the longitudinal track trajectory 。
And finally, inputting the obtained track into a control module for track tracking control.
As shown in FIG. 2, the policy network π θ And (z, a) specifically comprises a Convolution (CNN) feature extraction network and a Full Connection Network (FCN). Wherein z is an input state quantity of the strategy network, and comprises a time sequence aerial view matrix and a historical track of the self vehicle; a is the output of the policy network, i.e. the end state of the planned trajectoryθ is the weight and bias parameters of the network. The input of the Convolution (CNN) feature extraction network is the space-time aerial view matrix, and the output is finally extracted environmental feature information. The input of the full-connection network (FCN) is Convolution (CNN) characteristic extraction network output environment characteristic information and history track information of the automatic driving automobile, and the output is the end state of the track
The convolutional neural network of the strategy network comprises three convolutional layers, two pooling layers and three full-connection layers. The input layer combines 3 matrices of 256 × 3 into a matrix of 256 × 9; the convolution layer Conv1 is composed of convolution kernels of (3 × 9) × 32 and step stride ═ 2, and its input is the output of the input layer, and is a matrix of 256 × 9 and its output is the characteristic of 128 × 32; the pooling layer Pool1 is composed of pooling kernels of (2 × 2) and step size stride 2, and its input is the output of the convolutional layer Conv1, which is a characteristic of 128 × 32, and its output is a characteristic of 64 × 32; the convolution layer Conv2 is composed of convolution kernels of (3 × 32) × 64 and step size stride ═ 2, the input of the convolution kernels is the output of the pooling layer Pool1, and the input is the feature of 64 × 32, and the output is the feature of 32 × 128; the pooling layer Pool2 is composed of pooling kernels of (2 × 2) and step size stride 2, and its input is the output of the convolutional layer Conv2, which is a characteristic of 32 × 128, and its output is a characteristic of 16 × 128; the convolution layer Conv3 is composed of convolution kernels of (3 × 128) × 128 and step size stride ═ 2, and its input is the output of the pooling layer Pool2, and is a characteristic of 16 × 128, and its output is a characteristic of 8 × 128; the fully connected layer FC has a size of (8 × 128) × 512, its input is the output of the convolution layer Conv3, and is characterized by 8 × 128 and its output is characterized by 1 × 512. The full-connection layer FC-mu and the full-connection layer FC-sigma are of parallel structures, the input of the full-connection layer FC-mu and the input of the full-connection layer FC-sigma are all features extracted by a convolutional neural network and are 1 x 512 features, the output of the full-connection layer FC-mu is 1 x 512 features, and the output of the full-connection layer FC-sigma is 1 x 512 features. The features extracted by the full connection layer FC- μ and the full connection layer FC- σ together constitute a state feature z.
As shown in fig. 3, in the ST diagram, the traffic scene is mainly divided into two main cases:
(1) there is no obstacle in front of the bicycle. And fitting a track by the initial state quantity and the final state track points trained from the deep reinforcement learning, and planning the longitudinal speed.
(2) There is a barrier in front of the bicycle. The present invention draws obstacles as parallelograms that block portions of a road for a particular period of time. For example, in the lower graph, the prediction module predicts that the vehicle will be at t 0 To t 1 Into the lane in which the vehicle will be drivenDuring which it occupies position s 0 To s 1 Therefore, a rectangle is drawn on the ST diagram, which will be at time t 0 To t 1 Period blocking position s 0 To s 1 . To avoid collisions, the velocity profile must not intersect this rectangle.
When following a car, the speed curve is below the lower boundary of the car following sampling. During overtaking, the speed curve is above the lower overtaking sampling limit.
In summary, the invention solves the problem of the queue driving of the commercial vehicles by adopting a method of combining a Lattice algorithm and deep reinforcement learning under an RSS (responsibility sensitivity safety) strategy aiming at the automatic driving task. By applying the A3C framework, the training efficiency is greatly improved, and the network convergence is promoted. Meanwhile, under the RSS (responsibility sensitive safety) framework, the safety of the reinforcement learning planning path is greatly improved. Compared with the Lattice algorithm, the method abandons the sampling with higher time complexity and the evaluation process of each alternative track cost function, and greatly improves the timeliness of the algorithm. Meanwhile, the training process of reinforcement learning has better universality, and the design of the reward function based on the final control effect can be more suitable for complex traffic scenes and complex vehicle dynamics characteristics.
In addition, the embodiment of the invention also provides an intelligent automobile controller, and the controller is internally provided with an execution program of the method.
The embodiment of the invention also provides a storage device, and the storage device is internally provided with the program code of the method.
The above-listed series of detailed descriptions are merely specific illustrations of possible embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent means or modifications that do not depart from the technical spirit of the present invention are intended to be included within the scope of the present invention.
Claims (10)
1. A method for planning a queue path of a commercial vehicle by combining deep reinforcement learning and RSS strategies is characterized by comprising the following steps:
s1, designing a time sequence aerial view as a state quantity of the strategy network;
s2, Frenet coordinate transformation is carried out, and the state quantity of the intelligent agent at the current time is obtained from the characteristic aerial viewAnd the action space is designed as the longitudinal end state of the track:
s3, obtaining the state quantityAnd the motion spaceAs a strategy network input, improving a Lattice planning algorithm by utilizing a strategy gradient algorithm, designing a reward function by combining with an RSS strategy, and training the final state longitudinal state of the intelligent agent;
2. The method for planning a queue path of a commercial vehicle according to claim 1, wherein the step S1 comprises the following two steps: (1) obtaining surrounding environment information including dynamic and static obstacles and lane lines, and predicting the future 0-t of the dynamic obstacle end Position information within time of (a); (2) and generating a characteristic aerial view of three dimensions of transverse dimension, longitudinal dimension and time by using the obtained environmental information and the predicted information.
3. The method for planning a queue path of a commercial vehicle according to claim 1, wherein the three-dimensional time-series bird's eye view matrix has dimensions (40, 400, 80), wherein the first dimension 40 represents a lateral range of 10m each, left and right of a reference line, and the lateral displacement interval is 0.5 m; the second dimension 400 represents a range of 200m longitudinally forward from the origin of the own vehicle, with longitudinal displacement intervals of 0.5m, and the third dimension 80 represents a time range within 8s in the future, with time intervals of 1 s; when a point [ alpha, beta, gamma ] in the time-series bird's eye view matrix is-1, the point is indicated to have an obstacle or be an unworkable area in the time space; when a point [ alpha, beta, gamma ] in the time-series bird's eye view matrix is 0, the point is a travelable area in the time space; when a point [ α, β, γ ] ] in the time-series bird's eye view matrix is 1, it indicates that the point is one point of the reference line.
4. The method for planning a queue path of a commercial vehicle according to claim 1, wherein in S2, the Frenet coordinate transformation is as follows:
coordinate information (X, theta) of the vehicle in a Cartesian coordinate system X ,k X ,v X ,a X ) By transformation of coordinates intoWherein X is the coordinate of the vehicle in a Cartesian coordinate system and is a vector, and theta X For the orientation of the vehicle in a Cartesian coordinate system, k X Is curvature, v X Is the linear velocity of the vehicle in a Cartesian coordinate system, a X Is the acceleration of the vehicle under a Cartesian coordinate system, s is the longitudinal displacement under a Frenet coordinate system,the first derivative of the longitudinal displacement s with respect to time in the Frenet coordinate system,in the Frenet coordinate systemThe second derivative of the longitudinal displacement s with respect to time.
5. The method for planning the queue path of the commercial vehicle according to claim 1, wherein in S3, the strategy network pi is a θ The optimization goal of (z, a) is to maximize the expected return on the output planned trajectory:
wherein z is the state characteristic of the surrounding traffic environment, a is the network output action (namely the longitudinal end state of the track), theta is the network parameter, p (tau, theta) is the probability of executing the action a to output the track tau under the parameter theta and the state z, and R (tau) is the reward function of the track tau;
and D, carrying out derivation on J (theta), and calculating a derivative of the optimization target relative to the network parameter theta:
in the actual sampling process, the intelligent agent continuously obtains tracks and rewards from traffic scenes, and then adjusts strategies according to the rewards and real-timely adjusts the experience data<z,a,τ,r>Storing the data into an experience pool (Memory), and randomly sampling n pieces of experience data from the experience pool by using a Monte Carlo method during training<z,a,τ,r>Gradient of objective function according to law of large numbersCarrying out simplified approximation:
to reduce the variance, the baseline b is increased at the reward r (τ) to reduce the variance:
J(π)=∑ τ p(τ,θ)·[r(τ)-b]
the part BL containing the baseline is differentiated by the network parameter theta:
calculating the variance:
Then, the derivative of f (b) with respect to b is calculated:
f (b) derivative with respect to b f' (b) where b ═ Σ τ r (τ) is 0, i.e. when b ═ Σ τ r (t) is f (b) minimum, variance is minimum, sigma τ r (tau) is an implicit state value V (z), so the state value V (z) can be used as a base line b to reduce the variance and improve the convergence rate of the policy networkDegree and effect;
the policy network pi θ The output of (z, a) conforms to a normal distribution, and specifically includes two parts, a mean μ (z, θ) and a variance σ (z, θ):
policy network pi θ (z, a) in the learning process, the output mean value mu (z, theta) can continuously approach the optimal strategy arg max Q (z, a), the output variance sigma (z, theta) will approach to 0 continuously, the randomness of the strategy will decrease, and when the strategy is executed, the action is sampled from the normal distributionAnd outputting and executing.
6. The method for planning the queue path of the commercial vehicle by combining the deep reinforcement learning and the RSS strategy according to claim 1, wherein the reward function of the strategy network is designed as follows:
reward=k 1 ·r speed +k 2 ·r acc +k 3 ·r safe
wherein r is speed For speed awards, the goal is to maintain the vehicle speed at the target vehicle speed, v target To the desired target vehicle speed, t total The number of points, v, corresponding to the track in time units t For the vehicle speed of the planned trajectory at time t:
wherein r is acc Awarding for longitudinal comfort, the goal is to maintain a small longitudinal jerk,at time t for planning a trajectoryLongitudinal acceleration of (2):
wherein r is safe For safety reward, the track generated by the target meets the safety standard;
longitudinal safety distance:
v f is the front speed, v r For rear speed, ρ is driver reaction time, a min,brake At minimum braking acceleration, a max,brake At maximum braking acceleration, a max,accel Is the maximum acceleration;
transverse safe distance:
mu is the minimum value of the transverse distance when the transverse speed of the two vehicles is 0.The acceleration is the maximum acceleration in the lateral direction,is the lateral minimum braking acceleration, ρ is the driver reaction time;
when the vehicle runs according to the track generated by the strategy network, when the transverse and longitudinal distances between the vehicle and the front and rear vehicles or other vehicles which are jammed in the fleet are less than the minimum safety distance, the reward is-100, otherwise, the reward is 0:
d is the distance from the other vehicle.
7. The method of claim 1, wherein the strategy network pi is a commercial vehicle queue path planning method combining deep reinforcement learning and RSS strategy θ (z, a) includes both Convolutional (CNN) feature extraction networks and Fully Connected Networks (FCNs); wherein z is an input state quantity of the strategy network, and comprises a time sequence aerial view matrix and a historical track of the self vehicle; a is the output of the policy network, i.e. the end state of the planned trajectoryTheta is weight and bias parameters of the network, the input of the Convolution (CNN) feature extraction network is the space-time aerial view matrix, the output is finally extracted environment feature information, the input of the Full Connection Network (FCN) is the environment feature information output by the Convolution (CNN) feature extraction network and the history track information of the automatic driving automobile, and the output is the final state of the track
The convolutional neural network of the strategy network comprises three convolutional layers, two pooling layers and three fully-connected layers, wherein the input layer merges 3 matrixes of 256 × 3 into a matrix of 256 × 9; the convolution layer Conv1 is composed of convolution kernels of (3 × 9) × 32 and step stride ═ 2, and its input is the output of the input layer, and is a matrix of 256 × 9 and its output is the characteristic of 128 × 32; the pooling layer Pool1 is composed of pooling kernels of (2 × 2) and step size stride 2, and its input is the output of the convolutional layer Conv1, which is a characteristic of 128 × 32, and its output is a characteristic of 64 × 32; the convolution layer Conv2 is composed of convolution kernels of (3 × 32) × 64 and step size stride ═ 2, the input of the convolution kernels is the output of the pooling layer Pool1, and the input is the feature of 64 × 32, and the output is the feature of 32 × 128; the pooling layer Pool2 is composed of pooling kernels of (2 × 2) and step size stride 2, and its input is the output of the convolutional layer Conv2, which is a characteristic of 32 × 128, and its output is a characteristic of 16 × 128; the convolution layer Conv3 is composed of convolution kernels of (3 × 128) × 128 and step size stride ═ 2, and its input is the output of the pooling layer Pool2, and is a characteristic of 16 × 128, and its output is a characteristic of 8 × 128; the full-connection layer FC has a size of (8 × 128) × 512, the input of the full-connection layer FC is the output of the convolution layer Conv3, the input of the full-connection layer FC is the feature of 8 × 128, the output of the full-connection layer FC- μ is the feature of 1 × 512, the input of the full-connection layer FC- μ is the feature extracted by the convolutional neural network, the input of the full-connection layer FC- μ is the feature of 1 × 512, the output of the full-connection layer FC- σ is the feature of 1 × 512, and the feature extracted by the full-connection layer FC- μ and the full-connection layer FC- σ jointly form the state feature z.
8. The method for planning a queue path of a commercial vehicle according to claim 1, wherein in S4, the longitudinal trajectory polynomial is fitted as follows:
the boundary conditions are as follows:
according to the fifth order polynomial of the longitudinal trajectory and the boundary condition:
based on the obtained a 0 a 1 a 2 a 3 a 4 a 5 Obtaining the fifth-order polynomial s of the longitudinal track trajectory 。
9. An intelligent automobile controller, characterized in that the controller is internally provided with an execution program of the method of any one of claims 1 to 8.
10. A storage device, characterized in that it houses the program code of the method according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210748792.5A CN115079697A (en) | 2022-06-29 | 2022-06-29 | Commercial vehicle queue path planning method, controller and storage device combining deep reinforcement learning and RSS strategy |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210748792.5A CN115079697A (en) | 2022-06-29 | 2022-06-29 | Commercial vehicle queue path planning method, controller and storage device combining deep reinforcement learning and RSS strategy |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115079697A true CN115079697A (en) | 2022-09-20 |
Family
ID=83256365
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210748792.5A Pending CN115079697A (en) | 2022-06-29 | 2022-06-29 | Commercial vehicle queue path planning method, controller and storage device combining deep reinforcement learning and RSS strategy |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115079697A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115871658A (en) * | 2022-12-07 | 2023-03-31 | 之江实验室 | Intelligent driving speed decision method and system for dense pedestrian flow |
CN117542218A (en) * | 2023-11-17 | 2024-02-09 | 上海智能汽车融合创新中心有限公司 | Vehicle-road cooperative system based on vehicle speed-vehicle distance guiding control |
-
2022
- 2022-06-29 CN CN202210748792.5A patent/CN115079697A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115871658A (en) * | 2022-12-07 | 2023-03-31 | 之江实验室 | Intelligent driving speed decision method and system for dense pedestrian flow |
CN115871658B (en) * | 2022-12-07 | 2023-10-27 | 之江实验室 | Dense people stream-oriented intelligent driving speed decision method and system |
CN117542218A (en) * | 2023-11-17 | 2024-02-09 | 上海智能汽车融合创新中心有限公司 | Vehicle-road cooperative system based on vehicle speed-vehicle distance guiding control |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111222630B (en) | Autonomous driving rule learning method based on deep reinforcement learning | |
US11093829B2 (en) | Interaction-aware decision making | |
CN110297494B (en) | Decision-making method and system for lane change of automatic driving vehicle based on rolling game | |
CN115079697A (en) | Commercial vehicle queue path planning method, controller and storage device combining deep reinforcement learning and RSS strategy | |
US20220371594A1 (en) | Model-based design of trajectory planning and control for automated motor-vehicles in a dynamic environment | |
CN112162555B (en) | Vehicle control method based on reinforcement learning control strategy in hybrid vehicle fleet | |
Wei et al. | A behavioral planning framework for autonomous driving | |
US20190035275A1 (en) | Autonomous operation capability configuration for a vehicle | |
CN113954837B (en) | Deep learning-based lane change decision-making method for large-scale commercial vehicle | |
Zheng et al. | Behavioral decision‐making model of the intelligent vehicle based on driving risk assessment | |
CN112622886A (en) | Anti-collision early warning method for heavy operation vehicle comprehensively considering front and rear obstacles | |
CN111679660B (en) | Unmanned deep reinforcement learning method integrating human-like driving behaviors | |
CN113581182B (en) | Automatic driving vehicle lane change track planning method and system based on reinforcement learning | |
CN111824182B (en) | Three-axis heavy vehicle self-adaptive cruise control algorithm based on deep reinforcement learning | |
Sun et al. | DDPG-based decision-making strategy of adaptive cruising for heavy vehicles considering stability | |
Li et al. | Dynamically integrated spatiotemporal‐based trajectory planning and control for autonomous vehicles | |
CN114580302A (en) | Decision planning method for automatic driving automobile based on maximum entropy reinforcement learning | |
CN115257819A (en) | Decision-making method for safe driving of large-scale commercial vehicle in urban low-speed environment | |
DE102019202634B3 (en) | Method, control device for an automated road vehicle, computer program product for recognizing objects in road traffic and automated road vehicle for mobility services | |
DE102022116418A1 (en) | MACHINE CONTROL | |
Dubey et al. | Autonomous braking and throttle system: A deep reinforcement learning approach for naturalistic driving | |
CN113110359A (en) | Online training method and device for constraint type intelligent automobile autonomous decision system | |
DE102022109385A1 (en) | Reward feature for vehicles | |
CN115140048A (en) | Automatic driving behavior decision and trajectory planning model and method | |
CN112668692A (en) | Quantifying realism of analog data using GAN |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |