CN115079697A - Commercial vehicle queue path planning method, controller and storage device combining deep reinforcement learning and RSS strategy - Google Patents

Commercial vehicle queue path planning method, controller and storage device combining deep reinforcement learning and RSS strategy Download PDF

Info

Publication number
CN115079697A
CN115079697A CN202210748792.5A CN202210748792A CN115079697A CN 115079697 A CN115079697 A CN 115079697A CN 202210748792 A CN202210748792 A CN 202210748792A CN 115079697 A CN115079697 A CN 115079697A
Authority
CN
China
Prior art keywords
output
network
strategy
time
longitudinal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210748792.5A
Other languages
Chinese (zh)
Inventor
朱子轩
蔡英凤
陈龙
孙晓强
何友国
袁朝春
方啸
陆文杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN202210748792.5A priority Critical patent/CN115079697A/en
Publication of CN115079697A publication Critical patent/CN115079697A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0214Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory in accordance with safety or protection criteria, e.g. avoiding hazardous areas
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0221Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0287Control of position or course in two dimensions specially adapted to land vehicles involving a plurality of land vehicles, e.g. fleet or convoy travelling
    • G05D1/0289Control of position or course in two dimensions specially adapted to land vehicles involving a plurality of land vehicles, e.g. fleet or convoy travelling with means for avoiding collisions between vehicles

Abstract

The invention discloses a method for planning a queue path of a commercial vehicle, a controller and a storage device by combining deep reinforcement learning and RSS strategies, wherein an A3C framework is introduced, vehicles in a fleet can respectively and interactively learn with the environment in a plurality of threads simultaneously by utilizing a multithreading method, and each thread summarizes the learning results and stores the results in Global _ net. And regularly taking back the learning results of different vehicles in the fleet from Global _ net to guide the learning interaction behind the vehicle and the environment. Meanwhile, the speed planning is carried out by using the Lattice algorithm and taking the ST image, so that the driving stability and comfort of the motorcade can be effectively improved, and the smoothness of the driving track of the commercial vehicle is ensured. Finally, the invention combines a safety constraint RSS strategy, and the automatic driving automobile safety strategy based on the mathematical formula provides a frame for the implicit rule, thereby realizing the organic integration with other participants on the road and effectively solving the safety problem when other vehicles are merged into the road during the queue driving.

Description

Commercial vehicle queue path planning method, controller and storage device combining deep reinforcement learning and RSS strategy
Technical Field
The invention belongs to the field of automatic driving in artificial intelligence, and relates to a commercial vehicle queue path planning method, a controller and a storage device which combine deep reinforcement learning and RSS (responsibility sensitive safety) models.
Background
The intelligent automobile is a high and new technology product based on an environment perception technology, an intelligent driving technology, a wireless communication technology and a computer technology, and the process of transformation and upgrading of the automobile industry is a process of gradually realizing intellectualization of the automobile. The automobile in an intelligent running state takes safety, environmental protection, energy conservation, comfort and the like as comprehensive control targets to cooperatively construct an efficient and ordered transportation network.
Commercial fleets are currently widely used in engineering. Commercial vehicles mainly include five types, namely passenger cars, trucks, semi-trailer tractors, incomplete passenger car vehicles and incomplete truck vehicles. Has the characteristics of large volume, heavy weight, large blind area of the driver vision, and the like. Currently, commercial fleet path planning presents a number of problems in the training process: firstly, a plurality of vehicle bodies participate in training at the same time, and the training difficulty is high, and even the network is difficult to converge. Secondly, the reward function is difficult to design, each fleet member has one reward function, actions output by the fleet members are mutually interfered, and the reward counteraction phenomenon exists, so that exploration in training is difficult. Finally, the commercial vehicle has large volume and large load, so that the safety of the commercial vehicle cannot be well guaranteed when the commercial vehicle is not driven by people. Therefore, how to find a commercial vehicle queue planning method which can simultaneously achieve safety and high efficiency becomes an important subject.
Disclosure of Invention
In order to solve the problem of the commercial vehicle queue, a framework of A3C is introduced, A3C utilizes a multithreading method to enable vehicles in a fleet to perform interactive learning with the environment in a plurality of threads simultaneously, and each thread collects learning results and stores the learning results in Global _ net. And regularly taking back the learning results of different vehicles in the fleet from Global _ net to guide the learning interaction behind the vehicle and the environment. Meanwhile, the speed planning is carried out by using the Lattice algorithm and taking the ST image, so that the driving stability and comfort of the motorcade can be effectively improved, and the smoothness of the driving track of the commercial vehicle is ensured. Finally, the invention combines a safety constraint strategy, namely an RSS (responsibility sensitivity safety) strategy, and the set of automatic driving automobile safety strategy based on the mathematical formula provides a frame for the implicit rule, thereby realizing the organic integration with other participants on the road and effectively solving the safety problem when the vehicles are merged when the vehicles run in a queue.
The invention provides a commercial vehicle queue planning method combining deep reinforcement learning and an RSS (responsibility sensitivity safety) strategy, which improves the learning efficiency of a motorcade by utilizing an A3C framework and improves the safety and stability of the motorcade during driving through the constraints of a Lattice algorithm and the RSS strategy.
In order to achieve the purpose, the invention adopts the following technical scheme:
the commercial vehicle queue planning method combining deep reinforcement learning and RSS (responsibility sensitive safety) strategies comprises the following steps:
step 1: in order to better acquire information of surrounding traffic environment, the invention designs an effective and simple time sequence aerial view as the state quantity of the strategy network, thereby greatly improving the learning of the strategy network and ensuring the safety of output tracks. The generation of the time-series bird's eye view comprises the following two steps: (1) and obtaining surrounding environment information including moving and static obstacles and lane lines. Obtaining 0-t future of dynamic barrier by using prediction module (both lstm and GCN network) end Position information within time of (a); (2) and generating a characteristic aerial view of three dimensions of transverse dimension, longitudinal dimension and time by using the information obtained by the perception module and the prediction.
Step 2: frenet coordinate transformation is performed, and the state quantity of the agent at the current time is obtained from the characteristic aerial view. Coordinate information of the vehicle in a Cartesian coordinate system(X,θ X ,k X ,v X ,a X ) Can be converted into
Figure BDA0003720472470000028
Wherein X is the coordinate of the vehicle in a Cartesian coordinate system and is a vector, and theta X For the orientation of the vehicle in a Cartesian coordinate system, k X Is curvature, v X Is the linear velocity of the vehicle in a Cartesian coordinate system, a X Is the acceleration of the vehicle in a cartesian coordinate system. s is the longitudinal displacement in the Frenet coordinate system,
Figure BDA0003720472470000021
the first derivative of the longitudinal displacement s with respect to time in the Frenet coordinate system,
Figure BDA0003720472470000022
the second derivative of the longitudinal displacement s with respect to time in the Frenet coordinate system. Thereby obtaining the state quantity:
Figure BDA0003720472470000023
the motion space is designed as the longitudinal end state of the track:
Figure BDA0003720472470000024
and 3, step 3: the invention uses an A3C algorithm framework to generate training samples to fill an experience pool in an exploration process by using a training method of a fleet shared network. All the agents share the strategy network and participate in the training of the network together, so that the problem of network non-convergence is avoided.
The obtained state quantity
Figure BDA0003720472470000025
And the motion space
Figure BDA0003720472470000026
As input, a strategy gradient algorithm is utilized to improve a Lattice planning algorithm, and meanwhile, a reward function is designed by combining with an RSS (really simple syndication) strategy, so that the last state sampling point of the intelligent agent is trained。
Policy network pi θ The optimization goal of (z, a) is to maximize the expected return on the output planned trajectory:
Figure BDA0003720472470000027
wherein z is the state characteristic of the surrounding traffic environment, a is the network output action (namely the longitudinal end state of the track), theta is the network parameter, p (tau, theta) is the probability of executing the action a to output the track tau under the parameter theta and the state z, r (tau) is the reward function of the track tau, theta represents the network parameter, and pi represents the policy network.
The above policy network pi θ The optimization method of (z, a) is a gradient ascent method:
Figure BDA0003720472470000031
α represents the learning rate of the gradient descent magnitude.
Calculating a derivative of the optimization objective with respect to the network parameter θ:
Figure BDA0003720472470000032
in the actual sampling process, the intelligent agent continuously obtains tracks and rewards from traffic scenes, and then adjusts strategies according to the rewards and real-timely adjusts the experience data<z,a,τ,r>And storing the experience into an experience pool (Memory). During training, n pieces of experience data are randomly sampled from an experience pool by using a Monte Carlo method<z,a,τ,r>Gradient of objective function according to law of large numbers
Figure BDA0003720472470000033
Carrying out simplified approximation:
Figure BDA0003720472470000034
the updating direction of the final strategy parameter theta can be obtained as follows:
Figure BDA0003720472470000035
to reduce the variance, the baseline b is increased at the reward r (τ) to reduce the variance:
J(π)=∑ τ p(τ,θ)·[r(τ)-b]
the part of the objective function J (pi) with respect to the baseline is then split:
Figure BDA0003720472470000036
the part BL containing the baseline is differentiated by the network parameter theta:
Figure BDA0003720472470000041
obviously, according to the derivation result of BL on the network parameter theta, increasing the baseline b in the objective function J (pi) does not affect the gradient of the final optimization objective J (pi)
Figure BDA0003720472470000042
Adding a baseline b that is independent of action a does not affect the gradient of the final strategy.
The variance is calculated according to the formula:
Figure BDA0003720472470000043
as is apparent from the above description of the preferred embodiment,
Figure BDA0003720472470000044
the smaller the variance, the smaller the variance. Designing a function of the part with respect to b
Figure BDA0003720472470000045
Then, the derivative of f (b) with respect to b is calculated:
Figure BDA0003720472470000046
f (b) derivative with respect to b f' (b) where b ═ Σ τ r (τ) is 0, i.e. when b ═ Σ τ r (τ) is the minimum of f (b) and the minimum of variance. Obviously, Σ τ r (tau) is an implicit state value V (z), so the state value V (z) can be used as a base line b to reduce the variance, and the convergence rate and the effect of the policy network are improved.
The intelligent agent for reinforcement learning continuously improves the self ability in the training process, and in the process, the intelligent agent needs to continuously try and error in an unfamiliar state space. In a strange state feature, a new behavior may result in the agent receiving a higher reward and also in a worse behavior action. An "exploratory action" is to try some new action. The "exploit action" is the taking of a known action that can achieve the maximum reward, only requiring that the policy action be explicitly executed. Too many "exploratory behaviors" (fewer "exploitation behaviors") make convergence of the agent more difficult, and too few "exploratory behaviors" (more "exploitation behaviors") make it highly likely that the agent converges to the locally optimal space. There is therefore a trade-off between "exploration behavior" and "utilization behavior". In order to strengthen the exploration capability of the intelligent agent in the strange state space in the training process and avoid the intelligent agent from falling into the local optimal space in the training process, the strategy network pi designed by the invention θ The output of (z, a) will fit a normal distribution. The method specifically comprises two parts of a mean value mu (z, theta) and a variance sigma (z, theta):
Figure BDA0003720472470000051
theoretically, the strategic network pi θ (z, a) in the learning process, the output mean value mu (z, theta) can continuously approach the optimal strategy arg max Q (z, a), Q (z, a) representing the action cost function, arg max A variable point z, a is obtained corresponding to the maximum value of Q. The variance σ (z, θ) of the output will continuously approach to 0 and the randomness of the strategy will decrease. When executing the strategy, sampling the actions from the normal distribution
Figure BDA0003720472470000052
And outputting and executing.
And 4, step 4: and fitting a longitudinal trajectory polynomial. Using the current longitudinal state of the vehicle
Figure BDA0003720472470000053
And optimal longitudinal end state of reinforcement learning output
Figure BDA0003720472470000054
As a boundary condition, there is a fifth order polynomial of s with respect to time t:
Figure BDA0003720472470000055
the boundary conditions are as follows:
Figure BDA0003720472470000056
according to the fifth order polynomial of the longitudinal trajectory and the boundary condition:
Figure BDA0003720472470000061
based on the obtained a 1 a 2 a 3 a 4 a 5 Obtaining the fifth-order polynomial s of the longitudinal track trajectory
And obtaining an optimal track, and inputting the optimal track to the control module.
The invention provides an intelligent automobile controller, wherein an execution program of the method is arranged in the controller.
The invention also provides a storage device, which is internally provided with the program code of the method.
The invention has the beneficial effects that:
(1) aiming at the automatic driving task, the invention adopts a method of combining a Lattice algorithm and deep reinforcement learning to solve the problem of driving of the commercial vehicle queue under an RSS (responsibility sensitivity safety) strategy. By using the A3C framework, the training efficiency is greatly improved, and the network convergence is promoted. Meanwhile, under the RSS (responsibility sensitive safety) framework, the safety of the reinforcement learning planning path is greatly improved.
(2) Compared with the Lattice algorithm, the method abandons the sampling with higher time complexity and the evaluation process of each alternative track cost function, and greatly improves the timeliness of the algorithm. Meanwhile, the training process of reinforcement learning has better universality, and the design of the reward function based on the final control effect can be more suitable for complex traffic scenes and complex vehicle dynamics characteristics.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a Policy Gradient network neural network architecture used by the present invention;
trace point sampling of the ST diagram of fig. 3.
Detailed Description
The invention will be further explained with reference to the drawings.
The invention provides a commercial vehicle queue path planning method combining Deep Reinforcement Learning (DRL) and Lattice algorithm under RSS (responsibility sensitivity safety) strategy, which can realize the improvement of safety and stability when a commercial large vehicle runs in a queue, and as shown in figure 1, the method specifically comprises the following steps:
as shown in fig. 1, using the A3C framework, a training method using a fleet sharing network generated respective gradient-filled Global _ net during exploration. Compared with the method that the states of all the members of the fleet are used as input and the tracks of the number of the members are output, the method only takes the state of each intelligent agent as input and outputs the track of the intelligent agent, so that network convergence is promoted, and the phenomena of mutual interference of member actions and reward offset are avoided. Meanwhile, all intelligent networked automobiles share the decision network and participate in network training together. The individual training process for each agent is described below.
And outputting a longitudinal end state sampling point by using an intelligent agent for deep reinforcement learning training:
(1) and designing a characteristic aerial view. The invention designs an effective and simple time sequence aerial view as the state quantity of the strategy network, greatly improves the learning of the strategy network and ensures the safety of the output track.
The generation of the time-series bird's eye view comprises the following two steps: (1) according to the perception module of the automatic driving automobile, the surrounding environment information including moving and static obstacles and lane lines is obtained. Obtaining 0-t of dynamic barrier in future by utilizing prediction module end Position information within time of (a); (2) and generating a characteristic aerial view of three dimensions of transverse dimension, longitudinal dimension and time by using the information obtained by the perception module and the prediction.
The size of the three-dimensional time sequence aerial view matrix is (40, 400, 80). Wherein the first dimension 40 represents the lateral extent of 10m each, left and right of the reference line, with a lateral displacement interval of 0.5 m; the second dimension 400 represents a range of 200m longitudinally forward from the origin of the own vehicle with a longitudinal displacement interval of 0.5m, and the third dimension 80 represents a time range within 8s in the future with a time interval of 1 s. Specifically, when a point [ α, β, γ ] in the time-series bird's eye view matrix is-1, it indicates that the point is an obstacle or an untravelable area in the time space; when a point [ alpha, beta, gamma ] in the time-series bird's eye view matrix is 0, the point is a travelable area in the time space; when a point [ α, β, γ ] ] in the time-series bird's eye view matrix is 1, it indicates that the point is one point of the reference line.
(2) And (5) designing state quantities. In order to obtain the state quantity of the intelligent agent at the current time from the characteristic aerial view, Frenet coordinate transformation is carried out, and coordinate information (X, theta) of the vehicle is converted under a Cartesian coordinate system X ,k X ,v X ,a X ) Can be converted into
Figure BDA0003720472470000071
Wherein X is the coordinate of the vehicle in a Cartesian coordinate system and is a vector, and theta X For the orientation of the vehicle in a Cartesian coordinate system, k X Is curvature, v X Is the linear velocity of the vehicle in a Cartesian coordinate system, a X Is the acceleration of the vehicle in a cartesian coordinate system. s is the longitudinal displacement in the Frenet coordinate system,
Figure BDA0003720472470000072
the first derivative of the longitudinal displacement s with respect to time in the Frenet coordinate system,
Figure BDA0003720472470000073
the second derivative of the longitudinal displacement s with respect to time in the Frenet coordinate system. Thereby obtaining the state quantity:
Figure BDA0003720472470000074
the motion space is designed as the longitudinal end state of the track:
Figure BDA0003720472470000075
(3) and (4) designing a policy network. The obtained state quantity
Figure BDA0003720472470000076
And the motion space
Figure BDA0003720472470000077
As input, a strategy gradient algorithm is utilized to improve a Lattice planning algorithm, meanwhile, a reward function is designed by combining an RSS (really simple syndication) strategy, and the last state sampling point of the intelligent agent is trained. Policy network pi θ The optimization goal of (z, a) is to maximize the expected return on the output planned trajectory:
Figure BDA0003720472470000081
wherein z is the state characteristic of the surrounding traffic environment, a is the network output action (namely the longitudinal end state of the track), theta is the network parameter, p (tau, theta) is the probability of executing the action a to output the track tau under the parameter theta and the state z, and R (tau) is the reward function of the track tau.
Policy network pi θ The optimization method of (z, a) is a gradient ascent method:
Figure BDA0003720472470000082
and (3) derivation is carried out on the expected return J (theta) of the parameter theta to obtain the optimal theta, at the moment, the strategy network pi is optimal, the track is optimal, and the derivative of the optimization target relative to the network parameter theta is calculated:
Figure BDA0003720472470000083
in the actual sampling process, the intelligent agent continuously obtains tracks and rewards from traffic scenes, and then adjusts strategies according to the rewards and real-timely adjusts the experience data<z,a,τ,r>And storing the experience into an experience pool (Memory). During training, n pieces of experience data are randomly sampled from an experience pool by using a Monte Carlo method<z,a,τ,r>Gradient of objective function according to law of large numbers
Figure BDA0003720472470000084
Carrying out simplified approximation:
Figure BDA0003720472470000085
to reduce the variance, the baseline b is increased at the reward r (τ) to reduce the variance:
J(π)=∑ τ p(τ,θ)·[r(τ)-b]
the part of the objective function J (pi) with respect to the baseline is then split:
Figure BDA0003720472470000086
the part BL containing the baseline is differentiated by the network parameter theta:
Figure BDA0003720472470000091
obviously, according to the derivation result of BL on the network parameter theta, increasing the baseline b in the objective function J (pi) does not affect the gradient of the final optimization objective J (pi)
Figure BDA0003720472470000092
Adding a baseline b that is independent of action a does not affect the gradient of the final strategy.
The variance is calculated according to the formula:
Figure BDA0003720472470000093
as is apparent from the above description of the preferred embodiment,
Figure BDA0003720472470000094
the smaller the variance, the smaller. Designing a function of the part with respect to b
Figure BDA0003720472470000095
Then, the derivative of f (b) with respect to b is calculated:
Figure BDA0003720472470000096
f (b) derivative with respect to b f' (b) where b ═ Σ τ r (τ) is 0, i.e. when b ═ Σ τ r (τ) is the minimum of f (b) and the minimum of variance. Obviously, Σ τ r (tau) is an implicit state value V (z), so the state value V (z) can be used as a base line b to reduce the variance, and the convergence rate and the effect of the policy network are improved.
In order to strengthen the exploration capability of the intelligent agent in the strange state space in the training process and avoid the intelligent agent from falling into the local optimal space in the training process, the strategy network pi designed by the invention θ The output of (z, a) will fit a normal distribution. The method specifically comprises two parts of a mean value mu (z, theta) and a variance sigma (z, theta):
Figure BDA0003720472470000101
theoretically, the strategic network pi θ (z, a) in the learning process, the output mean value mu (z, theta) can continuously approach the optimal strategy arg max Q (z, a), the variance σ (z, θ) of the output will not beThe randomness of the strategy decreases as the break approaches 0. When executing the strategy, sampling the actions from the normal distribution
Figure BDA0003720472470000102
And outputting and executing.
(4) A reward function is designed. The reward function reward comprises the following parts, k 1 ~k 3 The corresponding proportionality coefficient for each section is:
reward=k 1 ·r speed +k 2 ·r acc +k 3 ·r safe
wherein r is speed For speed awards, the goal is to maintain the vehicle speed at the target vehicle speed, v target To a desired target vehicle speed, t total The number of points, v, corresponding to the track in time units t For the vehicle speed of the planned trajectory at time t:
Figure BDA0003720472470000103
wherein r is acc Awarding for longitudinal comfort, the goal is to maintain a small longitudinal jerk,
Figure BDA0003720472470000104
to plan the longitudinal acceleration of the trajectory at time t:
Figure BDA0003720472470000105
wherein r is safe For security rewards, the reward function is further designed under the RSS (responsibility sensitive security) policy with the goal of generating a trajectory that meets the security standards.
Longitudinal safety distance:
Figure BDA0003720472470000106
v f is the front speed, v r For rear speed, ρ is driver reaction time, a min,brake At minimum braking acceleration, a max,brake At maximum braking acceleration, a max,accel Is the maximum acceleration.
Transverse safe distance:
Figure BDA0003720472470000107
wherein v is 1 Is the speed of the bicycle, v 2 The speed of the other vehicles in the transverse direction, namely the speed of the other vehicles trying to jam the vehicle entering the motorcade,
Figure BDA0003720472470000108
mu is the minimum value of the transverse distance when the transverse speed of the two vehicles is 0.
Figure BDA0003720472470000111
The acceleration is the maximum acceleration in the lateral direction,
Figure BDA0003720472470000112
and p is the driver reaction time, which is the lateral minimum braking acceleration.
When the vehicle runs according to the track generated by the strategy network, when the transverse and longitudinal distances between the vehicle and the front and rear vehicles or other vehicles which are jammed in the fleet are less than the minimum safety distance, the reward is-100, otherwise, the reward is 0:
Figure BDA0003720472470000113
where d is the distance to other vehicles.
(5) And fitting a longitudinal trajectory polynomial. Using the current longitudinal state of the vehicle
Figure BDA0003720472470000114
And optimal longitudinal end state of reinforcement learning output
Figure BDA0003720472470000115
As a boundary condition, there is a fifth order polynomial of s with respect to time t:
Figure BDA0003720472470000116
the boundary conditions are as follows:
Figure BDA0003720472470000117
according to the fifth order polynomial of the longitudinal trajectory and the boundary condition:
Figure BDA0003720472470000118
according to the obtained a 0 a 1 a 2 a 3 a 4 a 5 Obtaining the fifth-order polynomial s of the longitudinal track trajectory
And finally, inputting the obtained track into a control module for track tracking control.
As shown in FIG. 2, the policy network π θ And (z, a) specifically comprises a Convolution (CNN) feature extraction network and a Full Connection Network (FCN). Wherein z is an input state quantity of the strategy network, and comprises a time sequence aerial view matrix and a historical track of the self vehicle; a is the output of the policy network, i.e. the end state of the planned trajectory
Figure BDA0003720472470000121
θ is the weight and bias parameters of the network. The input of the Convolution (CNN) feature extraction network is the space-time aerial view matrix, and the output is finally extracted environmental feature information. The input of the full-connection network (FCN) is Convolution (CNN) characteristic extraction network output environment characteristic information and history track information of the automatic driving automobile, and the output is the end state of the track
Figure BDA0003720472470000122
The convolutional neural network of the strategy network comprises three convolutional layers, two pooling layers and three full-connection layers. The input layer combines 3 matrices of 256 × 3 into a matrix of 256 × 9; the convolution layer Conv1 is composed of convolution kernels of (3 × 9) × 32 and step stride ═ 2, and its input is the output of the input layer, and is a matrix of 256 × 9 and its output is the characteristic of 128 × 32; the pooling layer Pool1 is composed of pooling kernels of (2 × 2) and step size stride 2, and its input is the output of the convolutional layer Conv1, which is a characteristic of 128 × 32, and its output is a characteristic of 64 × 32; the convolution layer Conv2 is composed of convolution kernels of (3 × 32) × 64 and step size stride ═ 2, the input of the convolution kernels is the output of the pooling layer Pool1, and the input is the feature of 64 × 32, and the output is the feature of 32 × 128; the pooling layer Pool2 is composed of pooling kernels of (2 × 2) and step size stride 2, and its input is the output of the convolutional layer Conv2, which is a characteristic of 32 × 128, and its output is a characteristic of 16 × 128; the convolution layer Conv3 is composed of convolution kernels of (3 × 128) × 128 and step size stride ═ 2, and its input is the output of the pooling layer Pool2, and is a characteristic of 16 × 128, and its output is a characteristic of 8 × 128; the fully connected layer FC has a size of (8 × 128) × 512, its input is the output of the convolution layer Conv3, and is characterized by 8 × 128 and its output is characterized by 1 × 512. The full-connection layer FC-mu and the full-connection layer FC-sigma are of parallel structures, the input of the full-connection layer FC-mu and the input of the full-connection layer FC-sigma are all features extracted by a convolutional neural network and are 1 x 512 features, the output of the full-connection layer FC-mu is 1 x 512 features, and the output of the full-connection layer FC-sigma is 1 x 512 features. The features extracted by the full connection layer FC- μ and the full connection layer FC- σ together constitute a state feature z.
As shown in fig. 3, in the ST diagram, the traffic scene is mainly divided into two main cases:
(1) there is no obstacle in front of the bicycle. And fitting a track by the initial state quantity and the final state track points trained from the deep reinforcement learning, and planning the longitudinal speed.
(2) There is a barrier in front of the bicycle. The present invention draws obstacles as parallelograms that block portions of a road for a particular period of time. For example, in the lower graph, the prediction module predicts that the vehicle will be at t 0 To t 1 Into the lane in which the vehicle will be drivenDuring which it occupies position s 0 To s 1 Therefore, a rectangle is drawn on the ST diagram, which will be at time t 0 To t 1 Period blocking position s 0 To s 1 . To avoid collisions, the velocity profile must not intersect this rectangle.
When following a car, the speed curve is below the lower boundary of the car following sampling. During overtaking, the speed curve is above the lower overtaking sampling limit.
In summary, the invention solves the problem of the queue driving of the commercial vehicles by adopting a method of combining a Lattice algorithm and deep reinforcement learning under an RSS (responsibility sensitivity safety) strategy aiming at the automatic driving task. By applying the A3C framework, the training efficiency is greatly improved, and the network convergence is promoted. Meanwhile, under the RSS (responsibility sensitive safety) framework, the safety of the reinforcement learning planning path is greatly improved. Compared with the Lattice algorithm, the method abandons the sampling with higher time complexity and the evaluation process of each alternative track cost function, and greatly improves the timeliness of the algorithm. Meanwhile, the training process of reinforcement learning has better universality, and the design of the reward function based on the final control effect can be more suitable for complex traffic scenes and complex vehicle dynamics characteristics.
In addition, the embodiment of the invention also provides an intelligent automobile controller, and the controller is internally provided with an execution program of the method.
The embodiment of the invention also provides a storage device, and the storage device is internally provided with the program code of the method.
The above-listed series of detailed descriptions are merely specific illustrations of possible embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent means or modifications that do not depart from the technical spirit of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. A method for planning a queue path of a commercial vehicle by combining deep reinforcement learning and RSS strategies is characterized by comprising the following steps:
s1, designing a time sequence aerial view as a state quantity of the strategy network;
s2, Frenet coordinate transformation is carried out, and the state quantity of the intelligent agent at the current time is obtained from the characteristic aerial view
Figure FDA0003720472460000011
And the action space is designed as the longitudinal end state of the track:
Figure FDA0003720472460000012
s3, obtaining the state quantity
Figure FDA0003720472460000013
And the motion space
Figure FDA0003720472460000014
As a strategy network input, improving a Lattice planning algorithm by utilizing a strategy gradient algorithm, designing a reward function by combining with an RSS strategy, and training the final state longitudinal state of the intelligent agent;
s4, utilizing the current longitudinal state of the bicycle
Figure FDA0003720472460000015
And end state longitudinal state
Figure FDA0003720472460000016
And as a boundary condition, carrying out polynomial fitting on the longitudinal track to obtain the optimal track.
2. The method for planning a queue path of a commercial vehicle according to claim 1, wherein the step S1 comprises the following two steps: (1) obtaining surrounding environment information including dynamic and static obstacles and lane lines, and predicting the future 0-t of the dynamic obstacle end Position information within time of (a); (2) and generating a characteristic aerial view of three dimensions of transverse dimension, longitudinal dimension and time by using the obtained environmental information and the predicted information.
3. The method for planning a queue path of a commercial vehicle according to claim 1, wherein the three-dimensional time-series bird's eye view matrix has dimensions (40, 400, 80), wherein the first dimension 40 represents a lateral range of 10m each, left and right of a reference line, and the lateral displacement interval is 0.5 m; the second dimension 400 represents a range of 200m longitudinally forward from the origin of the own vehicle, with longitudinal displacement intervals of 0.5m, and the third dimension 80 represents a time range within 8s in the future, with time intervals of 1 s; when a point [ alpha, beta, gamma ] in the time-series bird's eye view matrix is-1, the point is indicated to have an obstacle or be an unworkable area in the time space; when a point [ alpha, beta, gamma ] in the time-series bird's eye view matrix is 0, the point is a travelable area in the time space; when a point [ α, β, γ ] ] in the time-series bird's eye view matrix is 1, it indicates that the point is one point of the reference line.
4. The method for planning a queue path of a commercial vehicle according to claim 1, wherein in S2, the Frenet coordinate transformation is as follows:
coordinate information (X, theta) of the vehicle in a Cartesian coordinate system X ,k X ,v X ,a X ) By transformation of coordinates into
Figure FDA0003720472460000017
Wherein X is the coordinate of the vehicle in a Cartesian coordinate system and is a vector, and theta X For the orientation of the vehicle in a Cartesian coordinate system, k X Is curvature, v X Is the linear velocity of the vehicle in a Cartesian coordinate system, a X Is the acceleration of the vehicle under a Cartesian coordinate system, s is the longitudinal displacement under a Frenet coordinate system,
Figure FDA0003720472460000021
the first derivative of the longitudinal displacement s with respect to time in the Frenet coordinate system,
Figure FDA0003720472460000022
in the Frenet coordinate systemThe second derivative of the longitudinal displacement s with respect to time.
5. The method for planning the queue path of the commercial vehicle according to claim 1, wherein in S3, the strategy network pi is a θ The optimization goal of (z, a) is to maximize the expected return on the output planned trajectory:
Figure FDA0003720472460000023
wherein z is the state characteristic of the surrounding traffic environment, a is the network output action (namely the longitudinal end state of the track), theta is the network parameter, p (tau, theta) is the probability of executing the action a to output the track tau under the parameter theta and the state z, and R (tau) is the reward function of the track tau;
the policy network pi θ The optimization method of (z, a) is a gradient ascent method:
Figure FDA0003720472460000024
and D, carrying out derivation on J (theta), and calculating a derivative of the optimization target relative to the network parameter theta:
Figure FDA0003720472460000025
in the actual sampling process, the intelligent agent continuously obtains tracks and rewards from traffic scenes, and then adjusts strategies according to the rewards and real-timely adjusts the experience data<z,a,τ,r>Storing the data into an experience pool (Memory), and randomly sampling n pieces of experience data from the experience pool by using a Monte Carlo method during training<z,a,τ,r>Gradient of objective function according to law of large numbers
Figure FDA0003720472460000026
Carrying out simplified approximation:
Figure FDA0003720472460000027
to reduce the variance, the baseline b is increased at the reward r (τ) to reduce the variance:
J(π)=∑ τ p(τ,θ)·[r(τ)-b]
the part of the objective function J (pi) with respect to the baseline is then split:
Figure FDA0003720472460000028
the part BL containing the baseline is differentiated by the network parameter theta:
Figure FDA0003720472460000031
calculating the variance:
Figure FDA0003720472460000032
Figure FDA0003720472460000033
the smaller the variance, the smaller the part is designed as a function of b
Figure FDA0003720472460000034
Then, the derivative of f (b) with respect to b is calculated:
Figure FDA0003720472460000035
f (b) derivative with respect to b f' (b) where b ═ Σ τ r (τ) is 0, i.e. when b ═ Σ τ r (t) is f (b) minimum, variance is minimum, sigma τ r (tau) is an implicit state value V (z), so the state value V (z) can be used as a base line b to reduce the variance and improve the convergence rate of the policy networkDegree and effect;
the policy network pi θ The output of (z, a) conforms to a normal distribution, and specifically includes two parts, a mean μ (z, θ) and a variance σ (z, θ):
Figure FDA0003720472460000036
policy network pi θ (z, a) in the learning process, the output mean value mu (z, theta) can continuously approach the optimal strategy arg max Q (z, a), the output variance sigma (z, theta) will approach to 0 continuously, the randomness of the strategy will decrease, and when the strategy is executed, the action is sampled from the normal distribution
Figure FDA0003720472460000041
And outputting and executing.
6. The method for planning the queue path of the commercial vehicle by combining the deep reinforcement learning and the RSS strategy according to claim 1, wherein the reward function of the strategy network is designed as follows:
reward=k 1 ·r speed +k 2 ·r acc +k 3 ·r safe
wherein r is speed For speed awards, the goal is to maintain the vehicle speed at the target vehicle speed, v target To the desired target vehicle speed, t total The number of points, v, corresponding to the track in time units t For the vehicle speed of the planned trajectory at time t:
Figure FDA0003720472460000042
wherein r is acc Awarding for longitudinal comfort, the goal is to maintain a small longitudinal jerk,
Figure FDA0003720472460000043
at time t for planning a trajectoryLongitudinal acceleration of (2):
Figure FDA0003720472460000044
wherein r is safe For safety reward, the track generated by the target meets the safety standard;
longitudinal safety distance:
Figure FDA0003720472460000045
v f is the front speed, v r For rear speed, ρ is driver reaction time, a min,brake At minimum braking acceleration, a max,brake At maximum braking acceleration, a max,accel Is the maximum acceleration;
transverse safe distance:
Figure FDA0003720472460000046
Figure FDA0003720472460000047
mu is the minimum value of the transverse distance when the transverse speed of the two vehicles is 0.
Figure FDA0003720472460000048
The acceleration is the maximum acceleration in the lateral direction,
Figure FDA0003720472460000049
is the lateral minimum braking acceleration, ρ is the driver reaction time;
when the vehicle runs according to the track generated by the strategy network, when the transverse and longitudinal distances between the vehicle and the front and rear vehicles or other vehicles which are jammed in the fleet are less than the minimum safety distance, the reward is-100, otherwise, the reward is 0:
Figure FDA00037204724600000410
d is the distance from the other vehicle.
7. The method of claim 1, wherein the strategy network pi is a commercial vehicle queue path planning method combining deep reinforcement learning and RSS strategy θ (z, a) includes both Convolutional (CNN) feature extraction networks and Fully Connected Networks (FCNs); wherein z is an input state quantity of the strategy network, and comprises a time sequence aerial view matrix and a historical track of the self vehicle; a is the output of the policy network, i.e. the end state of the planned trajectory
Figure FDA0003720472460000051
Theta is weight and bias parameters of the network, the input of the Convolution (CNN) feature extraction network is the space-time aerial view matrix, the output is finally extracted environment feature information, the input of the Full Connection Network (FCN) is the environment feature information output by the Convolution (CNN) feature extraction network and the history track information of the automatic driving automobile, and the output is the final state of the track
Figure FDA0003720472460000052
The convolutional neural network of the strategy network comprises three convolutional layers, two pooling layers and three fully-connected layers, wherein the input layer merges 3 matrixes of 256 × 3 into a matrix of 256 × 9; the convolution layer Conv1 is composed of convolution kernels of (3 × 9) × 32 and step stride ═ 2, and its input is the output of the input layer, and is a matrix of 256 × 9 and its output is the characteristic of 128 × 32; the pooling layer Pool1 is composed of pooling kernels of (2 × 2) and step size stride 2, and its input is the output of the convolutional layer Conv1, which is a characteristic of 128 × 32, and its output is a characteristic of 64 × 32; the convolution layer Conv2 is composed of convolution kernels of (3 × 32) × 64 and step size stride ═ 2, the input of the convolution kernels is the output of the pooling layer Pool1, and the input is the feature of 64 × 32, and the output is the feature of 32 × 128; the pooling layer Pool2 is composed of pooling kernels of (2 × 2) and step size stride 2, and its input is the output of the convolutional layer Conv2, which is a characteristic of 32 × 128, and its output is a characteristic of 16 × 128; the convolution layer Conv3 is composed of convolution kernels of (3 × 128) × 128 and step size stride ═ 2, and its input is the output of the pooling layer Pool2, and is a characteristic of 16 × 128, and its output is a characteristic of 8 × 128; the full-connection layer FC has a size of (8 × 128) × 512, the input of the full-connection layer FC is the output of the convolution layer Conv3, the input of the full-connection layer FC is the feature of 8 × 128, the output of the full-connection layer FC- μ is the feature of 1 × 512, the input of the full-connection layer FC- μ is the feature extracted by the convolutional neural network, the input of the full-connection layer FC- μ is the feature of 1 × 512, the output of the full-connection layer FC- σ is the feature of 1 × 512, and the feature extracted by the full-connection layer FC- μ and the full-connection layer FC- σ jointly form the state feature z.
8. The method for planning a queue path of a commercial vehicle according to claim 1, wherein in S4, the longitudinal trajectory polynomial is fitted as follows:
Figure FDA0003720472460000061
the boundary conditions are as follows:
Figure FDA0003720472460000062
according to the fifth order polynomial of the longitudinal trajectory and the boundary condition:
Figure FDA0003720472460000063
based on the obtained a 0 a 1 a 2 a 3 a 4 a 5 Obtaining the fifth-order polynomial s of the longitudinal track trajectory
9. An intelligent automobile controller, characterized in that the controller is internally provided with an execution program of the method of any one of claims 1 to 8.
10. A storage device, characterized in that it houses the program code of the method according to any one of claims 1 to 8.
CN202210748792.5A 2022-06-29 2022-06-29 Commercial vehicle queue path planning method, controller and storage device combining deep reinforcement learning and RSS strategy Pending CN115079697A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210748792.5A CN115079697A (en) 2022-06-29 2022-06-29 Commercial vehicle queue path planning method, controller and storage device combining deep reinforcement learning and RSS strategy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210748792.5A CN115079697A (en) 2022-06-29 2022-06-29 Commercial vehicle queue path planning method, controller and storage device combining deep reinforcement learning and RSS strategy

Publications (1)

Publication Number Publication Date
CN115079697A true CN115079697A (en) 2022-09-20

Family

ID=83256365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210748792.5A Pending CN115079697A (en) 2022-06-29 2022-06-29 Commercial vehicle queue path planning method, controller and storage device combining deep reinforcement learning and RSS strategy

Country Status (1)

Country Link
CN (1) CN115079697A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115871658A (en) * 2022-12-07 2023-03-31 之江实验室 Intelligent driving speed decision method and system for dense pedestrian flow
CN117542218A (en) * 2023-11-17 2024-02-09 上海智能汽车融合创新中心有限公司 Vehicle-road cooperative system based on vehicle speed-vehicle distance guiding control

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115871658A (en) * 2022-12-07 2023-03-31 之江实验室 Intelligent driving speed decision method and system for dense pedestrian flow
CN115871658B (en) * 2022-12-07 2023-10-27 之江实验室 Dense people stream-oriented intelligent driving speed decision method and system
CN117542218A (en) * 2023-11-17 2024-02-09 上海智能汽车融合创新中心有限公司 Vehicle-road cooperative system based on vehicle speed-vehicle distance guiding control

Similar Documents

Publication Publication Date Title
CN111222630B (en) Autonomous driving rule learning method based on deep reinforcement learning
US11093829B2 (en) Interaction-aware decision making
CN110297494B (en) Decision-making method and system for lane change of automatic driving vehicle based on rolling game
CN115079697A (en) Commercial vehicle queue path planning method, controller and storage device combining deep reinforcement learning and RSS strategy
US20220371594A1 (en) Model-based design of trajectory planning and control for automated motor-vehicles in a dynamic environment
CN112162555B (en) Vehicle control method based on reinforcement learning control strategy in hybrid vehicle fleet
Wei et al. A behavioral planning framework for autonomous driving
US20190035275A1 (en) Autonomous operation capability configuration for a vehicle
CN113954837B (en) Deep learning-based lane change decision-making method for large-scale commercial vehicle
Zheng et al. Behavioral decision‐making model of the intelligent vehicle based on driving risk assessment
CN112622886A (en) Anti-collision early warning method for heavy operation vehicle comprehensively considering front and rear obstacles
CN111679660B (en) Unmanned deep reinforcement learning method integrating human-like driving behaviors
CN113581182B (en) Automatic driving vehicle lane change track planning method and system based on reinforcement learning
CN111824182B (en) Three-axis heavy vehicle self-adaptive cruise control algorithm based on deep reinforcement learning
Sun et al. DDPG-based decision-making strategy of adaptive cruising for heavy vehicles considering stability
Li et al. Dynamically integrated spatiotemporal‐based trajectory planning and control for autonomous vehicles
CN114580302A (en) Decision planning method for automatic driving automobile based on maximum entropy reinforcement learning
CN115257819A (en) Decision-making method for safe driving of large-scale commercial vehicle in urban low-speed environment
DE102019202634B3 (en) Method, control device for an automated road vehicle, computer program product for recognizing objects in road traffic and automated road vehicle for mobility services
DE102022116418A1 (en) MACHINE CONTROL
Dubey et al. Autonomous braking and throttle system: A deep reinforcement learning approach for naturalistic driving
CN113110359A (en) Online training method and device for constraint type intelligent automobile autonomous decision system
DE102022109385A1 (en) Reward feature for vehicles
CN115140048A (en) Automatic driving behavior decision and trajectory planning model and method
CN112668692A (en) Quantifying realism of analog data using GAN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination