CN115079697A

CN115079697A - Commercial vehicle queue path planning method, controller and storage device combining deep reinforcement learning and RSS strategy

Info

Publication number: CN115079697A
Application number: CN202210748792.5A
Authority: CN
Inventors: 朱子轩; 蔡英凤; 陈龙; 孙晓强; 何友国; 袁朝春; 方啸; 陆文杰
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2022-09-20

Abstract

The invention discloses a method for planning a queue path of a commercial vehicle, a controller and a storage device by combining deep reinforcement learning and RSS strategies, wherein an A3C framework is introduced, vehicles in a fleet can respectively and interactively learn with the environment in a plurality of threads simultaneously by utilizing a multithreading method, and each thread summarizes the learning results and stores the results in Global _ net. And regularly taking back the learning results of different vehicles in the fleet from Global _ net to guide the learning interaction behind the vehicle and the environment. Meanwhile, the speed planning is carried out by using the Lattice algorithm and taking the ST image, so that the driving stability and comfort of the motorcade can be effectively improved, and the smoothness of the driving track of the commercial vehicle is ensured. Finally, the invention combines a safety constraint RSS strategy, and the automatic driving automobile safety strategy based on the mathematical formula provides a frame for the implicit rule, thereby realizing the organic integration with other participants on the road and effectively solving the safety problem when other vehicles are merged into the road during the queue driving.

Description

Commercial vehicle queue path planning method, controller and storage device combining deep reinforcement learning and RSS strategy

Technical Field

The invention belongs to the field of automatic driving in artificial intelligence, and relates to a commercial vehicle queue path planning method, a controller and a storage device which combine deep reinforcement learning and RSS (responsibility sensitive safety) models.

Background

The intelligent automobile is a high and new technology product based on an environment perception technology, an intelligent driving technology, a wireless communication technology and a computer technology, and the process of transformation and upgrading of the automobile industry is a process of gradually realizing intellectualization of the automobile. The automobile in an intelligent running state takes safety, environmental protection, energy conservation, comfort and the like as comprehensive control targets to cooperatively construct an efficient and ordered transportation network.

Commercial fleets are currently widely used in engineering. Commercial vehicles mainly include five types, namely passenger cars, trucks, semi-trailer tractors, incomplete passenger car vehicles and incomplete truck vehicles. Has the characteristics of large volume, heavy weight, large blind area of the driver vision, and the like. Currently, commercial fleet path planning presents a number of problems in the training process: firstly, a plurality of vehicle bodies participate in training at the same time, and the training difficulty is high, and even the network is difficult to converge. Secondly, the reward function is difficult to design, each fleet member has one reward function, actions output by the fleet members are mutually interfered, and the reward counteraction phenomenon exists, so that exploration in training is difficult. Finally, the commercial vehicle has large volume and large load, so that the safety of the commercial vehicle cannot be well guaranteed when the commercial vehicle is not driven by people. Therefore, how to find a commercial vehicle queue planning method which can simultaneously achieve safety and high efficiency becomes an important subject.

Disclosure of Invention

In order to solve the problem of the commercial vehicle queue, a framework of A3C is introduced, A3C utilizes a multithreading method to enable vehicles in a fleet to perform interactive learning with the environment in a plurality of threads simultaneously, and each thread collects learning results and stores the learning results in Global _ net. And regularly taking back the learning results of different vehicles in the fleet from Global _ net to guide the learning interaction behind the vehicle and the environment. Meanwhile, the speed planning is carried out by using the Lattice algorithm and taking the ST image, so that the driving stability and comfort of the motorcade can be effectively improved, and the smoothness of the driving track of the commercial vehicle is ensured. Finally, the invention combines a safety constraint strategy, namely an RSS (responsibility sensitivity safety) strategy, and the set of automatic driving automobile safety strategy based on the mathematical formula provides a frame for the implicit rule, thereby realizing the organic integration with other participants on the road and effectively solving the safety problem when the vehicles are merged when the vehicles run in a queue.

The invention provides a commercial vehicle queue planning method combining deep reinforcement learning and an RSS (responsibility sensitivity safety) strategy, which improves the learning efficiency of a motorcade by utilizing an A3C framework and improves the safety and stability of the motorcade during driving through the constraints of a Lattice algorithm and the RSS strategy.

In order to achieve the purpose, the invention adopts the following technical scheme:

the commercial vehicle queue planning method combining deep reinforcement learning and RSS (responsibility sensitive safety) strategies comprises the following steps:

step 1: in order to better acquire information of surrounding traffic environment, the invention designs an effective and simple time sequence aerial view as the state quantity of the strategy network, thereby greatly improving the learning of the strategy network and ensuring the safety of output tracks. The generation of the time-series bird's eye view comprises the following two steps: (1) and obtaining surrounding environment information including moving and static obstacles and lane lines. Obtaining 0-t future of dynamic barrier by using prediction module (both lstm and GCN network) _end Position information within time of (a); (2) and generating a characteristic aerial view of three dimensions of transverse dimension, longitudinal dimension and time by using the information obtained by the perception module and the prediction.

Step 2: frenet coordinate transformation is performed, and the state quantity of the agent at the current time is obtained from the characteristic aerial view. Coordinate information of the vehicle in a Cartesian coordinate system(X,θ _X ,k _X ,v _X ,a _X ) Can be converted into

Wherein X is the coordinate of the vehicle in a Cartesian coordinate system and is a vector, and theta _X For the orientation of the vehicle in a Cartesian coordinate system, k _X Is curvature, v _X Is the linear velocity of the vehicle in a Cartesian coordinate system, a _X Is the acceleration of the vehicle in a cartesian coordinate system. s is the longitudinal displacement in the Frenet coordinate system,

the first derivative of the longitudinal displacement s with respect to time in the Frenet coordinate system,

the second derivative of the longitudinal displacement s with respect to time in the Frenet coordinate system. Thereby obtaining the state quantity:

the motion space is designed as the longitudinal end state of the track:

and 3, step 3: the invention uses an A3C algorithm framework to generate training samples to fill an experience pool in an exploration process by using a training method of a fleet shared network. All the agents share the strategy network and participate in the training of the network together, so that the problem of network non-convergence is avoided.

The obtained state quantity

And the motion space

As input, a strategy gradient algorithm is utilized to improve a Lattice planning algorithm, and meanwhile, a reward function is designed by combining with an RSS (really simple syndication) strategy, so that the last state sampling point of the intelligent agent is trained。

Policy network pi _θ The optimization goal of (z, a) is to maximize the expected return on the output planned trajectory:

wherein z is the state characteristic of the surrounding traffic environment, a is the network output action (namely the longitudinal end state of the track), theta is the network parameter, p (tau, theta) is the probability of executing the action a to output the track tau under the parameter theta and the state z, r (tau) is the reward function of the track tau, theta represents the network parameter, and pi represents the policy network.

The above policy network pi _θ The optimization method of (z, a) is a gradient ascent method:

α represents the learning rate of the gradient descent magnitude.

Calculating a derivative of the optimization objective with respect to the network parameter θ:

in the actual sampling process, the intelligent agent continuously obtains tracks and rewards from traffic scenes, and then adjusts strategies according to the rewards and real-timely adjusts the experience data<z,a,τ,r>And storing the experience into an experience pool (Memory). During training, n pieces of experience data are randomly sampled from an experience pool by using a Monte Carlo method<z,a,τ,r>Gradient of objective function according to law of large numbers

Carrying out simplified approximation:

the updating direction of the final strategy parameter theta can be obtained as follows:

to reduce the variance, the baseline b is increased at the reward r (τ) to reduce the variance:

J(π)＝∑ _τ p(τ,θ)·[r(τ)-b]

the part of the objective function J (pi) with respect to the baseline is then split:

the part BL containing the baseline is differentiated by the network parameter theta:

obviously, according to the derivation result of BL on the network parameter theta, increasing the baseline b in the objective function J (pi) does not affect the gradient of the final optimization objective J (pi)

Adding a baseline b that is independent of action a does not affect the gradient of the final strategy.

The variance is calculated according to the formula:

as is apparent from the above description of the preferred embodiment,

the smaller the variance, the smaller the variance. Designing a function of the part with respect to b

Then, the derivative of f (b) with respect to b is calculated:

f (b) derivative with respect to b f' (b) where b ═ Σ _τ r (τ) is 0, i.e. when b ═ Σ _τ r (τ) is the minimum of f (b) and the minimum of variance. Obviously, Σ _τ r (tau) is an implicit state value V (z), so the state value V (z) can be used as a base line b to reduce the variance, and the convergence rate and the effect of the policy network are improved.

The intelligent agent for reinforcement learning continuously improves the self ability in the training process, and in the process, the intelligent agent needs to continuously try and error in an unfamiliar state space. In a strange state feature, a new behavior may result in the agent receiving a higher reward and also in a worse behavior action. An "exploratory action" is to try some new action. The "exploit action" is the taking of a known action that can achieve the maximum reward, only requiring that the policy action be explicitly executed. Too many "exploratory behaviors" (fewer "exploitation behaviors") make convergence of the agent more difficult, and too few "exploratory behaviors" (more "exploitation behaviors") make it highly likely that the agent converges to the locally optimal space. There is therefore a trade-off between "exploration behavior" and "utilization behavior". In order to strengthen the exploration capability of the intelligent agent in the strange state space in the training process and avoid the intelligent agent from falling into the local optimal space in the training process, the strategy network pi designed by the invention _θ The output of (z, a) will fit a normal distribution. The method specifically comprises two parts of a mean value mu (z, theta) and a variance sigma (z, theta):

theoretically, the strategic network pi _θ (z, a) in the learning process, the output mean value mu (z, theta) can continuously approach the optimal strategy arg _max Q (z, a), Q (z, a) representing the action cost function, arg _max A variable point z, a is obtained corresponding to the maximum value of Q. The variance σ (z, θ) of the output will continuously approach to 0 and the randomness of the strategy will decrease. When executing the strategy, sampling the actions from the normal distribution

And outputting and executing.

And 4, step 4: and fitting a longitudinal trajectory polynomial. Using the current longitudinal state of the vehicle

And optimal longitudinal end state of reinforcement learning output

As a boundary condition, there is a fifth order polynomial of s with respect to time t:

the boundary conditions are as follows:

according to the fifth order polynomial of the longitudinal trajectory and the boundary condition:

based on the obtained a ₁ a ₂ a ₃ a ₄ a ₅ Obtaining the fifth-order polynomial s of the longitudinal track _trajectory 。

And obtaining an optimal track, and inputting the optimal track to the control module.

The invention provides an intelligent automobile controller, wherein an execution program of the method is arranged in the controller.

The invention also provides a storage device, which is internally provided with the program code of the method.

The invention has the beneficial effects that:

(1) aiming at the automatic driving task, the invention adopts a method of combining a Lattice algorithm and deep reinforcement learning to solve the problem of driving of the commercial vehicle queue under an RSS (responsibility sensitivity safety) strategy. By using the A3C framework, the training efficiency is greatly improved, and the network convergence is promoted. Meanwhile, under the RSS (responsibility sensitive safety) framework, the safety of the reinforcement learning planning path is greatly improved.

(2) Compared with the Lattice algorithm, the method abandons the sampling with higher time complexity and the evaluation process of each alternative track cost function, and greatly improves the timeliness of the algorithm. Meanwhile, the training process of reinforcement learning has better universality, and the design of the reward function based on the final control effect can be more suitable for complex traffic scenes and complex vehicle dynamics characteristics.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a Policy Gradient network neural network architecture used by the present invention;

trace point sampling of the ST diagram of fig. 3.

Detailed Description

The invention will be further explained with reference to the drawings.

The invention provides a commercial vehicle queue path planning method combining Deep Reinforcement Learning (DRL) and Lattice algorithm under RSS (responsibility sensitivity safety) strategy, which can realize the improvement of safety and stability when a commercial large vehicle runs in a queue, and as shown in figure 1, the method specifically comprises the following steps:

as shown in fig. 1, using the A3C framework, a training method using a fleet sharing network generated respective gradient-filled Global _ net during exploration. Compared with the method that the states of all the members of the fleet are used as input and the tracks of the number of the members are output, the method only takes the state of each intelligent agent as input and outputs the track of the intelligent agent, so that network convergence is promoted, and the phenomena of mutual interference of member actions and reward offset are avoided. Meanwhile, all intelligent networked automobiles share the decision network and participate in network training together. The individual training process for each agent is described below.

And outputting a longitudinal end state sampling point by using an intelligent agent for deep reinforcement learning training:

(1) and designing a characteristic aerial view. The invention designs an effective and simple time sequence aerial view as the state quantity of the strategy network, greatly improves the learning of the strategy network and ensures the safety of the output track.

The generation of the time-series bird's eye view comprises the following two steps: (1) according to the perception module of the automatic driving automobile, the surrounding environment information including moving and static obstacles and lane lines is obtained. Obtaining 0-t of dynamic barrier in future by utilizing prediction module _end Position information within time of (a); (2) and generating a characteristic aerial view of three dimensions of transverse dimension, longitudinal dimension and time by using the information obtained by the perception module and the prediction.

The size of the three-dimensional time sequence aerial view matrix is (40, 400, 80). Wherein the first dimension 40 represents the lateral extent of 10m each, left and right of the reference line, with a lateral displacement interval of 0.5 m; the second dimension 400 represents a range of 200m longitudinally forward from the origin of the own vehicle with a longitudinal displacement interval of 0.5m, and the third dimension 80 represents a time range within 8s in the future with a time interval of 1 s. Specifically, when a point [ α, β, γ ] in the time-series bird's eye view matrix is-1, it indicates that the point is an obstacle or an untravelable area in the time space; when a point [ alpha, beta, gamma ] in the time-series bird's eye view matrix is 0, the point is a travelable area in the time space; when a point [ α, β, γ ] ] in the time-series bird's eye view matrix is 1, it indicates that the point is one point of the reference line.

(2) And (5) designing state quantities. In order to obtain the state quantity of the intelligent agent at the current time from the characteristic aerial view, Frenet coordinate transformation is carried out, and coordinate information (X, theta) of the vehicle is converted under a Cartesian coordinate system _X ,k _X ,v _X ,a _X ) Can be converted into

the motion space is designed as the longitudinal end state of the track:

(3) and (4) designing a policy network. The obtained state quantity

And the motion space

As input, a strategy gradient algorithm is utilized to improve a Lattice planning algorithm, meanwhile, a reward function is designed by combining an RSS (really simple syndication) strategy, and the last state sampling point of the intelligent agent is trained. Policy network pi _θ The optimization goal of (z, a) is to maximize the expected return on the output planned trajectory:

wherein z is the state characteristic of the surrounding traffic environment, a is the network output action (namely the longitudinal end state of the track), theta is the network parameter, p (tau, theta) is the probability of executing the action a to output the track tau under the parameter theta and the state z, and R (tau) is the reward function of the track tau.

Policy network pi _θ The optimization method of (z, a) is a gradient ascent method:

and (3) derivation is carried out on the expected return J (theta) of the parameter theta to obtain the optimal theta, at the moment, the strategy network pi is optimal, the track is optimal, and the derivative of the optimization target relative to the network parameter theta is calculated:

Carrying out simplified approximation:

J(π)＝∑ _τ p(τ,θ)·[r(τ)-b]

The variance is calculated according to the formula:

as is apparent from the above description of the preferred embodiment,

the smaller the variance, the smaller. Designing a function of the part with respect to b

Then, the derivative of f (b) with respect to b is calculated:

In order to strengthen the exploration capability of the intelligent agent in the strange state space in the training process and avoid the intelligent agent from falling into the local optimal space in the training process, the strategy network pi designed by the invention _θ The output of (z, a) will fit a normal distribution. The method specifically comprises two parts of a mean value mu (z, theta) and a variance sigma (z, theta):

theoretically, the strategic network pi _θ (z, a) in the learning process, the output mean value mu (z, theta) can continuously approach the optimal strategy arg _max Q (z, a), the variance σ (z, θ) of the output will not beThe randomness of the strategy decreases as the break approaches 0. When executing the strategy, sampling the actions from the normal distribution

And outputting and executing.

(4) A reward function is designed. The reward function reward comprises the following parts, k ₁ ～k ₃ The corresponding proportionality coefficient for each section is:

reward＝k ₁ ·r _speed +k ₂ ·r _acc +k ₃ ·r _safe

wherein r is _speed For speed awards, the goal is to maintain the vehicle speed at the target vehicle speed, v _target To a desired target vehicle speed, t _total The number of points, v, corresponding to the track in time units _t For the vehicle speed of the planned trajectory at time t:

wherein r is _acc Awarding for longitudinal comfort, the goal is to maintain a small longitudinal jerk,

to plan the longitudinal acceleration of the trajectory at time t:

wherein r is _safe For security rewards, the reward function is further designed under the RSS (responsibility sensitive security) policy with the goal of generating a trajectory that meets the security standards.

Longitudinal safety distance:

v _f is the front speed, v _r For rear speed, ρ is driver reaction time, a _min,brake At minimum braking acceleration, a _max,brake At maximum braking acceleration, a _max,accel Is the maximum acceleration.

Transverse safe distance:

wherein v is ₁ Is the speed of the bicycle, v ₂ The speed of the other vehicles in the transverse direction, namely the speed of the other vehicles trying to jam the vehicle entering the motorcade,

mu is the minimum value of the transverse distance when the transverse speed of the two vehicles is 0.

The acceleration is the maximum acceleration in the lateral direction,

and p is the driver reaction time, which is the lateral minimum braking acceleration.

When the vehicle runs according to the track generated by the strategy network, when the transverse and longitudinal distances between the vehicle and the front and rear vehicles or other vehicles which are jammed in the fleet are less than the minimum safety distance, the reward is-100, otherwise, the reward is 0:

where d is the distance to other vehicles.

(5) And fitting a longitudinal trajectory polynomial. Using the current longitudinal state of the vehicle

And optimal longitudinal end state of reinforcement learning output

the boundary conditions are as follows:

according to the obtained a ₀ a ₁ a ₂ a ₃ a ₄ a ₅ Obtaining the fifth-order polynomial s of the longitudinal track _trajectory 。

And finally, inputting the obtained track into a control module for track tracking control.

As shown in FIG. 2, the policy network π _θ And (z, a) specifically comprises a Convolution (CNN) feature extraction network and a Full Connection Network (FCN). Wherein z is an input state quantity of the strategy network, and comprises a time sequence aerial view matrix and a historical track of the self vehicle; a is the output of the policy network, i.e. the end state of the planned trajectory

θ is the weight and bias parameters of the network. The input of the Convolution (CNN) feature extraction network is the space-time aerial view matrix, and the output is finally extracted environmental feature information. The input of the full-connection network (FCN) is Convolution (CNN) characteristic extraction network output environment characteristic information and history track information of the automatic driving automobile, and the output is the end state of the track

The convolutional neural network of the strategy network comprises three convolutional layers, two pooling layers and three full-connection layers. The input layer combines 3 matrices of 256 × 3 into a matrix of 256 × 9; the convolution layer Conv1 is composed of convolution kernels of (3 × 9) × 32 and step stride ═ 2, and its input is the output of the input layer, and is a matrix of 256 × 9 and its output is the characteristic of 128 × 32; the pooling layer Pool1 is composed of pooling kernels of (2 × 2) and step size stride 2, and its input is the output of the convolutional layer Conv1, which is a characteristic of 128 × 32, and its output is a characteristic of 64 × 32; the convolution layer Conv2 is composed of convolution kernels of (3 × 32) × 64 and step size stride ═ 2, the input of the convolution kernels is the output of the pooling layer Pool1, and the input is the feature of 64 × 32, and the output is the feature of 32 × 128; the pooling layer Pool2 is composed of pooling kernels of (2 × 2) and step size stride 2, and its input is the output of the convolutional layer Conv2, which is a characteristic of 32 × 128, and its output is a characteristic of 16 × 128; the convolution layer Conv3 is composed of convolution kernels of (3 × 128) × 128 and step size stride ═ 2, and its input is the output of the pooling layer Pool2, and is a characteristic of 16 × 128, and its output is a characteristic of 8 × 128; the fully connected layer FC has a size of (8 × 128) × 512, its input is the output of the convolution layer Conv3, and is characterized by 8 × 128 and its output is characterized by 1 × 512. The full-connection layer FC-mu and the full-connection layer FC-sigma are of parallel structures, the input of the full-connection layer FC-mu and the input of the full-connection layer FC-sigma are all features extracted by a convolutional neural network and are 1 x 512 features, the output of the full-connection layer FC-mu is 1 x 512 features, and the output of the full-connection layer FC-sigma is 1 x 512 features. The features extracted by the full connection layer FC- μ and the full connection layer FC- σ together constitute a state feature z.

As shown in fig. 3, in the ST diagram, the traffic scene is mainly divided into two main cases:

(1) there is no obstacle in front of the bicycle. And fitting a track by the initial state quantity and the final state track points trained from the deep reinforcement learning, and planning the longitudinal speed.

(2) There is a barrier in front of the bicycle. The present invention draws obstacles as parallelograms that block portions of a road for a particular period of time. For example, in the lower graph, the prediction module predicts that the vehicle will be at t ₀ To t ₁ Into the lane in which the vehicle will be drivenDuring which it occupies position s ₀ To s ₁ Therefore, a rectangle is drawn on the ST diagram, which will be at time t ₀ To t ₁ Period blocking position s ₀ To s ₁ . To avoid collisions, the velocity profile must not intersect this rectangle.

When following a car, the speed curve is below the lower boundary of the car following sampling. During overtaking, the speed curve is above the lower overtaking sampling limit.

In summary, the invention solves the problem of the queue driving of the commercial vehicles by adopting a method of combining a Lattice algorithm and deep reinforcement learning under an RSS (responsibility sensitivity safety) strategy aiming at the automatic driving task. By applying the A3C framework, the training efficiency is greatly improved, and the network convergence is promoted. Meanwhile, under the RSS (responsibility sensitive safety) framework, the safety of the reinforcement learning planning path is greatly improved. Compared with the Lattice algorithm, the method abandons the sampling with higher time complexity and the evaluation process of each alternative track cost function, and greatly improves the timeliness of the algorithm. Meanwhile, the training process of reinforcement learning has better universality, and the design of the reward function based on the final control effect can be more suitable for complex traffic scenes and complex vehicle dynamics characteristics.

In addition, the embodiment of the invention also provides an intelligent automobile controller, and the controller is internally provided with an execution program of the method.

The embodiment of the invention also provides a storage device, and the storage device is internally provided with the program code of the method.

The above-listed series of detailed descriptions are merely specific illustrations of possible embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent means or modifications that do not depart from the technical spirit of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method for planning a queue path of a commercial vehicle by combining deep reinforcement learning and RSS strategies is characterized by comprising the following steps:

s1, designing a time sequence aerial view as a state quantity of the strategy network;

s2, Frenet coordinate transformation is carried out, and the state quantity of the intelligent agent at the current time is obtained from the characteristic aerial view

And the action space is designed as the longitudinal end state of the track:

s3, obtaining the state quantity

And the motion space

As a strategy network input, improving a Lattice planning algorithm by utilizing a strategy gradient algorithm, designing a reward function by combining with an RSS strategy, and training the final state longitudinal state of the intelligent agent;

s4, utilizing the current longitudinal state of the bicycle

And end state longitudinal state

And as a boundary condition, carrying out polynomial fitting on the longitudinal track to obtain the optimal track.

2. The method for planning a queue path of a commercial vehicle according to claim 1, wherein the step S1 comprises the following two steps: (1) obtaining surrounding environment information including dynamic and static obstacles and lane lines, and predicting the future 0-t of the dynamic obstacle _end Position information within time of (a); (2) and generating a characteristic aerial view of three dimensions of transverse dimension, longitudinal dimension and time by using the obtained environmental information and the predicted information.

3. The method for planning a queue path of a commercial vehicle according to claim 1, wherein the three-dimensional time-series bird's eye view matrix has dimensions (40, 400, 80), wherein the first dimension 40 represents a lateral range of 10m each, left and right of a reference line, and the lateral displacement interval is 0.5 m; the second dimension 400 represents a range of 200m longitudinally forward from the origin of the own vehicle, with longitudinal displacement intervals of 0.5m, and the third dimension 80 represents a time range within 8s in the future, with time intervals of 1 s; when a point [ alpha, beta, gamma ] in the time-series bird's eye view matrix is-1, the point is indicated to have an obstacle or be an unworkable area in the time space; when a point [ alpha, beta, gamma ] in the time-series bird's eye view matrix is 0, the point is a travelable area in the time space; when a point [ α, β, γ ] ] in the time-series bird's eye view matrix is 1, it indicates that the point is one point of the reference line.

4. The method for planning a queue path of a commercial vehicle according to claim 1, wherein in S2, the Frenet coordinate transformation is as follows:

coordinate information (X, theta) of the vehicle in a Cartesian coordinate system _X ,k _X ,v _X ,a _X ) By transformation of coordinates into

Wherein X is the coordinate of the vehicle in a Cartesian coordinate system and is a vector, and theta _X For the orientation of the vehicle in a Cartesian coordinate system, k _X Is curvature, v _X Is the linear velocity of the vehicle in a Cartesian coordinate system, a _X Is the acceleration of the vehicle under a Cartesian coordinate system, s is the longitudinal displacement under a Frenet coordinate system,

in the Frenet coordinate systemThe second derivative of the longitudinal displacement s with respect to time.

5. The method for planning the queue path of the commercial vehicle according to claim 1, wherein in S3, the strategy network pi is a _θ The optimization goal of (z, a) is to maximize the expected return on the output planned trajectory:

wherein z is the state characteristic of the surrounding traffic environment, a is the network output action (namely the longitudinal end state of the track), theta is the network parameter, p (tau, theta) is the probability of executing the action a to output the track tau under the parameter theta and the state z, and R (tau) is the reward function of the track tau;

the policy network pi _θ The optimization method of (z, a) is a gradient ascent method:

and D, carrying out derivation on J (theta), and calculating a derivative of the optimization target relative to the network parameter theta:

in the actual sampling process, the intelligent agent continuously obtains tracks and rewards from traffic scenes, and then adjusts strategies according to the rewards and real-timely adjusts the experience data<z,a,τ,r>Storing the data into an experience pool (Memory), and randomly sampling n pieces of experience data from the experience pool by using a Monte Carlo method during training<z,a,τ,r>Gradient of objective function according to law of large numbers

Carrying out simplified approximation:

J(π)＝∑ _τ p(τ,θ)·[r(τ)-b]

calculating the variance:

the smaller the variance, the smaller the part is designed as a function of b

Then, the derivative of f (b) with respect to b is calculated:

f (b) derivative with respect to b f' (b) where b ═ Σ _τ r (τ) is 0, i.e. when b ═ Σ _τ r (t) is f (b) minimum, variance is minimum, sigma _τ r (tau) is an implicit state value V (z), so the state value V (z) can be used as a base line b to reduce the variance and improve the convergence rate of the policy networkDegree and effect;

the policy network pi _θ The output of (z, a) conforms to a normal distribution, and specifically includes two parts, a mean μ (z, θ) and a variance σ (z, θ):

policy network pi _θ (z, a) in the learning process, the output mean value mu (z, theta) can continuously approach the optimal strategy arg _max Q (z, a), the output variance sigma (z, theta) will approach to 0 continuously, the randomness of the strategy will decrease, and when the strategy is executed, the action is sampled from the normal distribution

And outputting and executing.

6. The method for planning the queue path of the commercial vehicle by combining the deep reinforcement learning and the RSS strategy according to claim 1, wherein the reward function of the strategy network is designed as follows:

reward＝k ₁ ·r _speed +k ₂ ·r _acc +k ₃ ·r _safe

wherein r is _speed For speed awards, the goal is to maintain the vehicle speed at the target vehicle speed, v _target To the desired target vehicle speed, t _total The number of points, v, corresponding to the track in time units _t For the vehicle speed of the planned trajectory at time t:

at time t for planning a trajectoryLongitudinal acceleration of (2):

wherein r is _safe For safety reward, the track generated by the target meets the safety standard;

longitudinal safety distance:

v _f is the front speed, v _r For rear speed, ρ is driver reaction time, a _min,brake At minimum braking acceleration, a _max,brake At maximum braking acceleration, a _max,accel Is the maximum acceleration;

transverse safe distance:

The acceleration is the maximum acceleration in the lateral direction,

is the lateral minimum braking acceleration, ρ is the driver reaction time;

d is the distance from the other vehicle.

7. The method of claim 1, wherein the strategy network pi is a commercial vehicle queue path planning method combining deep reinforcement learning and RSS strategy _θ (z, a) includes both Convolutional (CNN) feature extraction networks and Fully Connected Networks (FCNs); wherein z is an input state quantity of the strategy network, and comprises a time sequence aerial view matrix and a historical track of the self vehicle; a is the output of the policy network, i.e. the end state of the planned trajectory

Theta is weight and bias parameters of the network, the input of the Convolution (CNN) feature extraction network is the space-time aerial view matrix, the output is finally extracted environment feature information, the input of the Full Connection Network (FCN) is the environment feature information output by the Convolution (CNN) feature extraction network and the history track information of the automatic driving automobile, and the output is the final state of the track

The convolutional neural network of the strategy network comprises three convolutional layers, two pooling layers and three fully-connected layers, wherein the input layer merges 3 matrixes of 256 × 3 into a matrix of 256 × 9; the convolution layer Conv1 is composed of convolution kernels of (3 × 9) × 32 and step stride ═ 2, and its input is the output of the input layer, and is a matrix of 256 × 9 and its output is the characteristic of 128 × 32; the pooling layer Pool1 is composed of pooling kernels of (2 × 2) and step size stride 2, and its input is the output of the convolutional layer Conv1, which is a characteristic of 128 × 32, and its output is a characteristic of 64 × 32; the convolution layer Conv2 is composed of convolution kernels of (3 × 32) × 64 and step size stride ═ 2, the input of the convolution kernels is the output of the pooling layer Pool1, and the input is the feature of 64 × 32, and the output is the feature of 32 × 128; the pooling layer Pool2 is composed of pooling kernels of (2 × 2) and step size stride 2, and its input is the output of the convolutional layer Conv2, which is a characteristic of 32 × 128, and its output is a characteristic of 16 × 128; the convolution layer Conv3 is composed of convolution kernels of (3 × 128) × 128 and step size stride ═ 2, and its input is the output of the pooling layer Pool2, and is a characteristic of 16 × 128, and its output is a characteristic of 8 × 128; the full-connection layer FC has a size of (8 × 128) × 512, the input of the full-connection layer FC is the output of the convolution layer Conv3, the input of the full-connection layer FC is the feature of 8 × 128, the output of the full-connection layer FC- μ is the feature of 1 × 512, the input of the full-connection layer FC- μ is the feature extracted by the convolutional neural network, the input of the full-connection layer FC- μ is the feature of 1 × 512, the output of the full-connection layer FC- σ is the feature of 1 × 512, and the feature extracted by the full-connection layer FC- μ and the full-connection layer FC- σ jointly form the state feature z.

8. The method for planning a queue path of a commercial vehicle according to claim 1, wherein in S4, the longitudinal trajectory polynomial is fitted as follows:

the boundary conditions are as follows:

based on the obtained a ₀ a ₁ a ₂ a ₃ a ₄ a ₅ Obtaining the fifth-order polynomial s of the longitudinal track _trajectory 。

9. An intelligent automobile controller, characterized in that the controller is internally provided with an execution program of the method of any one of claims 1 to 8.

10. A storage device, characterized in that it houses the program code of the method according to any one of claims 1 to 8.