CN114442630A

CN114442630A - Intelligent vehicle planning control method based on reinforcement learning and model prediction

Info

Publication number: CN114442630A
Application number: CN202210088325.4A
Authority: CN
Inventors: 陈剑; 戚子恒; 王通
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2022-05-06
Anticipated expiration: 2042-01-25
Also published as: CN114442630B

Abstract

The invention discloses an intelligent vehicle planning control method based on reinforcement learning and model prediction. The method comprises the following steps: acquiring and calculating road boundary information and obstacle information under a vehicle body coordinate system through a vehicle-mounted laser radar sensor; acquiring and calculating by using a vehicle-mounted GPS sensor to obtain a global reference waypoint under a vehicle body coordinate system; building a virtual scene where the intelligent vehicle is located; under a virtual scene of the intelligent vehicle, based on road boundary information, obstacle information and global reference road points under a vehicle body coordinate system, a path generation module is used for planning a path of the intelligent vehicle to obtain a planned path of the intelligent vehicle; and tracking the planned path of the intelligent vehicle by using the tracking control module, thereby realizing the planning control of the intelligent vehicle. The invention promotes the network training of the planning part, ensures the path planning effect of the intelligent vehicle when the intelligent vehicle is not positioned accurately, and promotes the stability and the comfort of the vehicle body movement.

Description

Intelligent vehicle planning control method based on reinforcement learning and model prediction

Technical Field

The invention belongs to an intelligent vehicle planning control method in the field of automatic driving of intelligent vehicles, and particularly relates to an intelligent vehicle planning control method based on reinforcement learning and model prediction in a weak GPS environment.

Background

With the development of economy and the improvement of the technical level of the automobile industry in recent years, the quantity of automobile keeping is continuously increased, and the problems of traffic accidents, traffic jams, exhaust emission, drowsiness of drivers and the like are aggravated. The unmanned automobile has the advantages of energy conservation, environmental protection, comfort, high efficiency and the like, is an important trend of automobile development in the future, and is highly valued by countries in the world.

Path planning and tracking control are key technologies for autonomous driving. For the path planning module, the planning effect depends heavily on a high-precision map and high-precision positioning equipment. Compared with the traditional electronic map with the precision of meter level, the centimeter-level high-precision map can show the details of the number, the shape, the width and the like of the lanes of the road more truly, and can help the intelligent vehicle to plan and make decisions more accurately. However, the information collection, quality detection, operation and maintenance processes in the high-precision map making process make the cost of drawing and maintenance expensive. Meanwhile, because GPS signals are easily inaccurate in positioning or lost due to weather, high buildings, tunnels and the like, high-precision positioning equipment is often required to be equipped with expensive IMU equipment for auxiliary positioning, and the popularization of the intelligent vehicle are greatly hindered. The difficulty with tracking control modules is how to deal with the non-linear behavior of the vehicle system and with the constraint problems in the state variables and manipulated variables while tracking the path. Meanwhile, errors are easily introduced when the sensor senses the motion state of the vehicle body, and robustness of the controller under error interference needs to be ensured.

In recent years, reinforcement learning has been highly successful in fields such as image recognition, voice recognition, robots, and the like. Q learning develops from reinforcement learning. In Q learning, there is one body with states and corresponding actions. At any time, the agent is in some feasible state. In the next time step, the state is switched by performing certain operations. This action is accompanied by a reward or penalty. The objective of the broker is to maximize the bonus benefits. The algorithm can interact with its environment through constant trial and error in an initially unknown environment, it directs the vehicle to take action continuously, maximizing its return from the environment, and then find a collision-free path that avoids the obstacle.

The DDPG (deep Deterministic Policy gradient) algorithm references the network structure of Actor-Critic, and adopts the method of an experience playback pool in the DQN (deep Q network) algorithm to establish a database named as the experience pool to store the data of interaction between the intelligent agent and the environment. During training, the agent can randomly select training data from the experience pool to train the neural network, so that the correlation of the training data in time is prevented, and the training efficiency and the sample utilization rate are effectively improved.

Model Predictive Control (MPC) is an effective method to conveniently deal with multivariable constraint Control problem, and has been widely used in industrial systems. In recent years, MPC has been extended to the problem of moving body tracking control to achieve a predetermined goal in a suboptimal manner based on satisfying system constraints. In this control scheme, the control sequence is recalculated at each sample time, minimizing the cost function under the input state constraint. After the first control input of the sequence is applied to the system, the online optimization problem is repeated at the next time step based on the latest system state.

Disclosure of Invention

In order to solve the problem of inaccurate positioning of the intelligent vehicle in the background art, the invention provides an intelligent vehicle planning control method based on reinforcement learning and model prediction, and the existing planning and control algorithm is improved so as to improve the stability and comfort of the intelligent vehicle when the positioning is not accurate.

The technical scheme adopted by the invention is as follows:

the invention comprises the following steps:

step 1: obtaining an obstacle grid map through a vehicle-mounted laser radar sensor, determining road boundary information and obstacle information around a vehicle body in a laser radar sensor coordinate system based on the obstacle grid map, and then obtaining the road boundary information and the obstacle information in the vehicle body coordinate system after coordinate conversion;

step 2: acquiring global reference waypoints under a coordinate system of the vehicle-mounted GPS sensor by using the vehicle-mounted GPS sensor, acquiring vehicle body positioning and motion states by using the vehicle-mounted GPS sensor, and finally performing coordinate conversion on the global reference waypoints based on the vehicle body positioning and the motion states to acquire the global reference waypoints under a vehicle body coordinate system;

and step 3: building a virtual scene where the intelligent vehicle is located by the barrier grid map and the global reference waypoints;

and 4, step 4: under a virtual scene of the intelligent vehicle, based on road boundary information, obstacle information and global reference road points under a vehicle body coordinate system, a path generation module is used for planning a path of the intelligent vehicle to obtain a planned path of the intelligent vehicle;

and 5: and tracking the planned path of the intelligent vehicle by using the tracking control module, thereby realizing the planning control of the intelligent vehicle.

The path generation module in the step 4 is obtained by training the following steps:

s1: the training stage of the reinforcement learning agent based on the DDPG is divided into an initial stage, an intermediate stage and a final stage in sequence; the system comprises a first state space input in an initial stage, a second state space input in an intermediate stage, a third state space input in a final stage and a third state space input in a final stage, wherein the first state space input in the initial stage consists of the distance from an intelligent vehicle to the left and right boundaries of a road and the position of an accurate global reference waypoint in a vehicle body coordinate system;

s2: constructing an action space which is the corner delta of the front wheel of the intelligent vehicle_f；

S3: and forming a training set by the action space and different state spaces to train the reinforcement learning intelligent body based on the DDPG, setting reward and punishment values and supervising the training process to obtain the well-trained reinforcement learning intelligent body.

The reward and punishment values include a reward value R that reaches an endpoint_arrivePunishment value R of intelligent vehicle collision_collisionAnd intermediate state reward and punishment value R_temp。

-a reward or punishment value R of said intermediate state_tempObtained by the following steps:

a1: respectively distributing corresponding potential field functions to the road boundary, the obstacle and the global reference waypoint in each training stage by using a potential field method;

a2: respectively calculating corresponding road boundary potential according to the three potential field functionsField P_RBarrier potential field P_OAnd an accurate global reference waypoint potential field P_WAnd inaccurate global reference waypoint potential field P_W′After the corresponding potential fields in the training stage are superposed, the total potential field P of the current training stage is obtained_UAnd as reward and punishment value R of intermediate state_temp；

A3: in the training process, according to the total potential field P_UThe three-dimensional gradient diagram is characterized in that potential field parameters of all potential field functions in each training stage in A1 are set by a path planning method based on a potential field method, the total potential field of each training stage is updated according to the set potential field functions, and the updated total potential field is used as a reward and punishment value R of the intermediate state of each training stage_temp。

In the tracking control module in the step 5, firstly, a vehicle dynamic model is established according to the intelligent vehicle, and then a prediction equation of the vehicle state is established based on the vehicle dynamic model;

then, according to a prediction equation of the vehicle state, a target optimization function and a constraint condition are established by using a model prediction control algorithm, and a path tracking controller is further established;

and finally, tracking the planned path of the intelligent vehicle by using the path tracking controller, thereby realizing the planning control of the intelligent vehicle.

The target optimization function is as follows:

the constraint conditions of the objective optimization function are as follows:

Δu_min≤Δu(k|t)≤Δu_max

u_min≤u(k|t)≤u_max

y_min≤y(k|t)≤y_max

β_min≤β(k|t)≤β_max

k＝t,…,t+N_p-1

y(t+N_p|t)-r(t+N_p|t)∈Ω

wherein, min_U(t)J represents the operation of taking a control quantity set of the front wheel turning angle of the vehicle when the target optimization value of the intelligent vehicle is minimum in the prediction time domain corresponding to the time t; j represents a target optimization value of the intelligent vehicle, and U (t) represents a control quantity set of a vehicle front wheel corner in a prediction time domain corresponding to the time t;

representing the operation of calculating the norm squared based on the first weight matrix Q,

representing the operation of calculating the norm squared based on the second weight matrix R,

denotes an operation of calculating a norm square based on the third weight matrix P, y (t + i | t) denotes a predicted value of the ith vehicle-state yaw angle and lateral position at time t, r (t + i | t) denotes a predicted value of the ith vehicle-state yaw angle and lateral position at time t, u (t + i | t) denotes an ith control quantity at time t, and y (t + N | t) denotes a predicted value of the ith vehicle-state yaw angle and lateral position at time t_p| t) denotes the Nth at time t_pPredicted values of yaw angle and lateral position of individual vehicle states, r (t + N)_p| t) denotes the Nth at time t_pExpected values of yaw angle and lateral position, N, for individual vehicle states_pFor predicting the time domain Q, R, P are the first, second and third weighting coefficients, Deltau_maxA right limit increment for a vehicle front wheel steering angle; Δ u_minA left limit increment for a vehicle front wheel steering angle; Δ u (k | t) represents a control increment of the vehicle front wheel steering angle at the time k at the current time t, u (k | t) is a control amount of the vehicle front wheel steering angle at the time k at the current time t, and u (k | t) is a control amount of the vehicle front wheel steering angle at the time k at the current time t_maxA right extreme position of a vehicle front wheel corner; u. of_minA left extreme position of a vehicle front wheel corner; y (k | t) represents the vehicle state yaw angle and lateral position at time k at the current time t, y_minThe minimum value of the vehicle state yaw angle and the lateral position; y is_maxBeta (k | t) represents the vehicle at time k at the current time t, which is the maximum value of the yaw angle and the lateral position of the vehicle stateA centroid slip angle; beta is a_minAnd beta_maxThe minimum value and the maximum value of the vehicle mass center slip angle are respectively, and omega represents a terminal constraint domain.

And the terminal constraint domain in the objective optimization function is subjected to linearization preprocessing.

The invention has the beneficial effects that:

the invention provides a planning control method aiming at the scene of inaccurate intelligent vehicle positioning, which comprises a path planning method based on DDPG reinforcement learning and a path tracking method based on model prediction control, namely a path generating module and a tracking control module.

In the path planning method, the path generation of the intelligent vehicle under the inaccurate positioning scene is realized based on the DDPG algorithm, and the safety and the smoothness of the path are ensured. The reward and punishment value of the DDPG is improved by a potential field method, and the training stage is divided into an initial stage, an intermediate stage and a final stage, so that the convergence speed and the training efficiency of the algorithm are improved.

In the tracking control method, the path tracking controller is realized based on a model prediction control algorithm, and the terminal cost and the terminal constraint are added into the target optimization function, so that the stability and the control precision of the control system are improved. And the terminal constraint domain is linearized, so that the real-time performance of the intelligent vehicle control system is ensured.

The planning control algorithm combining the path planning method and the tracking control method can smoothly complete obstacle avoidance in a scene where the intelligent vehicle is inaccurately positioned, complete a navigation task safely according to a designed path, and can ensure the smoothness and stability of a track.

Drawings

Fig. 1 is a schematic diagram of the offset of an acquired reference point.

FIG. 2 is a schematic diagram of the misalignment of the vehicle body causing the reference waypoints to shift.

Fig. 3 is a schematic diagram of a DDPG network structure.

Fig. 4 is a virtual environment path generation flow block diagram.

FIG. 5 is a smart vehicle kinematics model.

FIG. 6 is a schematic diagram of path generation in a virtual environment.

FIG. 7 is a vehicle dynamics model.

FIG. 8 is a graph of reward functions for reinforcement learning training.

Fig. 9 is a planning control implementation flow of the present invention.

FIG. 10 is a diagram of a smart vehicle motion profile when positioning is inaccurate.

FIG. 11 is a graph of three methods of centroid cornering angle variation when positioning is inaccurate.

Fig. 12 shows lateral acceleration changes in three methods when the positioning step is accurate.

Detailed Description

The invention will be further illustrated and described with reference to specific embodiments. The technical features of the embodiments of the present invention can be combined correspondingly without mutual conflict.

As shown in fig. 9, the present invention includes the steps of:

step 1: the intelligent vehicle is provided with a laser radar sensor and a GPS sensor. Obtaining an obstacle grid map through a vehicle-mounted laser radar sensor, determining road boundary information and obstacle information around a vehicle body in a laser radar sensor coordinate system based on the obstacle grid map, and then obtaining the road boundary information and the obstacle information in the vehicle body coordinate system after coordinate conversion; the obstacle information is specifically the position of the nearest obstacle in front of the intelligent vehicle.

Step 2: acquiring global reference waypoints under a coordinate system of the vehicle-mounted GPS sensor by using the vehicle-mounted GPS sensor, acquiring vehicle body positioning and motion states by using the vehicle-mounted GPS sensor, and finally performing coordinate conversion on the global reference waypoints based on the vehicle body positioning and the motion states to acquire the global reference waypoints under a vehicle body coordinate system; signals of the vehicle-mounted GPS sensor may be interfered by the environment and shifted, which causes shifting of the acquired global reference point, as shown in fig. 1. The signal of the vehicle-mounted GPS sensor is interfered, and the vehicle body is inaccurately positioned, so that the global reference waypoint in the vehicle body coordinate system is offset, as shown in fig. 2.

And step 3: establishing a virtual scene where the intelligent vehicle is located by the barrier grid map and the global reference waypoints;

and 4, step 4: as shown in fig. 4, in a virtual scene of the intelligent vehicle, based on road boundary information, obstacle information and global reference waypoints in a vehicle body coordinate system, a path generation module is used to plan a path of the intelligent vehicle, so as to obtain a planned path of the intelligent vehicle; the kinematic model of the intelligent vehicle is shown in fig. 5, and the generated planned path in the virtual environment is shown in fig. 6.

s1: the network structure of the reinforcement learning agent based on the DDPG is shown in FIG. 3, and the training stage of the reinforcement learning agent based on the DDPG is divided into an initial stage, an intermediate stage and a final stage in sequence from simple to difficult according to a training scene; wherein, the distance d from the intelligent vehicle to the left and right borders of the road in the first state space input in the initial stage_lAnd d_rAnd the position d of the accurate global reference waypoint in the body coordinate system_wxAnd d_wyThe second state space input in the intermediate stage consists of the first state space and the position d of the nearest obstacle in front of the intelligent vehicle in the coordinate system of the vehicle body_oxAnd d_oyThe third state space input in the final stage comprises the distance from the intelligent vehicle to the left and right boundaries of the road, the position of the nearest obstacle in front of the intelligent vehicle in the vehicle body coordinate system and the position d of the inaccurate reference waypoint in the vehicle body coordinate system_wx′And d_wy′Composition is carried out; i.e. the third state space s ═ { d ═ d_l,d_r,d_ox,d_oy,d_wx′,d_wy′}。

S3: forming a training set by the action space and different state spaces to train the reinforcement learning intelligent body based on the DDPG, setting reward and punishment values and supervising the training process to obtain the trained reinforcement learning intelligent body;

the reward and punishment values include a reward value R reaching an endpoint_arrivePunishment value R of intelligent vehicle collision_collisionAnd reward punishment values of intermediate statesR_temp；

Reward and punishment value R of intermediate state_tempObtained by the following steps:

a2: respectively calculating corresponding road boundary potential fields P according to the three potential field functions_RBarrier potential field P_OAnd an accurate global reference waypoint potential field P_WAnd inaccurate global reference waypoint potential field P_W′After the corresponding potential fields in the training stage are superposed, the total potential field P of the current training stage is obtained_UAnd as reward and punishment value R of intermediate state_temp(ii) a I.e. the reward penalty value R of the intermediate state of the final stage_temp＝P_R+P_O+P_W′；

The potential field function of a road boundary is:

wherein, P_R(d_l,d_r) Is the road boundary potential field, a_RIs the strength parameter of the potential field, d_sThe safe distance from the intelligent vehicle to the road boundary.

The potential field function of an obstacle is:

wherein, P_O(d_ox,d_oy) Is an obstacle potential field, a_oAnd b_oRespectively, an intensity parameter and a shape parameter of the barrier potential function. X_sAnd Y_sRespectively representing the safe distance between the vehicle and the obstacle in the longitudinal direction and the transverse direction, wherein the longitudinal direction is the driving direction of the intelligent vehicle, the direction perpendicular to the driving direction of the intelligent vehicle is the transverse direction, and the longitudinal direction and the transverse direction are in the horizontal plane and defined as follows:

X_s＝X₀-vT₀

Y_s＝Y₀+(υsinθ_e+υ_osinθ_e)T₀

wherein, X₀And Y₀Representing the minimum safety distance, T, in the longitudinal and transverse directions, respectively₀Is the safe time interval, v is the speed of the intelligent vehicle, v_oIs the speed of the obstacle, θ_eIs the heading angle deviation between the smart car and the obstacle.

The potential field functions of the accurate and inaccurate global reference waypoints are the same, wherein the potential field function of the global reference waypoint is:

wherein, P_W(d_wy) Is an accurate global reference waypoint potential field, d_aError range, a, referring to the lateral position of the global reference waypoint_wIs the potential field strength of the global reference waypoint.

In the tracking control module in the step 5, firstly, a vehicle dynamic model is established according to the intelligent vehicle, and then a prediction equation of the vehicle state is established based on the vehicle dynamic model; the vehicle dynamics model is shown in fig. 7.

Then, according to a prediction equation of the vehicle state, a target optimization function with terminal constraint and terminal cost and constraint conditions are established by using a model prediction control algorithm, and a path tracking controller is further established;

and finally, tracking the planned path of the intelligent vehicle by controlling the corner of the front wheel of the vehicle by using a path tracking controller, thereby realizing the planning control of the intelligent vehicle.

The objective optimization function with terminal constraints and terminal costs is:

Δu_min≤Δu(k|t)≤Δu_max

u_min≤u(k|t)≤u_max

y_min≤y(k|t)≤y_max

β_min≤β(k|t)≤β_max

k＝t,…,t+N_p-1

y(t+N_p|t)-r(t+N_p|t)∈Ω

wherein the content of the first and second substances,

a terminal cost for joining; y (t + N)_p|t)-r(t+N_pAnd l t) belongs to omega and is the added terminal constraint. min_U(t)J represents the operation of taking a control quantity set of the front wheel turning angle of the vehicle when the target optimization value of the intelligent vehicle is minimum in the prediction time domain corresponding to the time t; j represents a target optimization value of the intelligent vehicle, reflects the requirements for path tracking errors and stable change of control quantity in a certain time domain in the future, and U (t) represents a control quantity set of a vehicle front wheel corner in a corresponding prediction time domain at the moment t;

representing the operation of calculating the norm squared based on the third weight matrix P,

represents the operation of calculating the weight of the tracking error of the intelligent vehicle based on the first weight matrix Q at the ith moment under the t moment,

the operation of calculating the weight of the control stationarity of the intelligent vehicle based on the second weight matrix R at the ith moment under the t moment is shown,

denotes the Nth at time t_pCalculating the tracking error weight of the intelligent vehicle based on the third weight matrix P at each moment,

reflecting the requirements for path tracking errors,

reflecting the requirement for smooth change of the controlled variable, y (t + i | t) represents the predicted value of the ith vehicle state yaw angle and the transverse position at the time t, r (t + i | t) represents the expected value of the ith vehicle state yaw angle and the transverse position at the time t, the predicted values of the vehicle state yaw angle and the transverse position are obtained through the planned path of the intelligent vehicle, u (t + i | t) represents the ith controlled variable at the time t, and y (t + N | t) represents the ith controlled variable at the time t_p| t) denotes the Nth at time t_pPredicted values of yaw angle and lateral position of individual vehicle states, r (t + N)_pT) represents the Nth time at t_pExpected values of yaw angle and lateral position, N, for individual vehicle states_pFor predicting the time domain Q, R, P are the first, second and third weighting coefficients, Deltau_maxA right limit increment for a vehicle front wheel steering angle; Δ u_minA left limit increment for a vehicle front wheel steering angle; Δ u (k | t) represents a control increment of the vehicle front wheel steering angle at the time k at the current time t, and u (k | t) is a control of the vehicle front wheel steering angle at the time k at the current time tSystem amount of u_maxA right extreme position of a vehicle front wheel corner; u. of_minA left extreme position of a vehicle front wheel corner; y (k | t) represents the vehicle state yaw angle and lateral position at time k at current time t, y_minThe minimum value of the vehicle state yaw angle and the lateral position; y is_maxBeta (k | t) represents the vehicle mass center slip angle at the moment k at the current moment t, and is the maximum value of the vehicle state yaw angle and the lateral position; beta is a_minAnd beta_maxThe minimum value and the maximum value of the vehicle mass center slip angle are respectively, and omega represents a terminal constraint domain.

And a terminal constraint domain in the objective optimization function is subjected to linearization preprocessing, so that the real-time performance of the control system is ensured.

In this example, the training environment is a joint simulation of MATLAB/Simulink and Carsim. And designing a network structure, a state space, an action space and an incentive function of the reinforcement learning algorithm in MATLAB/Simulink, and obtaining a vehicle model with high precision and high truth in Carsim.

And after the potential field design is finished, setting the potential field parameters by using a path planning method of a potential field method. And if the planned path does not meet the safety requirement, adjusting potential field parameters.

When a reinforcement learning training scene is set, the training scene is divided into three stages from simple to difficult. The initial stage comprises only road barriers and accurate reference waypoints; an intermediate stage, in which obstacles are added to the initial stage; and in the final stage, inaccurate reference waypoints are added in the intermediate stage.

The result of the reinforcement learning training is shown in fig. 8, and both the network training effect and the convergence rate of the method are improved compared with the traditional DDPG network.

The controller provided by the invention is tested under the condition of double-shift line, noise is added into the yaw velocity and the transverse velocity, and the tracking effect of the controller is compared with the tracking effect of the traditional model prediction control method. The Mean Absolute Error (MAE) of its tracking effect is given by the following table:

table 1: mean absolute error of tracking effect (MAE)

As can be seen from the table 1, the tracking accuracy of the tracking control method provided by the invention is improved compared with that of the traditional model prediction control method when no error exists, yaw velocity noise exists and transverse velocity noise exists.

The path planning method and the tracking control method provided by the invention are combined to deal with the scene of inaccurate vehicle body positioning, and the implementation flow is shown in fig. 9. Fig. 10 is a comparison of planning control effects in a scenario where a designed reference waypoint is inaccurate and a car body positioning step is accurate, where frame a is a planning control method proposed by the present invention, frame B is a traditional DDPG planning and pure tracking control method, and PF + MPC is a planning and model prediction control tracking method of a potential field method. In fig. 11, (a), (b), and (c) are respectively the change of the centroid slip angle in the three methods in sequence, and in fig. 12, (a), (b), and (c) are respectively the change of the lateral acceleration in the three methods in sequence, so as to reflect the stability and comfort of the trajectory. Table 2 statistical analysis was performed on the experimental data.

Table 2: table for analyzing experimental results of the present invention and other methods

As can be seen from fig. 9, fig. 10, fig. 11, fig. 12 and table 2, the planning control method designed by the present invention can make the intelligent vehicle have a more comfortable and stable motion state when the positioning is not accurate.

Claims

1. An intelligent vehicle planning control method based on reinforcement learning and model prediction is characterized by comprising the following steps:

2. The intelligent vehicle planning control method based on reinforcement learning and model prediction as claimed in claim 1, wherein the path generation module in step 4 is obtained by training through the following steps:

3. The smart vehicle planning control method based on reinforcement learning and model prediction as claimed in claim 2, wherein the reward and punishment value includes a reward value R to an endpoint_arrivePunishment value R of intelligent vehicle collision_collisionAnd intermediate state reward and punishment value R_temp。

4. The intelligent vehicle planning control method based on reinforcement learning and model prediction as claimed in claim 3, wherein the reward and punishment value R of the intermediate state_tempObtained by the following steps:

a2: respectively calculating corresponding road boundary potential fields P according to the three potential field functions_RBarrier potential field P_OAnd an accurate global reference waypoint potential field P_WAnd inaccurate global reference waypoint potential field P_W′After the corresponding potential fields in the training stage are superposed, the total potential field P of the current training stage is obtained_UAnd as reward and punishment value R of intermediate state_temp；

5. The intelligent vehicle planning control method based on reinforcement learning and model prediction as claimed in claim 1, wherein the tracking control module of step 5 firstly establishes a vehicle dynamics model according to the intelligent vehicle, and then establishes a prediction equation of the vehicle state based on the vehicle dynamics model;

6. The intelligent vehicle planning control method based on reinforcement learning and model prediction as claimed in claim 5, wherein the objective optimization function is:

Δu_min≤Δu(k|t)≤Δu_max

u_min≤u(k|t)≤u_max

y_min≤y(k|t)≤y_max

β_min≤β(k|t)≤β_max

k＝t,…,t+N_p-1

y(t+N_p|t)-r(t+N_p|t)∈Ω

denotes an operation of calculating a norm square based on the third weight matrix P, y (t + i | t) denotes a predicted value of the ith vehicle-state yaw angle and lateral position at time t, r (t + i | t) denotes a predicted value of the ith vehicle-state yaw angle and lateral position at time t, u (t + i | t) denotes an ith control quantity at time t, and y (t + N | t) denotes a predicted value of the ith vehicle-state yaw angle and lateral position at time t_p| t) denotes the Nth at time t_pPredicted values of yaw angle and lateral position of individual vehicle states, r (t + N)_p| t) denotes the Nth at time t_pExpected values of yaw angle and lateral position, N, for individual vehicle states_pFor predicting the time domain Q, R, P are the first, second and third weighting coefficients, Deltau_maxA right limit increment for a vehicle front wheel corner; Δ u_minA left limit increment for a vehicle front wheel steering angle; Δ u (k | t) represents a control increment of the vehicle front wheel steering angle at the time k at the current time t, u (k | t) is a control amount of the vehicle front wheel steering angle at the time k at the current time t, and u (k | t) is a control amount of the vehicle front wheel steering angle at the time k at the current time t_maxA right extreme position of a vehicle front wheel corner; u. of_minA left extreme position of a vehicle front wheel corner; y (k | t) represents the vehicle state yaw angle and lateral position at time k at current time t, y_minThe minimum value of the vehicle state yaw angle and the lateral position; y is_maxBeta (k | t) represents the vehicle mass center slip angle at the moment k at the current moment t, and is the maximum value of the vehicle state yaw angle and the lateral position; beta is a_minAnd beta_maxThe minimum value and the maximum value of the vehicle mass center slip angle are respectively, and omega represents a terminal constraint domain.

7. The intelligent vehicle planning control method based on reinforcement learning and model prediction as claimed in claim 6, wherein the terminal constraint domain in the objective optimization function is subjected to linearization preprocessing.