CN116884238A

CN116884238A - Intelligent vehicle expressway ramp remittance decision method based on reinforcement learning

Info

Publication number: CN116884238A
Application number: CN202310564555.8A
Authority: CN
Inventors: 谢宪毅; 刘国峰; 金立生; 郭柏苍; 韩广德; 朱文涛
Original assignee: Yanshan University
Current assignee: Yanshan University
Priority date: 2023-05-18
Filing date: 2023-05-18
Publication date: 2023-10-13

Abstract

The application relates to an intelligent vehicle expressway ramp converging decision method based on reinforcement learning, which relates to the field of intelligent vehicles and is used for improving the converging success rate of the intelligent vehicles and reducing traffic accidents. The method comprises the steps of setting a junction point of a ramp and a main road of a highway, and setting the ramp and a main road positioned at the S1 m position behind the junction point and the S2 m position in front of the junction point as control areas; in the control area, the intelligent vehicle is projected onto a target lane of a main road, and the distance from the intelligent vehicle to a junction point is equal to the distance from the intelligent vehicle to the junction point on the ramp; determining a front vehicle and a rear vehicle of the projected intelligent vehicle, and acquiring the speed and the position of the front vehicle and the rear vehicle as environment vehicle information; acquiring the distance between the intelligent vehicle and the junction point, the speed and the acceleration of the intelligent vehicle as intelligent vehicle information; and according to the intelligent vehicle information and the environment vehicle information, adopting a reinforcement learning DDPG model to adjust acceleration and front wheel steering angle, and gradually realizing successful import.

Description

Intelligent vehicle expressway ramp remittance decision method based on reinforcement learning

Technical Field

The application relates to an intelligent vehicle, in particular to an intelligent vehicle expressway ramp entry decision method based on reinforcement learning.

Background

The entrance ramp of the expressway is an important scene of automatic driving application, has the characteristics of complex traffic environment and high accident rate, and is a bottleneck section of vehicles in the running process of the expressway. The existing rule-based decision algorithm has the problems that decisions are too conservative and cannot meet the requirement of dealing with unpredictable sudden conditions, the reinforced decision algorithm has the problems of low sample utilization rate and poor training stability, and meanwhile, a space for continuous optimization exists in the setting of a reward function.

Disclosure of Invention

In order to solve the problems in the prior art, the application aims to provide an intelligent vehicle expressway ramp afflux decision method based on reinforcement learning, which is used for assisting in controlling an intelligent vehicle in a control area, and making an effective decision in a set distance range of a current junction under the condition of given speed limitation, so that the safety of the intelligent vehicle is improved.

In order to achieve the above object, the present solution is as follows.

In a first aspect, the application provides an intelligent vehicle expressway ramp entry decision method based on reinforcement learning, which is characterized in that an intersection point of an expressway ramp and a main road is set as a junction point, the ramp and the main road at the position S1 m behind the junction point and the position S2 m in front of the junction point are set as control areas, and S1 and S2 are set values; in the control area, the intelligent vehicle is projected onto a target lane of a main road, and the distance from the intelligent vehicle to a junction point is equal to the distance from the intelligent vehicle to the junction point on a ramp; determining a front vehicle and a rear vehicle of the projected intelligent vehicle, and acquiring the speed and the position of the front vehicle and the rear vehicle as environment vehicle information; acquiring the distance between the intelligent vehicle and the current point, the speed and the acceleration of the intelligent vehicle as intelligent vehicle information; according to the intelligent vehicle information and the environment vehicle information, the intelligent vehicle adopts a reinforcement learning DDPG model to adjust acceleration and front wheel steering angle, and successful import is gradually realized.

In the above technical solution, the rewards obtained by the agents for reinforcement learning the DDPG model include at least a first rewards, and the calculating steps include:

dividing the afflux process into a plurality of stages, and setting an influence factor for each stage;

and judging the stage to which the intelligent vehicle belongs according to the distance from the intelligent vehicle to the current point, and calculating rewards by using the influence factors of the stage.

In the above technical solution, the calculation formula of the first prize is as follows:

wherein: k (k) _i I=1, 2, for the influence factor of the i-th phase, n, n is the set total number of phases, w _m Is the weight of the afflux location rewards, deltav _max Is the maximum allowable speed difference between the intelligent vehicle speed and the average speed of the front vehicle and the rear vehicle, v _p1 For the speed of the first vehicle in front of the intelligent vehicle, v _f1 For the speed, v, of the first vehicle behind the intelligent vehicle _m For intelligent vehicle speed, d _m The value range of w is [0,1 ] for the distance from the central point of the intelligent vehicle to the junction point]Wherein 0 represents the same distance gap between the intelligent vehicle and the preceding first vehicle and the distance gap between the intelligent vehicle and the following first vehicle, 1 represents the distance gap between the intelligent vehicle and the preceding first vehicle or the following first vehicle as zero, and w is defined as follows:

wherein: d, d _p1 D, the distance from a vehicle in front of the intelligent vehicle to a junction point _f1 For the distance l from the vehicle behind the intelligent vehicle to the junction _p1 、l _m Respectively intelligent vehicleVehicle length of first vehicle in front of vehicle and intelligent vehicle.

In the above technical solution, the reward obtained by the agent for reinforcement learning DDPG model is the sum of a first reward and a second reward, and the second reward is any one of the following values, or the sum of any two or the sum of any more:

collision punishment and non-collision punishment, safe driving speed punishment, stopping punishment without reaching a designated destination, destination punishment, passenger comfort punishment; wherein:

the collision penalty and non-collision rewards are calculated as follows:

wherein: r is (r) _col Representing a vehicle collision penalty, r _collision A penalty value set for a collision of the vehicle;

the safe driving speed rewards are calculated as follows:

wherein: r is (r) _v (v) Indicating a high vehicle speed incentive at the current vehicle speed v of the intelligent vehicle,for the highest vehicle speed of the intelligent vehicle, r _{h_s} A bonus value for high-speed running of the vehicle;

failing to specify the destination stop penalty:

wherein: r is (r) _s Indicating successful arrival of the vehicle at the reward r _stop A prize value for successful arrival of the vehicle;

the destination prize is calculated as follows:

wherein: r is (r) _a Indicating successful arrival of the vehicle at the reward r _arrival A prize value set for successful arrival of the vehicle at the destination;

passenger comfort is calculated as follows:

wherein: w (w) _j Weight of punishment rewards for comfort, j _max Maximum shock allowed for passenger comfort,for the acceleration derivative, j of the intelligent vehicle _m The degree of impact received by the passenger.

In the technical scheme, an agent in the reinforcement learning DDPG model selects actions through an Actor network, and evaluates the Q value of the selected actions through a Critic network, wherein the Actor network and the Critic network are both composed of a time sequence neural network layer and a full connection layer; wherein:

the Action and the time sequence neural network layer in the Critical network are spliced after the Action output by the Actor network is transmitted to the full-connection layer in the Critical network, and the full-connection layer calculates and generates an evaluation Q value for the Action.

In the above technical solution, the Actor network and the Critic network both have network parameters and target network parameters, and the two parameters are the same when initialized;

during updating, N pieces of historical state-action pair data are randomly sampled from the experience playback pool D, N is a set value, and the network parameter theta of the Critic network is updated by minimizing the loss function ^Q And then, the network parameter theta of the Actor network is updated by maximizing the estimated Q value from the Critic network ^μ ；

Acquisition of Actor network and Critic networkTarget network parameter θ ^μ′ By τθ ^μ +(1-τ)θ ^μ′ Updating target network parameter θ of an Actor network ^μ’ τ is the approximation coefficient, τθ ^Q +(1-τ)θ ^Q′ Updating target network parameter θ for Critic networks ^Q’ ；

And after the reinforcement learning DDPG model training is finished, the Actor network uses the target network parameters to make decisions on intelligent vehicle import.

In the technical scheme, the experience playback pool is divided into a positive sample experience playback pool and a negative sample experience playback pool;

when the intelligent agent samples in the experience pools, respectively and averagely sampling data from two experience playback pools according to the set sample number, and then combining the sampled data together for training;

the positive sample is a sample of successful import of the intelligent vehicle, and the negative sample is a sample of failure of import of the intelligent vehicle.

In the technical scheme, the time sequence neural network is any one of LSTM, GRU, bi-LSTM and RNN.

In a second aspect, the application provides an intelligent vehicle, wherein the intelligent vehicle is provided with a system realized by any one of the methods, and the intelligent vehicle is controlled by the system to realize successful convergence from a highway ramp to a highway main road.

In a third aspect, the present application proposes a computer readable storage medium storing a computer program capable of being loaded by a processor and executing any one of the methods described above.

The technical scheme of the application has the following technical effects:

(1) The intelligent vehicles are projected onto a target lane of a main road, a plurality of vehicles in front of and behind the projected intelligent vehicles are taken as environmental vehicles in combination with the perception range of the intelligent vehicles, environmental vehicle information is obtained, and the intelligent vehicles are combined with the information of the intelligent vehicles, so that the intelligent vehicles are helped to sink into the environmental vehicles of the main road from the ramp on the basis of considering the traffic density of the main road by reinforcement learning, collision is avoided, and the success rate of the sink is improved.

(2) The method has the advantages that the influence degree of different distances from the intelligent vehicle to the junction after the intelligent vehicle is projected to the main road is different, the merging process is divided into a plurality of stages, and influence factors are set for each stage to measure, so that the merging success rate of the intelligent vehicle from the ramp to the expressway is improved, and traffic accidents are reduced.

(3) By adding the time-series neural network into the DDPG, the behavior of converging failure such as collision, road boundary departure and the like can be effectively avoided.

(4) In training, through designing positive and negative sample experience playback pools, the sample utilization rate is improved, the training efficiency is improved, and the trained decision model has strong robustness and strong stability.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a schematic illustration of a highway ramp afflux in an embodiment wherein an intelligent vehicle is in an AB afflux phase;

FIG. 2 is a schematic diagram of highway ramp entry with the intelligent vehicle in the BC entry stage in one embodiment;

FIG. 3 is a schematic diagram of highway ramp entry with the intelligent vehicle in a CD entry stage in one embodiment;

FIG. 4 is a schematic diagram of a DDPG structure incorporating an LSTM in one embodiment;

FIG. 5 is a schematic diagram of a positive and negative sample experience playback pool structure in one embodiment;

FIG. 6 is a schematic diagram of a reinforcement learning model.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature.

The highway on the expressway entrance ramp consists of a expressway main road and a converging lane. The afflux environment includes an afflux lane having a geometry and structure with afflux vehicles, ambient vehicles, and the like. The length of the accelerating lane of the entering lane is determined according to the rule in the highway route design rule, and the accelerating lane length of the entrance ramp is determined by the highest design vehicle speed of the main line. According to the regulations of the road traffic safety law, the minimum speed in the expressway must not be lower than 60km/h, namely the speed of the vehicle must be more than or equal to 60km/h to enter the main road of the expressway, and the speed of the main road is limited to 120km/h.

The entering vehicles are intelligent vehicles, are provided with sensing sensors, the entering behavior on the expressway entrance ramp belongs to forced entering, the speed of the vehicles is limited according to the regulation in the accelerating vehicle track with fixed length standard, the traffic density of the main line is combined, the space-time limit factors are comprehensively considered to make a safe and reasonable entering decision strategy with generalization, the optimal entering time is selected, and the entering into the main line is completed.

In order to improve the success rate and the safety of the intelligent vehicle converging from the ramp to the expressway main road, the application provides a technical scheme of a ramp converging decision method of the intelligent vehicle expressway based on reinforcement learning, wherein the intersection of the expressway ramp and the main road is set as a junction point, the ramp and the main road which is positioned at the S1 m position after the junction point and the S2 m position before the junction point are set as control areas, and S1 and S2 are set values; in the control area, the intelligent vehicle is projected onto a target lane of a main road, and the distance from the intelligent vehicle to a junction point is equal to the distance from the intelligent vehicle to the junction point on a ramp; determining a front vehicle and a rear vehicle of the projected intelligent vehicle, and acquiring the speed and the position of the front vehicle and the rear vehicle as environment vehicle information; acquiring the distance between the intelligent vehicle and the current point, the speed and the acceleration of the intelligent vehicle as intelligent vehicle information; according to the intelligent vehicle information and the environment vehicle information, the intelligent vehicle adopts a reinforcement learning DDPG model to adjust the angular speed and the front wheel steering angle, and successful import is gradually realized.

In an embodiment where an artificial intelligent vehicle is converged into a high-speed main road from a ramp, it is assumed that no other converging vehicles except the intelligent vehicle exist on an entrance ramp, and the refreshing frequency of the converging environment is 15Hz, namely the converging environment is refreshed 15 times within 1 second, so that the states of the vehicle in the environment, such as the position, the driving direction and the like, are updated in real time. The sensing range of the intelligent vehicle is a circle with a radius of 200m, and 200m is the sensing range of a general sensing sensor. The intelligent vehicle is located at the intersection of the entrance ramp and the control area and is driving to a junction, and the junction is the intersection of the expressway ramp and the main road. The initial speed of the intelligent vehicle is randomly selected within the range of 16.7-27.8m/s, and when stopping, collision or successful import occurs, the primary import simulation is ended. One simulation process may be considered as a subsequent round of training. After the completion of the one-time integration simulation, the intelligent vehicle is then deleted, and if the simulation is again performed, the intelligent vehicle is regenerated by using the new initial conditions.

The simulated environment vehicle accords with the randomness of the entrance ramp of the expressway converging into the environment, the initial position is randomly selected within the range of 0-150 m from the starting point, the speed of the environment vehicle is randomly selected within the range of 25-33m/s, the longitudinal driving strategy of the environment vehicle is based on an Intelligent Driver Model (IDM), the vehicle follows and avoids collision, and the environment vehicle has no lane change behavior because the main road is a single lane.

In connection with the perception range 200m of the intelligent vehicle, a control area of the intelligent vehicle is exemplarily set at 230m after the junction point (an exemplary value of S1) and 100m before the junction point (an exemplary value of S2). The reinforcement learning based merge decision is only used to control intelligent vehicles in the control area, assuming that the intelligent vehicles make a valid decision within 230m after the merge point, because if the merge is unsuccessful, the intelligent vehicles can stop completely within 230m given the speed limit, furthermore, the initial part of the entrance ramp is designed for the merge vehicles to accelerate from low speed, the intelligent vehicles will not make a merge decision, 100m before the merge point is used to evaluate if the merge is successful, because if the intelligent vehicles do not select the right merge opportunity and vehicle gap merge, a collision may occur after the merge.

The one-time sink process of the intelligent vehicle can be considered a markov decision process (Markov Decision Process, MDP), so the intelligent vehicle can be MDP modeled and represented using (S, a, R), where S is the state space, a is the action space, and R is the rewarding function. Specifically, the following is described.

(1) State space

In the expressway entrance ramp built based on highway-env converging environment, the intelligent vehicle can acquire the environment state in a detectable range based on the vehicle sensor, and can acquire the position and speed information of other environment vehicles in the detectable range, including the position information of obstacles in the road, the speed, the acceleration, the front wheel rotation angle and the like of the intelligent vehicle.

Illustratively, the environmental states within the perception range include states of 5 vehicles: the tandem vehicle (m) and two front vehicles (p 1, p 2) and two rear vehicles (f 1, f 2) thereof, wherein the front-rear sequence of the vehicles takes the intelligent vehicle as a reference object and the running direction of the intelligent vehicle is the front; the first front vehicle and the first rear vehicle are vehicles adjacent to the intelligent vehicle.

When the intelligent vehicle on the entrance ramp is projected onto the target lane on the main line, the distance from the intelligent vehicle to the junction point is projected to be the same as the distance from the intelligent vehicle to the junction point on the ramp, and the distance from the central point of the intelligent vehicle to the junction point is taken as the distance from the intelligent vehicle to the junction point. The state of the intelligent vehicle comprises the distance d from the central point of the intelligent vehicle to the current combining point _m Velocity v _m And acceleration a _m As intelligent vehicle information; the front two vehicles and the rear two vehicles states include a distance (d) _p1 ，d _p2 ，d _f1 ，d _f2 ) Sum velocity (v) _p1 ，v _p2 ，v _f1 ，v _f2 ) As environmental vehicle information; the state space is composed of intelligent vehicle information and environment vehicle information, and can be expressed as s= [ d ] _p2 ，v _p2 ，d _p1 ，v _p1 ，d _m ，v _m ，a _m ，d _f1 ，v _f1 ，d _f2 ，v _f2 ]The order of the status information content does not limit the application of the method.

In the perception of intelligent vehicles, it is reasonable to combine an average length of the vehicle body (in this case, an exemplary vehicle length of 5 m) with about 4 vehicles in the front and rear environments. When the speed and distance of the intelligent vehicle and the environment vehicle change in real time during simulation, when fewer than two vehicles in front of or behind the intelligent vehicle are detected, a virtual vehicle can be assumed at the intersection of the perception range area and the main road to construct 5 vehicle state vectors, and if the vehicles are just outside the perception range, the virtual vehicle has a speed value within the speed limit of the main road. Note that the virtual vehicle is determined to be in front of or behind the intelligent vehicle or its projection, if the intelligent vehicle decelerates on the ramp and its projection moves behind the host vehicle, the rear vehicle becomes the front vehicle. In practice, when the intelligent vehicle runs on the ramp, the projection vehicles on the main road change from front to rear, the rear vehicle changes into front, and the information of the environment vehicle is acquired in real time, so that the merging decision action of each step of the intelligent vehicle is adjusted in real time under the condition that the running of the environment vehicle is not influenced, and the merging success rate is improved.

(2) Action space

The motion space of the intelligent vehicle is longitudinal acceleration and front wheel rotation angle, the motion type is continuous motion, the motion is expressed as A= { a, θ }, a is longitudinal acceleration, and θ is front wheel rotation angle.

(3) Reward function

The awards for each time step include at least a first award in the (3.1) section described below, or the sum of the first awards and the second awards of the (3.1) and (3.2) sections.

(3.1) according to different designs of the intelligent vehicle import stage, scoring the reward functions with different influence degrees of road sections.

In the control area, the intelligent vehicle is different in influence degree of successful merging decision made on the intelligent vehicle due to different distances from the merging point after projection mapping to the main road in the process of merging the entrance ramp into the merging region to the merging point, so that in the control area, the rewarding function with different influence degree is designed aiming at the intelligent vehicle at different merging stages when the intelligent vehicle is at different road section positions of the entrance ramp and is used for providing accurate and efficient rewards and punishment for merging decision, thereby improving the merging success rate of the intelligent vehicle from the ramp into the expressway and reducing traffic accidents.

As shown in fig. 1, when the intelligent vehicle is in the AB merge phase, the intelligent vehicle is relatively far from the merging point, and the effect of making a successful merge decision on the intelligent vehicle is small, the following first reward function is designed:

the intelligent vehicle projection position is located between the previous vehicle and the next vehicle, the speed is close to the average speed of the two vehicles, and the corresponding punishment rewards of the AB road section are defined as follows.

K in _ab For the AB-stage influencing factor, exemplary values are 0.5, w _m Is a weight for the afflux location incentive, with an exemplary value of 0.015, deltav _max For the maximum allowable speed difference between the intelligent vehicle speed and the average speed of the front vehicle and the rear vehicle, an exemplary value is 5m/s, v _m For intelligent vehicle speed, d _m V is the distance from the central point of the intelligent vehicle to the current combining point _p1 For the speed of the first vehicle in front of the intelligent vehicle, v _f1 For the speed of the first vehicle behind the intelligent vehicle, w has a value in the range of [0,1 ]]Wherein 0 represents the same distance gap between the intelligent vehicle and the first vehicle ahead and the distance gap between the intelligent vehicle and the first vehicle behind, 1 represents the distance gap between the intelligent vehicle and the first vehicle ahead or the first vehicle behind is zero, and w is defined as follows:

wherein, I _p1 、l _m Actual vehicle length, d, of the first vehicle in front of the intelligent vehicle and the intelligent vehicle, respectively _p1 D is the distance from the center point of the first front vehicle to the clutch flow point _f1 Is the distance of the center point of the first rear vehicle from the clutch flow point.

As shown in fig. 2, when the intelligent vehicle is in the BC road section, the intelligent vehicle is relatively close to the current point, which is an important stage of the intelligent vehicle for selecting proper time and action and realizing successful remittance, and the following first reward function is designed:

k in _bc The BC stage influencing factor is 1.

As shown in fig. 3, when the intelligent vehicle is in the CD road section, that is, the intelligent vehicle is in the merging stage, and is closest to the merging point, the action output of the intelligent vehicle is a key stage of avoiding collision and realizing successful merging, and the following first reward function is designed:

k in _bc The CD stage influencing factor was 1.5.

(3.2) a second bonus function that remains unchanged throughout the sink process, comprising the following portions:

(1) to enhance the impact of the intelligent vehicle entry process on passenger comfort, the corresponding reward function is defined as follows:

wherein w is _j Weights for comfort penalty rewards, exemplary values are 0.012, j _max Exemplary values for maximum impact allowed for passenger comfortIs 3m/s ³ 。For the acceleration derivative, j of the intelligent vehicle _m The degree of impact received by the passenger.

(2) When an intelligent vehicle collides with any vehicle, a penalty is given and the corresponding reward function is defined as follows:

wherein r is _col Representing a vehicle collision penalty, r _collision The penalty value for a collision of the vehicle is-1.

(3) The expressway of China stipulates: the lowest running speed of the vehicle is 60km/h, and the lowest running speed is set to prevent traffic jam caused by low-speed running of the vehicle, so that high-speed running rewards are set to encourage the intelligent vehicle to run quickly under the condition of ensuring safety, and the corresponding rewards function is defined as follows:

wherein: r is (r) _v (v) Indicating a high vehicle speed incentive at the current vehicle speed v of the intelligent vehicle,for the highest speed of the intelligent vehicle,/for the intelligent vehicle>The value is at least equal to or greater than the minimum running speed of the vehicle is 60km/h, r _{h_s} An exemplary value for the prize value for high speed travel of the vehicle is 0.5.

(4) When the intelligent vehicle successfully arrives 100 meters before the current point, the arrival rewards are given, and the corresponding rewards function is defined as follows:

wherein r is _a Indicating successful arrival of the vehicle at the reward r _arrival An exemplary value is 1 for a prize value for successful arrival of a vehicle.

(5) When the intelligent vehicle does not reach the specified destination to stop, giving punishment, and ending the current round in simulation, wherein the corresponding rewarding function is defined as follows:

wherein r is _s Indicating successful arrival of the vehicle at the reward r _stop The prize value for successful arrival of the vehicle is-0.5.

In summary, the overall reward function is as follows:

R＝r _m +r _j +r _col +r _v (v)+r _a +r _s (1)

as a further improvement, by adding the time-series neural network into the DDPG, the behavior of converging failure such as collision, exiting the road boundary and the like can be effectively avoided. Specifically, in the process of merging the expressway entrance ramp, the intelligent vehicle finds the optimal driving strategy by maximizing long-term return through interaction with the environment, the merging process involves interaction with other vehicles on the main road, and the behavior state information of the other vehicles influences the most driving strategy of the merging vehicles, so that a time sequence neural network is merged into a DDPG model frame, the characteristic that the time sequence neural network processes the time sequence information is utilized, the predicted behavior of avoiding merging failures such as collision and departure from a road boundary is realized during action selection, and the training efficiency of the DDPG merging decision model is improved. The time sequence neural network inputs the historical information of the running states of the intelligent vehicle and the environment vehicle for a plurality of continuous time steps, namely the historical information of the intelligent vehicle information and the environment vehicle information. The time-series neural network can be any one of LSTM, GRU, bi-LSTM and RNN.

Taking a time-sequential neural network LSTM as an example, one embodiment of incorporating LSTM into a DDPG framework is shown in fig. 4: the Actor network in the DDPG frame is composed of three layers of networks, namely a 2-layer LSTM and a 1-layer full-connection layer, wherein the LSTM is provided with an input size of 44, the number of hidden layer units is 256, then a full-connection layer is connected, 256-dimensional data is used as input, 2-dimensional action output is calculated and generated, the LSTM network part in the Critic network is set to be the same as the LSTM network part in the Actor network, and historical driving state information of 4 continuous time steps of an intelligent vehicle and an environment vehicle, namely intelligent vehicle information and environment vehicle information of 4 continuous time steps are input; for the Action output by the Actor network, splicing the Action output by the Actor network and the output of the LSTM network in the Critic network into a vector, transmitting the vector to a full-connection layer of the next layer, and calculating by the full-connection layer to generate an evaluation Q value of the Action. LSTM may be replaced with any of GRU, bi-LSTM, RNN.

As a further improvement, the experience playback pool is equally divided into a positive sample experience playback pool and a negative sample experience playback pool, when an agent samples in the experience pool, the agent samples from the two experience playback pools according to the set sample number, and then the sampled data are combined together for training, so that the problem that the agent is difficult to sample the positive and negative samples in a balanced manner in the training due to unbalanced proportion of the positive sample and the negative sample generated in the initial stage of training of the DDPG algorithm, and network instability is caused at the beginning of training is solved. The positive sample is a sample of successful import of the intelligent vehicle, and the negative sample is a sample of failure of import of the intelligent vehicle.

Reinforcement learning exemplary model as shown in fig. 6, the training steps of reinforcement learning DDPG models incorporated into time series neural networks are substantially the same, except that training of time series neural networks is added at each training period of each DDPG, one exemplary training step includes:

s1, randomly initializing network parameters theta of an Actor network and a Critic network ^μ And theta ^Q And initializing experience pool D and target network parameters theta of the Actor network and the Critic network ^μ’ And theta ^Q’ Make θ ^μ’ And theta ^μ Identical, let θ ^Q’ And theta ^Q The same applies.

S2, setting the appointed distance from the ramp entrance to the overcurrent point as a training period, and setting the training period as M.

In each training period, the intelligent vehicle needs T rounds of single-step training from the ramp entrance to the overflow point, random noise N is initialized for action exploration before the rounds of single-step training, the initial state of continuous X time steps is obtained as the current state, and X is the set value.

S3, in each round of single-step training, acquiring a target network parameter theta of the current Actor network ^μ’ Random noise N _t Select action A _t = { a, θ }, t is the current time step, a is the acceleration, and θ is the front wheel rotation angle.

S4, acquiring a ramp stage where the current position of the intelligent vehicle is located, and selecting a selected action A _t And obtain instant rewards R _t Enter the next state S _t+1 The state-action pairs { S ] are stored in the experience pool D _t ，A _t ，R _t ，S _t+1 }。

S5, randomly sampling N historical state-action pair data { S ] in the experience playback pool D _t ，A _t ，R _t ，S _t+1 N is set value, and an Adam optimizer is adopted to optimize Q (s, a; theta) ^Q ) Network parameter theta of (2) ^Q Optimizing and updating network parameters theta of Critic network by minimizing loss function L ^Q The objective function is as follows;

r _t representing the instant prize calculated at step t according to equation (1), gamma being the discount factor, y _t As a target Q value function, N is the number of samples, Q' is the target network parameter θ in the Actor network ^μ′ And target network parameter θ of Critic network ^Q′ Q value, μ'(s) _t+1 ；θ ^μ′ ) For the target network parameter theta of the Actor network ^μ′ Lower pair of states s _t+1 And selecting a strategy function of the action.

S6, adopting an Adam optimizer to act on an action policy function of an Actor networkMu (s; θ) ^μ ) Parameter θ ^μ Optimizing and updating network parameters theta of an Actor network by maximizing Q value estimation from a Critic network ^μ Gradient of objective functionThe formula is as follows;

s7, updating target network parameters of an Actor network and a Critic network:

θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ′

θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q′

wherein: τ is an approximation coefficient, whose value is 0.01.

S8, after completing T rounds of single-step training (i.e. t=t), m=m+1, i.e. start training in the next cycle. After the training of M cycles is finished (i.e., m=m), the action policy function μ' (s; θ) of the Actor network ^μ′ ) The training results can be directly used for the incoming decisions of the intelligent vehicle.

In one embodiment, the training action policy function μ' (s; θ) ^μ′ ) The intelligent vehicle intelligent control system is further implemented as a controller for intelligent vehicle import decision, and is used for controlling a longitudinal acceleration port and a front wheel steering angle theta of the intelligent vehicle, so that the intelligent vehicle can be successfully imported on the expressway ramp.

An exemplary training process pseudocode description is shown in table 1.

TABLE 1

From the training process, the intelligent vehicle can walk from the starting point to the ending point of the ramp in one training period, or the factors such as midway collision are directly stopped, or successful import is realized, but the intelligent vehicle is step by step, and the intelligent vehicle can be ensured to successfully reach the destination through T steps in the import process.

In summary, as can be seen from the above embodiments, the method for determining the merging decision of the present application is directed to an intelligent vehicle, which is controlled to merge into a high-speed main road from a ramp entrance, and considers the influence of the distance change of the projection position of the intelligent vehicle on the high-speed main road from the merging point on the merging success to design a reward function, so as to improve the merging success rate, reduce traffic accidents, consider the influence of the decision on the comfort of passengers and the safety of the vehicle in the merging process, and improve the practicability of the intelligent vehicle. By integrating the time sequence neural network into the reinforcement learning DDPG model, the robustness and stability of model training are improved, and the reliability of the method is improved.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present disclosure may be implemented by means of software plus necessary general purpose hardware, or of course may be implemented by dedicated hardware including application specific integrated circuits, dedicated CPUs, dedicated memories, dedicated components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. However, in more cases for the present disclosure, a software program implementation is a better implementation.

Reference throughout this specification to "one embodiment," "another embodiment," "an embodiment," and so forth, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present application as broadly described. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with any embodiment, it is intended that such feature, structure, or characteristic be implemented within the scope of the application.

Although the embodiments of the present application have been described above with reference to the accompanying drawings, the present application is not limited to the above-described specific embodiments and application fields, and the above-described specific embodiments are merely illustrative, and not restrictive. Those skilled in the art, having the benefit of this disclosure, may effect numerous forms of the application without departing from the scope of the application as claimed.

Claims

1. An intelligent vehicle expressway ramp afflux decision method based on reinforcement learning is characterized in that:

the method comprises the steps of setting a junction point of a ramp and a main road of a highway as a junction point, setting a ramp and a main road which is positioned at the position S1 m behind the junction point and the position S2 m in front of the junction point as control areas, and setting values S1 and S2;

in the control area, the intelligent vehicle is projected onto a target lane of a main road, and the distance from the intelligent vehicle to a junction point is equal to the distance from the intelligent vehicle to the junction point on a ramp;

determining a front vehicle and a rear vehicle of the projected intelligent vehicle, and acquiring the speed and the position of the front vehicle and the rear vehicle as environment vehicle information;

acquiring the distance between the intelligent vehicle and the current point, the speed and the acceleration of the intelligent vehicle as intelligent vehicle information;

according to the intelligent vehicle information and the environment vehicle information, the intelligent vehicle adopts a reinforcement learning DDPG model to adjust acceleration and front wheel steering angle, and successful import is gradually realized.

2. The method according to claim 1, characterized in that:

the rewards obtained by the agents of the reinforcement learning DDPG model at least comprise first rewards, and the calculating steps comprise:

and judging the stage to which the intelligent vehicle belongs according to the distance from the projection intelligent vehicle to the current point, and calculating rewards by using the influence factors of the stage.

3. The method according to claim 2, characterized in that:

the calculation formula of the first prize is as follows:

wherein: d, d _p1 D, the distance from a vehicle in front of the intelligent vehicle to a junction point _f1 For the distance l from the vehicle behind the intelligent vehicle to the junction _p1 、l _m The first vehicle in front of the intelligent vehicle and the vehicle length of the intelligent vehicle respectively.

4. The method according to claim 2, characterized in that:

the rewards earned by the agents of the reinforcement learning DDPG model are the sum of a first reward and a second reward, the second reward being the value of any one or the sum of any two or the sum of any number of the following:

the collision penalty and non-collision rewards are calculated as follows:

the safe driving speed rewards are calculated as follows:

failing to specify the destination stop penalty:

the destination prize is calculated as follows:

passenger comfort is calculated as follows:

5. The method according to claim 1, characterized in that:

the intelligent agent for strengthening the learning DDPG model selects actions through an Actor network, and evaluates the Q value of the selected actions through a Critic network, wherein the Actor network and the Critic network are composed of a time sequence neural network layer and a full connection layer; wherein:

6. The method according to claim 5, wherein:

the Actor network and the Critic network both have network parameters and target network parameters, and the two parameters are the same when initialized;

during updating, N pieces of historical state-action pair data are randomly sampled from the experience playback pool D, N is a set value, and the network parameter theta of the Critic network is updated by minimizing the loss function ^Q Estimation from Critic network by maximizationQ value, updating network parameter theta of Actor network ^μ ；

Obtaining target network parameters theta of an Actor network and a Critic network ^μ′ By τθ ^μ +(1-τ)θ ^μ′ Updating target network parameter θ of an Actor network ^μ’ τ is the approximation coefficient, τθ ^Q +(1-τ)θ ^Q′ Updating target network parameter θ for Critic networks ^Q’ ；

7. The method according to claim 6, wherein:

the experience playback pool is divided into a positive sample experience playback pool and a negative sample experience playback pool;

8. The method according to claim 5, wherein:

the time sequence neural network is any one of LSTM, GRU, bi-LSTM and RNN.

9. An intelligent vehicle, characterized in that: the intelligent vehicle is provided with a system realized by the method of any one of claims 1 to 8, and the intelligent vehicle is controlled by the system to realize successful convergence into the main expressway from the expressway ramp.

10. A computer-readable storage medium, characterized by: a computer program stored which can be loaded by a processor and which performs the method according to any one of claims 1 to 8.