CN113353102B

CN113353102B - Unprotected left-turn driving control method based on deep reinforcement learning

Info

Publication number: CN113353102B
Application number: CN202110773027.4A
Authority: CN
Inventors: 赵敏; 孙棣华; 陈进
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2021-07-08
Filing date: 2021-07-08
Publication date: 2022-11-25
Anticipated expiration: 2041-07-08
Also published as: CN113353102A

Abstract

The invention discloses an unprotected left-turn driving control method based on deep reinforcement learning, which comprises the following steps: 1. establishing a simulation and training environment, wherein the specific method comprises the following steps: 1) Constructing two identical closed road environment simulation scenes; 2) Setting proper simulation running time, and generating any number of unprotected LTAP/OD events; 3) Setting a plurality of straight vehicles and three left-turning vehicle candidate paths; 2. designing a reward function, and processing a complex unprotected LTAP/OD event by adopting the driving skill of a human driver; 3. designing a strategy structure, updating parameters of a deep convolution fuzzy system by using a learning algorithm, and searching an optimal value function; 4. and designing a learning algorithm, and improving training efficiency by adopting data of a human driver and a deep convolution fuzzy system algorithm. The combination of the driving skill of the human driver and the deep convolution fuzzy algorithm effectively improves the interpretability of the deep reinforcement learning algorithm, the error correction capability of the training efficiency and the traffic efficiency of the vehicle.

Description

Unprotected left-turn driving control method based on deep reinforcement learning

Technical Field

The invention belongs to the field of motion control of medium and high-grade automatic driving automobiles, and particularly relates to a method for training an unprotected left-turning control model for generating an automatic driving strategy.

Background

At a crossroad without traffic signals or other stop sign guidance, a straight-going vehicle (SDV) and a left-Turning Vehicle (TV) drive oppositely (LTAP/OD, figure 1), and an unprotected left-turning is efficiently and safely finished, so that the method has strong challenge for an automatic driving vehicle and is also suitable for human drivers. When the existing automatic driving automobile finishes unprotected left turning, the robustness of the algorithm is emphasized more, the manual customization rule is mainly adopted, the over-conservative strategy is often adopted, and the passing efficiency is low although the safety is ensured to a certain extent. Against experienced human drivers, they attempt to "negotiate" with a straight-ahead vehicle in the process of road right competition, primarily through body motions such as steering, braking, and acceleration, in an attempt to quickly complete a left turn.

In the aspect of research of simulating human driving strategies, a deep neural network-based reinforcement learning paradigm is commonly adopted in the industry, and a patent CN110824912B directly obtains available automatic driving strategies by using high-dimensional data; the patent CN112784485A discloses an automatic driving key scene generation method based on reinforcement learning; patent CN108009587B invented a method and apparatus for determining driving strategy based on reinforcement learning and rules. But due to the inexplicability of the deep neural network model, the training efficiency and the error correction capability of the model are greatly limited.

Disclosure of Invention

The invention aims to provide a reinforcement learning method based on a deep convolution fuzzy system, which is used for learning the driving skill of a human driver, improving the traffic efficiency and improving the interpretability of a deep reinforcement learning algorithm.

In order to achieve the above object, the technical scheme of the invention is as follows: an unprotected left-turn driving control method based on deep reinforcement learning is characterized by comprising the following steps:

step (1) establishes a simulation and training environment, and the specific method comprises the following steps:

(1.1) constructing two identical closed road environment simulation scenes;

(1.2) setting a proper simulation running time, and generating any number of unprotected LTAP/OD events;

(1.3) setting a plurality of straight-going vehicles (SDV) and three left-Turning Vehicle (TV) candidate paths;

designing a reward function, and processing a complex unprotected LTAP/OD event by adopting the driving skill of a human driver;

designing a strategy structure, updating parameters of a deep convolution fuzzy system by using a learning algorithm, and searching an optimal value function;

designing a learning algorithm, and improving training efficiency by adopting data of a human driver and a deep convolution fuzzy system algorithm, wherein the specific method comprises the following steps:

(4.1) setting a function Q for recording a learning algorithm;

(4.2) initializing function Q using data of a human driver;

(4.3) obtaining a new value of the function Q through deep convolution fuzzy system operation;

and (4.4) updating the value of the function Q by using deep reinforcement learning to obtain an optimal solution.

In step (1), the unprotected LTAP/OD events, each unprotected LTAP/OD event, are a deep reinforcement learning training round.

In step (2), the reward function functions as follows:

s is _t The state of the environment at the moment t;

a is a _t An action taken by the agent at time t;

c is mentioned ₁ ,c ₂ ,c ₃ ,c ₄ 0 is its weight parameter, where c1=0.5, c2=4, c3=0.5, c4=4, the vehicle maximum speed limit is 17m/s ≈ 60km/h;

the | v ^TV -v ^SDV I is the absolute value of the speed difference between TV and SDV;

the described

Indicating the distance of the TV to the border of the collision area,

i.e. the first bonus function is active before the TV passes the conflict area and the second bonus function is active after the TV passes the conflict areaTV velocity v _TV The larger the traffic efficiency is;

d represents the distance between the centers of gravity of the TV and the SDV, the larger the distance is, the smaller the collision risk is, and when D is less than or equal to 3.5m, a third reward function acts.

In step (2), the driving skills of the human driver include vehicle steering, braking, and accelerating body motion.

In step (3), the membership function of the fuzzy system is A ¹ ,A ² ,…A ^q The mathematical expression of the ith fuzzy subsystem of the ith layer is as follows:

input set corresponding to the fuzzy subsystem

Is selected from the input space of the l-th layer through a sliding window with width m and moving step length s, and the input of the l-th layer is composed of all the outputs of the l-1-th layer, wherein,

then fuzzy system

Can be represented by the following q ^m The bar fuzzy IF-THEN rule constitutes:

if it is not

Is composed of

And is

Is composed of

X is then

Is composed of

Said parameter

Is a fuzzy set

Is the core parameter of the deep convolution fuzzy system;

the value function based on the deep convolution fuzzy system comprises the following input and output data pairs according to the collected data: (x) ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ ,x ₇ ；y)＝(x ^TV ,y ^TV ,v ^TV ,v ^SDV ,D,a ^SDV Action; value), 7 inputs and 1 output, where x ^TV For the transverse position, y, of a vehicle turning left under the geodetic coordinate system ^TV For the longitudinal position of a vehicle turning left under the geodetic coordinate system, v ^TV Speed of a left-turn vehicle, v ^SDV Is the speed of the straight-driving vehicle, D is the distance between the straight-driving and left-turning vehicles, a ^SDV The acceleration of the straight-ahead vehicle, the action is the control action taken by the intelligent agent, and the value is the value of the action value function; the deep convolution fuzzy system structure is divided into three layers, 9 fuzzy subsystems in total, wherein each fuzzy subsystem

There are 3 inputs, i.e. m =3, and the convolution window is shifted by a step size s =1.

The invention has the beneficial effects that: a deep convolution fuzzy system model is adopted, the non-linear mapping relation between input and output is simulated by utilizing the universal approximation characteristic of a fuzzy system, a high-dimensional input space is processed by adopting a layering mode and a convolution window, and dimension cursing is overcome. By adopting the driving skill of a human driver and the deep convolution fuzzy system algorithm, the interpretability of the deep reinforcement learning algorithm, the error correction capability of the training efficiency and the traffic efficiency of vehicles are improved.

Drawings

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings:

FIG. 1 is a schematic diagram of two cases of successful negotiation between a left-turn vehicle and a straight-ahead vehicle;

FIG. 2 is a schematic diagram of a traffic simulation scenario in Prescan: the method comprises the following steps of (a) schematically representing a double-loop scene, (b) schematically representing a multi-straight-going vehicle scene, and (c) schematically representing three candidate paths;

FIG. 3 is a schematic diagram of an ensemble structure of reinforcement learning based on a deep convolution fuzzy system;

FIG. 4 is a schematic diagram of membership functions of a fuzzy subsystem in a deep convolutional fuzzy system;

fig. 5 is a diagram of a learning paradigm based on a cost function.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application. On the contrary, the embodiments of the application include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.

Example 1: as shown in fig. 1 to 5, the invention provides an unprotected left-turn driving control method based on deep reinforcement learning, in order to create enough unprotected LTAP/OD events in one simulation, two identical closed road loop simulation scenes are constructed, as shown in fig. 2 a. It can be seen that after two vehicles pass through the target intersection of interest (the box location), they will return through the loop, forming a loop. By setting the appropriate simulation runtime, any number of unprotected LTAP/OD events can be obtained. For the training process of deep reinforcement learning, each unprotected LTAP/OD event becomes a round (Episode). Training and testing also requires a scenario of multiple straight-going vehicles (see fig. 2 b) and three candidate paths (see fig. 2 c).

It is desirable for the intelligence to be able to master human-like negotiation skills to handle complex unprotected LTAP/OD events. In particular, it is desirable for the TV to be able to safely complete a left turn while avoiding inefficiencies resulting from overly conservative driving decisions. The reward function is therefore as follows:

wherein, said s _t Is the state of the environment at time t, said a _t For the action taken by the agent at time t, in a first reward function, c ₁ ,c ₂ ,c ₃ ,c ₄ > 0 is its weight parameter, where c1=0.5, c2=4, c3=0.5, c4=4, the vehicle maximum speed limit is 17m/s ≈ 60km/h. First term | v ^TV -v ^SDV The greater the absolute value of the speed difference between the TV and the SDV, the more effective the "opposing vehicles" can be avoided, since a large speed difference means that the two vehicles do not accelerate or decelerate in synchronization. Second item

Represents the distance of the TV to the border of the collision area, and a smaller distance represents a higher efficiency of the TV. If it is not

I.e. before the TV passes through the conflict area, the first bonus function is active, and after the TV passes through the conflict area, the second bonus function is active for pursuing efficient traffic. For the third term, TV velocity v _TV The larger, the largerThe higher the traffic efficiency. The last term is to ensure safety, D represents the distance between the centers of gravity of both the TV and SDV, the greater the distance, the less the risk of collision. When D is less than or equal to 3.5m, which means that two vehicles collide with each other, the third reward function is acted.

The value of each action is examined using a nonlinear function approximator (deep convolutional fuzzy system) as a Critic (criticic) and the action with the maximum value is selected. Therefore, the parameters of the deep convolution fuzzy system are updated by using a learning algorithm, and an optimal value function is searched.

The model structure of the deep convolution fuzzy system has the main modeling idea that the universal approximation characteristic of the fuzzy system is utilized to simulate the nonlinear mapping relation between input and output; a high-dimensional input space is processed by adopting a layering mode and a convolution window, and the problem of regular explosion caused by dimension cursing is solved.

As shown in fig. 3, in the DCFS-based value function, each square represents a fuzzy system. Membership function A ¹ ,A ² ,…A ^q As given by figure 4. The mathematical expression of the ith fuzzy subsystem of the ith layer is as follows:

input set corresponding to the fuzzy subsystem

Is selected from the input space of the l-th layer through a sliding window with width m and moving step length s, and the input of the l-th layer is composed of all the outputs of the l-1-th layer. Wherein, the first and the second end of the pipe are connected with each other,

then fuzzy system

if it is not

Is composed of

And is

Is composed of

X is then

Is composed of

(3)

Wherein the parameters

Is a fuzzy set

The parameters are the core parameters of the deep convolution fuzzy system, and are designed by an online training algorithm introduced in the next step, so that the parameters of the deep convolution fuzzy system have clear physical meanings, which is the reason why the method has interpretability.

According to the collected data, for the value function based on the depth convolution fuzzy system, the input-output data pair is as follows: (x) ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ ,x ₇ ；y)＝(x ^TV ,y ^TV ,v ^TV ,v ^SDV ,D,a ^SDV Action; value), a deep convolution fuzzy system structure composed of 7 inputs and 1 output is shown in fig. 5, wherein x ^TV For the lateral position, y, of a left-hand vehicle in a geodetic coordinate system ^TV As coordinates of the earthLongitudinal position of a left-turning vehicle, v ^TV Speed of a left-turn vehicle, v ^SDV Is the speed of the straight-driving vehicle, D is the distance between the straight-driving and left-turning vehicles, a ^SDV The action is the control action taken by the agent for the acceleration of the straight-ahead vehicle, and the value is the value of the action value function. In general, the system is divided into three layers, 9 fuzzy subsystems, wherein each fuzzy subsystem

To improve training efficiency, the Q function (i.e., action-state-value function) is initialized with data of a human driver, see pseudo code of algorithm 1 for details. And obtaining a Q function after the initial parameters of the depth convolution fuzzy system, and updating the parameters of the Q function by using an algorithm 2 to obtain an optimal solution.

The above-mentioned embodiments are merely preferred embodiments for fully illustrating the present invention, and the scope of the present invention is not limited thereto. The equivalent substitution or change made by the technical personnel in the technical field on the basis of the invention is all within the protection scope of the invention. The protection scope of the invention is subject to the claims.

Claims

1. An unprotected left-turn driving control method based on deep reinforcement learning is characterized by comprising the following steps:

(1.1) constructing two identical closed road environment simulation scenes;

(4.1) setting a function Q for recording a learning algorithm;

(4.2) initializing function Q using data of a human driver;

(4.4) updating the value of the function Q by using deep reinforcement learning to obtain an optimal solution;

in step (2), the reward function functions are as follows:

s is _t The state of the environment at the moment t;

a is a _t An action taken by the agent at time t;

c is said ₁ ,c ₂ ,c ₃ ,c ₄ 0 is its weight parameter, where c1=0.5, c2=4, c3=0.5, c4=4, the vehicle maximum speed limit is 17m/s ≈ 60km/h;

the above-mentioned

Indicating the distance of the TV to the border of the collision area,

i.e. the first bonus function is active before the TV passes the conflict area and the second bonus function is active after the TV passes the conflict area, the TV speed v ^TV The larger the traffic efficiency is;

2. The deep-reinforcement-learning-based unprotected left-turn driving control method according to claim 1, wherein in step (1), the unprotected LTAP/OD events are each an intensive learning training round.

3. The unprotected left-turn driving control method based on deep reinforcement learning of claim 1, wherein in step (2), the driving technique of the human driver comprises vehicle steering, braking and accelerating vehicle body action.

4. The unprotected left-turn driving control method based on deep reinforcement learning of claim 1, wherein in step (3), the fuzzy system membership function is A ¹ ,A ² ,…A ^q The mathematical expression of the ith fuzzy subsystem of the ith layer is as follows:

input set corresponding to the fuzzy subsystem

then fuzzy system

if it is not

Is composed of

And is

Is composed of

Then x is

y is

Said parameter

Is a fuzzy set

Is the core parameter of the deep convolution fuzzy system;

the value function based on the deep convolution fuzzy system comprises the following input and output data pairs according to the collected data: (x) ₁ ,x ₂ ,x ₃ ,x ₄ ,x ₅ ,x ₆ ,x ₇ ；y)＝(x ^TV ,y ^TV ,v ^TV ,v ^SDV ,D,a ^SDV Action; value), 7 input and 1 output groupsA deep convolution fuzzy system structure where x ^TV For the transverse position, y, of a vehicle turning left under the geodetic coordinate system ^TV For the longitudinal position of a vehicle turning left under the geodetic coordinate system, v ^TV Speed of a left-turn vehicle, v ^SDV Is the speed of the straight-driving vehicle, D is the distance between the straight-driving and left-turning vehicles, a ^SDV The acceleration of the straight-ahead vehicle, the action is the control action taken by the intelligent agent, and the value is the value of the action value function; the deep convolution fuzzy system structure is divided into three layers, 9 fuzzy subsystems in total, wherein each fuzzy subsystem