CN112498354A

CN112498354A - Multi-time scale self-learning lane changing method considering personalized driving experience

Info

Publication number: CN112498354A
Application number: CN202011561553.6A
Authority: CN
Inventors: 付志军; 郭耀华; 殷玉明; 肖艳秋; 侯俊剑; 周放; 刘晓丽; 姚雷; 王辉; 王良文
Original assignee: Zhengzhou University of Light Industry
Current assignee: Zhengzhou University of Light Industry
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-03-16
Anticipated expiration: 2040-12-25
Also published as: CN112498354B

Abstract

The invention discloses a multi-time scale self-learning lane changing method considering personalized driving experience, which comprises the following steps of: the first step is preparation; the second step is off-line learning; the third step is on-line operation; the electric control device controls the host vehicle to automatically drive at the level of L4 through the multi-time scale self-learning algorithm and learn the driving habits of the driver on line, updates the personalized driving experience data set, the multi-time scale neural network, the multi-time scale self-learning algorithm, the lane change model and the reward function according to the driving habits of the driver, introduces transition probability through the Markov decision lane change model to capture variation among and inside the individual, enables the automatic control output of the electric control device on lane change to gradually approach the driving habits of the driver of the host vehicle, and improves the driving experience. The invention adopts a learning structure combining an off-line strategy and an on-line strategy, considers both generality and particularity, and is very consistent with the characteristics of L4-level intelligent driving.

Description

Multi-time scale self-learning lane changing method considering personalized driving experience

Technical Field

The invention belongs to the field of intelligent driving, and particularly relates to a multi-time scale self-learning lane changing method considering personalized driving experience.

Background

In the intelligent driving field, along with the development of vehicle intellectualization, the intelligent control unit and the driver share the bottom control right of the vehicle more and more, the intelligent automobile can hardly avoid taking the right of the driver, or the intelligent automobile interferes the driver at an important moment to make a control strategy which is beneficial to the benefit of the driver, and then potential safety hazards are caused. Therefore, smart cars cannot ignore the understanding and perception of the highest decision maker of the vehicle, i.e., the driver.

The advanced driving assistance system at the present stage already has a function of monitoring driving behaviors preliminarily through detection of a driver state, a vehicle and a surrounding environment.

However, from the viewpoint of man-machine driving of the intelligent vehicle, personalized differences of different drivers and different driving states of the same driver are not considered, and it is still difficult to meet the requirement of intelligent driving of the vehicle.

For example, a lane change system detects the distance and relative speed between a vehicle ahead and a surrounding vehicle using radar, and performs a lane change if the distance and relative speed are higher than a set threshold (risk of collision may occur), otherwise does not perform a lane change.

This requires that the car manufacturer must base its extensive experimentation on finding set thresholds reflecting the average driving behaviour and reaction time of humans.

However, different drivers will observe the relative distance and vehicle speed and choose to slow down or change lanes according to personal dynamic preferences. Therefore, the intelligent lane-changing method not only needs to "infer" and "learn" the personalized dynamic preferences of human beings, but also needs to continuously "adapt" and "take action" so as to realize the self-learning lane-changing method considering the personalized driving experience.

The reinforcement learning is a machine learning method independent of an environment model and prior knowledge, and can continuously optimize a control strategy through a trial and error and delay return mechanism, so that a feasible scheme is provided for developing an individualized self-learning lane change method.

However, reinforcement learning also faces the following challenges in developing a specific application of the personalized self-learning lane-changing method:

first, for general reinforcement learning, performing the same action for the same environmental state will receive the same reward. The actual reward is dynamically changed along with the individual preference of the driver, and the driver can take corresponding actions (such as braking and decelerating) or accelerate lane change when facing the lane change decision.

Thus, the rewards earned by reinforcement learning do not always reflect their own behavior, but are influenced by driver personalized driving behavior decisions.

Furthermore, general reinforcement learning assumes that the reward earned in each step is due to the corresponding action performed in that step. In practice the driver response (in seconds) is not always instantaneous, depending on his personalized dynamic preferences, and varies from moment to moment. Thus, evaluating the reward may require several steps until the effect of performing the action is observed and the corresponding reward value is received.

Disclosure of Invention

The invention aims to solve the problem of poor driving experience caused by lack of consideration of personalized differences in the conventional active lane change, and provides a multi-time scale self-learning lane change method considering the personalized driving experience.

In order to achieve the purpose, the multi-time scale self-learning lane-changing method considering the personalized driving experience is carried out according to the following steps:

the first step is preparation;

establishing an individualized driving experience data set, a multi-time scale neural network, a multi-time scale self-learning algorithm, a Markov decision-based lane change model and a dynamic time-varying reward function considering driver preference in an electric control device of a host vehicle;

the personalized driving experience data set comprises environmental vehicle data, control data and a driver preference metric matrix; the environmental vehicle data and the control data are derived from public data;

the second step is off-line learning;

before a host vehicle is started for the first time, enabling a multi-time scale neural network to read environmental vehicle data and control data in an individualized driving experience data set, and establishing a mapping relation from the environmental vehicle data to the control data;

the third step is on-line operation; the electric control device controls the host vehicle to automatically drive at the level of L4 through the multi-time scale self-learning algorithm and learns the driving habits of the driver on line, and updates the personalized driving experience data set, the multi-time scale neural network, the multi-time scale self-learning algorithm, the lane change model and the reward function according to the driving habits of the driver, so that the automatic control output of the electric control device to the lane change is gradually close to the driving habits of the driver of the host vehicle, and the driving experience of the driver is improved.

The environmental vehicle data in the first step includes x_t,yt，φ_t，Δx_t，Δy_tAnd Δ v_t；

Wherein x is_t,y_tIndicates the longitudinal and lateral positions, phi, of the host vehicle_tRepresenting yaw rate, Δ x_tRepresenting the distance between the host vehicle and the surrounding vehicle along the x-axis, ay_tRepresenting the distance between the host vehicle and the surrounding vehicle along the y-axis, av_tRepresenting a speed difference between the host vehicle and the surrounding vehicle;

the control data includes vehicle steering wheel target angle data and vehicle target speed data;

the driver preference metric matrix is given by formula one:

wherein

Is the driver discomfort threshold for the longitudinal acceleration,

is the longitudinal deceleration threshold, | a_yI is the lateral acceleration threshold, | z_xI and I z_yL is the maximum allowed longitudinal and transverse impulse, respectively;

in the second step, the multi-time scale neural network is represented by formula two:

(ii) a Wherein f (x, u/w) is a nonlinear function of system output, x and u are respectively a system state and an input, w is a weight of the neural network,

is a time scale factor that can be adaptively varied; tau is_σObeying Gaussian normal distribution N, T₀,σ₀The mean and variance vectors are Gaussian normal distribution N;

in the second step, the electric control device performs off-line learning according to the individualized driving experience data set through the multi-time scale neural network, and obtains a mapping relation from environmental vehicle data to control data in an off-line state;

in the third step, the multi-time scale self-learning algorithm is as follows:

3.1, initializing parameters; the electric control device initializes a discount parameter gamma, a learning step length alpha, an exploration parameter epsilon and a multi-time scale parameter t_s≤t_a≤t_lAnd event Q (s, a) ═ 1 (lane change) and event Q (s, a) ═ 0 (lane change) are also included in the event Q (s, a) — 0 (lane change))；

t_sThe sampling period of the vehicle-mounted sensor is also the period of the lane change model for acquiring information; the vehicle-mounted sensor is used for acquiring environmental vehicle data; t is t_aThe time interval from the acquisition of information to the output of control data by the electric control device through a multi-time scale self-learning algorithm is adopted; the system comprises a plurality of pieces of environment vehicle data, a control data and a control data, wherein the environment vehicle data comprises vehicle steering wheel target turning angle data and vehicle target speed data; t is t_lThe method is a learning and updating period of a multi-time scale self-learning algorithm;

3.2, observing the vehicle state; a lane change model in the electric control device acquires current environmental vehicle data through a vehicle-mounted sensor connected with the electric control device to obtain a current environmental state s; acquiring a current environment state s by a multi-time scale self-learning algorithm in the electric control device through a lane change model;

3.3, executing a control action; multiple time scale self-learning algorithm in electric control device every t_aSelecting and outputting control data a according to the environment state s by using a greedy algorithm, and controlling the host vehicle to perform lane changing or lane unchanging actions according to the vehicle steering wheel target corner data and the vehicle target speed data in the control data a if the driver does not intervene; and if the driver intervenes, controlling the host vehicle to perform lane changing or lane unchanging actions according to the operation of the driver.

During the third step, every t_lTime, a learning and updating operation is performed once;

tau is the current time, s' is the current environmental state, s is the environmental state after acting according to the control data a; t is tau-t_lFor at τ -t_lAnd all times t between τ_iCalculating a driver preference metric matrix M to obtain t_iReward function of time of day

t_lIs a learning update period of a multi-time scale self-learning algorithm, t_l＞t_a，

The electric control device correspondingly updates the individualized driving experience data set on line through a sixth formula according to the data, wherein the sixth formula is as follows:

in the formula six, R is a formula five, namely a reward function; wherein s is the environmental state of the host vehicle expressed by formula three; a is the control data that actually occurs; alpha represents the learning step length, and gamma is a discount factor;

every t_lSelecting control data a with a probability of 1-epsilon when learning and updating are performed once, or randomly selecting off-line learning with epsilon probability as a new learned action when Q value is maximum; wherein epsilon represents the transition probability, 0 < epsilon < 0.5;

the third formula is stored in the electric control device, and the third formula is as follows: s_t＝[x_t,y_t,φ_t,Δx_t,Δy_t,Δv_t](ii) a In the third formula, s_tThe environmental state of the host vehicle is represented by the direction of the road width, which is the y-axis direction, i.e., the lateral direction, and the direction of the road length, which is the x-axis direction, i.e., the longitudinal direction_t,y_tIndicates the longitudinal and lateral positions, phi, of the host vehicle_tRepresenting yaw rate, Δ x_tRepresenting the distance between the host vehicle and the surrounding vehicle along the x-axis, ay_tRepresenting the distance between the host vehicle and the surrounding vehicle along the y-axis, av_tRepresenting a speed difference between the host vehicle and the surrounding vehicle;

if there are a plurality of surrounding vehicles around the host vehicle, Δ x_t,Δy_t,Δv_tRespectively representing column vectors corresponding to different surrounding vehicles; a formula III forms a lane change model;

the dynamic time-varying reward function that takes into account driver preferences in the first step is:

defining the expression for performing action a as formula four:

a_t＝[δ_t,v_t](ii) a In the fourth formula, δ_tIs the steering wheel angle, v_tIs the speed;

the electric control device stores a reward function expressed by a formula five, wherein the formula five is as follows:

(ii) a In the fifth formula, M^*A safety boundary constructed for each parameter range in equation one,

and

a reference state and an execution action corresponding to the formula one;

and finally, the electric control device trains the multi-time scale neural network in the second step by using the new strategy data learned by the multi-time scale self-learning algorithm, and updates the off-line strategy.

The invention has the following advantages:

the invention provides a multi-time scale self-learning lane changing method considering personalized driving experience, and provides personalized comfortable driving experience for intelligent vehicle users.

The invention has the following advantages:

(1) the invention adopts a learning structure combining an off-line strategy and an on-line strategy, the off-line strategy learned from historical data reflects the general lane changing behavior of personalized driving experience, and the turning and vehicle speed two-dimensional actions generated each time by on-line self-learning consider the particularity of the actual lane changing working condition, so that the learning architecture design not only considers the generality, but also considers the particularity, and is very in line with the characteristics of L4-level intelligent driving;

(2) the invention defines a driver preference measurement matrix M (shown as a formula I) consisting of optimal transverse and longitudinal acceleration and a maximum allowable impulse area of a user, represents a comprehensive result of driver preference and a perception risk level corresponding to dynamic motion in a given environment, and provides an acceptable comfort standard of personalized driving experience;

(3) the Markov decision lane change model is used for capturing variation among changing individuals and in the same individual by introducing the transition probability through the Markov decision lane change model;

(4) a dynamic time-varying reward function considering the preference of a driver is given and applied, so that the self-learning lane-changing method can be evaluated in real time according to the actual operating condition to obtain a self-learning strategy; meanwhile, the multi-time scale self-learning lane-changing method realizes time-sharing operation of state acquisition, action behavior evaluation and action behavior execution, and better accords with the actual decision-making behavior of lane changing when a driver drives a vehicle.

After the third step is continuously carried out and continuously updated, the lane changing method disclosed by the invention is more and more close to the driving habit of a driver in continuous use, so that better automatic driving experience is brought to the driver.

Drawings

FIG. 1 is a functional block diagram of a multi-time scale self-learning lane-change method of the present invention that considers a personalized driving experience;

fig. 2 is a schematic lane change of the present invention.

Detailed Description

As shown in fig. 1 and 2, the multi-time scale self-learning lane-changing method considering the personalized driving experience of the present invention is performed according to the following steps:

the first step is preparation;

establishing an individualized driving experience data set, a multi-time scale neural network, a multi-time scale self-learning algorithm, a Markov decision-based lane change model and a dynamic time-varying reward function considering driver preference in an electric control device of a host vehicle; the electric control device of the host vehicle is an in-vehicle ECU of the host vehicle.

the second step is off-line learning;

the third step is on-line operation; the electric control device controls the host vehicle to automatically drive at the level of L4 through the multi-time scale self-learning algorithm, learns the driving habits of the driver on line (namely learns the control data output by the driver under the specific environmental vehicle data), and updates the personalized driving experience data set, the multi-time scale neural network, the multi-time scale self-learning algorithm, the lane change model and the reward function according to the driving habits of the driver, so that the automatic control output of the electric control device to lane change gradually approaches the driving habits of the driver of the host vehicle, and the driving experience of the driver is improved.

Wherein x is_t,y_tIndicates the longitudinal and lateral positions, phi, of the host vehicle_tRepresenting yaw rate, Δ x_tDenotes the distance between the host vehicle and the surrounding vehicle along the x-axis (i.e., the distance in the road surface width direction), ay_tDenotes the distance between the host vehicle and the surrounding vehicle along the y-axis (i.e., the distance in the road surface length direction), Δ ν_tRepresenting a speed difference between the host vehicle and the surrounding vehicle;

the driver preference metric matrix is given by formula one:

wherein

Is the driver discomfort threshold for the longitudinal acceleration,

is the longitudinal deceleration threshold, | a_y| is a transverse acceleration threshold valuez_xI and I z_yL is the maximum allowed longitudinal and lateral impulse (i.e. rate of change of acceleration), respectively;

in equation two, all neurons process information according to the newly incoming connection information and their previous internal states, according to the time scale factor τ_iWeights for retaining the previous information and processing the new information are determined.

The invention introduces learning parameters into time scale factors and provides a time scale capable of changing in a self-adaptive manner

Mean and variance vector initial values tau of standard Gaussian normal distribution N₀＝0,σ₀And (1) extracting a time sequence from the Gaussian normal distribution N so as to realize different time scale mapping of the actual multidimensional state signal, performing off-line strategy learning by using an individualized driving experience data set, and obtaining a mapping relation from environmental vehicle data to control data in an off-line state.

In the third step:

standard online reinforcement learning algorithms require immediate evaluation of the performed actions made before the next iteration is performed, whereas the driving preferences of each driver are different for lane-change behavior; the control data output by different drivers is different in the face of the same environmental vehicle data.

Factors such as the degree of attention, the reaction time, and the surrounding environment determine that the control data (the vehicle steering wheel target angle data and the vehicle target speed data) of the control action to be taken based on the same environmental vehicle data are different every lane change even for the same driver. Thus, the evaluation of the execution action may require several recursive steps (looping several times) until the effect of the operation is observed and a corresponding reward is obtained. Therefore, the invention provides a multi-time scale self-learning lane-changing algorithm as follows:

the multi-time scale self-learning algorithm comprises the following steps:

3.1, initializing parameters; the electric control device initializes a discount parameter gamma, a learning step length alpha, an exploration parameter epsilon and a multi-time scale parameter t_s≤t_a≤t_lAnd event Q (s, a) ═ 1 (lane change) and event Q (s, a) ═ 0 (lane change);

various sensors on the vehicle, such as a speed sensor, a distance sensor, an angle sensor and the like, are all in the prior art, and various environmental vehicle data, including environmental data and state data of the vehicle itself, can be provided for the vehicle-mounted ECU for realizing automatic driving, and details are not described.

3.2, observing the vehicle state; a lane change model in the electric control device acquires current environmental vehicle data through a vehicle-mounted sensor connected with the electric control device to obtain a current environmental state s, wherein the current environmental state s comprises the current environmental vehicle data; acquiring a current environment state s by a multi-time scale self-learning algorithm in the electric control device through a lane change model;

The greedy algorithm is a conventional algorithm and will not be described in detail.

3. The multi-time scale self-learning lane-changing method taking personalized driving experience into account of claim 2, characterized in that: during the third step, every t_lTime, a learning and updating operation is performed once;

tau is the current time, s' is the current environmental state, s is the environmental state after acting according to the control data a; t is tau-t_l(time of action according to control data a) for the time at τ -t_lAnd all times t between τ_iCalculating a driver preference metric matrix M to obtain t_iReward function of time of day

t_lIs a learning update period of a multi-time scale self-learning algorithm, t_l＞t_aFrom the actual t_sAnd t_aIs determined by the value of (a), is t_aAbout three times of;

in the formula six, R is a formula five, namely a reward function; wherein s is the environmental state of the host vehicle expressed by formula three; a is the control data that actually occurs; alpha represents the learning step length, and gamma is a discount factor; after the third step is continuously carried out and continuously updated, the lane changing method disclosed by the invention is more and more close to the driving habit of a driver in continuous use, so that better automatic driving experience is brought to the driver.

When modeling the driver-host vehicle-environment, the differences between individuals and in different states in the individuals must be considered so as to accurately capture relevant real-time state information, take corresponding actions, evaluate the executed actions and realize state updating. For this purpose, a markov-based lane change model is proposed, which takes into account the transition probabilities.

Every t_lSelecting actually generated control data a according to each state s by using an epsilon-greedy algorithm when one learning and updating operation is performed in time;

specifically, every t_lSelecting control data a with a probability of 1-epsilon when learning and updating are performed once, or randomly selecting off-line learning with epsilon probability as a new learned action when Q value is maximum; wherein epsilon represents the transition probability, 0 < epsilon < 0.5;

ε represents the degree of trade-off between learned (1- ε) and unlearned (ε) values, with ε generally being chosen to be small (0 < ε < 0.5) for conservative considerations.

The third formula is stored in the electric control device, and the third formula is as follows: s_t＝[x_t,y_t,φ_t,Δx_t,Δy_t,Δv_t](ii) a In the third formula, s_tThe environmental state of the host vehicle is represented by the direction of the road width, which is the y-axis direction, i.e., the lateral direction, and the direction of the road length, which is the x-axis direction, i.e., the longitudinal direction_t,y_tIndicates the longitudinal and lateral positions, phi, of the host vehicle_tRepresenting yaw rate, Δ x_tRepresenting the distance between the host vehicle and the surrounding vehicle along the x-axis, ay_tIndicating the distance between the host vehicle and the surrounding vehicle along the y-axisFrom, Δ v_tRepresenting a speed difference between the host vehicle and the surrounding vehicle;

the markov decision state transition depends on the driver preference metric matrix M and the environmental state of the host vehicle, with state changes between the two different states being linked by transition probabilities. The transition probabilities are used for capturing variation among the varying individuals and in the same individual, the unknown transition probabilities are updated by adopting a multi-time scale self-learning algorithm, and corresponding actions are taken (lane change is 1 or lane change is not 1);

the goal of reinforcement learning is to continually generate strategies to guide the system from a "bad" state to a "good" state. The evaluation of "bad" and "good" captures each executed action in all states by assigning a reward value. Defining the expression for performing action a as formula four:

a_t＝[δ_t,v_t](ii) a In the fourth formula, δ_tIs the steering wheel angle (in degrees), v_tIs the speed (in km/h);

in the general reinforcement learning criteria definition, the reward function is invariant (static); however, in the present invention, the reward function varies with time depending on the driver's personalized driving preference (to decide lane change or no lane change), and for this reason, the electric control device stores the reward function expressed by the formula five:

(ii) a In the fifth formula, M^*A safety boundary constructed for each parameter range in the driver preference metric matrix defined by equation one,

and

a reference state and an execution action corresponding to the formula one;

and finally, the electric control device trains the multi-time scale neural network in the second step by using the new strategy data learned by the multi-time scale self-learning algorithm, and updates the off-line strategy, so that the off-line strategy is closer to the driving habit of the driver.

Although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: modifications and equivalents may be made thereto without departing from the spirit and scope of the invention and it is intended to cover in the claims the invention as defined in the appended claims.

Claims

1. The multi-time scale self-learning lane changing method considering the personalized driving experience is characterized by comprising the following steps of:

the first step is preparation;

the second step is off-line learning;

2. The multi-time scale self-learning lane-changing method taking personalized driving experience into account of claim 1, characterized in that:

the environmental vehicle data in the first step includes x_t,y_t，φ_t，Δx_t，Δy_tAnd Δ v_t；

the driver preference metric matrix is given by formula one:

wherein

Is the driver discomfort threshold for the longitudinal acceleration,

in the third step, the first step is carried out,

the multi-time scale self-learning algorithm comprises the following steps:

3.1, initializing parameters; the electric control device initializes a discount parameter gamma, a learning step length alpha, an exploration parameter epsilon and a multi-time scale parameter t_s≤t_a≤t_lAnd event Q (s, a) 1 (lane change) and event Q (s, a) 0 (lane change);

tau is the current time, s' is the current environmental state, s is the environmental state after acting according to the control data a; t is tau-t_lFor at τ -t_lAnd all times t between τ_iCalculating a driver preference metric matrix M to obtain t_iReward function R(s) of time of day_ti,a_ti),t_lIs a learning update period of a multi-time scale self-learning algorithm, t_l＞t_a，

every t_lWhen learning and updating are performed once, control data a is selected with a probability of 1-epsilon so that the Q value is the maximum, or off-line learning is selected at random with a probability of epsilonLearning a new action; wherein epsilon represents the transition probability, 0 < epsilon < 0.5;

defining the expression for performing action a as formula four:

and

is a corresponding to the formulaReference state and execution action;