CN115257789A

CN115257789A - Decision-making method for side anti-collision driving of commercial vehicle in urban low-speed environment

Info

Publication number: CN115257789A
Application number: CN202211070522.XA
Authority: CN
Inventors: 李旭; 胡玮明; 胡锦超; 胡悦; 孔栋; 徐启敏
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-09-02
Filing date: 2022-09-02
Publication date: 2022-11-01

Abstract

The invention discloses a decision-making method for side anti-collision driving of an operating vehicle in an urban low-speed environment. Firstly, an urban traffic scene is constructed by utilizing a hardware-in-the-loop driving simulation platform, and safe driving behaviors under different driving conditions and driving conditions are simulated and collected. Secondly, the safe driving behavior of the human driver is simulated by a data set aggregation algorithm in a mode of simulating learning. And finally, further learning a lateral anti-collision strategy by using a near-end strategy optimization algorithm in an unsupervised learning mode, and realizing high-level decision output of lateral anti-collision driving behaviors of the commercial vehicle. The method provided by the invention can simulate the safe driving behavior of human drivers, considers the influence of factors such as visual blind areas and traffic participant types on the driving safety, provides a more reasonable and effective anti-collision driving strategy for large-scale commercial vehicles, and realizes the decision of lateral anti-collision driving of the commercial vehicles in the low-speed environment of cities.

Description

Decision-making method for side anti-collision driving of commercial vehicle in urban low-speed environment

Technical Field

The invention relates to a decision-making method for driving of commercial vehicles, in particular to a decision-making method for lateral anti-collision driving of the commercial vehicles in an urban low-speed environment, and belongs to the technical field of automobile safety.

Background

In commercial vehicle traffic accidents in urban environments, the percentage of accidents caused by vision blind areas is highest. The reason is that under the influence of a plurality of factors such as the length of the vehicle body of a commercial vehicle, the high driving position, the large difference between the inner wheel and the outer wheel, the large right turning radius and the like, when the vehicle turns, particularly turns to the right, a dynamic vision blind area in a crescent shape is formed, and pedestrians and non-motor vehicles in the vision blind area are easy to collide and even roll. Therefore, the right side of the commercial vehicle is one of the most dangerous areas in all blind vision areas, and is the main area where serious safety accidents such as side collision, rolling and the like occur. Under the urban traffic environment with more types and dense quantity of traffic participants, especially when vehicles run at low speed (starting, turning to the right and the like), how to avoid the lateral collision caused by the visual blind areas of operating vehicles becomes a core problem for ensuring the safety of road traffic and transportation.

If the driver can be warned before collision and rolling accidents occur and the driver is reminded to take operations such as speed reduction and steering, the frequency of traffic accidents caused by visual blind areas can be greatly reduced or the damage caused by the traffic accidents can be reduced. Therefore, an effective and reliable operation vehicle lateral anti-collision driving decision method is researched for an urban low-speed environment with mixed operation of machines and non-machines, and the method plays an important role in reducing the frequency of vehicle lateral collision and improving the road traffic safety.

Generally, although the existing method can play a certain early warning role, certain defects still exist in the aspects of effectiveness and reliability of lateral collision prevention, and the research of lateral collision prevention driving strategies for providing specific driving suggestions such as driving speed, steering and the like is not involved, particularly the research of an effective and reliable decision-making method for lateral collision prevention driving of a commercial vehicle in a low-speed environment of a city is lacked.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides a lateral anti-collision driving decision method of an operating vehicle in a low-speed city environment, aiming at large operating vehicles such as large buses, trucks and urban logistics vehicles, and aiming at realizing a lateral anti-collision driving decision of the operating vehicle in the low-speed city environment and ensuring the running safety of the vehicle. The method can simulate safe driving behaviors of human drivers, considers the influence of factors such as driving conditions, visual blind areas and traffic participant types on driving safety, can provide a more reasonable and effective anti-collision driving strategy for large-scale commercial vehicles, and further guarantees the running safety of the commercial vehicles. Meanwhile, the method does not need to consider complex vehicle dynamics equations and vehicle body parameters, the calculation method is simple and clear, the lateral collision avoidance decision strategy of the large-scale commercial vehicle can be output in real time, the cost of the used sensor is low, and the method is convenient for large-scale popularization.

The technical scheme is as follows: in order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows: a decision-making method for side anti-collision driving of commercial vehicles in an urban low-speed environment. Firstly, an urban traffic scene is constructed by utilizing a hardware-in-the-loop driving simulation platform, and safe driving behaviors under different driving conditions and driving conditions are simulated and collected. Secondly, the safe driving behavior of the human driver is simulated by a data set aggregation algorithm in a mode of simulating learning. And finally, further learning a lateral anti-collision strategy by using a near-end strategy optimization algorithm in an unsupervised learning mode, and realizing high-level decision output of lateral anti-collision driving behaviors of the commercial vehicle. The method provided by the invention can simulate the safe driving behavior of human drivers, considers the influence of factors such as visual blind areas and traffic participant types on the driving safety, provides a more reasonable and effective anti-collision driving strategy for large-scale commercial vehicles, and realizes the lateral anti-collision driving decision of the commercial vehicles in the low-speed environment of cities. The method specifically comprises the following three steps:

the method comprises the following steps: urban traffic scene construction by using driving simulation platform

In order to reduce the frequency of the occurrence of side collision accidents of the commercial vehicles caused by factors such as visual blind areas and the like and improve the driving safety of the commercial vehicles, the invention provides a decision-making method for side collision avoidance of the commercial vehicles in an urban low-speed environment, which is applicable to the following scenes: the commercial vehicle runs in a low-speed city environment, other traffic participants (motor vehicles, non-motor vehicles or pedestrians) exist on the left side or the right side of the commercial vehicle, and an effective and reliable lateral anti-collision driving strategy is provided for a driver in order to avoid a lateral collision accident.

According to the scene, firstly, an urban traffic scene is constructed by using a driving simulation platform, and a traffic flow and traffic participants with high randomness are set to cover a straight road, a curve and an intersection. Secondly, a plurality of drivers control the operating vehicle by using a driving simulator (a steering wheel, an accelerator and a brake pedal), and collect safe driving behaviors under 8 driving conditions of lane change, lane keeping, vehicle following, left steering, right steering, acceleration, deceleration, uniform speed and the like. And finally, constructing a safe driving behavior database D based on the collected safe driving behaviors.

Step two: simulation of safe driving behavior of driver by using imitation learning method

Data set Aggregation (DAgger) is a more advanced behavior cloning method, strategies can be actively selected from a safe driving behavior database, the safe driving behavior of a human driver can be matched easily in the subsequent training process, and the simulation learning capability is higher. Therefore, the invention utilizes the DAgger algorithm to simulate the safe driving behavior of a human driver. The safe driving behavior database D will continuously aggregate new data sets D at each time step i _i The specific training process is as follows:

substep 1: initializing a parameter phi;

and substep 2: initializing a strategy pi;

substep 3: performing a loop of N time steps, each loop comprising sub-steps 3.1 to 3.5, in particular:

substep 3.1: the strategy is updated using the following equation:

in the formula, pi _i Strategy for indicating the ith time, pi ^* Express expert strategy, beta _i Representing the parameters for soft update of the policy at the ith iteration,

representing the optimal strategy at the ith moment;

substep 3.2: using a pi _i Sampling the expert track;

substep 3.3: output by pi _i Accessed data set D composed of policies and actions given by experts _i ＝{(S _t ,π ^* (S _t ))}， S _t Representing the state space at time t;

substep 3.4: aggregating the data sets: d ← D & ÷ D & - _i ；

Substep 3.5: training a strategy on a data set D

Wherein

The optimal strategy of the i +1 moment is represented;

substep 4: finally, returning to the optimal strategy at the moment of N +1

Step three: further learning collision avoidance strategies using unsupervised learning methods

In an actual driving decision task, it is difficult to effectively and accurately process driving conditions and driving conditions not involved in a safe driving behavior database by the lack of sufficient generalization ability of driving decisions based on the imitation learning. In order to further improve the effectiveness and reliability of the lateral collision avoidance decision, a decision network needs to be further trained. Deep reinforcement learning is used as an unsupervised learning method, understanding of traffic environment can be obtained through continuous exploration and trial and error, and improvement of a strategy network is guided by reward of environment feedback, so that maximum return is obtained. The near-end strategy optimization algorithm uses a trust domain strategy optimization algorithm for reference, and a new balance is obtained among sampling efficiency, algorithm performance and complexity of realization and debugging by using first-order optimization. Therefore, the method utilizes the near-end strategy optimization algorithm to construct the anti-collision decision model, and trains the anti-collision decision model on the basis of the step two.

First, the lateral collision avoidance decision problem of the commercial vehicle is converted into a markov decision process under a certain reward function, which can be described as (S, a, P, R). Where S is a state space, A is a driving action, P represents a probability of state transition due to uncertainty in the motion of the target vehicle, and R is a reward function. Secondly, defining basic parameters of a Markov decision process, specifically:

(1) Establishing a state space

Firstly, a state space is constructed by utilizing motion state information of the own vehicle and relative motion state information of the own vehicle and surrounding traffic participants:

in the formula, p _x ,p _y Respectively represents the transverse position and the longitudinal position of the bicycle, and the units are meter and v _x ,v _y Respectively, the transverse speed and the longitudinal speed of the self-vehicle, and the unit is meter per second, a _x ,a _y Respectively represents the lateral acceleration and the longitudinal acceleration of the bicycle, and the unit is meter per second squared and theta _s Represents the heading angle of the own vehicle and has the unit of degree.

Respectively, the relative distance and the relative speed of the own vehicle and the jth surrounding traffic participant are respectively expressed in meters and meters per second. Wherein j =1,2,3,4,5,6, in minutesRespectively represent a traffic participant in front, a traffic participant in front of the left, a traffic participant in the rear of the right, and a traffic participant in the front of the right. Considering that the number of traffic participants in an actual traffic scene is not fixed, when a sensor observes i (i is less than j) traffic participants, the last j-i rows of the state space are filled with zeros.

(2) Establishing an action space

To output advanced driving decisions, the present invention defines the motion space as discrete lateral and longitudinal motions.

A _t ＝[a ₁ ,a ₂ ,a ₃ ,a ₄ ,a ₅ ,a ₆ ] (3)

In the formula, A _t Represents the motion space at time t, a ₁ ,a ₂ ,a ₃ Respectively representing left turn, right turn and straight movement, a ₄ ,a ₅ ,a ₆ Respectively, acceleration, deceleration and holding speed constant.

(3) Establishing a reward function

In order to quantitatively evaluate the advantages and disadvantages of an anti-collision strategy, the invention establishes an anti-collision reward function considering the influence of traffic participant types on driving safety:

in the formula, R _t The reward function, x, representing time t _{min_1} ,x _{min_2} Representing the lateral safety distance threshold in meters, in the present invention, x _{min_1} ＝2，x _{min_2} ＝2.5。

Furthermore, negative feedback is applied to the decision making the side impact, i.e. when the output decision strategy results in a side impact, the reward value obtained at the current moment is subtracted by 50.

Secondly, training the constructed collision avoidance decision model, and specifically comprising the following substeps:

substep 1: initializing a policy parameter θ ₀ Sum function parameterNumber phi ₀ ；

Substep 2: a loop of T time steps is performed, each loop comprising sub-step 2.1 to sub-step 2.5, in particular:

substep 2.1: running a policy in the Environment _k ＝π(θ _k )；

Substep 2.2: calculating the optimal prize value at time t

Substep 2.3: based on current value function

Calculating an estimate of a merit function

Substep 2.4: policy updates are made using the following equation:

in the formula, theta _k+1 Representing the policy network parameter at time k +1, epsilon representing the hyper-parameter, pi _θ A policy network with a parameter theta, clip (-) representing a truncation function, may be

Truncation at [ 1-epsilon, 1+ epsilon]Where τ denotes a hyper-parameter that determines the magnitude of the soft update, argmax (·) denotes a variable that maximizes the objective function,

representing the dominance value of a state-action pair.

Substep 2.5: the value function update is performed using:

in the formula, phi _k+1 Parameter of value function, V, representing the time k +1 _φ (S _t ) Represents a state space S _t The following value function.

And finally, after the anti-collision decision model is trained, the motion state information of the vehicle and the relative motion state information of the vehicle and surrounding traffic participants are input into the anti-collision decision model, driving suggestions such as acceleration, deceleration, lane change and the like can be output, and effective and reliable lateral anti-collision driving decisions of large-scale commercial vehicles are realized.

Has the beneficial effects that: compared with a general driving decision method, the method provided by the invention has the characteristics of more effectiveness and reliability, and is specifically embodied as follows:

(1) The method provided by the invention can simulate the safe driving behavior of a human driver, provides a more reasonable and safe lateral anti-collision decision strategy for large-scale commercial vehicles, realizes the lateral anti-collision driving decision of the commercial vehicles in the low-speed environment of the city, and can ensure the running safety of the commercial vehicles.

(2) The method provided by the invention comprehensively considers the influence of factors such as driving conditions, visual blind areas, traffic participant types and the like on driving safety, sets a refined reward function aiming at different traffic participant types, realizes the lateral anti-collision driving decision under different driving conditions, and further improves the effectiveness and reliability of the decision.

(3) The method provided by the invention does not need to consider complex vehicle dynamics equations and vehicle body parameters, the calculation method is simple and clear, the lateral anti-collision strategy of the large-scale commercial vehicle can be output in real time, and the used sensor has low cost and is convenient for large-scale popularization.

Drawings

FIG. 1 is a technical roadmap for the present invention.

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

The invention provides a decision-making method for vehicle lateral anti-collision driving in a low-speed city environment, aiming at large-scale operation vehicles such as large buses, trucks and city logistics vehicles. Firstly, an urban traffic scene is constructed by utilizing a hardware-in-the-loop driving simulation platform, and safe driving behaviors under different driving conditions and driving conditions are simulated and collected. Secondly, the safe driving behavior of the human driver is simulated by a data set aggregation algorithm in a mode of simulating learning. And finally, further learning a lateral anti-collision strategy by using a near-end strategy optimization algorithm in an unsupervised learning mode, and realizing high-level decision output of lateral anti-collision driving behaviors of the commercial vehicle. The method provided by the invention can simulate the safe driving behavior of human drivers, considers the influence of factors such as visual blind areas and traffic participant types on the driving safety, provides a more reasonable and effective anti-collision driving strategy for large-scale commercial vehicles, and realizes the decision of lateral anti-collision driving of the commercial vehicles in the low-speed environment of cities. The technical route of the invention is shown in figure 1, and the specific steps are as follows:

In order to reduce the frequency of side collision accidents of commercial vehicles caused by factors such as visual blind areas and improve the driving safety of the commercial vehicles, the invention provides a decision-making method for side collision avoidance of the commercial vehicles in an urban low-speed environment, which is applicable to the following scenes: the commercial vehicle runs in a low-speed city environment, other traffic participants (motor vehicles, non-motor vehicles or pedestrians) exist on the left side or the right side of the commercial vehicle, and an effective and reliable lateral anti-collision driving strategy is provided for a driver in order to avoid a lateral collision accident.

Data set Aggregation (DAgger) is a more advanced behavior cloning method, strategies can be actively selected from a safe driving behavior database, the safe driving behaviors of human drivers can be easily matched in the subsequent training process, and the simulation learning capability is higher. Therefore, the invention utilizes the DAgger algorithm to simulate the safe driving behavior of a human driver. The safe driving behavior database D will continuously aggregate new data sets D at each time step i _i The specific training process is as follows:

substep 1: initializing a parameter phi;

substep 2: initializing a strategy pi;

substep 3.1: the strategy is updated using the following equation:

in the formula, pi _i Strategy for indicating the ith time, pi ^* Express expert strategy, beta _i Representing the parameters for soft updates of the policy at the ith iteration,

representing the optimal strategy at the ith moment;

substep 3.2: by using pi _i Sampling an expert track;

substep 3.3: output is formed by _i Accessed data set D composed of policies and actions given by experts _i ＝{(S _t ,π ^* (S _t ))}， S _t Representing the state space at time t;

substep 3.4: aggregating the data sets: d ← D ≈ D ÷ D _i ；

Substep 3.5: training a strategy on a data set D

Wherein

Representing the optimal strategy at the moment i + 1; and substep 4: finally, returning to the optimal strategy at the moment of N +1

Step three: further learning of collision avoidance strategies using unsupervised learning methods

In an actual driving decision task, it is difficult to effectively and accurately handle driving conditions and driving conditions not involved in the safe driving behavior database by the lack of sufficient generalization capability of the driving decision based on the imitation learning. In order to further improve the effectiveness and reliability of the lateral collision avoidance decision, a decision network needs to be trained further. Deep reinforcement learning is used as an unsupervised learning method, understanding of traffic environment can be obtained through continuous exploration and trial and error, and improvement of a strategy network is guided by reward of environment feedback, so that maximum return is obtained. The near-end strategy optimization algorithm uses a trust domain strategy optimization algorithm for reference, and a new balance is obtained among sampling efficiency, algorithm performance and complexity of realization and debugging by using first-order optimization. Therefore, the method utilizes the near-end strategy optimization algorithm to construct the anti-collision decision model, and trains the anti-collision decision model on the basis of the step two.

First, the lateral collision avoidance decision problem of the operating vehicle is converted into a markov decision process under a certain reward function, which can be described as (S, a, P, R). Where S is a state space, a is a driving action, P represents a state transition probability due to uncertainty of the motion of the target vehicle, and R is a reward function. Secondly, defining basic parameters of the Markov decision process, specifically:

(1) Establishing a state space

in the formula, p _x ,p _y Respectively represents the transverse position and the longitudinal position of the bicycle, and the units are meter and v _x ,v _y Respectively, the transverse speed and the longitudinal speed of the self-vehicle, and the unit is meter per second, a _x ,a _y Respectively represents the lateral acceleration and the longitudinal acceleration of the self-vehicle, and the unit is meter per square second theta _s Indicating the heading angle of the vehicle in degrees.

Respectively, the relative distance and the relative speed of the own vehicle and the jth surrounding traffic participant are respectively expressed in meters and meters per second. Wherein j =1,2,3,4,5,6, respectively represents a front traffic participant, a left rear traffic participant, a right rear traffic participant, and a right front traffic participant. Considering that the number of traffic participants in an actual traffic scene is not fixed, when the sensor observes i (i is less than j) traffic participants, the last j-i line of the state space is filled with zeros.

(2) Establishing an action space

A _t ＝[a ₁ ,a ₂ ,a ₃ ,a ₄ ,a ₅ ,a ₆ ] (3)

In the formula, A _t Represents the motion space at time t, a ₁ ,a ₂ ,a ₃ Respectively, left turn, right turn and straight line, a ₄ ,a ₅ ,a ₆ Respectively, acceleration, deceleration, and holding the speed constant.

(3) Establishing a reward function

Furthermore, negative feedback is applied to the decision making the side impact, i.e. when the outputted decision strategy results in a side impact, the reward value obtained at the current moment is subtracted by 50.

substep 1: initializing a policy parameter θ ₀ Sum function parameter phi ₀ ；

substep 2.1: running policy π in the Environment _k ＝π(θ _k )；

Substep 2.2: calculating the optimal prize value at time t

Substep 2.3: based on current value function

Calculating an estimate of a merit function

Substep 2.4: policy updates are made using the following equation:

in the formula, theta _k+1 Representing the policy network parameter at time k +1, epsilon representing the hyper-parameter, pi _θ Represents a policy network with a parameter θ, clip () represents a truncation function, which may be

Truncation at [ 1-epsilon, 1+ epsilon]Wherein τ denotes a hyper-parameter that determines the soft update width, argmax (·) denotes a variable that maximizes the objective function,

representing the dominance value of a state-action pair.

Substep 2.5: the value function update is performed using:

Claims

1. A decision-making method for side anti-collision driving of an operating vehicle in an urban low-speed environment comprises the steps of firstly, utilizing a hardware-in-loop driving simulation platform to construct an urban traffic scene, and simulating and collecting safe driving behaviors under different driving conditions and driving working conditions; secondly, simulating the safe driving behavior of the driver by using a data set combination algorithm in a learning simulation mode; finally, a lateral anti-collision strategy is further learned by a near-end strategy optimization algorithm in an unsupervised learning mode, an anti-collision driving strategy is provided for large commercial vehicles, and a decision of lateral anti-collision driving of the commercial vehicles in an urban low-speed environment is realized; the method is characterized in that:

The commercial vehicle runs in a low-speed environment in a city, and other traffic participants including motor vehicles, non-motor vehicles or pedestrians exist on the left side or the right side of the commercial vehicle;

according to the scenes described above, firstly, a driving simulation platform is utilized to construct an urban traffic scene, which covers straight roads, curved roads and intersections, and traffic flows and traffic participants with high randomness are set; secondly, a plurality of drivers control the operating vehicles by using a driving simulator, wherein the driving simulator is provided with a steering wheel, an accelerator and a brake pedal and collects safe driving behaviors under 8 driving conditions of lane change, lane keeping, vehicle following, left steering, right steering, acceleration, deceleration and uniform speed; finally, constructing a safe driving behavior database D based on the collected safe driving behaviors;

step two: simulating safe driving behavior of driver by using imitation learning method

Simulating safe driving behaviors of human drivers by using a data set combination algorithm; the safe driving behavior database D will continuously aggregate new data sets D at each time step i _i The specific training process is as follows:

substep 1: initializing a parameter phi;

substep 2: initializing a strategy pi;

substep 3.1: the strategy is updated using the following equation:

representing the optimal strategy at the ith moment;

substep 3.2: using a pi _i Sampling an expert track;

substep 3.3: output is formed by _i Accessed data set D composed of policies and actions given by experts _i ＝{(S _t ,π ^* (S _t ))}，S _t Representing the state space at time t;

substep 3.4: aggregating the data sets: d ← D & ÷ D & - _i ；

Substep 3.5: training a strategy on a data set D

Wherein

Representing the optimal strategy at the moment i + 1;

substep 4: finally, returning to the optimal strategy at the moment of N +1

Constructing an anti-collision decision model by using a near-end strategy optimization algorithm, and training the anti-collision decision model on the basis of the second step; firstly, converting a lateral collision avoidance decision problem of a commercial vehicle into a Markov decision process under a certain reward function, which is described as (S, A, P, R); wherein S is a state space, A is a driving action, P represents a state transition probability caused by uncertainty of the motion of the target vehicle, and R is a reward function; secondly, defining basic parameters of a Markov decision process, specifically:

(1) Establishing a state space

in the formula, S _t Representing the state space at time t, p _x ,p _y Respectively represents the transverse position and the longitudinal position of the bicycle, and the units are meter and v _x ,v _y Respectively, the transverse speed and the longitudinal speed of the self-vehicle, and the unit is meter per second, a _x ,a _y Respectively represents the lateral acceleration and the longitudinal acceleration of the self-vehicle, and the unit is meter per square second theta _s Representing the course angle of the vehicle, and the unit is degree;

respectively representing the relative distance and the relative speed of the own vehicle and the jth surrounding traffic participant, wherein the units are meter and meter per second respectively; wherein j =1,2,3,4,5,6, respectively representing a traffic participant in front, a traffic participant in front left, a traffic participant in rear right, and a traffic participant in front right; considering that the number of traffic participants in an actual traffic scene is not fixed, when a sensor observes i (i is less than j) traffic participants, the last j-i line of the state space is filled with zero;

(2) Establishing an action space

Defining an action space as discrete lateral and longitudinal actions for outputting advanced driving decisions;

A _t ＝[a ₁ ,a ₂ ,a ₃ ,a ₄ ,a ₅ ,a ₆ ] (3)

in the formula, A _t Represents the motion space at time t, a ₁ ,a ₂ ,a ₃ Respectively, left turn, right turn and straight line, a ₄ ,a ₅ ,a ₆ Respectively representing acceleration, deceleration and keeping the speed unchanged;

(3) Establishing a reward function

In order to quantitatively evaluate the advantages and disadvantages of the anti-collision strategy, an anti-collision reward function considering the influence of the traffic participant type on the driving safety is established:

in the formula, R _t Reward function, x, indicating time t _{min_1} ,x _{min_2} Representing the lateral safety distance threshold in meters, in the present invention, x _{min_1} ＝2，x _{min_2} ＝2.5；

In addition, negative feedback is applied to the decision causing the side collision, namely when the output decision strategy causes the side collision, the reward value obtained at the current moment is subtracted by 50;

substep 1: initializing a policy parameter θ ₀ Parameter phi of sum function ₀ ；

substep 2.1: running a policy in the Environment _k ＝π(θ _k )，θ _k A policy network parameter representing time k;

substep 2.2: calculating the optimal prize value at time t

Substep 2.3: based on current value function

Calculating an estimate of a merit function

Substep 2.4: policy updates are made using the following equation:

a dominance value representing a state-action pair;

substep 2.5: the value function update is performed using:

in the formula, phi _k+1 Parameter of value function, V, representing the time k +1 _φ (S _t ) Represents a state space S _t A value function of;

and finally, after the anti-collision decision model is trained, the motion state information of the vehicle and the relative motion state information of the vehicle and surrounding traffic participants are input into the anti-collision decision model, driving suggestions such as acceleration, deceleration, lane change and the like can be output, and effective and reliable lateral anti-collision driving decisions of the large commercial vehicle are realized.