CN118212808A

CN118212808A - Method, system and equipment for planning traffic decision of signalless intersection

Info

Publication number: CN118212808A
Application number: CN202410150462.5A
Authority: CN
Inventors: 李立; 赵峥程; 杨文臣; 刘晓锋; 王润民; 路庆昌; 许文鹏
Original assignee: Changan University
Current assignee: Changan University
Priority date: 2024-02-02
Filing date: 2024-02-02
Publication date: 2024-06-18

Abstract

The invention discloses a method, a system and equipment for planning traffic decision of a signalless intersection, which relate to the technical field of intelligent traffic and comprise the following steps: constructing a modeling scene of a network-connected signalless intersection, and obtaining a steering vehicle decision-motion planning frame under a multi-vehicle interaction working condition; performing driving risk perception, defining a risk level according to the accident severity, and obtaining a risk perception coefficient; calculating a passing clearance according to the relative state of the interactive vehicles, obtaining a passing strategy through the passing clearance, and obtaining expected vehicle speed by adopting a particle swarm algorithm; obtaining a relation between a vehicle action and a vehicle state relative to a current environment, performing reward and punishment strategy training on a risk level by using an RA-SAC algorithm, and deciding a driving continuous action; and finally, obtaining the vehicle decision motion planning model. The invention can effectively evaluate the intelligent network-connected vehicles without the signalized intersections and has stronger operability.

Description

Method, system and equipment for planning traffic decision of signalless intersection

Technical Field

The invention relates to the technical field of intelligent transportation, in particular to a method, a system and equipment for planning traffic decision of a signalless intersection.

Background

In recent years, with the rapid development of intelligent networking vehicle technology, the intelligent networking vehicle technology gradually shows good potential in reducing traffic conflict, improving traffic running efficiency and economic benefit, realizes vehicle group collaborative decision and intelligent control based on collaborative awareness, fusion and interaction of all-time-space traffic information, and promotes an automatic driving China development route based on vehicle road collaboration, and becomes strategic development content of intelligent traffic in China. The intelligent network-connected vehicle sends and collects environment information through the V2I (Vehicle to Infrastructure) device, road holographic perception is achieved, roads and vehicles form an interconnection whole, and particularly for irregular plane intersection road scenes such as signalless intersections, the network-connected technology can reduce negative influences on driving safety caused by imperfect information collection, and conditions are provided for avoiding traffic collision, improving traffic efficiency and driver and passenger experience.

However, the popularity of autonomous vehicles from the vast majority of current manned vehicles can experience a lengthy transition phase. In the process, the manned vehicles and the automatic driving vehicles coexist in the road scene, heterogeneous traffic groups can aggravate the complexity of the driving environment of the signalless intersection, and the important test is caused on the technologies of the automatic driving vehicles in various aspects such as perception, decision and control, so that how to quantify the motion characteristics of the manned vehicles and bring the motion characteristics into the motion planning factors of the automatic driving vehicles is one of important work of propulsion intelligence. In addition, different from a signalized intersection, the speed and the acceleration of the vehicle before stopping the line are required to be considered to ensure the passing efficiency under the signalized scene, and the movement of the vehicle in a collision area inside the intersection is required to be planned to avoid collision, so that a small challenge is caused to the driving behavior of left turning which requires planning of the vehicle speed and the steering angle at the same time. Therefore, how to construct a left movement planning model facing the multi-vehicle interaction working condition by taking driving risks and driving styles of other vehicles in heterogeneous environments into consideration from the driving safety point of view is a problem to be solved in a scene without a signalized intersection by intelligent network vehicles.

At present, a reinforcement learning method is applied to develop a study of an automatic driving decision in a signalless intersection scene, safety, efficiency and comfort are taken as targets, whether collision happens to the safety targets is only considered, for example, a return function is formulated for a vehicle collision situation, potential collision risks are considered in less study, the defect that low-frequency accident data are used as evaluation standards is overcome, for example, the sample size is small, and a risk change process is difficult to reflect. Assuming that the vehicle is already in an accident critical state without collision, it is difficult to reflect the dynamic driving safety thereof only according to the collision event. Common hierarchical reinforcement learning methods need to separate different levels of training, increase training cost, have higher network complexity, and increase running cost at the same time, so that the method is difficult to apply to complex and changeable driving task scenes.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a method, a system and equipment for planning traffic decision of a signalless intersection, which are used for solving the problems that the common hierarchical reinforcement learning method in the prior art needs to separate different levels of training, the training cost is increased, the network complexity is higher, and the running cost is increased, so that the method is difficult to be applied to complex and changeable driving task scenes.

The invention specifically provides the following technical scheme: a traffic decision planning method for a signalless intersection comprises the following steps:

constructing a modeling scene of the network-connected signalless intersection, and constructing a steering vehicle decision-motion planning frame under the multi-vehicle interaction working condition based on the modeling scene of the network-connected signalless intersection;

Performing driving risk perception through the steering vehicle decision-motion planning framework, defining risk levels according to perceived accident severity, and acquiring risk perception coefficients under different risk levels;

calculating a passing gap according to the relative states of the interactive vehicles, obtaining a passing strategy through the passing gap, and obtaining the expected speed of each vehicle in the passing strategy by adopting a particle swarm algorithm;

Selecting a target path point based on each vehicle position and the global path, and matching the expected vehicle speed with the target path point by using a pure tracking algorithm to obtain a driving decision of continuous driving action;

Carrying out reward and punishment strategy training on the risk perception coefficient by using an RA-SAC algorithm, and changing the gradient updating amplitude of the driving decision through the trained risk perception coefficient to obtain a vehicle decision motion planning model;

And making a decision on the passing of each vehicle by using the vehicle decision motion planning model.

Preferably, the steering vehicle decision-motion planning framework comprises: the system comprises a network connection no-signal intersection environment, a perception and decision module and a vehicle motion planning module.

Preferably, the perceived accident severity defines risk levels, and the risk perception coefficients under different risk levels are obtained, including the following steps:

carrying out risk classification on traffic conflict events according to collision avoidance acceleration thresholds, and calculating conditional probabilities under different risk classes;

and obtaining risk perception coefficients under each given state by adopting a Bayesian theory.

Preferably, the risk classification is performed on the traffic collision event according to the collision avoidance acceleration threshold, and the conditional probability under different risk classes is calculated, where the specific expression is:

Wherein D _s、D_r and D _d are rDRAC thresholds for safety, risk and risk event, respectively, σ represents a random variable; τ is a risk level, represented by values 0,1,2, corresponding to safety, risk and risk level, respectively, and the specific expression is: τ= {0,1,2}.

Preferably, the risk perception coefficient under each given state is obtained by adopting bayesian theory, and the specific expression is:

Where ε is a risk perception coefficient, P (τ|D) is a posterior probability at a certain risk level τ, P (τ) is a priori probability of the risk level, and P (D|τ) is a conditional probability at a different risk level.

Preferably, the selecting a target path point based on each vehicle position and the global path, and matching the expected vehicle speed with the target path point by using a pure tracking algorithm, includes the following steps:

Designing a steering tracking function module of the vehicle based on a pure tracking algorithm and a PID controller, selecting a target path point based on the current vehicle position and a global path, and adjusting the steering angle of the vehicle by combining the pure tracking algorithm and the PID controller;

and deciding a basic forward looking distance parameter in a pure tracking algorithm according to the current vehicle state, matching the expected vehicle speed with a target path point, and tracking the target path point.

Preferably, the driving decision for obtaining the continuous driving action comprises the following steps:

defining a vehicle passing process at a signalless intersection as a markov decision process;

the method comprises the steps of determining two parameters of a basic forward looking distance and a throttle valve/brake pedal of motion control through a deep reinforcement learning method based on a Markov decision process;

Obtaining rewards through the actions taken under each decision, and obtaining the relation between the driving continuous actions and the current environment and the vehicle state through the rewards;

and obtaining a driving decision of the driving continuous action according to the relation between the current environment and the vehicle state by the driving continuous action.

Preferably, the training of the risk perception coefficient by using the RA-SAC algorithm includes the following steps:

The potential collision risk is identified in the training process by changing the punishment and punishment force through the risk perception coefficient;

Putting the current vehicle driving decision into a corresponding environment for evaluation to obtain a reward and punishment result which is more in line with the influence of the vehicle action on the actual environment;

Changing gradient updating amplitude according to the potential collision risk of the current vehicle driving, and harvesting feedback of different degrees according to the amplitude;

And obtaining optimal driving continuous actions through the feedback and the reward and punishment results, and updating driving decisions through the optimal driving continuous actions.

Preferably, the present invention also provides a system for traffic decision planning at signalless intersections, comprising:

the frame construction module is used for constructing a modeling scene of the network-connected signalless intersection and constructing a steering vehicle decision-motion planning frame under the multi-vehicle interaction working condition based on the modeling scene of the network-connected signalless intersection;

the risk acquisition module is used for carrying out driving risk perception through the steering vehicle decision-motion planning framework, defining risk levels according to the perceived accident severity, and acquiring risk perception coefficients under different risk levels;

The vehicle speed acquisition module is used for calculating a passing gap according to the relative states of the interactive vehicles, acquiring a passing strategy through the passing gap, and acquiring the expected speed of each vehicle in the passing strategy by adopting a particle swarm algorithm;

The state acquisition module is used for selecting a target path point based on each vehicle position and the global path, matching the expected vehicle speed with the target path point by using a pure tracking algorithm, and obtaining a driving decision of continuous driving action;

The decision model construction module is used for carrying out reward and punishment strategy training on the risk perception coefficient by using an RA-SAC algorithm, and changing the gradient update amplitude of the driving decision through the trained risk perception coefficient to obtain a vehicle decision motion planning model;

and the vehicle decision module is used for deciding the passing of each vehicle by using the vehicle decision motion planning model.

Preferably, the present invention further provides a computer device, including a memory and a processor, where the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the steps of a method for planning traffic decisions of a signalless intersection.

Compared with the prior art, the invention has the following remarkable advantages:

The invention provides a vehicle decision-motion planning framework for a signalless intersection, which comprises the steps of acquiring a risk level, a risk perception coefficient, a passing clearance, an expected vehicle speed, a vehicle position and a target path point by combining a risk perception method with a deep reinforcement learning algorithm, constructing a work of driving continuous action, executing a state of changing a vehicle in an environment, and completing feedback through environment information transmission to realize closed loop control of vehicle motion planning; meanwhile, a mechanism RA-SAC algorithm for adjusting the rewarding strategy based on the risk level is designed, and the current driving decision of the vehicle is put into the corresponding environment for evaluation through the mechanism, so that the rewarding and punishing result is more fit with the influence caused by the action of the vehicle on the actual environment, different levels are not needed for training, the vehicle motion planning is more accurate and the cost is lower, the intelligent network-connected vehicle without a signalized intersection can be effectively evaluated, and the method has stronger operability.

Drawings

FIG. 1 is a diagram of an intelligent networked vehicle motion planning framework;

FIG. 2 is a schematic illustration of a signalless intersection in a networked environment;

FIG. 3 is a traffic sequence decision flow diagram;

FIG. 4 is a schematic diagram of a trajectory change process;

FIG. 5 is a flow chart of a particle swarm algorithm;

FIG. 6 is a vehicle motion planning framework diagram;

FIG. 7 is a diagram of the geometry of a pure tracking algorithm;

FIG. 8 is a steering controller based on a pure tracking algorithm;

FIG. 9 is a network structure diagram of the RA-SAC algorithm;

FIG. 10 is a flow chart of a simulation experiment;

FIG. 11 (a) is a RA-SAC training process rewards graph and (b) is a SAC training process rewards graph;

In fig. 12, (a) is a TD3 training process reward graph and (b) is a DDPG training process reward graph.

Detailed Description

The following description of the embodiments of the present invention, taken in conjunction with the accompanying drawings, will clearly and completely describe the embodiments of the present invention, and it is evident that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

Referring to fig. 1, an embodiment of the present application provides a method for planning traffic decisions at a signalless intersection, including the following steps:

Step S1: and constructing a modeling scene of the network-connected signalless intersection, and constructing a steering vehicle decision-motion planning frame under the multi-vehicle interaction working condition based on the modeling scene of the network-connected signalless intersection.

A modeling scene of the network-connected signalless intersection is constructed, the characteristics and basic assumptions of the scene are described in detail, and then a steering vehicle decision-motion planning framework under the working condition of multi-vehicle interaction of the signalless intersection is provided, as shown in figure 1.

(1) The invention uses the T-shaped intersection as a prototype to build a scene, the geometric characteristics of the road are shown in figure 2, the line shape of the road at the T-shaped intersection is irregular, the road section of the area is located in suburb, and the invention has the characteristics of small traffic density, high speed and high driving risk, and accords with the setting of the research scene of the invention. The multi-vehicle interaction condition participants comprise 1 intelligent internet-Connected vehicle (Connected Automated Vehicle, CAV) and 2 internet-Connected manual driving vehicles (CHV).

Because the road environment of the real signalless intersection is complex, in order to reduce the interference of other unstable factors in the modeling process, the invention makes the following assumptions on the research scene:

1) In the process that the vehicle enters the observation area until the vehicle leaves the intersection, no communication time delay exists between all intelligent vehicle-mounted units and the road side units.

2) Each vehicle is the head vehicle in the lane in the direction, and no following behavior exists in the driving process of the upstream transition zone.

3) No pedestrian passes through the road, no starting phenomenon of vehicles parked around the road, and other conditions disturbing the running state of the vehicles occur.

(2) Vehicle decision-motion planning framework

Intelligent networked vehicle control can be divided into four phases of environment awareness, behavior decision, track planning and path tracking. In the network-connected signalless intersections of vehicles with different automation degrees, CAV needs to perform risk sensing, make reasonable traffic decisions and then plan vehicle movement. The integral frame of the invention comprises: the system comprises a network connection no-signal intersection environment, a perception and decision module and a vehicle motion planning module.

1) In a mixed traffic scene of a network-connected signalless intersection, intelligent network-connected vehicles and manual driving network-connected vehicles are mixed, a road side is required to be provided with a sensing terminal, basic information such as the position and the speed of the vehicle and the like sent by the vehicle in the range of the intersection is received in real time, and state information of other traffic units is sent to the vehicle in the communication range, so that the interconnection and the intercommunication of vehicle-road information are realized. The information in the scene is the basis of driving decisions and motion planning, and control effect feedback is provided at the same time, and the characteristics are shown in fig. 1 (a).

2) The intelligent network-connected vehicle completes reasonable driving decision according to the correct understanding of traffic environment, which is an important precondition for motion trail planning. In the frame setting scene, the vehicle needs to evaluate the risk level of the driving environment where the vehicle is positioned on the premise of sensing the information such as the motion state, the relative acceleration, the distance and the like of the interactive object vehicle so as to adjust the driving strategy. The risk perception sub-module is designed, as shown in fig. 1 (b), by calculating rDRAC (relative Deceleration Rate to Avoid the Crash) values of the vehicle and the environmental vehicle, different risk level probabilities are estimated to be used for calculating risk perception coefficients, and the current driving risk level of the vehicle is reflected. The vehicle speed decision module, as shown in fig. 1 (c), decides the vehicle speed from the upstream of the intersection to the arrival at the intersection. The passing process of the host vehicle at the intersection is regarded as a multi-objective planning problem, vehicles with different directions are mapped to a virtual queue, and whether the current passing gap is accepted is judged according to the critical gap, so that the passing sequence of the host vehicle is determined. After the traffic sequence is determined, the expected speed of the vehicle is solved by adopting a particle swarm algorithm and is transmitted to the bottom layer controller as a sub-target.

3) The vehicle control module adopts decoupling parallelism of longitudinal control and steering control, and makes decisions on key parameters of the two controllers through the reinforcement learning agent as shown in fig. 1 (d). Wherein the punishment and punishment force of the training process is based on the risk level obtained by the current module (b). The longitudinal control module adjusts the throttle/brake pedal pressure signal based on the desired vehicle speed of module (c), the adjustment determining the safety and comfort of the occupants. The steering control module consists of a pure tracking algorithm and a PID controller, the vehicle selects a target point according to the current position and the global path, and adjusts the steering angle of the front wheel to realize path tracking control.

Step S2: and carrying out driving risk sensing through a steering vehicle decision-motion planning framework, defining risk levels according to the sensed accident severity, and acquiring risk sensing coefficients under different risk levels.

The method comprises the following steps: in order to make a correct motion planning guidance strategy for different road conditions, in the risk perception module shown in fig. 1 (b), risk classification is carried out on traffic conflict events according to collision avoidance acceleration thresholds, conditional probabilities under different risk classes are calculated, and a bayesian theory is adopted to obtain risk perception coefficients under each given state. I.e. driving risk perception.

1) Risk ranking

Depending on the severity of the vehicle crash, it is classified as mild (without casualties and property loss), moderate (with only property loss), severe (endangering the occupants). Three different degrees correspond to three levels in sequence: security (safe), risk present (risky), risk (dangerous), defined as a set of levels Ω is:

Ω＝{safe,risky,dangerous}＝{s,r,d} (1)

the risk level τ is represented by the values 0, 1,2, corresponding to the safety, risk and risk level, respectively, as shown in equation (2):

τ∈Ω＝{0,1,2} (2)

rDRAC between the host vehicle and the interactive environment vehicle, denoted by symbol D, for any rDRAC value, the conditional probabilities at different risk levels are respectively:

Where D _s、D_r and D _d are rDRAC thresholds for safety, risk and risk events, respectively, obtained from the distribution of the inD dataset, σ representing the random variable. When a larger D indicates a higher probability of a higher risk level, a lower probability of a lower risk level, and vice versa.

2) Driving risk estimation

The driving risk estimation is based on the bayesian theory, and the prior information about the parameter value and the current data (or possibility) at hand are used for obtaining posterior estimation of the parameter value, namely, the posterior probability at a certain risk level is obtained according to the prior probability of different risk levels and the current conditional probability.

For a given state, the posterior probability at a certain risk level τ is:

Wherein, P (τ) is the prior probability of the risk level, and the prior probabilities of different risk levels are obtained by counting inD datasets as shown in table 1. P (d|τ) is a conditional probability at different risk levels.

Table 1a priori probabilities of different risk levels

Risk level	0	1	2
				Probability of	0.748	0.239	0.013

Combining all risk levels, converting the risk assessment result into a quantifiable continuous value, defining a risk perception coefficient epsilon, and calculating as shown in a formula (5):

The risk perception coefficient epsilon is an output result of the risk perception module and is used as a strategy parameter of a deep reinforcement learning algorithm in vehicle motion planning, and the intelligent body adjusts a model rewarding strategy according to the coefficient epsilon, so that updating of network parameters is affected, and a training mode of the intelligent body is improved.

Step S3: calculating a passing clearance according to the relative states of the interactive vehicles, obtaining a passing strategy through the passing clearance, and obtaining the expected speed of each vehicle in the passing strategy by adopting a particle swarm algorithm.

The method comprises the following steps: the vehicle speed decision module calculates a predicted time interval according to the relative state of the front interactive vehicle and judges the passing sequence of the vehicle according to the acceptable passing clearance; then, the expected speed of the vehicle is solved by adopting a particle swarm algorithm with safety and efficiency as guidance; finally, the throttle/pressure pedal adjustment is completed by sending the throttle/pressure pedal adjustment to the bottom layer as a control sub-target. Compared with the method that the intelligent body directly performs acceleration control according to the environment information, the expected vehicle speed can play a role in guiding, so that the vehicle speed adjusting task of the reinforcement learning intelligent body is simplified.

1) Intersection passing model

According to the road traffic regulations in China, the highest running speed of the road is not more than 50km/h through the road upstream of the crossroad without traffic signals and speed limit marks. When the vehicle speed per hour is less than 12.4km/h, the vehicle fuel economy is greatly reduced, so v _lower is set to 12.4km/h. At the terminal time t _f before reaching the intersection, the vehicle speed should satisfy the following boundary constraints:

the boundary value of the acceleration satisfies the positive and negative maximum acceleration in inD data set:

In an independent collision event, the difference in the Time of two vehicles passing through the collision zone is generally represented by the safety replacement indicator GT (Gap Time). The smaller the GT is, the smaller the time gap that the rear vehicle passes through the collision area is, and the higher the potential collision risk is, so that the size of the GT value is used as a safety index basis. Assuming that m vehicles exist in a scene, the passing sequence of the vehicle is n, and a sub-objective function f ₁ is established, as shown in a formula (8):

Wherein, when GT _(m-1,m) (t) is that the sequence of the host vehicle is m, the GT value of the host vehicle and the adjacent vehicle m-1 at the time t is equal. If the passing sequence of the host vehicle is m-1 and the host vehicle passes between two vehicles (m-2 and m), f ₁ is the sum of the GT values of the host vehicle and the adjacent two vehicles in front and behind.

The relative speed of the two vehicles in the interaction process reflects the relative motion state at the moment. When two vehicles are in an upstream stage, the speed difference is too large, which means that one party may be in an idle state, so that the driving experience of personnel in the vehicles is influenced, and the traffic in the intersections is likely to be negatively influenced. When two vehicles are near an intersection, an excessive speed difference means that the speed of a certain party is too high, the safety reaction time is short, and the potential collision risk is further increased along with the speed change of any party. Therefore, the speed difference between the vehicle passing process and the adjacent vehicles should be as small as possible, and assuming that the numbers of the adjacent vehicles are m and m-1 respectively, the sub-objective function f ₂ is shown as the formula (9):

under the condition of ensuring the safety of drivers, vehicles can pass through intersections as soon as possible, travel delay can be reduced, and meanwhile, the time for processing single conflict events is shortened, so that the road traffic efficiency and the economical fuel efficiency are positively influenced. The efficiency sub-objective function f ₃ is shown in equation (10):

Where V _m (t) is the speed of vehicle m at time t and V _lower、v_upper is the lowest and highest speed limit at the intersection, respectively.

Comprehensively considering performance indexes of speed decision in multiple aspects, combining the motion state constraint of the vehicle, and integrating the different objective functions into a total objective function f, wherein the total objective function f is shown in a formula (11):

f＝f₁+f₂+f₃ (11)

2) Traffic sequence decision

Virtual lanes are a concept of converting a two-dimensional intersection scene into a one-dimensional queue, the queue order of the virtual lanes represents a collision-free road right allocation scheme of vehicles, and the headway of the vehicles in the queue can be converted into an intersection area passing gap. The vehicle rotation projection method has the advantages that vehicles in lanes in different directions are projected to the virtual lanes, and a model is built aiming at the virtual queues, so that the method is a common co-control method for the signalless intersections. Through the centralized control of the virtual queues, the situation that vehicles arrive at the intersection at the same time can be avoided, the negative effects of traffic jams, uncoordinated team formation of multi-lane intersections and the like are eliminated, and the traffic running efficiency is improved.

Based on the acceptance of critical gaps, the invention designs a traffic decision scheme, and the flow is shown in figure 3.

3) Desired vehicle speed decision

After the passing sequence of the vehicles is obtained, the main vehicle needs to plan the vehicle speed according to the expected time of other vehicles arriving at the intersection, so that the situation that the vehicle track is overlapped with the environment vehicle track or the arrival time is too short is avoided. It is worth to say that the traffic strategy is not additionally output, and the expected vehicle speed reflects the traffic strategy and the vehicle speed planning result.

Aiming at the expected vehicle speed planning problem, the invention adopts a particle swarm algorithm to solve the objective function proposed before. The particle swarm algorithm is derived from the behavior research of the prey of the bird swarm, the fitness of the position of the particle is recorded as experience in each iteration, the self-moving direction is adjusted by means of self experience and memory, and the optimal fitness is found through continuous movement, so that the particle iteration optimizing is completed.

The positions of the particles are represented by solution vectors, and if one solution of the objective function consists of D arguments, the position of one particle is D-dimensional space, and the position of the i-th particle is represented as:

The value obtained by substituting the position of the particle into the objective function is the fitness, the fitness of the position of each particle in one iteration is recorded through the historical fitness, and the quality of the position is judged according to the fitness. The fitness of the particles is recorded as:

In general, the smaller the fitness value is, the better. Assuming that the ith particle is iterated for d times altogether, and in the nth iteration, when the fitness of the position reached after the particle moves is better than the historical fitness, the historical optimal position before the position is replaced is recorded as the individual historical optimal fitness. By the d-th iteration, the individual history best fitness is noted as:

The sum of fitness of each particle is the fitness of the particle swarm, which is recorded as:

By the d-th iteration, the population history best fitness is noted as:

The position of each particle change is moved according to a certain speed, and the movement speed of the ith particle is recorded as follows:

The speed of the particle in the step d is the sum of the speed inertia, self-cognition and social cognition of the particle in the previous step, and is expressed as:

Wherein C ₁ is an individual learning factor of the particle, and the larger the factor is, the more the particle tends to move to the own history optimal position, and the value is generally 2; c ₂ is a social learning factor of the particles, the larger the factor, the more prone the particles to move to the historic optimal position of other particles, typically having a value of 2; r ₁,r₂ is a random function, and the value range is [0,1] for increasing the search randomness; w is an inertial weight, not a negative number, for adjusting the search range for the solution space.

In order to avoid the situation that the vehicle is too close to the front vehicle or the rear vehicle, a non-safety time t _unsafe is defined, which indicates that the vehicle is too close to the adjacent vehicle when the vehicle reaches an intersection in the time range; the safety time t _safe indicates that the vehicle arrives at the intersection within the time range and can be staggered with the adjacent vehicles for a certain time so as to ensure the safety of the vehicles, and the traffic schematic diagram of the intersection is shown in fig. 4.

After the particle position is initialized, calculating the expected arrival time of each feasible solution and the front and rear vehicles, screening out the solution with the expected arrival time at t _unsafe, optimizing the feasible solution of t _safe in the safety range of the algorithm, and avoiding the situation that the vehicle ignores safety and blindly pursues efficiency due to the expected vehicle speed represented by certain extreme particles. Combining the particle swarm algorithm with the above discussion, the expected vehicle speed decision algorithm is finally obtained, and the flow chart is shown in fig. 5.

By combining the above, HV (Human-DRIVING VEHICLE) makes a traffic policy according to the expected arrival event of the environmental vehicle, and can reasonably decide the expected vehicle speed to safely pass through the intersection.

The steering tracking function module of the vehicle is designed based on a pure tracking algorithm and a PID (pro-integrated-DERIVATIVE CONTROL) controller.

Step S4: and selecting a target path point based on each vehicle position and the global path, and matching the expected vehicle speed with the target path point by using a pure tracking algorithm to obtain a driving decision of continuous driving action. And constructing a complete vehicle decision-motion planning framework, defining a passing process of the vehicle at a signalless intersection as a Markov decision process, obtaining the relation between the vehicle action and the vehicle state relative to the current environment, and deciding the driving continuous action by adopting a deep reinforcement learning method.

The method comprises the following steps:

Through a deep reinforcement learning method based on a Markov decision process, decision is made on two parameters of a basic forward looking distance and a throttle valve/brake pedal of motion control, a training mechanism that the punishment and punishment force of an agent changes along with the risk degree is designed by introducing a risk measurement method, and an RA-SAC (RISK AWARENESS-Soft Actor-Critic) algorithm is provided.

1) The vehicle motion planning framework comprises a decision maker based on a reinforcement learning intelligent agent and an underlying tracker, wherein the underlying tracker comprises a vehicle speed control module and a steering control module, and the structural schematic diagram is shown in fig. 6. The vehicle speed control module is responsible for controlling the longitudinal acceleration of the vehicle, the reinforcement learning intelligent body makes a decision on a throttle valve/brake pedal pressure signal of the vehicle, and the vehicle speed control module converts the throttle valve/brake pedal pressure signal into a vehicle dynamics model to finish the vehicle speed control. The steering control module selects a target path point based on the current vehicle position and the global path, and adopts a pure tracking algorithm and a PID controller to adjust the steering angle of the vehicle in a combined way. The intelligent agent makes a decision on basic forward looking distance parameters in a pure tracking algorithm according to the current vehicle state so as to adapt to the current vehicle speed and a target path point, and tracking of the target path point is realized.

The complete workflow of the vehicle motion planning section is: HV obtains other vehicle information from the environment, including X/Y coordinates, speed, etc.; and receiving risk perception and expected vehicle speed results from a decision layer. After the vehicle senses the environmental information, the reinforcement learning intelligent body carries out motion planning decision, wherein the decision content comprises a throttle valve/brake pedal pressure signal and a front wheel steering angle, and the result is used as the input of a vehicle dynamics model to control the vehicle to complete appointed motion. And the motion execution changes the state of the vehicle in the environment, and feedback is completed through the transmission of environment information, so that the closed-loop control of the vehicle motion planning is realized.

2) Steering tracking control module

In the running process of the main vehicle, a reasonable steering angle is required to be output to track a global path besides the speed adjustment, and the steering angle is mainly designed by a steering control model and a tracker, wherein the steering model comprises a geometric model, an automobile dynamics tracking model and the like. The steering geometry model used in the present invention is a pure tracking algorithm (Pure Pursuit Algorithm).

1. Pure tracking algorithm

The pure tracking algorithm is a path tracking control strategy based on a geometric model, the basic principle is that a rear axle center control point reaches a reference path point with a forward looking distance of l _d along an arc by controlling the steering radius R of a vehicle, a front wheel steering angle delta required for control is calculated based on an Ackerman model, and a geometric relation diagram is shown in figure 7.

According to the variables in fig. 7, the correspondence between the radius R and l _d is obtained by the sine theorem, as shown in equation (19):

The curvature κ of the path is:

Taking the lateral error e into account, we get:

from the rear axle kinematics of the vehicle model, the steering angle can be determined as:

Wherein, L _w is the wheelbase of the vehicle, L _d is the forward looking distance, the value of the forward looking distance is related to the forward looking coefficient k, the vehicle speed V and the forward looking distance basic value L _c, and the expression is shown in the formula (23):

l_d＝k·V+l_c(23)

combining the formulas (21), (22) and (23), the steering angle calculation formula is obtained as follows:

the pure tracking algorithm has simple structure, good robustness to road curvature disturbance, and is suitable for path tracking control under lower vehicle speed and small lateral acceleration.

2. Steering control design

Based on a geometric model, a pure tracking algorithm is adopted as a steering tracking basic method, and the problems of larger tracking error, slow course deviation recovery speed and the like are caused by directly controlling through an output angle, so that the pure tracking algorithm is optimized by combining a PID controller, the control angle of a vehicle is continuously changed in a front-back control period, and the course error in the tracking process is slowed down. The steering tracking controller is designed as shown in fig. 8.

The PID controller is a widely applied controller and has the advantages of simple structure, good stability, reliable operation, convenient adjustment and the like. PID regulation is a linear combination of proportion, integral and differential regulation rules, which absorbs the advantages of quick proportion regulation reaction, elimination of static difference and predictability of differential regulation, and improves the reaction, accuracy and dynamic performance of control, and the control law is shown in formula 25:

Where e (k) is the systematic error of the kth step, and k _p、k_i and k _d are parameters for proportional, integral and differential control, respectively.

3) Exercise planning based on reinforcement learning

In the hierarchical decision-making planning architecture of an autopilot system, higher-level behavioural decisions (e.g., lane changes, overtaking, enqueuing, dequeuing, etc.) are represented as discrete states of mutual transitions, while lower-level execution actions (e.g., acceleration, deceleration, etc.) are represented as a continuous sequence of actions. The behavior decision layer determines discrete driving behavior states such as lane change, overtaking, left/right turning and the like and conversion rules thereof, and after making the behavior decision, the motion planning layer is responsible for providing a safe, comfortable and feasible continuous action sequence so as to realize the driving behavior selected by the decision system. The reinforcement learning intelligent agent can gradually adapt to the environment in continuous trial and error, and initial modeling cost is reduced. Furthermore, reinforcement learning can exhibit greater advantages at the anthropomorphic driving level under the direction of reasonable rewards.

1. Markov decision process

Reinforcement learning agents (agents) may interact with the Environment (Environment) to target a maximum cumulative prize (Reward) to derive a next Action (Action), which is made in relation to the current State alone, and not to the historical State, and thus are typically modeled based on a markov decision process (Markov Decision Processes, MDPs).

The markov decision process is described by tuples (S, a, P, R, γ), where S represents a finite set of possible states in the dynamic environment, a represents a set of actions available to the agent in a particular state, P represents a state transition probability matrix, providing a probability of the system transitioning between each pair of states, R represents rewards earned in a particular state by taking some action, and represents how good the agent is to make the environmental feedback after making the action. Gamma is [0, 1) interval, represents discount factor, reflects importance of future rewards to current rewards, and ensures convergence of overall rewards.

The agent takes a decision from a strategy pi (a|s) in the environment, wherein the strategy refers to a certain state s in the environment, the agent takes probability distribution of each action, and the calculation mode is as follows:

π(a|s)＝p[A_t＝a|S_t＝s] (26)

the agent accumulates rewards obtained by each step of action, the total accumulated rewards is called as rewards, and the calculation mode is as follows:

When k takes an infinite value, the task may be considered a persistent task, where R _t+1 is the reward that is obtained when state S _t transitions to S _t+1. The agent needs to constantly adjust actions or action selection strategies to maximize returns. The value function of the state s under the policy pi is denoted as v _π(s), and the return probability obtained by the decision of the agent under the policy pi is expected to be shown as formula (28):

The rewards earned for taking a particular action a under policy pi are shown in equation (29), with the earned rewards for each particular action reflecting the superiority of that action for the current environment versus state.

In the driving process of the signalless intersection, the driving behavior required to be completed at the current moment of the vehicle is only related to the current road traffic condition, is irrelevant to all past states, and is subjected to Markov property. The invention designs a main vehicle control model based on a deep reinforcement learning method of a Markov decision process.

Step S5: and carrying out reward and punishment strategy training on the risk perception coefficient by using an RA-SAC algorithm, and changing the gradient updating amplitude of the driving decision by the trained risk perception coefficient to obtain a vehicle decision motion planning model.

The design combines a method for adjusting rewarding and punishing strategies based on risk perception results with a basic SAC algorithm, and is named as an RA-SAC algorithm. The flexible action evaluation algorithm (SAC) is a deep reinforcement learning algorithm that combines an offline strategy, an Actor-Critic structure, and a maximum entropy (Maximum Entropy). Compared with the method that only one optimal strategy capable of maximizing accumulated return is found, the SAC requires the selected strategy to have the maximum entropy of each output action, so that task achievement is ensured and strategy randomness is improved. The SAC main body network structure comprises an Actor network and four Critic networks, and the main body network structure is shown in fig. 9. The policy optimization objective function of SAC is:

In the formula, h (pi (|s _t)) is entropy, the calculation mode is shown in the formula (31), alpha is a temperature coefficient, the importance of entropy to rewards is determined, and the parameter control strategy agent can be adjusted to favor rewards or entropy values, so that the random degree of agent exploration is controlled.

h(π(·|s_t))＝-logπ(a_t|s_t) (31)

And outputting action probability distribution parameters by the Actor network, and obtaining the actions of the intelligent agent according to probability sampling. VCritic network output v (s _t) represents an estimate of state value, QCritic network output q (a, s) represents an estimate of action-value. Wherein VCritic network and QCritic network are respectively composed of a value estimation Critic network and a target policy Critic network. The data sampled from the experience pool (s _i,a_i,r_i,s_i+1) is used for updating QCritic network parameters ω, and the true value estimate as s _t state is obtained based on the optimal Bellman equation, and the estimation method is as follows:

Where E _π is the cumulative return on current state expectation.

QCritic the network samples batch data B, and updates the batch data B in a gradient descent mode, wherein the updating mode is as follows:

The VCritic network updates the network parameter theta by the data sampled from the experience pool (s _i,a_i,r_i,s_i+1), and the output true value is:

Where A (s _i) is the set of all possible actions under policy pi, a' _i is some predicted action, and log pi _θ(a′_i|s_i is the entropy under that predicted action. The update mode of VCritic network is:

The SAC updates two target strategy Critic networks according to the super parameter rho, and the updating method comprises the following steps:

φ_tj←ρφ_tj+(1-ρ)φ_j,j＝{1,2} (36)

The essence of the Actor network parameter update is to minimize KL divergence, i.e., minimize relative entropy. The updating mode of the Actor network is as follows:

In the method, in the process of the invention, The output minimum value of two target network estimations is taken, so that overestimation can be effectively prevented, and log pi _θ(a_i|s_i) is taken as entropy under the selection action.

The invention adopts the coefficient epsilon obtained by the risk measurement method to change the feedback of the intelligent agent under different risk degrees, obtains the rewards R based on risk estimation according to the risk perception coefficient epsilon and the basic rewards R, and adopts the calculation method as shown in a formula (38), so that under the condition of high risk, the intelligent agent can obtain larger rewards by adopting safe driving behavior, otherwise, the intelligent agent can be subjected to larger punishment by adopting aggressive driving behavior.

R(s,a)＝(1+ε)·r(s,a)(38)

Changes in rewards affect network parameter updates when QCritic network estimates are:

The method for adjusting the rewards and punishments strategy based on the risk perception (RISK AWARENESS) result is combined with the basic SAC algorithm, and is named as RA-SAC algorithm. The rewarding and punishing force is changed through the risk perception coefficient, so that the potential collision risk can be identified in the training process by the intelligent agent, the current driving decision of the vehicle is put into the corresponding environment for evaluation, and the rewarding and punishing result is more attached to the influence caused by the action of the vehicle on the actual environment. The gradient update amplitude is changed by the current driving risk, so that the intelligent agent can harvest feedback of different degrees according to the risk degree, and the pseudo code of the RA-SAC algorithm is shown in the table 2.

TABLE 2 pseudo code for RA-SAC algorithm

State space, action space, design of reward functions:

1) State space

The state is the observation of the intelligent body to the environment space and is the basis for decision making, the design of the state space runs through the whole training process of the intelligent body, and the state space accurately fitting the attribute and the gain change of the intelligent body is designed aiming at the scene of the signalless intersection researched by the invention, so that whether the model training can be converged, the convergence speed and the final performance is related, and the reasonable selection of the state is critical to the training work of the intelligent body.

The state space is generally composed of internal attributes, namely, the position coordinates, the motion state, the expected speed and the transverse path error of the intelligent network-connected vehicle running at the signalless intersection, and external attributes, namely, the related information of the environment vehicle, wherein the state space aggregate expression is shown in the following formula (40):

Where s _HV is the state of the host vehicle, represents an internal attribute of the state space, defined as s _HV＝(x,y,v,a,v_goal, e), A set of states for an environmental vehicle i, defined as/>The meaning of each state set is shown in table 3.

TABLE 3 vehicle State space at signalless intersection

Because the data dimensions of the state space are different, the feature extraction is easy to interfere, and the problem that an intelligent agent is always close to a boundary value in the selection action is caused. For a set of data x= { X ₁,x₂,…,x_n }, the normalization method is:

Where X' _i is the ith data after normalization, X _mean is the average of the set of data, and X _std is the standard deviation of the set of data.

2) Action space

The intelligent agent makes a decision on parameters of longitudinal speed and steering tracking control. The longitudinal control aims at adjusting acceleration according to the current road environment and the expected vehicle speed, and the action is designed to be the output value of a throttle opening and brake pedal pressure signal. The steering control aims at adjusting the basic forward looking distance according to the current vehicle speed, so that a reasonable front wheel corner is obtained, and the basic forward looking distance of a pure tracking algorithm is designed as the action. The action space definition is shown in formula (42):

a＝(u,l_c) (42)

Wherein u is the output value of a throttle valve/a brake pedal, l _c is the basic forward looking distance parameter of a pure tracking algorithm, and the specific information of the motion space variables is shown in table 4.

Table 4 main vehicle action space

In order to avoid unreasonable phenomenon that the throttle pedal and the brake pedal are simultaneously output in the process of adjusting the speed of the vehicle, the method for converting the value of the action u and outputting the value after judging the value of the action u is shown as a formula (43):

Where throttle is throttle pedal output pressure and brake is brake pedal output pressure. The output value u is required to be converted into an input parameter of the vehicle dynamics model in the simulation environment, namely, the corresponding basic unit quantity is multiplied to enable the motion to be linearly changed to a specific range of the input parameter. Wherein, the basic unit amount of the throttle is 100 (%), and the basic unit amount of the brake is 150 (Bar).

3) Reward function

One of the purposes of the bonus function design is to guide the agent to gradually complete the tasks desired by the designer, and unreasonable bonus function design can lead to interference between different targets, causing the agent to sink into the "see-for-each" state. For example, to avoid penalties caused by too great a change in acceleration during vehicle speed adjustment, to meet idle-only penalties, the vehicle prefers to stay in place all the time; or meets the requirement of accepting overspeed punishment, and the vehicle always rushes to the end point in a large acceleration state, so that long-time accumulated punishment under normal vehicle speed is avoided. For the intelligent network-connected vehicle researched by the invention, the main line task in the scene of the signalless intersection is to track the global path and ensure the driving safety of the vehicle. The vehicle speed control range, the driving comfort level and the efficiency are all additional targets based on the main line task, and the reward and punishment weight of the targets needs to be adjusted according to the importance of the targets, so that misleading of an intelligent body is avoided. And combining priori knowledge and experimental debugging conditions, the rewarding function is divided into two parts, namely vehicle driving state rewarding and interaction safety rewarding.

3.1 Driving state rewards

The running state rewarding is mainly based on the rationality of the state of the vehicle in the running process, and is reflected in the aspects of whether the vehicle accords with the driving standard, whether the tracking control of the driving decision is finished, whether the change of the vehicle speed is smooth and the like, and the running state rewarding function is designed from the following four points.

3.1.1 Speed limit)

The speed of the vehicle in the running process accords with the road standard, the speed limit requirement in an actual road is referred, the highest speed limit of a lane in the upstream stage of a scene is v _upper, the value is 50km/h, and the value is about 13.89m/s; the lowest speed limit is v _lower, and the value is 3.44m/s. The vehicle speed rewards of the vehicle in the transition zone and the departure zone are r _ve l1, and the calculation method is shown as a formula (44):

In the formula, V is the current speed, and overspeed can negatively influence the driving safety of the vehicle, so different punishment forces are given to different conditions, and the punishment to overspeed is larger. For the situation that the vehicle can stop and idle speed in the intersection, punishment is required to be given to overspeed behavior, the vehicle speed rewarding of the vehicle in the conflict area of the intersection is r _vel2, and the calculation method is shown in the formula (45):

Wherein v _limit is the speed limit of the intersection, the value is 30km/h, and the speed limit is about 8.33m/s.

In connection with the above discussion, the vehicle speed rewards consist of rewards r _vel1 at the entrance and exit lanes and rewards r _vel2 at the inside of the intersection, as shown in equation (46):

r_vel＝r_vel1+r_vel2 (46)

3.1.2 Speed tracking

After the decision layer obtains the expected passing speed in the transition zone, the bottom layer controller needs to finish speed adjustment according to the speed as a target. To guide the vehicle to track the target vehicle speed, a bonus function r _{track_vel} is set as shown in equation (47):

r_{track_vel}＝-|V-v_goal| (47)

Wherein V is the current vehicle speed, and V _goal is the target expected vehicle speed.

3.1.3 Path tracking)

The vehicle is required to complete tracking driving according to a preset global path, and in order to reflect the path tracking condition in the training process, a reward function r _{track_path} is set as shown in a formula (48):

r_{track_path}＝-e (48)

Where e is the path tracking lateral error.

3.1.4 Comfort level)

The acceleration change rate jerk represents the change in acceleration per unit time, and has a direct effect on the driving comfort. A larger Jerk indicates a more intense change in vehicle speed, a worse ride experience, and a flatter ride, on the contrary, is often used as a measure of driving comfort. Assuming J represents jerk, the definition herein for comfort prize r _jerk is as shown in equation (49):

r_jerk＝-J² (49)

In combination with the ①～④ prizes above, the vehicle travel state prize r _drive is expressed as:

r_drive＝ω₁r_vel+ω₂r_{track_vel}+ω₃r_{track_path}+ω₄r_jerk (50)

Where ω ₁、ω₂、ω₃、ω₄ is the weighting factor for each prize, respectively.

3.2 Driving safety rewards

The main basis of the driving safety rewards is whether the interaction process of the vehicle and the environment vehicle is safe in the driving process. The invention designs the interactive safety rewarding function based on whether collision occurs or not and also based on a potential collision risk index Gap Time. When the host vehicle collides with the environmental vehicle, the collision reward is r _collision, and the value is shown as the formula (51):

r_collision＝-100 (51)

when the Gap Time between the host vehicle and the environmental vehicle is smaller than a certain threshold value, the estimated arrival Time of the conflict area between the host vehicle and the environmental vehicle is too short, the potential collision risk is large, and the rewards r _gap are set as shown in a formula (52):

combining r _collision with r _gap, the interactive security reward r _safe is obtained as shown in formula (53):

r_safe＝r_collision+r_gap (53)

in order to define the prize value interval within a reasonable range, a linear change to a reasonable interval is employed for r _drive. Combining the driving state rewards r _drive with the safety rewards r _safe to obtain an intelligent body training rewards function r, wherein the calculation method is shown as a formula (54):

r＝λ₁r_drive+λ₂r_safe (54)

Wherein lambda ₁、λ₂ is the running state rewards and the running safety rewards coefficients respectively.

Step S6: and making a decision on the passing of each vehicle by using the vehicle decision motion planning model.

The invention provides a signalless intersection vehicle passing decision planning system based on the method and the discussion, which comprises a frame construction module, a risk acquisition module, a vehicle speed acquisition module, a vehicle state acquisition module, a decision model construction module and a network-connected vehicle decision module.

The frame construction module is used for constructing a modeling scene of the network-connected signalless intersection and constructing a steering vehicle decision-motion planning frame under the multi-vehicle interaction working condition based on the modeling scene of the network-connected signalless intersection; the risk acquisition module is used for carrying out driving risk perception through a steering vehicle decision-motion planning framework, defining risk levels according to perceived accident severity, and acquiring risk perception coefficients under different risk levels; the vehicle speed acquisition module is used for calculating a passing clearance according to the relative states of the interactive vehicles, acquiring a passing strategy through the passing clearance, and acquiring the expected speed of each vehicle in the passing strategy by adopting a particle swarm algorithm; the vehicle state acquisition module is used for selecting a target path point based on each vehicle position and the global path, matching the expected vehicle speed with the target path point by using a pure tracking algorithm, and obtaining a driving decision of continuous driving action; the decision model building module is used for carrying out reward and punishment strategy training on the risk perception coefficient by using an RA-SAC algorithm, and changing the gradient update amplitude of the driving decision by the trained risk perception coefficient to obtain a vehicle decision motion planning model; and the network-connected vehicle decision module is used for deciding each vehicle by using the vehicle decision motion planning model.

The invention also provides a computer device which comprises a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the steps of the signalless intersection vehicle decision planning method.

Example simulation results and analysis: constructing a joint simulation platform, and constructing a network simulation scene taking a inD data set signalless intersection as a prototype in the platform, wherein the network simulation scene comprises road modeling, track calibration of an environmental vehicle, collision detection setting and the like; and the basic motion control module of the vehicle is completed, a vehicle reinforcement learning basic frame is built, and the vehicle reinforcement learning basic frame is embedded into a main vehicle motion planning decision layer by combining the sensing and decision content. And finally, verifying the control effect of the vehicle in the scene, comparing with other reinforcement learning algorithms suitable for continuous action tasks, and evaluating the performance of the method.

1. On the basis of a joint simulation platform, the multi-vehicle working condition scene without the signalized intersection is built, a vehicle control module is embedded, a reinforcement learning model is designed, an experimental flow is shown in a figure 10, and the steps are as follows:

And (3) constructing a simulation scene: setting road parameters according to the line shape of the intersection road without the signal, wherein the upstream road section and the downstream road section of the intersection are 200 meters; and selecting a vehicle model for the host vehicle and the environmental vehicle, generating a basic control interface of the vehicle in the Simulink, calibrating the running path, the initial speed and the expected speed change of the environmental vehicle, and completing the construction of the simulated road environment.

Vehicle decision and control module design: designing a function to process the basic state information of the vehicle, and finishing definition of decision layers such as risk perception coefficients, expected vehicle speed and the like; a steering control module is designed based on a pure tracking algorithm and a PID controller, and a longitudinal steering control interface is set; a collision detection module, an interaction stage judgment module and the like are added in the scene to serve as reinforcement learning environment condition judgment auxiliary training.

Reinforcement learning model design: and constructing reinforcement learning agents based on the RL Agent in the Simulink environment, and defining states, actions, rewarding functions and training termination conditions. And defining the state and the action value of the intelligent agent, building a network structure, and setting intelligent agent parameters and training parameters.

Model evaluation: and finishing the reinforcement learning training of the intelligent agent, and comparing the performance of the intelligent agent with that of other algorithms.

2. The results of the algorithm training and analysis, critic and Actor network settings for the RA-SAC algorithm are shown in tables 5 and 6, respectively. The Critic network consists of 1 input layer, 3 full connection layers, and 1 output layer, the input size of the input layers depends on the state and the number of actions. The middle layer adopts a fully-connected network, the input and output sizes of the fully-connected network are the number of neurons, and the phenomenon that the gradient of the training process disappears can be caused by the excessive number of neurons, so that 64 and 32 neurons are mainly adopted as the input sizes of the fully-connected layer.

The Actor network is composed of 1 input layer, 2 full connection layers and 2 output layers, and the input size of the input layers depends on the number of states. The full connection layer adopts Relu as an activation function, and the output layers 1 and 2 respectively output two actions u and l _c and adopts Relu as an activation function.

Table 5 Critic network settings

Name of the name	Input size	Activation function	Output size
				Input layer	14	/	14
Full connection layer	14	Relu	64
				Full connection layer	64	Relu	32
Full connection layer	32	Relu	16
				Output layer	16	/	1

TABLE 6 Actor network settings

Name of the name	Input size	Activation function	Output size
				Input layer	12	/	12
Full connection layer	12	Relu	64
				Full connection layer	64	Relu	32
Output layer 1 (u)	32	Relu	1
				Output layer 2 (l _c)	32	Relu	1

After the RA-SAC network structure is designed, the parameters of the intelligent agent and the training parameters are set, the super-parameter information of the model is shown in table 7, and the reinforcement learning training is completed in the built simulation environment.

Table 7 RA-SAC Algorithm training Supermameters

TD3 and DDPG are deterministic strategy algorithms, the action space is explored by adding noise randomization adjustment actions, the exploration degree of the action space can be adjusted by adjusting noise variance, SAC, TD3 and DDPG algorithms are adopted as comparison methods, and the results of training average rewards, step numbers and the like are shown in Table 8. It is worth to say that the other three comparison algorithms are not provided with a perception decision layer, and no risk perception coefficient and expected vehicle speed decision result are generated. The training prize curves for each algorithm are shown in fig. 11-12.

As can be seen from the training results in Table 4.8, the average reward of RA-SAC is improved by 10.39% compared with the original SAC algorithm, the average step number of the vehicle in the training environment is reduced by 31.06%, and the training performance of the model is obviously improved under the condition of designing the vehicle speed guidance and risk perception. Compared to TD3 and DDPG, the average prize of the method is 13.32% lower than TD3, with an average number of steps 16.20% more; the average prize is 48.67% higher and the average number of steps is 65.17% less than DDPG.

Table 8 training results of different algorithms

Algorithm	RA-SAC	SAC	TD3	DDPG
					Average rewards	-190.36	-212.42	-167.99	-370.82
Average number of steps	495	718	426	1421

The convergence rate of the RA-SAC algorithm and the SAC algorithm is obviously higher than that of the TD3 algorithm and the DDPG algorithm, which proves that the randomness strategy gradient has certain advantages in the application of an actual model and can be quickly adapted to a complex training environment. Those of ordinary skill in the art will appreciate that: all or part of the steps of implementing the above method embodiments may be performed by a hardware or software system associated with program instructions, and the foregoing program may be stored in a computer readable storage medium, which when executed, performs steps including the above method embodiments: and the aforementioned storage medium includes: various media storing program codes such as ROM, RAM, magnetic or optical disks.

The present invention has been described in further detail with reference to specific preferred embodiments, and it should be understood by those skilled in the art that the present invention may be embodied with several simple deductions or substitutions without departing from the spirit of the invention.

Claims

1. The traffic decision planning method for the signalless intersection is characterized by comprising the following steps of:

2. A signalless intersection vehicle transit decision making method as recited in claim 1, wherein said steer vehicle decision-motion planning framework comprises: the system comprises a network connection no-signal intersection environment, a perception and decision module and a vehicle motion planning module.

3. A signalless intersection vehicle transit decision making method as claimed in claim 1, wherein said defining risk classes based on perceived accident severity and obtaining risk perception coefficients at different risk classes comprises the steps of:

4. The signalless intersection vehicle traffic decision-making method of claim 3, wherein risk classification is performed on traffic collision events according to collision avoidance acceleration thresholds, and conditional probabilities under different risk classes are calculated, wherein the specific expression is:

5. The method for traffic decision planning at signalless intersections of claim 4, wherein the risk perception coefficients in each given state are obtained by bayesian theory, and the specific expression is:

6. The signalless intersection vehicle transit decision making method of claim 1, wherein the selecting a target waypoint based on each vehicle location and global path and matching the desired vehicle speed with the target waypoint using a pure tracking algorithm comprises the steps of:

7. A signalless intersection vehicle transit decision making method according to claim 1, wherein said obtaining a driving decision for driving a continuous motion comprises the steps of:

8. The signalless intersection vehicle transit decision making method of claim 1, wherein said using RA-SAC algorithm to reward and punish the risk perception coefficient training strategy, changing the gradient update amplitude of the driving decision by the trained risk perception coefficient, comprising the steps of:

Putting the current vehicle driving decision into a corresponding environment for evaluation to obtain a reward and punishment result which accords with the influence of the vehicle action on the actual environment;

9. A signalless intersection vehicle transit decision making system, comprising:

10. A computer device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to perform the steps of a signalless intersection vehicle transit decision planning method according to any one of claims 1 to 8.