CN113460090B

CN113460090B - T-shaped emergency collision avoidance control method, system, medium and equipment for automatic driving vehicle

Info

Publication number: CN113460090B
Application number: CN202110948176.XA
Authority: CN
Inventors: 侯晓慧; 张俊智; 何承坤
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-08-18
Filing date: 2021-08-18
Publication date: 2023-09-12
Anticipated expiration: 2041-08-18
Also published as: CN113460090A

Abstract

The application relates to a T-shaped emergency collision avoidance control method, a system, a medium and equipment for an automatic driving vehicle, which comprise the following steps: calculating the control input quantity of the optimal control problem based on the rule according to a preset vehicle model, a reward function and an initial state; when the first setting condition is met, updating the network parameters of reinforcement learning based on the control input quantity until the second setting condition is met; and when the second setting condition is met, updating the network parameters of reinforcement learning based on the Actor-Critic framework of TD3 until the third setting condition is met, and outputting the optimal control quantity. The application can furthest exert the collision avoidance potential of the automatic driving vehicle and improve the performance of high-speed emergency avoidance and extreme driving conditions of the automatic driving vehicle. The application can be widely applied to the technical field of active safety control of automatic driving automobiles.

Description

T-shaped emergency collision avoidance control method, system, medium and equipment for automatic driving vehicle

Technical Field

The application relates to the technical field of active safety control of automatic driving automobiles, in particular to a method, a system, a medium and equipment for controlling T-shaped emergency collision avoidance of an automatic driving automobile based on deep reinforcement learning.

Background

With the rapid development of the automobile industry, the active safety of automobiles is challenged more and more, and meanwhile, various active safety systems of vehicles are developed and applied by various manufacturers at home and abroad, including an Anti-lock braking system (Anti-lock Braking System), a driving Anti-skid system (Acceleration Slip Regulation), an electronic stability system (Electronic Stability Program) and the like. Currently, these active safety systems help the driver avoid "abnormal" driving scenarios, such as skidding, oversteering, understeering, etc., due to the nonlinear dynamics of the vehicle, mainly by limiting the driving state of the vehicle to a linear, stable range. However, from the viewpoint of vehicle controllability, the method for improving stability is too conservative, is mainly suitable for conventional working conditions, and cannot cope with sudden scenes and extreme driving working conditions, such as T-shaped collision. At the same time, these active safety systems do not consider how to control the vehicle to reduce collision losses when collisions are unavoidable.

A T-collision refers to a collision of one vehicle against the side of another vehicle. T-shaped collisions often occur when one vehicle enters an intersection against a red light or stop sign and collides with another vehicle traveling perpendicular thereto. Such collisions may be due to mechanical failure (stuck throttle/failed brake), insufficient braking force (wet/icy road), inattention of the driver, etc. Because of the lack of energy absorbing devices in the side structures of automobiles, T-collisions result in greater injuries and losses in traffic accidents than other collision modes. The relevant data indicate that drivers in T-crash accidents often take braking action, and that this operation is not the option of optimal collision avoidance or mitigation of crash losses. In such an emergency condition, it is necessary to fully utilize the attaching ability of the tire, and to expand the traveling limit of the vehicle as much as possible to avoid a collision or mitigate a collision loss. Conventional vehicle collision avoidance strategies generally employ a layered architecture of path planning-tracking, where certain constraints are added during path planning based on vehicle dynamics, and such constraints may result in the vehicle failing to fully develop its dynamic potential or failing to track the planned path, resulting in instability. In professional driving games, however, the driver often consciously controls the wheel locking or slipping to reduce the number of turns or avoid obstacles, an operation known as "drift". The nature of the drift is a critical steady balanced condition that allows the vehicle to be in an oversteered condition by precise control, where the rear wheels reach the attachment limit. The expert driver can achieve precise control of both the vehicle sideslip and the travel path simultaneously during drift, albeit operating entirely outside the vehicle stability limits.

Under the attachment limit working condition, the vehicle is a complex nonlinear system, and the braking, driving and steering systems control high coupling, so that the coordination control algorithm is more complex.

Disclosure of Invention

Aiming at the problems, the application aims to provide a T-shaped emergency collision avoidance control method, a system, a medium and equipment for an automatic driving vehicle based on deep reinforcement learning, which can furthest exert the collision avoidance potential of the automatic driving vehicle and improve the performances of high-speed emergency avoidance and extreme driving working conditions of the automatic driving vehicle.

In order to achieve the above purpose, the present application adopts the following technical scheme: a T-shaped emergency collision avoidance control method for an autonomous vehicle, comprising: calculating the control input quantity of the optimal control problem based on the rule according to a preset vehicle model, a reward function and an initial state; when the first setting condition is met, updating the network parameters of reinforcement learning based on the control input quantity until the second setting condition is met; and when the second setting condition is met, updating the network parameters of reinforcement learning based on the Actor-Critic framework of TD3 until the third setting condition is met, and outputting the optimal control quantity.

Further, the method further comprises the following steps: presetting a state space and an action space in a Markov decision model based on T-shaped collision avoidance of an automatic driving vehicle;

the state space contains all information required by T-shaped emergency collision avoidance of the automatic driving vehicle, including self-vehicle state information and surrounding environment information;

the action space comprises a steering angle of front wheels of the bicycle and longitudinal slip rates of left and right rear wheels of the bicycle.

Further, the setting of the reward function includes: the first type rewards and the second type rewards are overlapped to form the rewards;

the first reward is an instant reward given after each decision in the collision avoidance process;

the second rewards are termination state rewards given based on different state modes of the bicycle after each training round is finished; the different state modes of the self-vehicle comprise collision and rollover in the collision avoidance process.

Further, the calculating the control input amount of the rule-based optimal control problem includes:

the optimal control problem based on the rules is that the vehicle is braked at full force first, and the vehicle is steered at full force after the set time so as to make the yaw movement to the greatest extent;

the control input vector is composed of the transverse force and the longitudinal force of the current tire;

the objective function of the rule-based optimal control problem is set to terminate the state rewards.

Further, the first setting condition is: epinode is less than or equal to i _control ；

The second setting condition is: epi code>i _control ；

The third setting condition is: epi code=i _max ；

The epoode is the number of sequences, i, of the current training _control The number of sequences for learning optimal control; i.e _max Is the set maximum training round number.

Further, the updating the reinforcement-learning network parameter based on the control input amount includes:

obtaining a new measured value and a current rewarding value based on the control input quantity, forming four elements of state transition by the original measured value, the control input quantity, the new measured value and the current rewarding value, and storing the four elements in an experience pool;

randomly sampling in an experience pool, calculating target values of two evaluation networks in an Actor-Critic framework of TD3, and taking a minimum value;

updating the evaluation network parameters by minimizing the loss function;

the action network is updated by minimizing the difference in the optimal control input amount and the action network control amount, and then the target evaluation network and the target action network are updated.

Further, the updating the network parameters based on the Actor-Critic framework of TD3 comprises the following steps:

selecting a control input quantity, obtaining a new measured value and a current rewarding value according to the control input quantity, forming four elements of state transition by the original measured value, the control input quantity, the new measured value and the current rewarding value, and storing the four elements in an experience pool;

updating the evaluation network parameters by minimizing the loss function:

and updating the action network by a strategy gradient method, and then updating the target evaluation network and the target action network.

A T-shaped emergency collision avoidance control system for an autonomous vehicle, comprising: the device comprises a computing module, a first updating module and a second updating module; the calculation module calculates the control input quantity of the optimal control problem based on the rule according to a preset vehicle model, a reward function and an initial state; the first updating module updates the network parameters of reinforcement learning based on the control input quantity when the first setting condition is met until the second setting condition is met; and the second updating module updates the network parameters of reinforcement learning based on an Actor-Critic framework of TD3 when the second setting condition is met until a third setting condition is met, and outputs the optimal control quantity.

A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods described above.

A computing apparatus, comprising: one or more processors, memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods described above.

Due to the adoption of the technical scheme, the application has the following advantages:

1. according to the application, the advanced reinforcement learning combined with priori knowledge is adopted to carry out integrated design on the decision control of the T-shaped emergency collision avoidance of the automatic driving vehicle, and compared with a layered control architecture of path planning-tracking, the control architecture can furthest exert the collision avoidance potential of the automatic driving vehicle, and even under the unavoidable extreme condition of collision, the control planning for reducing the collision loss as much as possible is realized, so that the performances of high-speed emergency collision avoidance and extreme driving working conditions of the automatic driving vehicle are improved.

2. According to the application, a priori knowledge-combined deep reinforcement learning algorithm is combined, and a T-shaped emergency collision avoidance control system arranged for a distributed rear-drive vehicle-type automatic driving vehicle is combined with an optimally controlled dual-delay depth deterministic strategy gradient (Twin Delayed Deep Deterministic policy gradient algorithm, TD 3) algorithm, so that collision avoidance or maximum reduction of collision loss of the vehicle in a T-shaped emergency collision avoidance scene can be realized.

Drawings

FIG. 1 is a schematic diagram of a T-shaped obstacle avoidance learning process of a vehicle based on a TD3 algorithm in an embodiment of the application;

FIG. 2 is a schematic representation of a vehicle dynamics model in an embodiment of the application;

FIG. 3 is a schematic view showing a combination of a collision position and a collision angle in an embodiment of the present application;

fig. 4 is a schematic diagram of a network structure of a TD3 action network according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a network architecture of a TD3 evaluation network in accordance with an embodiment of the application;

FIG. 6 is a schematic view of an initial state of T-shaped collision avoidance according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a round prize for TD3 in one embodiment of the application;

FIG. 8 is a schematic view of a T-shaped collision avoidance path according to an embodiment of the present application;

fig. 9 is a schematic diagram of a computing device in accordance with an embodiment of the application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present application. It will be apparent that the described embodiments are some, but not all, embodiments of the application. All other embodiments, which are obtained by a person skilled in the art based on the described embodiments of the application, fall within the scope of protection of the application.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

The currently applied active safety system and collision avoidance strategy of the vehicle cannot be applied to extreme T-shaped collision conditions. In such emergency conditions, it is necessary to try to refer to drifting operations in professional driving races, and to expand the traveling limit of the vehicle as much as possible to avoid collisions or mitigate collision losses. The application discloses a T-shaped emergency collision avoidance control system of an automatic driving vehicle based on deep reinforcement learning, which combines a double-delay depth deterministic strategy gradient algorithm of optimal control, performs integrated design aiming at a T-shaped collision avoidance decision control system of a distributed rear-drive vehicle type, furthest plays the collision avoidance potential of the automatic driving vehicle, realizes control planning for reducing collision loss as much as possible even under the extreme condition of unavoidable collision, and improves the performances of high-speed emergency avoidance and extreme driving working conditions of the automatic driving vehicle. Training test results prove that the feasibility of the scheme provided by the application is provided, and a new scheme is provided for T-shaped emergency collision avoidance control of the automatic driving vehicle.

In one embodiment of the present application, as shown in fig. 1, a method for controlling T-type emergency collision of an automatic driving vehicle based on deep reinforcement learning is provided, and the embodiment is exemplified by using 6 deep neural networks, including 1 action network pi (s|θ ^π ) 1 target action network pi ^′ (s∣θ ^π′ ) 2 evaluation networksAnd 2 target evaluation networks->Because the T-shaped emergency collision avoidance scene is dangerous, the control model training process is completed in the simulation environment MATLAB/Simulink. In this embodiment, the method includes the steps of:

step 1, calculating the control input quantity of an optimal control problem based on rules according to a preset vehicle model, a reward function and an initial state;

step 2, when the first setting condition is met, updating the network parameters of reinforcement learning based on the control input quantity until the second setting condition is met;

and step 3, when the second setting condition is met, updating the network parameters of reinforcement learning by using an Actor-Critic framework based on TD3 until the third setting condition is met, and outputting the optimal control quantity.

The control method in this embodiment further includes a step of presetting a state space and an action space in a markov decision model based on T-collision avoidance of the automatically driven vehicle.

The method comprises the following steps: a state space S, an action space a and a reward function R in a markov decision model based on T-collision avoidance of an autonomous vehicle are constructed. Wherein:

(1) State space S

The state space contains all information required by T-shaped emergency collision avoidance of the automatic driving vehicle, including self-vehicle state information and surrounding environment information, and the following formula is shown:

S＝[x _e ,x _r ] ^T

x _r ＝[X _r ,Y _r ,c _eX ,c _eY ,c _rX ,c _eY ] ^T

wherein x is _e And x _r The vehicle state information and the surrounding environment information are respectively. V (V) _x ，V _y Andlongitudinal speed, lateral speed and yaw rate of the own vehicle in the vehicle coordinate system, V _e ,Y _e And ψ are the centroid position and yaw angle of the own vehicle in the geodetic coordinate system, respectively. M is the current vehicle state mode, including: 1-no collision, 2-collision, 3-completion of collision avoidance, and 4-side turning in the collision avoidance process. X is X _r ,Y _r Is the centroid position of the other vehicle in the geodetic coordinate system. (c) _eX ,c _eY ) And (c) _rX ,c _rY ) The coordinates of a certain point on the own vehicle and the other vehicle under the geodetic coordinate system are respectively, so that the connecting line of the two points is the minimum distance between the two vehicles, and the two vehicles only exist in a non-collision state. In this embodiment, the T-collision avoidance strategy is described by taking a stationary collision avoidance scenario of another vehicle as an example.

(2) Action space A

The action space contains the following three elements:

A＝[δ,λ ₃ ,λ ₄ ] ^T

wherein delta is the steering angle of the front wheel of the bicycle, lambda ₃ And lambda (lambda) ₄ The longitudinal slip rates of the left rear wheel and the right rear wheel of the bicycle are respectively. The range is delta epsilon minus 30 degrees and 30 degrees]，λ ₃ ∈[-1,1]，λ ₄ ∈[-1,1]。

In this embodiment, a T-type collision avoidance strategy is set for a self-vehicle type of a distributed rear-drive vehicle. In order to enable the vehicle to sideslip more easily so as to realize collision avoidance or reduce collision avoidance loss under the limiting working condition, the braking force distribution coefficient of the front wheel and the rear wheel is 0:1, namely, only the braking force of the rear wheel is generated, and the strategy that a professional driver finishes drifting by utilizing a hand brake in a real driving environment is simulated. Based on the control quantity [ delta, lambda ] ₃ ,λ ₄ ] ^T By combining the vehicle dynamics model and the tire model, the longitudinal and transverse forces of the corresponding tires and the current motion state of the vehicle can be obtained.

In this embodiment, a two-rail three-degree-of-freedom vehicle dynamics model is employed, as shown in fig. 2.

Wherein, coefficient matrix B is:

where, ψ is the yaw angle of the vehicle,for yaw acceleration of the vehicle, +.>For longitudinal acceleration +.>Is the lateral acceleration, m is the vehicle mass, I _z For yaw moment of inertia of vehicle, L _a And L _b Respectively the linear distance between the mass center and the front axle/rear axle, L _w Is one half of the track, F _xj And F _yj Respectively represent the tangential and transverse tire ground forces of the wheel, wherein j=1, 2,3,4 respectively represent the left front wheel, the right front wheel, the left rear wheel and the right rear wheel, F _roll And F _air Rolling resistance and air resistance of the vehicle are respectively:

F _roll ＝fmg

wherein f is a rolling resistance coefficient, g is a gravitational acceleration coefficient, ρ is an air density, C _d The air resistance coefficient, A is the cross-sectional area of the vehicle.

The tire model adopts a table look-up method based on experimental data. The tire experimental data are collected under the condition of pure slip rate or pure corneringA kind of electronic device. Whereas in practice the tire forces are the resultant of the lateral forces and traction forces, which are influencing each other. Therefore, the model adopts the Pacejka tire model considering the longitudinal and transverse coupling characteristics to carry out ovalization on two component forces of experimental data, and corrects table lookup data. Finally according to the longitudinal slip rate lambda of each tyre _i Angle of slip alpha _i Vertical force F _zi The longitudinal force F of the tire can be obtained by looking up a table _xi Transverse force F _yi (i=1, 2,3, 4), i.e.

F _xi ＝T ₁ (λ _i ,α _i ,F _zi )

F _yi ＝T ₂ (λ _i ,α _i ,F _zi )

T ₁ 、T ₂ Respectively represent the longitudinal force F of the tyre _xi Transverse force F _yi With a longitudinal slip rate lambda _i Angle of slip alpha _i Vertical force F _zi Is a function of the correspondence relation of (a).

Wherein, the slip angle of each wheel is:

in the method, in the process of the application,for the total speed of the vehicle, β=arctan (V _x /V _y ) Is the centroid slip angle of the vehicle.

The vertical load of each wheel is:

in the method, in the process of the application,h _g is the height of the center of mass of the vehicle.

(3) Reward function R

Setting of a bonus function comprising: the first type rewards and the second type rewards are overlapped to form the rewards; the first is the immediate prize awarded after each decision in the collision avoidance process; the second type of rewards is a termination state rewards given based on different state patterns of the bicycle after each training round is finished; the different state modes of the self-vehicle comprise collision and rollover in the collision avoidance process.

The method comprises the following steps: under the TD3 framework, the intelligent agent only learns how to interact with the environment according to the definition of the reward function, so that the maximization of the reward function is realized, and therefore, the control effect of the intelligent agent is directly determined by the design of the reward function. The reward function needs to define a reward and punishment of corresponding actions under different driving states, and if the definition is not clear, the model is not converged or the model is converged to a locally optimal solution easily. There are two types of rewards in the T-shaped emergency collision avoidance problem of the automatic driving vehicle, R is used respectively _i And R is _t And (3) representing. First prize R _i The method is an instant rewarding agent given after each decision step in the collision avoidance process, and aims to overcome sparsity of rewarding in the reinforcement learning process and accelerate learning speed of the intelligent agent. Second kind of rewards R _t Is a termination state reward awarded based on the different state patterns of the vehicle after each training round is completed. The three ending modes are collision, collision avoidance and rollover in the collision avoidance process. The definition of each bonus item will be described in detail below.

(31) Instant rewards R _i

The setting of the instant rewards can help the learning speed of the intelligent agent to be faster and the convergence to be more stable. Instant rewards mainly consider the following aspects:

(311) Relative velocity term R _i1

Relative velocity term R _i1 For encouraging the relative speed of the own vehicle with respect to the other vehicle to be as small as possible, thereby reducing potential collision or collision loss, R _i1 Is defined as

Wherein D is the relative minimum distance between the own vehicle and the other vehicle, and DeltaV is the component of the relative speed between the own vehicle and the other vehicle along the direction D. k (k) ₁ Is a negative constant, and is used to adjust the bonus weight of the relative velocity term.

(312) Relative heading angle term R _i2

Related accident studies report that impact energy lessens the impact of a collision by distributing the remaining kinetic energy over a larger surface area when the two bodies are relatively parallel at the time of collision. Thus R is _i2 Is defined as

Where k is any integer, k ₂ Is a negative constant, and is used to adjust the bonus weight of the relative heading angle term. Psi is the yaw angle of the own vehicle, which in this example is in a stationary state, and its yaw angle is constant pi/2.

(313) Input size and rate of change term R _i3

The inputs to the intelligent system are three elements of the action space:

A＝[δ,λ ₃ ,λ ₄ ] ^T

wherein delta is the steering angle of the front wheel of the bicycle, lambda ₃ And lambda (lambda) ₄ Longitudinal slip ratio of left and right rear wheels of the bicycle. The range is delta epsilon minus 30 degrees and 30 degrees]，λ ₃ ∈[-1,1]，λ ₄ ∈[-1,1]. The magnitude of the input item and its rate of change is inversely related to the relationship between rewards. The smaller the input item and its rate of change, the easier the vehicle remains in a linearly stable region, less prone to instability. R is R _i1 Is defined as

Wherein k is ₃ 、k ₄ Is a negative constant and is used for adjusting the rewarding weight of the input item and the change rate of the input item.

(32) Termination state reward R _t

When the T-shaped emergency collision avoidance is in a termination state, the training round is ended, and the termination state rewards are given based on different state modes of the vehicle. The termination state has three ending modes, namely, collision avoidance, collision occurrence and rollover occurrence in the collision avoidance process.

Wherein k is ₅ When the vehicle completes T-shaped collision avoidance without collision and side turning, a larger reward is given for the normal number; k (k) ₆ Is a negative constant, and gives larger punishment when rollover occurs in the collision avoidance process of the vehicle; r is R _tc For rewards given when the own vehicle finally collides with other vehicles, the magnitude of the rewards reflects the severity of the collision and depends on a combination of factors including the collision speed, the collision position and angle, R _tc Represented as

R _tc ＝k ₇ +R _tc1 +R _tc2

Wherein k is ₇ Is a negative constant, and is a basic punishment for collision; r is R _tc1 R is a collision velocity related term _tc2 Is the collision position and angle related term. R will be described below _tc Is defined in detail in (a).

(321) Collision velocity term R _tc1

In this embodiment, it is assumed that the vehicle is stationary, so the greater the speed before the collision of the vehicle, the greater the kinetic energy it carries, and the greater the collision loss. Thus R is _tc1 Represented as

Wherein k is ₈ Is a negative constant for adjusting the bonus weight of the relative crash velocity term.

(322) Collision position and angle term R _tc2

The impact location and angle, i.e., the area and direction of interaction forces between the impacting vehicles, directly affect the degree of transfer of impact energy, an important factor affecting the severity of the impact.

The collision position is often the most severely damaged area of the vehicle body, and can greatly influence the collision loss due to different structures, materials and collision deformation degrees of different parts of the vehicle, and the collision position I of the vehicle is used for carrying out statistical analysis on the collision accident of the vehicle _p The following regions can be distinguished:

the collision angle refers to the included angle of the long shafts of two vehicles when collision occurs. According to statistical analysis of vehicle collision accident, the collision angle I _a Is divided into 6 areas from 0 ° to 180 °:0±5° (180±5°), 20±15°, 50±15°, 90±25°, 130±15°, 160±15°. These 6 regions are combined according to the effect:

the two factors of collision position and collision angle are interactively coupled, and the collision severity is different for different combinations of collision states. Different combinations of collision positions and collision angles are shown in FIG. 3, and the reward function values R corresponding to different collision states _tc2 Expressed as:

wherein k is ₉ Is a negative constant, is used for adjusting the rewarding weight of collision position and angle item, beta _i The coefficients corresponding to the different combinations of collision position and collision angle in fig. 3.

Combining all the factors to finally obtain the intelligent agent rewarding function R as

R＝R _i +R _t 。

In the above embodiment, the network parameters of TD3 are initialized before the reinforcement learning network parameters are updated. The method comprises the following steps:

randomly initializing parameters θ of an action network ^π Evaluating parameters of a networkInitializing parameter assignment of a target action network and a target evaluation network,/->While constructing an experience pool D.

The network structure of the action network is shown in fig. 4, and is composed of an input layer, two hidden layers and an output layer. The input state is 13 dimensions, the first hidden layer is composed of 400 neurons, the second hidden layer is composed of 300 neurons, and the control output layer is 3 dimensions. The activation function of each hidden layer is a linear correction unit (ReLU), and the activation function of the control output layer is a hyperbolic tangent function (Tanh) to limit the magnitude of the control amount.

The network structure of the evaluation network is shown in fig. 5, and is composed of two input layers, three hidden layers and one output layer. Wherein, the state input is 13 dimensions, the control input is 3 dimensions, the first hidden layer is composed of 400 neurons, the second hidden layer is composed of 300 neurons, and the output is an action value function of 1 dimension. The state input layer and the control input layer skip the first hidden layer and are directly connected with the second hidden layer. The activation function of each hidden layer is a linear correction unit (ReLU), and the activation function of the output layer is an identity transformation (identity).

In the above embodiment, the first setting condition is: epinode is less than or equal to i _control The method comprises the steps of carrying out a first treatment on the surface of the The second setting condition is: epi code>i _control The method comprises the steps of carrying out a first treatment on the surface of the The third setting condition is: epi code=i _max The method comprises the steps of carrying out a first treatment on the surface of the Wherein, epoode is the number of sequences of the current training, i _control The number of sequences for learning optimal control; i.e _max Is the set maximum training round number.

In the above embodiment, the initial state set in advance is shown in fig. 6.

In the present embodimentIn the example, the initial state measurement s is set ₀ The method comprises the following steps:

the initial actions are as follows:

[δ,λ ₃ ,λ ₄ ] ^T ＝[0,0,0] ^T

the total vehicle length and the total vehicle width of the own vehicle and the other vehicle are respectively set as

[L _e ,W _e ,L _r ,W _r ] ^T ＝[3.5m,1.66m,8m,3m] ^T

In the above embodiment, in step 1, the optimal control problem based on the rule is that the vehicle is braked fully first, and steering is performed fully after a set time to make the vehicle perform yaw motion to the greatest extent; the control input vector is composed of the transverse force and the longitudinal force of the current tire; the objective function of the rule-based optimal control problem is set to the end state reward.

In the present embodiment, for converting the T-type emergency collision avoidance problem into a rule-based optimal control problem, a rule-based collision avoidance behavior policy is set according to the operation experience of the driver performing the emergency collision avoidance. In the T-shaped collision avoidance process, the self-vehicle is braked fully first, and the time T is set ₀ And then steering with full force to enable the vehicle to perform yaw movement to the greatest extent, so that the vehicle can realize collision avoidance or reduce collision loss to the greatest extent in a T-shaped emergency collision avoidance scene. The control optimization model is described as follows:

when t is less than or equal to t ₀ Full-force braking of the rear axle of the vehicle (assuming that the driving force is provided by the rear wheels only), according to the vehicle model employed in the present embodiment, the input vector u is controlled at this time ^control The method comprises the following steps:

u ^control ＝[F _y1 ,F _y2 ,F _y3 ,F _y4 ,F _x3 ,F _x4 ] ^T ＝[0,0,0,0,μF _z3 ,μF _z4 ] ^T

wherein mu is the road adhesion coefficient, F _zi (i=1, 2,3, 4) may beDerived from the tire vertical force formula of the vehicle model, μF _zi For the maximum tire force that can be provided under the constraint of the traction conditions.

When t>t ₀ As can be seen from the initial state and the reward function corresponding to the collision position and angle items shown in fig. 6, the vehicle should take a left turn and the final Y-axis displacement is as large as possible, so as to avoid collision or reduce collision loss to the greatest extent. At this time:

δ＝δ _max ＝30°

the tire slip angle formula described by the vehicle model can be used for obtaining the slip angle alpha of the front axle two wheels ₁ And alpha ₂ Then, the side force of the front axle two wheels is obtained by a table look-up method (the longitudinal slip rate of the front axle two wheels is assumed to be 0):

the two wheels of the rear axle respectively provide maximum longitudinal forces in opposite directions, so that the vehicle can perform yaw movement to the greatest extent under the moment and steering action. At this time, the input vector u is controlled ^control The method comprises the following steps:

u ^control ＝[F _y1 ,F _y2 ,F _y3 ,F _y4 ,F _x3 ,F _x4 ] ^T ＝[T ₂ (0,α ₁ ,F _z1 ),T ₂ (0,α ₂ ,F _z2 ),0,0,-μF _z3 ,μF _z4 ] ^T

the objective function J is set to the end state prize R _t ：

J＝R _t

The only variable in this optimization problem is t ₀ When t ₀ When determining, the real-time control input u of the whole collision avoidance process of the vehicle ^control And the state of motion is also determined. Therefore, t which maximizes the objective function J can be solved through iteration in MATLAB/Simulink simulation software ₀ 。

In the above embodiment, in step 2, when the first setting condition epi code is not more than i is satisfied _control Based on optimal control input pairsThe network parameters for reinforcement learning are updated. The method specifically comprises the following steps:

step 21, obtaining a new measured value and a current rewarding value based on the control input quantity, and forming four elements of state transition by the original measured value, the control input quantity, the new measured value and the current rewarding value, and storing the four elements in an experience pool;

the method comprises the following steps: calculating a control input u of a rule-based optimal control problem in combination with the vehicle model, the reward function and the initial state _t . In the reinforcement learning training process, the control amount is executedObtaining new measured values s _t+1 And the current prize value r _t State transition four elements->Stored in experience pool D.

Step 22, randomly sampling in an experience pool, calculating target values of two evaluation networks in an Actor-Critic framework of TD3, and taking a minimum value;

the method comprises the following steps: randomly sampling N groups of data in the experience pool D, calculating target values of two evaluation networks, and taking the minimum value:

step 23, updating the evaluation network parameters by minimizing the loss function:

step 24, updating the action network by minimizing the difference between the optimal control input quantity and the action network control quantity, and then updating the target evaluation network and the target action network.

The method comprises the following steps: every d rounds, the action network is updated by minimizing the difference between the optimal control input quantity and the action network control quantity:

/>

where f (·) is the output pi(s) of the current action network _t ∣θ ^π )＝[δ,λ ₃ ,λ ₄ ] ^T Control input to optimal control problem determinationThe mapping function of (2) can be determined by a vehicle dynamics equation and a table lookup method;

then updating the target evaluation network and the target action network:

θ ^π′ ←τθ ^π +(1-τ)θ ^π′ 。

in the above embodiment, when the second setting condition epi code is satisfied in step 3>i _control The method for updating the network parameters of reinforcement learning based on the Actor-Critic framework of TD3 comprises the following steps:

step 31, selecting a control input quantity, obtaining a new measured value and a current rewarding value according to the control input quantity, forming four elements of state transition by the original measured value, the control input quantity, the new measured value and the current rewarding value, and storing the four elements in an experience pool;

the method comprises the following steps: selecting a control quantity u according to an action network strategy and an exploration strategy _t ＝π(s _t ∣θ ^π ) E, e is noise,

according to the control quantity u _t Obtaining new measured values s _t+1 And the current prize value r _t State transition four elements (s _t ,u _t ,r _t ,s _t+1 ) The experience pool D is stored;

step 32, randomly sampling in an experience pool, calculating target values of two evaluation networks in an Actor-Critic framework of TD3, and taking a minimum value;

the method comprises the following steps: randomly sampling N groups of data in the experience pool D, calculating a target value of an evaluation network, and taking a minimum value:

step 33, updating the evaluation network parameters by minimizing the loss function:

step 34, updating the action network through a strategy gradient method, and then updating the target evaluation network and the target action network;

the method comprises the following steps: every d rounds, updating the action network through a strategy gradient algorithm:

and updating the target evaluation network and the target action network:

θ ^π′ ←τθ ^π +(1-τ)θ ^π′

until the third setting condition epi code=i is satisfied _max 。

To sum up, as shown in fig. 7 and 8, the effect schematic diagram after training and testing in the simulation environment is shown by using the T-shaped emergency collision avoidance control method for the automatic driving vehicle based on deep reinforcement learning provided by the application.

Fig. 7 is a graph showing the round prize training of the TD3 algorithm during learning, wherein the gray curve is the actual prize for each round and the dark curve is the average prize for each 200 rounds. As can be seen from fig. 7, the return value obtained in the previous 8000 rounds has a general trend of increasing as the number of rounds increases, which indicates that the control capability of the algorithm is improved from the interaction process. The return values obtained for rounds 8000-12000 gradually tended to be smooth, indicating that the algorithm had a near optimal strategy at the end of training.

Fig. 8 is a schematic view of T-shaped collision avoidance trajectory, based on the set initial state condition, although collision cannot be avoided under the extreme working condition, the own vehicle is in yaw motion through steering, and finally is basically parallel to the bodies of two vehicles when the own vehicle collides with other vehicles, so that the collision contact area is increased, and the collision loss is reduced.

In one embodiment of the present application, there is provided a T-type emergency collision avoidance control system for an autonomous vehicle, including: the device comprises a computing module, a first updating module and a second updating module;

the calculation module is used for calculating the control input quantity of the optimal control problem based on the rule according to a preset vehicle model, a reward function and an initial state;

the first updating module is used for updating the network parameters of reinforcement learning based on the control input quantity when the first setting condition is met until the second setting condition is met;

and the second updating module is used for updating the network parameters of reinforcement learning based on the Actor-Critic framework of TD3 when the second setting condition is met until the third setting condition is met, and outputting the optimal control quantity.

The system provided in this embodiment is used to execute the above method embodiments, and specific flow and details refer to the above embodiments, which are not described herein.

As shown in fig. 9, a schematic structural diagram of a computing device provided in an embodiment of the present application, where the computing device may be a terminal, and may include: a processor (processor), a communication interface (Communications Interface), a memory (memory), a display screen, and an input device. The processor, the communication interface and the memory complete communication with each other through a communication bus. The processor is configured to provide computing and control capabilities. The memory includes a non-volatile storage medium storing an operating system and a computer program which when executed by the processor implements a control method; the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a manager network, NFC (near field communication) or other technologies. The display screen can be a liquid crystal display screen or an electronic ink display screen, the input device can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on the shell of the computing equipment, and can also be an external keyboard, a touch pad or a mouse and the like. The processor may call logic instructions in memory to perform the following method:

calculating the control input quantity of the optimal control problem based on the rule according to a preset vehicle model, a reward function and an initial state; when the first setting condition is met, updating the network parameters of reinforcement learning based on the control input quantity until the second setting condition is met; and when the second setting condition is met, updating the network parameters of reinforcement learning based on the Actor-Critic framework of TD3 until the third setting condition is met, and outputting the optimal control quantity.

Further, the logic instructions in the memory described above may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

It will be appreciated by those skilled in the art that the architecture shown in fig. 9 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting of the computing devices to which the present inventive arrangements may be applied, and that a particular computing device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment of the present application, there is provided a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the methods provided by the method embodiments described above, for example comprising: calculating the control input quantity of the optimal control problem based on the rule according to a preset vehicle model, a reward function and an initial state; when the first setting condition is met, updating the network parameters of reinforcement learning based on the control input quantity until the second setting condition is met; and when the second setting condition is met, updating the network parameters of reinforcement learning based on the Actor-Critic framework of TD3 until the third setting condition is met, and outputting the optimal control quantity.

In one embodiment of the present application, there is provided a non-transitory computer-readable storage medium storing server instructions that cause a computer to perform the methods provided by the above embodiments, for example, including: calculating the control input quantity of the optimal control problem based on the rule according to a preset vehicle model, a reward function and an initial state; when the first setting condition is met, updating the network parameters of reinforcement learning based on the control input quantity until the second setting condition is met; and when the second setting condition is met, updating the network parameters of reinforcement learning based on the Actor-Critic framework of TD3 until the third setting condition is met, and outputting the optimal control quantity.

The foregoing embodiment provides a computer readable storage medium, which has similar principles and technical effects to those of the foregoing method embodiment, and will not be described herein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A T-shaped emergency collision avoidance control method for an autonomous vehicle, comprising:

calculating the control input quantity of the optimal control problem based on the rule according to a preset vehicle model, a reward function and an initial state;

when the first setting condition is met, updating the network parameters of reinforcement learning based on the control input quantity until the second setting condition is met;

updating the network parameters of reinforcement learning based on an Actor-Critic framework of TD3 when the second setting condition is met until a third setting condition is met, and outputting the optimal control quantity;

the setting of the reward function comprises the following steps: the first type rewards and the second type rewards are overlapped to form the rewards;

the second rewards are termination state rewards given based on different state modes of the bicycle after each training round is finished; the different state modes of the self-vehicle comprise collision and rollover in the collision avoidance process;

the calculating the control input quantity of the optimal control problem based on the rule comprises the following steps:

the control input quantity consists of the transverse force and the longitudinal force of the current tire;

setting an objective function of the rule-based optimal control problem as a termination state reward;

the first setting condition is: epinode is less than or equal to i _control ；

The second setting condition is: epi code>i _control ；

The third setting condition is: epi code=i _max ；

The epinode is the current trainingNumber of sequences of training, i _control The number of sequences for learning optimal control; i.e _max Is the set maximum training round number.

2. The control method as set forth in claim 1, further comprising: presetting a state space and an action space in a Markov decision model based on T-shaped collision avoidance of an automatic driving vehicle;

3. The control method of claim 1, wherein updating the reinforcement-learned network parameters based on the control input comprises:

updating the evaluation network parameters by minimizing the loss function;

4. The control method of claim 1, wherein the TD 3-based Actor-Critic framework updates reinforcement-learned network parameters, comprising:

updating the evaluation network parameters by minimizing the loss function:

5. A T-shaped emergency collision avoidance control system for an autonomous vehicle, comprising: the device comprises a computing module, a first updating module and a second updating module;

the calculation module calculates the control input quantity of the optimal control problem based on the rule according to a preset vehicle model, a reward function and an initial state;

the first updating module updates the network parameters of reinforcement learning based on the control input quantity when the first setting condition is met until the second setting condition is met;

the second updating module updates the network parameters of reinforcement learning based on an Actor-Critic framework of TD3 when the second setting condition is met until a third setting condition is met, and outputs the optimal control quantity;

the first setting condition is: epinode is less than or equal to i _control ；

The second setting condition is: epi code>i _control ；

The third setting condition is: epi code=i _max ；

6. A computer readable storage medium storing one or more programs, wherein the one or more programs comprise instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-4.

7. A computing device, comprising: one or more processors, memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods of claims 1-4.