CN112784485A

CN112784485A - Automatic driving key scene generation method based on reinforcement learning

Info

Publication number: CN112784485A
Application number: CN202110082493.8A
Authority: CN
Inventors: 董乾; 薛云志; 孟令中; 杨光; 王鹏淇; 师源; 武斌
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2021-01-21
Filing date: 2021-01-21
Publication date: 2021-05-11
Anticipated expiration: 2041-01-21
Also published as: CN112784485B

Abstract

The invention discloses an automatic driving key scene generation method based on reinforcement learning, which comprises the following steps: 1) selecting a road scene from a map library, setting a driving route of a main vehicle in a simulation system and respectively establishing a probability model for each dynamic environment element; 2) the simulation system controls the main vehicle to start executing a simulation task; training the probability models of all dynamic elements in the selected road scene based on a reinforcement learning technology to obtain the optimal parameters of all probability models for the selected road scene and storing the optimal parameters in a test case library; 3) the step 1-2) is circulated, and the optimal parameters of each probability model for each road scene in the map library are obtained; 4) acquiring a plurality of road scenes from the map library, combining the road scenes to obtain a test map, and selecting dynamic elements required in a simulation environment; 5) and importing the probability model and the corresponding optimal parameters of each dynamic element contained in the test map from the test case library to generate a key scene test case.

Description

Automatic driving key scene generation method based on reinforcement learning

Technical Field

The invention relates to an automatic driving key scene generation method based on reinforcement learning, and belongs to the technical field of computer software.

Background

Today, the performance of most perceptual and predictive algorithms is very sensitive to imbalances in training data (also known as the long tail problem), rare events are often difficult to collect, and are easily overlooked in large data streams, which greatly challenges the application of robots in the real world, especially in safety critical areas (e.g. autopilot).

In the automotive industry, it is common to reproduce, through simulation, the key scenes collected during a test drive. The prior art proposes an alternative method, called worst case evaluation, to search for controllers in the field of worst case evaluation vehicles. Although it may be useful to evaluate certain cases of mining by worst case, it is almost impossible to have some cases at risk in the real world, and it is of little instructive interest for practical use. In addition, the prior art mainly simulates the route or task completion condition of a simulation main body (such as an unmanned vehicle) for automatic driving, but no modeling method is provided for how the deployment of the simulation environment meets the key safety scene requirement required by an enterprise.

Reinforcement learning is a branch of the field of artificial intelligence machine learning for controlling agents that are capable of autonomous actions in an environment by interacting with the environment, including sensing and rewarding, to continuously improve its behavior. The two most important features in reinforcement learning are trial and error and late rewards. Therefore, the invention provides a key scene generation method in the automatic driving test process based on the reinforcement learning theory.

Disclosure of Invention

The invention aims to provide an automatic driving key scene generation method based on reinforcement learning, and solves the problems that training of dynamic environment elements in an automatic driving simulation environment is lacked, and the generation of an automatic driving key safety scene for how the dynamic environment elements are deployed is lacked in the prior art. Aiming at dynamic environment elements in an automatic driving simulation scene, the invention obtains the neural network model of the dynamic environment elements in different road scenes by continuously training model parameters in the simulation process through reinforcement learning, thereby generating a series of key scene test cases. The model parameters of the dynamic environment elements include initial position, movement speed, movement route, trigger distance, and the like. The invention designs a reasonable dynamic environment element rewarding mechanism, combines with road scenes, and fully considers the motion trail of dynamic environment elements such as pedestrians, vehicles, traffic lights and the like and the influence on the main vehicle.

In the invention, the map library of the automatic driving test scene can be preset by the test system, and the map scene can also be imported by a user; the main vehicle is a virtual vehicle to be tested in a test system, and the motion trail and the behavior mode of the main vehicle are controlled by a decision module of a simulation system; the dynamic environment elements mainly comprise three types of pedestrians, other running vehicles and traffic lights, and can cause dynamic interference to the running of the tested virtual vehicle in the simulation system, wherein the pedestrians are road participants in the test scene, the other running vehicles are non-tested vehicles which commonly use the road of the test scene, and the traffic lights are relatively static traffic elements and are used for controlling the time conversion of the traffic lights at the intersection.

The method for generating the automatic driving key scene based on reinforcement learning comprises the following steps:

step 1: initializing a test scene, selecting a road scene from a map library, setting a driving route of a main vehicle, and respectively establishing an initial probability model for three dynamic environment elements, namely pedestrians, other driving vehicles and traffic lights;

step 2: a decision module of the simulation system controls the main vehicle to start executing a simulation task; training probability model parameters of three types of dynamic elements in the selected road scene based on a reinforcement learning technology;

and step 3: the three types of dynamic elements finally obtain the optimal parameters of the probability model aiming at the selected road condition, and the optimal parameters are stored in a test case library;

and 4, step 4: the step 1-3 is circulated until all road scenes of the three types of dynamic elements in the map library are trained to obtain optimal parameters of the probability model;

and 5: importing road combinations into a map library to be any test map, and selecting dynamic elements required by a user in a simulation environment, wherein the dynamic elements mainly comprise pedestrians, other running vehicles, traffic lights and the like;

step 6: and according to the road scenes in the test map, importing the dynamic element probability model corresponding to each road and the corresponding optimal parameters from the test case library to generate a series of key scene test cases.

Further, step 1 specifically comprises:

the dynamic environment elements are other dynamic elements except the main vehicle in the test scene, and mainly comprise three types of pedestrians, other running vehicles and traffic lights.

The road conditions in the map library comprise a one-way lane, a two-way lane, a crossroad, a T-shaped intersection, a Y-shaped intersection, an entrance and an exit of a highway, an overpass and the like; the key scenes appearing on different road conditions are different, the situations of collision, line pressing, retrograde motion, red light running and the like can exist, different dynamic environment element models exist for each road condition, and the combination of the dynamic environment elements is equivalent to modeling of the key scenes of the road conditions.

For the initial models of the pedestrians and the vehicles, the parameters mainly comprise the elements of initial positions, movement routes, movement speeds, trigger distances and the like; the pedestrian movement route comprises straight movement along the direction of the road, crossing the road and the like, the vehicle movement route comprises straight movement, left turning, right turning, turning around, lane changing and the like, and the pedestrian movement route and the vehicle movement route have different options according to different road conditions and initial positions, but all the options are required to accord with traffic rules; for the initial model of traffic lights, it was primarily the time setting of the traffic lights, including the duration of the red, green, and yellow lights.

Further, step 2 specifically comprises:

step 2.1: setting the total iteration times E of model training; initializing the iteration times e to be 0;

step 2.2: obtaining a road scene, selecting various dynamic element types (pedestrians, other running vehicles and traffic lights) needing training in the road scene, obtaining a main vehicle running route, obtaining an initial model of each type of dynamic elements, wherein the number of the selected each type of dynamic elements is more than or equal to 1; it should be noted that, for pedestrians and vehicles, the set movement route and the initial position are in compliance with the traffic rules;

step 2.3: for each road scene, the state S is determined, the state S comprises the road type, the route of the host vehicle and the speed of the host vehicle, the probability distribution of each dynamic element in the selected road scene can be calculated according to the current state of the host vehicle (the current state of the host vehicle is determined in step 1; the road scene and the current state of the host vehicle are known conditions, such as the condition that the road scene is a crossroad and the current state of the host vehicle is a right turn, is set before the test), wherein the probability distribution formula of the pedestrians and the vehicles is as follows:

wherein, the ith action element a in the formula (1)_iThe motion element a includes specific elements such as initial position (X, Y), motion path L, motion speed V, and trigger distance D of the dynamic element_iThe probability of (c) is:

the ith motion route is a discrete variable li _ init _ state obtained by discretizing the continuous random variable li _ init _ index, and the discrete variable li _ init _ state is the motion route initial state of the ith scene li. The complexity of the motion route is that the possible route options are strongly dependent on the road structure and the initial point, and assuming that the total number of options of the motion route under a specific condition is N (i.e. the selectable total number of motion routes), the conditional probability density of li _ init _ index can be modeled by using a probability density function on a [0,1] interval of the neural network structure, and the formula is as follows (3):

li_init_index～P(li_init_index|S,a₁,...,a_i-1,x_i,y_i) (3)

the discretization of the continuous random variable li _ init _ index is detailed in step 2.4.

The duration of the traffic light is a continuous variable, and the probability distribution formula of the duration of the traffic light is as follows (4):

light _ init _ index, t _ red, t _ green, and t _ yellow in equation (4) represent the initial state (initial red, green, and yellow colors), red light duration, green light duration, and yellow light duration, respectively, of the traffic light.

Conditional probability densities for traffic light durations t _ red, t _ green, t _ yellow

P(t_red|S,a₁,...,a_i-1,light_init_index)、P(t_green|S,a₁,...,a_i-1,light_init_index，t_red)、P(t_yellow|S,a₁,...,a_i-1Light _ init _ index, t _ red, t _ green) can be modeled using a gaussian distribution.

The initial state light _ init _ state of the traffic light is a discrete variable, and can be obtained by discretizing a continuous random variable light _ init _ index, and the conditional probability density of the light _ init _ index can be modeled by a probability density function on a section [0,1] of a neural network structure to obtain a formula (5);

light_init_index～P(light_init_index|S,a₁,...,a_i-1) (5)

the discretization of the continuous random variable light _ init _ index is detailed in step 2.4.

Step 2.4: randomly sampling the probability distribution of each dynamic element to obtain the action parameters of the dynamic element model in the state S, namely obtaining the initial position X, the initial position Y, the movement speed v, the movement route L and the trigger distance D for pedestrians, obtaining the initial position X, the initial position Y, the movement speed v, the movement route L and the trigger distance D for vehicles, and obtaining the initial state light _ init _ state, the red light time setting t _ red, the yellow light time setting t _ yellow and the green light time setting t _ green of traffic lights;

step 2.4.1: modeling dynamic elements by using Gaussian distribution N (mu, sigma) for continuous random variables such as initial position (X, Y), motion speed v, trigger distance D, red light time setting t _ red, yellow light time setting t _ yellow and green light time setting t _ green, modeling dynamic elements by using polynomial distribution for discrete random variables such as initial state light _ init _ state of traffic light and motion route L, and using Neural Network (NN) for conditional probability inference;

the gaussian distribution probability sampling formula is as follows:

μ_k,σ_k←M_k(S) (6)

ε～N(0,1) (7)

a_k＝μ_k+σ_k*ε (8)

random variable a_kIs an operation of sampling from the kth node, M_kIs a model representing the conditional distribution of the kth action, followed by a_kParameters that scale and move to represent the real scene:

b_k＝a_k*l_k+s_k (9)

wherein l_kAnd s_kRespectively the range and mean of the kth action.

Step 2.4.2: for discrete random variables, such as the movement route L, the initial state light _ init _ index of the traffic light, [0,1] distribution, the probability sampling formula is as follows:

1) for the moving route L of pedestrians and other traveling vehicles, knowing the probability of li _ init _ index obeys equation (3), firstly, probability sampling is performed on the initial state li _ init _ index of the route, a discrete random variable li _ init _ state which can be further constructed by using a continuous random variable li _ init _ index is used, and when the kth route is selected, the correspondence between the continuous type and the discrete type is as follows:

li _ init _ state ═ k, where li _ init _ index ∈ ((k-1)/N, k/N) (10)

2) For the initial state of the traffic light, knowing the probability of light _ init _ index obeys equation (5), first performing probability sampling on the light _ init _ index of the initial state of the traffic light, and then performing mapping from the light _ init _ index to the light _ init _ state, the random variable light _ init _ state of the initial state of the traffic light can be further constructed as follows:

thus obtaining the initial state of the traffic light (initial red, green, yellow colors).

Step 2.5: and (3) testing the main vehicle by taking the random sampling result in the step (2.4) as a condition, and calculating an incentive value R by using the operation result, wherein the design principle of the incentive value R is a key scene in which some main vehicle accidents are expected to occur or the main vehicle violates the traffic rules, and the calculation formula is as follows:

w1, w2, w3, w4 and w5 in the formula (12) are all non-negative weight coefficients, and w1+ w2+ w3+ w4+ w5 is 1; here, ped denotes a set of pedestrian dynamic elements, c denotes a set of other traveling vehicle dynamic elements, l denotes a set of traffic light dynamic elements, r denotes a set of traffic regulations violated by the object to be measured (host vehicle), and p denotes a set of penalty terms for the object to be measured (host vehicle).

Wherein, the first term R of formula (12)_pedExpressing the reward value of the pedestrian, and the calculation formula is as follows:

wherein b1 and b2 are bothA non-negative weight coefficient and b1+ b2 is 1;

indicates for the ith action element a_iAccording to the minimum distance dis between the pedestrian and the main vehicle_pThe prize value earned, i.e. for the ith action element a_iThe host vehicle presents the reward value obtained about the key scene of the pedestrian, and the following same reason.

Of formula (13)

Indicating the minimum distance dis between the pedestrian and the host vehicle_pAvailable prize value, dis_pIndicating the distance between the pedestrian and the host vehicle, as the distance dis_pLess than threshold_pWhen the distance between the pedestrian and the main vehicle is smaller than the safe distance, the corresponding reward value DIS is obtained>0, DIS is a specific value which can be set, otherwise, the reward value is 0, and the calculation formula is as follows:

of formula (13)

Col for indicating traffic accident between main car and pedestrian_pAvailable prize value, col_pIndicating that if the host vehicle and the pedestrian have a traffic accident, the corresponding reward value COL is obtained, and the COL is a specific value which can be set, and the calculation formula is as follows:

second term R of formula (12)_cIndicating the reward value of the vehicle, including the distance between the vehicle and the host vehicle being less than the safe distance dis_cCollision accident col of main vehicle_cIn this case, the calculation formula is as follows:

here, c1 and c2 are non-negative weight coefficients, and c1+ c2 is 1;

of formula (16)

A bonus value, dis, being available representing the minimum distance between the other travelling vehicles and the host vehicle_cIndicating the distance between the other running vehicle and the host vehicle, when the distance dis_cLess than threshold_cWhen the distance between other driving vehicles and the main vehicle is smaller than the safe distance, obtaining a corresponding reward value DIS, wherein DIS is a settable specific numerical value, otherwise, the reward value is 0, and the calculation formula is as follows;

of formula (16)

The prize value, col, available to indicate a traffic accident between the host vehicle and other vehicles_cIt means that if the main vehicle and other running vehicles have a traffic accident, then the corresponding reward value COL is obtained, and the COL is a specific value which can be set, and the calculation formula is as follows:

a third term R of formula (12)_lThe reward value of the traffic light comprises that the main vehicle runs the red light and the main vehicle runs the yellow light, and the calculation formula is as follows:

R_l＝f1*R_red(a_i)+f2*R_yellow(a_i) (19)

both f1 and f2 are non-negative weight coefficients, and f1+ f2 is 1;

r of formula (19)_redIndicating the reward value, red, available to the host vehicle for running a red lightWhen the condition that the main vehicle runs the RED light is shown, a corresponding reward value RED is obtained, wherein RED is a specific numerical value which can be set, and the calculation formula is as follows:

r of formula (19)_yellowThe reward value which can be obtained when the main vehicle runs the red light is shown, YELLOW shows that the main vehicle runs the YELLOW light, the corresponding reward value YELLOW is obtained, YELLOW is a specific numerical value which can be set, and the calculation formula is as follows:

dis of formula (21)_yellowThe distance value of the host car exceeding the stop line is shown after the host car finds that the yellow light is turned on until the host car stops, and alpha represents the coefficient of the distance.

Fourth term R of formula (12)_rThe traffic rule violation behaviors such as main vehicle line pressing rate, reverse driving, illegal lane change and the like are represented by the following calculation formula:

R_r＝g1*R_cross(a_i)+g2*R_converse(a_i)+g3*R_{lane_change}(a_i) (22)

g1, g2 and g3 are all non-negative weight coefficients, and g1+ g2+ g3 is 1;

r of formula (22)_crossThe reward value which can be obtained when the main straight line runs is represented, CROSS represents the condition that the main straight line runs, and then a corresponding reward value CROSS is obtained, wherein CROSS is a specific value which can be set, and the calculation formula is as follows:

r of formula (22)_converseIndicating the reward value obtainable by the reverse driving of the host vehicle, reverse indicating the presence of reverse driving of the host vehicle, and obtaining the corresponding reward value convert, which is a settable specific numberThe value, the calculation formula is as follows:

r of formula (22)_{lane_change}The LANE _ CHANGE indicates that the host vehicle has an illegal LANE CHANGE, and the LANE _ CHANGE indicates that the host vehicle has an illegal LANE CHANGE, so that a corresponding reward value LANE _ CHANGE is obtained, the LANE _ CHANGE is a settable specific numerical value, and the calculation formula is as follows:

a fifth term R of formula (12)_pThe penalty is used for avoiding special conditions, namely, some unreasonable conditions of the dynamic element occur, which are usually related to the distance between the main vehicle and the dynamic element, and the calculation formula is as follows;

wherein eta is_iIs that the main vehicle is in state s_iRun-way of, p₀Indicates the position of the dynamic element, and γ is a set distance threshold.

Step 2.6: optimizing the probability model of the dynamic elements by using a strategy gradient method, wherein the objective function formula is as follows:

wherein a is the distribution from the strategy pi_φMiddle sampling action, phi ═ a₁,...,a_n) (ii) a E is the expectation function and R is the prize value.

And (3) sampling and approximating the target function for N times, wherein the gradient for updating the model parameter phi is as follows:

in order to make the selection of the strategy as diverse as possible, an entropy term H (pi) is added in the objective function_φ)：

H(π_φ)＝-∫π_φ(x)logπ_φ(x)dx (29)

π_φIs a distribution characterized by phi, where x is an independent variable, which, when taken at different values, has a probability density value of pi_φ(x) Will change correspondingly; for entropy term H (pi)_φ) And synchronizing with the reward value to maximize, and then adding the entropy term to obtain the gradient of the objective function:

the updating formula of the parameter phi is as follows, the gradient descent method is used for optimizing the parameter to obtain the minimum value of the formula, and thus the maximum reward value and entropy value are obtained:

when using the autoregressive Gaussian distribution Pair strategy_φIn modeling, the joint probabilities can be computed using chain rules:

π_φ,iis the model parameter phi corresponding to the ith dynamic element model.

Step 2.7: adding 1 to the iteration number e; when the iteration number E of the model training is smaller than E, returning to the step 2.2; and when the iteration times of the model training are equal to E, completing the model parameter training of the dynamic elements.

Further, the final test case obtained in step 4 specifically is:

for the selected road condition, the state of the test scene is determined, after the three types of dynamic elements are subjected to iterative training based on reinforcement learning, the model of each type of dynamic elements can obtain the probability distribution of the configuration parameters of the key scene, and the configuration parameters of the key scene comprise an initial position, a trigger position, a movement speed, a movement route, the change time of a traffic light and the like; based on the probability distribution of the configuration parameters of the dynamic element models, a key scene test case for unmanned vehicle simulation can be generated quickly.

Further, the map of the test case in step 5 is a free combination of arbitrary road conditions that have undergone iterative training based on reinforcement learning in a map library.

Further, the test case generation in step 6 specifically includes:

step 6.1: selecting required road types from a map library, and freely combining the road types to form a test case map;

step 6.2: setting a main vehicle running route, and selecting the type and the number of the dynamic elements;

step 6.3: after the road scene is added into the main vehicle driving route, the state S of the test scene is determined; according to the road condition and the movement track of the main vehicle, a dynamic element model M trained in the state S can be found in a test case library; for the ith state Si, the dynamic element model M obtains the probability distribution Pi of the action parameters, random sampling is carried out on the probability distribution according to the step 2.4 to obtain specific values ai of the action parameters, for pedestrians and vehicles, ai comprises an initial position X, an initial position Y, a movement speed v, a movement route L and a trigger distance D, for traffic lights, ai comprises a traffic light initial state light _ init _ state, a red light time setting t _ red, a yellow light time setting t _ yellow, a green light time setting t _ green, and specific values of the action parameters are set for the dynamic elements;

step 6.4: and generating a final test case.

The invention has the positive effects that:

(1) in the prior art, simulation of an unmanned vehicle is mostly considered for simulation of automatic driving, dynamic environment elements in a simulation scene are ignored, initial positions, movement speeds, movement routes, trigger distances and the like of the dynamic environment elements in the scene are mainly researched, and based on a reinforcement learning technology, a series of test cases of key scenes can be quickly generated on the premise that the movement of the dynamic elements accords with actual conditions.

(2) The dynamic environment elements in the simulation scene are trained in the simulation through the reinforcement learning technology to obtain the key scene of accidents such as high-probability collision and the like in automatic driving, invalid actions are avoided, the problems of more invalid explorations and low training speed in the training process are solved, and the training efficiency is obviously improved.

(3) The reward mechanism is reasonable in design, and the influence of pedestrians, vehicles, traffic lights and the like on the main vehicle is fully considered in combination with real traffic rules.

(4) The automatic driving test scene generation is put on the key scenes with few participants, such as an AV and a dynamic vehicle, the probability distribution calculation is simple and convenient, and the model training is simple and easy to realize.

Drawings

FIG. 1 is a flow chart of a method for generating an autopilot key scene;

FIG. 2 is a flow chart of model parameter training for three types of dynamic elements.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, belong to the scope of the present invention.

An automatic driving key scene generation method based on reinforcement learning comprises the following steps:

step 2: the main vehicle starts to execute a simulation task; based on a reinforcement learning technology, aiming at the selected road condition, carrying out probability model parameter training on the three types of dynamic elements;

and step 3: the three types of dynamic elements finally obtain the optimal parameters of the probability model for the selected road condition, and the optimal parameters are stored in a test case library;

step 6: and according to the selected road, importing a dynamic element probability model corresponding to each road from a test case library to generate a series of key scene test cases.

Further, step 1 specifically comprises:

In one embodiment, the moving routes of the pedestrian and the vehicle have different options according to different road conditions and initial positions, such as: the road condition selects a straight road, the initial position of the pedestrian is on the sidewalk on one side of the road, and then the moving route of the pedestrian is straight along the sidewalk; the road condition selects an intersection, the initial position of the pedestrian is at the intersection, and then the moving route of the pedestrian is to pass through the road.

In one embodiment, the road condition selects an intersection; the initial position of the pedestrian is in the northeast corner of the crossroad, and the movement route passes through two roads of the crossroad and reaches the southwest corner of the crossroad; the initial position of the vehicle is behind the right of the main vehicle, the movement route passes through the crossroad together with the main vehicle, and the condition of merging can occur; the time of the traffic lights was set to 60 seconds red, 30 seconds green and 3 seconds yellow.

The detailed description of step 2 is provided in the summary of the invention.

Further, the final test case obtained in step 4 specifically is:

Further, the test case generation in step 6 specifically includes:

step 6.4: and generating a final test case.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the principle and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. An automatic driving key scene generation method based on reinforcement learning comprises the following steps:

1) selecting a road scene from a map library, setting a driving route of a main vehicle in a simulation system and respectively establishing a probability model for each dynamic environment element; the dynamic environment elements comprise pedestrians, other running vehicles except the main vehicle and traffic lights;

2) the simulation system controls the main vehicle to start executing a simulation task; training the probability models of all dynamic elements in the selected road scene based on a reinforcement learning technology to obtain the optimal parameters of all probability models for the selected road scene and storing the optimal parameters in a test case library;

3) the step 1-2) is circulated, and the optimal parameters of each probability model for each road scene in the map library are obtained;

4) acquiring a plurality of road scenes from the map library, combining the road scenes to obtain a test map, and selecting dynamic elements required in a simulation environment;

5) and importing the probability model and the corresponding optimal parameters of each dynamic element contained in the test map from a test case library to generate a key scene test case as an automatic driving key scene.

2. The method as claimed in claim 1, wherein in step 2), the method for training the probability model of each dynamic element in the selected road scene is as follows:

21) setting the total iteration times E of model training; initializing the iteration times e to be 0;

22) setting a motion route and an initial position of each dynamic element in the selected road scene;

23) calculating the probability distribution of each dynamic element in the selected road scene according to the current state of the main vehicle;

24) randomly sampling the probability distribution of each dynamic element to obtain the action parameters of the probability model of each dynamic element in the state S;

25) testing the main vehicle by using the random sampling result of the step 24) as a condition, and then calculating an award value R according to the running result of the test, wherein the award value R

a_iIs the ith dynamic element, and n is the number of the dynamic elements; w1, w2, w3, w4 and w5 are all non-negative weight coefficients, ped represents a pedestrian set in the selected road scene, c represents other running vehicle sets in the selected road scene, l represents a traffic light set in the selected road scene, r represents a set of the host vehicle violating the traffic rules, and p represents a set of host vehicle penalty terms;

reward value

b1 and b2 are non-negative weight coefficients;

indicates for the ith action element a_iAccording to the minimum distance dis between the pedestrian and the main vehicle_pThe value of the benefit to be obtained is,

(a_i) Indicates for the ith action element a_iTraffic accident col according to the main car and the pedestrian_p(a_i) The value of the reward earned;

reward value

Wherein c1, c2 are non-negative weighting coefficients,

indicates for the ith action element a_iThe bonus value obtained on the basis of the minimum distance between the other running vehicles and the host vehicle,

indicates for the ith action element a_iObtaining a reward value according to the traffic accident of the main vehicle and other running vehicles;

reward value R_l＝f1*R_red(a_i)+f2*R_yellow(a_i) F1 and f2 are all non-negative weight coefficients, R_red(a_i) Indicates for the ith action element a_iObtaining a reward value according to the condition that the main vehicle runs the red light;

R_r＝g1*R_cross(a_i)+g2*R_converse(a_i)+g3*R_{lane_change}(a_i) G1, g2 and g3 are all non-negative weight coefficients, R_cross(a_i) Indicates for the ith action element a_iAccording to the reward value, R, obtained by driving the main vehicle line-pressing line_converse(a_i) Indicates for the ith action element a_iBased on the reward value, R, available for retrograde driving of the host vehicle_{lane_change}(a_i) Indicates for the ith action element a_iThe reward value can be obtained according to the illegal lane change of the main vehicle;

wherein eta is_iIs the driving route of the main vehicle in the state si, rho₀The position of the dynamic element is shown, gamma is a set threshold value, and RP is a driving state reward value;

26) optimizing the probability model of the dynamic elements by using a strategy gradient method; wherein the objective function for optimization is determined based on the reward value as

a is the distribution of the slave strategies π_φMiddle sampling action, phi ═ a₁,...,a_n) (ii) a E is an expectation function;

27) adding 1 to the iteration number e; when the iteration number E of the model training is less than E, returning to the step 22); and when the iteration number of the model training is equal to E, finishing the training of the probability model of the dynamic element.

3. The method of claim 2, wherein the probability distributions for pedestrians and vehicles are

Wherein, the dynamic element a_iThe method comprises the steps of obtaining an initial position (X, Y), a movement route L, a movement speed V and a trigger distance D of a dynamic element; dynamic element a_iThe probability of (c) is:

the probability distribution of the duration of the traffic light is

Wherein light _ init _ index, t _ red, t _ green, and t _ yellow respectively represent initial states of the traffic lightsState, red light duration, green light duration, yellow light duration; the conditional probability density P (t _ red | S, a) of the traffic light duration t _ red is respectively obtained by modeling by using Gaussian distribution₁,...,a_i-1Light _ init _ index), and conditional probability density P of t _ green (t _ green | S, a)₁,...,a_i-1Light _ init _ index, t _ red), and t _ yellow conditional probability density P (t _ yellow | S, a)₁,...,a_i-1,light_init_index，t_red，t_green)。

4. The method according to claim 3, wherein the initial state li _ init _ state of the motion route in each road scene is obtained by discretizing a continuous random variable li _ init _ index, and li _ init _ state is the initial state of the motion route of the ith road scene li, and the conditional probability density of li _ init _ index is obtained by modeling with a probability density function on a neural network construction [0,1] section; the initial state of the traffic light is obtained by discretizing the continuous random variable light _ init _ index, and the conditional probability density of the light _ init _ index is obtained by modeling by using a probability density function on a neural network structure [0,1] interval.

5. The method of claim 4, wherein the step 24) of randomly sampling the probability distribution of each dynamic element comprises:

241) modeling dynamic elements by using Gaussian distribution for continuous random variables; for discrete random variables, modeling the dynamic elements using a polynomial distribution;

242) sampling the probability distribution of discrete random variables: a) for the moving routes of pedestrians and other traveling vehicles, firstly, probability sampling is carried out on an initial state li _ init _ index of the moving route, a discrete random variable li _ init _ state is constructed by using a continuous random variable li _ init _ index, and when a k-th route is selected from the discrete random variable li _ init _ state, the correspondence relationship li _ init _ state from continuous to discrete is equal to k, wherein li _ init _ index belongs to ((k-1)/N, k/N); b) for the initial state of the traffic light, probability sampling is firstly carried out on the light _ init _ index of the initial state of the traffic light, then mapping from the light _ init _ index to the light _ init _ state is carried out, and the random variable light _ init _ state of the initial state of the traffic light is constructed.

6. The method of claim 5, wherein for an initial state of the traffic light, the light _ init _ index to light _ init _ state mapping relationship is

7. An automatic driving test method, characterized in that a simulation system adopts the automatic driving key scene obtained by the method of claim 1 to test a target main vehicle.