CN112508164B

CN112508164B - End-to-end automatic driving model pre-training method based on asynchronous supervised learning

Info

Publication number: CN112508164B
Application number: CN202010727803.2A
Authority: CN
Inventors: 田大新; 郑坤贤; 段续庭; 周建山; 韩旭; 郎平; 林椿眄; 赵元昊; 郝威; 龙科军; 刘赫; 拱印生
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2023-01-10
Anticipated expiration: 2040-07-24
Also published as: CN112508164A

Abstract

An end-to-end automatic driving model pre-training method based on asynchronous supervised learning is characterized in that a plurality of supervised learning processes are executed asynchronously and parallelly on a plurality of demonstration data sets acquired by a real vehicle, so that the stability of the supervised learning processes is improved, and the convergence of the pre-training processes is accelerated. After the end-to-end automatic driving model is pre-trained by expert demonstration data collected from the real world, the initial performance of the model in the subsequent real vehicle reinforcement learning training stage can be improved and the convergence of the model can be accelerated. In addition, the invention provides a visual analysis method for an end-to-end automatic driving model training process, so as to analyze the model performance improvement brought by the pre-training method based on asynchronous supervised learning from a microscopic angle. The invention designs a multi-vehicle distributed reinforcement learning-driven automatic driving model training system which is used for collecting expert demonstration data and verifying the feasibility of the application of the pre-training method in the real world.

Description

End-to-end automatic driving model pre-training method based on asynchronous supervised learning

Technical Field

The invention relates to the field of traffic, in particular to an end-to-end model pre-training method for an automatic driving vehicle.

Technical Field

Current automated driving faces a significant challenge: the traditional automatic driving system is too large and complex in structure. In order to perfect the automatic driving system as much as possible so as to meet the requirements of different working conditions, the traditional automatic driving system cannot avoid the problem that the system structure is huge and complicated due to perfect logic. The traditional automatic driving system which is too complex faces three problems of algorithm bloat, performance limitation and decision contradiction:

(1) Algorithm bloated: the traditional automatic driving system needs to manually set a rule base to generalize the driving state of the unmanned vehicle, and the algorithm scale is continuously huge along with the increase and complexity of driving environment scenes;

(2) Performance is limited: the system structure determines that certain bottlenecks exist in the depth and the decision accuracy of the field Jing Bianli, and complex working conditions are difficult to process;

(3) And (4) making a decision contradiction: the traditional automatic driving system adopts a finite-state machine to switch driving behaviors under different states, and the state division of the finite-state machine needs to be based on definite boundary conditions. In fact, some 'grey zones' exist among driving behaviors, namely more than 1 reasonable behavior selection can be realized in the same scene, so that the driving states conflict.

The widespread success of Deep Reinforcement Learning (Deep RL) has led Deep RL to be increasingly applied to the training of end-to-end auto-driving models. The learning-based algorithm abandons the hierarchical structure of the rule algorithm, is more concise and direct, and greatly simplifies the structure of a decision system. In the course of Deep RL model training, through the process of cyclic state observation-action execution-reward acquisition, the mapping relation between the environment state and the optimal action can be established only by little prior knowledge. However, due to the lack of prior knowledge, the initial performance of the Deep RL is poor, and thus the Deep RL has a problem of long training time (requiring excessive real-world experience) in the process of training an automatic driving model which can be actually applied to the ground. In a simulation environment, the disadvantage of poor initial performance of the Deep RL model can be tolerated. However, if it is necessary to operate the Deep RL model-based autonomous driving vehicle in the real world in a normalized manner, it is inevitably necessary to train the Deep RL model-based autonomous driving vehicle in the real world using the real vehicle. In this case, poor initial performance means that real vehicles may frequently collide in the real world or be interrupted from training by frequent human intervention to avoid danger, which greatly increases the workload of the tester and the risk during training. Therefore, in order to deploy an end-to-end autopilot model based on a Deep RL on an actual autopilot vehicle, the problem of poor initial performance of the Deep RL model must be solved.

The invention introduces prior knowledge into the training of the Deep RL model to solve the problem of poor initial performance when the Deep RL model is trained in the real world. The invention provides an asynchronous supervised learning method for a continuous action Deep RL model, which is used for parallelly and asynchronously executing a plurality of supervised learning processes on a plurality of training data sets acquired from the real world. By running different supervised learning processes in different threads, parallel asynchronous online updating of model parameters by a plurality of agents is realized, and compared with the parameter updating process of a single agent, the time correlation of strategy exploration is greatly reduced, so that the supervised learning process is more stable. In order to avoid collecting time-consuming and labor-consuming human expert Driving demonstration data, the invention also uses a Manually Designed Heuristic Driving strategy (MDHDP) to drive the vehicle by means of the MDHDP to generate high-reward experience data as expert demonstration, and a supervised learning training data set is formed. In order to visually analyze the improvement brought by the pre-training process from a microscopic angle, the invention provides a visualization method suitable for a continuous action Deep RL model, and the visualization analysis method has important significance for testing and verifying a continuous output neural network model. Finally, the invention designs a multi-vehicle distributed reinforcement learning driving automatic driving model training system which is used for collecting expert demonstration data and verifying the feasibility of the application of the pre-training method in the real world.

Disclosure of Invention

The invention aims to solve the technical problem of providing an end-to-end automatic driving model pre-training method to solve the problems of poor initial performance and slow model convergence when an end-to-end automatic driving model is driven by training reinforcement learning in the real world.

The technical scheme adopted by the invention for solving the technical problem is as follows: an end-to-end automatic driving model pre-training method based on asynchronous supervised learning is designed, and a plurality of supervised learning processes are executed asynchronously and parallelly on a plurality of demonstration data sets acquired by a real vehicle, so that the stability of the supervised learning process is improved, and the convergence of the pre-training process is accelerated. After the end-to-end automatic driving model is pre-trained by expert demonstration data collected from the real world, the initial performance of the model in the subsequent real vehicle reinforcement learning training stage can be improved and the convergence of the model can be accelerated. In addition, the invention provides a visual analysis method facing the end-to-end automatic driving model training process for analyzing the effectiveness of the pre-training method from a microscopic angle. In order to avoid collecting time-consuming and labor-consuming human expert Driving demonstration data, the invention also uses a Manually Designed Heuristic Driving strategy (MDHDP) to drive the vehicle by means of the MDHDP to generate high-reward experience data as expert demonstration, and a supervised learning training data set is formed. Finally, in order to verify the feasibility of the application of the pre-training method in the real world, the invention designs a multi-vehicle distributed reinforcement learning driven automatic driving model training system in a matching way.

An end-to-end automatic driving model pre-training method based on asynchronous supervised learning is characterized in that a plurality of supervised learning processes, namely asynchronous supervised learning, are asynchronously executed in parallel on a plurality of demonstration data sets acquired by a real vehicle, so that the stability of the supervised learning processes is improved, and the convergence of the pre-training processes is accelerated.

The demonstration data set is a heuristic strategy pi 'designed by people' _i N 'collected by drive data collection vehicle' _i The design is as follows based on the preview theory: determining the wheel rotation angle and the braking and accelerator amount, wherein the wheel rotation angle is determined as a collection vehicle i according to the current vehicle speed v of the vehicle _it The specific steps of determining the wheel rotation angle of the front vehicle position are as follows:

(1) The collection vehicle i is according to the current speed v _it And position E (x) _it ,y _it ) Determining Preview point F (x' _it ,y′ _it )

l _EF ＝L+v _it In the multiplied by delta t formula, L is a fixed pre-aiming distance, and delta t is a pre-aiming coefficient;

(2) Calculating the rotation angle of the pointing preview point F

Wherein

Is the center of the collection vehicle i;

(3) Correcting the turning angle according to the position and speed of the front vehicle j to avoid collision

Wherein W = D/C, D is the transverse distance between the front vehicle j and the collecting vehicle i, and C is a side collision safety threshold;

determining the braking and accelerator amount as a collection vehicle i according to the current vehicle speed v _it Current road section r _t Speed limit of

Distance d from front vehicle j _it The specific steps for determining the brake and accelerator amount are as follows:

(1) Determining a current road section r _t Speed limit of

Where g is the acceleration of gravity, u is the coefficient of friction, M is the vehicle mass, CA is the down force coefficient, and ρ is the road section r _t The curvature of (a);

(2) Speed v of the vehicle _it Not exceed

The throttle amount is increased and the vehicle speed v is increased _it Exceed

Or the distance from the preceding vehicle j is less than the forward collision safety threshold value, and the braking amount is increased.

The pre-training process is based on quintuple

The definition is as follows:

status of state

Order to

Set of time-varying environmental states collected as a demonstration vehicle i, wherein

Representing the state of the nth experience in the pre-training demonstration data set omega, and consisting of 4 continuous single-channel images shot by the front camera;

movement of

Order to

Set of demonstrated driving actions captured as a demonstration vehicle i, wherein

Represents the n-th empirical demonstration motion (wheel turning angle) in omega, and

loss function

Order to

Represents the loss of the nth experience in Ω, as follows

In the formula

And

is corresponding to

The variable of (a) is selected,

is an input

An action later output by the pre-trained model;

function of state transition

Order to

For a given state

And actions

Thereafter (assuming n corresponds to the tth time slot), the system transitions to state in the next time slot

Is expressed as

Discount coefficient γ: γ ∈ [0,1] to balance current losses and long-term losses.

The pre-training process comprises the following steps:

(1) Given a random strategy

Input state

Rear output action

A probability distribution of (a);

(2) Deriving an expected total loss function

Indicating a current state

Always Enforcement policy pi _i Total loss to final state as follows

(3) Derivation of random exploration total loss function

If the agent is in the state

Time not according to strategy pi _i Performing an action

But performs insteadOther actions

But still following the strategy in the subsequent state _i Then the total loss is expected to be

(4) Deriving a merit function

Representing a random exploration strategy pi _i External movement

With the following advantages

(5) Determining a problem formula: given the current state

By minimizing the merit function

Finding an optimal strategy

To minimize the desired total loss function

When the exploration process converges

Satisfies the following formula

Where Π is the set of random strategies.

The asynchronous supervised learning introduces actor-comment family neural network as a nonlinear function estimator prediction random action strategy

And expected total loss function

To solve the problem equation

Where theta and theta _v Parameters for actors and commenting on the neural network of the family are updated as follows:

in the formula of theta 'and theta' _v Respectively thread-related parameters, theta and theta _v Is a global sharing parameter.

The visual analysis method is designed based on a univariate analysis method, specifically, under the condition that other pixels in an image of an input model are kept unchanged, the value of a certain pixel o is changed, the change amplitude is delta o, and for a certain layer in a neural network, the change amplitude is delta o

The output impact of (c) is as follows:

in the formula

And if

Then

And

are respectively as

And finally obtaining the influence of each pixel in the image of the input model on the final output of the model by the weight and the bias parameters of the layer, and drawing to obtain the end-to-end automatic driving model attention thermodynamic diagram.

The number of pixels in the end-to-end automatic driving model attention thermodynamic diagram is consistent with the number of pixels of an image of an input model, an image area which has a great influence on a model output result is displayed in the thermodynamic diagram in a special highlight mode, whether the attention area of the model is an area related to driving decision or not can be checked from a microscopic angle, and the effectiveness of model training is verified.

A multi-vehicle distributed reinforcement learning drive automatic driving model training system is composed of a plurality of robot vehicles, a building model, a pavement map and the like and comprises a strategy learning scene, a strategy verification scene and a UWB positioning-reinforcement learning reward system, wherein the robot vehicles in the strategy learning scene are used as learners of driving strategies in the reinforcement learning training process, the plurality of vehicle distributed exploration environments execute actor-comment family network parameter updating tasks in parallel and asynchronously.

The robot trolley in the strategy verification scene inherits the parallel and asynchronous updated global driving strategy of other trolleys and runs in the strategy verification scene, the UWB positioning-reinforcement learning reward system gives rewards, and the strategy verification trolley records scores.

The UWB positioning-reinforcement learning reward system determines the position of the robot trolley according to a UWB positioning label bound on the robot trolley, and gives strategy learning and verifies real-time acquired rewards in the reinforcement learning training process of the trolley according to a reinforcement learning reward function.

Compared with the prior art, the invention has the following advantages and positive effects: aiming at the problems of poor initial performance and low convergence rate of the existing reinforcement learning-driven end-to-end automatic driving model, the invention provides a series of methods of end-to-end automatic driving model pre-training, effect analysis and landing verification, which take an end-to-end automatic driving model pre-training method based on asynchronous supervised learning as a core, well solves the problem that the reinforcement learning-driven end-to-end automatic driving model is difficult to land and deploy, greatly promotes the development of the learning-driven end-to-end automatic driving technology, and assists the development of the automatic driving technology in China. Therefore, in summary, the method has great significance for improving the overall performance of the vehicle end-to-end automatic driving system.

Drawings

FIG. 1 is a diagram of an end-to-end autopilot model architecture;

FIG. 2 is a diagram of the theoretical architecture of an asynchronous supervised learning approach;

FIG. 3 is a diagram of a multi-vehicle distributed reinforcement learning driven autopilot model training system architecture;

fig. 4 is a visualization analysis method example.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims. The end-to-end automatic driving model used by the invention is shown in figure 1, the input of the model is four driving images shot by a pre-processed front camera, and the first layer of convolutionA layer contains 32 convolution kernels with 4 steps of size 4 x 8, immediately followed by 32 convolution kernels with 2 steps of size 32 x 4, immediately followed by 32 convolution kernels with 1 step of size 32 x 3, and finally a fully connected layer with 256 hidden units. These four hidden layers are followed by active layer modified linear units (relus). The neural network in fig. 1 has two sets of outputs: two linear outputs for representing model output actions

The mean and variance of the normal distribution of (a); a linear output for representing a value function

Pre-training process modeling

The pre-training process based on asynchronous supervised learning consists of five tuples

The definition is as follows:

status of state

Order to

Representing the state of the nth experience in the pre-training demonstration data set omega, and consisting of 4 continuous single-channel images shot by a front camera;

movement of

Order to

Demonstration as demonstration vehicle i acquisitionShow a set of driving actions, wherein

loss function

Order to

Represents the loss of the nth experience in Ω, as follows

In the formula

And

is corresponding to

The variable of (a) is selected,

is an input

An action later output by the pre-trained model;

state transfer function

Order to

For a given state

And actions

Is expressed as

Discount coefficient γ: γ ∈ [0,1] to balance current loss and long-term loss.

Problem formula derivation

According to the pre-training process model, we further derive a problem formula of the pre-training process:

(1) Given a random strategy

Input state

Rear output action

A probability distribution of (a);

(2) Deriving an expected total loss function

Indicating a current state

Always Enforcement policy π _i Total loss to final state as follows

(3) Derivation ofRandom exploration of total loss function

If the agent is in the state

Time not according to strategy pi _i Performing an action

But performs other actions

(4) Deriving a merit function

Representing a random exploration strategy pi _i External movement

With the following advantages

(5) Determining a problem formula: given the current state

By minimizing the merit function

Finding an optimal strategy

To minimizeExpected total loss function

When the exploration process converges

Satisfies the following formula

Where Π is the set of random strategies.

Pre-training demonstration data acquisition

The demonstration data set used for pre-training is pi 'through a manually designed heuristic strategy' _i N 'collected by drive data collection vehicle' _i The design is as follows based on the preview theory:

turning the wheel: the collection vehicle i is according to the current vehicle speed v _it And determining the wheel angle of the front vehicle

l _EF ＝L+v _it ×Δt

Wherein L is a fixed pre-aiming distance, and delta t is a pre-aiming coefficient;

(2) Calculating the rotation angle of the pointing preview point F

Wherein

Is the center of the collection vehicle i;

Wherein W = D/C, D is the transverse distance between the front vehicle j and the collecting vehicle i, and C is the side collision safety threshold.

Brake/throttle: the collection vehicle i is according to the current vehicle speed v _it Current road section r _t Speed limit of

Distance d from preceding vehicle j _it Determining brake and throttle amounts

(1) Determining a current road section r _t Speed limit of

(2) Vehicle speed v _it Not exceed

Asynchronous supervised learning method

To solve the problem equation

We introduce actor-comment family neural network as a non-linear function estimator to predict random action strategy

And expected total loss function

in the formula of theta 'and theta' _v Respectively thread-related parameters, theta and theta _v Is a global sharing parameter. And executing a plurality of supervised learning processes in a parallel and asynchronous mode on a plurality of pre-training demonstration data sets, namely an asynchronous supervised learning method.

Visual analysis

The visual analysis method is designed based on a univariate analysis method, and specifically, under the condition that other pixels in an image of an input model are kept unchanged, the value of a certain pixel o is changed, the change amplitude is delta o, and for a certain layer in a neural network

The output impact of (c) is as follows:

in the formula

And if

Then

And

are respectively as

The weight and bias parameters for this layer. And finally, obtaining the influence of each pixel in the image of the input model on the final output of the model, and drawing to obtain an end-to-end automatic driving model attention thermodynamic diagram. The pixel number in the end-to-end automatic driving model attention thermodynamic diagram is consistent with the pixel number of the image of the input model, the image area which has great influence on the output result of the model is displayed in a special highlight mode in the thermodynamic diagram, and whether the attention area of the model is an area relevant to driving decision or not can be checked from a microscopic angle to verify the effectiveness of model training. For example, if the highlight region is located in the sky, a roadside building, or the like in the input image, it can be inferred that the model training has a problem, whereas if the highlight region is located in the road surface, other vehicles, or the like in the image, it can be known that the training is effective.

Multi-vehicle distributed reinforcement learning driving automatic driving model training system

In order to verify the engineering feasibility of the pre-training method in the real world, the invention provides a multi-vehicle distributed reinforcement learning driven automatic driving model training system which comprises a strategy learning scene, a strategy verification scene and a UWB positioning-reinforcement learning reward system and consists of a plurality of robot trolleys, a building model, a pavement map and the like. The robot trolley in the strategy learning scene is used as a learner of a driving strategy in the reinforcement learning training process, and a plurality of trolleys are distributed in an exploration environment to execute the network parameter updating task of the actor-commenter asynchronously. The robot trolley in the strategy verification scene inherits the global driving strategy of other trolleys which are asynchronously updated in parallel and runs in the strategy verification scene, the UWB positioning-reinforcement learning reward system gives rewards, and the strategy verification trolley records scores. The UWB positioning-reinforcement learning reward system determines the position of the trolley according to a UWB positioning label bound on the robot trolley, and gives strategy learning according to a reinforcement learning reward function and verifies real-time rewards obtained in the process of trolley reinforcement learning training according to the reinforcement learning reward function.

In the embodiment, a theoretical framework of the asynchronous supervised learning method provided by the present invention is specifically shown in fig. 2. A plurality of agents having the end-to-end autopilot model shown in fig. 1 execute a plurality of supervised learning processes asynchronously in parallel on a plurality of demonstration data sets acquired by a real vehicle to improve the stationarity of the supervised learning process and accelerate the convergence of the pre-training process. After the end-to-end automatic driving model is pre-trained by expert demonstration data acquired from the real world, the initial performance of the model in the subsequent real vehicle reinforcement learning training stage can be improved and the convergence of the model can be accelerated.

In this embodiment, the architecture diagram of the multi-vehicle distributed reinforcement learning-driven automatic driving model training system provided by the present invention is shown in fig. 3, and includes 2 policy learning scenarios, 1 policy verification scenario, and a UWB positioning-reinforcement learning reward system. After the real vehicle training system is built, the artificially designed heuristic strategy pi 'provided by the invention is adopted' _i And driving the robot trolley to collect demonstration data in a strategy learning scene, and constructing a pre-training demonstration data set omega. And then pre-training an end-to-end automatic driving model by adopting an asynchronous supervised learning method based on the data set omega, deploying the model in the robot trolley after the pre-training is finished, and performing subsequent reinforcement learning training in a real-vehicle training system.

There may be a phenomenon that the trained samples are biased, resulting in the trained model not being actually used to solve the problem to be solved. Macroscopically based on existing training data, it is difficult to analyze, at which time it is necessary to microscopically determine whether the model reacts to the correct position in the input image. Therefore, the training effect visualization analysis method is designed based on the single-factor analysis method, the 'attention degree' of the model to each pixel is obtained by sequentially making slight changes to each pixel of the input image and observing the change of the model output, and the thermodynamic diagram of the end-to-end automatic driving model sensitive area is drawn. As shown in fig. 4, for example, by changing the pixel of the left blue square, a different result can be obtained after inputting the model, and this difference is the importance of this pixel to the model output, and obtaining the importance of each pixel can draw the thermodynamic diagram.

Claims

1. An end-to-end automatic driving model pre-training method based on asynchronous supervised learning is characterized in that a plurality of supervised learning processes, namely asynchronous supervised learning, are asynchronously executed in parallel on a plurality of demonstration data sets acquired by a real vehicle, so that the stability of the supervised learning process is improved, and the convergence of the pre-training process is accelerated;

the pre-training process is based on quintuple

The definition is as follows:

status of state

Order to

movement of

Order to

Represents the nth experience in omegaA demonstration action of, moreover

Loss function

Order to

Represents the loss of the nth experience in Ω, as follows

In the formula

And

is corresponding to

The variable of (a) is selected,

is an input

An action later output by the pre-trained model;

function of state transition

Order to

For a given state

And actions

Then, the system transitions to the state in the next time slot

Is expressed as

Discount coefficient γ: gamma belongs to [0,1] and is used for balancing current loss and long-term loss;

the pre-training process comprises the following steps:

(1) Given a random strategy

Input state

Rear output action

A probability distribution of (a);

(2) Deriving an expected total loss function

Indicating a current state

Always Enforcement policy pi _i Total loss to final state as follows

(3) Derivation of random exploration total loss function

If the agent is in the state

Time not according to strategy pi _i Performing an action

But performs other actions

But in the following states still according to the strategy pi _i Then the total loss is expected to be

(4) Deriving a merit function

Representing a random exploration strategy pi _i External movement

With the following advantages

(5) Determining a problem formula: given the current state

By minimizing the merit function

Finding an optimal strategy

To minimize the desired total loss function

When the exploration process converges

Satisfies the following formula

Wherein Π is the set of random strategies;

And expected total loss function

To solve the problem equation

wherein theta 'and theta' _v Thread-related parameters, θ and θ, respectively _v Is a global sharing parameter.

2. The method of claim 1, wherein the presentation data set is pi 'through a manually designed heuristic strategy' _i N 'collected by drive data collection vehicle' _i The design is as follows based on the preview theory: determining the wheel rotation angle and the braking and accelerator amount, wherein the wheel rotation angle is determined as a collection vehicle i according to the current vehicle speed v of the vehicle _it The specific steps of determining the wheel rotation angle of the front vehicle position are as follows:

(1) The collection vehicle i is according to the current speed v _it And position E (x) _it ，y _it ) Determining Preview point F (x' _it ，y′ _it )

(2) Calculating the rotation angle of the pointing preview point F

Wherein

Is the center of the collection vehicle i;

determining the braking and accelerator quantities as a collecting vehicle i according to the current vehicle speed v _it Current road section r _t Speed limit of

Distance d from preceding vehicle j _it The specific steps for determining the brake and accelerator amount are as follows:

(1) Is determined whenFront road section r _t Speed limit of

Comprises the following steps:

(2) Vehicle speed v _it Not exceed

3. A visual analysis method for an end-to-end automatic driving model training process, which is proposed by aiming at the training method of claim 1, is characterized in that the visual analysis method is designed based on a univariate analysis method, specifically, under the condition that other pixels in an image of an input model are kept unchanged, the value of a certain pixel o is changed, the change amplitude is delta o, and for a certain layer in a neural network, the change amplitude is delta o

The output impact of (c) is as follows:

in the formula

And if

Then

And

are respectively as

4. The analysis method according to claim 3, wherein the number of pixels in the end-to-end automatic driving model attention thermodynamic diagram is consistent with the number of pixels of the image of the input model, the image area which has a great influence on the output result of the model is displayed in a special highlight form in the thermodynamic diagram, whether the attention area of the model is the area relevant to the driving decision is checked from a microscopic perspective, and the effectiveness of the model training is verified.

5. The multi-vehicle distributed reinforcement learning-driven automatic driving model training system is provided by the asynchronous supervised learning-based end-to-end automatic driving model pre-training method, and is characterized in that the training system consists of a plurality of robot vehicles, a building model and a road surface map and comprises a strategy learning scene, a strategy verification scene and a UWB positioning-reinforcement learning reward system, wherein the robot vehicles in the strategy learning scene are used as learners of driving strategies in the reinforcement learning training process, and a plurality of vehicle distributed exploration environments are used for asynchronously executing actor-family comment network parameter updating tasks.

6. The training system of claim 5, wherein the robotic vehicle in the strategy verification scenario inherits the global driving strategy of the parallel asynchronous update of the other vehicles and travels in the strategy verification scenario, the UWB positioning-reinforcement learning reward system gives a reward, and the strategy verification vehicle records a score.

7. The training system of claim 6, wherein the UWB location-reinforcement learning reward system determines the location of the vehicle based on UWB location tags bound to the robotic vehicle, and provides policy learning and policy validation rewards acquired in real time during vehicle reinforcement learning training based on reinforcement learning reward functions.