WO2022221979A1

WO2022221979A1 - Automated driving scenario generation method, apparatus, and system

Info

Publication number: WO2022221979A1
Application number: PCT/CN2021/088037
Authority: WO
Inventors: 钟胤; 魏卓; 李铁岩
Original assignee: 华为技术有限公司
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2022-10-27
Also published as: CN112997128B; CN112997128A

Abstract

An automated driving scenario generation method, apparatus (400), and system (500), which can solve the problems in the prior art of a long test time, large costs and overhead, and difficulty in covering all driving scenarios due to limited test driving scenarios, and the test personnel not being as numerous as real users. The method comprises: acquiring feedback information outputted when an automated driving system (ADS) (200) is tested in a reference automated driving scenario (S301); obtaining safety violation parameters and coverage parameters on the basis of the feedback information (S302); and if the safety violation parameters or the coverage parameters do not meet a preset index, obtaining an updated reference automated driving scenario by updating the reference automated driving scenario on the basis of the feedback information, until the safety violation parameters and the coverage parameters obtained on the basis of the feedback information outputted when the ADS (200) is tested in the updated reference automated driving scenario meet the preset index (S303).

Description

A method, device and system for generating an automatic driving scene

technical field

The present application relates to the technical field of automatic driving simulation, and in particular, to a method, device and system for generating automatic driving scenarios.

Background technique

With the rapid development of science and technology, vehicle autonomous driving technology is becoming more and more popular. Since the safety of the vehicle is the premise of the automatic driving of the vehicle, in order to realize the real automatic driving, it is necessary to determine the safety problems that may occur during the automatic driving of the vehicle and evaluate the safety of the vehicle. Among them, security issues include information security issues (security) and functional security issues (safety). Security mainly refers to security issues caused by human malicious attacks, such as property loss, privacy leakage, system control, system failure, etc. Safety mainly refers to non-artificial security issues. Security problems caused by internal system failures and information security problems caused by malicious attacks can also lead to functional security problems such as system control or system failure.

Currently, the safety of conventional vehicles is assessed through certification, such as certification of whether the vehicle meets safety requirements according to the ISO-26262 standard and ISO/PAS 21448 SOTIF standard. However, due to the following two problems in intelligent vehicles: (1) It is difficult to manually detect the completeness of the specification. The autonomous driving system of the vehicle is mainly a machine learning system, and the logic of its perception, prediction, planning and control is obtained through training and learning. , the requirement specification is vague; (2) It is difficult to manually detect the correctness of the implementation. The neural network of the machine learning system is composed of an input layer, an output layer, multiple hidden layers and many neurons, which are activated by linear or nonlinear activation functions. to pass a high-dimensional vector, which is logically incomprehensible. Therefore, it is currently impossible to evaluate the safety of smart vehicles through certification.

Then, in order for smart vehicles to solve safety problems and gain the trust of stakeholders such as users, regulators, insurers, government departments, etc., they can only improve the intelligence level and safety of vehicles through sufficient exercise, that is, perform as much as possible. training and testing. There are usually two ways to exercise: (1) Exercise through the user's real driving, that is, sell the car first, and improve the intelligence level and safety of the car by solving the problems found by the user during use. The approach clearly conflicts with stakeholder trust, hitting the road before safety is established, conflicting with the personal safety of users, the risk assessment of insurers, and the regulatory responsibilities of regulators and government departments; (2) through experimental road tests To exercise, that is, test engineers instead of real users in a specific test area to test various driving scenarios, but due to the limited driving scenarios tested, there are not as many testers as real users, so the time required is very long and the cost is high. large and difficult to cover all driving scenarios.

It can be seen that when exercising the vehicle through the experimental road test to improve the intelligence level and safety of the vehicle, due to the limited driving scenarios of the test, the testers do not have as many real users, so the time required is very long and the cost is high. And it is difficult to cover all driving scenarios.

SUMMARY OF THE INVENTION

The present application provides a method, device and system for generating an automatic driving scenario, so as to solve the problem that the existing technology has limited testing driving scenarios, and there are not as many testers as there are real users, resulting in long testing time, high cost and difficulty in coverage. Questions for all driving scenarios.

In a first aspect, an embodiment of the present application provides a method for generating an automatic driving scene, including:

Acquiring feedback information output by the automatic driving system ADS when testing in a reference automatic driving scenario; wherein the feedback information includes vehicle control instructions and neural network behavior information;

Obtain safety violation parameters and coverage parameters based on the feedback information; wherein, the safety violation parameters are used to indicate the probability of safety violations when the vehicle corresponding to the ADS in the reference automatic driving scenario is driving according to the vehicle control instruction, The coverage parameter is used to indicate the activated neurons and/or hierarchical correlations in the neural network corresponding to the ADS when testing in the reference automatic driving scenario;

If the safety violation parameter or the coverage parameter does not meet the preset index, the reference automatic driving scene is updated based on the feedback information to obtain an updated reference automatic driving scene, until the updated reference automatic driving scene is performed based on the ADS. The safety violation parameters and coverage parameters obtained by referring to the feedback information output during the test in the automatic driving scenario meet the preset indicators.

Based on the above technical solutions, the feedback information output by the ADS during the test in the reference automatic driving scenario is obtained, and the safety violation parameters and coverage parameters are obtained based on the feedback information. Update the reference autopilot scene to obtain the updated reference autopilot scene, until the safety violation parameters and coverage parameters obtained based on the feedback information output by the ADS during the test in the updated reference autopilot scene meet the preset indicators, so as to guide the driving. Scenarios are updated in a direction that is likely to lead to safety violations and previously untested, that is, to generate driving scenarios that are most likely to lead to safety violations and have high coverage, which can not only improve the efficiency of testing driving scenarios, but also cover more Driving scenarios, in addition, closed-loop-based automated testing can further reduce time overhead and labor overhead.

In a possible design, before obtaining the feedback information output by the automatic driving system ADS when it is tested in the reference automatic driving scenario, the following steps are further included:

obtaining the initial automatic driving scenario of the ADS;

Analyze the road type in the initial automatic driving scene and the probability of safety violations when the vehicle corresponding to the ADS is driving in the initial automatic driving scene, and create the reference automatic driving scene based on the analysis result; wherein, the reference The autonomous driving scenarios include any one or more of typical autonomous driving scenarios, missing autonomous driving scenarios, and autonomous driving scenarios that are prone to violations.

In a possible design, the safety violation parameter includes the probability that the parallel distance between the vehicle corresponding to the ADS and other vehicles or pedestrians is less than the safety distance, and the vertical distance between the vehicle corresponding to the ADS and other vehicles or pedestrians is less than the safety distance. The probability that the vehicle corresponding to the ADS violates the traffic light indication, the probability that the vehicle corresponding to the ADS violates the traffic sign indication, the probability that the vehicle corresponding to the ADS violates the command of the traffic police and the speeding rate of the vehicle corresponding to the ADS any one or more of the probabilities;

The coverage parameter includes any one or more of the number of activated neurons in the neural network corresponding to the ADS and the weight of the influence of the input subset on the prediction result in the heatmap of the neural network corresponding to the ADS.

In a possible design, the reference automatic driving scene is updated based on the feedback information, including:

Acquiring actions selected by the reinforcement learning agent from the action space of the reference autonomous driving scene based on the feedback information; wherein the actions in the action space are road topology, road degradation, dynamic time, dynamic time in the autonomous driving scene Discrete or continuous updates of weather, dynamic traffic or landscape information;

The reference driving scenario is updated based on the selected action.

In a possible design, the reinforcement learning agent determines the reward based on the feedback information and the current state of the vehicle corresponding to the ADS, and based on the reward and the current state of the vehicle corresponding to the ADS, from the action space The agent that selects the action in ;

Wherein, the reward is the sum of the reward based on safety violation and the reward based on coverage, and the reward based on safety violation is the ADS after the state of the vehicle corresponding to the ADS is updated based on the vehicle control instruction in the feedback information The degree to which the safety violation probability of the corresponding vehicle is close to the preset index, and the coverage-based reward is the closeness that the coverage rate of the reference automatic driving scene determined based on the neural network behavior information in the feedback information is close to the preset index The state of the vehicle corresponding to the ADS is used to indicate the position of the vehicle corresponding to the ADS in the reference automatic driving scene.

Based on the above technical solution, first obtain the action selected by the reinforcement learning agent from the action space of the reference automatic driving scene based on the feedback information, wherein the reinforcement learning agent determines the reward based on the feedback information and the state of the vehicle corresponding to the current ADS, and based on the feedback information The reward and the state of the vehicle corresponding to the current ADS select the action agent from the action space, and then update the reference driving scene based on the selected action. The reward determined by the vehicle control instructions in the feedback information and the neural network behavior information guide the behavior, and the action selected from the action space of the reference automatic driving scene with the goal of obtaining the maximum reward for the reinforcement learning agent, so as to guide the driving scene toward the easy-to-cause Safety violations and previously untested direction updates, i.e., generating driving scenarios that are most likely to lead to safety violations and have high coverage, can not only improve the efficiency of testing driving scenarios, but also cover more driving scenarios.

In a possible design, the neural network model of the reinforcement learning agent includes a value network and a strategy network, the value network is used to calculate the value of a set action in a set state, and the strategy network is used to obtain the set value. Action probability distribution in a fixed state; before acquiring the action selected by the reinforcement learning agent from the action space of the reference autonomous driving scene based on the feedback information, further comprising:

obtaining the first state of the vehicle corresponding to the current ADS, and the first action selected by the reinforcement learning agent from the action space based on the policy network;

updating the reference automatic driving scene based on the first action, and acquiring feedback information output by the ADS when testing in the updated reference automatic driving scene;

Obtain the reward of the first action determined by the reinforcement learning agent based on the feedback information and the current second state of the vehicle corresponding to the ADS, and the second state selected from the action space based on the policy network. action;

computing a first value for a first action in the first state and a second value for a second action in the second state based on the value network;

Based on the first value, the second value and the reward of the first action, a time difference error is determined; wherein, the time difference error is the difference between the value predicted by the value network and the actual value;

Obtain the gradients of the value network and the strategy network, and update the parameters of the value network based on the time difference error and the gradient of the value network, and update the parameters based on the time difference error and the gradient of the strategy network. Describe the parameters of the policy network.

Based on the above technical solutions, by setting the value network and the strategy network as the neural network model of the reinforcement learning agent, the two neural networks, the value network and the strategy network, respectively approximate the value function and the strategy function of the reinforcement learning agent. The function refers to the rules for evaluating the quality of actions and states by the rewards provided by the environment in reinforcement learning. The agent is trained so that the reinforcement learning agent can strategically update the reference autonomous driving scenario based on the vehicle control instructions and neural network behavior information in the feedback information, thereby guiding the autonomous driving scenario towards the scenarios that are prone to safety violations and previously untested. Orientation update, i.e., generating driving scenarios that are most likely to be tested, which are likely to lead to safety violations and have high coverage, can not only improve the efficiency of testing driving scenarios, but also cover more driving scenarios.

In a second aspect, the present application further provides a device for generating an automatic driving scenario, the device for generating an automatic driving scenario having a function of implementing the first aspect or any possible method in design of the first aspect, the function It can be realized by hardware, or can be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, such as a first obtaining module, a second obtaining module, and an updating module.

The first obtaining module is used to obtain feedback information output by the automatic driving system ADS when testing in a reference automatic driving scenario; wherein the feedback information includes vehicle control instructions and neural network behavior information;

The second obtaining module is configured to obtain safety violation parameters and coverage parameters based on the feedback information; wherein the safety violation parameters are used to indicate that the vehicle corresponding to the ADS in the reference automatic driving scenario is based on the vehicle The probability of safety violations when the control command is driving, and the coverage parameter is used to indicate the activated neurons and/or hierarchical correlations in the neural network corresponding to the ADS when testing in the reference autonomous driving scenario;

The updating module is configured to update the reference automatic driving scene based on the feedback information to obtain an updated reference automatic driving scene if the safety violation parameter or the coverage parameter does not meet the preset index, until the reference automatic driving scene is updated based on the feedback information. The safety violation parameter and the coverage parameter obtained from the feedback information output by the ADS during the test in the updated reference automatic driving scenario satisfy the preset index.

In a possible design, the device further includes a creation module for:

obtaining the initial automatic driving scenario of the ADS;

In a possible design, the update module is specifically used for:

The reference driving scenario is updated based on the selected action.

In a possible design, the neural network model of the reinforcement learning agent includes a value network and a strategy network, the value network is used to calculate the value of a set action in a set state, and the strategy network is used to obtain the set value. Action probability distribution in a fixed state; the device also includes a training module for:

In a third aspect, the present application further provides a system for generating an autonomous driving scenario, the system for generating an autonomous driving scenario may include: at least one processor; and a memory and a communication interface communicatively connected to the at least one processor; Wherein, the memory stores instructions that can be executed by the at least one processor, and the at least one processor executes the first aspect or any one of the possible first aspects by executing the instructions stored in the memory. The function of the method in the design.

In a fourth aspect, the present application further provides a computer storage medium, where the computer storage medium includes computer instructions, which, when the computer instructions are executed on a computer, cause the computer to execute the first aspect or any one of the first aspects. a possible in-design approach.

In a fifth aspect, the present application further provides a computer program product that, when the computer program product runs on a computer, causes the computer to execute the first aspect or any possible method in design of the first aspect.

Description of drawings

1 is a schematic diagram of a training process of reinforcement learning provided by an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a system for generating an automatic driving scene according to an embodiment of the present application;

FIG. 3 is a schematic flowchart of a method for generating an automatic driving scene according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an apparatus for generating an automatic driving scene according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of another automatic driving scene generation system provided by an embodiment of the present application.

Detailed ways

In order to make the objectives, technical solutions and advantages of the present application clearer, the following will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application.

In order to facilitate understanding of the embodiments of the present application, the technical terms involved in the embodiments of the present application are explained below first.

1. Auto driving system (ADS)

An autonomous driving system is a system for controlling a vehicle, and the vehicle is capable of autonomous driving under the control of the autonomous driving system. The autonomous driving system may include a collection device, two primary processing devices, a secondary processing device, and a vehicle control device. Wherein, the collecting device is used for collecting the initial environment information of the vehicle, and sending the initial environment information to the two main processing devices. The main processing device is used to process the received initial environment information to obtain target environment information, and then generate a vehicle control instruction according to the target environment information, and send the vehicle control instruction to the auxiliary processing device. The auxiliary processing device is used for sending the vehicle control command sent by one of the main processing devices to the vehicle control device, so that the vehicle control device controls the vehicle (such as forward, reverse or turn, etc.) according to the received vehicle control command. When one of the main processing devices fails, the auxiliary processing device can send the vehicle control command sent by the other main processing device to the vehicle control device.

When using the automatic driving system to train and test the automatic driving of the vehicle, the automatic driving system of the vehicle is mainly a machine learning system, which is composed of a neural network. The neural network includes an input layer, an output layer, multiple hidden layers and many neural networks. The higher the number of activated neurons, the higher the level correlation, which means that the coverage of the driving scene during the automatic driving training and testing of the vehicle is higher, that is, the vehicle can be covered when the automatic driving training and testing are performed. More driving scenarios.

2. Reinforcement learning agents

Agent is a very important concept in the field of artificial intelligence. Any independent entity that can think and interact with the environment can be abstracted as an agent. For example, an agent can be a computer system or a part of a computer system in a specific environment. The agent can communicate and cooperate with other agents according to its own perception of the environment, according to the existing instructions or through self-learning, and autonomously complete the set goals in the environment where it is located. An agent can be software or a combination of software and hardware.

Reinforcement Learning (RL), also known as Reinforcement Learning, Evaluation Learning or Reinforcement Learning, is one of the paradigms and methodologies of machine learning. The problem of maximizing rewards or achieving a specific goal, that is, reinforcement learning is that the agent learns in a "trial and error" way, and the reward obtained by interacting with the environment through actions guides the behavior. The goal is to enable the agent to obtain Biggest reward. Reinforcement learning does not require a training data set. In reinforcement learning, the reinforcement signal (ie reward) provided by the environment is used to evaluate the quality of the action, rather than telling the reinforcement learning system how to generate the correct action. Since the external environment provides little information, the agent must learn from its own experience. In this way, the agent acquires knowledge in the environment of action-evaluation (ie, reward) and improves the action plan to adapt to the environment.

Exemplarily, FIG. 1 is a schematic diagram of a training process of reinforcement learning provided by an embodiment of the application. As shown in FIG. 1 , reinforcement learning mainly includes four elements: agent, environment, state, Action and reward. Among them, the input of the agent is the state, and the output is the action. The policy function refers to the rules of adopting behavior used by the agent in reinforcement learning. For example, in the learning process, the action can be output according to the state, and the action can be used to explore the environment to update the state. The update of the policy function depends on the policy gradient ( policy gradient, PG), the policy function is usually a neural network. In the current technology, the training process of reinforcement learning is as follows: through multiple interactions between the agent and the environment, the actions, states and rewards of each interaction are obtained, and multiple sets of actions, states and rewards are used as training data to train the agent once , using the above training process to perform the next round of training on the agent until the convergence conditions are met.

Among them, the process of obtaining the action, state and reward of an interaction is shown in Figure 1. The current state of the environment s(t) is input to the agent, and the action a(t) output by the agent is obtained. ) to calculate the reward r(t) of this interaction, so far, the action a(t) and reward r(t) of this interaction are obtained. Record the action a(t) and reward r(t) of this interaction for subsequent training of the agent, and also record the next state s(t+1) of the environment under the action of action a(t), In order to realize the next interaction between the agent and the environment.

3. Simulator

The simulator has the ability to simulate driving scenarios by configuring various parameters, such as configuring parameters such as road network, traffic, pedestrians, landscape, weather, etc. The main modules in the simulator include camera image and radar image. , lidar image (lidar image), dynamic model, vehicle position update and inertial navigation (ie global positioning system and inertial sensor (global positioning system+inertial measurement unit, GPS+IMU)), the first three capture the simulation driving scene image, the latter three are mainly used to dynamically update the position of the vehicle in the driving scene.

At present, a simulator-based simulation environment generates multiple driving scenarios and tests the generated driving scenarios. There are usually the following two methods:

(1) Violent learning

A simulator-based simulation environment that randomly generates as many driving scenarios as possible in a violent way for testing and learning. The steps are as follows: randomly select a real driving scene or violently select a real driving scene, model the selected real driving scene, such as road topology, traffic state, weather, etc., and configure the model in the simulator; automatically or non-automatically update the simulator weather, landscape and other information in the simulator, so as to generate a simulated driving scene in the simulator; set the self-vehicle information in the simulated driving scene, such as the position of the self-vehicle, sensor type and location, etc.; use ADS to generate a driving scene in the simulator Scenarios are tested and verified. Since the driving scenes of this method are imported through violence or randomly generated, it is blind, a large number of driving scenes must be repeated, and it is difficult to traverse all driving scenes, so the efficiency is low and the coverage of the tested driving scenes cannot be effectively improved. .

(2) Formal method

Based on the simulation environment of the simulator, through formal reasoning, specific driving scenarios related to safety are constructed for testing and learning. The steps of this method are as follows: construct a specific driving scenario by reasoning, model the specific driving scenario, such as road topology, traffic state, weather, etc., and configure the model in the simulator; update the simulator automatically or not automatically to generate a simulated driving scene in the simulator; set the self-vehicle information in the simulated driving scene, such as the position of the self-vehicle, sensor type and location, etc.; use ADS to generate a driving scene in the simulator Carry out test verification. Since the driving scenes of this method mainly rely on manual reasoning and manually construct specific driving scenes, the construction efficiency is low, and the scenes are generally relatively simple, it is difficult to construct diverse driving scenes, and no safety guarantee can be provided.

It can be seen that the above two methods have certain shortcomings: the driving scenes generated based on violent learning are random, and there are a large number of repeated driving scenes in the generated driving scenes. When testing the generated driving scenes, it is difficult to traverse all the driving scenes. Scenarios have the problems of huge time and labor costs, low efficiency, and cannot effectively improve the coverage of the driving scenarios tested; the formal method is basically to generate driving scenarios manually, so the driving scenarios are usually relatively simple, have low construction efficiency, and are difficult to construct diversely In the driving scene, it is impossible to provide a safety guarantee. That is to say, the efficiency of the above two methods for testing driving scenarios is low, and it is difficult to provide a variety of driving scenarios, and it is impossible to quickly improve the coverage of the tested driving scenarios.

In view of this, an embodiment of the present application provides a method for generating an automatic driving scene, which strategically updates the driving scene through reinforcement learning, thereby improving the diversification of the driving scene, and at the same time, guiding the driving scene toward the direction that is likely to lead to safety violations and has not been tested before. direction update, improve the efficiency of test driving scenarios, in addition, closed-loop-based automated testing can further reduce time overhead and labor overhead.

It should be understood that the embodiments of the present application may be applied to a system for generating an automatic driving scene. Exemplarily, FIG. 2 is a schematic structural diagram of a system for generating an automatic driving scene provided by an embodiment of the present application, as shown in FIG. 2 . , the generation system of the automatic driving scene includes the vehicle 100 , the ADS 200 , the reinforcement learning agent 300 and the simulator 400 , wherein the ADS 200 can be set on the vehicle 100 , and a closed loop is formed between the ADS 200 , the reinforcement learning agent 300 and the simulator 400 . The simulator 400 is used to configure various parameters to simulate an autonomous driving scenario of the vehicle 100 . The ADS200 is used to test the automated driving scenario of the simulated vehicle 100 . The reinforcement learning agent 300 is used to take the output of the ADS 200 as the environment, and update the automatic driving scene of the vehicle 100 according to the reward of the environment.

The ADS 200 may include a collection device, a main processing device, an auxiliary processing device, and a vehicle control device that are connected in sequence. Optionally, the acquisition device may include: at least one sensor among a variety of sensors, such as a camera, a radar, a gyroscope, and an accelerometer. The processing capability of the main processing device may be stronger than that of the auxiliary processing device, and the main processing device may be a device integrating image processing functions, scalar computing functions, vector computing functions and matrix computing functions. The auxiliary processing device may be a microcontroller unit (MCU). Optionally, the ADS may further include a target radar, and the target radar is connected to the auxiliary processing device. It should be noted that, the radars in the embodiments of the present invention may all be lidars (light detection and ranging, Lidar). Optionally, the radars in the embodiments of the present invention may also be other types of radars, such as millimeter-wave radars or Ultrasonic radar, which is not limited in this embodiment of the present invention.

The system for generating an automatic driving scene provided by the embodiments of the present application has been described above. Next, the method for generating an automatic driving scene provided by the embodiments of the present application is described with reference to the accompanying drawings.

It should be understood that the terms "first" and "second" in the embodiments of the present application are only used for description purposes, and cannot be interpreted as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, a feature defined as "first" or "second" may expressly or implicitly include one or more of that feature. "At least one" means one or more, and "plurality" means two or more. "And/or", which describes the association relationship of the associated objects, indicates that there can be three kinds of relationships, for example, A and/or B, which can indicate: the existence of A alone, the existence of A and B at the same time, and the existence of B alone, where A, B can be singular or plural. The character "/" generally indicates that the associated objects are an "or" relationship. "At least one item(s) below" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural item(s), for example, at least one of a, b, or c An item (a), which can be expressed as: a, b, c, a and b, a and c, b and c, or a and b and c.

FIG. 3 is a schematic flowchart of a method for generating an automatic driving scene provided by an embodiment of the present application. The method for generating an automatic driving scene may be applied to the generation of the automatic driving scene shown in FIG. 2 or similar in functional structure to FIG. 2 . system. As shown in FIG. 3 , the specific flow of the method for generating the automatic driving scene is described as follows.

S301 , acquiring feedback information output by the automatic driving system ADS when it is tested in a reference automatic driving scenario.

In some embodiments, before acquiring the feedback information output by the ADS when the ADS performs the test under the reference autopilot scene, the system for generating the autopilot scene needs to acquire the initial autopilot scene of the ADS, wherein the initial autopilot scene may include road, time , weather, vehicles, pedestrians, traffic lights, traffic signs, traffic police, and any one or more of the landscape, and then perform a distribution analysis on the obtained initial automatic driving scene to determine the initial automatic driving scene. Road types, such as straight roads, T-junctions, overpasses, winding roads, etc., and the probability of safety violations when the vehicles corresponding to ADS are driving in the initial automatic driving scenario, such as the probability of vehicle safety accidents or traffic violations, and finally based on the results of distribution analysis , and create a reference automatic driving scenario, where the reference automatic driving scenario includes any one or more of a typical automatic driving scenario, a missing automatic driving scenario, and a violation-prone automatic driving scenario, which is not limited in this embodiment of the present application.

It should be noted that, in the embodiment of the present application, the distribution analysis is off-line, that is, the distribution analysis of the initial automatic driving scene is not performed online in the simulator in the automatic driving scene generation system. Use offline distribution analysis to determine the road type in the initial autonomous driving scenario and the probability of safety violations when the vehicle corresponding to the ADS is driving in the initial autonomous driving scenario, thereby reducing the difficulty of analysis.

In other embodiments, after creating the reference automatic driving scene, the automatic driving scene generation system can simulate the reference automatic driving scene by configuring various parameters of the simulator in the automatic driving scene generation system, and obtain the automatic driving scene The feedback information output by the ADS in the generation system when tested in the simulated reference autonomous driving scenario, wherein the feedback information output by the ADS includes the vehicle control instructions of the ADS in the reference autonomous driving scenario and the neural network behavior of the ADS in the reference autonomous driving scenario information.

It should be noted that, in the embodiment of the present application, the vehicle control command may be a steering signal, a speed signal or a body control signal, which is not limited in the embodiment of the present application. For example, when the ADS is in the simulated reference automatic driving scenario During the test, the ADS collection device will collect the initial environmental information of the vehicle in the reference automatic driving environment, and send the initial environmental information to the two main processing devices, and the main processing device will process the received initial environmental information to obtain the target. environment information, and then generate vehicle control instructions according to the target environment information, so that the vehicle control device controls the vehicle, such as forwarding, reversing, or turning, according to the received vehicle control instructions.

It should be noted that, in this embodiment of the present application, the neural network of the ADS may be a deep neural network (deep neural networks, DNN), or a convolutional neural network (convolutional neural networks, CNN), or a recurrent neural network (rerrent neural network) , RNN), which is not limited in the embodiments of the present application.

S302. Obtain security violation parameters and coverage parameters based on the feedback information.

In some embodiments, after obtaining the feedback information output by the ADS when testing with reference to the automatic driving scene, the system for generating the automatic driving scenario may determine the safety violation parameters and coverage respectively based on the vehicle control instructions and the neural network behavior information in the feedback information parameter.

It should be noted that, in this embodiment of the present application, the safety violation parameter is used to indicate the probability of a safety violation when the vehicle corresponding to the ADS drives according to the vehicle control instruction in the reference automatic driving scenario. The probability that the parallel distance of ADS is less than the safety distance, or the probability that the vertical distance between the vehicle corresponding to ADS and other vehicles or pedestrians is less than the safety distance, or the probability that the vehicle corresponding to ADS violates the traffic light, or the vehicle corresponding to ADS violates the traffic sign. The probability, or the probability that the vehicle corresponding to the ADS violates the command of the traffic police, or the probability that the vehicle corresponding to the ADS exceeds the speed, is not limited in this embodiment of the present application.

Exemplarily, if the vehicle control command is to control the vehicle corresponding to the ADS to move forward by 50 meters or back by 100 meters, it can be determined according to the vehicle control command that the vehicle corresponding to the ADS is in the reference automatic driving scenario with other vehicles or pedestrians (or road shoulders, obstacles, etc.). etc.) the probability that the parallel distance is less than the safety distance, the probability that the vertical distance with other vehicles or pedestrians (or road shoulders, obstacles, etc.) is less than the safety distance, the probability of violating traffic light instructions, the probability of violating traffic signs, the violation of traffic police Probability of command and probability of vehicle speeding.

It should be noted that, in this embodiment of the present application, the coverage parameter is used to indicate the activated neurons and/or hierarchical correlations in the neural network corresponding to the ADS when testing in the reference autonomous driving scenario, for example, the activated neurons in the neural network The number of neurons and the effect of input subsets in the heat map of the neural network on the prediction results. The neural network is sometimes called a multi-layer perceptron (MLP), which is divided according to the position of different layers. The network layer can be divided into three categories, input layer, hidden layer and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. Fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer, so when the ADS is tested in the reference autonomous driving scene, the activated neural network in the ADS neural network The higher the number of elements, or the higher the level correlation in the neural network of ADS, the greater the influence of the input subset in the heatmap on the prediction results, the higher the coverage of the reference autonomous driving scene, that is, the ability to test to the previously untested to cover more autonomous driving scenarios.

Exemplarily, if the neural network behavior information is that the number of neurons activated by the input layer is 2, the number of neurons activated by the output layer is 1, and the number of neurons activated by the hidden layer is 6, then It can be determined that the number of activated neurons in the ADS neural network is 8 when the ADS is tested in the reference autonomous driving scenario; if the neural network behavior information is hierarchical correlation, it can be back-propagated from the output layer according to the hierarchical correlation Find a subset of the input layer and determine the heat map of the neural network. The heat map is used to represent the contribution or weight of all input subsets to the output results. From the heat map, you can intuitively see the impact of the input subsets on the output results.

S303. If the safety violation parameter or the coverage parameter does not meet the preset index, update the reference automatic driving scene based on the feedback information to obtain the updated reference automatic driving scene, until the test is performed under the updated reference automatic driving scene based on the ADS The security violation parameters and coverage parameters obtained from the output feedback information satisfy the preset indicators.

In some embodiments, after determining the safety violation parameters and the coverage parameters respectively based on the vehicle control instructions and the neural network behavior information in the feedback information, the automatic driving scene generation system may determine whether the safety violation parameters and the coverage parameters meet the preset indicators, If the safety violation parameter or the coverage parameter does not meet the preset index, the reference automatic driving scene is updated based on the feedback information to obtain the updated reference automatic driving scene, until the output based on the ADS is output during the test under the updated reference automatic driving scene. The security violation parameters and coverage parameters obtained from the feedback information satisfy the preset indicators.

Exemplarily, if the safety violation parameter includes the probability that the vehicle corresponding to the ADS violates the command of the traffic police and the probability that the vehicle corresponding to the ADS exceeds the speed limit, the probability that the vehicle corresponding to the ADS violates the command of the traffic police is 40%, and the speed of the vehicle corresponding to the ADS is 40%. The probability is 30%, and the probability of the vehicle corresponding to ADS violating the traffic police command and the probability of speeding of the vehicle corresponding to ADS in the preset indicators should both be 80%, then the test result does not meet the preset indicators, or, if the coverage parameters include neural The number of activated neurons in the network, where the number of activated neurons in the neural network is 6, and the number of activated neurons in the neural network in the preset index should be 8, and the test result does not meet the preset index.

Exemplarily, after it is determined that the safety violation parameter or the coverage parameter does not meet the preset index, the action selected by the reinforcement learning agent from the action space of the reference automatic driving scene based on the feedback information is obtained, wherein the action in the action space is the automatic driving. Discrete or continuous update of road topology, road degradation, dynamic time, dynamic weather, dynamic traffic or landscape information in the scene, for example, road topology is the type of road in the reference autonomous driving scene, such as straight road, T-junction, overpass, winding mountain Highway, etc., the road degradation is the degree of road degradation in the reference autonomous driving scenario. Update the reference driving scene based on the selected action to obtain the updated reference automatic driving scene, until the safety violation parameters and coverage parameters obtained based on the feedback information output by the ADS when testing in the updated reference automatic driving scene meet the preset indicators .

It should be noted that, in the embodiment of the present application, the reinforcement learning agent of the automatic driving scene generation system determines the state of the vehicle corresponding to the reward and the current ADS based on the feedback information, and determines the state of the vehicle corresponding to the current ADS based on the reward and the current ADS. The agent that selects the action in the action space of the reference autonomous driving scene. The reward is the sum of the safety violation-based reward and the coverage-based reward, and the state of the vehicle corresponding to the ADS is used to indicate the position of the vehicle corresponding to the ADS in the reference autonomous driving scene.

It should be noted that, in the embodiment of the present application, the reward based on safety violation is the degree to which the safety violation probability of the vehicle corresponding to the ADS is close to the preset index after the state of the vehicle corresponding to the ADS is updated based on the vehicle control instruction in the feedback information, For example, if the reward based on the safety violation is larger, that is, after updating the state of the vehicle corresponding to the ADS based on the vehicle control instruction in the feedback information, the safety violation probability of the vehicle corresponding to the ADS is closer to the preset index, the higher the degree of The higher the probability of safety violations when the vehicle is driving in the reference autonomous driving scenario, which can encourage the evolution of the reference autonomous driving scenario towards an autonomous driving scenario that is prone to safety violations;

The coverage-based reward is the degree to which the coverage of the reference autonomous driving scene determined based on the neural network behavior information in the feedback information is close to the preset index. For example, if the reward based on the test coverage is larger, the The closer the coverage of the reference autonomous driving scene determined by the behavior information is to the preset index, the higher the neurons and/or hierarchical correlations activated in the neural network corresponding to the ADS when the ADS is tested with reference to the autonomous driving scene. The higher the coverage of reference autonomous driving scenarios, the more autonomous driving scenarios can be covered by encouraging reference autonomous driving scenarios to evolve towards previously untested autonomous driving scenarios.

It should be noted that, in the embodiment of the present application, in order for the reinforcement learning agent of the generation system of the automatic driving scene to strategically select actions from the action space of the reference automatic driving scene based on the feedback information, the reinforcement learning agent needs to be trained. . Specifically, the neural network model of the reinforcement learning agent is set as a value network and a strategy network, wherein the value network is used to calculate the value of the set action in the set state, and the strategy network is used to obtain the action probability distribution in the set state. . Obtain the first state of the vehicle corresponding to the current ADS, and the first action selected by the reinforcement learning agent from the action space of the reference automatic driving scene based on the policy network, update the reference automatic driving scene based on the first action, and obtain the ADS at The updated feedback information output during the test in the reference automatic driving scenario, to obtain the reward of the first action determined by the reinforcement learning agent based on the feedback information and the second state of the vehicle corresponding to the current ADS, and the reference automatic driving based on the policy network. For the second action selected in the action space of the scene, the first value of the first action in the first state and the second value of the second action in the second state are calculated based on the value network, based on the first value, the second value and The reward of the first action, determine the time difference error, where the time difference error is the difference between the value predicted by the value network and the actual value, obtain the gradients of the value network and the policy network, and update the value based on the time difference error and the value network gradient The parameters of the network, the parameters of the policy network are updated based on the temporal difference error and the gradient of the policy network.

Illustratively, the policy function refers to the rules of adopting behaviors used by the agent in reinforcement learning. For example, during the learning process, an action can be output according to the state, and the environment can be explored with this action to update the state. The value function refers to a rule that the agent uses the reinforcement signal (ie reward) provided by the environment in reinforcement learning to evaluate the quality of actions and states. For example, the action value is used to evaluate the quality of the action, and the state value is used to evaluate the quality of the action. Evaluate how good the current state is. Set the neural network model of the reinforcement learning agent as the value network and the policy network, that is, the two neural networks of the policy network and the value network respectively approximate the policy function and the action value function of the reinforcement learning agent, and then approximate the state value function. For example, if V(s; θ, ω) is the state value function (that is, the value of the current state), which is the policy function

(ie the probability of the action) and the action value function q(s, a; ω) (ie the value of the action), such as

The steps for training the above reinforcement learning agent are as follows:

(1) Obtain the first state s _t of the vehicle corresponding to the current ADS, and the first action a _t randomly selected by the reinforcement learning agent from the action space of the reference automatic driving scene based on the strategy network in the neural network model;

(2) _Update the reference automatic driving scene based on the first action at, and obtain the feedback information output by the ADS when testing in the updated reference automatic driving scene;

(3) Acquiring the reinforcement learning agent to determine the reward r ₁ of the first action a _t based on the feedback information and the second state s _t+1 of the vehicle corresponding to the current ADS, and the reinforcement learning agent based on the strategy network in the neural network model randomly a second action at ₊₁ selected from the action space of the reference autonomous driving scene;

(4) Obtain the first value of the first action in the first state calculated by the reinforcement learning agent based on the value network in the neural network model q _t =q(s _t , at ; ω _t ₎ and the value of the first action in the second state The second value of the second action q _t+1 =q(s _t+1 , at ₊₁ +; ω _t+1 );

(5) Based on the first value q _t , the second value q _t+1 and the reward r ₁ of the first action at _t , determine the temporal-difference error (TD error), where TD error is the value network prediction The difference between the value and the actual value, such as δ _t =q _t -(r ₁ +γ·q _t+1 );

(6) Obtain the gradient of the value network, such as

ω= _ωt ;

(7) Based on the TD error, use the gradient of the descending value network to update the parameters of the value network, such as ω _t+1 =ω _t -α·δ _t ·d _ω,t ;

(8) Obtain the gradient of the policy network, such as

θ=θ _t ;

(9) Based on the TD error, use the gradient of the rising policy network to update the parameters of the policy network, such as θ _t+1 =θ _t +β·δ _t ·d _θ,t .

The above technical solution obtains the action selected by the reinforcement learning agent from the action space of the reference automatic driving scene based on the feedback information, wherein the reinforcement learning agent determines the reward based on the feedback information and the state of the vehicle corresponding to the current ADS, and based on the reward and the state of the vehicle corresponding to the current ADS are obtained. The state of the vehicle corresponding to the current ADS selects the action agent from the action space, and then updates the reference driving scene based on the selected action to obtain the updated reference automatic driving scene until the updated reference automatic driving scene based on the ADS The safety violation parameters and coverage parameters obtained from the feedback information output during the test in the driving scene meet the preset indicators, and after the reinforcement learning agent performs reinforcement learning in a "trial and error" manner, the vehicle control based on the feedback information is obtained. Rewards determined by instructions and neural network behavior information to guide behavior, with the goal of the reinforcement learning agent obtaining the maximum reward Actions selected from the action space of the reference autonomous driving scene, thus guiding the driving scene toward the easy-to-lead safety violation and previously untested The direction of the update is to generate driving scenarios that are most likely to lead to safety violations and have high coverage that need to be tested, which can not only improve the efficiency of testing driving scenarios, but also cover more driving scenarios. In addition, closed-loop-based automated testing Time and labor costs can be further reduced.

The above embodiments can be used alone or in combination with each other to achieve different technical effects.

In the above-mentioned embodiments of the present application, the methods provided by the embodiments of the present application are introduced from the perspective of an automatic driving scene generation system as an execution subject. In order to realize the functions in the methods provided by the above embodiments of the present application, the generation system for automatic driving scenarios may include a hardware structure and/or a software module, and implement the above-mentioned various functions in the form of a hardware structure, a software module, or a hardware structure plus a software module. Function. Whether one of the above functions is performed in the form of a hardware structure, a software module, or a hardware structure plus a software module depends on the specific application and design constraints of the technical solution.

Based on the same technical concept, an embodiment of the present application also provides an apparatus 400 for generating an automatic driving scene. The apparatus 400 may be a system for generating an automatic driving scene, or an apparatus in a system for generating an automatic driving scene. The apparatus 400 includes a A module for executing the method shown in FIG. 3 above. Illustratively, referring to FIG. 4 , the apparatus 400 may include:

The first obtaining module 401 is used to obtain feedback information output by the automatic driving system ADS when testing in a reference automatic driving scenario; wherein the feedback information includes vehicle control instructions and neural network behavior information;

The second obtaining module 402 is configured to obtain safety violation parameters and coverage parameters based on the feedback information; wherein the safety violation parameters are used to indicate that in the reference automatic driving scenario, the vehicle corresponding to the ADS is controlled according to the vehicle The probability of safety violation when commanded to drive, and the coverage parameter is used to indicate the activated neurons and/or hierarchical correlations in the neural network corresponding to the ADS when testing in the reference autonomous driving scenario;

The updating module 403 is configured to update the reference automatic driving scene based on the feedback information to obtain an updated reference automatic driving scene if the safety violation parameter or the coverage parameter does not meet the preset index, until the reference automatic driving scene is updated based on the feedback information. The safety violation parameter and the coverage parameter obtained from the feedback information output by the ADS during the test in the updated reference automatic driving scenario satisfy the preset index.

In a possible design, the device further includes a creation module for:

obtaining the initial automatic driving scenario of the ADS;

In a possible design, the update module 403 is specifically used for:

The reference driving scenario is updated based on the selected action.

Based on the same technical concept, referring to FIG. 5 , an embodiment of the present application further provides a system 500 for generating an automatic driving scenario, including:

at least one processor 501; and, a communication interface 503 communicatively connected to the at least one processor 501;

Wherein, the at least one processor 501 causes the automatic driving scenario generation system 500 to execute the method shown in FIG. 3 by executing the instructions stored in the memory 502 .

Optionally, the memory 502 is located outside the automatic driving scenario generation system 500 .

Optionally, the automatic driving scenario generation system 500 includes the memory 502 , the memory 502 is connected to the at least one processor 501 , and the memory 502 stores the memory 502 that can be executed by the at least one processor 501 . instruction. FIG. 5 shows, with dashed lines, that memory 502 is optional to system 500 for generating autonomous driving scenarios.

The processor 501 and the memory 502 may be coupled through an interface circuit, or may be integrated together, which is not limited here.

The specific connection medium between the processor 501 , the memory 502 , and the communication interface 503 is not limited in the embodiments of the present application. In the embodiment of the present application, the processor 501, the memory 502, and the communication interface 503 are connected through a bus 504 in FIG. 5. The bus is represented by a thick line in FIG. 5, and the connection between other components is only for schematic illustration. , is not limited. The bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used in FIG. 5, but it does not mean that there is only one bus or one type of bus.

It should be understood that the processor mentioned in the embodiments of the present application may be implemented by hardware or software. When implemented in hardware, the processor may be a logic circuit, an integrated circuit, or the like. When implemented in software, the processor may be a general-purpose processor implemented by reading software codes stored in memory.

Exemplarily, the processor may be a central processing unit (central processing unit, CPU), or other general-purpose processors, digital signal processing or (DSP), application specific integrated circuit (application specific integrated circuit, ASIC) ), off-the-shelf programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should be understood that the memory mentioned in the embodiments of the present application may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory may be random access memory (RAM), which acts as an external cache. By way of example and not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data eat SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (synchlink DRAM, SLDRAM) ) and direct memory bus random access memory (direct rambus RAM, DR RAM).

It should be noted that when the processor is a general-purpose processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components, the memory (storage module) can be integrated in the processor.

It should be noted that the memory described herein is intended to include, but not be limited to, these and any other suitable types of memory.

Based on the same technical concept, an embodiment of the present application further provides a computer storage medium, including computer instructions, when the computer instructions are executed on a computer, the method shown in FIG. 3 is executed.

Based on the same technical concept, an embodiment of the present application further provides a chip, which is coupled to a memory and used to read and execute program instructions stored in the memory, so that the method shown in FIG. 3 is executed.

Based on the same technical concept, an embodiment of the present application also provides a computer program product, which enables the method shown in FIG. 3 to be executed when the computer program product runs on a computer.

It should be understood that, all relevant contents of the steps involved in the above method embodiments can be cited in the functional descriptions of the corresponding functional modules, which will not be repeated here.

As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the present application. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the protection scope of the present application. Thus, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.

Claims

A method for generating an automatic driving scene, comprising:

Acquiring feedback information output by the automatic driving system ADS when testing in a reference automatic driving scenario; wherein the feedback information includes vehicle control instructions and neural network behavior information;

Obtain safety violation parameters and coverage parameters based on the feedback information; wherein, the safety violation parameters are used to indicate the probability of safety violations when the vehicle corresponding to the ADS in the reference automatic driving scenario is driving according to the vehicle control instruction, The coverage parameter is used to indicate the activated neurons and/or hierarchical correlations in the neural network corresponding to the ADS when testing in the reference automatic driving scenario;

If the safety violation parameter or the coverage parameter does not meet the preset index, the reference automatic driving scene is updated based on the feedback information to obtain an updated reference automatic driving scene, until the updated reference automatic driving scene is performed based on the ADS. The safety violation parameters and coverage parameters obtained by referring to the feedback information output during the test in the automatic driving scenario meet the preset indicators.
The method according to claim 1, before acquiring the feedback information output by the automatic driving system ADS when testing in the reference automatic driving scenario, further comprising:

obtaining the initial automatic driving scenario of the ADS;

Analyze the road type in the initial automatic driving scene and the probability of safety violations when the vehicle corresponding to the ADS is driving in the initial automatic driving scene, and create the reference automatic driving scene based on the analysis result; wherein, the reference The autonomous driving scenarios include any one or more of typical autonomous driving scenarios, missing autonomous driving scenarios, and autonomous driving scenarios that are prone to violations.
The method according to claim 1 or 2, wherein the safety violation parameter includes a probability that the parallel distance between the vehicle corresponding to the ADS and other vehicles or pedestrians is less than a safe distance, and the vehicle corresponding to the ADS is connected to other vehicles. Or the probability that the vertical distance of the pedestrian is less than the safety distance, the probability that the vehicle corresponding to the ADS violates the traffic light indication, the probability that the vehicle corresponding to the ADS violates the traffic sign indication, the probability that the vehicle corresponding to the ADS violates the traffic police command and any one or more of the probabilities of the vehicle speeding corresponding to the ADS;

The coverage parameter includes any one or more of the number of activated neurons in the neural network corresponding to the ADS and the weight of the influence of the input subset on the prediction result in the heatmap of the neural network corresponding to the ADS.
The method according to any one of claims 1 to 3, wherein updating the reference automatic driving scene based on the feedback information comprises:

Acquiring actions selected by the reinforcement learning agent from the action space of the reference autonomous driving scene based on the feedback information; wherein the actions in the action space are road topology, road degradation, dynamic time, dynamic time in the autonomous driving scene Discrete or continuous updates of weather, dynamic traffic or landscape information;

The reference driving scenario is updated based on the selected action.
The method of claim 4, wherein the reinforcement learning agent determines the reward based on the feedback information and the state of the vehicle corresponding to the current ADS, and based on the reward and the current state of the vehicle corresponding to the ADS Agents whose states select actions from said action space;

Wherein, the reward is the sum of the reward based on safety violation and the reward based on coverage, and the reward based on safety violation is the ADS after the state of the vehicle corresponding to the ADS is updated based on the vehicle control instruction in the feedback information The degree to which the safety violation probability of the corresponding vehicle is close to the preset index, and the coverage-based reward is the closeness that the coverage rate of the reference automatic driving scene determined based on the neural network behavior information in the feedback information is close to the preset index The state of the vehicle corresponding to the ADS is used to indicate the position of the vehicle corresponding to the ADS in the reference automatic driving scene.
The method of claim 5, wherein the neural network model of the reinforcement learning agent comprises a value network and a policy network, the value network is used to calculate the value of a set action in a set state, the The policy network is used to obtain the action probability distribution in the set state; before obtaining the action selected by the reinforcement learning agent from the action space of the reference automatic driving scene based on the feedback information, the method further includes:

obtaining the first state of the vehicle corresponding to the current ADS, and the first action selected by the reinforcement learning agent from the action space based on the policy network;

updating the reference automatic driving scene based on the first action, and acquiring feedback information output by the ADS when testing in the updated reference automatic driving scene;

Obtain the reward of the first action determined by the reinforcement learning agent based on the feedback information and the current second state of the vehicle corresponding to the ADS, and the second state selected from the action space based on the policy network. action;

computing a first value for a first action in the first state and a second value for a second action in the second state based on the value network;

Based on the first value, the second value and the reward of the first action, a time difference error is determined; wherein, the time difference error is the difference between the value predicted by the value network and the actual value;

Obtain the gradients of the value network and the strategy network, and update the parameters of the value network based on the time difference error and the gradient of the value network, and update the parameters based on the time difference error and the gradient of the strategy network. Describe the parameters of the policy network.
A device for generating automatic driving scenarios, comprising:

a first acquisition module, configured to acquire feedback information output by the automatic driving system ADS when it is tested in a reference automatic driving scenario; wherein the feedback information includes vehicle control instructions and neural network behavior information;

The second obtaining module is configured to obtain safety violation parameters and coverage parameters based on the feedback information; wherein the safety violation parameters are used to indicate that the vehicle corresponding to the ADS in the reference automatic driving scenario is based on the vehicle control instruction The probability of safety violation while driving, the coverage parameter is used to indicate the activated neurons and/or hierarchical correlations in the neural network corresponding to the ADS when testing in the reference autonomous driving scenario;

an update module, configured to update the reference automatic driving scene based on the feedback information to obtain an updated reference automatic driving scene if the safety violation parameter or the coverage parameter does not meet the preset index, until the reference automatic driving scene is updated based on the feedback information. The safety violation parameter and the coverage parameter obtained from the feedback information output by the ADS during the test in the updated reference automatic driving scenario satisfy the preset index.
The apparatus of claim 7, wherein the apparatus further comprises a creation module for:

obtaining the initial automatic driving scenario of the ADS;

Analyze the road type in the initial automatic driving scene and the probability of safety violations when the vehicle corresponding to the ADS is driving in the initial automatic driving scene, and create the reference automatic driving scene based on the analysis result; wherein, the reference The autonomous driving scenarios include any one or more of typical autonomous driving scenarios, missing autonomous driving scenarios, and autonomous driving scenarios that are prone to violations.
The device according to claim 7 or 8, wherein the safety violation parameter includes a probability that the parallel distance between the vehicle corresponding to the ADS and other vehicles or pedestrians is less than a safe distance, and the vehicle corresponding to the ADS is connected to other vehicles. Or the probability that the vertical distance of the pedestrian is less than the safety distance, the probability that the vehicle corresponding to the ADS violates the traffic light indication, the probability that the vehicle corresponding to the ADS violates the traffic sign indication, the probability that the vehicle corresponding to the ADS violates the traffic police command and any one or more of the probabilities of the vehicle speeding corresponding to the ADS;

The coverage parameter includes any one or more of the number of activated neurons in the neural network corresponding to the ADS and the weight of the influence of the input subset on the prediction result in the heatmap of the neural network corresponding to the ADS.
The device according to any one of claims 7 to 9, wherein the update module is specifically configured to:

Acquiring actions selected by the reinforcement learning agent from the action space of the reference autonomous driving scene based on the feedback information; wherein the actions in the action space are road topology, road degradation, dynamic time, dynamic time in the autonomous driving scene Discrete or continuous updates of weather, dynamic traffic or landscape information;

The reference driving scenario is updated based on the selected action.
The device of claim 10, wherein the reinforcement learning agent determines the reward based on the feedback information and the state of the vehicle corresponding to the current ADS, and based on the reward and the current state of the vehicle corresponding to the ADS Agents whose states select actions from said action space;

Wherein, the reward is the sum of the reward based on safety violation and the reward based on coverage, and the reward based on safety violation is the ADS after the state of the vehicle corresponding to the ADS is updated based on the vehicle control instruction in the feedback information The degree to which the safety violation probability of the corresponding vehicle is close to the preset index, and the coverage-based reward is the closeness that the coverage rate of the reference automatic driving scene determined based on the neural network behavior information in the feedback information is close to the preset index The state of the vehicle corresponding to the ADS is used to indicate the position of the vehicle corresponding to the ADS in the reference automatic driving scene.
The device according to claim 11, wherein the neural network model of the reinforcement learning agent comprises a value network and a policy network, the value network is used to calculate the value of a set action in a set state, the The strategy network is used to obtain the action probability distribution under the set state; the device further includes a training module for:

obtaining the first state of the vehicle corresponding to the current ADS, and the first action selected by the reinforcement learning agent from the action space based on the policy network;

updating the reference automatic driving scene based on the first action, and acquiring feedback information output by the ADS when testing in the updated reference automatic driving scene;

Obtain the reward of the first action determined by the reinforcement learning agent based on the feedback information and the current second state of the vehicle corresponding to the ADS, and the second state selected from the action space based on the policy network. action;

computing a first value for a first action in the first state and a second value for a second action in the second state based on the value network;

Based on the first value, the second value and the reward of the first action, a time difference error is determined; wherein, the time difference error is the difference between the value predicted by the value network and the actual value;

Obtain the gradients of the value network and the strategy network, and update the parameters of the value network based on the time difference error and the gradient of the value network, and update the parameters based on the time difference error and the gradient of the strategy network. Describe the parameters of the policy network.
A system for generating automatic driving scenarios, characterized in that the system includes a memory and a processor; the memory is used to store computer instructions; the processor is used to call the computer instructions stored in the memory to execute The method for generating an automatic driving scene according to any one of claims 1-6.
A computer storage medium, characterized by comprising computer instructions, which, when the computer instructions are executed on a computer, cause the computer to execute the method for generating an automatic driving scene according to any one of claims 1-6.
A computer program product, characterized in that, when the computer program product runs on a computer, the computer is caused to execute the method for generating an automatic driving scene according to any one of claims 1-6.