CN117242449A

CN117242449A - Performance test of mobile robot trajectory planner

Info

Publication number: CN117242449A
Application number: CN202280030303.0A
Authority: CN
Inventors: 伊恩·怀特赛德; 马尔科·费里
Original assignee: Faber Artificial Intelligence Co ltd
Current assignee: Faber Artificial Intelligence Co ltd
Priority date: 2021-04-23
Filing date: 2022-04-22
Publication date: 2023-12-15
Also published as: GB202115740D0

Abstract

A computer-implemented method of evaluating performance of a trajectory planner of a mobile robot in a real or simulated scenario, the method comprising: receiving a scene ground truth of a scene, controlling an autonomous agent of the scene in response to at least one other agent of the scene using a trajectory planner to generate the scene ground truth, and the scene ground truth comprising an autonomous trace of the autonomous agent and an agent trace of the other agent; evaluating the autonomous trace by the test forecast to assign at least one time-series of test results to the autonomous agent, the time-series of test results belonging to at least one performance evaluation rule; extracting one or more predetermined delinquent evaluation parameters based on the agent trace; and applying one or more predetermined delinquent evaluation rules to the delinquent evaluation parameters to determine whether failure of at least one performance evaluation rule is acceptable.

Description

Performance test of mobile robot trajectory planner

Technical Field

The present disclosure relates to methods for evaluating the performance of a trajectory planner in a real or simulated scene, as well as computer programs and systems for implementing the methods. Such planners are capable of automatically planning the autonomous trajectories of a fully automatic/semi-automatic driving vehicle or other form of mobile robot. Example applications include ADS (Autonomous Driving System, autopilot system) and ADAS (Advanced DriverAssist System, advanced driver assistance system) performance testing.

Background

The field of autonomous vehicles has evolved significantly and rapidly. An autonomous vehicle (autonomous vehicle, AV) is a vehicle equipped with sensors and control systems that enable it to operate without the need for human control of its behavior. Autonomous vehicles are equipped with sensors that enable them to perceive their physical environment, such sensors including, for example, cameras, radar, and lidar. The autonomous vehicle is equipped with a suitably programmable computer capable of processing the data received from the sensors and making safe and predictable decisions based on the context that has been perceived by the sensors. An autonomous vehicle may be fully autonomous (as it is designed to operate without human supervision or intervention, at least in some circumstances) or semi-autonomous. Semi-automatic driving systems require different levels of human supervision and intervention, such systems including advanced driver assistance systems and three-level automatic driving systems. There are different aspects to test the behavior of sensors and control systems on a particular autonomous vehicle, or to test one type of autonomous vehicle.

A "class 5" vehicle is a vehicle that can run fully automatically in any environment, as it is always guaranteed that a certain minimum level of safety is met. Such vehicles do not require manual control at all (steering wheel, pedals, etc.).

In contrast, class 3 and class 4 vehicles are able to operate fully automatically, but only in certain defined environments (e.g., within a geofenced area). Class 3 vehicles must be equipped to automatically handle any situation that requires immediate response (such as emergency braking); however, changes in the environment may trigger a "transitional demand" that requires the driver to control the vehicle for some limited time frame. Class 4 vehicles have similar limitations; however, in the event that the driver does not respond within the desired time frame, the class 4 vehicle must also be able to automatically implement a "minimum risk strategy" (minimum risk maneuver, MRM), i.e., some appropriate action to bring the vehicle into a safe condition (e.g., to slow down and park the vehicle). Class 2 vehicles require the driver to be ready for intervention at any time, and if the automated driving system fails to respond properly at any time, the driver is responsible for the intervention. With level 2 automation, drivers are responsible for determining when their intervention is needed; for class 3 and class 4, this responsibility is transferred to the vehicle's autopilot system, which must alert the driver when intervention is required.

Security is an increasing challenge as the level of autonomy increases and more responsibility is transferred from people to machines. In autopilot, the importance of ensuring safety has been accepted. Ensuring safety does not necessarily mean zero incidents, but rather means ensuring that a certain minimum level of safety is met in a defined environment. It is generally assumed that this minimum level of safety must greatly exceed the safety level of the human driver in order for autopilot to be viable.

According to Shalev Shewartz et al, "formalized model for safe and scalable autopilot Cars" (On a Formal Model of Safe and Scalable Self-driving Cars) "(2017), arXiv:170806374 (RSS paper), which is incorporated herein by reference in its entirety, human driving estimates result in 10 per hour ^-6 A magnitude of severe accident. Assuming that the autopilot system needs to reduce it by at least three orders of magnitude, the RSS paper concludes that it needs to guarantee 10 per hour ^-9 The lowest safety level of the magnitude of serious accident, thusThe pure data driven method requires collection of a large amount of driving data every time a change is made to the software or hardware of the AV system.

The RSS paper provides a model-based security assurance approach. The rule-based Responsibility-Sensitive Safety (RSS) model is built by formalizing a small number of "common sense" driving rules:

"1, do not hit a person from behind.

2. No reckless insertion is required.

3. The right of way is given, not solicited.

4. Care was taken in areas of limited visibility.

5. You must do this if you can avoid one accident without causing another. "

The RSS model proves to be secure in the sense that no incidents occur if all agents always follow the rules of the RSS model. The aim is to reduce the amount of driving data that needs to be collected by several orders of magnitude to prove the required level of safety.

A security model, such as RSS, may be used as a basis for evaluating the quality of trajectories implemented by autonomous agents in real or simulated scenarios under the control of an autopilot system (stack). The stack is tested by exposing the stack to different scenarios and evaluating whether the generated autonomous trajectories meet the rules of the security model (rule-based test). Rule-based testing methods may also be applied to other aspects of performance, such as comfort or progress in achieving defined goals.

Disclosure of Invention

Identifying "interesting" events is a significant challenge when testing complex robotic systems, such as autonomous vehicles. Today, such systems are capable of extremely high standards with few failures. For rule-based security models, it is relatively simple to isolate failure instances from the security model. However, not every failure instance must provide information. For example, if a stack is tested in a simulation, the failure of the stack in an unrealistic or highly unlikely scenario is typically less informative than the failure in a more realistic or more likely scenario.

According to a first aspect herein, a computer-implemented method of evaluating performance of a trajectory planner of a mobile robot in a real or simulated scenario, the method comprising:

receiving a scene ground truth of a scene, controlling an autonomous agent of the scene in response to at least one other agent in the scene using a trajectory planner to generate the scene ground truth, and the scene ground truth including an autonomous trace of the autonomous agent and an agent trace of the other agent;

evaluating the autonomous trace by a test oracle to assign at least one time-series of test results to the autonomous agent, the time-series of test results belonging to at least one performance evaluation rule;

extracting one or more predetermined delinquent evaluation parameters based on the agent trace; and

one or more predetermined delinquent evaluation rules are applied to the delinquent evaluation parameters to determine whether failure of at least one performance evaluation rule is acceptable.

In this way, rule-based testing is extended to include the concept of delinquent errors. The delinquent evaluation rules define what is referred to herein as an "acceptable failure" model of the scenario problem. For example, human driving ability provides a benchmark for a failure to be attributed, if the situation is such that no reasonable human driver can prevent the failure event, another agent is identified as the cause of the failure event.

In an embodiment, the method may comprise the step of detecting an action of another agent of a predetermined type, wherein the overestimated parameter is extracted based on the detected action.

The delinquent evaluation parameter may include a distance between the autonomous agent and another agent when the action is detected.

The delinquent assessment parameters may include at least one motion parameter of another agent when an action is detected.

One or more predetermined delinquent evaluation rules may be applied to identify that one of the autonomous agent and the other agent caused a failure event in at least one time-series of test results.

This action may occur prior to the failure event.

The delinquent evaluation parameters may include a time interval between the detected action and the failure event.

In many scenarios, rule-based evaluation of complex scene floors requires a significant amount of computational resources. Some embodiments limit the extent to which additional processing is required by initiating a delinquent evaluation process in response to a failure event and limiting the evaluation of the trace of another agent based on the timing of the delinquent evaluation process (e.g., within some predetermined time window before and/or after the failure event).

The predetermined delinquent evaluation rules may be applied to only a portion of the agent trace for a period of time defined by the timing of the failure event.

One or more predetermined delinquent evaluation parameters may be extracted in response to a failure event in the at least one time series of test results based on the timing of the agent trace and the failure event.

Alternatively, the predetermined delinquent evaluation rules may be applied regardless of whether any failure events occur in at least one time series of test results.

The scene may be assigned a category label that represents:

acceptable failure events occurring in at least one time series of test results,

unacceptable failure events occurring in at least one time series of test results,

no failure event occurs in at least one time series of test results and such failure event is unacceptable, or

No failure event occurs in the at least one time series of test results, and such failure event is acceptable.

The class labels may be stored in association with a set of scene parameters of the parameterized scene.

The method may include generating display data for summarizing a visualization of scene parameters and category labels.

The method may comprise the step of generating display data for displaying a rule timeline with a visual indication of whether a failure on at least one performance evaluation rule is acceptable, the rule timeline being a visual representation of a time series.

The failure results and agents that caused the failure event may be visually identified in a regular timeline.

The method may include the step of presenting a graphical user interface including a rule timeline with visual indications.

The method may comprise the step of storing the results of the time series in a test database, the results having an indication of whether failure of at least one performance evaluation rule is acceptable.

The results of the time series may be stored in a test database along with an indication of the agent that caused the failure event.

According to a second aspect herein, a computer-implemented method of evaluating performance of a trajectory planner of a mobile robot in a real or simulated scenario, the method comprising:

evaluating the autonomous trace by the test forecast to assign at least one time-series of test results to the autonomous agent, the time-series of test results belonging to at least one performance evaluation rule;

Extracting one or more predetermined delinquent evaluation parameters based on the agent trace and the timing of the failure event in response to the failure event in the at least one time series of test results; and

one or more predetermined delinquent evaluation rules are applied to the delinquent evaluation parameters to identify one of the autonomous agent and the other agent as having caused the failure event.

In an embodiment, the method may comprise the step of detecting an action of another agent of a predetermined type, which action occurs before the failure event, wherein the delinquent evaluation parameter is extracted based on the detected action.

The method may include the step of generating display data for displaying a regular timeline, the regular timeline being a visual representation of the time series, wherein the failure result and the agent causing the failure event are visually identified.

The method may comprise the step of presenting a graphical user interface comprising a regular timeline.

The method may include the step of storing the results of the time series in a test database together with an indication of the agent responsible for the failure event.

The method may include applying a predetermined delinquent evaluation rule to only a portion of the agent trace for a period of time defined by the timing of the failure event. Another aspect provides a computer system comprising one or more computers configured to implement the method of the first or second aspect or any embodiment thereof, and executable program instructions for programming the computer system to implement the method of the first or second aspect or any embodiment thereof. One or more computer programs may be embodied in a transitory or non-transitory medium and configured, when executed by one or more computers, to implement the method.

Drawings

For a better understanding of the present disclosure, and to show how embodiments of the disclosure may be carried into effect, reference is made, by way of example only, to the following drawings, in which:

FIG. 1A shows a schematic functional block diagram of an autonomous vehicle stack;

FIG. 1B shows a schematic overview of an autonomous vehicle test case;

FIG. 1C shows a schematic block diagram of a scene extraction pipeline;

FIG. 2 shows a schematic block diagram of a test pipeline;

FIG. 2A shows further details of a possible implementation of a test pipeline;

FIG. 3A illustrates an example of a rule tree evaluated in a test forecast;

FIG. 3B illustrates an output example of a node of a rule tree;

FIG. 4A illustrates an example of a rule tree to be evaluated in a test forecast;

FIG. 4B illustrates a second example of a rule tree evaluated from a set of scene ground truth data;

FIG. 4C shows how rules can be selectively applied in a test forecast;

FIG. 5 shows a schematic block diagram of a visualization component for presenting a graphical user interface;

FIGS. 5A, 5B, and 5C illustrate different views available in a graphical user interface;

FIG. 6A shows a first example of an insertion scenario;

FIG. 6B illustrates an example prophetic output of a first scenario example;

FIG. 6C shows a second example of an insertion scenario;

FIG. 6D illustrates an example prophetic output of a second scenario example;

FIG. 7 illustrates a block diagram of an extensible test forecast capable of receiving and applying an acceptable failure model;

FIG. 8 shows a flow chart of a method of delinquent evaluation; and

FIG. 9 illustrates an extensible graphical user interface presented in a computer system;

Fig. 10 and 11 illustrate respective scene space visualizations having points corresponding to scene runs categorized according to different failure categories.

Detailed Description

The described embodiments provide a test pipeline to facilitate rule-based testing of a mobile robot stack in a real or simulated scenario. Agent (participant) behavior in a real or simulated scenario is evaluated by test predictions based on defined performance evaluation rules. These rules may evaluate different aspects of security. For example, a security rule set may be defined to evaluate the performance of a stack according to a particular security standard, regulation, or security model (such as RSS), or a custom rule set may be defined to test any aspect of performance. The application of the test pipeline in terms of safety is not limited and may be used in any aspect of testing performance, such as comfort or progress to achieve some defined goal. The rule editor allows defining or modifying the performance evaluation rules and passing the performance evaluation rules to the test forecast.

The "full" stack generally involves processing (perceived) from low-level sensor data and interpreting everything fed to the main high-level functions such as prediction and planning, as well as control logic that generates appropriate control signals to implement (e.g., control braking, steering, acceleration, etc.) planning level decisions. For an autonomous vehicle, the 3-level stack includes some logic to implement the transition requirement, and the 4-level stack also includes some logic to implement the minimum risk policy. The stack may also implement auxiliary control functions such as signals, headlamps, windshield wipers, etc.

The term "stack" may also refer to individual subsystems (sub-stacks) of the entire stack, such as a sense, predict, program, or control stack, which may be tested alone or in any desired combination. A stack may refer entirely to software, i.e., one or more computer programs that may be executed on one or more general-purpose computer processors.

Whether real or simulated, a scenario requires autonomous agents to manipulate real or modeled physical contexts. An autonomous agent is a real or simulated mobile robot that moves under the control of the stack under test. The physical context includes static and/or dynamic elements to which the stack under test needs to respond effectively. For example, the mobile robot may be a fully or semi-autonomous vehicle (autonomous vehicle) under stack control. The physical context may include a static road layout and a given set of environmental conditions (e.g., weather, time of day, lighting conditions, humidity, pollution/particulate level, etc.) that may be maintained or changed as the scene progresses. The interactive scenario also includes one or more other agents ("external" agents, e.g., other vehicles, pedestrians, cyclists, animals, etc.).

The following examples consider the application of an autonomous vehicle test. However, these principles are equally applicable to other forms of mobile robots.

Scenes may be represented or defined at different levels of abstraction. More abstract scenes accommodate a greater degree of variation. For example, an "insertion scenario" or "lane change scenario" is an example of a highly abstract scenario that is characterized by a policy or behavior of interest that accommodates many variations (e.g., different agent start positions and speeds, road layout, environmental conditions, etc.). "scenario run" refers to the specific case where an agent manipulates a physical context, optionally in the presence of one or more other agents. For example, multiple runs of the insert or lane change scene (in the real world and/or in the simulator) may be performed using different agent parameters (e.g., start position, speed, etc.), different road layouts, different environmental conditions, and/or different stack configurations, etc. The terms "run" and "instance" are used interchangeably in this context.

In the following example, performance of a stack is evaluated at least in part by evaluating the behavior of autonomous agents in a test forecast during one or more runs according to a given set of performance evaluation rules. These rules apply to the "ground truth" of the (or each) scene run, which in general only means an appropriate representation of the scene run (including the behavior of autonomous agents) that is considered authoritative for testing purposes. Ground truth is inherent to simulation; the simulator calculates a series of scene states, by definition, which are perfect, authoritative representations of the operation of the simulated scene. In real world scene runs, the "perfect" representation of the scene run does not exist in the same sense; however, appropriate informational ground truth may be obtained in a variety of ways, such as manual annotation based on in-vehicle sensor data, automatic/semi-automatic annotation of such data (e.g., using offline/non-real-time processing), and/or using external information sources (such as external sensors, maps, etc.), and so forth.

Scene ground truth typically includes "trails" of autonomous agents and any other (significant) agents that are applicable. A trail is a history of the location and movement of agents during a scene. There are a number of ways in which traces can be represented. Trace data will typically include spatial and motion data for agents within the environment. The term is used for real scenes (with real world traces) and simulated scenes (with simulated traces). The trace typically records the actual trace implemented by the agent in the scene. With respect to the term "trace" and "track" may include the same or similar types of information (such as a series of time-varying spatial and motion states). The term trajectory is generally popular in planning scenarios (may refer to future/predicted trajectories), while the term trace is generally popular in testing/evaluation scenarios with respect to past behavior.

In the simulation context, a "scene description" is provided as an input to the simulator. For example, the scene description may be encoded using a scene description language (scenario description language, SDL), or in any other form that may be used by a simulator. A scene description is typically a more abstract representation of a scene that can produce multiple simulation runs. Depending on the implementation, the scene description may have one or more configurable parameters that may be changed to increase the extent of possible variation. The degree of abstraction and parameterization is a design choice. For example, the scene description may encode the fixed layout using parameterized environmental conditions (such as weather, lighting, etc.). However, further abstractions are made, for example, using configurable road parameters (such as road curvature, lane configuration, etc.). The simulator input includes a scene description (as applicable) of a selected set of parameter values. The latter may be referred to as parameterization of the scene. The configurable parameters define a parameter space (also referred to as a scene space), the parameterization corresponding to points in the parameter space. In this case, a "scene instance" may refer to a scene instantiation in a parameterized simulator based on the scene description and (if applicable) the selection.

For brevity, the term scene may also be used to refer to a scene run, as well as a scene in a more abstract sense. The meaning of the term scene will be clearly visible from the context in which it is used.

Trajectory planning is an important function in the current context, and the terms "trajectory planner," "trajectory planning system," and "trajectory planning stack" are used interchangeably herein to refer to one or more components that may plan a future trajectory for a mobile robot. The trajectory planning decisions ultimately determine the actual trajectory implemented by the autonomous agent (although in some test scenarios this may be affected by other factors such as the implementation of these decisions in the control stack, and the real or modeled dynamic response of the autonomous agent to the generated control signals).

The trajectory planner may be tested alone or in combination with one or more other systems (e.g., sensing, predicting, and/or controlling). In a full stack, planning generally refers to a higher level of autonomous decision making capability (such as trajectory planning), while control generally refers to a lower level generation of control signals for performing these autonomous decisions. However, in the context of performance testing, the term control is also used in a broader sense. For the avoidance of doubt, when the trajectory planner is considered to control autonomous agents in simulation, this does not necessarily mean that the control system is tested in combination with the trajectory planner (in a narrow sense). Example AV stack:

In order to provide relevant context to the described embodiments, further details of an example form of the AV stack will now be described.

Fig. 1A shows a high-level schematic block diagram of an AV runtime stack 100. The runtime stack 100 is shown to include a sense (subsystem) 102, a predict (subsystem) 104, a plan (subsystem) (planner) 106, and a control (subsystem) (controller) 108. As previously mentioned, the term (sub) stack may also be used to describe the above-described components 102-108.

In a real world scenario, the perception system 102 receives sensor outputs from the AV's in-vehicle sensor system 110 and uses these sensor outputs to detect external agents and measure their physical states, such as their position, velocity, acceleration, etc. The in-vehicle sensor system 110 may take different forms, but typically includes various sensors such as image capture devices (cameras/optical sensors), lidar and/or radar units, satellite positioning sensors (GPS, etc.), motion/inertial sensors (accelerometers, gyroscopes, etc.), and the like. The in-vehicle sensor system 110 thus provides rich sensor data from which detailed information about the surrounding environment can be extracted, as well as the state of the AV and any external participants (vehicles, pedestrians, cyclists, etc.) within the environment. The sensor output typically includes sensor data for a plurality of sensor modalities, such as stereo images from one or more stereo optical sensors, lidar, radar, and the like. The sensor data of the plurality of sensor modalities may be combined using filters, fusion components, or the like.

The sensing system 102 generally includes a plurality of sensing components that cooperate to interpret the sensor output to provide a sensing output to the prediction system 104.

In a simulation scenario, it may be necessary or not necessary to model the in-vehicle sensor system 100, depending on the nature of the test, and in particular, depending on where the stack 100 is "sliced" for testing purposes (see below). For higher level slices, no simulated sensor data is needed, and therefore no complex sensor modeling is needed.

The prediction system 104 uses the perceived output from the sensing system 102 to predict future behavior of external participants (agents), such as other vehicles in the vicinity of the AV.

The predictions calculated by the prediction system 104 are provided to the planner 106, which planner 106 uses these predictions to make the autopilot decisions to be performed by the AV in a given driving scenario. The input received by the planner 106 is typically indicative of a drivable zone and will also capture the predictable movement of any external agents (obstacles from the AV perspective) within the drivable zone. The drivable region may be determined using the perceived output from the sensing system 102 in combination with map information such as HD (high definition) maps.

The core function of the planner 106 is to plan the trajectory of the AV (autonomous trajectory) taking into account the predictable proxy movements. This may be referred to as trajectory planning. The trajectory is planned to achieve a desired goal in the scene. For example, the goal may be to enter a circular intersection and exit the circular intersection at a desired exit; beyond the preceding vehicle; or remain on the current lane at the target speed (lane following). For example, the targets may be determined by an autopilot route planner (not shown).

The controller 108 performs the decisions made by the planner 106 by providing appropriate control signals to the AV's onboard participant system 112. In particular, the planner 106 plans the AV track, and the controller 108 generates control signals to implement the planned track. Typically, the planner 106 will plan the future so that the planned trajectory may only be partially implemented at the control level before the planner 106 plans a new trajectory. Participant system 112 includes "primary" vehicle systems (such as braking, acceleration, and steering systems), as well as auxiliary systems (e.g., signaling, windshield wipers, headlamps, etc.).

Note that there may be a distinction between the planned trajectory at a given moment and the actual trajectory followed by the autonomous agent. The planning system typically operates on a series of planning steps, such that the planning trajectory is updated at each planning step to account for any changes in the scene since the previous planning step (or more precisely, any changes that deviate from the predicted changes). Planning system 106 may infer the future such that the planned trajectory at each planning step extends beyond the next planning step. Thus, any individually planned trajectory may not be fully implemented (if planning system 106 is tested individually in simulation, then the autonomous agent may simply proceed precisely along the planned trajectory to the next planning step, however, as previously described, in other real and simulated contexts, the planned trajectory may not follow precisely to the next planning step because the behavior of the autonomous agent may be affected by other factors such as the operation of control system 108 and the real or modeled dynamics of the autonomous vehicle). In many test scenarios, the actual trajectory of the autonomous agent is ultimately important; in particular whether the actual trajectory is safe or not, and other factors such as comfort and progress. However, the rule-based test methods herein may also be applied to planned trajectories (even if those planned trajectories are not fully or precisely implemented by autonomous agents). For example, even if the actual trajectory of an agent is considered safe according to a given set of security rules, it is possible that the instantaneous planned trajectory is unsafe; the fact that the planner 106 is considering an unsafe course of action may be exposed even though it does not result in unsafe agent behavior in this scenario. In addition to the actual agent behavior in the simulation, the instantaneous planning trajectory constitutes a form of internal state that can be effectively evaluated. Other forms of internal stacks may perform similar evaluations.

The example of FIG. 1A contemplates a relatively "modular" architecture with separable perception, prediction, planning and control systems 102-108. The sub-stacks themselves may also be modular, for example with separable planning modules within the planning system 106. For example, the planning system 106 may include a plurality of trajectory planning modules that may be applied to different physical contexts (e.g., simple lane driving versus complex intersections or circular intersections). For the reasons described above, this is relevant to analog testing, as it allows testing components (such as planning system 106 or individual planning modules thereof) alone or in different combinations. For the avoidance of doubt, for a modular stack architecture, the term stack may refer not only to a full stack, but also to any individual subsystem or module thereof.

The degree of integration or separation of the various stack functions may vary greatly from stack implementation to stack, and in some stacks, certain aspects may be so tightly coupled that they cannot be distinguished. For example, in other stacks, planning and control may be integrated (e.g., such stacks may be planned directly from control signals), while other stacks may be constructed in a manner that makes clear distinction between the two (e.g., planning from trajectories, and determining how to best execute planning trajectories at control signal levels through separate control optimizations) (as shown in FIG. 1A). Similarly, in some stacks, predictions and plans may be more tightly coupled. In extreme cases, perception, prediction, planning and control may be essentially inseparable in so-called "end-to-end" driving. Unless indicated otherwise, the terms of sensing, predicting, planning and controlling as used herein are not meant to imply any particular coupling or modularity of these aspects.

It should be understood that the term "stack" includes software, but may also include hardware. In simulation, the stack software may be tested on a "general purpose" external computer system prior to final upload to the on-board computer system of the physical vehicle. However, in a "hardware-in-the-loop" test, the test may extend to the underlying hardware of the vehicle itself. For example, the stack software may run on an on-board computer system (or a replica thereof) that is coupled to a simulator for testing purposes. In this case, the stack under test extends to the underlying computer hardware of the vehicle. As another example, certain functions of the stack 110 (e.g., perceptual functions) may be implemented in dedicated hardware. In an analog scenario, hardware-in-loop testing may involve feeding synthetic sensor data to a dedicated hardware-aware component.

Example test case:

FIG. 1B shows a highly schematic overview of an autonomous vehicle test case. By running multiple scenario instances in simulator 202 and evaluating the performance of stack 100 (and/or individual sub-stacks thereof) in test forecast 252, ADS/ADAS stack 100 of the type shown in FIG. 1A, for example, is subjected to repeated tests and evaluations in the simulation. The output of the test forecast 252 provides useful information to the expert 122 (team or individual) allowing them to identify problems in the stack 100 and modify the stack 100 to alleviate those problems (S124). The results also assist the expert 122 in selecting further scenarios for testing (S126), and the process continues to repeatedly modify, test, and evaluate the performance of the stack 100 in the simulation. The improved stack 100 is eventually consolidated (S125) in the real world AV 101 equipped with the sensor system 110 and the participant system 112. The improved stack 100 generally includes program instructions (software) that execute in one or more computer processors (not shown) of an onboard computer system of the vehicle 101. In step S125, the software of the modified stack is uploaded to the AV 101. Step 125 may also involve modifications to the underlying vehicle hardware. At AV 101, the modified stack 100 receives sensor data from the sensor system 110 and outputs control signals to the participant system 112. The real world test (S128) may be used in combination with the simulation-based test. For example, after an acceptable level of performance is achieved through the process of simulation testing and stack refinement, the appropriate real world scenes may be selected (S130), and the performance of AV 101 in those real scenes may be captured and similarly evaluated in test forecast 252.

The scene for simulation may be obtained in various ways, including manual coding. The system is also capable of extracting scenes for simulation from real world operation, allowing the real world situation and its changes to be recreated in simulator 202.

FIG. 1C shows a high-level schematic block diagram of a scene extraction pipeline. The real world operation data 140 is passed to a "ground truth" pipeline 142 for generating ground truth of the scene. The operational data 140 may include, for example, sensor data and/or perceived output captured/generated on one or more vehicles (which may be autonomous, man-made, or a combination thereof), and/or data captured from other sources such as external sensors (CCTV, etc.). The operation data is processed within the ground truth pipeline 142 to generate the appropriate ground truth 144 (trace and context data) for the real world operation. As discussed, the ground truth process may be based on manual annotation of the "raw" operational data 142, or the process may be fully automated (e.g., using an off-line perception method), or a combination of manual and automatic ground truth may be used. For example, 3D bounding boxes may be placed around vehicles and/or other agents captured in the operational data 140 in order to determine the spatial and motion states of their trail. The scene extraction component 146 receives the scene ground truth 144 and processes the scene ground truth 144 to extract a more abstract scene description 148 that is available for simulation. The scene description 148 is used by the simulator 202 to allow multiple simulation runs to be performed. The simulated operation is a variation of the original real world operation, the extent of possible variation being determined by the degree of abstraction. Ground truth 150 is provided for each simulation run.

Test line:

further details of the test pipeline and test forecast 252 will now be described. The examples below focus on simulation-based testing. However, as previously described, the test forecast 252 is equally applicable to evaluating stack performance on real scenes, and the following description is equally applicable to real scenes. The following description takes the stack 100 of fig. 1A as an example. However, as previously described, the test pipeline 200 is highly flexible and may be applied to any stack or sub-stack operating at any autonomous level.

Fig. 2 shows a schematic block diagram of a test pipeline denoted by reference numeral 200. Test pipeline 200 is shown to include simulator 202 and test prophetic 252. Simulator 202 runs a simulation scenario to test all or part of AV runtime stack 100, test forecast 252 evaluates the performance of the stack (or sub-stack) on the simulation scenario. As discussed, only sub-stacks of the runtime stack may be tested, but for simplicity the following description refers to the (full) AV stack 100 throughout. However, the description applies equally to sub-stacks instead of the full stack 100. The term "slice" is used herein to select a set or subset of stack components for testing.

As previously mentioned, the idea of simulation-based testing is to run a simulated driving scenario where an autonomous agent must maneuver under the control of the stack 100 being tested. Typically, the scenario includes a static drivable zone (e.g., a particular static road layout) that an autonomous agent needs to maneuver, typically in the presence of one or more other dynamic agents (such as other vehicles, bicycles, pedestrians, etc.). To this end, analog input 203 is provided from simulator 202 to stack under test 100.

The slice of the stack specifies the form of the analog input 203. For example, fig. 2 shows the prediction, planning and control systems 104, 106 and 108 within the AV stack 100 under test. To test the AV full stack of fig. 1A, the perception system 102 may also be applied during testing. In this case, the analog input 203 includes synthetic sensor data that is generated using an appropriate sensor model and processed within the perception system 102 in the same manner as the real sensor data. This requires the generation of a sufficiently realistic synthetic sensor input (such as realistic image data and/or simulated lidar/radar data that is also realistic, etc.). The generated output of the perception system 102 will in turn be fed to the higher level prediction and planning systems 104, 106.

In contrast, so-called "planning level" simulations will substantially bypass the perception system 102. Instead, the simulator 202 will provide simpler, higher-level inputs 203 directly to the prediction system 104. In some cases, the prediction system 104 may even be bypassed appropriately to test the planner 106 on predictions obtained directly from the simulation scenario (i.e., the "best" predictions).

Between these extremes, there are a range of different levels of input slices, e.g., testing only a subset of the perception systems 102 (such as "back" (higher level) perception components), e.g., components such as filters or fusion components that operate on outputs from lower level perception components (such as object detectors, bounding box detectors, motion detectors, etc.).

Regardless of the form they take, the analog inputs 203 are used (directly or indirectly) as a basis for decision making by the planner 108. The controller 108 in turn implements the planner decisions by outputting control signals 109. In the real world case, these control signals will drive the physical participant system 112 of the AV. In simulation, the autonomous vehicle dynamics model 204 is used to convert the generated control signals 109 into the actual motions of the autonomous agent in the simulation, thereby simulating the physical response of the autonomous vehicle to the control signals 109.

Alternatively, the simpler simulation forms assume that the autonomous agent follows each planned trajectory precisely between planning steps. This approach bypasses (to the extent that it is separate from planning) the control system 108 and eliminates the need for the autonomous vehicle dynamic model 204. This may be sufficient for certain aspects of the test plan.

To the extent that the external agents exhibit autonomous behavior/decisions made within simulator 202, some form of agent decision logic 210 is implemented to execute these decisions and determine agent behavior within the scenario. The proxy decision logic 210 may be compared in complexity to the autonomous stack 100 itself, or it may have more limited decision making capabilities. The purpose is to provide sufficiently realistic foreign agent behavior within simulator 202 so that the decision making capabilities of autonomous stack 100 can be effectively tested. In some cases, this does not require any proxy decision making logic 210 at all (open-loop simulation), in other cases, relatively limited proxy logic 210 (such as basic adaptive cruise control (adaptive cruise control, ACC)) may be used to provide useful testing. One or more proxy dynamics models 206 may be used to provide more realistic proxy behavior, if appropriate.

The scene is run according to the scene description 201a and, if applicable, the selected parameterization 201b of the scene. Scenes typically have static and dynamic elements that may be "hard coded" or configurable in the scene description 201a and are thus determined by the scene description 201a in combination with the selected parameterization 201 b. In a driving scenario, the static elements typically include a static road layout.

Dynamic elements typically include one or more external agents in a scene, such as other vehicles, pedestrians, bicycles, and the like.

The range of dynamic information provided to simulator 202 for each external agent may vary. For example, a scene may be described by separable static and dynamic layers. A given static layer (e.g., defining a road layout) may be used in combination with different dynamic layers to provide different instances of a scene. For each external agent, the dynamic layer may include a spatial path that the agent is to follow and one or both of motion data and behavior data associated with the path. In a simple open loop simulation, the external participants simply follow the spatial paths and motion data defined in the dynamic layer that are non-reactive, i.e., do not react to autonomous agents in the simulation. Such open-loop simulation may be implemented without any proxy decision logic 210. However, in closed loop simulation, the dynamic layer instead defines at least one behavior (such as ACC behavior) to follow along the static path. In this case, the proxy decision logic 210 implements this behavior in a reactive manner (i.e., reacting to autonomous agents and/or other external agents) in the simulation. The motion data may still be associated with a static path, but in this case is less canonical and may be used, for example, as a target along the path. For example, with ACC behavior, the target speed may be set along a path that the agent will seek a match, but the agent decision logic 210 may be allowed to reduce the speed of the foreign agent below the target at any point along the path in order to maintain a target tracking interval with the lead vehicle.

As will be appreciated, the scenarios for simulation may be described in a variety of ways, with any degree of configurability. For example, the number and type of agents and their motion information may be configured as part of scene parameterization 201 b.

The output of simulator 202 for a given simulation includes an autonomous trace 212a of an autonomous agent and one or more agent traces 212b (traces 212) of one or more external agents. Each trace 212a, 212b is a complete history of agent behavior in a simulation with spatial and motion components. For example, each trace 212a, 212b may take the form of a spatial path having motion data associated with points along the path, such as velocity, acceleration, jerk (rate of change of acceleration), snap (rate of change of jerk), and the like.

Additional information is also provided to supplement trace 212 and provide context to trace 212. Such additional information is referred to as "context" data 214. The context data 214 belongs to the physical context of the scene and may have a static component (such as road layout) and a dynamic component (such as the degree of change in weather conditions during simulation). The context data 214 may be "pass-through" in part because it is defined directly by the choice of scene description 201a or parameterization 201b and is therefore not affected by the simulation results. For example, the context data 214 may include a static road layout directly from the scene description 201a or the parameterization 201 b. However, typically the contextual data 214 will include at least some elements derived within the simulator 202. For example, this may include simulated environmental data, such as weather data, where simulator 202 may freely change weather conditions as the simulation proceeds. In this case, the weather data may be time-dependent, and this time dependence will be reflected in the context data 214.

Test forecast 252 receives trace 212 and context data 214 and scores these outputs according to a set of performance evaluation rules 254. Performance evaluation rules 254 are shown provided as inputs to test forecast 252.

Rules 254 are sortable in nature (e.g., pass/fail type rules). Some performance evaluation rules are also associated with digital performance metrics for "scoring" the trajectory (e.g., indicating the degree of success or failure or some other parameter that helps to interpret or otherwise relate to the classification result). The evaluation of rules 254 is time-based, given rules may have different results at different times in the scene. Scoring is also time-based; for each performance evaluation metric, test forecast 252 tracks how the value (score) of the metric changes over time as the simulation proceeds. The test forecast 252 provides an output 256, the output 256 including a classification (e.g., pass/fail) result 256a for each rule's time series, and a scoring time graph 256b for each performance metric, as described in further detail below. The results and scores 256a, 256b provide useful information to the expert 122 and may be used to identify and mitigate performance issues within the test stack 100. The test forecast 252 also provides overall (aggregate) results (e.g., overall pass/fail) for the scenario. The output 256 of the test forecast 252 is stored in association with information about the scenario to which the output 256 belongs in a test database 258. For example, the output 256 may be stored in association with the scene description 210a (or identifier thereof) and the selected parameterization 201 b. In addition to time-dependent results and scores, overall scores may be assigned to scenes and stored as part of output 256. For example, the aggregate score (e.g., overall pass/fail) and/or aggregate result (e.g., pass/fail) for each rule spans all rules 254.

Fig. 2A shows another slice selection and indicates a full stack and a sub-stack using reference numerals 100 and 100S, respectively. The sub-stack 100S will be tested within the test pipeline 200 of fig. 2.

A plurality of "post" sense components 102B form part of the sub-stack 100S to be tested and are applied to the analog sense inputs 203 during testing. The rear sensing component 102B can, for example, include other fusion components that filter or fuse the sensory input from the plurality of front sensing components.

In the full stack 100, the rear sense unit 102B receives the actual sense input 213 from the front sense unit 102A. For example, the front sense component 102A may include one or more 2D or 3D bounding box detectors, in which case the analog sense input provided to the rear sense component may include analog 2D or 3D bounding box detection derived in the simulation via ray tracing. The front sensing component 102A generally includes components that directly operate on sensor data. With the slice of fig. 2A, the analog sensory input 203 will correspond in form to the actual sensory input 213 typically provided by the front sensory component 102A. However, instead of being applied as part of the test, the front sense component 102A is used to train one or more sense error models 208, which one or more sense error models 208 may be used to introduce real errors into the simulated sense input 203 in a statistically stringent manner, the simulated sense input 203 being fed to the rear sense component 102B of the tested sub-stack 100.

Such a perceptual error model may be referred to as a perceptual statistical performance model (Perception Statistical Performance Model, PSPM), or simply "PRISM". Further details of the principles of PSPM and suitable techniques for constructing and training PSPM may be incorporated in international patent publications WO2021037763, WO2021037760, WO2021037765, WO2021037761 and WO2021037766, each of which is incorporated herein by reference in its entirety. The idea behind the PSPM is to effectively introduce real errors into the analog perceived input provided to the sub-stack 100S (i.e., which reflects the type of errors expected when the pre-perceived component 102A is applied in the real world). In the simulation scenario, the "best" ground truth perceived input 203G is provided by the simulator, but these are all used to derive a more realistic perceived input 203 with the real errors introduced by the perceived error model 208.

As described in the above references, the PSPM may depend on one or more variables ("interference factors") representing physical conditions, allowing for the introduction of different levels of error reflecting different possible real world conditions. Thus, simulator 202 may simulate different physical conditions (e.g., different weather conditions) by simply changing the value of the weather disturbance factor, which in turn will change the manner in which the perceived error is introduced.

The post-sense components 102b within the sub-stack 100S process the analog sense inputs 203 in exactly the same way they would process the real world sense inputs 213 within the full stack 100, and their outputs in turn drive prediction, planning and control.

Alternatively, PRISM may be used to model the entire perception system 102 including the post-perception component 208, in which case PSPM is used to generate the actual perceived output that is passed as input directly to the prediction system 104.

Depending on the implementation, there may or may not be a deterministic relationship between the simulation results for a given scene parameterization 201b and a given configuration of stack 100 (i.e., the same parameterization may or may not always result in the same result for the same stack 100). The non-determinism may be generated in various ways. For example, when the simulation is based on a PRISM, the PRISM may model a distribution of possible perceived outputs for each given time step of a scene from which to probabilistically sample the true perceived outputs. This results in non-deterministic behavior within simulator 202, whereby different results can be obtained for the same stack 100 and scene parameterization, as different perceived outputs are sampled. Alternatively or additionally, the simulator 202 may be inherently non-deterministic, e.g., weather, lighting, or other environmental conditions may be randomized/probabilistic within the simulator 202 to some extent. As will be appreciated, this is a design choice; in other embodiments, the changing environmental conditions may be fully specified in the parameterization 201b of the scene. For non-deterministic simulations, each parameterization may run multiple scene instances. Aggregate pass/fail results (e.g., counts or percentages that are pass or fail results) may be assigned to particular choices of parameterization 201 b.

The test orchestration component 260 is responsible for selecting the scenario for simulation. For example, the test orchestration component 260 may automatically select the scene description 201a and the appropriate parameterization 201b based on the test forecast output 256 from the previous scene.

Test prophetic rules:

the performance evaluation rules 254 are structured as computational graphs (rule trees) to be applied within the test predictions. The term "rule tree" herein refers to a computational graph configured to implement a given rule, unless indicated otherwise. Each rule is structured as a rule tree, and the multiple rule sets may be referred to as "forests" of the multiple rule trees.

Fig. 3A shows an example of a rule tree 300 constructed from a combination of an extractor node (leaf object) 302 and an evaluator node (non-leaf object) 304. Each extractor node 302 extracts a time-varying digital (e.g., floating point) signal (fraction) from a set of scene data 310. The scene data 310 is in the form of a scene ground truth in the sense described above, and may be referred to as a scene ground truth. Scene data 310 is obtained by deploying a trajectory planner (such as planner 106 of fig. 1A) in a real or simulated scene, and is shown as including autonomous and proxy traces 212 and context data 214. In the simulation scenario of fig. 2 or fig. 2A, the scene ground truth 310 is provided as an output of the simulator 202.

Each evaluator node 304 is shown as having at least one child object (node), where each child object is one of the extractor nodes 302 or another of the evaluator nodes 304. Each evaluator node receives the outputs from its child nodes and applies an evaluator function to these outputs. The output of the evaluator function is a time-series classification result. The following examples consider simple binary pass/fail results, but these techniques can be easily extended to non-binary results. Each evaluator function evaluates the output of its child node according to a predetermined atomic rule. Such rules may be flexibly combined according to a desired security model.

Further, each evaluator node 304 derives from the output of its child node a time-varying digital signal that is related to the classification result by a threshold condition (see below).

Top level root node 304a is an evaluator node that is not a child of any other node. Top level node 304a outputs the results of the final sequence and its child nodes (i.e., nodes that are direct or indirect children of top level node 304 a) provide the bottom level signals and intermediate results.

Fig. 3B visually depicts an example of a derived signal 312 and corresponding time series result 314 calculated by the evaluator node 304. The result 314 correlates with the derived signal 312 in that a pass result is returned when (and only when) the derived signal exceeds the failure threshold 316. As will be appreciated, this is merely an example of a threshold condition that correlates time series results with corresponding signals.

The signal extracted directly from the scene ground truth 310 by the extractor node 302 may be referred to as the "raw" signal, to distinguish it from the "derived" signal calculated by the evaluator node 304. The result and the original/derived signal may be discretized in time.

Fig. 4A shows an example of a rule tree implemented within test platform 200.

A rule editor 400 is provided for constructing rules to be enforced with the test predictions 252. Rule editor 400 receives rule creation input from a user (which may or may not be an end user of the system). In this example, the rule creation input is encoded in a domain specific language (domain specific language, DSL) and defines at least one rule diagram 408 to be implemented within the test forecast 252. The rules in the following example are logical rules, where TRUE and FALSE represent pass and fail, respectively (it will be appreciated that this is purely a design choice).

The following examples consider rules formulated using combinations of atomic logical predicates. Examples of basic atomic predicates include basic logic gates (OR, AND, etc.) AND logic functions such as "greater than", (Gt (a, b)) (return TRUE when a is greater than b, otherwise return false).

The Gt function is to implement a secure lateral distance rule (with an agent identifier "other agent id") between an autonomous agent and another agent in the scene. Two extractor nodes (latd, latsd) apply a LateralDistance and LateralSafeDistance extractor function, respectively. These functions operate directly on the scene ground truth 310 to extract time-varying lateral distance signals (that measure lateral distances between the autonomous agent and the identified other agents) and time-varying security lateral distance signals of the autonomous agent and the identified other agents, respectively. The safe lateral distance signal may depend on various factors such as the speed of the autonomous agent (captured in trace 212) and the speed of other agents, as well as the environmental conditions (e.g., weather, lighting, road type, etc.) captured in the contextual data 214.

The evaluator node (is_latd_safe) is a parent node of the extractor nodes latd and latsd, and is mapped to an atomic predicate Gt. Thus, when rule tree 408 is implemented, evaluator node is_latd_safe applies the Gt function to the outputs of extractor nodes latd and latsd to calculate TRUE/FALSE results for each time step of the scene, returning TRUE for each time step the latd signal exceeds the latsd signal, and otherwise returning FALSE. In this way, a "safe lateral distance" rule is constructed from the atom extractor function and predicates; when the lateral distance reaches or falls below a safe lateral distance threshold, the autonomous agent does not comply with the safe lateral distance rule. It will be appreciated that this is a very simple rule tree example. Rules of arbitrary complexity can be constructed according to the same principle.

The test forecast 252 applies the rule tree 408 to the scene ground truth 310 and provides results via a User Interface (UI) 418.

Fig. 4B shows an example of a rule tree including lateral distance branches corresponding to fig. 4A. In addition, the rule tree includes a longitudinal distance branch and a top-level OR predicate (safe distance node, is_d_safe) for implementing a safe distance metric. Similar to the lateral distance branch, the longitudinal distance branch extracts longitudinal distance and longitudinal distance threshold signals (extractor nodes lon and lon, respectively) from the scene data, and when the longitudinal distance is above the safe longitudinal distance threshold, the longitudinal safety evaluator node (is_lon_safe) returns TRUE. When one OR both of the lateral and longitudinal distances are safe (below the applicable threshold), the top-level OR node returns TRUE, and if both are unsafe, FALSE. In this case it is sufficient that only one distance exceeds the safety threshold (for example, if two vehicles are traveling on adjacent lanes, their longitudinal spacing is zero or near zero when they are side by side; but if these vehicles have sufficient lateral spacing, this is not unsafe).

For example, the digital output of the top level node may be a time-varying robustness score.

Different rule trees may be constructed, for example, to implement different rules for a given security model, to implement different security models, or to selectively apply rules to different scenarios (not every rule must be applicable to every scenario in a given security model; by this means, different rules or combinations of rules may be applied to different scenarios). Within this framework, rules for evaluating comfort (e.g., based on instantaneous acceleration and/or jerk along the trajectory), progress (e.g., based on the time it takes to reach a defined goal), etc. may also be constructed.

The above examples consider simple logical predicates such as OR, AND, gt that evaluate a result or signal in a single instance of time. In practice, however, it may be desirable to formulate certain rules based on sequential logic.

Hekmanejad et al, "encoding and monitoring responsibility sensitive safety rules for automated driving vehicles based on signal sequential logic (Encoding and Monitoring Responsibility Sensitive Safety Rules forAutomated Vehicles in Signal Temporal Logic)" (2019), MEMOCODE'19: the 17 th ACM-IEEE system design formalization method and model International conference discussion (which is incorporated herein by reference in its entirety) discloses signal sequential logic (signal temporal logic, STL) encoding of RSS security rules. Sequential logic provides a formal framework for constructing predicates defined in terms of time. This means that the result calculated by the evaluator at a given moment may depend on the result and/or signal value at another moment.

For example, the requirement of the security model may be that the autonomous agent respond to certain events within a set timeframe. Such rules may be encoded in a similar manner using sequential logical predicates in the rule tree.

In the above example, the performance of stack 100 is evaluated at each time step of the scene. The overall test result (e.g., pass/fail) may be derived therefrom, i.e., if a rule fails at any time step in the scene, for example, certain rules (e.g., safety critical rules) may result in an overall failure (i.e., the rule must be passed at each time step to obtain an overall pass of the scene). For other types of rules, the overall pass/fail criteria may be "softer" (e.g., a rule may trigger a failure only if the rule fails within a certain number of consecutive time steps), and such criteria may be context-dependent.

Rule evaluation hierarchy:

fig. 4C schematically depicts a hierarchy of rule evaluation implemented in test forecast 252. A set of rules 254 for implementation in the test forecast 252 is received.

Some rules apply only to autonomous agents (examples are comfort rules that evaluate whether an autonomous agent exceeds a certain maximum acceleration or jerk threshold at any given moment).

Other rules pertain to interactions of the autonomous agent with other agents (e.g., the "collision-free" rules or safe distance rules considered above). Each such rule evaluates in a pairwise fashion between the autonomous agent and each other agent. As another example, the "pedestrian emergency braking" rule can only be activated when a pedestrian walks in front of an autonomous vehicle, and can only be activated for that pedestrian agent.

Not every rule must be applicable to every scene, some rules may only be applicable to a portion of a scene. Rule activation logic 422 within test forecast 422 determines whether each of rules 254 applies to the scenario in question and when, and selectively activates the rule when applicable. Thus, a rule may remain active throughout a scene, may never be active in a given scene, or may only be active in some scenes. Furthermore, rules may be evaluated for different numbers of agents at different points in the scene. Selectively activating rules in this manner may significantly improve the efficiency of test forecast 252.

The activation or deactivation of a given rule may depend on the activation/deactivation of one or more other rules. For example, when a pedestrian emergency braking rule is activated, the "best comfort" rule may be deemed unsuitable (because pedestrian safety is a primary issue), while the latter may be deactivated whenever the latter is activated.

Rule evaluation logic 424 evaluates each activation rule for any period of time that the activation rule remains active. Each interaction rule is evaluated in pairs between the autonomous agent and any other agent to which it applies.

There may also be some degree of correlation in the application of rules. For example, another way to address the relationship between the comfort rule and the emergency braking rule is to increase the jerk/acceleration threshold of the comfort rule each time the emergency braking rule is activated for at least one other agent.

Although the pass/fail result has been considered, the rules may be non-binary. For example, two categories of failure, namely "acceptable" and "unacceptable" may be introduced. Further, considering the relationship between the comfort rules and the emergency braking rules, an acceptable failure in the comfort rules may occur at the moment the rules fail but the emergency braking rules activate. Thus, correlations between rules can be handled in various ways.

The activation criteria for rule 254 may be specified in the rule creation code provided to rule editor 400, as well as the nature of any rule dependencies and the mechanisms used to enforce those dependencies.

Graphical user interface:

fig. 5 shows a schematic block diagram of a visualization component 520. The visualization component is shown as having an input connected to the test database 258 for presenting an output 256 of the test forecast 252 on the Graphical User Interface (GUI) 500. The GUI is presented on a display system 522.

Fig. 5A shows an example view of GUI 500. This view is applicable to a particular scenario that includes multiple agents. In this example, test forecast output 526 applies to multiple external agents and organizes the results according to the agents. For each agent, at some point in the scene, each rule applicable to that agent has a time-series result. The visual representation of the results is referred to as a "regular timeline". In the example shown, a summary view is selected for "agent 01" resulting in the display of a "top-level" result calculated for each applicable rule. There are top-level results that are computed at the root node of each rule tree. The rules for differentiating the agent are periods of inactivity, activation and passing, and activation and failure.

A first selection element 534a is provided for each time-series result. This allows access to lower level results of the rule tree, i.e. results of lower level computations in the rule tree.

Fig. 5B shows a first expanded view of the results of "rule 02", where the results of lower level nodes are also visually shown. For example, for the "safe distance" rule of fig. 4B, the results of the "is_latd_safe node" and the "is_land_safe" node (labeled as "C1" and "C2" in fig. 5B) may be visually shown. In the first expanded view of rule 02, it can be seen that success/failure on rule 02 is defined by the logical OR relationship between results C1 and C2; rule 02 fails only if both C1 and C2 fail (as described in the "safe distance" rule above).

A second selection element 534b is provided for each time series result that allows access to the associated digital performance score.

Fig. 5C shows a second expanded view in which the results of rule 02 and the "C1" results have been expanded to show the relevant scores of the time periods in which these rules are active for agent 01. The score is displayed as a visual score-time graph representing pass/fail in a similar color coding.

Example scenario:

fig. 6A depicts a first example of an insertion scenario in simulator 202 that terminates in a collision event between autonomous vehicle 602 and another vehicle 604. The insertion scenario is characterized as a multi-lane driving scenario, in which an autonomous vehicle 602 moves along a first lane 612 (autonomous lane), and another vehicle 604 initially moves along a second, adjacent lane 604. At some point in the scene, another vehicle 604 moves from an adjacent lane 614 into an autonomous lane 612 in front of the autonomous vehicle 602 (insertion distance). In such a scenario, autonomous vehicle 602 cannot avoid a collision with another vehicle 604. The first scenario instance terminates in response to the collision event.

Fig. 6B depicts an example of a first prophetic output 256a obtained from a ground truth 310a of a first instance of a scene. The "no collision" rule is evaluated for the duration of the scene between the autonomous vehicle 602 and another vehicle 604. A collision event may cause failure of the rule at the end of the scene. In addition, the "safe distance" rule of fig. 4B was evaluated. As another vehicle 604 moves laterally closer to autonomous vehicle 602, a point in time (t 1) occurs at which both the safe lateral distance and the safe longitudinal distance threshold are breached, resulting in a failure of the safe distance rule that persists to a collision event at time t 2.

Fig. 6C depicts a second example of an insertion scenario. In this second example, the insertion event does not result in a collision, and autonomous vehicle 602 is able to maintain a safe distance from another vehicle 604 after the insertion event.

Fig. 6D depicts an example of a second prophetic output 256b obtained from a ground truth 310b of a second instance of a scene. In this case, the "no collision" rule will be throughout. When the lateral distance between the autonomous vehicle 602 and the other vehicle 604 becomes unsafe, the safe distance rule is broken at time t 3. At time t4, however, autonomous vehicle 602 seeks to maintain a safe distance from another vehicle 604. Thus, the safe distance rule fails only between times t3 and t 4.

Delinquent evaluation:

the pass or responsibility is an important concept in the interactive agent scenario. If a failure occurs in the scenario run, the problem of the autonomous agent making an error in a given scenario is important to determine if an unexpected event is caused by a problem within the stack under test 100. In a sense, the delineation is an intuitive concept. However, this is a challenging concept that is more commonly applied in the context of formal security models and rule-based performance tests.

For example, in the first scenario example of fig. 6A, a collision event may be the responsibility of the autonomous agent 602 or another agent 604, depending on the situation of the insertion action of the other agent 604.

An extension of the test framework will now be described which formalizes the delinquent concepts, allowing for objective assessment of the delinquent in a similarly strict and explicit manner.

FIG. 7 illustrates an extension of test forecast 252 including "external" delinquent evaluation logic 702. When a rule of a given agent pairing (autonomous agent and another agent) fails, it is evaluated whether the autonomous agent or the other agent should be responsible for the failure. The following example considers collision events, which are characterized by failure of the top level "no collision" rule. However, the same principles may be applied to any type of rule, and apply anywhere in the hierarchy of rules that run in a given scenario. In the following description, a failure of a rule to be determined as another agent, rather than an autonomous agent role, may be referred to as an "acceptable failure". The concept of a formal security model is extended to include an "acceptable failure" model in order to formally distinguish between failures that an autonomous agent should be able to block and failures that any autonomous agent cannot reasonably block.

Note that the external delinquent evaluations differ from any "internal" evaluations of rule relevance by any internal rule evaluation logic 704. For example, as described above, in some embodiments, failure in a given comfort rule may be considered acceptable or reasonable in a more general sense when another rule (such as an emergency braking rule) that takes precedence over the comfort rule is activated.

External delinquent evaluations are also different from rule activation logic 422. Rule evaluation logic selectively activates rules applicable to a scene. For example, the safe distance rule may be disabled for any agent that is behind the autonomous vehicle beyond a certain distance. The motivation for disabling the safe distance in this case may be that it is the responsibility of another agent (rather than the autonomous vehicle) to maintain the safe distance in this case.

However, the external delinquent evaluation logic 702 applies to the activation rules and operates to determine whether the autonomous agent or another agent is the cause of the failure to activate the rules.

To this end, an acceptable failure model 700 is defined for a given scenario, and the acceptable failure model 700 is provided as a second input to the test forecast 252. The functionality of rule editor 400 is extended to define acceptable behavior models. The focus and acceptable failure model 700 described below is the failure on the activation rules that are not interpreted or justified by the internal hierarchy of rules applicable to the operation of a given scenario and that require investigation of the behavior of another agent in the scenario.

The described example introduces at least three classifications of results: "pass" and, in addition, two different classifications or categories of "failures," i.e., false "acceptable failures" of another agent according to acceptable failure model 700 and false "unacceptable failures" of another agent not according to acceptable behavior pattern 700. Note that the term "unacceptable" in this case refers specifically to the outcome for the acceptable failure model 700, which does not exclude the possibility that the rule (e.g., according to the internal rule hierarchy) is reasonable in some other sense.

Another option is to encode implicit concepts of some acceptable failures in pass/fail type rules. For example, consider the basic "no collision" rule that would otherwise pass if an area of the autonomous agent 602 intersected an area of another agent 604. The rule may be extended to append further conditions that depend on the failure of the behavior of another agent 604. For example, the rule may be formulated as "whenever the region of the autonomous agent 602 intersects the region of the other agent 604 (a collision event) it fails unless the other agent has performed an insert action less than T seconds prior to the collision event. However, this approach has two problems. First, even if a collision occurs, it may result in the passing of a collision-free rule. This is a highly misleading characterization of the operation of a scenario that may have a critical potential impact on security testing.

An effective two-stage implementation of acceptable behavior is described. Rule 254 is formulated as a pass/fail type rule and the first stage evaluates each applicable rule to calculate a pass/fail result at each time that the rule is activated to determine a pass/fail result. The first stage is independent of the acceptable failure model 700. The second stage processing is performed only in response to failure of the rule to evaluate the behavior of another agent against the acceptable behavior model 700 (delinquent analysis). This may be performed for all failures or for only certain failures, e.g., failures for only specific rules or rules and/or failures that are not proved by the internal rule hierarchy.

Fig. 8 shows a schematic flow chart of the second stage process, and a high level visual representation of the process performed by each step. The example of fig. 8 considers the delinquent analysis contributed by the crash event in the scenario run of fig. 6a, but the description is more generally applicable to other types of rule failure.

In step S802, a collision event is detected in a given scenario run as a failure of some top-level "no collision" rule evaluated in pairs between the autonomous agent 602 and the other agent 604. The crash event is determined to occur at time t2 of the scene run.

In step S804, in response to the detected collision event, the trace of another agent 604 is analyzed for a period of time before and/or after the timing of the collision event. In this example, the trace of another agent 604 is used to locate an earlier insertion event at time t1 that occurred within the time period under consideration. An insertion event is defined at the point where another agent 604 crosses from an adjacent lane 614 into the autonomous lane 612.

Another agent 604 is shown as a portion of trace 704 between time t1 and time t2 and forms part of a ground truth of a scene run.

At step 806, the partial proxy trace 704 is used to extract one or more delinquent evaluation parameters. The delinquent evaluation parameters are parameters required to evaluate an acceptable failure model 700 applicable to the scenario.

In step S808, an acceptable failure model 700 is applied to the extracted delinquent evaluation parameters. That is, rule-based evaluation of the delinquent evaluation parameters is performed according to the rules of the acceptable failure model 700 in order to classify the failure as acceptable or unacceptable in the above sense.

In the described insertion scenario, such a parameter may be the collision time, t=t2-t 1, i.e. the time interval between an insertion event and a rule failure. For example, a simple delinquent evaluation rule may be defined as follows:

"in an insertion scenario, if another agent crosses the lane boundary of the autonomous lane with a collision time less than T, then the collision is acceptable"

Where T is some predefined threshold (e.g., 2 seconds).

Other examples of potentially important parameters in the insertion scenario are the speed v of another agent 604 at time t1 of the insertion event, and the insertion distance d between the autonomous agent 602 and the other agent 604.

In the insertion example, the most important requirement of this particular delinquent evaluation rule is that the insertion event occur before the rule under investigation fails. This requirement can be evaluated by checking for the presence of an insertion event in the period between time T1-T and time T1. In this case, the requirement that the mistake be attributed to another agent is that there be an insert event during this time period.

The insertion distance d is an example of a delinquent evaluation parameter that also requires identification of an insertion event at time t 1. The partial trail 702 of the autonomous agent 602 is described in the visual representation of step S804, and the insertion distance d is defined in this example as the lateral distance between the front bumper of the autonomous agent 602 and the rear bumper of the other agent 604.

FIG. 9 illustrates an example of an expanded GUI for merging the results of a delinquent evaluation analysis. Different colors (shaded) are used to indicate pass, acceptable fail, and unacceptable fail. During the described run duration, the failure interval can be seen in the time lines of "rule 01" and "rule 02". For example, rule 01 may be a "collision free" rule and rule 02 may be a "safe distance" rule. A delinquent analysis has been performed for each failure interval. When an insert and subsequent crash event occur, the first and second failure intervals 904 and 906 occur at the end of the scene according to rule 01 and rule 02, respectively. These intervals 904, 906 have been visually marked as acceptable, while the third failure interval 908 has been visually marked as unacceptable.

The visual representation 501 of the scene run is related to the time t1 of the collision event. Details 906 of the delinquent analysis are also shown in relation to time t 1. For example, detail 906 may be displayed in response to a user selecting a corresponding interval 904 of the timeline of rule 01 and/or time t1 of manipulation into visualization 501. With respect to the latter, a suitable GUI element, such as slider 912, may be proven for this purpose.

Fig. 9A shows a visual representation 501 at an earlier time, the visual representation 501 having details 912 of the unacceptable failure interval 908 of rule 02 obtained in the delinquent evaluation analysis. This failure occurs before the insertion of another agent 604 and is therefore not interpreted by it. According to the acceptable behavior model 700, the error is in the autonomous agent 602 and further investigation of the stack under test 100 is required.

A possible basis for the acceptable failure model 700 in the case of lane driving can be found in the "new united states law proposal (Proposal for a new UN Regulation on uniform provisions concerning the approval of vehicles with regards to Automated Lane Keeping System) for unified regulations of vehicle authentication for automated driving lane keeping systems" (https:// unidots. Org/ECE/TRANS/wp. 29/2020/81), the contents of which are incorporated herein by reference in their entirety. The above references consider a focused human driver performance model (referred to herein as an "ALKS unavoidable collision model") suitable for use in inserting, exiting and decelerating scenes. For the insertion scenario, the normal "roaming" distance of the agent within the lane is defined, and the perceived boundary of the insertion occurs when another vehicle exceeds the normal lateral roaming distance (possibly prior to the actual lane change). The model is applied to the following set of parameters: ve0 (autonomous speed), ve0-Vo0 (relative speed of the other vehicle performing the insertion), dy0 (lateral distance between the autonomous vehicle and the other vehicle), dx0 (longitudinal distance between the autonomous vehicle and the other vehicle), and Vy (lateral speed of the other vehicle), which are measured at the beginning of the insertion (when the perceived insertion boundary is exceeded). By applying an ALKS unavoidable collision model to a given combination of these parameters, it can be determined whether a collision can be avoided under the conditions represented by the parameter combination.

Variations of the above-described embodiments apply an acceptable failure model 700 to each scenario to determine whether a failure on a rule or combination of rules is acceptable, regardless of whether such failure events actually occur. For AV developers, this approach has the advantage of revealing a scenario where failure is acceptable according to acceptable failure model 700, but stack 100 does not actually fail. In these cases, stack 100 is superior to acceptable failure model 700 (e.g., superior to a reasonable human driving baseline).

The following examples relate to an ALKS unacceptable collision model, but the description applies equally to other forms of acceptable failure models 700 and/or other types of failure events (other than collisions).

The user creates an abstract scene, such as an insertion, with some specific parameters, such as the start position, start speed, etc. of another agent.

In execution, to sample the scene space, the user specifies ranges of certain parameters, such as dx0 (distance in front of the agent inserted into the autonomous agent) and Vy (lateral velocity). The scenario instances are run with different combinations of these parameters, and test forecast 252 evaluates the traces produced in each run.

In addition to the performance evaluation rules, at each step of the run, test forecast 252 determines whether acceptable failure conditions are met. Taking the ALKS unavoidable collision model as an example, if and when (i) another vehicle crosses the perceived insertion boundary (the beginning of the insertion action) and (ii) the relevant parameters of the insertion action at that point in time (e.g., parameters (Ve 0, ve0-Vo0, dy0, dx0, vy), or some subset thereof) are such that the collision event is acceptable according to the ALKS model, then it is determined that an acceptable failure condition is satisfied. Satisfaction of acceptable failure conditions means that failure of an autonomous agent to a given rule or combination of rules (e.g., failure to a "no-collision" rule) will be an acceptable outcome according to acceptable behavior model 700, regardless of whether such failure event actually occurred. This in turn allows each scene to be classified by one of four ways:

1. acceptable failure conditions are not met, and no failure event occurs (the autonomous agent expects to avoid failure, and has avoided);

2. an acceptable failure condition is not met, but a failure event occurs (an unacceptable failure of the autonomous agent occurs, indicating that the stack under test 100 is problematic);

3. Acceptable failure conditions have been met and a failure event has occurred (the autonomous agent has not avoided failure, but failure is not expected to occur);

4. acceptable failure conditions have been met, but no failure event has occurred (the autonomous agent is not expected to avoid failure, but still try to do so; the stack 100 performance has outperformed acceptable failure model 700).

A distinction is made between parameterization 201b of the scene input to simulator 202 and the delinquent evaluation parameters for which failure model 700 is acceptable. The latter is extracted from the trace to explain the actual behavior of the agent in the scene run: depending on the manner in which the scene is configured, actual behavior may deviate because the outcome of the scene is determined by decisions within the stack 100 and any autonomous agent decision logic 210 (e.g., in some cases, insertion may not actually occur in the inserted scene, and the described techniques are robust to such outcome, etc.).

However, the scene run may be characterized within the system by its parameterization 201b, and this parameterization 201b (corresponding to a point in scene space) may be classified into one of the four categories described above. To perform this classification, the scene is run with parameter combinations and the processing steps described above are applied to the generated traces 212a, 212b.

Fig. 10 shows an example of acceptable failure results summarized in multiple runs. Fig. 10 shows regions of scene space, each point corresponding to a particular parameterization 201b of a given abstract scene. And classifying the points according to the corresponding operation result.

Fig. 11 shows similar results for at least three-dimensional scene space.

While the above examples contemplate collision events, these techniques may be more generally applied to other types of failure events. The failure event may be a failure result on a particular rule or a particular combination of failure results on a single rule or multiple rules. After a failure event is identified, a delinquent assessment analysis may be caused and communicated in a similar manner.

While the above examples consider AV stack testing, these techniques may be applied to testing other forms of mobile robotic components. Other mobile robots are being developed, for example, for transporting freight materials in both interior and exterior industrial areas. No one is on such mobile robots, belonging to a class of mobile robots known as UAVs (unmanned autonomous vehicle, unmanned vehicles). Autonomous airborne mobile robots (unmanned aerial vehicles) are also under development.

A computer system includes execution hardware that may be configured to perform the method/algorithm steps disclosed herein and/or implement models trained using the present technology. The term execution hardware includes any form/combination of hardware configured to perform the relevant method/algorithm steps. The execution hardware may take the form of one or more processors, which may be programmable or non-programmable, or a combination of programmable and non-programmable hardware may be used. Examples of suitable programmable processors include general-purpose processors based on instruction set architectures such as CPUs, GPUs/accelerator processors, and the like. Such general purpose processors typically execute computer readable instructions held in a memory coupled to or internal to the processor and perform the relevant steps in accordance with these instructions. Other forms of programmable processors include Field Programmable Gate Arrays (FPGAs) having circuit configurations programmable by circuit description code. Examples of non-programmable processors include Application Specific Integrated Circuits (ASICs). The code, instructions, etc. may be suitably stored on a transitory or non-transitory medium (examples of which include solid state, magnetic and optical storage devices, etc.). The subsystems 102-108 of runtime stack 1 may be implemented in programmable or dedicated processors, or a combination of both, on-board a vehicle in a test or like environment, or in an off-board computer system. The various components of fig. 2 (such as simulator 202 and test prophetic 252) may similarly be implemented in programmable and/or dedicated hardware.

Claims

1. A computer-implemented method of evaluating performance of a trajectory planner of a mobile robot in a real or simulated scenario, the method comprising:

receiving a scene ground truth of the scene, controlling an autonomous agent of the scene in response to at least one other agent of the scene using the trajectory planner to generate the scene ground truth, and the scene ground truth comprising an autonomous trace of the autonomous agent and an agent trace of the other agent;

evaluating the autonomous trace by a test forecast to assign at least one time-series of test results to the autonomous agent, the time-series of test results belonging to at least one performance evaluation rule;

applying one or more predetermined delinquent evaluation rules to the delinquent evaluation parameters, thereby determining whether failure of the at least one performance evaluation rule is acceptable.

2. The method of claim 1, comprising the step of detecting an action of another agent of a predetermined type, wherein the delinquent evaluation parameter is extracted based on the detected action.

3. The method of claim 2, wherein the delinquent evaluation parameter comprises a distance between the autonomous agent and the other agent upon detection of the action.

4. A method according to claim 2 or 3, wherein the delinquent assessment parameters comprise at least one movement parameter of the further agent when the action is detected.

5. A method according to any preceding claim, wherein the one or more predetermined delinquent evaluation rules are applied to identify that one of the autonomous agent and the further agent caused a failure event in the at least one time-series of test results.

6. A method according to claim 5 when dependent on claim 2, wherein the action occurs before the failure event.

7. The method of claim 6, wherein the delinquent evaluation parameter comprises a time interval between the detected action and the failure event.

8. The method of any of claims 5 to 7, wherein the one or more predetermined delinquent evaluation parameters are extracted in response to the failure event in the at least one time series of test results based on timing of the agent trace and the failure event.

9. The method of any of claims 1 to 4, wherein the predetermined delinquent evaluation rules are applied regardless of whether any failure events occur in the at least one time series of test results.

10. The method of claim 9, wherein the scene is assigned a category label, the category label representing:

acceptable failure events occurring in the at least one time series of test results,

unacceptable failure events occurring in the at least one time series of test results,

no failure event occurs in the at least one time series of test results and such failure event is unacceptable, or

11. The method of claim 10, wherein the class labels are stored in association with a set of scene parameters that parameterize the scene.

12. The method of claim 11, comprising generating display data for summarizing the visualization of the scene parameters and the category labels.

13. A method according to any preceding claim, comprising the step of generating display data for displaying a rule timeline having a visual indication of whether a failure on the at least one performance assessment rule is acceptable, the rule timeline being a visual representation of the time series.

14. A method according to claim 13 when dependent on claim 5, wherein the failure result and the agent responsible for the failure event are visually identified in the rule timeline.

15. A method according to claim 13 or 14, comprising the step of presenting a graphical user interface comprising the rule timeline with the visual indication.

16. A method according to any preceding claim, comprising the step of storing the results of the time series in a test database, the results having an indication of whether failure of the at least one performance assessment rule is acceptable.

17. A method according to claim 16 when dependent on claim 5, wherein the results of the time series are stored in the test database together with an indication of the agent responsible for the failure event.

18. A method according to claim 5 or any claim dependent on claim 5, wherein the predetermined delinquent evaluation rules apply to only a portion of the proxy trace for a period of time defined by the timing of the failure event.

19. A computer system comprising one or more computers configured to implement the method of any preceding claim.

20. One or more computer programs embodied in a transitory or non-transitory medium and configured, when executed by one or more computers, to implement the method of any of claims 1-18.