CN117501249A

CN117501249A - Test visualization tool

Info

Publication number: CN117501249A
Application number: CN202280041327.6A
Authority: CN
Inventors: 伊恩·怀特赛德; 马尔科·费里; 本·格雷夫斯; 吉姆·克鲁克尚克
Original assignee: Faber Artificial Intelligence Co ltd
Current assignee: Faber Artificial Intelligence Co ltd
Priority date: 2021-06-08
Filing date: 2022-06-08
Publication date: 2024-02-02
Also published as: GB202204797D0

Abstract

A computer system for presenting a graphical user interface for visualizing the operation of a driving scenario in which an autonomous agent navigates a road layout, the computer system comprising: an input, a presentation component, and a scene visualization, the input configured to receive a map of a road layout and operational data, the operational data comprising: a sequence of time-stamped autonomous agent states, a time-varying digital score quantifying the performance of the autonomous agent with respect to a running set of evaluation rules; the presentation component is configured to cause the graphical user interface to display, for each rule: a graph of time-varying digital scores and a marker representing a selected time index of the graph, the marker being movable along a time axis to change the selected time index; the scene visualization includes a visualization that runs at a selected time index, whereby moving the marker along the time axis causes the scene visualization to update with changes in the time index.

Description

Test visualization tool

Technical Field

The present disclosure relates to computer systems and methods for visualizing and assessing mobile robot behavior.

Background

The field of autonomous vehicles has grown significantly and rapidly. An autonomous vehicle (autonomous vehicle, AV) is a vehicle equipped with sensors and a control system that can operate without human control of its behavior. Autonomous vehicles are equipped with sensors that enable them to perceive their physical environment, such sensors including, for example, cameras, radar, and lidar. The autonomous vehicle is equipped with a suitably programmed computer that can process the data received from the sensors and make safe and predictable decisions based on the context that has been perceived by the sensors. An autonomous vehicle may be fully automated (as it is designed to operate without human supervision or intervention, at least in some cases) or semi-automated. Semi-automatic systems require different levels of human supervision and intervention, such systems including advanced driver assistance systems and three-level autopilot systems. There are different aspects to testing the behavior of sensors and control systems on a particular autonomous vehicle or on one type of autonomous vehicle. Other mobile robots are being developed, for example, for carrying cargo materials in interior and exterior industrial areas. No one is present on such mobile robots, belonging to a class of mobile robots known as unmanned vehicles (unmanned autonomous vehicle, UAV). Unmanned aerial mobile robots (unmanned aerial vehicles) are also under development.

In autopilot, the importance of ensuring safety has been accepted. Ensuring safety does not necessarily mean zero incidents, but rather ensuring that a certain minimum safety level is met in defined situations. It is generally assumed that this minimum safety level must significantly exceed the minimum safety level of the human driver in order for autonomous driving to be feasible.

The rule-based model may be used to test performance of various aspects of the autonomous vehicle in a real-world driving scenario and simulation. These models provide that the automated vehicle stack should meet criteria that are to be considered safe. A large number of real world or simulated driving maneuvers need to be evaluated to ensure that potentially dangerous scenarios are encountered in the test. Thus, a large amount of real or simulated driving data needs to be processed in the test. Rules defined for rule-based test models are applied to each of the real or simulated driving scenarios to generate a set of test results, which can be complex and difficult for a user.

Disclosure of Invention

The RSS model provides a rule-based model for testing the behavior of autonomous agents to evaluate the planning and control of the automated driving vehicle stack. Other aspects of the performance of an autonomous vehicle may also be tested using a rule-based model. For example, perceived errors of a real or simulated stack of autonomous vehicles are determined based on perceived ground truth (which may be simulated ground truth or "false" ground truth generated from real world sensor data). The user may evaluate whether the perceived output of the automated driving vehicle stack is within acceptable accuracy standards by defining a set of perceived error rules and evaluating the determined perceived errors for those rules.

In rule-based testing of an automated driving vehicle stack, the driving performance of an autonomous agent is evaluated for one or more defined rules, both in a real world driving scenario and in a simulation. These rules may include: driving rules that evaluate the behavior of autonomous agents based on some model of expected safe driving behavior in similar driving scenarios, and/or perception rules that evaluate the accuracy of the perception of their surroundings by an autonomous. Many rules can be defined for each scenario and it is important in testing the evaluation of these rules: the interpretation may be done by the user for individual scenarios, or in an aggregated result set representing the performance of the autonomous agent for a large number of scenarios. One way to provide interpretable results at the scene level is: a graphical user interface is provided for displaying the results of each given instance of a scene (or "run") driven by the autonomous agent under a given set of (real or simulated) conditions. In one example graphical user interface, a visualization of a scene is presented with a timeline set of each rule indicating whether the rule passed or failed during run-time. The visualization provides a useful summary to users of rules that pass and fail in a given run, providing an overall summary of autonomic performance for that run. Rules may be user-definable and/or may be arbitrarily complex. A numerical score may be provided for each rule in the user interface, and multiple conditions may contribute to the rule, and thus its numerical performance score. While this flexibility is desirable to accommodate subtle differences in driving over a wide range of real/real driving maneuvers, it may be difficult for a user to interpret the direct relationship of rule evaluations to events of a scene, particularly with complex rules and/or rules based on multiple conditions. The interpretation of the run evaluation rules in AV performance testing is one technical challenge addressed herein.

A system for visualizing driving performance of an autonomous agent is described herein that provides visualization of a scene and a set of time graphs of digital performance of the autonomous agent based on a respective set of rules. The time stamp for each scene visualization and the graph associated with each rule are provided to the user, allowing the user to select a given time in the scene to visualize the content that appears in the scene at that time, moving the time stamp for each rule to the corresponding time step of the graph of autonomous digital performance for that rule, allowing the user to quickly identify how the rule failure corresponds to the actual event of the scene. This enables the user to visualize the relationship between the defined rules and the autonomous agent behavior and other conditions of the scene at any given time of driving operation. Regardless of the manner in which the underlying rules are defined, the novel graphical user interface mechanism makes the digital performance score more interpretable to the user.

A first aspect herein provides a computer system for presenting a graphical user interface for visualizing the operation of a driving scenario in which an autonomous agent navigates a road layout, the computer system comprising: at least one input, a presentation component, and a scene visualization, the at least one input configured to receive a map of a road layout of a driving scene and operational data of an operation of the driving scene, wherein the operational data comprises: a sequence of time-stamped autonomous agent states, a time-varying digital score that quantifies performance of the autonomous agent with respect to each rule in the set of operation evaluation rules, the time-varying digital score calculated by applying the operation evaluation rules to the operation; the presentation component is configured to generate presentation data for causing the graphical user interface to display, for each of the running evaluation rules: a graph of time-varying digital scores and a marker representing a selected time index on a time axis of the graph, the marker being movable along the time axis via user input at a graphical user interface to change the selected time index; the scene visualization includes a visualization of the road layout overlaid with a running agent visualization at a selected time index, whereby moving the marker along the time axis causes the presentation component to update the scene visualization as the selected time index changes.

The input may also be configured to receive second operational data of a second operation of the driving scenario, the second operational data comprising: a second sequence of time-stamped autonomous agent states and time-varying digital scores that quantify the performance of the autonomous agent with respect to each rule in the set of drivability and/or perception rules, the time-varying digital scores being calculated by applying the operation evaluation rules to the operation; and wherein the presentation component is further configured to generate presentation data for causing the graphical interface to display for each rule of the set of drivability and/or perception rules: a second graph of second run time-varying digital scores and a second run second proxy visualization at a selected time index, wherein the run time-varying digital scores and the second run time-varying digital scores are plotted against a common set of axes including at least a common time axis, wherein the markers represent the selected time index on the common time axis, wherein the scene visualization overlays the second proxy visualization.

When testing autonomous agents in a simulated or real world driving scenario, multiple runs may be evaluated for a single scenario, where aspects of the agent's configuration or behavior are different for each run. In this case, the evaluation of the rules and metrics for each run does not itself provide a detailed picture of how the differences in the configuration and/or behavior of the agents affect the progress of the scene. Described herein is a system that includes a run comparison user interface in which two driving runs can be compared in a common scene visualization along a common time interval, wherein a user can interactively select a time index for the scene, and the user interface displays a visualization of the vehicle status at that time for each of the two runs. This enables the behavior of the vehicle to be compared over two runs in the playback of the scene, which allows the user to identify specific actions or features per run that contribute to better or worse performance.

The time-varying digital score may be calculated by applying one or more rules to a time-varying signal extracted from the operational data, wherein a change in the signal is visible in the scene visualization.

In response to a deselection input at the graphical user interface representing one of the first operation and the second operation, the presentation component may be configured to: removing, for each driving rule, a plot of time-varying digital scores of deselected runs from the common axis set; and removing the proxy visualization of the deselected run from the single visualization of the road layout, whereby the user is able to switch from the run comparison view relating to both the first run and the second run to a single run view relating to only one of the first run and the second run.

The graphical user interface may also include a comparison table having an entry for running each rule in the set of evaluation rules, the entry containing an aggregate execution result for the rule in a first run and an aggregate execution result for the rule in a second run.

The entry for each rule may also include a description of the rule.

In response to the expanded input at the graphical user interface, the presentation component may be configured to hide the graph of the time-varying numerical score for each rule and display a timeline view that includes indications of the pass/fail results of the rules over time.

The presentation component may be configured to cause the graphical user interface to display the numerical score at the selected time index for each rule in the set of running evaluation rules.

The operational assessment rules may include a perception rule wherein the scene visualization includes a set of perceived outputs generated by a perceived component of the autonomous vehicle.

The scene visualization may include sensor data overlaid on the visualization of the road layout.

The scene visualization may include: a scene timeline having scene time stamps, whereby the stamps are moved along the scene timeline such that the presentation component updates the respective time stamp of each graph of the time-varying digital score as the selected time index changes.

The scene timeline may include: a frame index corresponding to the selected time index, and a control set that is moved forward or backward by incrementing or decrementing the frame number, respectively.

The driving scenario may be a simulated driving scenario in which autonomous agent navigation simulates a road layout, wherein the operational data is received from the simulator.

The driving scenario may be a real-world driving scenario in which the autonomous agent navigates the real-world road layout, wherein the operational data is calculated based on data generated on the autonomous agent during operation.

The graph of time-varying digital scores comprises an xy graph of time-varying digital scores.

Alternatively or additionally, the time-varying digital score is drawn using color coding.

A second aspect herein provides a method for visualizing operation of a driving scenario in which an autonomous agent navigates a road layout, the method comprising: receiving a map of a road layout of a driving scene and operation data of operation of the driving scene, wherein the operation data comprises: a sequence of time-stamped autonomous agent states and a time-varying digital score that quantifies performance of the autonomous agent with respect to each of the operation evaluation rules, the time-varying digital score being calculated by applying the operation evaluation rules to the operation; generating presentation data for causing the graphical user interface to display, for each of the running evaluation rules: a graph of time-varying digital scores and a marker representing a selected time index on a time axis of the graph, the marker being movable along the time axis via user input at a graphical user interface to change the selected time index; and a scene visualization including a visualization of the road layout, the scene visualization overlaid with a running agent visualization at the selected time index, whereby moving the marker along the time axis causes the presentation component to update the scene visualization as the selected time index changes.

Another aspect herein provides a computer program comprising executable instructions for programming a computer system to implement the method or system functions of any one of the preceding claims.

Drawings

For a better understanding of the present disclosure, and to show how embodiments of the disclosure may be carried into effect, reference is made, by way of example only, to the following drawings, in which:

figure 1A shows a schematic block diagram of an autonomous vehicle stack,

figure 1B shows a schematic overview of an autonomous vehicle test paradigm,

figure 1C shows a schematic block diagram of a scene extraction pipeline,

figure 2A shows a schematic block diagram of a test pipeline,

figure 2B shows further details of a possible implementation of the test pipeline,

figure 3A shows an example of a rule diagram evaluated within a test prophetic machine,

figure 3B shows an example output of a node of the rule graph,

figure 4 shows a schematic block diagram of a computer system for presenting a running visual user interface,

figure 5 illustrates a single operational view of an example operational visualization user interface,

figure 6 illustrates a comparative view of the operation of an example operation visualization user interface,

figure 7 shows a reverse report view of the running visual user interface,

Figure 8 shows an architecture for evaluating perceptual errors,

figure 9A illustrates an example graphical user interface for a diversion tool,

figure 9B shows a schematic representation of a driving scenario comprising sensor data displayed in a graphical user interface,

figure 9C illustrates an example user interface with a zoom function and a timeline cleaner,

figure 9D shows the selection of segments of a scene in a user interface,

FIG. 10 illustrates an example graph of a numerical score of a perceived error having a defined error threshold.

Detailed Description

In one example graphical user interface disclosed in international patent application nos. PCT/EP2022/053413, PCT/EP2022/053406, a visualization of a scene is presented with a set of timelines for each rule indicating whether the rule passed or failed during run-time. The visualization provides a useful summary to users of rules that pass and fail in a given run, providing an overall summary of autonomous performance for that run. However, the rules may be user-defined and may be arbitrarily complex, as described later, wherein a plurality of conditions contribute to the numerical performance score provided in the user interface, making it difficult for the user to interpret the rule evaluation for direct relation to the scene event.

The described embodiments provide a test pipeline to facilitate rule-based testing of a mobile robot stack in a real or simulated scenario. The set of interactive graphical user interface (interactive graphical user interface, GUI) features improves the interpretability of the application rules, allowing an expert to more easily and reliably estimate stack performance in a given driving scenario from GUI output.

A "full" stack generally involves everything from the processing and interpretation of low-level sensor data (perceptions) to major higher-level functions such as prediction and planning, and control logic for generating appropriate control signals to implement planning-level decisions (e.g., control braking, steering, acceleration, etc.). For an autonomous vehicle, the 3-level stack includes some logic for implementing transition requirements, and the 4-level stack also includes some logic for implementing minimum risk maneuvers. The stack may also implement auxiliary control functions such as signaling, headlights, windshield wipers, etc.

The term "stack" may also refer to individual subsystems of a full stack (sub-stacks), such as a sense, predict, program, or control stack, which may be tested individually or in any desired combination. A stack may refer to software alone, i.e., one or more computer programs that may be executed on one or more general-purpose computer processors.

The test framework described below provides a pipeline for generating ground truth of a scene from real world data. The ground truth may be used as a basis for a perception test by comparing the generated ground truth with the perceived output of the perception stack being tested, estimating driving behavior for driving rules.

Agent (actor) behavior in a real or simulated scenario is evaluated by a test propulsor (oracle) based on defined performance evaluation rules. Such rules may evaluate different security aspects. For example, a security rule set may be defined to estimate performance of a stack for a particular security standard, regulation, or security model (such as RSS), or a custom rule set may be defined for testing any aspect of performance. The application of the test pipeline is not limited to safety and may be used to test any aspect of performance, such as comfort or progress toward some defined goal. The rule editor allows the performance evaluation rules to be defined or modified and passed to the test predictors.

Similarly, vehicle perception may be estimated/evaluated by a "perception predictor" based on defined perception rules. These may be defined within a perceptual error specification that provides a standard format for defining errors in the perception.

Defining rules in the perceptual error framework allows highlighting regions of interest in the real-world driving scene to the user, for example by marking those regions in a playback of the scene presented in the user interface, as described in more detail below. This enables the user to review significant errors in the sense stack and identify possible causes of the errors, such as occlusions in the raw sensor data. Assessing perceptual errors in this way also allows for defining a "contract" between the perceptual and planning components of the AV stack, wherein requirements for perceptual performance may be specified, and wherein stack commitments meeting these requirements for perceptual performance can be planned safely. The unified framework may be used to evaluate real perceived errors from real world driving scenarios as well as simulated errors that are calculated using a perceived error model directly simulated or by applying a perceived stack to simulated sensor data (e.g., a realistic simulation of a camera image).

The ground truth determined by the pipeline itself may be evaluated within the same perceived error specification by comparing the ground truth according to defined rules with the "true" ground truth determined by manually reviewing and annotating the scene. Finally, the results of applying the perceptual error test framework may be used to guide a test strategy to test both the perceptual and predictive subsystems of the stack.

Scenes, whether real or simulated, require autonomous agents to navigate real or modeled physical contexts. An autonomous agent is a real or simulated mobile robot that moves under the control of the stack under test. The physical context includes static and/or dynamic elements to which the stack under test needs to respond effectively. For example, the mobile robot may be a fully or semi-automated vehicle (autonomous vehicle) controlled by a stack. The physical context may include: static road layout and a given set of environmental conditions (e.g., weather, time of day, lighting conditions, humidity, pollution/particle level, etc.), which may be maintained or changed as the scene progresses. The interactive scenario also includes one or more other agents ("external" agents, e.g., other vehicles, pedestrians, cyclists, animals, etc.).

The following examples consider the application of an autonomous vehicle test. However, these principles are equally applicable to other forms of mobile robots.

Scenes may be represented or defined at different levels of abstraction. More abstract scenes accommodate a greater degree of variation. For example, a "cut-in scenario" or "lane change scenario" is an example of a highly abstract scenario featuring interesting maneuvers or behaviors that accommodate many variations (e.g., different agent start positions and speeds, road layout, environmental conditions, etc.). "scenario run" refers to the specific case of agent navigation physical context (optionally in the presence of one or more other agents). For example, multiple runs of a cut-in or lane-change scene (in the real world and/or in a simulator) may be performed with different proxy parameters (e.g., start position, speed, etc.), different road layouts, different environmental conditions, and/or different stack configurations, etc. The terms "run" and "instance" are used interchangeably in this context.

In the following example, performance of a stack is estimated during one or more runs, at least in part, by evaluating behavior of autonomous agents in a test propulsor for a given set of performance evaluation rules. Rules are applied to the "ground truth" of the scene (or each scene) run, generally speaking, "ground truth" simply means that an appropriate representation of the scene run (including the behavior of the autonomous agent) is considered the authority for testing purposes. Ground truth is inherent to simulation; the simulator calculates a sequence of scene states, ground truth by definition is a perfect, authoritative representation of the operation of the simulated scene. In real world scene runs, the "perfect" representation of the scene run does not exist in the same sense; however, ground truth of suitable informativeness can be obtained in many ways, such as manual annotation based on-board sensor data, automated/semi-automated annotation of such data (e.g., using offline/non-real-time processing), and/or using external information sources (such as external sensors, maps, etc.), and the like.

Scene ground truth typically includes "traces" of autonomous agents and any other (significant) agents (as applicable). The trace is the history of the agent's position and motion during the scene. The traces can be represented in a number of ways. Trace data will typically include spatial and motion data of agents within the environment. A proxy trace is provided for each proxy that includes a sequence of time-stamped proxy states to allow visualization of the proxy's state at different time steps. The term is used with respect to both real scenes (with real world traces) and simulated scenes (with simulated traces). The trace typically records the actual trace implemented by the agent in the scene. With respect to the term, "trace" and "track" may contain the same or similar types of information (such as a series of spatial and motion states over time). The term trajectory is generally advantageous in the context of planning (and may refer to future/predicted trajectories), while the term trajectory is generally advantageous in the context of testing/evaluation relative to past behavior.

In the simulation context, a "scene description" is provided as input to the simulator. For example, the scene description may be encoded using a scene description language (scenario description language, SDL) or in any other form that may be used by a simulator. A scene description is typically a more abstract representation of a scene that may cause multiple simulations to run. Depending on the implementation, the scene description may have one or more configurable parameters that may be changed to increase the extent of possible variation. Abstraction and parameterization are design choices. For example, the scene description may encode the fixed layout with parameterized environmental conditions (such as weather, lighting, etc.). However, further abstractions are possible, for example, with configurable road parameters (e.g., road curvature, lane configuration, etc.). The input to the simulator includes a scene description and a selected set of parameter values (if applicable). The latter may be referred to as parameterization of the scene. The configurable parameters define a parameter space (also referred to as a scene space), the parameterization corresponding to points in the parameter space. In this context, a "scene instance" may refer to an instantiation of a scene in a simulator based on a scene description and, if applicable, a selected parameterization.

For simplicity, the term scene may also be used to refer to scenes in the sense of scene operation and more abstract. The meaning of the term scene will be clear from the context in which the term scene is used.

Example AV stack:

in order to provide context in connection with the described embodiments, further details of an example form of an AV stack will now be described.

Fig. 1A shows a high-level schematic block diagram of an AV runtime stack 100. The runtime stack 100 is shown to include a sense (subsystem) 102, a predict (subsystem) 104, a plan (subsystem) 106, and a control (subsystem) 108. As noted, the term (sub) stack may also be used to describe the aforementioned components 102-108.

In a real world context, the perception system 102 receives sensor outputs from the AV's in-vehicle sensor system 110 and uses these sensor outputs to detect external agents and measure their physical states, such as their position, velocity, acceleration, etc. The in-vehicle sensor system 110 may take different forms, but typically includes a variety of sensors, such as image capture devices (cameras/optical sensors), lidar and/or radar units, satellite positioning sensors (global positioning system (Global Position System, GPS), etc.), motion/inertial sensors (accelerometers, gyroscopes, etc.), and the like. Thus, the in-vehicle sensor system 110 provides rich sensor data from which detailed information about the surrounding environment and the status of AV and any external actors (vehicles, pedestrians, cyclists, etc.) within the environment can be extracted. The sensor output typically includes sensor data for a plurality of sensor modalities, such as stereo images from one or more stereo optical sensors, lidar, radar, and the like. The sensor data of the plurality of sensor modalities may be combined using filters, fusion components, or the like.

The sensing system 102 generally includes a plurality of sensing components that cooperate to interpret the sensor output, and thus provide a sensed output to the prediction system 104.

In a simulation context, it may or may not be necessary to model the in-vehicle sensor system 100, depending on the nature of the test, and in particular, depending on where the stack 100 is "sliced" for testing purposes (see below). For higher level slices, no analog sensor data is needed, and therefore no complex sensor modeling is needed.

The perceived output from the perception system 102 is used by the prediction system 104 to predict future behavior of external actors (agents), such as other vehicles in the vicinity of the AV.

The predictions calculated by the prediction system 104 are provided to the planner 106, which planner 106 uses the predictions to make automatic driving decisions to be performed by the AV in a given driving scenario. The input received by the planner 106 will typically indicate a drivable zone and will also capture the predicted movement of any foreign agents (obstacles from the AV perspective) within the drivable zone. The drivable region may be determined using a perceived output from the sensing system 102 in combination with map information, such as a High Definition (HD) map.

The core function of the planner 106 is to plan the trajectory of the AV (autonomous trajectory) taking into account the predicted proxy motion. This may be referred to as trajectory planning. The trajectory is planned to perform a desired object within the scene. For example, the targets may be: entering the rotary island and exiting the rotary island at a desired exit; overtake the preceding vehicle; or stay in the current lane at the target speed (lane following). For example, the targets may be determined by an automatic route planner (not shown).

The controller 108 performs the decisions made by the planner 106 by providing appropriate control signals to the AV's onboard actor system 112. In particular, the planner 106 plans the AV track, and the controller 108 generates control signals to implement the planned track. Typically, the planner 106 will plan the future such that the planned trajectory may only be partially implemented at the control level before the planner 106 plans the new trajectory. Actor system 112 includes: a "primary" vehicle system (such as a brake, acceleration, and steering system) and a secondary system (e.g., signaling, wipers, headlights, etc.).

Note that there may be a distinction between the planned trajectory at a given moment in time and the actual trajectory followed by the autonomous agent. The planning system typically operates on a sequence of planning steps such that the planning trajectory is updated at each planning step to account for any changes in the scene since the beginning of the previous planning step (or more precisely, any changes that deviate from the predicted changes). Planning system 106 may reason to the future such that the planned trajectory at each planning step extends beyond the next planning step. Thus, any individual planning trajectory may not be fully implemented (if the planning system 106 is tested alone in simulation, then the autonomous agent may simply follow the planning trajectory accurately until the next planning step; however, as noted, in other real and simulated contexts, the planning trajectory may not be accurately followed until the next planning step, as the behavior of the autonomous agent may be affected by other factors such as the operation of the control system 108 and the real or modeled dynamics of the autonomous vehicle.

The example of FIG. 1A contemplates a relatively "modular" architecture with separable perception, prediction, planning, and control systems 102-108. The sub-stacks themselves may also be modular, for example with separable planning modules within the planning system 106. For example, planning system 106 may include: may be applied to multiple trajectory planning modules in different physical contexts (e.g., simple lane driving versus complex intersections or roundabout). For the reasons mentioned above, this is relevant to analog testing, as it allows components (e.g., planning system 106 or its individual planning modules) to be tested individually or in different combinations. For the avoidance of doubt, for a modular stack architecture, the term stack may refer not only to a full stack, but also to any individual subsystem or module thereof.

The degree of integration or separation of the various stack functions may vary significantly between different stack implementations, and in some stacks, certain aspects may be so closely coupled that they are indistinguishable. For example, in other stacks, planning and control may be integrated (e.g., such stacks may be planned directly from control signals), while other stacks (such as the stacks depicted in fig. 2B) may be structured in a manner that draws a clear distinction between the two stacks (e.g., with planning in terms of trajectories, and with separate control optimizations to determine how best to perform planning a trajectory at a control signal level). Similarly, in some stacks, predictions and plans may be more closely coupled. In extreme cases, perception, prediction, planning and control may be substantially inseparable in so-called "end-to-end" driving. Unless otherwise indicated, the sensing, predictive planning, and control terms used herein do not imply any particular coupling or modularity of these aspects.

It will be understood that the term "stack" encompasses software, but may also encompass hardware. In simulation, the stacked software may be tested on a "general purpose" on-board computer system prior to final upload to the on-board computer system of the physical vehicle. However, in "in-loop hardware" testing, the testing may be extended to the underlying hardware of the vehicle itself. For example, the stack software may run on an in-vehicle computer system (or a copy thereof) coupled to the simulator for testing purposes. In this context, the stack under test extends to the underlying computer hardware of the vehicle. As another example, certain functions of the stack 110 (e.g., perceptual functions) may be implemented in dedicated hardware. In an analog context, in-loop hardware testing may involve feeding synthetic sensor data to a dedicated hardware aware component.

Test prophetic machine

Fig. 1B shows a highly schematic overview of a test paradigm for an autonomous vehicle. ADS/ADAS stacks 100 (e.g., of the type depicted in fig. 1A) are repeatedly tested and evaluated in the simulation by running multiple instances of the scenario in simulator 202 and evaluating the performance of stacks 100 (and/or their individual sub-stacks) in test propulsor 252. The output of the test forecaster 252 provides information to the experts 122 (team or individual) allowing them to identify problems in the stack 100 and modify the stack 100 to alleviate those problems (S124). The results also help the expert 122 select other scenarios for testing (S126), and continue the process, repeatedly modifying, testing, and evaluating performance of the stack 100 in the simulation. The improved stack 100 is ultimately incorporated (S125) in a real world AV101 equipped with a sensor system 110 and an actor system 112. The improved stack 100 generally includes program instructions (software) that execute in one or more computer processors (not shown) of an onboard computer system of the vehicle 101. At step S125, the software of the modified stack is uploaded to the AV101. Step 125 may also involve modifications to the underlying vehicle hardware. At AV101, the modified stack 100 receives sensor data from the sensor system 110 and outputs control signals to the actor system 112. The real world test (S128) may be used in combination with a simulation-based test. For example, in the event that acceptable performance levels are achieved through the process of simulation testing and stack refinement, appropriate real world scenes may be selected (S130), and the performance of AV101 in those real scenes may be captured and similarly evaluated in test predictors 252.

For simulation purposes, scenes may be obtained in a variety of ways, including manual encoding. The system is also capable of extracting scenes from real-world operations for simulation purposes, allowing the real-world scenes and their variants to be recreated in simulator 202.

FIG. 1C shows a high-level schematic block diagram of a scene extraction pipeline. The real world operation data 140 is passed to a "ground truth" pipeline 142 for generating a ground truth of the scene. The operational data 140 may include: for example, sensor data and/or perceived output captured/generated on one or more vehicles (which may be automated, mandriven, or a combination thereof), and/or data captured from other sources such as external sensors (CCTV, etc.). The performance data is processed within a ground truth pipeline 142 to generate appropriate ground truths 144 (trace and background data) for the real world operation. As discussed, the ground truth process may be based on manual annotation of "raw" operational data 142, or the process may be fully automated (e.g., using an off-line perception method) or a combination of manual and automated ground truth may be used. For example, a 3D bounding box may be placed around the vehicle and/or other agent captured in the operational data 140 in order to determine the spatial and motion state of its trajectory. The scene extraction component 146 receives the scene ground truth 144 and processes the scene ground truth 144 to extract a more abstract scene description 148 that can be used for simulation purposes. The scene description 148 is used by the simulator 202 to allow multiple simulation runs to be performed. The simulated run is a variation of the original real world run, with the extent of possible variation being determined by the level of abstraction. Ground truth 150 is provided for each simulation run.

The scene extraction shown in fig. 1C may also be used to extract real world driving maneuver data for testing and visualization, i.e., the real autonomous state of the driving maneuver may be extracted for visualizing the real world driving maneuver in a perceptually visualized user interface, as described later. It should be noted that the term "operational data" is used herein to refer to both "raw" operational data (such as sensor data, etc.) collected by the autonomous vehicle during real world driving operations, as well as operational data processed in testing and visualization (which includes "ground truth" traces and background data of the autonomous vehicle). In the context of rule evaluation, the operational data further includes: the numerical score of the autonomous vehicle performance is measured against one or more operational assessment rules, which may include a perceived error rule or a driving rule, as described in more detail below.

The test forecaster 252 applies a rule-based model to evaluate the real or simulated behavior of the automated driving vehicle stack (also referred to herein as an autonomous agent) as determined by the planner 106. However, the test paradigm for modeling the background shown in FIG. 1B may also be implemented in the context of assessing perceived errors, where a "perceived predictor" replaces test predictor 252 and processes the scene to assess perceived errors against rules defined within a similar rule-based model for planning described below (referred to herein as a "perceived error framework"). The perceived error is obtained by comparing the perceived output generated by the perception component 102 with the scene ground truth, which as described above is inherent to the simulation, and can be generated for a real world scene using the ground truth pipeline 142. The evaluation of the perceived errors within the perceived error framework is described in further detail below. Perceptual error assessment is also described in uk patent application nos. 2108182.3, 2108958.6, 2108952.9 and 2111765.0, the entire contents of which are incorporated herein by reference.

Simulated background

Further details of the test pipeline and test propulsor 252 will now be described. The following examples focus on simulation-based testing. However, as noted, the test predictors 252 may be equally applied to evaluate stack performance on real scenes, and the following related description equally applies to real scenes. The following description will be given taking stack 100 of fig. 1A as an example. However, as noted, the test pipeline 200 is highly flexible and may be applied to any stack or sub-stack operating at any level of autonomy.

Fig. 2A shows a schematic block diagram of a test pipeline 200. The illustrated test pipeline 200 includes a simulator 202 and a test propulsor 252. The simulator 202 runs the simulated scene for the purpose of testing all or part of the AV runtime stack, and the test foresper 253 evaluates the performance of the stack (or sub-stack) on the simulated scene. The following description will be given taking the stack of fig. 1A as an example. However, test pipeline 200 is highly flexible and may be applied to any stack or sub-stack operating at any level of autonomy.

The idea of the simulation-based test is to run a simulated driving scenario: the autonomous agent must navigate under control of the stack (or sub-stack) being tested. Typically, a scene includes a static drivable zone (e.g., a particular static road layout), where autonomous agents are required to navigate in the presence of one or more other dynamic agents (such as other vehicles, bicycles, pedestrians, etc.). Analog inputs are fed into the stack of tests where they are used to make decisions. In turn, the autonomous agent is caused to perform these decisions, thereby simulating the behavior of the autonomous vehicle in those situations.

Analog input 203 is provided into the stack of tests. "slice" refers to a collection or subset of stack components selected for testing. This in turn indicates the form of the analog input 203.

For example, fig. 2A shows a prediction system 104, a planning system 106, and a control system 108 within the AV stack 100 under test. To test the full AV stack of fig. 1A, the perception system 104 may also be applied during testing. In this case, the analog input 203 would include synthetic sensor data that is generated using an appropriate sensor model and processed within the perception system 102 in the same manner as the real sensor data. This requires the generation of sufficiently realistic synthetic sensor inputs (such as realistic image data and/or simulated lidar/radar data of the same reality, etc.). The resulting output of the perception system 102 will then be fed into a higher level prediction system 104 and planning system 106.

Instead, so-called "programming level" simulations will substantially bypass the perception system 102. Instead, simulator 202 provides simpler, higher-level inputs 203 directly to prediction system 104. In some contexts, it may even be appropriate to bypass the prediction system 104 in order to test the planner 106 based on predictions obtained directly from the simulation scenario.

Between these extremes, there are many different levels of range of input slices, for example testing only a subset of the perception systems, such as "later" perception components, i.e. components that operate on output from lower-level perception components (such as object detectors, bounding box detectors, motion detectors, etc.), such as filters or fusion components.

For example only, the test pipeline 200 is described with reference to the runtime stack 100 of FIG. 1A. As discussed, only sub-stacks of the runtime stack may be tested, but for simplicity the following description always refers to the AV stack 100. In fig. 2A, reference numeral 100 may thus represent a full AV stack or just a sub-stack, depending on the context.

Whatever their form, the analog inputs 203 are used (directly or indirectly) as a basis for decision making by the planner 108.

The controller 108 in turn implements the planner decisions by outputting control signals 109. In a real world context, these control signals will drive the physical actor system 112 of the AV.

In the simulation, the autonomous vehicle dynamics model 204 is used to convert the generated control signals 109 into realistic movements of the autonomous agent within the simulation, thereby simulating the physical response of the autonomous vehicle to the control signals 109.

To the extent that the external agent presents automatic behavior/decisions within simulator 202, some form of agent decision logic 210 is implemented to implement those decisions and determine agent behavior within the scenario. The proxy decision logic 210 may be comparable in complexity to the autonomous stack 100 itself, or it may have more limited decision-making capabilities. The purpose is to provide sufficiently realistic external agent behavior within simulator 202 to be able to effectively test the decision-making capabilities of autonomous stack 100. In some contexts, this does not require any proxy decision logic 210 at all (open loop simulation), and in other contexts, relatively limited proxy logic 210, such as basic adaptive cruise control (adaptive cruise contro, ACC), may be used to provide useful testing. One or more proxy dynamic models 206 may be used to provide more realistic proxy behavior.

The simulation of the driving scenario is run from a scenario description 201 with a static layer 201a and a dynamic layer 201 b.

The static layer 201a defines static elements of a scene that will typically include a static road layout.

Dynamic layer 201b defines dynamic information about external agents (such as other vehicles, pedestrians, bicycles, etc.) within the scene. The range of dynamic information provided may vary. For example, for each external agent, dynamic layer 201b may include a spatial path that the agent is to follow and one or both of motion data and behavior data associated with the path. In a simple open loop simulation, the external actor simply follows the spatial path and motion data defined in the non-reactive (i.e., non-responsive to autonomous agents within the simulation) dynamic layer. Such open loop simulation may be implemented without any proxy decision logic 210. However, in closed loop simulation, the dynamic layer 201b instead defines at least one behavior (such as ACC behavior) to follow along the static path. In this case, the proxy decision logic 210 implements this behavior within the simulation in a reactive manner (i.e., in response to autonomous agents and/or other external agents). The motion data may still be associated with a static path, but in this case is less canonical and may be used, for example, as a target along the path. For example, with ACC behavior, the target speed may be set along a path that the agent will seek a match, but the agent decision logic 110 may be allowed to reduce the speed of the external agent below the target at any point along the path in order to maintain a target separation distance from the preceding vehicle.

For a given simulation, the output of simulator 202 includes an autonomous trace 212a of the autonomous agent and one or more agent traces 212b (traces 212) of one or more external agents.

The trace is a complete history of agent behavior within the simulation with both spatial and motion components. For example, the trace may take the form of a spatial path having motion data associated with points along the path, such as velocity, acceleration, jerk (rate of change of acceleration), snap (rate of change of jerk), and the like.

Additional information is also provided to supplement and provide background to trace 212. Such additional information is referred to as "environment" data 214, and the "environment" data 214 may have both static components (such as road layout) and dynamic components (such as the degree to which the weather conditions change during simulation). The environmental data 214 may be "passed" in part because it is directly defined by the scene description 201 and is not affected by the simulation results. For example, the environmental data 214 may include a static road layout directly from the scene description 201. However, typically the environmental data 214 will include at least some elements derived within the simulator 202. This may include, for example, simulating weather data, wherein simulator 202 is free to change weather conditions as the simulation progresses. In this case, the weather data may be time-dependent, and the time-dependent will be reflected in the environmental data 214.

The test forecaster 252 receives the traces 212 and the environmental data 214 and scores these outputs in the manner described below. Scoring is time-based: for each performance metric, the test predictors 252 track how the value (score) of that metric changes over time as the simulation progresses. The test predictors 252 provide an output 256 that includes a fractional-time plot of each performance metric, as described in more detail later. The scores are output for storage in database 258, which may be accessed in database 258, for example, to display the results in a user interface as described above. Metrics 254 provide information to the expert and scores may be used to identify and mitigate performance issues within test stack 100.

Perception error model

Fig. 2B shows a particular form of slice and uses reference numerals 100 and 100S to denote a full stack and a sub-stack, respectively. The sub-stack 100S will undergo testing within the test pipeline 200 of fig. 2A.

A plurality of "late" sense components 102B form part of the sub-stack 100S to be tested and are applied to the analog sense input 203 during testing. The later sense component 102B can, for example, include other fusion components that filter or fuse the sense inputs from multiple earlier sense components.

In the full stack 100, the later sense component 102B will receive the actual sense input 213 from the earlier sense component 102A. For example, the earlier sensing component 102A may include one or more 2D or 3D bounding box detectors, in which case the analog sensing inputs provided to the later sensing component may include: analog 2D or 3D bounding box detection derived in the simulation via ray tracing. The earlier sensing component 102A will typically include components that operate directly on sensor data.

With this slicing, the analog sense input 203 will correspond in form to the actual sense input 213 typically provided by the earlier sense component 102A. However, the earlier sense component 102A is not applied as part of the test, but rather is used to train one or more sense error models 208, the sense error models 208 being operable to introduce real world errors to the simulated sense inputs 203 of the later sense component 102B fed to the tested sub-stack 100 in a statistically stringent manner.

Such a perceptual error model may be referred to as a perceptual statistical performance model (Perception Statistical Performance Model, PSPM) or synonymously "PRISM". Further details of the principles of PSPM and suitable techniques for constructing and training them may be incorporated in international patent application nos. PCT/EP2020/073565, PCT/EP2020/073562, PCT/EP2020/073568, PCT/EP2020/073563, and PCT/EP2020/073569, which are incorporated herein by reference in their entirety. The idea behind the PSPM is to effectively introduce real errors into the analog perceived input provided to the sub-stack 102B (i.e., it reflects that the type of error that will be expected is the earlier perceived component 102A that will be applied to the real world). In a simulation context, the "perfect" ground truth perceived input 203G is provided by the simulator, but these are used to derive a more realistic perceived input 203 with the real errors introduced by the perceived error model 208.

As described in the above references, the PSPM may be dependent on one or more variables representing physical conditions ("confounding factors") allowing for the introduction of different levels of error reflecting different possible real world conditions. Thus, simulator 202 may simulate different physical conditions (e.g., different weather conditions) by simply changing the value of weather confounding factors, which will in turn change how perceptual errors are introduced.

The later sense component 102b within the sub-stack 100S processes the analog sense input 203 in exactly the same way it would process the real world sense input 213 within the full stack 100, and its output in turn drives prediction, planning and control. Alternatively, the PSPM may be used to model the entire perception system 102 including the late-stage perception component 208.

One example rule considered herein for evaluation of the test predictors 252 is a "safe distance" rule, which is applied in a lane following context and evaluated between an autonomous agent and another agent. The safe distance rule requires that the autonomous agent always maintains a safe distance from another threshold. Both lateral and longitudinal distances are considered and in order to pass the safety distance rule it is sufficient that only one of those distances meets a certain safety threshold (considering a lane driving scenario in which an autonomous agent and another agent are in adjacent lanes, their longitudinal spacing along the road may be zero or close to zero when driving alongside each other, as long as sufficient lateral spacing between agents is maintained to be safe, similarly an autonomous agent driving behind another agent in the same lane, assuming that both agents substantially follow the center line of the lane, its lateral spacing perpendicular to the road direction may be zero or close to zero, as long as sufficient longitudinal spacing distance is maintained to be safe). A numerical score is calculated for the safe distance rule at a given point in time based on any distance (lateral or longitudinal) currently determined to be safe.

The safe distance rule was chosen to illustrate certain principles supporting the described method, as it is simple and intuitive. However, it should be understood that the described techniques may be applied to any rule designed to quantify some aspect (or aspects) of drivability (such as safety, comfort, and/or progress toward some defined goal) by a digital "robustness score. The time-varying robustness score over the duration of the scene run is denoted s (t) and the total robustness score of the run is denoted y. For example, a robustness scoring framework may be built for driving rules based on signal time logic.

In general, a robustness score (such as the score described below with reference to fig. 3A and 3B) takes the form of an amount such as absolute or relative position, velocity, or other amount of relative movement of the agent. The robustness score is typically defined based on a threshold to define one or more of these quantities based on a given rule (e.g., the threshold may define a minimum lateral distance to the nearest agent that is considered acceptable in terms of security). The robustness score is then defined by whether a given amount is above or below the threshold and the extent to which the amount exceeds or falls below the threshold. Thus, the robustness score provides a measure of how an agent should be performing or expected to perform relative to other agents and their environments (including, for example, any speed limits defined within the drivable zone for a given scenario, etc.). It should be noted that the numerical score may be similarly defined for other aspects of the operational assessment, including, for example, for assessing perceived errors.

Fig. 3A schematically illustrates the geometrical principle of the safety distance rule evaluated between the autonomous agent E and another agent C (challenger).

Fig. 3A and 3B are described in conjunction with each other.

The lateral distance is measured along a road reference line (which may be a straight line or a curved line) and the longitudinal spacing is measured in a direction perpendicular to the road reference line. The lateral and longitudinal spacing (distance between autonomous agent E and challenger C) is defined by d, respectively _lat And d _lon And (3) representing. The transverse and longitudinal distance threshold (safety distance) is defined by d _slat And d _slon And (3) representing.

Distance d of safety _slat 、d _slon Typically not fixed, but rather varying depending on the relative speed of the agent (and/or other factors such as weather, road curvature, road surface, lighting, etc.). The pitch and the safe distance are expressed as a function of time t, lateral and longitudinal "headroom" distances, defined as:

D _lat (t)＝d _lat (t)-d _slat (t)

D _lon (t)＝d _lon (t)-d _slon (t).

fig. 3A (1) shows an autonomous agent E at a safe distance from challenger C by means of the fact that: for a proxy pair, the lateral spacing d of the proxy _lat Greater than the current lateral safety distance d _slat (positive D) _lat )。

Fig. 3A (2) shows an autonomous agent E at a safe distance from challenger C by means of the fact that: for a proxy pair, the longitudinal spacing d of the proxy _lon Greater than the current longitudinal safety distance d _slon (negative D) _lon )。

Fig. 3A (3) shows the autonomous agent E at an unsafe distance from the challenger C. The safe distance rule fails because of D _lat And D _lon Both negative.

Fig. 3B illustrates a safe distance rule implemented as a computational graph applied to a set of scene ground truths 310 (or other scene data).

The lateral spacing, lateral safety distance, longitudinal spacing, and longitudinal safety distance are extracted from the scene ground truth 310 as time-varying signals by the first, second, third, and fourth extractor nodes 302, 304, 312, 314, respectively, of the computational graph 300. The lateral and longitudinal headroom distances are calculated by the first calculation (estimator) node 306 and the second calculation (estimator) node 316 and are converted to robustness scores as follows. The following example considers a normalized robustness score over some fixed range (such as [ -1,1 ]), with 0 as the pass threshold.

The headroom quantifies the degree to which the relevant safety distance is breached or not breached: a positive lateral/longitudinal clearance distance means that the lateral/longitudinal spacing between the autonomous E and challenger C is greater than the current lateral/longitudinal safety distance, and a negative clearance distance means the opposite. According to the principles set forth above, the robustness scores for lateral and longitudinal distances may be defined, for example, as follows:

If D _lat (t) >0 then s _lat (t)＝min[1，D _lat (t)/A _lat ]

If D _lat (t) is less than or equal to 0 and s _lat (t)＝max[-1，D _lat (t)/B _lat ]

If D _lon (t) >0 then s _lon (t)＝min[1，D _lon (t)/A _lon ]

If D _lon (t) is less than or equal to 0 and s _lon (t)＝max[-1，D _lon (t)/B _lon ]

Here, a and B represent some predefined normalized distances (which may be the same or different for transverse and longitudinal scores). For example, it can be seen that the longitudinal robustness score s _lon (t) varies between 1 and-1 because of D _lon (t) at A _lon and-B _lon And changes between. For D _lon (t) > A, the longitudinal robustness score is fixed at 1 for s _lon (t)＜B _lon The robustness fraction is fixed at-1. Longitudinal robustness score s _lon (t) continuously varies over all possible values of the longitudinal headroom. The same considerations apply to the lateral robustness score. As will be appreciated, this is merely one example,and the robustness score s (t) may be defined differently based on the headroom distance.

Score normalization is convenient because it makes rules more interpretable and facilitates comparing scores between different rules. However, normalizing the score in this manner is not necessary. The score may be defined over any range having any failure threshold (not necessarily zero).

The robustness fraction s (t) of the safety distance rule whole is calculated by the third estimator node 308 as:

s(t)＝min[s _lat (t)，s _lon (t)]

when s (t) >0, the rule passes, and when s (t) < 0, the rule fails. When s=0 (implying that one of the longitudinal and lateral spacing is equal to its safe distance), the rule "just" FAILs, s=0 representing the boundary (performance class) between PASS (PASS) and FAIL (FAIL) results. Alternatively, s=0 may be defined at the point where autonomous E just passes; this is an insignificant design choice, and for this reason the terms "pass threshold" and "fail threshold" are used interchangeably herein to refer to a subset of the parameter space for which the robustness score y=0.

The pass/fail results (or more generally, performance categories) may be assigned to each time step of the scene run based on the robustness score s (t) at that time, which is useful to an expert interpreting these results.

In addition to estimating driving behavior for driving rules, the above-described rule framework may be used to evaluate other aspects of the automated driving vehicle stack that contribute to performance, such as by defining rules for perceived errors. The perceived error is determined based on a ground truth detection set that is inherent in the simulation, and in a real-world driving scenario, the ground truth detection set may be generated by manual annotation or by applying an offline perception pipeline that utilizes offline detection and refinement techniques that are not available in real-time by autonomous agents to produce a high quality perceived output, which may be referred to herein as a "false ground truth" perceived output.

Fig. 8 shows an architecture for evaluating perceptual errors. The diversion tool 152, including the perceptual predictor 1108, is used to extract and evaluate perceived errors of real driving scenarios and simulated driving scenarios, and outputs the results to be presented in the GUI500 along with the results from the test predictor 252. Note that while the diversion tool 152 is referred to herein as a perceived diversion tool, it may be more generally used to extract and present driving data to the user, including perceived data and drivability data useful for testing and improving the automated driving vehicle stack.

For real sensor data 140 from a driving maneuver, the output of online perception stack 102 is passed to diversion tool 152 to determine digital "real world" perceived error 1102 based on extracted ground truth 144 obtained by running both real sensor data 140 and online perceived output via ground truth pipeline 400.

Similarly, for a simulated driving operation, where the sensor data is simulated from scratch and a perception stack is applied to the simulated sensor data, a simulated perception error 1104 is calculated by the diversion tool 152 based on a comparison of the detection from the perception stack and the simulated ground truth. However, in the case of simulation, ground truth may be obtained directly from simulator 202.

In the case where the simulator models the perception error directly to simulate the output of the perception stack, the difference between the simulated detection and the simulated ground truth, i.e., the simulated perception error 1110, is known and this is passed directly to the perception predictor 1108.

The perceptual predictor 1108 receives a set of perceptual rule definitions 1106, which sets of perceptual rule definitions 1106 may be written via a user interface definition or in a domain-specific language, described in more detail later. The perceptual rule definition 1106 may apply thresholds or rules defining perceptual errors and their limitations. The perceptual predictor applies the defined rules to the real or simulated perceptual errors obtained for the driving scene and determines where the perceptual errors violate the defined rules. These results are passed to a presentation component 1120, which presentation component 1120 presents visual indicators of the evaluated perceptual rules for display in the graphical user interface 500. Note that for clarity reasons, the inputs to the test propulsor are not shown in fig. 8, but the test propulsor 252 is also based on the ground truth scene obtained from the ground truth pipeline 400 or simulator 202.

Further details of the framework for assessing perceived errors of the real world driving stack for the extracted ground truth will now be described. As described above, both the perceived error and the driving rules analysis by the test propulsor 252 may be incorporated into a real world driving analysis tool, as will be described in more detail below.

Not all errors are of the same importance. For example, a translational error of 10cm from an autonomous ten meter agent is much more important than the same translational error from an agent one hundred meters away. A straightforward solution to this problem would be to scale the error based on distance from the autonomous vehicle. However, the relative importance of different perceived errors or the sensitivity of autonomous drivability to different errors depends on the use of a given stack. For example, if the cruise control system is designed to run on a straight road, this should be sensitive to translational errors, but need not be particularly sensitive to directional errors. However, AV processing the ring island ingress is highly sensitive to orientation errors, as it uses the detected orientation of the agent as an indicator of whether the agent left the ring island, thereby determining whether it is safe to enter the ring island. It is therefore desirable to enable the sensitivity of the system to different sensing errors to be configured for each use case.

The perceptual error is defined using a domain specific language. This may be used to generate a perception rule, for example by defining allowable limits of translational error. The rules implement a configurable set of security error levels for different distances from the autonomous. For example, when the vehicle distance is less than 10 meters, the error in its position (i.e., the distance between the detection of the car and the refined pseudo-ground truth detection) may be defined as not more than 10cm. If the agents are one hundred meters away, an acceptable error may be defined as up to 50cm. Using a look-up table, rules may be defined to suit any given use case. More complex rules may be established based on these principles. For example, rules may be defined such that errors of other agents, such as agents in an upcoming lane if the autonomous lane is separated from the upcoming traffic by a divider, are completely ignored based on their location relative to the autonomous vehicle. Traffic beyond a defined cut-off distance behind the autonomous may also be ignored based on rule definition.

The rule set may then be applied together to a given driving scenario by defining a perceived error specification that includes all of the rules to be applied. Typical perceptual rules that may be included in the specification define thresholds for longitudinal and lateral translational errors (average error of measurement detection relative to ground truth in longitudinal and lateral directions, respectively), orientation errors (defining a minimum angle that requires rotation detection to align it with a corresponding ground truth), size errors (errors in each dimension of a detected bounding box, or a union of aligned ground truth and detected boxes to obtain volume increments). Other rules may be based on vehicle dynamics, including errors in speed and acceleration of the agent, as well as errors in classification, such as defining penalty values for misclassifying the vehicle as a pedestrian or truck. Rules may also include false positives (false positives) or missing detection and detection latency.

Based on the defined perceptual rules, a robustness score may be established. In practice this may be used if the detection is within the specified thresholds of the rules, the system should be able to drive safely, if the detection is not within the specified thresholds of the rules (e.g. they are too noisy), then a bad thing may occur that the autonomous vehicle may not be able to handle, which should be formally captured. Complex rule combinations may be included, for example, to evaluate detection over time, and incorporate complex weather correlations.

The perceptual error framework is described in more detail in uk patent application nos. 2108182.3, 2108958.6, 2108952.9 and 2111765.0, which are incorporated herein by reference in their entirety.

User interface

The test framework described above (i.e., the test propulsor 252 and the perception diversion tool 152) may be combined in a real world driving analysis tool, where both perception and driving assessment are applied to perceived ground truth extracted from ground truth pipeline 400, as shown in fig. 1C.

The results of the rule-based analysis described above for planning and perceiving the AV stack provide a numerical score that provides an indicator of the autonomous vehicle performance for each scene. The digital data may be interpreted directly by an expert as described above to identify problems with the stack in order to improve the stack. A user interface will now be described that provides a visualization of the scene being tested and the results of the rule evaluation in order to present the user with a background of the scene when identifying problems with the stack based on the test results. The graphical user interface described in detail below provides a graph of digital scores based on applying definition rules to signals extracted from a scene, and also provides a visualization of scene data such that any changes in the signals upon which the digital scores are based are also visible to a user in the scene visualization. This is useful in a variety of applications.

In one example application, the user interface may be used to visualize a real world scene, and the visualization may include a representation of the scene with perceived output (e.g., bounding box) generated by the perception component 102 along with annotations of the perceived output, e.g., by a ground truth pipeline or by manual annotations. This allows the expert user to easily identify where the perception of the autonomous vehicle deviates significantly from the "ground truth" perceived output, for example, if the user notices a significant difference in orientation of the bounding box representing the agent in front of the autonomous vehicle, which represents an orientation error. This may be used to visualize errors for which the perception component 102 of the autonomous stack has made a perception error and thus improve the perception stack. Another possible application is to identify where the ground truth aware annotation is incorrect, where this information can be used to refine the ground truth method (whether manually or using an automated ground truth pipeline). The visualization may also display raw sensor data alongside the perceived output, which may help expert users identify whether the source of error is a failure in the autonomous perceived stack or a failure in ground truth perception. For example, in the case where there is an orientation error between the bounding box of the perception stack 102 output and the ground truth bounding box, and the camera image or set of lidar measurements is overlaid on the visual representation of the scene, the expert user can easily identify the correct orientation of the agent in the scene, and thus which perceived output is error-causing. However, the user cannot easily identify the source of the perceived error based on only the digital difference between the orientations of the two bounding boxes.

FIG. 9A illustrates an example user interface for analyzing a driving scenario extracted from real-world data. In the example of fig. 9A, a top schematic representation 1204 of a scene is shown based on point cloud data (e.g., lidar, radar, or data derived from stereo or monochromatic depth imaging), with a corresponding camera frame 1224 shown in the inset. The road layout information may be obtained from high definition map data. Camera frame 1224 may also be annotated with detection. The UI may also show sensor data collected during driving, such as lidar, radar or camera data. This is shown in fig. 9B. Scene visualization 1204 is also overlaid with annotations based on derived pseudo-ground truth and detection from on-board perception components.

In the example shown, there are three vehicles, each annotated by a box. Solid line box 1220 shows a pseudo ground truth of a proxy for a scene, while outline 1222 shows an unrefined detection from autonomous sense stack 102. A visual menu 1218 is shown in which the user may select which sensor data, online and offline detection, to display in visual menu 1218. These can be turned on and off as desired. Raw sensor data may be shown alongside both vehicle detection and ground truth detection to allow a user to identify or confirm certain errors in vehicle detection. U1500 allows playback of the selected shot and shows a timeline view in which the user can select any point 1216 in the shot to show the bird's eye view and a snapshot of the camera frames corresponding to the selected point in time.

As described above, the perception stack 102 may be estimated by comparing the detection to the refined pseudo ground truth 144. The perception is estimated for defined perception rules 1106. The defined perception rules 1106 may be based on the usage of a particular AV stack. These rules specify different ranges of values for the differences between the location, orientation, or scale of the car detection and the location, orientation, or scale of the pseudo-ground truth detection. As described above, rules may be defined in a domain-specific language. As shown in fig. 9A, a different perception rule result is shown along a "top-level" perception timeline 1206 of the driving scene, aggregating the results of the perception rules, marking the period on the timeline when any of the perception rules are destroyed. This can be extended to a separate set of perceptual rule timelines 1210 showing each defined rule.

The perceived error timeline may be "scaled down" to show longer periods of driving operation. In a zoomed-in view, the perceived error may not be displayed at the same granularity as when zoomed in. In this case, the timeline may display an aggregation of perceived errors over a time window to provide an summarized set of perceived errors for the zoomed out view.

A second driving estimation timeline 1208 shows how the pseudo ground truth data is estimated for driving rules. The aggregated driving rules are displayed in a top-level timeline 1208, which top-level timeline 1208 may be expanded to display a separate set of timelines 1212 for each defined driving rule's performance. As shown, each rule timeline may be further expanded to display a graph of the digital performance score of a given rule over time. In this case, the pseudo ground truth detection 144 is treated as an agent's actual driving behavior in the scene. Autonomous behavior may be evaluated for defined driving rules, for example, based on digital highway codes to see if the car is safe to appear for a given scenario.

In fig. 9A, each driving rule timeline is scalable to show a graph of the associated robustness scores. The timeline of the "comfort_02" driving rule is shown in an expanded state such that the xy pattern graph of the robustness score 1240 is visible. A pass/fail threshold 1242 is shown, where the failed region 1244 on the timeline corresponds to the region 1246 of the graph where the score is below the threshold 1242. The user may "clear" along the graph (e.g., around the pass/fail boundary) to visually map the time-varying graph 1240 to a corresponding change in the running visualization 1204. A marker (cleaner bar) 1248 is shown extending vertically through all regular timelines to indicate the current time step of the visualization 1204. By moving the marker 1248 horizontally along the regular timeline, the user cleans up the scene (to see the visual scene at different time steps). Color coding may be applied to the xy graph to show areas above the pass/fail threshold in colors different from those below the pass/fail threshold. Further details of the cleaning mechanism are described below with reference to fig. 9C and 9D.

In summary, both the perception rule evaluation and the driving estimation are based on refining the detection from real world driving using the offline perception method described above. For driving estimation, the refined pseudo ground truth 144 is used to estimate autonomous behavior for driving rules. As shown in fig. 1C, this may also be used to generate a simulated scene for testing. For perception rule evaluation, the perception diversion tool 152 compares the recorded vehicle detection to offline refinement detection to quickly identify and divert possible perception failures.

The driving notes may also be displayed in a driving note timeline view 1214, in which notable events marked during driving may be displayed. For example, driving notes will include points at which the vehicle brakes or turns, or points at which a human driver is off of the AV stack.

Additional timelines may be displayed in which user-defined metrics are shown to assist the user in debugging and offloading potential problems. User-defined metrics may be defined to identify errors or stack defects, and to divide the errors when they occur. The user may define custom metrics based on the goals of a given AV stack. Example user-defined metrics may flag when messages arrive out of order, when message delays of messages are perceived. This is useful for offloading, as it can be used to determine if planning has occurred due to a planner error or due to late arrival or out of order arrival of messages.

Fig. 9B illustrates an example of a UI visualization 1204 that displays sensor data, with a camera frame 1224 displayed in an insertion view. Typically, the sensor data is shown from a single snapshot in time. However, in the event that high definition map data is not available, each frame may show sensor data aggregated over multiple time steps to obtain a static scene map. As shown on the left, there are many visualization options 1218 to display or hide data, such as camera, radar or lidar data collected during a real scene, or on-line detection from the autonomous vehicle's own perception. In this example, online detection from the vehicle is shown as a color box 1222 overlaid on top of a gray box 1220 representing ground truth refined detection. An orientation error can be seen between ground truth and detection of the vehicle.

The refinement process performed by the ground truth pipeline 400 is used to generate the pseudo ground truth 144 as a basis for a plurality of tools. The UI shown displays results from the perceptual diversion tool 152 that allow the test predictors 252 to be used to estimate the driving capabilities of the ADAS for a single driving instance, detect defects, extract the context of the replication problem (see fig. 1C), and send the identified problem to the developer to improve the stack.

FIG. 9C illustrates an example user interface configured to enable a user to zoom on a sub-segment of a scene. Fig. 9C shows a snapshot of the scene, with the schematic representation 1204 and the camera frame 1224 shown in an interpolated view, as described above for fig. 9A. A set of perceived error timelines 1206, 1210 as well as an extensible driving estimation timeline 1208 and a driver notes timeline 1214 as described above are also shown in FIG. 9C.

In the example shown in fig. 9C, the current snapshot of the driving scenario is indicated by the cleaner bar 1230 extending over all timeline views simultaneously. This may be used in place of the indication 1216 of the current point in the scene on the single playback bar. The user may click on the cleaner bar 1230 to select the bar and move the bar to any point in time for the driving scenario. For example, the user may be interested in a particular error, such as a point within a red segment or a segment otherwise indicated as containing an error on a position error timeline, where the indication is determined based on the position error observed between "ground truth" at that time and detection at a time period corresponding to the indicated sector. The user may click on the cleaner bar and drag the bar to a point of interest within the position error timeline. Alternatively, the user may click on a point on any timeline that the washer extends across in order to place the washer at that point. This updates the schematic 1204 and the insert view 1224 to show the respective top schematic view and camera frame corresponding to the selected point in time. The user may then examine the schematic view and available camera data or other sensor data to see the position error and identify possible causes of the perceived error.

A "scale" bar 1232 is shown above the perception timeline 1206 and below the schematic view. This includes a series of 'gaps' indicating the time intervals of the driving scenario. For example, in the case where a time interval of 10 seconds is displayed in the timeline view, a notch indicating an interval of 1 second is shown. Some points in time are also marked with a numerical indicator (e.g., "0 seconds", "10 seconds", etc.).

The numerical score associated with the perceptual error rule may be continuous (e.g., floating point) or discrete (e.g., integer). The miss count (as a function of time) is an example of an integer fraction. The degree of deviation from the perceived ground truth (e.g., detecting a position or orientation offset from the corresponding ground truth) is an example of a floating point score. Color coding may be used on the perceptual timeline to draw the change (or approximate change) in score over time. For example, for integer fractions, different colors may be used for each integer value. The continuous score may be plotted using a color gradient, or "quantized" into discrete bins that are indicated using discrete color coding. Alternatively or additionally, the perceived error timeline may be "scalable" (as in fig. 9A) in the same manner as the driving rules to view the xy graphs of the associated robustness scores.

A zoom slider 1234 at the bottom of the user interface. The user may drag the pointer along the zoom slider to change the portion of the driving scene shown on the timeline. Alternatively, the position of the pointer may be adjusted by clicking on a desired point on the slider to which the pointer should be moved. Percentages are shown to indicate the currently selected zoom level. For example, if the full driving scenario is 1 minute long, the timelines 1206, 1208, 1214 show the respective perceived errors, driving estimates, and driver notes on the 1 minute drive, and the zoom slider shows 100%, with the button at the leftmost position. If the user slides the button until the zoom slider shows 200%, the timeline will be adjusted to show only the results corresponding to the thirty-second segment of the scene.

The scaling may be configured to adjust the display portion of the timeline according to the position of the cleaner bar. For example, with the zoom set to 200% for a one-minute scene, the zoomed timeline will show a 30 second segment, with the point in time at which the selected cleaner is located centered, i.e., 15 seconds before and after the point indicated by the cleaner, showing the timeline. Alternatively, scaling may be applied with respect to a reference point (e.g., the beginning of a scene). In this case, the enlarged clip shown on the timeline after enlargement always starts at the beginning of the scene. The granularity of the gaps and numerical marks of the scale bar 1232 may be adjusted according to the degree to which the timeline is enlarged or reduced. For example, in the case where the scene is zoomed in from 30 seconds to show a 3 second segment, the digital signature may be displayed before being zoomed in at 10 second intervals, with the gap being displayed at one second intervals, and after zooming, the digital signature may be displayed at one second intervals, with the gap being displayed at 100ms intervals. The visualizations of the time steps in the timelines 1206, 1208, 1214 are "stretched" to correspond to the enlarged segments. The higher level of detail may be displayed on the timeline in an enlarged view, as the smaller segments in time may be represented by a larger area in the display of the timeline within the UI. Thus, once zoomed in, errors spanning a very short time within a longer scene may become visible only in the timeline view.

Other zoom inputs may be used to adjust the timeline to display shorter or longer segments of the scene. For example, where the user interface is implemented on a touch screen device, the user may apply a zoom to the timeline by applying a pinch gesture. In another example, the user may scroll the wheel of the mouse forward or backward to change the zoom level.

In the case where the timeline is enlarged to show only a subset of the driving scene, the timeline may scroll in time to move the displayed portion in time so that the user may examine different portions of the scene in the timeline view. The user may scroll by clicking and dragging a scroll bar (not shown) at the bottom of the timeline view, or using, for example, a touchpad on the associated device on which the UI is running.

The user may also select segments of the scene, for example to derive for further analysis or as a basis for simulation. Fig. 9D shows how a user may select a portion of a driving scenario. The user may click with a cursor on the relevant point on the scale bar 1232. This may be done at any zoom level. This sets a first limit on the user's selection. The user drags the cursor along the timeline to expand the selection to the selected point in time. If zoomed in, this scrolls the timeline forward and allows the selection to be further expanded by continuously dragging to the end of the segment of the scene being displayed. The user may stop the drag at any point where the user stopped is an end limit on the user's selection. The bar 1230 at the bottom of the user interface displays the length of time of the selected segment and updates the value as the user drags the cursor to expand or decrease the selection. The selected segment 1238 shows the shadow segments on the scale bar. This segmentation may be indicated by a segment having a different color than the rest of the scale bar. A plurality of buttons 1236 are shown providing a user action such as "extract trace scene" to extract data corresponding to the selection. This may be stored in a database of extracted scenes. This may be used for further analysis or as a basis for simulating similar scenarios. After making the selection, the user may zoom in or out, and the selection 1238 on the scale bar 1232 also expands or contracts along with the scale and perception, driving estimation, and driving notice timelines.

DSL can also be used to define contracts between the perception of the system and the planning stack based on robustness scores calculated for defined rules. Fig. 10 shows an example graph of a robustness score defined for a given error (e.g., translational error). If the robustness score is above the defined threshold 1502, this indicates that the perceived error is within the expected performance and the system as a whole should submit to safe driving. If the robustness score falls below threshold 1502, as shown in FIG. 10, the error is "out of contract" because planner 106 cannot expect to drive safely for that perceived error level. The present contract essentially becomes a specification of requirements for the perception system. This may be used to assign responsibility to one of the perception subsystem 102 or the planning subsystem 106. If the error is identified as being within the contract when the car fails, this indicates a problem with the planner, not a perception problem, and vice versa, for the case where the perception is out of contract, the perception error is responsible.

The contract information may be displayed in UI500 by annotating whether the perceived error is considered intra-contract or out-of-contract. This uses mechanisms to take the contract specifications from the DSL and automatically mark the out-of-contract errors in the front end.

Further details of the above-described example user interfaces for visualizing perceived errors and driving rules are described in uk patent application nos. 2108182.3, 2108958.6, 2108952.9 and 2111765.0.

In another example application, as described in greater detail herein, the visualization may be used to allow expert users to investigate errors in driving behavior generated based on the output of the autonomous vehicle's planner 106. As described above, driving rules may be defined based on safety criteria that specify safety distances between vehicles in various situations, such that violations of these rules indicate a possible safety risk. However, as described with respect to fig. 3A-3C, the robustness score of the driving rules is not necessarily based on a measurable quantity that is easily interpretable. In the example given above, the robustness score for the lateral and longitudinal distances is equal to the normalized difference between the actual distance and the threshold distance, or 1 if the normalized difference is higher than some predetermined difference. This value is useful for easily determining the severity of rule failure, but is not easily interpreted in terms of real world driving. By viewing these results within a visualization of the scene, where the actual autonomous vehicles and other agents driving along the road are shown, the user can see the relative speed of the vehicles and the distance between them throughout the scene. The expert user may navigate to a point in the scene corresponding to, for example, the failure point based on the robustness score and navigate back in the scene to see what caused the rule to fail and possibly make an adjustment to the AV planner 106 to determine whether it may be avoided in the future.

The foregoing describes a framework for evaluating agents within a scene according to a predefined set of rules and metrics for the agents' behavior and/or perceived errors. As described above, the AV stack 100 may evaluate in a simulation by evaluating the performance of an autonomous agent during many simulation runs (or instances) for each of a set of abstract scenes defined in a scene description language and parameterized by a set of parameter values. A given instance of an AV stack is typically tested against a large number of scenarios with different parameters in a "test suite". A test suite is defined with a set of parameter ranges for the parameters of the scenario to be run and a set of rules (or "rule set") for evaluating autonomous agents for the test suite. Once the test suite is run, a set of autonomous traces is generated, each autonomous trace comprising a time series of autonomous states in the run, and a set of results comprising a pass/fail result for each rule of each scenario by the autonomous agent and a time series of numerical scores (robustness scores) for each rule of each scenario by the autonomous agent, thereby quantifying the degree of success or failure throughout the run. These results may be aggregated for the test suite to obtain an overall view of the performance of the autonomous vehicle over the set of scene parameters being tested.

It is also useful to directly compare the two runs. In one example, a user testing an AV stack may wish to compare the performance of autonomous vehicles in two versions of the same abstract scene, with a small number of scene parameters being different, in order to obtain a fine-grained view of how a given parameter value affects the perception or behavior of an autonomous agent in the abstract scene. In another example, the same scenario with the same parameter values may be run for two different versions of the stack of autonomous agents, e.g., where the planner changes from one instance of a given test suite to the next. In this case, where the pass or fail of a given rule has been different between the previous stack version and the current stack version, particularly in a scenario where the autonomous vehicle previously passed the rule but failed for the updated version (referred to herein as backing up), it is useful to look at these runs in a common visualization tool to determine at what point in the scenario the behavior of the two versions of the autonomous agent diverge and allow the user to identify the cause of backing up.

FIG. 4 shows a schematic block diagram of a computer system for presenting a run visualization interface, according to an embodiment of the present disclosure. Fig. 4 shows data provided as input to a first run 402 and a second run 404 of a renderer. However, a visual interface may also be implemented to display a single run. As described above, the first run and the second run may be instances of a scenario in which one or more scenario parameters of the two runs take different values, or alternatively, in the case of two runs of tests belonging to two respective versions of an autonomous stack, the scenario parameters may be the same. Each run includes a time series 416 of autonomous states, including spatial and motion coordinates of the autonomous vehicle at each time step of the run, as described above, and a set of robustness scores 418 for each defined autonomous agent in a set of rules defined for its perception and/or behavior during the run.

In addition to the operational data, a map is provided to the presenter defining the static road layout of the scene. This includes representations of road lanes and road features such as intersections and roundabout. Each scene instance has an associated map. The map may be obtained from a map database.

The presentation component 408 receives the two runs of run data and map data 406 and presents a common visualization 412 showing snapshots of the two runs overlaid on the same map, and a graph 414 of the robustness scores for each of the rules of the rule set, wherein the robustness scores for the two runs are plotted on a common axis set. Controls may be provided for the user to manually align the two runs such that equivalent points of the two runs are visually shown to allow direct visual comparison. Both the map visualization 412 and the robustness score graph 414 include a timeline with a time stamp 410, the time stamp 410 marking a common time of day within both runs. The time stamp 410 of the robustness score graph may be implemented in the form of a washer bar 1230, as described above with reference to fig. 9C, or as a point or circle or other indicator along each timeline, as described below with reference to fig. 5 and 6.

A user control is provided to move the time stamp of the map visualization 412 to move the visualization forward to update the visualization to show the status of each running autonomous agent at the moment the stamp is moved along the time axis. The control may also be used to update the time stamp 410 of the graph for each rule to identify the robustness score of the autonomous agent running each time at the selected time instant, as indicated by the line in the robustness graph. Robustness graph 414 is shown in an expanded view in fig. 4, with the digital robustness score along the y-axis and time along the x-axis. An alternative view of the robustness graph provides a binary indicator based on a pass/fail scheme, wherein the timeline is shown with time segments where the robustness score is above or below a certain threshold of pass/fail, e.g. identified by a color scheme, wherein the autonomous agent causes the part of the scene run where a given rule fails to be shown in red on the timeline, while the part following the rule is shown in green. This is shown and described in more detail below with reference to fig. 5 and 6.

In the map visualization 412, the autonomous agent may be represented in a different color for each run. Although not shown in fig. 4, the scene runs typically include one or more external agents that drive within the same road layout, and these are also represented in different colors in order to visually distinguish the agents of each running scene. The map visualization 412 and the robustness graph are provided within a common user interface display, which is described in more detail with reference to fig. 5 and 6.

FIG. 5 illustrates an example run visualization user interface in a single run view in which two runs are available for display, but only one run is selected in a selection pane having a first check box 506 for a first run and a second check box 504 for selecting a second run for display. As shown in fig. 5, the second check box is deselected, so only the first run is displayed in the visualization 412, and a rule evaluation timeline 508 is provided that only shows the performance of the autonomous agents of the first run. In this example, as described above, the rule evaluation timeline is displayed in a non-expanded view, with a single timeline shown as a line, with the failure of an autonomous agent for a given rule shown as a red portion of the rule's timeline, and with the time that the autonomous agent did not fail the rule shown as green. Each rule timeline 508 is identified by the name of the rule (e.g., dr—01) and the title of the rule (such as "collision"). Also shown is a digital indicator 512 that provides a digital robustness score for the selected time step. An extension control 514 is provided, which extension control 514 can be clicked by a user to display an extension view of the rule timeline, which extension view includes a robustness graph of the rule, as will be described in more detail with reference to fig. 6.

The time step within the run is indicated by a time stamp 410, which time stamp 410 is shown as a small circle at the beginning of the regular timeline 508 and on the timeline provided at the bottom of the display. The user may adjust the markup of the overall timeline by clicking and dragging the indicators along the timeline to move the visualization to a selected point within the run. The display time stamp 410 of the regular timeline collection and the time stamp of the overall timeline refer to the same underlying data, such that a user control that adjusts the time stamp of one timeline also adjusts the time stamps of all regular timelines 508. The robustness scores for each rule are indexed in time such that updating the time stamp for each rule causes the robustness score displayed in the digital indicator 512 to be updated to reflect the selected point in time. A search bar is provided in which a user can enter text filters to display only rules that are relevant to a given keyword. For example, the user may enter a "collision" to return a rule evaluation timeline for a rule that involves the word "collision" in the name of the rule or in the description.

In the example of fig. 5, the set of robustness/rule evaluation graphs/timelines 508 is displayed within the scene visualization 412, the scene visualization 412 comprising a map visualization and an overall timeline. Within the map visualization, the first running agent is shown traveling along a highway lane. At the selected time (in this example, the start time of the run), there are no other agents within the view.

A set of controls 516 is provided to adjust the display of the map. For example, these may include controls that redirect the map according to some predetermined default directional layout (e.g., adjust the map such that north corresponds to an upward direction in the visualization). On the left side a "tracking agent" control is shown, which is clicked on to be able to track the autonomous agent, so that the vehicle from the master agent is always shown at the centre of the visualization during playback of the scene. The sensor controls may be used to show a visualization of the field of view of each sensor of the host vehicle. Buttons with additional controls may be provided to display further options to the user including, for example, measurement tools, debug mode, and different camera location views. The scale indicator shows a reference distance for comparison with the distance in the driving scene.

In addition to the visualization 412, the user interface also includes a comparison table 502, which comparison table 502 shows applicable rules for the scenario and aggregate pass/fail results for each selected running autonomous agent. As shown in fig. 5, the comparison table 502 defines, at the top, the instance being compared, the instance being identified by the corresponding index. The parameter values for the scene for each instance are also shown. In this example, the y speed is set to "1.6" for both runs. Other parameters may include weather conditions, lighting, etc. A table of rules is shown, wherein each rule and a brief description of the function of the rule are displayed in a single row, and for each instance identified in a respective column of the table, a failure or pass indicator for the rule is shown. As described above, the pass and fail conditions of each rule are specified in the rule definition. For example, if an autonomous agent is less than a minimum distance from another agent even for a short period of time, a rule specifying a minimum distance from another vehicle may fail. The pass and fail results are shown in green and red to allow the user to quickly identify two rules that are running inconsistent. In the event that a given rule fails for a first run and passes for another run, the rule may be checked for a single view of the runs by selecting one of the runs and reviewing rule evaluation timeline 508 corresponding to the given rule. The user may review each of the runs in turn in a single run view by selecting the check box corresponding to that run and ensuring that the check box for the other run is deselected (as shown for check boxes 504, 506 in fig. 5).

FIG. 6 shows a user interface of a run comparison view, where two runs are compared in a common visualization 412. In this example, both the first operation check box 506 and the second operation check box 504 are selected so that both operations are displayed. Instead of selecting a check box, the user may instead hover over the eye icon corresponding to the first to display only the running visualization. In the map view, the autonomous agent 610a and the foreign agent 608a for a first run are shown, while the autonomous agent 610b and the foreign agent 608b for a second run are shown at different respective locations on the same road layout. The autonomous agent and the foreign agent may be indicated in different colors in the UI, for example, the first running autonomous agent may be shown in blue while the foreign agent is shown in gray. The agent for the second run may be indicated by a different color, for example orange. Time stamp 410 shows the progress of two runs along a common timeline. Frame number 602 is also shown and controls 604a, 604b are provided, where the user can select frames that are run one by one click forward or backward in time, where each frame of one autonomous state corresponding to the time series of autonomous states is received by the renderer. The time (from the scene) is also shown to the right of the timeline. In this example, the frames correspond to a regular interval of 0.01s, such that the current time of 4.950s corresponds to frame 495.

In the run comparison view, the timeline is evaluated for the first run display rule as previously shown for the single run view. The time stamp of each rule evaluation timeline is positioned at the same relative point along the timeline as the selected point on the main timeline for the overall visualization 412. FIG. 6 shows an expanded view of the "ALKS_03" rule, including a robustness graph 414a, where the robustness scores for both the first run and the second run are plotted on the same axis. It examines the headroom cut response. As the user moves the time stamp along the extended timeline of a given rule, the time stamps of all other rules, as well as the overall visualized timeline, are updated to the corresponding time step selected by the user.

Another rule "alk_05-stableaterposition" is shown in an expanded view with a robustness graph 414 b. In the graph, the robustness scores of both the first run and the second run are plotted. The time stamp has an associated line parallel to the y-axis at a selected time that intersects the graph for each run. The labels show the value of the robustness score for each run at a selected time. In this example, the first run has a robustness score of 0.24 at the selected time and the second run has a robustness score of 2. The scale of the graph is indicated by the 12 and-12 marks on the y-axis. The robustness graph of the first run is shown by the line almost covered with the x-axis, since the robustness fraction is relatively zero for the duration of the run. In contrast, the graph for the second run starts at a high value and then drops below zero and remains near zero for the remainder of the run. For this example rule, the autonomous agent overrides the rule for the duration of the first run, but the robustness score drops below zero for the second run. The UI may be configured to display a corresponding portion of the graph in red. In this example, the first run is a run that displays the rule evaluation timeline, so the rule evaluation timeline 508 of the ALKS_05 rule will be displayed green throughout the run.

The user may click on the time stamp 410 and drag it along the timeline (referred to herein as "cleaning") to select a different time to visualize in the duration of the two runs. When the user moves the time stamp, the visualization of the agents in the road layout is updated to reflect the status of the corresponding agents at the selected time within the run. The time stamp of the regular timeline and the robustness value 512 displayed alongside the regular timeline are also updated to reflect the selected time. The cleaning mechanism may be applied to the run comparison view (as shown in fig. 6) as well as to a single run view (as described above). The cleaning mechanism is described in more detail above with reference to fig. 9A. Fig. 9A shows a single run view, but the description applies equally to run comparison views (where the user can clean up along a regular timeline involving a vertical stack of multiple runs).

Thus, as the operation proceeds, the user may compare the behavior of autonomous agents in different operations to learn why the behavior deviates and inform future tests. For example, where the parameters between two runs are the same, but the comparison between two different versions of the runs of the autonomous stack, and given a rule (e.g., a failure of a lateral position rule that stabilizes the updated version of the stack), the user may review the autonomous position in the run corresponding to the updated stack and determine the nature of the error in the autonomous lateral position and attempt to identify the cause thereof. As described above, since the agent of the second operation is displayed in a different color from the agent of the first operation, the two operations can be easily distinguished in the visualization. Alternatively, some other means of visually identifying the corresponding running agent may be used, for example a visual effect, such as a lower opacity of the agent, or a label on or around a given running agent.

The run-time comparison interface may be used to evaluate changes made to the stack. For example, if the AV planner is updated to change the way the autonomous vehicle behaves when exiting the roadway, the previous version of the stack (before effecting the change) and the current version may be compared based on the corresponding operation of the scene that the autonomous vehicle exits to identify any changes in behavior of the autonomous for the same scene parameters. New autonomous stacks may also be evaluated for scenes with different scene parameters, and these new autonomous stacks may be compared for new versions of the stacks to identify how the different scene parameters affect the making of autonomous decisions since the change to the planner was implemented.

As described above, in a typical use case of a run comparison interface, a user may identify from comparison table 502 that a given rule passed for one run of the comparison but failed for another run. Assuming that two runs are selected, the user may then deselect the check box associated with the run through the given rule and determine an approximate point at which the failure occurred based on the rule timeline of the failed run. The user may then move the time stamp to a point near the autonomous agent failure rule to view playback of the autonomous behavior near the failure time. Then, for comparison with autonomous behavior in another run, the check box corresponding to the passed run may be re-selected, and the scene may be played back to show how the behavior of the home agent differs between the two runs.

The above description relates to the use of comparison tools for behavioral rules, but the user interface may also include perception rules as described earlier, in which the perception of the vehicle is assessed against ground truth (e.g., real detection simulated using a perception error model, or generated in real time by an autonomous vehicle). For example, in the event that a change is made to the awareness system and the same scenario is re-run, if the autonomous agent fails the collision rules by colliding with the lead vehicle, but it passes a previous version of the rules of the stack, the user may replay both runs in the same visualization and determine that failure to detect the lead agent has caused the collision within sufficient time, and the user may review the most recent change to the awareness stack to determine how the rollback occurred.

In the reverse compare use case, one or more test suites each defining a set of scenarios are run for two different versions of the autonomous stack. Typically, each test suite contains a large number of scenario instances (e.g., tens of thousands or more), from one stack version to another, with the vast majority of results being the same. It is difficult for a user to manually review these results to identify the rules for which the two versions are different. Instead, aggregation may be performed that runs two test suites and identifies and reports only those rules for each scenario that produce different results between the two stack versions.

An interface showing the result of such polymerization is shown in fig. 7. The reverse report interface may be provided as an additional display within the test tool that also includes the aforementioned run comparison visualization. The report includes a test results column identifying test suite comparisons 702, where each test suite defines a set of scenario parameters and a set of rules, as described above. For each test suite of the comparison, a pair of IDs 704 are shown, each ID704 identifying a separate run of the test suite. For a given test suite rollback comparison, each rule 706 that finds rollback is shown in the second column, and an overview 708 of the improvement and rollback of that rule is shown in the third column. Details of each individual rewind and improvement are shown in the following row, with the scene name 710 identified in the second column, and the results 712 identifying the different two runs of the scene and the results of each run. In the example shown, the running of the current version of the stack is shown first, and then the running of the previous version of the stack is shown in brackets. IN this example, for a given "STAY_IN_LANE_JCTN" rule, 5 refinements and 1 backoff are found. The first listed scenario is reverse, where the run corresponding to the current stack version resulted in failure and the previous stack version resulted in pass. The remaining five results show the current version of the stack passing, with the previous version of the stack passing. A "compare" link 714 is provided for each improvement and reversal, and when the user clicks on the link, this directs the user to the run comparison interface described above with reference to FIGS. 5-6, where the respective instance IDs identify the first and second run data for presentation in the run comparison interface.

As described above, the evaluation results may be stored in a results database that may be accessed by the graphical user interface described above to display a graph of the digital performance scores.

References herein to components, functions, modules, etc. represent functional components of a computer system that may be implemented in different ways at the hardware level. The computer system includes execution hardware that may be configured to perform the method/algorithm steps disclosed herein and/or implement models trained using the present technology. The term execution hardware encompasses any form/combination of hardware configured to carry out the relevant method/algorithm steps. The execution hardware may take the form of one or more processors, which may be programmable or non-programmable, or a combination of programmable and non-programmable hardware may be used. Examples of suitable programmable processors include general-purpose processors based on instruction set architectures such as CPUs, GPUs/accelerator processors, and the like. Such general purpose processors typically execute computer readable instructions stored in a memory coupled to the processor or internal to the processor, and perform the relevant steps in accordance with these instructions. Other forms of programmable processors include field programmable gate arrays (field programmable gate array, FPGAs) having circuit configurations that are programmable by circuit description code. Examples of non-programmable processors include application specific integrated circuits (application specific integrated circuit, ASIC). The code, instructions, etc. may be suitably stored on a transitory or non-transitory medium (examples of which include solid state, magnetic and optical storage devices, etc.). The subsystems 102-108 of the runtime stack of FIG. 1A may be implemented in a programmable or dedicated processor on the vehicle or a combination of both, or in an onboard or offboard computer system in the context of testing, etc. The components of fig. 2A, 2B, 3B, 4, and 8 may similarly be implemented in programmable and/or dedicated hardware.

Claims

1. A computer system for presenting a graphical user interface for visualizing the operation of a driving scenario in which an autonomous agent navigates a road layout, the computer system comprising:

at least one input configured to receive a map of a road layout of the driving scenario and operational data of operation of the driving scenario, wherein the operational data comprises:

time-stamped sequences of autonomous agent states

A time-varying digital score, quantifying the performance of the autonomous agent with respect to each rule in a set of run evaluation rules, the time-varying digital score calculated by applying the run evaluation rules to the run; and

a presentation component configured to generate presentation data for causing a graphical user interface to display:

evaluating each of the rules for the run:

a plot of the time-varying digital fraction

A marker representing a selected time index on a time axis of the graph, the marker being movable along the time axis via user input at the graphical user interface to change the selected time index; and

A scene visualization comprising a visualization of the road layout overlaid with the running agent visualization at the selected time index, whereby moving the marker along the time axis causes the presentation component to update the scene visualization as the selected time index changes.

2. The computer system of claim 1, wherein the input is further configured to receive second operational data of a second operation of the driving scenario, the second operational data comprising: a second sequence of time-stamped autonomous agent states and a second time-varying digital score that quantifies performance of the autonomous agent with respect to each rule in a set of run evaluation rules, the second time-varying digital score being calculated by applying the run evaluation rules to the run; wherein the presentation component is further configured to generate presentation data for causing a graphical interface to display for each rule in the running evaluation rule set:

a graph of the second time-varying digital score, wherein the time-varying digital score and the second time-varying digital score are plotted against a common set of axes including at least a common time axis, wherein the indicia represents a selected time index on the common time axis, an

The second running second agent visualization at the selected time index, wherein the scene visualization is overlaid with the second agent visualization.

3. The computer system of claim 1 or 2, wherein the time-varying digital score is calculated by applying one or more rules to a time-varying signal extracted from the operational data, and wherein a change in the signal is visible in the scene visualization.

4. A computer system according to claim 2 or 3, wherein, in response to a deselection input at the graphical user interface representing one of the first run and the second run, the presentation component is configured to:

removing from the common set of axes, for each driving rule, a plot of time-varying digital scores of the deselected runs, an

The deselected running agent visualization is removed from the single visualization of the road layout,

thereby, the user can switch from the operation comparison view related to both the first operation and the second operation to a single operation view related to only one of the first operation and the second operation.

5. The computer system of any of claims 2-4, wherein the graphical user interface further comprises: a comparison table having an entry for each rule in the set of run evaluation rules, the entry containing an aggregate performance result for the rule in the first run and an aggregate performance result for the rule in the second run.

6. The computer system of claim 5, wherein the entry for each rule further comprises a description of the rule.

7. The computer system of any preceding claim, wherein, in response to an expansion input at the graphical user interface, the presentation component is configured to hide a graph of the time-varying numerical score for each rule and to display a timeline view including an indication of pass/fail results of the rule over time.

8. A computer system according to any preceding claim, wherein the presentation component is configured to cause the graphical user interface to display a numerical score at the selected time index for each rule in the running evaluation rule set.

9. The computer system of any preceding claim, wherein the operational assessment rules comprise perception rules, and wherein the scene visualization comprises a set of perceived outputs generated by a perceived component of the autonomous vehicle.

10. The computer system of claim 9, wherein the scene visualization comprises: sensor data overlaid on a visualization of the road layout.

11. The computer system of any preceding claim, wherein the scene visualization comprises: a scene timeline having scene time stamps, whereby the stamps are moved along the scene timeline such that the presentation component updates a respective time stamp of each graph of the time-varying digital score as the selected time index changes.

12. The computer system of claim 10, wherein the scene timeline comprises: a frame index corresponding to the selected time index, and a control set that is moved forward or backward by incrementing or decrementing a frame number, respectively.

13. The computer system of any preceding claim, wherein the driving scenario is a simulated driving scenario in which autonomous agent navigation simulates a road layout, and wherein the operational data is received from a simulator.

14. The computer system of any of claims 1-12, wherein the driving scenario is a real-world driving scenario in which an autonomous agent navigates a real-world road layout, and wherein the operational data is calculated based on data generated on the autonomous agent during the operation.

15. A computer system according to any preceding claim, wherein the plot of time-varying digital scores comprises an xy plot of the time-varying digital scores.

16. The computer system of any preceding claim, wherein the time-varying digital score is drawn using color coding.

17. A method for visualizing the operation of a driving scenario in which an autonomous agent navigates a road layout, the method comprising:

receiving a map of a road layout of the driving scene and operation data of operation of the driving scene, wherein the operation data comprises:

a sequence of time-stamped autonomous agent states,

a time-varying digital score, quantifying the performance of the autonomous agent with respect to each rule in a set of run evaluation rules, the time-varying digital score calculated by applying the run evaluation rules to the run,

generating presentation data for causing a graphical user interface to display:

evaluating each of the rules for the run:

a plot of the time-varying digital fraction

A scene visualization comprising a visualization of the road layout overlaid with the running agent visualization at the selected time index, whereby moving the marker along the time axis causes a presentation component to update the scene visualization as the selected time index changes.

18. A computer program comprising executable instructions for programming a computer system to implement the method or system functions of any preceding claim.