GB2603179A

GB2603179A - Dense 3-D occupancy mapping

Info

Publication number: GB2603179A
Application number: GB2101281.0A
Authority: GB
Inventors: Tarrio Juan; F Alcantarilla Pablo
Original assignee: Slamcore Ltd
Current assignee: Slamcore Ltd
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2022-08-03
Also published as: GB202101281D0; GB2618278A; GB202312319D0; WO2022162075A1

Abstract

A computer-implemented method for building and updating a 3-D occupancy map of occupied and unoccupied spaces in a 3-D environment. Each sub-map comprises occupancy information within its respective volumetric region, including occupancy indications for a plurality of voxels. Each occupancy indication may indicate a likelihood that the respective voxel is occupied or unoccupied. Updating the at least one sub-map based on the 2-D depth image and associated pose estimate may comprise updating the occupancy information — in particular, updating the occupancy indications — based on the 2-D depth image and associated pose estimate. A single depth image may contain depth information that is relevant to multiple sub-maps. The method therefore updates multiple sub-maps based on depth information from one frame (for example, the most recently obtained frame). In some embodiments, all sub-maps in a list of active sub-maps may be updated with depth information from each 2-D depth image. A mobile device for building and updating the occupancy map may be a handheld device, robot, or a vehicle. The vehicle may be an unmanned and/or autonomous vehicle, a land vehicle, an aerial vehicle, a marine vehicle or a submarine vehicle.

Description

DESCRIPTION

DENSE 3-D OCCUPANCY MAPPING

FIELD OF THE INVENTION

This invention relates to 3-0 occupancy mapping, meaning the creation and/or maintenance of a volumetric map describing whether space is occupied or unoccupied. It relates in particular to occupancy mapping in the context of a visual-inertial simultaneous localisation and mapping (VI-SLAM) system, in which visual data and inertial measurements are combined to produce pose estimates of a mobile device as it moves around an environment.

BACKGROUND OF THE INVENTION

Interest in autonomous and semi-autonomous robots and vehicles, such as drones, is growing as the technology matures and they become increasingly widely used.

Trajectory planning, for robots or autonomous vehicles, requires a map representation that captures information about which parts of an environment are occupied (e.g. by solid objects) and which parts are free space. It is also desirable for the map to capture information about which parts of the environment remain unexplored -that is, parts where the occupancy is currently unknown. The vehicle or robot needs to be able to determine a safe travel path that will avoid collisions. In general, this is a 3-0 problem -a 2-0 map of the environment is not enough in most cases. This may be either because the robot or vehicle is capable of motion in three dimensions, or -even if the vehicle/robot does not move in three dimensions -simply because it has a finite size in three-dimensions, as do the obstacles in the environment. This implies a need for a dense, volumetric occupancy map, capable of accurately representing the occupancy state of all parts of the environment.

In the context of drones, applications of interest may include aerial surveillance, infrastructure inspection, and search and rescue. In the context of robotics, applications may include close inspection and grasping of objects.

For greater autonomy, it may be desirable for the robot or vehicle to be able to create and update the occupancy map of the environment by itself, as far as possible, without a constant need for external communications or additional processing resources. It is therefore desirable that the techniques used to build and maintain the map should be computationally tractable on embedded processing resources.

To date, little research has focused on techniques for dense volumetric occupancy mapping suitable for embedded implementation. Also, in the past, research on dense 3-D mapping has traditionally focused on accurate reconstruction of surfaces, rather than the accurate representation of occupied and unoccupied 3-D space, which is the primary concern here.

SUMMARY OF THE INVENTION

Existing surface mapping techniques are generally not directly applicable to path-planning problems, because they focus on accurate reconstruction of the surface, rather than on mapping the occupancy of space. For example, a typical surface mapping algorithm stores information about the occupancy of space only close to the surface. This can make the algorithm more efficient in terms of storage and computational requirements, for mapping/reconstructing surfaces, but means that the system lacks a definition of whether the rest of the space is occupied, unoccupied, or not yet explored.

Naive techniques applied to dense 3-D mapping tend to be computationally complex, and tend to result in large volumes of data. There is a need for improved techniques, which are more efficient, in terms of both computational effort to build and maintain the map and data storage requirements to store it. These improved techniques will facilitate implementation of 3-D occupancy mapping on embedded processors of mobile devices.

The invention is defined by the claims.

According to an aspect of the invention there is provided a computer-implemented method for building and updating a 3-D occupancy map of occupied and unoccupied spaces in a 3-D environment, according to claim 1.

Each sub-map comprises occupancy information. In particular, each sub-map may contain occupancy indications for a plurality of voxels. Each occupancy indication may indicate a likelihood that the respective voxel is occupied or unoccupied. Updating the at least one sub-map based on the 2-D depth image and associated pose estimate may comprise updating the occupancy information -in particular, updating the occupancy indications -based on the 2-D depth image and associated pose estimate.

The present inventors have recognised that a single depth image may contain depth information that is relevant to multiple sub-maps. The method therefore updates multiple sub-maps based on depth information from one frame (for example, the most recently obtained frame). In this way, the method can make more effective use of the depth information from each frame. In some embodiments, all sub-maps in a list of active sub-maps may be updated with depth information from each 2-D depth image.

The method may further comprise: obtaining updated pose information; and in response, updating one or more of the sub-map poses, based on the received updated pose information.

In this way, the method can enable the relative positions and orientations of the different sub-maps to be adjusted, when updated information (in particular, pose information) becomes available. Neighbouring sub-maps within the 3-D occupancy map overlap one another, in order to support their reconfiguration in response to the updated pose information. The overlap ensures that small changes in the position and/or orientation of one volumetric region, relative to its neighbours, will not result in any gaps in the 3-D occupancy map.

The updated pose information may be received from a visual-inertial simultaneous localisation and mapping (VI-SLAM) system.

Each sub-map may contain a rigid model describing the occupancy of space within its respective volumetric region. The sub-maps are effectively coupled to one another non-rigidly, in an articulated manner, by means of the sub-map poses. By describing the occupancy of space with several rigid models (each associated with a respective sub-map), instead of a single global rigid model, the present method helps avoid the need to update the entire 3-D occupancy map in response to updated pose information that affects just one part of the map. Instead, only the affected sub-map(s) need be updated, along with the associated sub-map pose(s).

Receiving the updated pose information may comprise obtaining updated pose estimates for one or more of the frames, and the method may comprise updating one or more of the sub-map poses based on the updated pose estimates. In this way, the sub-maps are effectively anchored to the poses of the frames that were used to create them.

Each sub-map may comprise an octree for storing occupancy information, each octree comprising a hierarchy of nodes, each node being associated with a voxel, wherein each leaf node of the octree includes an occupancy indication.

The occupancy indication may indicate a likelihood that the respective voxel is occupied or unoccupied. The set of occupancy indications may define a scalar field indicating the probability of occupation of each voxel. The occupancy indication may comprise a log-odds probability.

Optionally, in some embodiments, non-leaf nodes may also include occupancy information.

Optionally, in some embodiments, the octree may store confidence information.

For example, in addition to an occupancy indication, each leaf node of the octree may include a confidence indication, which indicates a confidence that the occupancy indication is correct.

Each sub-map may comprises an octree for storing occupancy information, each octree comprising hierarchy of nodes, each node being associated with a voxel, wherein each leaf node of the octree contains an occupancy indication indicating whether the respective voxel is occupied or unoccupied, and updating the at least one sub-map based on a given 2-D depth image may comprise: projecting each voxel of the sub-map into the 2-D depth image, to define a projected area within the 2-D depth image; and updating each voxel based on depth values of the 2-D depth image that are located within the respective projected area.

The use of the octree may help to reduce the quantity of data needed to store the map -in particular, when there are large unoccupied areas in the environment. Storing a reduced amount of data may have the additional benefit that less zo computational effort is required to update the 3-D occupancy map, when a new frame / 2-D depth image is obtained.

Updating each voxel optionally comprises: defining a first interval of depths encompassed by the voxel; defining a second interval of depth values in the depth image that are located within the voxel's projected area; calculating, based on the first interval and the second interval, an expected variation in a value of the occupancy indication within the voxel and if the expected variation is greater than a threshold, subdividing the voxel into eight sub-voxels.

This can provide an efficient approach to update the octree, when a new 2-D depth image is obtained.

The subdivision of voxels may be repeated, recursively, until all leaf-voxels have a sufficiently small expected variation in the occupancy indication.

The expected variation in the occupancy indication may be calculated using an inverse sensor model, which determines the probability of occupation along a ray given a distance between a queried point on the ray and a depth value in the depth image.

Optionally, updating at least one voxel comprises updating the occupancy indication of the voxel, based on the depth values that are located within the voxel's projected area, wherein the occupancy indication is updated using an inverse sensor model, which determines the probability of occupation along a ray, given a distance between a queried point on the ray and a depth value in the depth image.

This can offer a principled way to integrate the 2-D depth information from the depth image into the 3-D occupancy map. Optionally, the occupancy indication is updated only for voxels whose expected variation in occupancy indication is not greater than the threshold. In other words, a voxel is subdivided if it has a high expected variation in occupancy-indication, and its occupancy indication is updated if it has a low expected variation in occupancy indication.

The method may further comprise, before updating each voxel, pre-computing a pooling structure based on the 2-D depth image, wherein the pooling structure optionally comprises one of: a pyramidal min-max pooling structure, describing the minimum and maximum depth values for each of a plurality of regions in the depth image, the regions including regions at different resolutions, the regions at each scale being arranged in a regular array; a mean-variance integral image pooling structure, comprising integral images computed for the zeroth, first and second moments of depth; and a smooth-max integral image pooling structure, comprising integral images computed for positive and negative alpha-softmax functions of depth.

The step of defining the second interval of depth values, mentioned above, may comprise: determining a bounding box that encloses the voxel's projected area; and querying the pooling structure based on the bounding box to obtain the second interval of depth values.

The pyramidal min-max pooling structure may be constructed by iterative pooling. It can be used to rapidly compute an estimate of the minimum and maximum depth values in a rectangular bounding box within the depth image, given the location and size of the bounding box.

The mean-variance integral image pooling structure can be used to rapidly compute the mean and variance of any given rectangular bounding box in the depth image.

The smooth-max integral image pooling structure can be used to rapidly compute the sum of exponential functions of depth, for any given rectangular bounding box in the depth image. A softmax estimate of the maximum can be calculated as the ratio of the sums of two positive exponential functions. A softmin estimate of the minimum can be calculated as the ratio of the sums of two negative exponential functions. Each ratio forms an alpha-softmax function of depth.

Updating the at least one sub-map optionally comprises pruning a group of eight child nodes from the octree if it is determined that their occupancy can be indicated sufficiently accurately by a single occupancy indication associated with the parent node of the group of eight child nodes.

Typically, in order to be eligible for pruning, the group of eight child nodes should be leaf nodes. The method may comprise assigning to the parent node an occupancy indication that is an average of the occupancy indications of the eight child nodes. The parent node may be determined to indicate the occupancy sufficiently accurately if an error introduced by the average is below a threshold. In other words, the parent node may indicate the occupancy sufficiently accurately if a spread (variation) of the occupancy indications of the child nodes is below a threshold.

The method may further comprise: before updating the at least one sub-map, fusing two or more of the 2-0 depth images into a combined depth image, and updating the at least one sub-map based on the combined depth image.

Updating sub-maps is typically computationally intensive. Therefore, it is advantageous to aggregate information from multiple frames, where possible, before updating the sub-map. This can reduce the number of sub-map updates needed, without discarding useful depth information.

The fusing may comprise: re-projecting depth values from one or more of the 2D depth images, based on the pose estimates associated with the frames, to produce one or more re-projected depth images; and calculating average depth values based at least in part on the re-projected depth images.

Optionally, calculating the average depth values comprises, for each pixel of the combined depth image: computing a histogram of depth values, based on the re-projected depth images; and calculating the average depth value based at least in part on the mode of the histogram.

In particular, the average may be calculated as the mean of depth values falling within a predetermined interval about the mode. For example, the mean may be calculated as the mean of depth values falling within selected histogram bins, wherein the selected bins include the bin defining the mode, as well as the adjacent bins on either side of the mode.

The method may further comprise: maintaining a list of currently active sub-maps; and defining a new sub-map if occupancy information derived from the 2-D depth image extends outside all of the currently active sub-maps.

A new sub-map may be created aligned with a virtual grid that is fixed in the global coordinate system. That is, each new sub-map may be created at a substantially fixed distance and in a substantially fixed orientation relative to other previous newly created sub-maps. The step of defining the new sub-map may be performed before the step of updating the two or more sub-maps based on the 2-0 depth image.

Also provided is a computer program comprising computer program code configured to cause one or more physical computing devices to perform all the steps of a method as summarized above when said computer program is run on the one or more physical computing devices. The computer program may be stored on a computer readable storage medium, which may be a non-transitory storage medium.

Also provided is a mobile device configured to implement a method as summarized above.

Also provided is a mobile device configured to build and update a 3-0 occupancy map of occupied and unoccupied spaces in a 3-D environment, according to claim 16.

The one or more processors may be further configured to: obtain updated pose information; and in response, update one or more of the sub-map poses, based on the updated pose information.

The one or more processors may be configured to update the at least one sub-map in accordance with any of the methods as summarised above.

The at least one camera may comprise a stereo camera pair configured to capture stereo image pairs. The one or more processors may be configured to produce the 2-D depth images based on the stereo image pairs.

The one or more processors may be configured to: calculate an updated pose estimate for each frame; and update one or more of the sub-map poses based on the updated pose estimates.

The mobile device may be comprised in a handheld device, robot, or a vehicle.

The vehicle may be an unmanned and/or autonomous vehicle. The vehicle may be a land vehicle, an aerial vehicle, a marine vehicle or a submarine vehicle.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention will now be described by way of example with reference to the accompanying drawings, in which: Fig. 1 is a conceptual representation of a 3-0 occupancy map according to an

example,

Fig. 2 is a block diagram of a mobile device according to an example; Fig. 3 illustrates a method for building and updating a 3-D occupancy map, according to an example; Fig. 4 illustrates the steps of a method for obtaining frames, suitable for use in the method of Fig. 3; Fig. 5 illustrates a method of updating of a sub-map, based on a 2-D depth image; Fig. 6 illustrates a method for updating a voxel within a sub-map, based on a 2D depth image; Fig. 7 illustrates a method for defining an interval spanned by depth values in a part of a 2-D depth image, suitable for use in the method of Fig. 6; Fig. 8 shows a method for fusing multiple depth images, suitable for use in the method of Fig. 2; Fig. 9 is a 2-D plan view schematic illustrating how the method of fusing depth images works; Fig. 10 illustrates several inverse sensor functions that may be used according to examples; and Fig. 11 illustrates a procedure for querying a multiresolution pyramidal pooling structure with a bounding box.

It should be noted that these figures are diagrammatic and not drawn to scale.

Relative dimensions and proportions of parts of these figures have been shown exaggerated or reduced in size, for the sake of clarity and convenience in the drawings.

DETAILED DESCRIPTION

Examples described herein have been designed with the aim of compatibility with visual inertial simultaneous localisation and mapping (VI-SLAM) systems. Nevertheless, it should be understood that they can also be used in other contexts, and the scope of the present disclosure is not limited to systems incorporating a VI-SLAM system.

Visual-inertial systems for simultaneous localisation and mapping (SLAM) are known in the art. These systems combine an inertial measurement unit (IMU) and a camera. The IMU provides one or more of accelerometer, gyroscope and compass measurements; and the camera provides visual input, in the form of a sequence of images. These may be either monocular images, stereo image-pairs, or a set of images from a multi-camera rig.

The mobile device comprising the camera and IMU may be handheld or mounted on a vehicle or robot moving through an environment. The goals of the system are twofold: to localise -that is, estimate the position and orientation (together, the "pose") of the camera -and also to create a map of the environment through which the camera is moving.

An overview of SLAM systems in general can be found in Cadena et al. ("Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age," in IEEE Transactions on Robotics, vol. 32, no. 6, pp. 1309-1332, Dec.

2016, doi: 10.1109/TRO.2016.2624754). An overview of VI-SLAM systems, as well as a specific algorithm for VI-SLAM, can be found in Campos et al. (Carlos Campos, Richard Elvira, Juan J. Gomez Rodriguez, Jose M.M. Montiel and Juan D. Tardas, "ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM", arXiv:2007.11898v1 [cs.R0], 23 Jul 2020), for example.

VI-SLAM systems attempt to solve an optimisation problem -namely, to establish the pose of the mobile device, given all of the visual and inertial information available. As the number of measurements increases, it becomes possible to refine previous pose estimates. This can arise, in particular, in the context of "loop closure" -when the system recognises that it has returned to a previously visited location. When this happens, the system may refine the pose estimates making up the entire trajectory over the loop, thereby correcting for any drift that had crept in before the loop-closure was detected. However, refinement of pose estimates can also occur at other times. For example, at the end of a mapping session, it may be desirable to perform a global refinement of the entire trajectory, based on all of the information that has been gathered.

The pose estimates provided by a VI-SLAM system are very useful for supporting occupancy mapping. However, in order to make best use of the pose estimates, the occupancy mapping framework should be able to cope with the periodic refinements of historical pose estimates that may be produced by the VI-SLAM system.

Examples disclosed herein have been designed with this aim in mind.

Fig. 1 shows a conceptual representation of a 3-D occupancy map according to an example. Fig. 1A is a perspective view. Fig. 1B shows a plan view, looking down on the map from above (in the orientation shown in Fig. 1A). The map comprises a number of sub-maps 10-14. Each sub-map describes the occupancy of a different volumetric region of the 3-D environment to be mapped. Fig. 1C illustrates schematically how one sub-map 10 describes the occupancy of space. The sub-map is divided into a plurality of voxels 15. The voxels densely cover the space -that is, every point in the space falls uniquely within one of the voxels. Each voxel describes the occupancy of its respective portion of the space. A voxel may be occupied, unoccupied, or unknown (not yet mapped). In Fig. 1C, the shaded voxels 15 represent portions of the space that are occupied. Empty (free, unoccupied) and unknown portions of the space are also described by voxels, but these unoccupied/undetermined voxels are not shown explicitly in Fig. 1C, to avoid cluttering this schematic diagram. It should be understood that the sub-maps 11-14 are also divided into voxels which describe the occupancy of space -this is not shown in Fig. 1, for simplicity and clarity. Each sub-map has an associated pose defining the position and orientation of the relevant volumetric region within the 3-D environment, relative to a global reference coordinate system of the overall occupancy map. Within each sub-map, the occupancy of space may be described rigidly. However, the poses (that is, positions and orientations) of the sub-relative to one another can be adjusted by modifying the pose associated with each sub-map.

In this example, the sub-maps are cubes, which overlap at their faces. When they are first instantiated, the cubes are aligned with one another in a regular three-dimensional array. However, as updated pose estimates become available (for example, from a VI-SLAM system), the poses of the sub-maps can be updated.

Effectively, the cubes can be shifted and rotated in order to accommodate refinements in the relative pose estimates of the different sub-maps. The overlap between the sub-maps allows this translation and reorientation to be performed without creating unmapped gaps in the overall 3-0 occupancy map. The amount of overlap that is desirable depends on the expected magnitude of changes to the poses of the sub-maps. In the present example, an overlap between adjacent cubes of 15% of the volume of each cube was found to produce a good compromise between the desire to avoid gaps and the desire to minimise unnecessary redundancy in the 3-D occupancy map.

Note that, in Figs. 1A-1B, a first sub-map 10 is shown with neighbouring sub-maps 11, 12, 13, and 14 on four sides, for clarity and simplicity; however, in general, each sub-map may be surrounded by neighbours on all six sides.

Fig. 2 shows a block diagram of a mobile device 100 according to an example.

The mobile device 100 is configured to perform two functions. Firstly, it performs simultaneous localisation and mapping (SLAM), using visual and inertial data. This produces frames and associated pose estimates, describing the environment and the trajectory followed by the mobile device 100 during a mapping session. Secondly, the mobile device builds and updates a 3-D occupancy map. Both of these processes may be performed in parallel, in real-time, while the mobile device is moving around the 3-D environment. The mobile device may be part of a handheld device, a robot, or a vehicle (such as a drone), for example.

As shown in Fig. 2, the mobile device comprises: a memory 110; two cameras, operating as a stereo pair 120; an inertial measurement unit (IMU) 130; and a microprocessor 140. The IMU includes an accelerometer 132 and a gyroscope 134. In some examples, it may further comprise a compass.

Fig. 3 is a flowchart illustrating a method for building and updating a 3-D occupancy map, which is performed by the mobile device 100. The 3-D occupancy map maps the occupancy of space in the 3-D environment around the mobile device. In particular, the occupancy map indicates whether space is occupied, unoccupied, or unknown (for example, because that part of the environment has not yet been visited/observed by the mobile device 100). The occupancy map is stored in the memory 110.

The method is iterative, as shown by the loop structure in the diagram. Before the first iteration, it is assumed that at least one sub-map has been instantiated. As explained above, in the present example, each sub-map represents a cubic volume. The sub-maps describe respective different volumetric regions of the 3-0 environment. Each sub-map has an associated pose defining its position and orientation within the overall 3-D occupancy map. The sub-map volumes overlap with one another. A new sub-map may be defined whenever the mobile device visits a new part of the environment, which it has not previously mapped.

In step 220, the mobile device obtains the next frame characterising the 3-0 environment. Each frame comprises a 2-D depth image and a pose estimate describing the position and orientation from which the 2-0 depth image was captured. This process is illustrated in greater detail in Fig. 4. Step 220 includes steps 222-228. In step 222, the stereo camera pair 120 captures visual data, comprising frames. In step 224, the inertial measurement unit generates inertial data, comprising acceleration measurements from the accelerometer 132 and rotational measurements from the gyroscope 134. In step 226, the processor 140 calculates a pose estimate for the mobile device, for each frame, based on the visual data and the inertial data. This step is performed as part of the VI-SLAM algorithm. The details of the VI-SLAM algorithm are outside the scope of this disclosure. Any suitable VI-SLAM algorithm may be used, such as the one referenced above by Campos et al. Additionally, in step 228, the processor produces a 2-0 depth image for each frame. In the present example, the 2-D depth image is produced from the stereo pair produced by the cameras 120, using stereo disparity.

In general, the 2-0 depth image may describe the depths of points viewed by the cameras 120 in any suitable way. For example, the depth image may contain an array of depth values, being the distances of respective points in the 3-0 environment from an image plane of the stereo cameras 120. However, in the exemplary implementation described below, the depth image contains an array of range values, being the distances of respective points in the 3-D environment from a point associated with the stereo cameras. The use of range, rather than depth, has been found to offer a more convenient formulation in the present implementation. However, in other implementations, traditional depth values (as defined above) may be used, or a mixture of depth and range values may be used. It is possible to convert between a depth representation and a range representation, using knowledge of the camera geometry. In the following description, the term "depth" will be used in a general sense, to encompass both traditional "depth" values as well as "range" values. The term "range" will refer to the Euclidean distance from the camera point to the measured point.

In step 225, the processor 140 assesses whether updating of the pose estimates is required. An update may be required, for example, if the VI-SLAM algorithm has detected a loop-closure, or if the VI-SLAM algorithm has finished a mapping session. If an update is required, the method proceeds to step 230. In this step, the processor 140 calculates an updated pose estimate for each frame. This step is typically performed as part of the VI-SLAM algorithm. Next, in step 240, the processor 140 updates any of the sub-map poses that are affected by the updated pose estimates for the frames. For example, if frames associated with a sub-map have had their poses updated, the pose of the sub-map may be updated based on an aggregate or average of the individual updated frame poses. Optionally, the "rigid" volumetric data within the sub-map may also be updated, to reflect the updated pose information regarding the individual frames. The pose estimates of the frames within each sub-map may be defined relative to the pose of the entire sub-map within the 3-0 occupancy map.

After the poses of any affected sub-maps have been updated, the method proceeds to step 250 (and from there, eventually, to steps 260, 210 and 300). The method also proceeds directly to step 250 if it was determined in step 225 that no pose update was required. In step 250 a number of depth images will be subjected to a fusion step. In step 210, any new sub-maps required will be defined. In steps 260, and 300, the 3-0 occupancy map will be updated, based on either new frames or new pose updates for existing frames. In the present example, the method for updating sub-maps is essentially the same, irrespective of whether the update concerns a new frame, containing fresh depth information not previously available to the system, or updated pose estimates for an old frame. Of course, in other examples, other strategies could be adopted.

Steps 250 and 260 are pre-processing steps, before the actual update process 300. In step 250, the processor fuses two or more of the 2-D depth images into a single combined depth image. Typically, the 2-D depth images selected for fusion will be images taken from adjacent points on the trajectory travelled by the mobile device. Therefore, there will be a large degree of overlap in the depth data that they contain.

Fusing such depth images into a single combined depth image can allow this redundancy to be reduced, and indeed exploited, to produce smoother depth data with less noise. Fusing multiple depth images into a single combined depth image also reduces the number of updates to the sub-maps, since the combined depth image can be used for the 3-0 occupancy update, instead of updating the 3-D occupancy map separately for every single frame.

In the present example, as shown in Fig. 3, the depth images are obtained iteratively. They may be fused incrementally to produce the combined ("fused") depth image. If, after an iteration of incremental fusion in step 250, not enough depth images have been obtained/fused, then the fused depth image is not yet ready, and the method returns to step 220, to obtain another frame. When enough depth images have been obtained and fused, the combined ("fused") depth image is ready, and the method can proceed to step 260.

In step 260, the processor 140 pre-computes a pooling structure from the combined depth image produced in step 250. The pooling structure will be used in the update process (step 300), to facilitate faster and more efficient updating of the 3-0 occupancy map. Various types of pooling structure may be adopted. Details of suitable pooling structures will be described later, below.

In step 210, the processor 140 of the mobile device defines any new sub-maps that are required, within the 3-D occupancy map. A new sub-map is created if the occupancy information derived from the latest combined depth image extends outside the volume covered by the currently existing active sub-maps. Details of how the required number and position of new sub maps is computed are provided later, below. In step 300, the processor 140 updates at least one sub-map, based on the combined depth image, using the pre-computed pooling structure produced in step 260.

As illustrated in Fig. 5, this process comprises two steps, which are repeated for all voxels in the sub-map. In the present example, each sub-map represents the occupancy of 3-D space using an octree. The octree structure comprises a hierarchy of nodes, each node being associated with a voxel in the 3-D environment. The octree starts at a root node, which represents the entire cubic volume contained by the current sub-map (for example, the cube 10 in Fig. 1A). At each lower level of the octree, a parent node from the preceding level may be divided into N. x Ny x IV, constituent smaller volumes. Typically, N.= Ny = Nz = 2; thus, each parent node has eight children, corresponding to eight smaller constituent cubes (sub-voxels). The volume continues to be sub-divided at successive levels of the octree, creating a successively finer representation of the occupancy of space. The octree provides a multiresolution representation -large unoccupied parts of the environment can be represented by a small number of large voxels (nodes), while parts of the environment near to a solid object can be described at a finer resolution, with a larger number of smaller voxels (nodes). This allows occupancy to be represented accurately, but efficiently, avoiding a high degree of redundancy. Each leaf node of the octree -that is, each voxel that is not subdivided into eight children -has an occupancy indication associated with it, representing a likelihood that the voxel is occupied or unoccupied.

Note that the term "octree", as used herein, should be understood to include any hierarchical division of parent nodes (voxels) into N. x Ny x Nz child nodes (child-voxels or sub-voxels). Optionally, N. = Ny = Nz. Optionally, each of N, Ny, and Nz has a value from 1 to 8.

In order to update a sub-map based on a depth image, each voxel in the sub-map is projected into the depth image (step 310). This is done based on the pose of the sub-map, the position of the voxel within the sub-map, and the pose of the 2-D depth image. After projecting the voxel into the plane of the 2-D depth image, the voxel is updated (step 400). This may include updating the voxels associated with child nodes of the present voxel. In this way, the update function may be called recursively, to update the entire octree, starting from the root node. This is how the process is implemented in the present example. However, it should be understood that it is not mandatory to use a recursive implementation. Other forms of iteration can be used, alternatively, or in addition to, recursion.

After all of the voxels of all of the relevant sub-maps have been updated, the method returns to step 220, obtains the next frame, and begins the next iteration of the process.

Fig. 6 illustrates the process 400 used to update a voxel, according to the present example. In step 410, the processor 140 defines a first interval, describing the spread of depths that are encompassed by the voxel. In step 420, the processor defines a second interval, describing the spread of depth values from the depth image that fall within the projected area of the voxel. As will be seen below, these intervals may be approximations. In particular, the second interval may describe the spread of depth values approximately. It may also describe the spread of depth values based on a rectangular bounding box that encloses the projected area of the voxel, instead of considering the exact area of the projection (which, in general, is not rectangular). By using such approximations judiciously, examples of the present method can speed up zo the voxel update procedure, without significantly harming accuracy.

In step 430, the processor 140 calculates an expected variation of the occupancy indication within the voxel. This calculation is based on the first interval and the second interval, and uses an inverse sensor model, which determines the probability of occupation along a ray emanating from the camera, given a query to point along the ray and the depth value in the depth image that intersects the ray.

The processor 140 decides whether to subdivide the voxel into eight children based on the calculated expected variation of the occupancy indication. If the expected variation is low, it suggests that the entire voxel has the same probability of occupation -that is, either the entire voxel is occupied, the entire voxel is unoccupied, or the occupancy of the entire voxel is unknown. This would suggest that there is no need to subdivide the voxel. On the other hand, if the expected variation in the occupancy indication is high, it suggests that part of the voxel is occupied and part is not occupied (or part is occupied/unoccupied, and another part is unknown). This would suggest that the voxel should be subdivided. Voxels may be subdivided until each individual voxel has a uniform probability of occupation. This test is indicated in Fig. 6 by the decision step 435. In step 435, the processor 140 tests whether the expected variation in the occupancy indication is above a threshold variation. If it is above the threshold, the processor proceeds to step 445; if it is not above the threshold, the processor proceeds to step 452.

In step 445, the processor 140 checks whether the current voxel is a leaf node in the octree. If the current voxel is not a leaf node -that is, if the current voxel already has children -the processor proceeds directly to step 448, which calls the voxel-update procedure 400 for each of the child nodes of the current voxel. On the other hand, if the current voxel is a leaf node -that is, if it currently has no children -the processor proceeds firstly to step 446, in which the processor subdivides the current voxel into eight children. After instantiating the eight children, the processor proceeds to step 448, in which the occupancy indications of these children are updated.

After the occupancy indications of the voxels of the child nodes have been updated, the processor 140 proceeds to step 460. In this step, the processor prunes the updated child nodes, if appropriate. Pruning involves removing the child nodes from the tree, so that their parent node becomes a leaf node. This can be done when the occupancy indications of the eight child nodes are sufficiently similar, because this suggests that the entire parent voxel is uniformly occupied.

In the other branch stemming from step 435, if the expected variation in the occupancy indication is not above the threshold, the processor 140 proceeds to step 452. In this step, the processor 140 computes a new occupancy value for the voxel using the inverse sensor model. In step 455, the processor checks whether the current voxel is associated with a leaf node. If so, in step 456, the processor updates the occupancy indication of the current voxel, using the new occupancy value computed using the inverse sensor model. Note that this update does not necessarily replace the old occupancy indication that the new value. The information provided by the new occupancy value may be integrated with the information already stored in the old occupancy indication. If, in step 455, the processor determines that the present voxel is not associated with a leaf node, the processor proceeds to step 458. In this step, the processor propagates the newly computed occupancy value to the eight children of the octree node associated with present voxel. This propagation is done recursively until each leaf node below the current voxel in the hierarchy has been reached. In some examples, occupancy indications are only stored for leaf nodes of the octree, not for parent nodes in the hierarchical structure. In some other examples, some occupancy information may be stored also in the parent nodes. After propagation of the new value to the child nodes, in step 458, the processor proceeds to step 460. Here, the processor checks whether the child nodes should be pruned, such that the present voxel becomes a leaf node.

Note that no pruning is carried out after step 456. This is because step 456 is only performed for leaf nodes, and leaf nodes have no children that could be pruned.

Fig. 7 illustrates step 420 in greater detail. According to the present example, the step 420 of defining the second interval (describing the spread of depth values from the depth image falling within the projected area of the voxel) comprises two component steps. In step 422, the processor 140 determines a bounding box that encloses the projection of the voxel in the 2-D depth map. Then, in step 424, the processor queries the pooling structure (pre-computed in step 260) with the bounding box coordinates. The pooling structure returns an approximation of the spread of depth values of the 2D depth image that fall within the bounding box.

Fig. 8 shows a method for fusing multiple depth images, suitable for use in the pre-processing step 250 of Fig. 3. The method starts with a set of 2-D depth images. In step 252, the processor 140 re-projects the depth values of one or more of the depth images to the viewpoint of one of the other depth images. This is possible because each depth image is associated with its own pose estimate, and the relative differences between the poses determine the re-projection. According to the present example, a zo central depth image among the set of depth images is chosen as a reference. The depth values from depth images captured before and after this reference depth image are re-projected to the viewpoint of the central reference depth image. (The depth values of the reference depth image do not need to be re-projected.) In steps 254 and 256, the processor 140 combines the re-projected depth values (and the depth values of the reference depth image). It does this by computing a histogram of depth values for each pixel of the reference depth image (step 254). Then, in step 256, the processor uses the histograms to calculate a combined depth image. This step essentially involves an averaging operation. In general, the averaging may be performed in a variety of ways. In the present example, the processor identifies (for each pixel of the reference depth image) the histogram bin containing the largest number of depth values -that is, the mode of the histogram. The average is then calculated as the mean of the depth values contained in (i) the identified (modal) bin and (ii) those bins adjacent to the identified (modal) bin. By taking into account all depth values close to the mode of the histogram, the present method can robustly exclude outlier depth values, and can calculate a smooth and reliable estimate of the actual depth based on the mean of the other (non-outlier) depth values. Note that, if there are no outliers, the present method should return the mean of all of the depth values in the histogram, since all of the depth values should be clustered in the modal bin and its neighbouring bins.

The combined depth image output by the method of Fig. 8 is a 2-D depth image identical in format to the 2-0 depth images produced in step 228. The only difference is that there are now fewer 2-D depth images to be assimilated into the 3-D occupancy map. This helps to reduce the computational burden of the method of building and updating the 3-0 occupancy map.

The various steps, processes, and algorithms discussed above will now be described in greater detail, with specific examples.

Depth image fusion The idea of pre-fusion of depth images (see step 250 in Fig. 3) is to decimate the number of frames to be assimilated into the volumetric map (3-D occupancy map).

This is motivated by the fact that the process of updating the 3-0 occupancy map may be time consuming and therefore might not be possible at high frame-rates, in real-time, particularly on constrained platforms. While such decimation could be achieved by simply taking 1 frame every N frames and discarding the rest, pre-fusion tries to merge the depth information contained in all frames, rather than simply discarding it. The zo present example utilises a histogram-based method, which makes the system more robust to outliers and improves the overall quality of the reconstruction, even compared with individually assimilating every 2-0 depth image into the 3-D occupancy map. The system works by collecting batches of N frames and processing each batch independently, using a statistical smoothing filter. The output of each batch will be a single depth image to be assimilated into the 3-D occupancy map. The chosen value of N can vary. In the present example, a value of 11 frames is used. This effectively reduces the time required to assimilate individual 2-0 depth images by a factor of 11. At a frame-rate of 30 frames per second, assuming every frame is used, N=11 implies that the time span of each batch would be of the order of 0.3 s. It can be assumed that, under normal motion conditions, the change of viewpoint will be minimal, in this 0.3 s period. This approach works well when the noise in the depth measurements is approximately Gaussian in distribution.

Alternatively, batches could be formed of N frames whose successive pose distance (defined over a metric in transformation space SE3) is above a threshold. In this case, not all frames are fused. A new frame is added into the batch only if the relative transformation with respect to the last frame satisfies the selected criteria, otherwise it is discarded. This approach may yield better results when the measurement error in depth is highly correlated with the viewpoint -as is the case in when depth is extracted from stereo, for example. When deriving depth from stereo, some systemic errors On particular, artefacts at boundaries) persist until the camera moves. It is therefore desirable to wait for some movement before adding to the batch.

The filtering method used is explained in Fig. 9. Depth histograms are used at the pixel level to compute the depth mode. Using the mode instead of the mean can make the filter more resilient to outliers. In practice, most outliers are rejected at this stage -particularly those that result from a high frequency random process and are therefore detectable from time consistency alone. The filtering is based on an assumption that the process which generates depth values is a mixture of a Gaussian plus an outlier distribution. Under this assumption, if the histogram bin size is selected to match the Gaussian standard deviation, it could be expected that most measurements will fall in single adjacent bins. The filter selects the bin with most measurements and computes the mean depth of all the points that fell sufficiently close to this bin On particular, fell in its immediately adjacent bins), discarding any remaining depth values as outliers.

As illustrated in Fig. 9, in a batch of N frames the central frame is used as reference pose for the output filtered depthmap. The middle triangle in the drawing represents the viewpoint (pose) of the central frame. The other triangles represent the viewpoints (poses) of other frames along the trajectory of the mobile device 100. Points from all other frames are re-projected into the central frame and stored in the pixel they re-project to. Then, for each pixel, a histogram of depths is computed. This histogram can be discretised either linearly with depth or with inverse-depth depending on the sensor error model. After the histogram is computed, the bin with the highest re-projected point count (that is, the mode) is selected. A final per-pixel refinement is made by averaging the points that fall close to the selected bin. In the drawing: the grey shaded box is an observed object, the dots are the noisy measurements. The set of histograms is represented schematically by the trapezoidal grid. For example, three adjacent histogram bins for a pixel at the right hand edge of the depth image are denoted by reference numerals 20-1, 20-2, and 20-3.

There is no visibility-reasoning in the pre-fusion process, as the surface is assumed to be observable from all frames in the same batch. This can be assumed due to the filter being local -that is, small batches were the camera "sees" almost the same view, allowing the system to use a fast yet robust method.

Another advantage of pre-fusion is that it not only smooths the data through time filtering but also fills holes of missing data. This addresses another important problem with real depth sensors. Note that pre-fusion method uses the poses provided by the VI-SLAM algorithm, which can enable accurate re-projection of 2-D depth image points into the central reference view.

Sub-maps In order to handle maps of increasingly growing sizes we subdivide space in sub-maps (see step 210 in Fig. 3). The sub-maps are cubic blocks of volumetrically reconstructed maps that can be moved as independent units. While octrees and non-parametric volume discretisations in general do not provide an easy way of correcting deformations On contrast to meshes, for example) a cheap but effective piece-wise approximation can be made by means of sub-maps.

It is desirable to accommodate map deformations because the depth observations are anchored to camera poses which are being continuously optimised by the VI-SLAM system. Loop closures may introduce substantial changes in absolute pose that need to be reflected for the map to remain consistent. In light of this, it is proposed to anchor the sub-maps to the part of the trajectories from which those sub-maps were observed, and to move them following the changes in trajectory due to optimisation. Sub-map subdivision has the additional potential benefit of facilitating an incremental handling of the map. From a data transport and saving point of view, only sub-maps that have been updated need to be re-sent.

Sub-maps are created when a part of space needs to be allocated and is not currently covered by any active sub-map. New sub-maps are created aligned with a virtual grid which is fixed to the world coordinate system.

More specifically, in order to decide when a new sub-map is required, a low resolution auxiliary global occupancy map is created from the new depth image observations, using the same techniques described later in this document. The voxel centres of this low resolution map are used to build a set of sampling points. Only voxels presenting an indication of free or occupied space are used. Active maps are sampled according to this set and if a point is found to not be inside the bounding volume of any active map, a new sub-map is added according to the aligned grid. The auxiliary map is discarded after the allocation process.

The resolution of this auxiliary map is chosen to be a sufficient amount higher than the sub-map size, in order to reduce the likelihood that a required new map might be skipped due to lack of sampling. It is also chosen low enough to make the time required to compute this auxiliary map negligible in comparison with the other computations. As an example, for a map size of 2.56 m, the auxiliary map resolution could be chosen to be 0.256 m. By employing the same occupancy map methodology to compute the auxiliary map and later to update the sub-maps, the approach helps to ensure that enough sub-maps are added to cover the occupancy scalar field defined by the new depth measurements and the inverse sensor model, noting that this scalar field is not confined to the space near the measured surface and instead spans from the camera, covering the frustum and beyond the surface.

When a map is created, it is added to an active list and kept in that state until no update has been made for a certain amount of time, indicating that the camera has moved away from it. As a consequence, if a part of space is revisited and the corresponding sub-map has been dropped from the active list a new sub-map will be created in that place. This reflects the way that the VI-SLAM system works: while the camera sees a place, it will continuously be optimizing the same landmarks thus keeping pose-drift low. On the contrary, when the camera moves away, landmarks are marginalized and drift starts to accumulate. In this way, when the camera revisits a place, it is not possible to ensure that the alignment is correct. Rewriting an old map in these circumstances could potentially corrupt the information in it. Therefore, we choose to create a new map. Optionally, there may be a supervision system running in parallel with the 3-D occupancy mapping algorithm, which merges or replaces maps once several checks have been made of their consistency.

When sub-maps move, it is possible that holes might be created in the map.

This is mitigated by creating the sub-maps with a certain amount of overlapping, which is determined by the grid step in relation with the sub-map size.

The poses of sub-maps are updated when the trajectory is optimised. In order to determine the relation between a change in trajectory and a sub-map pose, when a 2-D depth image is assimilated into a sub-map, the camera pose associated with it is saved together with the current sub-map position in a CameraObservation structure alongside the frame ID and current timestamp. If an update in trajectory occurs at a later time, each sub-map will go separately through the list of its CameraObsenrations, generating for every one of them a new pose proposal based on the relative pose change in the trajectory: Tt2fil = Twck t2 wck TM'l Vk E CameraObservations (1) wk CameraObservations is a subset of poses, built by the sub-map based on which 2-D depth images were assimilated into it.

Finally, a mean pose is obtained by averaging all proposals in SE(3) space by means of the exponential and logarithmic map: = expsE(3)17,1 Ek I ogsE(3)(T,,,/,,,)i (2) While the complexity of this equation will grow linearly with the number of observations and also with the number of sub-maps, it should be noted that these are reduced dimensionality operations. Additionally, as this process is mostly needed when a loop closure occurs, it could be handled in parallel in a back-end thread.

Sub-map fusion Due to the overlapping of sub-maps and the revisiting of places by the mobile device, space will be mapped redundantly by many sub-maps. These might not all agree on the occupancy state of a given volume. These sub-maps should be merged together in order to disambiguate this information. There are several strategies that could be followed in terms of when is best to perform this merging. In the present implementation all sub-maps are merged at once at the end of the session.

Another aspect is how the merging is done. In the present implementation, a clamped addition is used, as described below. This method behaves differently depending on the choice of the clamping parameter. When it is chosen high the clamping never occurs and effectively the method adds together the occupancy indications of all sub-maps. This approach is described in Prisacariu et al. (Victor Adrian Prisacariu, Olaf Kahler, Stuart Golodetz, Michael Sapienza, Tommaso Cavallari, Philip H. S. Torr, and David W. Murray, "Infinitam v3: A framework for large-scale 3d reconstruction with loop closure". arXiv preprint arXiv:1708.00783, abs/1708.00783, 2017. On the contrary, when the clamping is low, the method will prefer newer measurements, effectively ignoring old ones. The latter can be beneficial when changes in the map render the old measurements erroneous, but it has the down side of potentially losing useful information.

Occupancy mapping The large scale mapping system already described uses sub-maps to subdivide the mapping volume in order to support map deformation and incremental updates. As such the system is agnostic to the particular volumetric mapping strategy used to update each sub-map. This section explains in greater detail this particular aspect of the proposed method.

As shall be seen, a method is provided for storing and operating on a dense 3D scalar field f(x) representing occupancy in an efficient manner. Here, X E R3 is a sampling point in space. The aim is to have a system that can give an indication of occupancy for any point of space, given the evidence provided by the sensor observations.

The output of the method can be thought as a 3-D scalar field of occupancy log-odds. The value at each point represents the log-odds of the probability of that point being occupied: = log f0 1)Pcc This means the more negative the values the higher the probability of being free space, and the more positive the values the higher the probability that the space is occupied. Unknown space takes a value of 0 -that is, the probability that the space is occupied is 50%.

The occupancy field is uniform in most places, thus it can be efficiently represented using adaptive resolution. In particular, higher resolutions are needed only near to surfaces and highly detailed objects, while free space (that accounts for most of space) can be represented at much lower resolutions. This is mainly a sampling problem, where we aim at choosing a sampling resolution such that the underlying function is correctly represented. This sampling resolution is not constant and is made adaptive such that the problem can be tractable in large scale scenarios.

In this formulation, each 2-D depth image defines occupancy as a 3D scalar field that takes the specific form: fic(x)= SaTcmxl-Dr(ff(Tcmx))) (4) Here, Tcm is a transformation from map to camera, Dr is a range map (computed from depth values in the 2-D depth image) and u(.) is the projection function. S(*) is a 1D inverse sensor model function that takes the range difference between the point and measurement and returns an indication of the probability of occupation of that point, or an implicit function that defines a surface to be reconstructed. In the former case, this is normally interpreted as log-odds of occupancy probability (equation 3), while in the latter case, truncated signed distance function (TSDF) is the normal choice.

IOCC (3)

Inverse Sensor Models The inverse sensor model determines the probability of occupation along a ray given a depth measurement (see step 452 in Fig. 6, for example). The current implementation is not tied to a particular sensor model, the only requirement being that the model takes the form described in the previous paragraph. Experiments have been conducted with three sensor model functions, which follow heuristic formulations: 1. Narrow band TSDF -this function takes the following form: r-f if ir < (5) S(r, o-) = c other Where r is the range corresponding to the sample point x, f is the measured range (that is, a range value obtained from the 2-D depth image) and a is a parameter that defines the narrow band and is normally correlated with the expected sensor uncertainty.

2. Half Extended T-SDF (HE-TSDF) -this function can be seen either as a piecewise approximation to a more traditional occupancy mapping inverse sensor model, or simply as an extension to the NB-TSDF that provides an indication of free space: -1 if r -< -a r-S(r,f-, a) = P if -PI < a (6) 0 other 3. Half Extended with back-weighting (HEW-TSDF) -this is the default function and is the result of multiplying the back part of the HE-TSDF function with a linear decaying weight that smooths the function and reduces the interference between surfaces observed from different viewpoints. It takes the following form: (-1 if r -< if o-< r -< 0 if 0<o-c ^*0 other While this function is slightly more complex than the HE-TSDF, it has the benefit of providing a smooth decay towards zero in the tail. This makes the function easier to represent, as it avoids a discontinuity, and also reduces the interference between thin object surfaces seen from different sides.

All three models are shown in Fig. 10. In all models, negative values represent free space while positive values represent occupied volume. For comparison, Fig. 10 also shows a scaled version of the spline model of Loop et al. (Charles Loop, Qin Cai, Sergio Orts-Escolano, and Philip A Chou, "A closed form Bayesian fusion equation S (r, a) = (7) using occupancy probabilities", in 3D Vision (3DV), 2016 Fourth International Conference on, pages 380-388, IEEE, 2016).

Integration of occupancy indications from multiple 2-0 depth images The integration (see step 456 in Fig. 6) is done incrementally by adding to the previous state the occupancy field defined by the latest 2-D depth image and applying a clamping function: 1;',t (x) = clomp,t1(x) + (x)] (8) The clamping prevents values going above a predefined threshold and thereby avoids the occupancy growing indefinitely if, for example, the camera remains stationary, observing the same volume. This can allow the system to handle changing environments better, for example.

The update function can be computed in one of two ways. In the first approach, as defined above in Equation (8), the occupancy may be stored as a single scalar value per sample, in which case the clamping is simply a saturation. In the second approach, surface reconstruction based on TSDF may store both a weight and a mean value instead of the absolute sum. Hence, two scalars are needed per sample and the update is computed as a weighted mean: l(X)+WPk k (X) k F(x) (9) 1+Ftik wt = clamp(' + wfl (10) where it should be noted that clamping is applied to the weight instead of the absolute sum.

The main advantage of the first formulation is simplicity and memory footprint; however, this comes at the cost of losing the information about how many measurements where actually integrated. Also, due to the different clamping, the behaviour is significantly different between the two cases. Consider the case of a single measurement being integrated a number of times. Using the first formulation, negative values will saturate to the negative limit, while positive values will saturate to the positive limit, leaving a step in the transition from free to occupied space that loses all information about the distance to nearby surfaces. The second formulation retains the original sensor model profile. Finally, while the first model saturates in finite time, the second will converge asymptotically, with similar behaviour to a first order infinite impulse response filter or a rolling average.

Adaptive Sampling of a 3D scalar field

The fact that NB-TSDF and similar models store information in the full camera frustum changes the problem from being sparse (bounded by surfaces) to dense (bounded by volume). This has motivated the present inventors to use octrees, as they allow a high degree of flexibility in adapting the resolution to the underlying data. This adaptive-resolution is a way of mitigating the volumetric grow during exploration.

In order to represent the occupancy field we make the following assumptions and definitions: 1. The octree represents a partition of the 3D volume in which each voxel completely represents the volume it covers. This implies that no voxel-interpolation is needed to find the function value at a given point.

2. Occupancy data is stored at least in the minimum size (leaf) voxels of the tree (optionally, occupancy data may also be stored in intermediate nodes). Note that this is merely an algorithmic decision.

Definitions 1 and 2 imply that in order to infer occupancy at a given point, one only has to traverse the tree until one finds the minimum size voxel that contains the point and query the value of that voxel. This is a desirable characteristic due to its simplicity. In particular, Definition 1 permits a splitting approach to allocating and updating the tree, were each node is evaluated independently, only having to consider the values that the zo function takes in the volume covered by the voxel, without taking into account its neighbours, making the algorithm well-suited to parallel implementations.

Choosing The Right Resolution We try to choose a sampling resolution such that the error induced is below a certain limit. This error will depend on how we approximate the function inside the voxel.

The current implementation assumes a constant value inside the voxel, thus approximating the field as piece-wise constant. Therefore, the error criterion takes the following form: Ifk(x) v11 < c Vx C voxeli (11) where vi is the stored voxel value.

This criterion is used in the allocation and update algorithm to decide if a node is good enough to represent the function, or if it must be split into smaller voxels (see step 435 in Fig. 6). Note that this criterion is used to choose the resolution independently for each 2-D depth image; therefore, it does not bound the error of the integrated (posterior) occupancy field.

In other implementations, instead of approximating the function inside a voxel as constant, one may use a higher order function. The simplest alternative to a constant is a linear function. In this case, the error criterion takes the form: fk(x) -vi-gr (3c - VX E voxeli (12) where gi is the stored voxel gradient and ci is the voxel center. This implies that four values, instead of just one, are needed per voxel, but it allows the function to be represented more accurately at a lower resolution (that is, with a smaller number of larger voxels).

Allocation and Evaluation Recursion In the light of the above definitions, new 2-0 depth images are integrated by recursively applying the error criterion to subsequent levels of the tree, starting from the tree root node, until a voxel small enough to meet the error criterion is reached. The most difficult part of this process is how to quickly and efficiently determine if the error criterion is met for a given volume of any size. To solve this problem, consider that the occupancy field is defined by a projective 3D depthmap and the sensor model.

Therefore, some upper bounds on the spread of occupancy values can be established if we project the queried voxel into the 2-0 depth image, identify a bounding box that encloses the projection, and consider the spread of range-values found in the 2-D depth image within that bounding box. At the same time, we also consider the spread of distance-values inside the voxel -that is, the maximum and minimum distances from the camera to any part of the voxel. The spread of range-values and the spread of distance values are considered together, in the context of the sensor model. Note that in the case of a piece-wise constant approximation, the spread of values taken by the occupancy field inside a voxel is directly related to the error of the constant approximation.

In particular, we define a closed form expected variation function Y(Tmiwrmax,Imin,tmax,a) that provides an upper bound on the occupancy variation given the spread of distances in the voxel rnii,-,,r,n" and the spread of range values Imax: IS (r (xi), P (xi), a) -S(r (x2), f (x2), a)I < 1P (rmi,,,,rmarlinin,trnax, a) V xi, x2 E voxeli The particular shape of this function depends on the sensor model chosen; but, for some of the example cases considered, where the occupancy is a function of the difference Sr = P -r and the uncertainty is constant a, this function can be simplified to: W (rutin, rmax, ?min) fmax, = W (arrnin, thinax) where Srmjn = -min -rmax and (5?-max = ?max rmin -Given that in this case the sensor model function S (Sr) is unidimensional, the evaluation of IP Sr",") can be performed by considering the critical points of S inside the interval Srmin, SE --rrtax alongside the values in the extrema, S (Srmin), S (Srmax).

These computations are performed in an approximated manner, posing the spread of occupancy values as a function of (i) the sensor model; (ii) the spread of measured range-values; and (iii) the spread of point-distances to the camera inside the voxel. These are all estimated conservatively, meaning that the method may tend to err on the side of choosing an overly high resolution. However, ultimately, this will not harm the accuracy of the estimation.

Sometimes, there may be unknown or invalid depth measurements which back-project into the voxel. These are depth image points whose depth could not be determined for any reason. One way of dealing with these unknown measurements is to assign an occupancy log-odds value of 0 along the ray. This value can then be used to adjust the expected variation in occupancy accordingly. However, this may tend to over-segment the space, as voxels will be subdivided wherever the occupancy does not appear to be reasonably uniform. An alternative method is to split voxels that contain these unknown depth measurements only up to a certain predefined resolution which is less than the maximum resolution of the tree. This can help to avoid the over-division of voxels due to missing (invalid) depth measurements.

2-D depth image pooling While the spread of distances inside a given voxel can be approximated easily by considering a circumscribing sphere, the spread of depth measurements falling within the projected area of the voxel is generally harder to approximate efficiently. The simplest approach would be an exhaustive search of all the pixels inside the bounding box that encloses the projection of the voxel, but the run-time required would make this challenging for real-time embedded implementation. The presently proposed solution is to pre-compute a min-max pooling structure over the 2-D depth image (see step 260 in Fig. 3), which allows the system to query the spread of ranges (range variance) for arbitrary bounding boxes with only a small number of memory-accesses and minimal calculation at query-time.

In particular, for a bounding box centred at c and with size w h, we are interested in getting a conservative estimate of the following metrics: * Maximum measured range.

Minimum measured range.

Number of valid depth measurements.

Number of invalid depth measurements.

Three possible ways of doing this pooling have been devised. These will now be described.

1. Pyramidal Min-Max Pooling. This is the pooling structure currently adopted. An image pyramid is constructed by iteratively pooling the pixels of the 2-D depth image, until a level of one pixel is reached. At each level of the pyramid, each pixel stores the minimum and maximum of all of the corresponding pixels at the lower levels.

Later, when a query is needed, the required level is computed with the following formula: 1 = cei1(log2(max(w, h))) (13) Then, as shown in Figure 11, for any bounding box 30, a single query 32 at level / and five queries 34 at level / -1 are made. The single query 32 at level / retrieves the element at this level of the pyramid that corresponds to the centre of the bounding box.

The five queries 34 at the lower level correspond to the portions of the bounding box that lie outside the area covered by the query 32 at level I. The results of these six queries are min-max pooled to obtain a single maximum, minimum and total number of invalid and valid depth measurements. It can be seen that the area queried is larger than the bounding box -that is, the maximum and minimum are determined for a set of pixels that is a superset of the pixels within the bounding box. This is due to the rounding up of the level in equation 13. This will tend to overestimate the spread of depth values -ensuring that the true maximum and and true minimum are encompassed and giving us the conservative estimate that we aim for. This may lead to higher resolution than absolutely required, which is not a problem. If a bounding box has a certain number of invalid measurements, the method may choose to split the voxel (that is, increase resolution), in order to account for the fact that it cannot know what that measurement is.

2. Integral Image Mean-Variance Pooling. Another approach is to use integral images to efficiently compute moments using box filters. In this case, three integral images are pre-computed for the zeroth, first, and second moments of depth. At query time, only a few lookups in these integral images are needed to compute the mean p and variance a for any bounding box. The min-max spread can then be approximated as p ± 3a. This approximation is not as conservative as using the multiresolution min-max pyramid; however, it may be more accurate, as the size of the bounding box does have to be rounded to a power of two (which is effectively the case for Method 1).

3. Smooth Max Integral Image Pooling. As another alternative, a potentially better approximation than Method 2 is to use a smooth maximum approximation: max(x88) i (xerP(axi) Vxi c BB (14) (exp(ax i)) min(xBR) (xlexp(-axi)) (exp(-axi)) E BB (15) Here, x; are the depth values (specifically, range values) The parameter a determines how dominant the maximum and minimum values are, over the other values in the summation. Similarly to Method 2, the summation over the bounding box BB is computed using box filtering with pre-computed integral images. However, this computation is also not conservative in terms of absolute maximum range and minimum range. Choosing too large a value for the parameter a may cause numerical problems, as overflows can occur when calculating integral images for an array of very large numbers. Care should be taken in the integral image calculation to avoid overflows.

Pruning Octree pruning (see step 460 in Fig. 6) is the process of reducing the resolution, where possible, in order to make the octree representation more efficient. Considering equation 11, a similar criterion is used to identify instances where eight child-voxels can be simplified into a single parent node. This is done by first computing a merged value of the occupancy indication as the average of the eight child-voxels, and then checking that this average value approximates the voxel values up to within a predefined error margin. If the test is passed, the merged average value of occupancy is stored in the parent node and its former child-voxels are released from memory. Thus, the former parent node becomes a leaf node.

For a node to be eligible for pruning, two conditions need to be satisfied: 1. Its children must be leaf-nodes (that is, they must not have children of their own); 2. The occupancy values of the child-nodes need to be sufficiently uniform.

This is tested using an epsilon value (punning epsilon). The value of epsilon for pruning should be set lower than the value of epsilon for splitting (equation 11), for the system to work correctly.

This process is applied after equation 11 as part of the map update process, meaning that the map is always kept pruned. Note that, in contrast to the resolution-selection governed by equation 11, the pruning criterion is applied to the posterior-that is, after the integration of new measurements -possibly yielding different results.

Data Down Propagation Data down propagation along the octree refers to the process of increasing the sampling resolution of the occupancy field. This happens when the occupancy field of the current 2-D depth image needs to be evaluated at lower resolution than the one stored in the tree (for example, in step 458 of Fig. 3), or when the resolution of the tree needs to be increased (for example, in step 448 of Fig. 3). Data is propagated down the octree taking into account the manner in which the voxel occupancy values approximate the occupancy field. In the present implementation, which is based on a piecewise-constant approximation, the value from the parent node is copied down to the children without interpolation. If a piecewise linear approximation were to be adopted instead, interpolation could be used to compute the occupancy values propagated down to the child nodes, for greater accuracy.

Data Up Propagation (Node Pooling) When the occupancy value of a voxel is updated, the updated value is propagated to the higher (lower resolution) nodes of the tree by means of max-min pooling. This technique of keeping the max and min values at every level of the octree is relatively cheap to compute and allows the system to obtain, for any volume, a conservative estimate of the maximum and minimum occupancy, by doing only a few voxel queries at the appropriate level. By knowing these max and min values of occupancy, the system can determine if any part of all that space is free (unoccupied), occupied, or unknown.

Allocation, Evaluation, Propagation and Pruning Recursion The full update recursion used in the present implementation is presented below in the form of pseudo-code. Note that, in this algorithm, everything is done in a single pass. The method goes recursively down the tree until a resolution is found such that equation 11 is met. While traversing down, new nodes are allocated and data is down-propagated if necessary. Once the required level is reached, the recursion ends and pruning and pooling is performed for the higher-level nodes in the octree, while returning from the respective calls.

The single pass update facilitates a parallelised implementation of the method. Although it may not be straightforward to see, this is enabled by Definitions 1 and 2, above. These define that voxels represent a segmentation of space in which the values of the occupancy field inside a voxel are only determined by that voxel's occupancy value; therefore, no information of neighbouring voxels is needed (in particular, no interpolation is required).

The update recursion is called for the octree in every sub-map in the 3-D occupancy map. The pseudo-code is as follows: function update_tree(tree, depthmap) ImagePoolingClass.compute(depthmap) // compute the depthmap pooling.

update_node(tree.root()) ll call update node on the root end function update_node(node) BoundingBox aabb = getProjectionBoundingBox(node.centre, node.size) // Query the voxel projection AABB (in 2D image space).

max_range, min_range = ImagePoolingClass.getMaxMinRange(aabb) // Use pooling structure to quickly estimate the spread of ranges inside the yokel model _variance = SensorModel.expectedVariance(max_range, min_range, node.max range, node.min_range) // Query sensor model for the variance given measurement and voxel range span.

if model _variance > epsilon if is_leaf(node) split_node(node) // converts the node into internal, allocating its 8 children.

propagate_value_down(node) // moves the current node value to its children (current node value is cleared). endif

for i = 0 to 7 update_node(node.child(0) // call recursively on each child. endfor compute_node_pooling(node) // pooling is optional here but used in later stages.

if is_node_prunable(node) prune_node(node) // remove children, making node a leaf-assign node value as an average of children values. endif else

range_measurement = depthmap.query(position_to_image(node.centre)) // query depthmap at node centre position.

node_distance = position_to_distance(node.centre) // compute the distance from the node centre to the camera centre.

newValue = SensorModel.computeValue(node_distance, range_measurement) // compute an inverse sensor model value given measured range and query distance.

if has_children(node) update_children_values(node) // propagate the value down to the children, updating the lower leaf nodes.

if is_node_prunable(node) prune_node(node) // remove children, making node a leaf-assign node value as an average of children values.

endif else update_node_value(node, newValue) II update the current node value. endif endif end Sampling Methods In the present implementation, once a sampling resolution is chosen, the voxel is sampled by evaluating the Sensor Model at the centre of the voxel. This works reasonably well as long as the sampling resolution is allowed to increase as dictated by the complexity of the occupancy function to be represented. However, there may be some cases in which it is desired to limit the resolution of the 3-D occupancy map -for example, for performance or memory reasons. This should be done carefully, in order not to introduce aliasing errors.

Several different strategies are possible, for the sampling: 1. Nearest neighbour. This is adopted in the present implementation and is expected to be the fastest method. The voxel centre is projected into the 2-0 depth image, and the depth-pixel closest to the projected voxel centre is used to compute an occupancy value, using the inverse sensor model. Provided that the minimum sampling resolution is kept below the expected object size (4 cm, in the present implementation) this method has been found to produce good results.

2. Bilinear sampling. Instead of just considering the depth pixel that falls closest to the projected voxel centre (as in the nearest neighbour approach) a bilinear interpolation may be performed, using the four surrounding pixels, to obtain a more accurate result. It is preferable to do this interpolation in sensor model (occupancy) space, rather than in depth space, as the latter may introduce artefacts near to the boundaries of objects (that is, near to the boundaries of the occupied space).

3. Sphere projection sampling. Bilinear sampling does not consider situations in which multiple depth measurements may project into the same voxel. A possible way to mitigate this is to approximate a voxel by its circumscribing sphere, and to consider all the pixels that fall within the projection of this sphere in the 2-0 depth image. With this approach, the maximum occupancy-value indicated by any depth measurements inside projection is considered. This makes the method conservative with thin objects, even when they are smaller than the minimum voxel sampling resolution. However, the time required to sample every pixel renders this strategy computationally more expensive.

4. Sphere projection with pooling approximation. Max-min pooling of the 2-D depth image may be used to optimize Method 3, by querying a bounding box and considering the spread of measurements. However, this approach may result in too coarse an approximation, when used for sampling.

Other features The memory 110 may store one or more computer programs (or software or code) and/or data. The computer programs may include an operating system for the processor 140 to execute in order for the mobile device 100 to function. The computer programs may include computer programs according to embodiments of the invention or computer programs that, when executed by the processor 140, cause the processor to carry out a method according to an embodiment of the invention. The computer programs may be stored in a non-transitory computer-readable storage medium as well as, or instead of, the memory 110.

The processor 140 may be any data processing unit suitable for executing one zo or more computer readable program instructions, such as those belonging to computer programs stored in the computer-readable storage medium and/or the memory 110. The processor 140 may comprise a single data processing unit or multiple data processing units operating in parallel or in cooperation with each other. The processor 140 may, as part of the execution of one or more computer readable program instructions, store data to and/or read data from the computer-readable storage medium and/or the memory 110.

It should be noted that the above-mentioned embodiments illustrate, rather than limit, the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims.

For example, the stereo camera pair 120 may be replaced by a single camera, or by a multi-camera rig (for example, providing an omnidirectional view of the environment). When a single camera is used, the depth information can be gathered by another depth-and/or ranging-sensor, such as a LiDAR or time-of-flight (ToF) optical sensor.

As mentioned already above, each parent voxel in the octree need not be subdivided into NxNxN =8 child voxels. In particular, the subdivision at the leaf nodes of the octree may use a different number of child voxels. For example, the system may use Nx NxN voxel blocks at the leaf nodes, with N ranging from 1 to 8. This feature can make the integration run faster; however, this may come at the expense of increased memory use.

In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps other than those listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The embodiments may be implemented by means of hardware comprising several distinct elements. In a device claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to some advantage. Furthermore in the appended claims lists comprising "at least one of: A; B; and C" should be interpreted as (A and/or B) and/or C. Furthermore in general, the various embodiments may be implemented in hardware or special purpose circuits, software, logic, or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor, or other computing device, although these are not limiting examples. While various aspects described herein may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The embodiments described herein may be implemented by computer software executable by a data processor of the apparatus, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks, and functions, or a combination of program steps and logic circuits, blocks, and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory, and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application-specific integrated io circuits (ASIC), gate level circuits, and processors based on multi-core processor architecture, as non-limiting examples.

Embodiments as discussed herein may be practiced in various components such as integrated circuit modules. The design of integrated circuits is largely a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Claims

CLAIMS1. A computer-implemented method for building and updating a 3-D occupancy map of occupied and unoccupied spaces in a 3-0 environment, the method comprising: defining (210) a plurality of sub-maps (10, 11, 12, 13, 14) of the 3-D occupancy map, each sub-map describing a different volumetric region of the 3-D environment, each sub-map having associated with it a sub-map pose defining a position and orientation of the respective volumetric region within the 3-0 occupancy map, wherein the volumetric regions overlap with one another; obtaining (220) a set of frames characterising the environment, each frame comprising a 2-D depth image and a pose estimate describing a position and orientation at which the depth image was captured; and updating (300) two or more of the sub-maps based on one of the 2-0 depth images and its associated pose estimate.
2. The method of claim 1, further comprising: obtaining (230) updated pose information; and in response, updating (240) one or more of the sub-map poses, based on the received updated pose information.
3. The method of claim 2, wherein receiving the updated pose information comprises obtaining (230) updated pose estimates for one or more of the frames, and the method comprises updating (240) one or more of the sub-map poses based on the updated pose estimates.
4. The method of any one of claims 1-3, wherein each sub-map comprises an octree for storing occupancy information, each octree comprising a hierarchy of nodes, each node being associated with a voxel, wherein each leaf node of the octree includes an occupancy indication.
5. The method of claim 2 or claim 3, wherein each sub-map comprises an octree for storing occupancy information, each octree comprising hierarchy of nodes, each node being associated with a voxel, wherein each leaf node of the octree contains an occupancy indication indicating whether the respective voxel is occupied or unoccupied, and wherein updating the at least one sub-map based on a given 2-D depth image comprises: projecting (310) each voxel of the sub-map into the 2-0 depth image, to define a projected area within the 2-D depth image; and updating (400) each voxel based on depth values of the 2-D depth image that are located within the respective projected area.
6. The method of claim 5, wherein updating (400) each voxel comprises: defining (410) a first interval of depths encompassed by the voxel; defining (420) a second interval of depth values in the depth image that are located within the voxel's projected area; calculating (430), based on the first interval and the second interval, an expected variation in a value of the occupancy indication within the voxel and if (435) the expected variation is greater than a threshold, sub-dividing (446) the voxel into eight sub-voxels.
7. The method of claim 5 or claim 6; wherein updating at least one voxel comprises updating (456) the occupancy indication of the voxel, based on the depth values that are located within the voxel's projected area, wherein the occupancy indication is updated using an inverse sensor model, which determines the probability of occupation along a ray, given a distance between a queried point on the ray and a depth value in the depth image.
8. The method of any one of claims 5-7, further comprising, before updating (400) each voxel, pre-computing (260) a pooling structure based on the 2-D depth image, wherein the pooling structure comprises one of: a pyramidal min-max pooling structure, describing the minimum and maximum depth values for each of a plurality of regions in the depth image, the regions including regions at different resolutions, the regions at each scale being arranged in a regular array; a mean-variance integral image pooling structure, comprising integral images computed for the zeroth, first and second moments of depth; and a smooth-max integral image pooling structure, comprising integral images computed for positive and negative alpha-softmax functions of depth.
9. The method of any one of claims 5-8, wherein updating the at least one sub-map comprises pruning (460) a group of eight child nodes from the octree if their occupancy can be indicated sufficiently accurately by a single occupancy indication associated with the parent node of the group of eight child nodes.
10. The method of any one of claims 2-3 or 5-9, further comprising: before updating (300) the at least one sub-map, fusing (250) two or more of the 2-0 depth images into a combined depth image, and updating (300) the at least one sub-map based on the combined depth image.
11. The method of claim 10, wherein the fusing comprises: re-projecting (252) depth values from one or more of the 2-D depth images, based on the pose estimates associated with the frames, to produce one or more re-projected depth images; and calculating (256) average depth values based at least in part on the re-projected depth images.
12. The method of claim 13, wherein calculating the average depth values comprises, for each pixel of the combined depth image: computing (254) a histogram of depth values, based on the re-projected depth images; and calculating (256) the average depth value based at least in part on the mode of the histogram.
13. The method of any one of the preceding claims, further comprising: maintaining a list of currently active sub-maps; and defining (210) a new sub-map if occupancy information derived from the 2-0 depth image extends outside all of the currently active sub-maps.
14. A computer program comprising computer program code configured to cause one or more physical computing devices to perform all the steps of the method of any one of the preceding claims when said computer program is run on the one or more physical computing devices.
15. A mobile device configured to implement the method of any one of claims 1-13.
16. A mobile device (100) configured to build and update a 3-D occupancy map of occupied and unoccupied spaces in a 3-D environment, the mobile device comprising: a memory (110), for storing the 3-D occupancy map; and one or more processors (140), configured to: define (210) a plurality of sub-maps of the 3-0 occupancy map, each sub-map describing a different volumetric region of the 3-D environment, each sub-map having associated with it a sub-map pose defining a position and orientation of the respective volumetric region within the 3-0 occupancy map, wherein the volumetric regions overlap with one another; wherein the mobile device further comprises.at least one camera (120), configured to capture (222) visual data, comprising a set of frames characterising the environment; and an inertial measurement unit (134), configured to generate (224) inertial data, wherein the inertial measurement unit comprises at least one or any combination of two or more of: an accelerometer (132), a gyroscope (134) and a compass, and wherein the one or more processors (140) are configured to: calculate (226) a pose estimate of the mobile device for each frame, based on the visual data and the inertial data; produce (228) a 2-0 depth image for each frame; and update (300) two or more of the sub-maps based on one of the 2-D depth images and its associated pose estimate.
17. The mobile device of claim 16, wherein the one or more processors are further configured to: obtain (230) updated pose information, and in response, update (240) one or more of the sub-map poses, based on the updated pose information.
18 The mobile device of claim 16 or claim 17, wherein the at least one camera comprises a stereo camera pair (120) configured to capture stereo image pairs, and the one or more processors are configured to produce (228) the 2-D depth images based on the stereo image pairs.
19. The mobile device of any one of claims 16 to 18, wherein the one or more processors (140) are configured to: calculate (230) an updated pose estimate for each frame; and update (240) one or more of the sub-map poses based on the updated pose 5 estimates.
20. The mobile device of any one of claims 16-19, wherein the mobile device is comprised in a handheld device, a robot, or a vehicle.