GB2537831A

GB2537831A - Method of generating a 3D representation of an environment and related apparatus

Info

Publication number: GB2537831A
Application number: GB1507013.9A
Authority: GB
Inventors: Newman Paul; Maria Paz Lina; Pinies Pedro
Original assignee: Oxford University Innovation Ltd
Current assignee: Oxford University Innovation Ltd
Priority date: 2015-04-24
Filing date: 2015-04-24
Publication date: 2016-11-02
Also published as: WO2016170332A1; GB2537696A; GB201511065D0; GB201507013D0

Abstract

Method of generating a 3D representation (of an environment) comprising a plurality of points with estimated depths, comprising: obtaining a depth map from the scene; calculating a certainty value for the estimated depths of the points; calculating a new estimated depth for points with a confidence below a first threshold using a geometric assumption about the environment (eg. it may be flat, planar, affine geometry) together with depth information for points with a certainty above a second threshold; generating the 3D simulation using the original estimated ranges for points with a certainty above the first threshold and the new estimated depths otherwise. The depth map may be generated using parallax by analysing how points move between successive frames of an image sequence. The shape assumption may be a strong prior. Each point of the depth map may be a pixel of the image. Secondary information such as colour and reflectance may be used to calculate the confidence value.

Description

Intellectual Property Office Application No. GII1507013.9 RTM Date:28 September 2015 The following terms are registered trade marks and should be read as such wherever they occur in this document: WiFi (Page 9) UNITS (Page 9) Nvidia (Page 13) Intellectual Property Office is an operating name of the Patent Office www.gov.uk/ipo A METHOD OF GENERATING A 3D REPRESENTAION OF AN ENVIRONMENT AND RELATED APPARATUS This invention relates to a method and system for generating a three dimensional (3D) representation of an environment. In particular but not exclusively, the system or method may be used to construct depth-maps from images, which are typically monocular images. Further, and again not exclusively, the invention may relate to depth-maps based on architectural priors. Specifically, and again not exclusively, the invention may have particular utility in environments wherein plain planar surfaces, for example walls or ceilings, are predominant features. Embodiments may find utility in the localisation of robots or other vehicles within an environment and/or in surveying an environment. A man-portable system could also be used.

It is convenient to describe the background in terms of generating a 3D model of environment around a vehicle, robot, or the like. However, the skilled person will appreciate that embodiments of the

invention have wider applicability.

Lack of surface texture is problematic when building depth-maps of indoor scenes from images. Use of monocular cameras for mapping is desirable due to the low cost of such cameras and the ability to perceive the complete 3D structure of the local environment with low cost sensors is desired. That is, it would be convenient to generate a 3D model of an environment using a low cost sensor, such as a camera.

Current state of the art methods for creating depth-maps with a monocular camera include J. Stuehmer, S. Gumhold, and D. Cremers, "Real-Time Dense Geometry from a Handheld Camera", Darmstadt, Germany, September 2010, pp. 11-20, R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, Dense tracking and mapping in real-lime", in Proceedings of the 2011 International Conference On Computer Vision ICCV, scr. ICCV '11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 2320-2327, and G. Graber, T. Pock, and H. Bischof, "Online 3D reconstruction using Convex Optimization", in 1st Workshop on Live Dense Reconstruction From Moving Cameras, 1CCV 2011, 2011.

These prior art methods are based on variational optimisation algorithms that are able to produce real time, dense 3D reconstructions of desktop-size environments under stable lighting conditions.

Such state of the art techniques for depth-map generation have been shown to work in real time with admirable performance in desktop-sized environments. Unfortunately, when applied to larger indoor environments, performance often degrades. A common cause of degradation is the presence of large affinc textureless areas like walls, floors and ceilings, and drab objects such as chairs and tables. These plain or bland areas produce noisy and grossly erroneous initial seeds for the depth-map that greatly impede successful optimisation.

In "Manhattan and piecewise-planar constraints for dense monocular mapping," in Proceedings of Robotics: Science and Systems, Berkeley, USA, July 2014, A. Concha, W. Hussain, L. Montano, and J. Civcra consider planar constraints to improve the estimation of monocular depth-maps. In this work, there is a requirement for a non-trivial pre-processing stage to first acquire plane normals, from external means, before constructing optimisation constraints. The method therefore requires segmentation of keyframes into a set of "superpixels"; a superpixel being a polygonal part of a digital image, larger than a normal pixel, which is rendered in the same colour and brightness. Classification of the superpixel into four predefined classes (wall, floor, ceiling, clutter) or matching of superpixels between different keyframes using homographies is then required. In contrast, the method and system described herein require neither a pre-processing step before the optimisation nor an additional penalty term for regularisation.

Prior art methods often minimise an objective function (or energy function) which generally consists of a data term that measures the photoconsistency over a set of consecutive images and a regularisation term that tends to preserve sharp discontinuities between objects located at different depths while simultaneously enforcing depth smoothness for homogeneous surfaces. A step of the minimisation process involves the application of a primal-dual optimisation scheme which is widely used for solving variational convex energy functions that arise in many image processing problems (see, for example. A. Chambolle and T. Pock, "A First-Order Primal-Dual Algorithm Ibr Convex Problems with Applications to Imaging". Journal of Mathematical Imaging and Vision, vol. 40, no. 1, pp. 120-145. May 2011).

Figures 2a, 2b and 2c show examples of depth-map creation using a prior art method for comparison, using an algorithm incorporating a Total Variation (TV) regulariser.

The performance of the prior art algorithm is shown for one synthetic dataset (Figure 2a) and for two real scenarios (Figures 2b and 2c) for desktop-sited (Figures 2a and 2h) and office-sized (Figure 2c) environments.

For the real desktop-sized example (Figure 2b), a sideways, wavy, movement was applied to the camera, as is typical in such experiments. The wavy movement increases the parallax and therefore improves accuracy of the estimated depth. For the office-sized example (Figure 2e) the camera was mounted on a robot 10 that was moving forward, which is the most challenging movement for calculating parallax, but also thc natural movement for collecting 3D models of indoor and outdoor environments.

In all three datasets, the image set 14: comprised 10 consecutive RGB (Red-Green-Blue colour model) images (4), of which one image 202a, 202b, 202c is shown for each dataset.

The sets of consecutive images 202a, 202b, 202c were used to generate initial seeds 204a, 204b, 204c for depth-maps 206a, 206b, 206c. The initial seeds 204a, 204b, 204c were generated by minimising Eii(0 (sec below) using exhaustive search.

The final depth-maps 206a, 206b, 206c were created after regularisation of the initial seeds 204a, 204b, 204c.

Coloured 3D point clouds 208a, 208b, 208e were then obtained from back-projecting the pixels estimated in the depth-maps.

For the majority of the pixels in the initial seed 204a of the synthetic image dataset 202a shown in Figure 2a there is a reasonable estimate of the depth. At first sight this appears to be true even in low-textured areas like the right wall 203.

This surprising result may be explained by the shadows (eg 205) cast by some objects on the wall (like the computer monitor 207) and the fact that the illumination pattern of the synthetic rendered scene 202a is an approximation of the usually more complex Lambertian reflectances found in real scenes. For the headset dataset shown in Figure 2b, the initial seed depth-map 204b is clearly wrong for some of the pixels 211 of the table 209 However the presence of cables 213, papers 215, the headset 217 and the calibration pattern 219, that have a good depth estimate and occupy most of the image 202b, help the regulariser 28 to propagate good depth estimates from pixels for which certainty of the depth estimate is high to pixels for which certainty of the depth estimate is lower.

Finally, Figure 2c shows an image 202c corresponding to that in Figure lb. Figure 2c shows an extremely noisy initial seed 204e for the office-sited real environment due to the fact that most of the scene shown by image 202c consists of plain white walls 221, and the robot 10 was moving in the direction of the field of view. Although the regulariser 28 improves the initial solution 204c, to generate the final depth-map 206c, it cannot cope with the vast number of initial wrong depth estimates for pixels and the final depth-map 206c obtained is of poor quality, as can be seen more clearly in the 3D point cloud reconstruction 208c. Depth-map 208c can be compared to the depth-map shown in Figure lc, which was generated from the same set of images 100, 202c using the method described herein in place of prior art methods.

In summary, for the office-sized environment an extremely noisy initial seed 204c is obtained due to the lack of texture in the images 202c, 100 (walls 221 and ceiling) and the type of movement applied to the camera. As a result, the final depth-map 206c calculated by the prior art method is of lower quality than the depth-maps 206a, 206b obtained for the previous datasets, as can be verified in the corresponding 3D point clouds, 208a, 208b, 208c.

According to a first aspect of the invention there is provided a method of generating a 3D representation of an environment. Conveniently, the 3D representation comprises a plurality of points, substantially each point having an estimated depth of that point relative to a reference. The method conveniently comprises at least one of the following steps: i) obtaining a depth-map which is typically generated from the environment; and ii) calculating a certainty value for the estimated depths of at least some, and typically all, of the points within the depth-map.

Further, the method may, for points having a certainty value below a first threshold, use a geometric assumption of the environment together with depth information for points having a certainty value above a second threshold to calculate a new estimated depth for those points below the first threshold.

The method may further comprise generating the 3D representation of the environment using the new estimated depths for points having a certainty value below the first threshold and the estimated depths from the depth-map for points having a certainty value above the first threshold.

Embodiments may find utility in generating 3D representations of the environment from images or the like. The use of the geometric assumption of the environment helps to address a problem of prior art methods where insufficient data is provided in the image (or other data of the environment) to allow an accurate 3D representation to be created.

Additionally or alternatively, the depth map may be generated by a LIDAR (laser-based detection and ranging) system, or by any other technique known to one skilled in the art.

Conveniently, the depth-map generated in step (i) is generated by processing at least two images of the environment to determine how points move between the at least two images. Such a method is convenient as it provides a convenient way to generate depth information for points within the depth-30 map.

In at least some embodiments, it is the motion of the sensor that generated the at least two images of the environment that is used to determine how points move between the at least two images. Conveniently, the method captures odometry data that allows the motion of the sensor to be captured.

In at least some embodiments, the geometric assumption favours affine geometry. Such aff ne geometry is convenient as it is found in many environments. Affine geometry may also be thought as the presence of largely planar surfaces.

Embodiments may implement the geometric assumption as a strong prior, such that it is assumed that surfaces within the 3D representation are likely to adhere to the geometric assumption.

Conveniently, at least some embodiments, obtain an image of the environment for which the depth-map is to be generated. The image may be obtained from a camera or other sensor.

Typically, at least some embodiments use each pixel within an image of the environment as a point of the depth-map for which the depth-map is to be generated.

Conveniently, embodiments use secondary information associated with each point to determine the certainty value.

In some embodiments, the secondary information comprises one of: colour and reflectance.

According to a second aspect of the invention there is provided a system arranged to generate a 3D representation of an environment. Conveniently, the 3D representation comprising a plurality of points, substantially each point having an estimated depth of that point relative to a reference. The system comprises processing circuitry arranged to perform at least one of the following steps: i) obtain a depth-map generated from the environment; and ii) calculate a certainty alue for the estimated depths of at least some of the points within the depth-map.

Further, the system may, for points having a certainty value below a first threshold, use a geometric assumption of the environment together with depth information for points having a certainty value above a second threshold to calculate a new estimated depth for those points below the first threshold.

The system may further generate the 3D representation of the environment using the new estimated depths for points having a certainty value below the first threshold and the estimated depths from the depth-map for points having a certainty value above the first threshold.

According to a third aspect of the invention there is provided a machine readable medium containing instructions which when read by a machine cause that machine to provide the method, or at least a portion of the method, of the first aspect of the invention, or to provide the system, or at least a portion of the system, of the second aspect of the invention.

The machine readable medium referred to in any of the above aspects of the invention may be any of the following: a CDROM; a DVD ROM / RAM (including -RJ-RW or +11../±RW), a hard drive; a memory (including a USB drive; an SD card; a compact flash card or the like); a transmitted signal (including an Internet download, ftp file transfer of the like); a wire; etc. Features described in relation to any of the above aspects of the invention may be applied, mutatis mutandis, to any of the other aspects of the invention.

There now follows by way of example only a detailed description of embodiments of the invention with reference to the accompanying drawings in which: Figure la is a schematic view of a robot utilising a camera to take and process RGB images of an environment; Figure lb shows an example of an ROB image which may be used to create a depth-map, such as taken by the robot in Figure la; Figure lc shows a depth-map obtained from the ROB image shown in Figure lb using an embodiment described herein; Figure ld shows a 3D coloured point cloud obtained from back-projec mg P els from the depth-map shown in Figure lc; Figures 2a, 2b and 2e (Prior art) show examples of depth-map creation using methods known in the prior art -in each Figure, the topmost image shows an initial ROB image used, the second shows an initial seed for a depth-map, the third shows a generated depth-map and the fourth shows the corresponding coloured point cloud representation of the environment; Figure 3 shows the profile of the data term ED((u)) for a textured pixel, id, (shown by the lighter grey line) and for a texture-less pixel, ut1 (shown by the darker grey line); a quadratic approximation of the data term at the minimum cost of ter is shown by the dashed line; Figure 4a shows meaningful depth pixels that have been selected by thresholding the curvature at the initial depth solution; Figure 4b shows a simple heuristic used to select a set of non-local meaningful candidates N(uil) with proper depth estimates for texture-less pixels uyi; Figures 5a, 5b and Se show dense depth-map results for three datasets using an embodiment described herein -in each Figure, the topmost image shows a ground truth depth-map, the second shows an estimated depth-map where each point of the depth-map has an estimated depth, the third shows differences between ground truth and the estimated depth-map and the fourth shows the corresponding coloured point cloud representation of the environment; Figure 6a shows a histogram of errors in depth, in metres, for the estimated depth-map shown in Figure 5a; Figure 6b shows a histogram of errors in depth, in metres, for the estimated depth-map shown in Figure 5b; Figure 6c shows a histogram of errors in depth, in metres, for the estimated depth-map shown in Figure Sc; and Figure 7 shows a flow chart illustrating the method of an embodiment.

It is noted that the original depth maps presented in the Figures were in colour, allowing a wider range of definition between depths. Although efforts have been made to render the Figures accurately in greyscale, the skilled person should keep the transformation to greyscale in mind when reviewing the Figures. Copies of the colour drawings may be available from the patent office.

Embodiments of the invention are described in relation to a sensor 12 mounted upon a robot 10. The skilled person would understand that the robot 10 could be replaced by a manned vehicle, or by a person carrying a sensor 12, amongst other options. The sensor 12 is arranged to monitor its environment 14, 15 and generate data based upon the monitoring, thereby providing data on a sensed IS scene around the robot IQ In the embodiment being described, since the sensor 12 is mounted upon a robot 10, the sensor 12 is also arranged to monitor the environment 14, 15 of the robot 10.

In the embodiment being described, the sensor 12 is a passive sensor (ie it does not create radiation and merely detects radiation) such as a camera. In the embodiment being described, the sensor 12 is a monocular camera.

The skilled person will appreciate that other kinds of sensor 12 could be used. In other embodiments, the sensor 12 may comprise other forms of sensor such as a laser scanner or the like. As such, the sensor 12 may also be an active sensor arranged to send radiation out therefrom and detect reflected radiation.

In the embodiment shown in Figure la, the robot 10 is travelling along a corridor 14 within a building 13 and the sensor 12 is imaging the environment (eg the corridor 14, door 15, etc.) as the robot 10 moves. The skilled person would understand that the robot may be remotely controlled, may be following a pre-programmed route, or may calculate its own route, or any combination of these or the like.

In the embodiment being described, the robot 10 comprises processing circuitry 16 arranged to capture data from the sensor 12 and subsequently to process the data (in this embodiment, these data comprise images) generated by the sensor 12. Embodiments of the invention are described in relation to generating 3D representations of the environment around the sensor from RGB images 100 taken from a moving sensor 12. The skilled person would understand that other image types may be used, and that a camera 12 taking the images 100 may not be in motion. Further, the skilled person would understand that other forms of data may be used in the place of images -for example LIDAR point clouds.

The 3D representation is typically provided by a plurality of points, which points may correspond to, or at least be generated from, the pixels of images of the environment. In embodiments wherein LIDAR point clouds are used instead of, or as well as, images, the points of the point cloud are used in the same way as the pixels of the images.

As described hereinafter, colour taken from the image (here an ROB image) can be used as a soft segmentation cue. Here a soft segmentation cue may be thought of as being secondary information about a pixel in addition to the positional information provided by the pixel. In alternative embodiments, in which representations of the environment are used other than images, other soft segmentation cues may be used. For example, reflectance may be used.

Here a depth-map is intended to mean a record of the distance of the surfaces of objects within the environment observed by the sensor 12 from a reference associated with the sensor 12. The reference may be a point reference, such as a point based on the sensor 12, or may bc a reference plane. The distance to the surface may be recodcd in any suitable manner.

In some embodiments, the distance to the surface may be recorded as a single value, associated with a pixel of an image 100. The image 100 may be thought of as providing an x-y plane. In one embodiment, the value associated with (or provided by) a pixel of an image may provide a depth value, and may be thought of as a z-value.

Thus, the processing circuitry 16 captures data from the sensor 12, which data provides an image, or other representation, of the environment around the robot 10 at a current time. In the embodiment being described, the processing circuitry 16 also comprises, or has access to, a storage device 17 on the robot 10. As such, the embodiment being described may be thought of as generating 3D-representations of an environment on-line. Here online means in what may be termed in real-time as the robot 10 moves within its environment 14, 15. As such, in real time might mean that the processing circuitry is able to process images at substantially any of the following frequencies: IHz; 2Hz; 5Hz; 10Hz; 15Hz; 20Hz; 25Hz; 30Hz; 50Hz (or any frequency in-between these). The skilled person would understand that the speed of data processing is limited by the hardware available, and would increase with hardware improvements.

The lower portion of Figure la shows components that may be found in a typical processing circuitry 16. A processor 18 may be provided which may be an Intel® X86 processor such as an i5, i7 processor, an AMDTm PhenomTm, OpteronTM, etc, an Apple A7, AS processor, or the like. The processor 18 is arranged to communicate, via a system bus 19, with an I/O subsystem 20 (and thereby with external networks, displays, and the like) and a memory 21.

The skilled person will appreciate that memory 21 may be provided by a variety of components including a volatile memory, a hard drive, a non-volatile memory!, etc. Indeed, the memory 121 may comprise a plurality of components under the control of, or at least accessible by, the processor 18.

However, typically the memory 21 provides a program storage portion 22 arranged to store program code 24 which when executed performs an action and a data storage portion 23 which can be used to store data either temporarily and/or permanently. The data storage portion stores image data 26 generated by the sensor 12 (or data for other representations). Trajectory data 25 may also be stored; trajectory data 25 may comprise data concerning a pre-programmed route and/or odometry data concerning the route taken -for example data concerning movement of the wheels, data from an INS system (Inertial Navigation System), or the like.

In other embodiments at least a portion of the processing circuitry 16 may be provided remotely from the robot 10. As such, it is conceivable that processing of the data generated by the sensor 12 is performed off the robot 10 or partially on and partially off the robot Ilk In embodiments in which the processing circuitry is provided both on and off the robot then a network connection (such as a 3G (eg UMTS -Universal Mobile Telecommunication System); 4G (LTE -Long Term Evolution) or WiFi (IEEE 802.11) or like) may be used.

It is convenient to refer to a robot 10 travelling along a corridor 14 but the skilled person will appreciate that embodiments need not be limited to any particular mobile apparatus or environment. Likewise, it is convenient in the following description to refer to image data 100 generated by a camera 12 but other embodiments may generate and use other types of data.

For example, a sparse, semi-dense, or the like LIDAR point cloud could be generated by a LIDAR system and an embodiment could use the sparse or semi-dense LIDAR point cloud as a starting point for generating a dense depth map of the environment portrayed.

The sensor 12, together with the processing circuitry 16 to which the sensor 12 is connected, and with the software running on the processing circuitry 16, form a system capable of producing 3D representations of the environment 14, 15 around the sensor 12 from the images 100 collected. In the embodiment being described, the 3D representation takes the form of a depth-map As the scnsor I2/robot 10 moves, a set of images is generated and the data providing the images is input to the processing circuitry 16. Typically, parallax between consecutive images 100, together with the trajectory data 25, is used to generate depth estimates for points within the images 100. Each point may correspond to a pixel of any one of the images. The depth estimate information for each pixel forms a depth-map of the environment 14, 15. Each, or at least the majority, of the depth-maps may be stored in the data storage portion 23 as depth map data 27.

Referring to Figure lb, an RGB image 100 of a long office corridor 14 is shown. The image comprises a large number of individual pixels (which may be referred to as points). The environment provided by the corridor 14 as shown in the image 100, includes large, textureless regions, such as regions 102. Here the texture-less regions are provided by walls which planar areas of substantially uniform colour with relatively few features. The lack of texture and features is known to present problems for depth estimation.

If the camera 12 taking the images 100 is moving directly forward along the corridor 14 shown, the parallax is low, particularly for objects and/or points further away from the camera. The skilled person would understand that low parallax also increases difficulty in estimating depth.

In environments wherein there are large bland (texture-less, low texture, featureless or few features) regions 102, it is known that the depth at the edges (eg 104) of bland regions 102 is often well estimated whereas the depths appropriate for inner pixels (ic pixels within the bland region 102, away from edges; these pixels may also be described as texture-less pixels) are problematic to estimate because there are no or few features to track between consecutive images.

In the embodiments described herein, a new non-local higher-order regularisation term that enforces piecewise affine constraints between image pixels that are far apart in the image 100 is introduced.

The skilled person would understand that, in different environments, different constraints may be applicable.

Applying a constraint, or assumption, concerning expected properties of the environment (such as the presence of plain planes, leading to a choice of affine constraints) facilitates utilisation of depth information of pixels at the edges 104 of bland regions 102 to improve the depth estimates for pixels within the bland regions 102. That is, and in one example, if it can be assumed that a bland region 102 is planar then points, for which there is a high degree of certainty as to their distance, can be used to determine the distance for points for which there is a low degree of certainty as to their distance.

Figure lc shows a depth-map 140 obtained, in the embodiment being described, using ten consecutive RGB images 100 taken by a monocular camera 12 moving forward along the corridor 14; that is, the depth-map has been obtained from a depth-map generated from images of the environment. Lighter shades of grey are used for close objects/points and darker shades for distant objects/points. The depth-map 140 can be seen to be smooth along the affine surfaces 142 (walls, floor and ceiling) despite the lack of texture. A consistent U-shaped distribution of grey scale shades for elements in the corridor 14 which are at the same distance from the camera 12 is shown.

Figure id shows a 3D point cloud 160 which is obtained from back-projecting the pixels of the depth-map 140, using the colour values from the initial RGB image 100. Depth and colour information is therefore combined to create the 3D point cloud 160. Whilst the image within Figure Id looks similar to the image within Figure lb, the underlying data providing the image of Figure Id now has depth information associated with each of the pixels whereas the image of Figure lb is simply an image. Thus, the depth-map of Figure ld provides a 3D model of the environment around the sensor 12 which the image of Figure lb does not.

-I -I

At least some embodiments are arranged to generate a surface normal for each pixel (or for at least some of the pixels) within the depth-map.

As described above. Figure lb shows an image IOU of a target environment: an indoor office environment having large plain structures 142, which are typically also planar, like floors, walls or ceilings. In such regions. the photoconsistency term used in prior art methods is of little help since all pixels tend to be similar from all views and regularisers which promote smoothness struggle to propagate information from distant boundaries to points within the large plain regions.

The point cloud shown in Figure Id, generated by the embodiment being described herein, provides a higher quality representation when compared to the depth map 208c shown in Figure 2c which was generated by a prior art method.

In the following, further details and examples of an implementation of the method described herein are presented.

The embodiment discussed herein comprises imposing a geometric assumption about the predominant shape of objects in the environment through which the sensor 12 is moved. For indoor environments such as the office environment shown in Figures I a-d. affine surfaces are considered. Here, affine surfaces is intended to mean surfaces which can be described by the equation of a plane; parallel edges are often present in such surfaces.

The estimated depth of pixels located with regions of a border of a bland region can be propagated to pixels within the bland regions by using a geometric assumption in a regularisation term that favours solutions with the predetermined family shape.

The geometric assumption may be implemented as a strong prior. A strong prior is a type of informative prior in which the information contained in the prior distribution dominates the information contained in the data being analysed (such as the images 100). When affine surfaces are considered, the geometric assumption favours plain planes. The program code 24 favours depth-map solutions in which at least portions of the depth-map comprise planes.

The skilled person would understand that other geometric assumptions of the environment could be used. The method described herein could equally be applied to environments with texture-less surfaces which arc not generally affine but which tend to have a different specified geometry. For example, in a chemical plant, a sewer system, or the like, cylindrical pipes may be present and may occupy a substantial portion of images 100 of the environment. Quadratic curves and surfaces could also be prevalent in sonic environments. In other embodiments further geometric assumptions might be used.

In embodiments wherein two or more different types of geometry are expected in the environment, two or more geometric assumptions may be utilised in parallel. In some embodiments, images 100 may be segmented into two or more segments, and a different geometric assumption applied to each segment. For example, in a chemical plant or the like, images may be segmented into "pipe" and "not pipe", and a geometrical assumption favouring cylinders could be implemented for the pipe segment, and a geometrical assumption favouring affinc surfaces implemented for the not pipe segment.

Unlike traditional variational methods that just consider local information of neighbouring pixels, distant pixels are related in a cost function in the embodiments being described. The non-local higher-order regularisation term introduced herein facilitates, in the embodiment being described, the generation of depth-maps across large affine yet bland scene regions (plain planes). Typically, these depth-maps can be referred to as dense in that they contain data for a large proportion of the environment. Here a depth-map may be considered dense if it contains depth information for at least roughly 75% or more of the pixels of the image from which it was generated. In other embodiments this might be roughly 80%, 90% or 95%.

Thus, a skilled person would understand that a dense depth map provides depth data for substantially all pixels within a representation of the environment being portrayed and typically, depth data is provided for every pixel. At a given resolution, a dense depth map may be used to generate a 3D model of an environment.

Typically, a so-called semi-dense depth map has an uneven data distribution, being dense in sonic areas, and sparse in others. For example, a senti-dense depth map may have depth information for pixels in the vicinity of corners and/or edges, but lack depth information for pixels away from these regions, such as within planar regions, or the like.

A sparse depth map may have an uneven or even data distribution, but does not have data for substantially all pixels in a representation of the environment. A 3D model generated directly from a sparse depth map would have gaps and distortions due to missing data.

The skilled person will understand that depth information may be discounted where the reliability of the depth information is known to be poor. A dense depth map may therefore be defined as a depth map which has good/reliable data for substantially all pixels of a representation of the environment being portrayed. Likewise, a sparse depth map may have depth estimates for substantially all pixels, but only have good depth estimates for a small selection of pixels, and uncertain and/or inaccurate depth estimates for the remaining points.

In contrast to previous methods, the method described herein requires neither a pre-processing step before the optimisation nor an additional penalty term for regularisation. Such steps are implicitly wrapped up in the form and structure of a new objective function.

When a monocular camera is used, the state of the art algorithm to create a depth-map(a) from a set of images Ie = fra(n), , (see R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, "DT.ILL Dense!racking and mapping in real-lime", in Proceedings of the 2011 International Conference on Computer Vision ICCV, ser. ICCV '11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 2320-2327) solves the following variational problem: mine E R () ± E0 (0 Eq. 1

D

where EA) is a nonconvex data term that calculates the average photometric error p between a reference image 11(u) and the warp of the remaining images in set 4 ED(0 = ft,6EICEI,z P(ii(U), (u), (u))du and ER() is a regularisation term, usually a Total Variation (TV) or Huber regulariser, that is able to preserve depth discontinuities while smoothing homogeneous regions: ER = .1./26 EkE/ W(U) IIW(n) II du In Eq. 2, A is a parameter used to define the trade-off between the regulariscr and the data term whereas o.)(U) in Eq. 3 is a per pixel weight based on the gradient of the reference image that reduces smoothing effects of the regulariscr across image edges.

The optimisation problem in Eq. 1 is solved using an iterative alternating optimisation method based on an exhaustive search step that involves the non-convex data term ED(c), and a Primal-Dual algorithm (see A. Chambolle and T. Pock, cited previously) that solves a convex cost function involving the regularisation term E(J) (see, for example, Y. Nesterov, "Smooth minimization of non-smooth functions", Math. Program., vol. 103, no. 1, pp. 127-152, May 2005. Available: htip://dx.doi.org/10.1007/s10107-004-0552-5) A CUDA (Compute Unified Device Architecture) implementation of the previous (TV) algorithm was tested on a NVIDIA GeForce GT 650M 1024 MB card using different synthetic scenarios found in A. Handa, R. A. Newcombe, A. Angeli, and A. J. Davison, "Real-time camera tracking: when is high frame-rate best?" in Proceedings of the 12th European conference on Computer Vision -Volume Part VII, Berlin, Heidelberg: Springer-Verlag, 2012, pp. 222-235, and A. Handa, T. Whelan, J. McDonald, and A. Davison, "A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM," in IEEE Intl. Conf. on Robotics and Automation, Hong Kong, May 2014, obtaining a median depth-map error that is usually below 2 centimetres, after 800 iterations in 500 nis.

However, as discussed above, when office-sized datasets such as that shown in Figure 2c are used, depth-map 206c quality is poor due to the presence of large, bland regions, eg 221.

In order to deal with the problems discussed in the example shown in Figure 2e in the context of indoor environments, the regulariser 28 is arranged to satisfy two requirements: 1 Be able cope with many pixels (points) without (reliable) depth information; and 2 Favour particular solutions which align with the geometric assumption.

In the embodiment being described, many pixels may have no reliable depth information, since their initial depth-map estimates (cg 204e) are in gross error. In Figure 2c, it can be seen that the walls of the corridor within the depth map 204c show no consistency of depth and are highly specular. This should be compared with Figure lc in which the same walls show highly consistent depth.

In the embodiment being described, depth estimates for which the certainty is low/for which errors are likely to be high (unreliable depth estimates) are discounted and removed from the dataset entirely, leaving no depth information for the corresponding pixels.

The skilled person would understand that, in other embodiments, unreliable depth estimates may be retained but given a lower weighting than more reliable depth estimates.

In alternative or additional embodiments, no depth information may have been collected for certain pixels. The method being described can therefore be used to create a dense depth-map from a sparse depth-map, semi-dense depth-map or the like, and/or to fill in gaps where data collection was missed or data were corrupted.

In the embodiment being described, the regulariser 28 favours affine solutions, since it is likely in the indoor environments 14,15 targeted that many pixels in the depth-map (eg 206c) belong to the same 3D plane.

A brief review of the standard primal-dual optimisation algorithm is provided since this algorithm is used to optimise the energy function proposed herein. Space precludes a step by step introduction to this algorithm and the reader is pointed to papers such as A. Chambolle and T. Pock's "A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging", Journal of Mathematical Imaging and Vision, vol. 40,110. I, pp. 120-145, May 2011, for a more detailed description.

Given a noisy signal n(u), a standard application of the primal-dual algorithm consists of calculating a de-noised signal'.(u) by minimising the following energy function: mine fe, a I V (OW I + A(010 -n (u))2 du Eq. 4 Eq. 4 can be discretised to obtain: mim + 21if -7/112 Eq. 5 where 4" and ri are the images represented in vectoriscd form and operator f: is a discrcliscd version of the gradient operator.

IS

Algorithm I (see Table I, below) shows the basic steps of the primal dual algorithm where p is an internal dual variable used during the optimisation and G and t are parameters that control the step size of the algorithm.

Table 1

Finally, the proximal map operators for the priuial and the dual steps [RH can be calculated for each pixel ij individually using: 10ij-HATT/ij Eq. 6 = 1+Ar Eq. 7 Pt/ IPUI/ max a, a) Once depth estimates have been obtained, the certainly/accuracy of each estimated depth is assessed.

IS Matchings, which may be non-local, are then found to improve the depth estimates for which the certainty/accuracy is below a certain threshold. The skilled person would understand that the embodiments being described allow pixels which are not adjacent to the pixel for which the certainty/accuracy of the depth estimate is below a certain threshold to be matched to the pixel in question. The matched pixels may be remote from one another within the image, ic non-local. For example, pixels in the centre of a bland region may be matched to pixels at the edge of the bland region, to pixels adjacent a feature within the bland region, or the like A certainty value may be calculated for each depth estimate, as discussed below. A first threshold may be specified -estimated depths with a certainty below the threshold may be designated as unreliable and discarded, leaving some pixels without depth information. Estimated depths with higher certainty values -for example equal to or above the first threshold -may then bc used to calculate depth estimates for pixels without depth information. The skilled person would understand that, in some embodiments, a second, higher threshold may be set and only depth estimates with certainty values equal to or exceeding the second threshold may be used to calculate depth estimates for pixels without depth information.

Algorithm 1 primal-dual 1: {Initialization of variables:} 0; OE [0, 1] 3:10 = n, .1,0=0, to = to 4: while k s'N do 5: {Dual step:} 6: plt+1 = lip(pk ± a Kr) 7: {Primal step:} 8: c = //f(r. -tin') 9: {Relaxation step:} 10: = r+1+ oRk+1_1/9 11: end while Pixels without depth information are matched to one or more pixels with a depth estimate with a certainty value equal to or above the second threshold. The depth information of the matched pixels is used to calculate depth estimates for the pixels without depth information.

As shown for example in Figure 2e. and in particular image 204c, most of the pixels that correspond to low-textured regions have noisy and meaningless depth values in the initial seed 204e obtained from exhaustive search of the data term EL47* Figure 3 shows, for a textured pixel u, (light grey, 302), the values taken by the data term EDR:(u)) along a range of inverse depth values [li", ci'max] and also the corresponding cost profile for a texture-less one tit, (dark grey, 304).

For the textured pixel 302 there is a clear minimum '.(u,)* of the data term that corresponds to a well-estimated depth, whereas the low-textured pixel shows a flat profile 304. The flat profile explains why any small noise in the original intensity images can randomly change the position of the minimum.

The curvature of a second order approximation (shown by dashed line 306 for the textured pixel only) of the data term at the minimum cost 308 is a measure of the reliability of the initial depth estimated 4ut)*. The curvature can therefore be used as a certainty value for the depth estimate -calculating the certainty value may therefore comprise calculating the curvature. The higher the curvature, the higher the certainty of the estimated depth. In some embodiments, the curvature may be used in calculating the certainty value. The calculation of the certainty value may also comprise other inputs, such as lighting change information or the like.

Using the certainty values, those pixels of the initial depth-map 204c that have meaningful depths (certainty value above a specified threshold) can be selected as can be seen in Figure 4a.

During the optimisation process, invalid depth estimates of texture-less pixels arc disregarded by setting A 0 in Equation (4). As explained in A. Chambolle and T. Pock, cited previously, this is equivalent to generating an interpolation solution (also known as in-painting in the computer vision literature) for those pixels that depends on the regulariser chosen.

Disregarded pixels are evidenced by the blank white areas 406 visible in Figures 4a and 4b. Thus, it will be seen, in Figure 4a, that the depth-map is much less dense when compared to the depth-map of Figure lc; ic far fewer pixels (or points) arc occupied with information in Figure 4a and this is because of the thresholding that is being described which removes points of low certainty.

In at least some embodiments, during the optimisation, and in order to speed up the transfer of depth information to invalid pixels, the following arc selected: * a local neighbourhood for each pixel and * a set of potential meaningful non-local pixel candidates with valid depths.

In the embodiment being described, a simple approach is used to find potential candidates: the closest valid pixels along the main eight star directions 404 (E. NE, N, NW, W, SW, S. SE) are selected as shown in Figure 4b. The skilled person would understand that different approaches could be used. For example, a circle centred on the pixel in question with a gradually-expanding radius could be used and candidate pixels on the circumference selected.

The non-local neighbourhood of a texture-less pixel u is denoted by N(ua).

Below, it is shown that if two pixels u, and u2 in an inverse depth image Er(u) belong to the same planar surface in 3D their inverse depth values (111) and '(h) are constrained by the following affine equation: IS (u1) c012) = -u2) Po. 8 where (*,*) is the inner product between two vectors.

Given a camera 12 with intrinsic parameters (1" , c", c,), where/is the focal length and c the optical centre, the corresponding 3D point x (x, y, z) for a pixel u (u, y) in the image 100 can be calculated using the following back-projection equation: [2;1 = Eq. 9 1-z If, in addition, he corresponding point x belongs to a plane p in 3D it has to satisfy: d = (n, = nrx nyy nz.z Eq. /0 where n -(nr, n," nr) is the unitary normal vector of the plane and d is the orthogonal distance to the 30 origin Substituting x in Eq. 9 into Eq. 10 allows the following to be obtained: d= (n u-ct, n v-ct, n 01) k. fit '7 fp z) -(u) = n=u + 111-0 + (7= - -1111) = (w,u)+ const. Eq. 11 d d fi,c1 frd where w = (w. w,) (7=, codifies the projection of the 3D surface normals into the image plane.

fud iva Finally if ul and tin belong to the same 3D plane then, from Eq. 1 I: COO -C012) = ((w,u1) const.)-((w,u2)-F const.)= (w,u, -u2) * Eq. 12 The semi-dense depth-map 4(4) containing meaningful depth estimates (eg Figure 4a) generated using the method explained above is then used to generate a dense depth-map.

Once potential meaningful non-local pixel candidates with reliable depth estimates (certainty value above or equal to a threshold -such pixels may be described as "meaningful" pixels) have been identified, the geometric assumption is implemented. Estimated depths are calculated for pixels for which there is no (reliable) depth information which are seen as being likely to belong to the same affine surface as one or more of the meaningful non-local pixel candidates using the assumption that the pixels are located on the same affine surface.

Making use of the geometric assumption explained in Equations 8 to 12, the proposed energy function to be minimised is given by: minia fin a, (u1,u2)Wu1) -(u2) -(w(u.,),u, -u2)11 Eq. 13 +122(u1 u2)11vi (ui (7/2)1 Eq.

+122(u1, u2)1w2(ui) w2(u2)l1dit1du2 Eq. 15 -F.a(u) (u) -n(u))2du Eq. 16 Equation 13 is the part of the regulariscr 28 in which affine surfaces are favoured between pixels ill and th using the constraint shown in Equation 8. Equation 15 imposes a kind of total variation constraint on the components of the estimated projected normal w(u) = (wi(u),w2(1))* Thus, this term tries to impose similar normal vectors for homogeneous surfaces whilst allowing for large discontinuities between different surfaces. As is explained in R. Ranftl K Bredies, and T. Pock, "Non-local total generalized variation for optical flow estimation", in Computer Vision-ECCV 2014, Springer, 2014, pp. 439-454, this regularisation term can be considered as a non-local extension of a Total Generalised Variation norm (sec, for example, K. Bredies, K. Kunisch, and T. Pock, "Total generalized variation". SIAM J. Img. Sci., vol. 3. no. 3, pp. 492-526, Sep. 2010).

-Finally, Eq. 16 is a standard data term that enforces pixels for which.1.04) 0 to be close to the input depth-map iglu) leg 204a, 204b, 204c).

Coefficients ai(ui,u)) and an(ui,u,) are used to incorporate soft-segmentation cues into the regulariscr 28. The soft-segmentation cues are used to assist in determining which pixels may be assumed to form part of the same surface. In the current implementation, these weights are based on the intensity similarity between pixels III and u2 in the reference image (eg 100,202a, 202b, 202c): rriOrt,u2) X exp cri Eq. 17 where a, controls the influence of the neighbouring In the embodiment being described, colour (RUB) information of the pixels is used as a soft-segmentation cue. If multiple pixels within a texture-less region have the same colour, it may be assumed that they form part of the same surface. In embodiments wherein a different sensor 12 is used, different information may be used. For example, in embodiments wherein LIDAR is used, reflectance information for each point may be used instead of colour.

In addition, the coefficients are used as support weights to control the local and non-local influence of pixels. For all pixels, support weight values are calculated for a local window, which in the embodiment being described is a 7x7 window. To allow the transmission of depth information from meaningful pixels to a texture-less pixel lid, additional support weights are calculated with its non-local neighbours In other embodiments, other sized windows may be used to calculate the support weights Indeed, the window may not be square, and may for example be rectangular, circular, or the like.

In at least some embodiments, an adaptive approach is used. Initially, depth information for a pixel is sought based on depth information for other pixels within the window (local pixels, or neighbours). If no reliable depth information is found based on the other pixels within the window, pixels elsewhere in the image 100 are then used (non-local neighbours) In some embodiments, if no useful depth information is found within the initial window, and the texture-less area is large compared to the size of the window, a larger window may be used before looking for pixels outside of a window.

Assuming that in is the total number of support weights al different from zero and that the images have n pixels, the proposed energy function can be expressed in a more compact matrix form after di scretisation: minelaK 0 ci1 +A c) il(e* -nT)112 Eq. 18 where c represents point-wise multiplication, a fa71,a21,a2IT is a 3in x/ vector containing all support weights, ''=/1 7',14,7 W7TIT is a 3nx1 extended vector containing the optimised depths and the first wt and second w1 components of the normals, e [ j, n -1x2.11 is a 3nx1 extended input vector with the semi-dense depth-map ti and additional padded zeros to match size (the corresponding 1), is set to zero in'it for this additional components) and K is a sparse selection matrix that takes into account distances between matched pixels.

Finally it should be noted that the expressions in Eq. 18 and Eq. 5 are almost identical, allowing use of Algorithm 1 (see Table 1) to solve the minimisation problem.

One embodiment of the method described herein was evaluated using three real datasets for which ground truth models are available, computed from a pushbroom laser. By projecting the laser points into the reference image, ground truth depth-maps are generated 500a, 500b, 500c to be compared to the depth-maps 506a, 506b, 506c generated according to the embodiment tested as shown in Figures 5a through Sc respectively.

Since not all laser points may be projected, the corresponding subset of those which are projected is used to obtain the statistics used.

The error 510a, 510b, 510c in the generated depth-maps 506a, 506b, 506c (as compared to ground truth 500a, 500b, 500c) is shown. Errors range from 0 cm (black) to 0.5 cm (white). The coloured 3D point clouds 508a, 5081), 508e clearly illustrate that these errors are low in terms of quality of reconstruction of the scene.

Figures 6a, 6b and 6c show histograms of the depth errors for Figures 5a, 5b and Sc respectively. For visualisation purposes, all errors are saturated to a maximum of 0.5 metres. Table 2 shows the median error for each datasct:

Table 2

Median Error Range = [1.655 3.445] [m] Dataset 1 4 cm Dataset 2 6.74 cm Dataset 3 3.62 cm Figure 7 provides a flow-chart of an embodiment of the method 700 described above. At step 702, a depth-map 140 is obtained. The depth map 140 may have been generated from an image of an environment, from LIDAR data, or by any other technique known to one skilled in the art. The depth map 140 comprises a plurality of points with depth estimates. Some or all of the points may have a depth estimate.

At step 704, a certainty value is calculated for the depth estimates of at least some of the points. A determination 706 is then made as to whether or not the certainty value corresponding to a point is below a first threshold. The skilled person would understand that points with certainty values below a first threshold may or may not include points with certainty values equal to the first threshold. Thresholds can therefore be applied with strictly less than, or less than or equal to, criteria, or with strictly more than, or more than or equal to, criteria.

It will be understood that herein, higher certainty values indicate increased confidence in a depth estimate -ie better depth estimates. In alternative embodiments, a scale with lower certainty values indicating higher confidence may be used. The skilled person would understand that the below/above thresholds and higher/lower certainty value comparisons should be reversed in such embodiments.

If the certainty value is below (or below or equal to) a first threshold, a new depth estimate is calculated 714 for the corresponding point using a geometric assumption of the environment together with depth information for points having a certainty value above a second threshold.

If the certainty value is above or equal to (or strictly above) a first threshold, the extant depth estimate is maintained 708. If the certainty value is above or equal to (or strictly above) a second threshold, the depth estimate for that point is used 712 in the calculation 714 of improved depth estimates for points with lower certainty values.

In some embodiments, the second threshold may be higher than the first threshold. In other embodiments, the two thresholds may be equal -ie a single threshold may be used.

A 3D representation of the environment may therefore be generated 716 by using the original depth estimates where certainty values are satisfactory, and by using the newly calculated depth estimates to replace the original depth estimates where certainty values are not satisfactory.

In some embodiments, the 3D representation may be generated 716 by back-projecting the pixels with the estimated depths contained in the improved depth-map. In sonic embodiments, the 3D representation may be a point cloud.

In embodiments wherein the depth-map was generated from an image, colour information front the image may also be used. The 3D representation may therefore be a coloured point cloud. The skilled person would understand that other 3D representations or models may be generated. For example, 3D printing may be used to generate a physical model of the environment, 3D vector graphics may be used to generate a virtual model, or the like.

The method 700 and system introduced herein allows dense depth-maps 140, 506a, 506b, 506c to be reconstructed from sparse or semi-dense depth-maps. The skilled person would understand that the embodiments described may be applied to any sparse or semi-dense depth-map, whether generated by an external process or internally.

The embodiments described may have particular utility in environments which contain texture-less yet affinc surfaces 102 -plain planes/bland regions-wherein initial depth estimates for pixels within plain planes may be unreliable, or indeed meaningless.

As demonstrated above, for a selected geometric assumption, the problem can be framed as a non-convex optimisation problem which includes an energy term designed to propagate depth information across the scene -in particular from boundaries 104 of bland regions 102 to interiors thereof.

It has also been shown that the optimisation can be expressed in a familiar form which admits primal-dual optimisation. The efficacy of the approach has been demonstrated on a variety of data gathered from a robot moving on trajectories designed to challenge the reconstruction process.

The skilled person will appreciate that embodiments described herein implement elements thereof as software. The skilled person will also appreciate that those elements may also be implemented in firmware or hardware Thus, software, firmware and/or hardware elements may be interchangeable as will be appreciated by the skilled person.

Claims

CLAIMS1. A method of generating a 3D representation of an environment, the 3D representation comprising a plurality of points, substantially each point having an estimated depth of that point relative to a reference, wherein the method comprises the following steps: i) obtaining a depth-map generated from the environment; ii) calculating a certainty value for the estimated depths of at least some of the points within the depth-map; iii) for points having a certainty value below a first threshold, using a geometric assumption of the environment together with depth information for points having a certainty value above a second threshold to calculate a new estimated depth for those points below the first threshold; and iv) generating the 3D representation of the environment using the new estimated depths for points having a certainty value below the first threshold and the estimated depths from the depth-map for points having a certainty value above the first threshold.
2. The method of claim 1 where the depth-map is generated by processing at least two images of the environment to determine how points move between the at least two images.
The method of claim 2 where motion of the sensor that generated the at least two images of the environment is used to determine how points move between the at least two images.
4. The method of any preceding claim wherein the geometric assumption favours affine geometry.
5. The method of any preceding claim wherein the geometric assumption is implemented as a strong prior.
6. The method of any preceding claim wherein the depth map is generated from an image of the environment, the method further comprising obtaining the image of the environment for which the depth-map is to be generated.
The method of any preceding claim wherein each point of the depth-map is a pixel of an image of the environment for which the depth-map is to be generated.
8. The method of any preceding claim wherein secondary information associated with each point is used to calculate the certainty value.
9. The method of claim 8 in which the secondary information comprises one of colour and reflectance.
10. A system arranged to generate a 3D representation of an environment, the 3D representation comprising a plurality of points, substantially each point having an estimated depth of that point relative to a reference, wherein the system comprises processing circuitry arranged to perform the following steps: i) obtain a depth-map generated from the environment; ii) calculate a certainty value for the estimated depths of at least some of the points within the depth-map; iii) for points having a certainty value below a first threshold, use a geometric assumption of the environment together with depth information for points having a certainty value above a second threshold to calculate a new estimated depth for those points below the first threshold; and iv) generate the 3D representation of the environment using the new estimated depths for points having a certainty value below the first threshold and the estimated depths from the depth-map for points having a certainty value above the first threshold.
11. A machine readable medium containing instructions which when read by a machine cause that machine to generate a 3D representation of an environment; the 3D representation comprising a plurality of points, substantially each point having an estimated depth of that point relative to a reference, wherein the instructions cause the machine to perform the following steps: i) obtain a depth-map generated from the environment; ii) calculate a certainty value for the estimated depths of at least some of the points thin the depth-map; iii) for points having a certainty value below a first threshold, use a geometric assumption of the environment together with depth information for points having a certainty value above a second threshold to calculate a new estimated depth for those points below the first threshold; and iv) generate the 3D representation of the environment using the new estimated depths for points having a certainty value below the first threshold and the estimated depths from the depth-map for points having a certainty value above the first threshold.