WO2019092439A1

WO2019092439A1 - Detecting static parts of a scene

Info

Publication number: WO2019092439A1
Application number: PCT/GB2018/053259
Authority: WO
Inventors: Ingmar POSNER; Daniel BARNES; Will MADDERN; Geoffrey PASCOE
Original assignee: Oxford University Innovation Limited
Priority date: 2017-11-13
Filing date: 2018-11-12
Publication date: 2019-05-16
Also published as: EP3710985A1; GB201718692D0

Abstract

A method of distinguishing between static and ephemeral parts of an experienced environment in representations of the experienced environment, the method comprising automatically generating training data comprising a set of training representations and corresponding ephemerality masks segmenting each of the training representations into static and ephemeral parts, wherein the training representations are representations of a training environment, and wherein the ephemerality masks are generated by comparing each training representation to a corresponding portion of a 3D static model of the static parts of the training environment, computing a discrepancy between the training representation and the corresponding portion of the 3D static model; and calculating an ephemerality mask for the training representation based on the discrepancy, the ephemerality mask segmenting the training representation into static and ephemeral parts, training a neural network with the training data, providing experienced representations of the environment to the trained neural network; and predicting, using the trained neural network, which parts of the experienced representation relate to static parts of the experienced environment and which to ephemeral parts of the experienced environment.

Description

DETECTING STATIC PARTS OF A SCENE

This invention relates to processing a representation (such as an image) of a part of an environment in order to distinguish between static content and ephemeral content. In particular, the method may relate to a sensed part of an environment proximal to a vehicle. Particularly, but not exclusively, the method may relate to a method of route determination, egomotion determination, and/or obstacle detection. In particular, and again not exclusively, the processing may be used as part of autonomous vehicle navigation systems.

It is convenient to describe the background to embodiments of this invention by referring to autonomous navigation systems but there may be embodiments in other fields. Other embodiments may relate to any of the following fields: surveillance systems which may be arranged to detect non-permanent objects in a scene, smartphone applications; surveying applications which may be arranged to detect change in a previous survey, and the likes.

However, in an autonomous navigation system, a changing environment presents a challenge as motion within a scene (i.e. change) can degrade standard outlier rejection schemes and result in erroneous motion estimates and therefore cause problems for navigation systems relying on an analysis of the environment.

One prior art approach has been to use a trained detector and tracking system. However, such systems can be problematic as they require a great deal of time to train, are challenging to implement, and require knowledge of all of the various distraction classes (i.e. types of object likely to be observed) which in a real-world environment can be numerous.

For ease of understanding, it is convenient to refer to 'image' and this language is used below. However, the skilled person will appreciate that some embodiments of the environment may use representations of an environment other than images (Lidar scans; point clouds, etc).

Autonomous vehicle operation in crowded urban environments presents a number of challenges to any system based on visual navigation and motion estimation. In urban traffic where up to 90% of an image can be obscured by a large moving object (e.g. a bus or truck), standard outlier rejection schemes such as RANSAC ([1] M. A. Fischler and R. C. Bolles, "Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography " Communications of the ACM, vol. 24, no. 6, pp. 381-395, 1981.) will produce incorrect motion estimates due to the large consensus of features tracked on the moving object.

Background subtraction approaches such as [2] M. Piccardi, "Background subtraction techniques: a review " in Systems, man and cybernetics, 2004 IEEE international conference on, vol. 4. IEEE, 2004, pp. 3099-3104, and [3] S. Jeeva and M. Sivabalakrishnan, "Survey on background modeling and foreground detection for real time video surveillance " Procedia Computer Science, vol. 50, pp. 566-571, 2015) build statistics over background appearance based on training data from a static camera to identify discrepancies in live images. These methods are typically used in surveillance applications and have limited robustness to general 3D camera motion in complex scenes, as experienced on a vehicle ([4] E. Hayman and J.-O. Eklundh, "Statistical background subtraction for a mobile observer " in CVPR. IEEE, 2003, p. 67, [5] Y. Sheikh, O. Javed, and T. Kanade, "Background subtraction for freely moving cameras " in Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009, pp. 1219-1225).

Conversely, there is a significant body of work on detection and tracking of moving (foreground) objects ([6] A. Yilmaz, O. Javed, and M. Shah, "Object tracking: A survey " Acm computing surveys (CSUR), vol. 38, no. 4, p. 13, 2006, [7] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, "Object detection with discriminatively trained part-based models " IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 9, pp. 1627-1645, 2010, [8] S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: towards realtime object detection with region proposal networks ," in Advances in neural information processing systems, 2015, pp. 91-99), which has been applied to robust visual odometry (VO) in dynamic environments ([9] A. Bak, S. Bouchafa, and D. Aubert, "Dynamic objects detection through visual odometry and stereo-vision: a study of inaccuracy and improvement sources " Machine vision and applications, pp. 1-17, 2014) and scale references for monocular Simultaneous Localisation and Mapping (SLAM) ([10] S. Song and M. Chandraker, "Robust scale estimation in real-time monocular SFM for autonomous driving " in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp.1566-1573).

However, these approaches require large quantities of manually-labelled training data of moving objects (e.g. cars, pedestrians) and object classes are chosen, which must cover all possibly- moving objects to avoid false negatives.

Recent 3D SLAM approaches have integrated per-pixel semantic segmentation layers to improve reconstruction quality ([11] J. Civera, D. G^'alvez-L^'opez, L. Riazuelo, J. D. Tard^'os, and J. Montiel, "Towards semantic SLAM using a monocular camera " in Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on. IEEE, 2011, pp. 1277-1284, [12] S. L. Bowman, N. Atanasov, K. Daniilidis, and G. J. Pappas, "Probabilistic data association for semantic SLAM " in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 1722-1729), but again rely on laboriously manually-annotated training data and chosen classes that encompass all object categories.

Unsupervised approaches have recently been introduced to estimate depth ([13] R. Garg, G. Carneiro, and I. Reid, "Unsupervised CNN for single view depth estimation: Geometry to the rescue " in European Conference on Computer Vision. Springer, 2016, pp. 740-756), egomotion ([14] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, "Unsupervised learning of depth and ego-motion from video " in CVPR, 2017), and 3D reconstruction ([15] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki, "SfM-Net: learning of structure and motion from video " arXiv preprint arXiv: 1704.07804, 2017).

These methods are attractive for large-scale use as they only require raw video footage from a monocular or stereo camera, without any ground-truth motion estimates or semantic labels. In particular, [14] introduces an "explainability mask", which highlights image regions that disagree with the dominant motion estimate. However, the explainability mask only recognises non-dominant moving objects, and hence will still produce incorrect motion estimates when significantly occluded by a large, independently moving object.

The distraction-suppression methods presented in [16] C. McManus, W. Churchill, A. Napier, B. Davis, and P. Newman, "Distraction suppression for vision-based pose estimation at city scales ," in Robotics and Automation (ICRA), 2013 IEEE International Conference on. IEEE, 2013, pp. 3762-3769 and [17] R. W. Wolcott and R. Eustice, "Probabilistic obstacle partitioning of monocular video for autonomous vehicles " in BMVC, 2016, use a prior 3D map to estimate a mask that quantifies reliability for motion estimation within the mapped environment, which is integrated into a VO pipeline. However, a prior map is required of the whole environment, which is experienced, and in which the approach can be used and localization within that map is required - it does not work in unmapped environments, and fails when localisation fails.

According to a first aspect of the invention there is provided a method of distinguishing between static and ephemeral parts of an experienced environment in representations of the experienced environment. The method comprises one or more of the following steps: (i) automatically generating training data comprising a set of training representations and corresponding ephemerality masks. The ephemerality masks may segment each of the training representations into static and ephemeral parts. The training representations may be representations of a training environment. The ephemerality masks may be generated by one or more of the following: comparing each training representation to a corresponding portion of a 3D static model; computing a discrepancy between the training representation and the corresponding portion of the 3D static model of the static parts of the training environment; and calculating an ephemerality mask for the training representation based on the discrepancy, the ephemerality mask segmenting the training representation into static and ephemeral parts;

(ii) training a neural network with the training data;

(iii) providing a representation of the experienced environment to the trained neural network; and

(iv) predicting, using the trained neural network, which parts of the representation of the experienced environment relate to static parts of the experienced environment and which to ephemeral parts of the experienced environment.

The method may further comprise at least one of: i. determining a route for a vehicle using the predicted ephemerality mask;

ii. altering a route for a vehicle using the predicted ephemerality mask;

iii. determining egomotion using the predicted ephemerality mask; and

iv. performing obstacle detection using the predicted ephemerality mask.

No localisation is needed for the representation with respect to the experienced or training environments in order to predict which parts of the experienced representation relate to static parts of the experienced environment and which to ephemeral parts of the experienced environment.

Optionally, no localisation is performed at run-time, as the method is executed. The generating training data may comprise: obtaining data for multiple traversals of a given route through the training environment, the data comprising 3D data;

comparing the 3D data from different traversals so as to split the data into: a set of static points comprising elements of the 3D data which meet a threshold for matching across the multiple traversals; and a set of ephemeral points comprising elements of the 3D data which do not meet a threshold for matching across the multiple traversals; and generating a 3D static model of the training environment using the static points and not the ephemeral points.

The 3D data may be a 3D point cloud. The prediction may comprise generating an ephemerality mask for the representation of the experienced environment.

The training representations may comprise visual images of the training environment.

In embodiments in which the training representations comprise visual images, optionally the visual images comprise stereoscopic pairs of visual images. In such embodiments, the ephemerality masks may be generated by: comparing a stereoscopic pair of the training visual images to a corresponding portion of the static 3D model by producing a synthetic model of a portion of the training environment using stereo reconstruction of the stereoscopic pair of visual images; calculating a discrepancy between the synthetic model and the 3D static model for each pixel of each stereoscopic visual image; and calculating an ephemerality mask for the visual image based on the discrepancy.

The representations of the experienced environment may be visual images of the experienced environment.

The discrepancy may comprise depth discrepancy (disparity). The discrepancy may comprise a normal error.

The discrepancy may comprise both depth discrepancy (disparity) and normal error. The training data may further comprises a depth mask for each training representation.

The method may further comprise predicting, using the trained neural network, depth values for elements of the experienced representation of the environment that is experienced.

The method may further comprise at least one of: i. determining a route for a vehicle using the predicted depth values;

ii. altering a route for a vehicle using the predicted depth values;

iii. determining egomotion using the predicted depth values; and

iv. performing obstacle detection using the depth values.

The training environment may be different from the environment in which the method is used (ie an experienced environment). Advantageously, embodiments of the invention may therefore work in non-surveyed areas.

The inventors appreciated that an aspect to providing robust "distraction-free" visual navigation is a deeper understanding of which regions of an image, or other representation, are static and which are ephemeral in order to better decide which features to use for motion estimation. For the avoidance of doubt, it will be appreciated that motion estimation will be inherently more accurate if based upon regions representing static parts of the sensed environment.

The inventors extended the map prior approach of [16] (C. McManus, W. Churchill, A. Napier, B. Davis, and P. Newman, "Distraction suppression for vision-based pose estimation at city scales " in Robotics and Automation (ICRA), 2013 IEEE International Conference on. IEEE, 2013, pp. 3762-3769) to multi-session mapping and quantified ephemerality using a structural entropy metric. The result may then be used to automatically generate training data for a deep convolutional network.

As a result, embodiments described herein do not rely on live localisation against a prior map or live dense depth estimation from a stereo camera feed, and hence can operate in a wider range of (unmapped) locations, even with a reduced (monocular-only) sensor suite and limited or no localisation.

The skilled person will appreciate that benefits of the approach disclosed herein are not restricted to improving motion estimation. For example, it may also be of use in object detection.

According to a second aspect of the invention, there is provided a computer-implemented method of automatically distinguishing between static and ephemeral parts of an environment in representations of the environment. The method may comprise at least one of the following: i) obtaining data for multiple traversals of a given route through an environment, the data comprising 3D data giving information on a 3D shape of the environment; and ii) comparing the 3D data from different traversals so as to split the data into at least one of: a set of static points comprising elements of the 3D data which meet a threshold for matching across the multiple traversals; and a set of ephemeral points comprising elements of the 3D data which do not meet a threshold for matching across the multiple traversals.

The data may further comprise visual images.

The method may further comprise: generating a prior 3D static model of the environment using the static points and not the ephemeral points; comparing a stereoscopic pair of the visual images to a corresponding portion of the static 3D model by producing a synthetic model of a portion of the environment using stereo reconstruction of the stereoscopic pair of visual images; calculating a discrepancy between the synthetic model and the 3D static model for each pixel of each stereoscopic visual image; and calculating an ephemerality mask for the visual image based on the discrepancy, the ephemerality mask marking the segmentation of the visual image into static and ephemeral content.

The discrepancy may comprise at least one of depth discrepancy (disparity) and normal error.

The skilled person will appreciate that the method disclosed herein is a departure from prior art approaches which required identification of classes of objects that can move, and then image recognition to identify objects within those classes with representations of environments, and then classify those identified objects as moving objects.

By contrast, the present application is conceptually different in that what is classed as ephemeral is not determined based on class, but rather on whether or not it has moved.

As such, for example, if an automotive history museum has vehicles permanently parked outside, embodiments of the present invention would recognise the museum display as a static part of the scene (non-ephemeral). By contrast, prior art class-based approaches, trained to recognise vehicles as a class, would recognise the museum display as vehicles and therefore classify them as ephemeral objects.

According to a third aspect of the invention, there is provided a method of automatically distinguishing between static and ephemeral parts of an experienced environment in representations of that experienced environment. The method may comprise one or more of the following: taking a neural network trained to distinguish between static and ephemeral parts of a training environment in representations of the training environment; providing representations of the experienced environment to the neural network; and predicting, using the neural network, which parts of the representation of the experienced environment relate to static parts of the experienced environment and which to ephemeral parts of the experienced environment.

The neural network may be trained with training data comprising a set of training representations and corresponding ephemerality masks segmenting each of the training representations into static and ephemeral content. The training representations may be representations of a training environment. The ephemerality masks may be generated by comparing representations of the training environment to a 3D model of the static parts of the training environment and computing a discrepancy between the training representations and the static 3D model.

The training data may comprise a set of training visual images and the corresponding ephemerality masks may mark the segmentation of each of the training visual images into static and ephemeral content. The ephemerality masks may be generated by comparing pairs of stereoscopic visual images selected from the training visual images to the 3D static model by producing a synthetic 3D model of the training environment using stereo reconstruction of each pair of stereoscopic visual images and computing a discrepancy between the synthetic 3D model and the static 3D model for each pixel of each stereoscopic visual image.

The predicting may comprise predicting an ephemerality mask for each representation of the experienced environment using the trained neural network, the ephemerality mask segmenting the representation of the experienced environment into static and ephemeral content.

ii. altering a route for a vehicle using the predicted ephemerality mask;

iii. determining egomotion using the predicted ephemerality mask; and

iv. performing obstacle detection using the predicted ephemerality mask.

According to a fourth aspect of the invention, there is provided a system for automatically distinguishing between static and ephemeral parts of representations of an experienced environment around a vehicle as it moves within that environment. The system may comprise one or more of the following: (i) one or more survey vehicles each equipped with one or more sensors arranged to collect a set of training representations, wherein the training visual representations are representations of a training environment;

(ii) a processing circuitry arranged to generate training data comprising the set of training representations and corresponding ephemerality masks segmenting each of the training representations into static and ephemeral content, wherein the ephemerality masks are generated by: comparing each training representation to a corresponding portion of a 3D model of the static parts of the training environment; computing a discrepancy between the training representation and the corresponding portion of the 3D static model for each element of each training representation; and calculating an ephemerality mask for the training representation based on the discrepancy, the ephemerality segmenting the training representation into static and ephemeral content; (iii) a neural network arranged to be trained, using the training data, to distinguish between ephemeral and static content in representations of environments; and

(iv) a vehicle comprising a sensor arranged to generate a representation of the environment through which it moves, and arranged to provide the representation of that environment to the trained neural network; and wherein the trained neural network is arranged to predict which parts of the representation of the experienced environment relate to static parts of the experienced environment and which to ephemeral parts of the experienced environment.

The processing circuitry of the vehicle may comprise the trained neural network. The one or more survey vehicles may be arranged to obtain training representations for multiple traversals of a given route through the training environment. The training representations may comprise 3D data; and the processing circuitry may be arranged to compare the 3D data from different traversals so as to split the data into: a set of static points comprising elements of the 3D data which meet a threshold for matching across the multiple traversals; and a set of ephemeral points comprising elements of the 3D data which do not meet a threshold for matching across the multiple traversals.

The trained neural network may be arranged to predict an ephemerality mask for the representation of the experienced environment.

The or each survey vehicle may be equipped with a camera and the set of training representations may comprise visual images. The processing circuitry may be arranged to: generate a prior 3D static model of the training environment using the static points and not the ephemeral points; compare a stereoscopic pair of the visual images to a corresponding portion of the static 3D model by producing a synthetic model of a portion of the environment using stereo reconstruction of the stereoscopic pair of visual images; calculate a discrepancy between the synthetic model and the 3D static model for each pixel of each stereoscopic visual image; and calculate an ephemerality mask for the visual image based on the discrepancy, the ephemerality mask marking the segmentation of the visual image into static and ephemeral content.

According to a fifth aspect of the invention, there is provided a system for automatically distinguishing between static and ephemeral parts of an environment in representations of the environment. The system may comprise: one or more survey vehicles each equipped with one or more sensors, the one or more survey vehicles being arranged to obtain data for multiple traversals of a given route through the environment, the data comprising 3D data; and processing circuitry arranged to compare the 3D data from different traversals so as to split the data into: a set of static points comprising elements of the 3D data which meet a threshold for matching across the multiple traversals; and a set of ephemeral points comprising elements of the 3D data which do not meet a threshold for matching across the multiple traversals.

The data may further comprise visual images. The processing circuitry may be further arranged to: generate a prior 3D static model of the environment using the static points and not the ephemeral points; compare a stereoscopic pair of the visual images to a corresponding portion of the static 3D model by producing a synthetic model of a portion of the environment using stereo reconstruction of the stereoscopic pair of visual images; calculate a discrepancy between the synthetic model and the 3D static model for each pixel of each stereoscopic visual image; and calculate an ephemerality mask for the visual image based on the discrepancy, the ephemerality mask marking the segmentation of the visual image into static and ephemeral content.

The discrepancy may comprise at least one of a depth discrepancy (disparity) and normal error.

According to a sixth aspect of the invention, there is provided a system for automatically distinguishing between static and ephemeral parts of an environment in representations of a runtime environment. The system may comprise at least one of the following: i) a vehicle equipped with at least one of: a sensor arranged to obtain representations of the environment; and a neural network trained to distinguish between static and ephemeral parts of a training environment in representations of the training environment; ii) processing circuitry arranged to provide the representations of the experienced environment to the neural network; and wherein the neural network may be arranged to predict which parts of the representations of the experienced environment relate to static parts of the environment and which to ephemeral parts of the environment..

The neural network may be trained with training data comprising a set of training representations and corresponding ephemerality masks marking the segmentation of each of the training representations into static and ephemeral content. The training representations may be representations of a training environment and ephemerality masks may be generated by comparing representations of the training environment to a 3D model of the static parts of the training environment by and computing a discrepancy between the training representations and the static 3D model for each element of each training representation.

The training data may comprise a set of training visual images and the corresponding ephemerality masks may segment each of the training visual images into static and ephemeral content, wherein the ephemerality masks are generated by comparing pairs of stereoscopic visual images selected from the training visual images to the 3D static model by producing a synthetic 3D model of the training environment using stereo reconstruction of each pair of stereoscopic visual images and computing a discrepancy between the synthetic 3D model and the static 3D model for each pixel of each stereoscopic visual image.

The processing circuitry may be further arranged to do at least one of the following: i. determine a route for a vehicle using the predicted ephemerality mask;

ii. alter a route for a vehicle using the predicted ephemerality mask;

iii. determine egomotion using the predicted ephemerality mask; and

iv. perform obstacle detection using the predicted ephemerality mask.

The neural network may be arranged to predict an ephemerality mask for each representation of the experienced environment using the trained neural network. According to a seventh aspect of the invention, there is provided a machine-readable medium containing instructions arranged to, when read by a processing circuitry, cause the processing circuitry to perform the method of the first or second aspects of the invention.

The machine readable medium referred may be any of the following: a CDROM; a DVD ROM / RAM (including -R/-RW or +R/+RW); a hard drive; a memory (including a USB drive; an SC card; a compact flash card or the like); a transmitted signal (including an Internet download, ftp file transfer of the like); a wire; etc.

Features described in relation to any of the above aspects of the invention may be applied, mutatis mutandis, to any of the other aspects of the invention.

There now follows, by way of example only and with reference to the accompanying figures, a detailed description of embodiments of the present invention of which:

Figure 1 is a schematic view of a vehicle utilising an embodiment for route determination;

Figure 2 shows input images, corresponding ephemerality masks according to an embodiment, and the effect of their use on visual odometry; Figure 3 shows a survey vehicle suitable for use with embodiments of the invention;

Figure 4 shows static 3D model generation in accordance with embodiments of the invention;

Figure 5 schematically illustrates an ephemerality labelling process according to various embodiments of the invention;

Figure 6 shows a network architecture according to various embodiments of the invention;

Figure 7 shows input data, a depth map, and ephemerality determinations according to an embodiment of the invention;

Figure 8 shows a selection of input images and corresponding ephemerality masks of an embodiment of the invention;

Figure 9 shows graphs illustrating the performance of an embodiment with respect to velocity estimation errors; Figure 10 shows an input image and ephemerality mask of an embodiment;

Figure 11 shows a flow chart outlining steps of a method of predicting ephemerality masks of an embodiment;

Figure 12 shows a flow chart outlining steps of a method of preparing training data an embodiment; and Figure 13 shows a flow chart outlining steps of a further embodiment.

Whilst it is convenient to describe embodiments in relation to a vehicle which is arranged to process its locale (i.e. the environment through which it is moving), embodiments of the invention may find wider applicability. The ability to determine which parts of a scene are ephemeral (non-constant, and/or moving, such as vehicles, road works and pedestrians) and/or which parts relate to static elements (such as buildings, roads, and trees) may find applicability in a number of other fields. For example embodiments may find utility in surveillance systems perhaps to aid object detection, smartphone applications; surveying applications interested in change detection (e.g., maybe returning to a pre-surveyed environment to see if any infrastructure has changed). In addition, it is convenient to describe embodiments in relation to a vehicle which is arranged to take monocular visual images of its surroundings and determine what is static and what is ephemeral within those images. However, embodiments of the invention can also be applied to other representations of an environment; for example labelling a LIDAR point cloud or other 3D or quasi-3D representation of an environment instead of or as well as labelling visual images.

Thus, embodiments of the invention are described in relation to a sensor 100 mounted upon a vehicle 102 and in relation to the flow chart of Figures 12 and 13. The sensor 100 is arranged to monitor its locale and generate data based upon the monitoring thereby providing data giving a representation of a sensed scene around the vehicle; ie an experienced environment. In the embodiment being described, because the sensor is mounted upon a vehicle 102 then the sensor 100 is also arranged to monitor the locale of the vehicle.

In the embodiment shown in Figure 1, the vehicle 102 is a truck. However, the skilled person will appreciate that any kind of vehicle may be used instead. In the embodiment being described, the sensor 100 is a passive sensor (i.e. it does not create radiation and merely detects radiation) and in particular is a monocular camera 100. The skilled person will appreciate that different or multiple cameras could be used in some embodiments.

In other embodiments, the sensor 100 may comprise other forms of sensor. As such, the sensor 100 may also be an active sensor arranged to send radiation out therefrom and detect reflected radiation, such as a LIDAR system.

However, a skilled person will appreciate that monocular cameras are relatively cheap and readily available and may therefore be preferable in many embodiments.

In the embodiment shown in Figure 1, the vehicle 102 is a road vehicle travelling along a road 108 and the sensor 100 is imaging the locale (eg the building 110, road 108, car 109, etc.) as the vehicle 102 travels. In this embodiment, the vehicle 102 also comprise processing circuitry 112 arranged to capture data from the sensor 100 and subsequently to process the data (in this case visual images 200) generated by the sensor 100. In particular, the processing circuitry captures data from the sensor 100 which data provides a sensed scene from around the vehicle 102 at a current time; as the vehicle 102 moves the sensed scene changes. In the embodiment being described, the processing circuitry 112 also comprises, or has access to, a storage device 114 on the vehicle 102.

The storage device 114 comprises program storage 126 and data storage 128 in the embodiment being described. The visual images 200 are stored in the data storage 128 portion. In additional or alternative embodiments, the data storage 128 may be differently partitioned, or not partitioned. Within the sensed scene, some of the objects 110 remain static (i.e. do not move or change other than changes in lighting, etc) and an example of such a static object within Figure 1 would be the building 110. Such static parts of the scene may be thought of as being structural or unchanging parts of the scene. Other objects 109 are not static, are not fixed relative to the sensed scene, and/or are only temporarily static and may not be there should the locale be revisited in the future; such objects may be referred to as ephemeral objects. An example of such an ephemeral object in Figure 1 would be the car 109 (whether or not the car is parked at the time).

These ephemeral objects may be thought of as "distractors" in camera images, as their presence can make image processing for the purposes of route planning, visual odometry and localisation more difficult.

In the embodiment being described, the processing circuitry 112 comprises a neural network, or segmentation unit, trained to predict whether or not features within the data captured by the sensor 100 are ephemeral. In the embodiment being described, captured data are visual images, and the neural network is also trained to predict depth. Features determined to be ephemeral can be ignored as distractors when performing visual odometry and localisation, and may be taken into account as potential obstacles.

In additional or alternative embodiments, the captured data may be of a different type, and/or depth may not be predicted. For example, the vehicle 102 may have a LIDAR sensor or other sensor as well as or instead of a monocular camera 100. As depth is determined directly, no depth prediction is needed in this example. Such embodiments may be advantageous for example when a user wishes to determine which elements in a point-cloud are static and which are ephemeral in a single traversal of an environment; the trained neural network allows a prediction of ephemerality to be made for the LIDAR point-clouds collected as an environment is experienced (ie at run-time). Features determined to be ephemeral can be ignored as distractors when trying to produce a representation of the structure of the environment traversed.

A self-supervised approach to ignoring such "distractors" in camera images for the purposes of robustly estimating vehicle motion is described herein. The approach may have particular utility in cluttered urban environments. The approach described herein leverages multi-session mapping (ie maps that have been generated in a number of sessions) to automatically generate ephemerality masks for input images. A depth map may also be generated for the input images. The generation of the ephemerality masks and depth maps may be performed offline, prior to use of embodiments of the invention onboard a vehicle 102.

The ephemerality mask is a per-pixel ephemerality mask in the embodiment being described, such that each pixel of each input image 200 is assigned an ephemerality value. In alternative embodiments, each image may be divided differently, for example grouping pixels and providing a value for each group of pixels.

The depth map is a per-pixel depth map in the embodiment being described, such that each pixel of each input image is assigned a depth value. In alternative embodiments, each image may be divided differently, for example grouping pixels and providing a value for each group of pixels. The images and their associated ephemerality masks and depth maps are then used to train a deep convolutional network (a neural network) in the embodiment being described - the images, ephemerality masks and depth maps may therefore be thought of as training data. The skilled person will appreciate that alternative network types may be trained in other embodiments. The trained network can then predict ephemerality and depth for other images, even images 200 of environments outside of the environment covered by the training data. The following describes embodiments in which the trained network is so used to process representations of an environment that is experienced; ie an experienced environment.

Visual images are received and ephemerality and depth are predicted. The skilled person will appreciate that localisation is not required in order for the predictions to be made according to embodiments of the invention. Advantageously, a vehicle 102 using an embodiment of the invention therefore does not have to be within an environment surveyed for the training data to successfully use an embodiment of the invention, nor does the vehicle 102 require knowledge of its location. Embodiments of the invention may therefore offer greater flexibility of use than prior art alternatives.

The predicted ephemerality and depth can then be used as an input to a monocular visual odometry (VO) pipeline, for example using either sparse features or dense photometric matching. The skilled person will appreciate that many other uses are possible and that VO is listed herein by way of example only. Embodiments of the invention may therefore yield metric-scale VO using only a single camera (due to the metric depth estimation enabled by the depth prediction training), and experiments have shown that embodiments may be able to recover the correct egomotion even when 90% of the image is obscured by dynamic, independently moving objects.

Embodiments of the invention may therefore yield reduced odometry drift and significantly improved egomotion estimation in the presence of large moving vehicles in urban traffic.

In the embodiment being described, large-scale offline mapping and deep learning approaches are leveraged to produce a per-pixel ephemerality mask 204 without requiring any semantic classification or manual labelling, as illustrated in Figure 2.

Figure 2 shows an example of robust motion estimation in urban environments using a single camera and a learned ephemerality mask.

Figure 2 shows three visual images 200a-c of a particular part of an environment at consecutive times (top left). In the embodiment being described, the images 200a-c are captured by a camera 100 of a vehicle 102. The images 200a-c show a bus 202 driving along a road 108.

Figure 2 shows three ephemerality masks 204a-c, one corresponding to each image 200a-c (top right). The bus 202, which is an example of an ephemeral object, can be easily distinguished from the road, pavement and wall, which are examples of static objects.

When making a left turn onto a main road, a large bus 202 passes in front of the vehicle 102 (arrow to indicate direction vehicle is facing) obscuring the view of the scene (top left). The learned ephemerality mask 204 correctly identifies the bus 202 as an unreliable region of the image 200a-c for the purposes of motion estimation (top right).

Traditional visual odometry (VO) approaches will incorrectly estimate a strong translational motion to the right due to the dominant motion of the bus (image 210, bottom left), whereas the approach disclosed herein correctly recovers the vehicle egomotion (image 220, bottom right).

The ephemerality mask 204a-c predicts stable image regions (e.g. buildings, road markings, static landmarks, shown as dark grey or black in Figure 2) that are likely to be useful for motion estimation, in contrast to dynamic, or ephemeral, objects (e.g. pedestrian and vehicle traffic, vegetation, temporary signage, shown as light grey or white in Figure 2).

In contrast to semantic segmentation approaches that explicitly label objects belonging to a priori chosen classes (e.g., "vehicle", "pedestrian", "cyclist") and hence require manually annotated training data, the embodiment being described uses repeated traversals of the same route to generate training data without requiring manual labelling or object class recognition. The skilled person would appreciate that the more limited or labour-intensive prior art approaches to generating training data could be used to train the neural network of some embodiments of the invention. In the embodiment being described, the training data is gathered by a survey vehicle equipped with a LIDAR sensor and a stereoscopic camera. The skilled person will appreciate that alternative or additional sensors may be used in other embodiments, and that a stereoscopic camera alone may be sufficient in some embodiments.

The survey vehicle 302 traverses a particular route through a particular environment multiple times, gathering data on the environment, the data including 3D data and visual images. The gathered data is then used to produce depth and ephemerality labels for the images, and the labelled images (training data) are then passed to a deep convolutional network in a self- supervised process (no manual assistance or annotation required).

In the embodiment being described, the depth maps for the training data are generated using stereoscopic image pairs taken by a stereoscopic camera C mounted on a survey vehicle 302. Per-pixel depth estimation is performed by warping the left image ono the right image (or vice versa) and matching intensities. A shift in the position of an object between the left and right images provides an estimate of the distance from the camera C to that object - the larger the shift, the smaller the distance. In the embodiment being described, to further refine this depth estimation for the labelling of images, the static 3D model (in the embodiment being described; this model is generated from the LIDAR data, but it may be produced differently in other embodiments) is used to provide depth estimates for the parts of the image designated as non-ephemeral. In alternative embodiments, this refinement step may not be used. In alternative or additional embodiments, LIDAR data or the likes may be mapped onto images to provide per-pixel depth estimates and stereoscopic data may not be used.

In the embodiment being described, the ephemerality mask is integrated as a component of a monocular visual odometry (VO) pipeline as an outlier rejection scheme. By leveraging the depth and ephemerality outputs of the network, robust metric scale VO can be performed using only a single camera 100 mounted to a vehicle 102.

The skilled person will appreciate that, in other embodiments, different methods of generating a static 3D model (i.e. a model of the environment with the ephemeral objects removed) may be used. The method disclosed herein (comparisons between multiple traversals) is advantageous over many prior art techniques in that manual labelling is not needed and classifications of ephemeral objects and image recognition based on those classes are not required. However, prior art techniques could be used to generate a suitable static 3D model.

I. LEARNING EPHEMERALITY MASKS

In this section, an approach for automatically building ephemerality masks 204a-c leveraging an offline 3D mapping pipeline 500, 600 in line with an embodiment of the invention is described.

Note that LIDAR and stereo camera sensors are only provided on the survey vehicle 302 to collect training data in the embodiment being described; a monocular camera 100 is sufficient (although additional or alternative sensors may be used or present in some embodiments).

The method of the embodiment being described comprises the following steps:

1) Prior 3D Mapping:

Using a survey vehicle 302 equipped with a stereo camera, C, and LIDAR scanner, L, multiple traversals of the target environment are performed. By analysing structural consistency across multiple mapping sessions with an entropy-based approach, what constitutes the static (non- ephemeral) structure of the scene is determined in the embodiment being described.

A synthetic 3D static model, which may be termed a prior 3D static structure, is generated using the static/structural components of the environment. 2) Ephemerality Labelling:

The prior 3D static structure is projected into every stereo camera image collected during the survey. A dense stereo approach (similar to [16] - C. McManus, W. Churchill, A. Napier, B. Davis, and P. Newman, "Distraction suppression for vision-based pose estimation at city scales " in Robotics and Automation (ICRA), 2013 IEEE International Conference on. IEEE, 2013, pp. 3762-3769) is used to compare the stereo camera image to the prior 3D static structure.

In the presence of traffic or dynamic objects, the two 3D representations differ, i.e. there is some discrepancy between them. In the embodiment being described, ephemerality is computed as a weighted sum of the discrepancies.

In the embodiment being described, the calculated discrepancy includes both depth disparity and normal difference between the prior/static 3D structure and the "true'Vcurrent 3D structure. 3) Network Training:

A deep convolutional network is then trained to predict the resulting pixel-wise depth and ephemerality mask for input monocular images 200 and at run time, live depth and ephemerality masks are predicted for images taken by a camera 100 even in locations not traversed by the survey vehicle 302. These three steps are described in more detail below.

A. Prior 3D Mapping

A survey vehicle 302 equipped with a camera C and LIDAR Z illustrated in Figure 3 performs a number of traverses j of an environment.

The 3D mapping is described below in terms of point clouds; however, the skilled person will appreciate that other models and/or representations may be used instead of, or alongside, point clouds.

Each global camera pose G_ci_w at time t relative to world frame Wis, recovered with a largescale offline process using the stereo mapping and navigation approach in [18] C. Linegar, W. Churchill, and P. Newman, "Made to measure: Bespoke landmarks for 24-hour, all-weather localisation with a camera " in 2016 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2016, pp. 787-794.

The position of each 3D LIDAR point pj £ M in world frame W using the camera pose and LIDAR-camera calibration G_LC as follows:

Given the pointcloud of all points p collected from all traversals j, the local entropy of each region of the pointcloud is then calculated, in the embodiment being described, to quantify how reliable the region is across each traversal. A neighbourhood function JV(-) is defined, where a point belongs to a neighbourhood if it satisfies the following condition:

•(2) where a is a neighbourhood size parameter, typically set to 0.5m in experiments described herein. For each query point a distribution ρ is then built over the traverses j from which points fell in the neighbourhood of the query point as follows:

e

•(3)

Intuitively, neighbourhoods of points that are well-distributed between different traversals indicate static structure, whereas neighbourhoods of points that were only sourced from one or two traversals are likely to be ephemeral objects.

The neighbourhood entropy H(p_t) of each point across all n traversals is computed as follows:

j^{= 1} ...(4)

A point pi is classified as static structure P if the neighbourhood entropy H(p_t) exceeds a minimum threshold β; all other points are estimated to be ephemeral and are removed from the static 3D prior.

The pointcloud construction, neighbourhood entropy and ephemeral point removal process are illustrated in Figure 4.

Figure 4 illustrates prior 3D mapping to determine the static 3D scene structure. Alignment of multiple traversals of a route (image 402, top left) yields a large number of points only present in single traversals, e.g. traffic or parked vehicles, here shown in white, around the survey vehicle 302. These points corrupt a synthetic depth map (image 404, top right).

By contrast, the embodiment being described removes 3D points that were only observed in a small number of traversals, and retains the structure that remained static for the duration of data collection (image 406, bottom left), resulting in high-quality synthetic depth maps (image 408, bottom right).

In Figure 4, black denotes areas of invalid data (e.g. the sky for which no LIDAR returns were received) and white denotes points which are discarded from the map optimisation process as a result of the ephemerality judgements. Grayscale from around 0.2 to 0.8 (i.e. with a gap at each end to allow the white and black regions to be distinguished) illustrates depth from the sensor viewpoint, with lighter coloured areas being further away.

B. Ephemerality Labelling

Figure 5 illustrates the ephemerality labelling process 500 of the embodiment being described.

Given the prior 3D static pointcloud p^s and globally aligned camera poses C, a synthetic depth map can be produced for each survey image, as illustrated in Figure 5.

A synthetic normal image can also be generated for each survey image, as is also illustrated in Figure 5.

From input images 502 (left) the true disparity d_t 504 (i.e. depth from the camera) and true normals «, 514 (i.e. line perpendicular to local plane in the stereo image taken by the camera) are computed, in this embodiment using an offline dense stereo approach.

Then the prior 3D pointcloud p^s is projected into the image to form the prior disparity df 506 and prior normal nf 516.

A difference between the true and prior disparities is termed a disparity error 508.

A difference between the true and prior normals is termed a normal error 518. In the embodiment being described, the disparity and normal error terms are combined to form the ephemerality mask 520 (right). In alternative embodiments, normal errors may not be used and the ephemerality mask 520 may be based on disparity (depth error) alone, or on depth error and one or more other factors. In Figure 5, black denotes invalid pixels.

The skilled person will appreciate that representing "True Normals", 514 and "Prior Normals" 516 involves three components, for x, y, and z respectively (hence RGB is generally used). Representing the normals in greyscale as shown is therefore not valid, but gives a feel for the data nonetheless.

The skilled person will appreciate that when a picture of a portion of road with no traffic is compared to a picture of the same portion of road with a vehicle on it, the distance from a camera to a first point on the road in the first picture may be similar to the distance from the camera to a tyre blocking the view of the first point on the road, and so the presence of the tyre of the vehicle may not be easy to determine from depth alone. By contrast, the road is substantially horizontal whereas the side of the tyre is substantially vertical, so the normal error is large. As such, the skilled person will appreciate that use of normal errors can assist in providing crisp and accurate outlines of ephemeral objects.

To handle visibility constraints the hidden point removal approach in [19] S. Katz, A. Tal, and R. Basri, "Direct visibility of point sets " in ACM Transactions on Graphics (TOG), vol. 26, no. 3. ACM, 2007, p. 24 is utilised in the embodiment being described. For every pixel /^' into which a valid prior 3D point projects, the expected disparity df and normal nf are computed using the local 3D structure of the pointcloud.

In the presence of dynamic objects, the scene observed from the camera will differ from the expected prior 3D map.

In the embodiment being described, the offline dense stereo reconstruction approach of [20] H. Hirschmuller, "Accurate and efficient stereo processing by semiglobal matching and mutual information " in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 2. IEEE, 2005, pp. 807-814 is used to compute the true disparity d_t and normal «, for each pixel in the survey image, as illustrated in Fig. 5.

The ephemerality mask £_t is defined as the weighted difference between the expected static and true disparity and normals as follows:

... (5) where γ and δ are weighting parameters, and £_t is bounded to [0, 1] after computation. C. Network Architecture

A convolutional encoder-multi-decoder network architecture is trained and used to predict both disparity and ephemerality masks from a single image in the embodiment being described, as illustrated in Figure 6.

Figure 6 shows a network architecture 600 for ephemerality and disparity learning (termed a "deep distraction network" as it is arranged to perform deep machine learning to identify distractions, i.e. ephemeral parts of scenes). In the embodiment being described, a common encoder 602 is used which splits to multiple decoders 604, 606 for the ephemerality mask 608 and disparity 610 outputs.

To preserve fine structure, skip connections are added between the higher convolutional layers, similar to the U-Net approach ([21] O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation " in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2015, pp. 234-241). In the embodiment being described, the deep distraction network comprises a single-encoder multi-decoder network with skip connections connecting each scale of the encoder to corresponding scales in the decoders similar to the UNet architecture (see reference [21]).

The Encoder used in the embodiment being described is based on the VGG network (see Simonyan, K. & Zisserman, A. "Very deep convolutional networks for large-scale image recognition" In Proc. International Conference on Learning Representations) and used to extract a low resolution feature map from the input monocular image 200.

The decoders perform the opposite of the encoder; essentially reversing the VGG operations. Specifically the Decoders map the low resolution encoder feature map to full input resolution feature maps for pixel-wise classification of ephemerality and disparity. At each scale in the decoders a skip connection passes higher resolution features from the corresponding scale in the encoder.

Each decoder is identical at initialisation but is trained independently for its specific tasks.

To train the Ephemerality Decoder, a pixel-wise loss is used to replicate the pre-computed ground truth ephemerality masks in the training data, which were calculated using the 3D static model and corresponding stereoscopic images in the embodiment being described. To train the Disparity Decoder, the stereo photometric loss proposed in [22] C. Godard, O. Mac Aodha, and G. J. Brostow, "Unsupervised monocular depth estimation with left-right consistency," in CVPR, 2017 is used, optionally semi-supervised using the prior LIDAR disparity df to ensure metric-scaled outputs. In the embodiment being described, semi-supervision using the prior LIDAR disparity (from the optimised map with ephemeral objects removed) is used to ensure metric-scaled outputs. This additional supervision is masked by the ground truth ephemerality masks as naturally ephemeral objects in the camera images do not appear in the optimised LIDAR prior (the static 3D model), i.e. LIDAR depth estimates obtained from the static 3D model are not used to estimate depth to objects identified as being ephemeral, as these objects are not present in the static 3D model.

In the embodiment being described, the losses between the different output stages are balanced using the multi-task learning approach in [23] A. Kendall, Y. Gal, and R. Cipolla, "Multi-task learning using uncertainty to weigh losses for scene geometry and semantics " arXiv preprint arXiv: 1705.07115, 2017, which continuously updates the inter-task weighting during training. In the embodiment being described, the Adam optimiser, as described in Kingma, D. and Ba, J., 2014, "Adam: A method for stochastic optimization", arXiv preprint arXiv: 1412.6980, and an initial learning rate of lxlO^"4 were used. The skilled person will appreciate that such implementation details may vary between embodiments.

II. EPHEMERALITY- AWARE VISUAL ODOMETRY In various embodiments, the live depth and ephemerality mask produced by the network are leveraged to produce reliable visual odometry estimates accurate to metric scale.

Two robust VO approaches are presented herein: a sparse feature-based approach and a dense photometric approach. Each integrates the ephemerality mask in order to estimate egomotion using only static parts of the scene, and uses the learned depth to estimate relative motion to the correct scale. This improves upon traditional monocular VO systems that cannot recover absolute scale ([24] H. Strasdat, J. Montiel, and A. J. Davison, "Scale drift-aware large scale monocular SLAM " Robotics: Science and Systems VI, vol. 2, 2010). Both of the odometry approaches described herein are optimised for realtime performance on a vehicle platform.

A. Sparse Monocular Odometry The sparse monocular VO approach described herein is derived from well-known stereo approaches ([25] D. Nister, O. Naroditsky, and J. Bergen, "Visual odometry for ground vehicle applications " Journal of Field Robotics, vol. 23, no. 1, pp. 3-20, 2006) where sets of features are detected and matched across successive frames to build a relative pose estimate. Each feature x, is parameterised as follows:

where (w„ ,) are the pixel coordinates and d_t is the disparity predicted by the deep convolutional network. The relative pose ^{^ ■} 'is recovered by minimising the reprojection error between matched features Xj and Xj :

The waφing function ω(-)→ Wi projects the matched feature Xj into the current image according to relative pose ξ and the camera intrinsics. The set of all extracted features is typically a small subset of the total number of pixels in the image. The step function s(£j) is used to disable the residual according to the predicted ephemerality as follows:

where τ is the maximum ephemerality threshold for a valid feature, typically set to 0.5 in the embodiments being described.

In the embodiments being described, sparse features are detected using FAST corners ([26] E. Rosten and T. Drummond, "Machine learning for high-speed corner detection " Computer Vision-ECCV 2006, pp. 430-443, 2006) and matched using BRIEF descriptors ([27] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, "BRIEF: binary robust independent elementary features " Computer Vision-ECCV 2010, pp. 778-792, 2010) for real-time operation. B. Dense Monocular Odometry

For the dense monocular approach, the method of [28] K. Tateno, F. Tombari, I. Laina, and N. Navab, "CNN-SLAM: realtime dense monocular SLAM with learned depth prediction " in CVPR, 2017 is adopted and the learned depth maps are combined with the photometric relative pose estimation of [29] J. Engel, J. Sturm, and D. Cremers, "Semi-dense visual odometry for a monocular camera " in Proceedings of the IEEE international conference on computer vision, 2013, pp. 1449-1456. Rather than a subset of pixels f, all pixels /^' within the reference keyframe image I_r are warped into the current image Ic and the relative pose ξ is recovered by minimising the photometric error as follows:

where the image function I (ij)→ I returns the pixel intensity at location (w„ ,). Note that the ephemerality mask is used directly to weight the photometric residual; no thresholding is required. Figure 7 illustrates the predicted depth, selected sparse features and weighted dense intensity values used for a typical urban scene.

Figure 7 shows input data for ephemerality-aware visual odometry. For a given input image 702 (top left), the network predicts a dense depth map 704 (top right) and an ephemerality mask. For sparse VO approaches, the ephemerality mask is used to select which features are used for optimisation 706 (bottom left), and for dense VO approaches the photometric error term is weighted directly by the ephemerality mask 708 (bottom right).

In Figure 7, white crosses with a black border correspond to points which are identified as static and black crosses with a white border correspond to points which are identified as ephemeral.

III. EXPERIMENTAL SETUP

The approach of the embodiments described above was benchmarked using hundreds of kilometres of data collected from an autonomous vehicle platform in a complex urban environment. The goal was to quantify the performance of the ephemerality-aware visual odometry approach in the presence of large dynamic objects in traffic.

A. Network Training

In the embodiments being described, the network was trained using eight 10km traversals from the Oxford RobotCar dataset ([30] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, "7 year, 1000 km: The Oxford RobotCar dataset " The International Journal of Robotics Research, vol. 36, no. 1, pp. 3-15, 2017) for a total of approximately 80km of driving.

The RobotCar vehicle 302 is equipped with a Bumblebee XB3 stereo camera C and a LMS-151 pushbroom LIDAR scanner L. The skilled person will appreciate that fewer, alternative or additional sensors may be used in other embodiments. For example, a stereo camera C may be used to provide all of the required information, without using LIDAR. In such embodiments, point clouds generated by stereo reconstruction from the stereoscopic images taken on different traversals could be compared to establish the synthetic, static 3D prior point cloud. Then, each unmodified stereo-reconstructed point cloud could be compared to the relevant portion of the synthetic static point cloud

For training, the input was down-sampled images to 640 X 256 pixels and sub-sampled to one image every metre before use; a total of 60,850 images were used for training.

Ephemerality masks and depth maps were produced at 50Hz using a single GTX 1080 Ti GPU.

B. Evaluation Metrics The approach was evaluated using 42 further traversals of the Oxford route for a total of over 400km. The evaluation datasets contain multiple detours and alternative routes, ensuring the method is tested in (unmapped) locations not present in the training datasets.

To quantify the performance of the ephemerality-aware VO systems, translational and rotational drift rates were computed using the approach proposed in the KITTI odometry benchmark ([31] A. Geiger, P. Lenz, and R. Urtasun, "Are we ready for autonomous driving? The KITTI vision benchmark suite " in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 3354-3361). Specifically, the average end-point-error was computed for all sub-sequences of length (100, 200, 800) metres compared to the inertial navigation (INS) system installed on the vehicle 302. In addition, the instantaneous translational velocities of each method were compared to that reported by the INS system (based on doppler velocity measurements). 6,000 locations that include distractors were manually selected, and velocity estimation errors were evaluated in comparison to the average of all locations. This allows the analysis to focus on dynamic scenes where independently moving objects produce erroneous velocity estimates in the baseline VO methods.

VI. RESULTS

In addition to the quantitative results listed below, qualitative results for ephemerality masks produced in a range of different locations are presented in Figure 8. Figure 8 shows two columns of input images (800a, 810a), each adjacent to a corresponding column of ephemerality masks 800b, 810b.

Figure 8 shows ephemerality masks 800b, 810b produced in challenging urban environments. The masks reliably highlight a diverse range of dynamic objects (cars, buses, trucks, cyclists, pedestrians, strollers) with highly varied distances and orientations. Even buses and trucks that almost entirely obscure the camera image are successfully masked despite the lack of other scene context. Robust VO approaches that make use of the ephemerality mask may therefore provide correct motion estimates even when more than 90% of the static scene is occluded by an independently moving object.

A. Odometry Drift Rates The end-point-error evaluation for each of the methods is presented in Table I.

TABLE I: Odometry Drift Evaluation

Rotation

VO Method Translation { ]

[degftn]

Sparse 6.55 0.0353

Sparse /Ephemerality 6.38 0.0321

Dense 7..15 0.0373

Dense w/Bpheroerality 6.52 0.0307

In both cases, the addition of the ephemerality mask reduced both average translational and rotational drift over the full set of evaluation datasets. Note that the metric scale for translational drift is derived from the depth map produced by the network, and hence both systems report translation in units of metres with low overall error rates using only a monocular camera.

The sparse VO approach provided lower overall translational drift, whereas the dense approach produced lower orientation drift.

B. Velocity Estimates The velocity error evaluation for each of the methods is presented in Table II.

TABLE II: Velocity Error Evaluation

Distractors

VO Method All |m sj

jm/si

Sparse 0.0548 0.220

Sparse w/Ephemeraliiy 0.0406 0.0489

Dense 0.0568 0.766

Dense w/Bphemeralit.y 0,0407 0.424

Across all the evaluation datasets, the ephemerality-aware odometry approaches tested were found to produce lower average velocity errors. However, in locations with distractors, the ephemerality-aware approaches produce significantly more accurate velocity estimates than the baseline approaches.

In particular, the robust sparse VO approach is almost unaffected by distractors, whereas the baseline method reports errors four times greater. The dense VO approach generally produces poorer translational velocity estimates than the sparse approach, which corresponds with higher translational drift rates reported in the previous section. Figure 9 presents the distribution of velocity errors for each of the approaches in the presence of distractors.

Figure 9 shows velocity estimation errors in the presence of distractors. The sparse ephemerality-aware approach significantly outperforms the baseline approach, producing far fewer outliers above 0.5 m/s. The dense ephemerality-aware approach does not perform as well, but still outperforms the baseline. The vertical axis is scaled to highlight the outliers.

V. SUMMARY

The concept of an ephemerality mask was introduced above. An ephemerality mask estimates the likelihood that any pixel in an input image corresponds to either reliable static structure or dynamic objects in the environment. Further, prediction of the ephemerality mask can be learned using an automatic self-supervised approach as disclosed herein with respect to various embodiments.

No manual labelling or choice of semantic classes is needed in order to train a processor to implement the approach disclosed herein and a single monocular camera is sufficient to produce reliable ephemerality-aware visual odometry to metric scale.

Over hundreds of kilometres, the approach disclosed herein has been shown to produce odometry resulting in lower drift rates, and more robust velocity estimates in the presence of large dynamic objects in urban scenes. Figure 10 illustrates a static/ephemeral segmentation performed using the ephemerality mask; currently the "background" (static features) may be used to guide motion estimation and/or the "foreground" (ephemeral features) may be used for obstacle detection.

The skilled person will appreciate that ephemerality masks are widely applicable for autonomous vehicles. In the scene shown in Figure 10, the ephemerality mask can be used to inform localisation against only the static scene (bottom left) whilst guiding object detection to only the ephemeral elements (bottom right).

Figure 11 illustrates an overall method 1100 for predicting ephemerality masks for representations of an environment that is being experienced, as disclosed herein. At step 1101, training data is generated. In the embodiment being described, the training data comprises a set of training representations, the training visual representations being representations of a training environment.

The training data also comprises corresponding ephemerality masks marking the segmentation of each of the training representations into static and ephemeral content. In the embodiment 1100 being described, the ephemerality masks are generated by comparing 1102 each training representation to a corresponding portion of a 3D model of the static parts of the training environment.

A discrepancy between the training representation and the corresponding portion of the 3D static model is then calculated 1104. In the embodiment being described, the training representations comprise visual images and discrepancy is assessed on a per-pixel basis for a visual image. In other cases, larger elements than a single pixel may be used, or the training representation may not comprise pixels, for example being a point-cloud or another type of representation. An ephemerality mask is then calculated 1106 for the training representation based on the discrepancy. The ephemerality mask marks the segmentation of the training representation into static and ephemeral content.

Figure 12 depicts a method 1200 for generating training data in accordance with various embodiments of the invention. The method 1200 comprises obtaining 1202 data for multiple traversals of a given route through an environment, the data comprising 3D data. One or more survey vehicles may be used to obtain the data.

The 3D data from different traversals is then compared 1204 so as to split the data into:

(i) a set of static points comprising elements of the 3D data which meet a threshold for matching across the multiple traversals; and

(ii) a set of ephemeral points comprising elements of the 3D data which do not meet a threshold for matching across the multiple traversals.

In some embodiments, for example where the data are 3D point-clouds generated by LIDAR systems or the likes, the separation of points of the point clouds into ephemeral and static sets may be sufficient for use as training data, for example when the vehicle 102 will have a LIDAR system.

In other embodiments, a segmented point-cloud alone is not sufficient. For example, in embodiments in which the vehicle 102 will have only a monocular camera 100, the training data preferably comprises labelled visual images. In such embodiments, the data comprises visual images and 3D data, and the following steps apply:

A prior 3D static model of the environment is generated 1206 using the static points and not the ephemeral points;

A stereoscopic pair of the visual images is compared 1208 to a corresponding portion of the static 3D model by producing a synthetic model of a portion of the environment using stereo reconstruction of the stereoscopic pair of visual images.

A discrepancy between the synthetic model and the 3D static model is then calculated 1210 for each pixel of each stereoscopic visual image.

An ephemerality mask for the visual image is then calculated 1212 based on the discrepancy, the ephemerality mask marking the segmentation of the visual image into static and ephemeral content.

The skilled person will appreciate that, when data from different sensors is to be matched (e.g. images from the camera C and the LIDAR system L), precise synchronisation helps in correctly matching the data streams. The TICSync approach described in Harrison, A. and Newman, P., "TICSync: Knowing when things happened', in Robotics and Automation (ICRA), 2011 IEEE International Conference on (pp. 356-363), IEEE was used for the embodiments described herein.

Similarly, when matching data to the same place in different traversals, so as to identify ephemeral objects in embodiments using that approach, precise localisation is again helpful. The approaches described with respect to various references cited above were used in the embodiments being described.

Figure 13 illustrates a method 1300 of various embodiments of the invention.

At step 1302, a trained neural network is obtained. The neural network is trained to distinguish between static and ephemeral parts of representations of environments. The neural network may be trained using training data generated as discussed with respect to Figure 12.

In some embodiments, the neural network is also trained to predict depth based on visual images.

At step 1304, representations of ant environment that are experienced by the vehicle 102 are provided to the neural network for analysis.

At step 1306, the trained neural network predicts an ephemerality mask for each representation of the experienced environment.

The trained neural network may also predict depth values (which may be thought of as a depth mask) for each representation of the experienced environment. The skilled person will appreciate that the predicted ephemerality mask (and the predicted depth mask where applicable) can be used in many different ways. Steps 1308 to 1314 illustrate four examples of such uses: i. determining a route 1314 for a vehicle using the predicted ephemerality mask;

ii. altering a route 1312 for a vehicle using the predicted ephemerality mask;

iii. determining egomotion 1310 using the predicted ephemerality mask; and

iv. performing 1308 obstacle detection using the predicted ephemerality mask.

The skilled person will appreciate that other additional or alternative uses are possible and that these are listed by way of example only.

Claims

1. A method of distinguishing between static and ephemeral parts of an experienced environment in representations of the experienced environment, the method comprising: automatically generating training data comprising a set of training representations and corresponding ephemerality masks segmenting each of the training representations into static and ephemeral parts, wherein the training representations are representations of a training environment, and wherein the ephemerality masks are generated by: comparing each training representation to a corresponding portion of a 3D static model of the static parts of the training environment; computing a discrepancy between the training representation and the corresponding portion of the 3D static model; and calculating an ephemerality mask for the training representation based on the discrepancy, the ephemerality mask segmenting the training representation into static and ephemeral parts; training a neural network with the training data; providing experienced representations of the environment to the trained neural network; and predicting, using the trained neural network, which parts of the experienced representation relate to static parts of the experienced environment and which to ephemeral parts of the experienced environment.

2. The method of any preceding claim further comprising at least one of: i. determining a route for a vehicle using the predicted ephemerality mask;

ii. altering a route for a vehicle using the predicted ephemerality mask;

iii. determining egomotion using the predicted ephemerality mask; and

iv. performing obstacle detection using the predicted ephemerality mask.

3. The method of claim 2 wherein no localisation, of a vehicle using the method, is performed and wherein the experienced environment is the environment experienced by the vehicle.

4. The method of any preceding claim wherein the generating training data comprises: obtaining data for multiple traversals of a given route through the training environment, the data comprising 3D data; comparing the 3D data from different traversals so as to split the data into: a set of static points comprising elements of the 3D data which meet a threshold for matching across the multiple traversals; and a set of ephemeral points comprising elements of the 3D data which do not meet a threshold for matching across the multiple traversals; and generating a 3D static model of the training environment using the static points and not the ephemeral points.

5. The method of any preceding claim wherein no localisation is needed for the representation of the experienced environment with respect to the environment being experienced or training environments in order to predict which parts of the experienced representation relate to static parts of the experienced environment and which relate to ephemeral parts of the experienced environment.

6. The method of any preceding claim wherein the prediction comprises generating an ephemerality mask for the representation of the experienced.

7. The method of any preceding claim wherein the training representations comprise visual images of the training environment, and optionally wherein the visual images comprise stereoscopic pairs of visual images.

8. The method of claim 7 wherein the ephemerality masks are generated by: comparing a stereoscopic pair of the training visual images to a corresponding portion of the static 3D model by producing a synthetic model of a portion of the training environment using stereo reconstruction of the stereoscopic pair of visual images; calculating a discrepancy between the synthetic model and the 3D static model for each pixel of each stereoscopic visual image; and calculating an ephemerality mask for the visual image based on the discrepancy.

9. The method of any preceding claim wherein the experienced representations are visual images of the experienced environment.

10. The method of any preceding claim wherein discrepancy comprises at least one of a depth discrepancy (disparity) and a normal error.

11. The method of any preceding claim wherein the training data further comprises a depth mask for each training representation.

12. The method of claim 11 further comprising predicting, using the trained neural network, depth values for elements of the experienced representation of the experienced environment.

13. The method of claim 12 further comprising at least one of: i. determining a route for a vehicle using the predicted depth values;

ii. altering a route for a vehicle using the predicted depth values;

iii. determining egomotion using the predicted depth values; and

iv. performing obstacle detection using the depth values.

14. The method of claim 3, or any claim dependent thereon, wherein the 3D data is a 3D point cloud.

15. The method of any preceding claim wherein the training environment is different from the experienced environment.

16. A computer-implemented method of automatically distinguishing between static and ephemeral parts of an environment in representations of the environment, the method comprising: obtaining data for multiple traversals of a given route through an environment, the data comprising 3D data giving information on a 3D shape of the environment; comparing the 3D data from different traversals so as to split the data into: a set of static points comprising elements of the 3D data which meet a threshold for matching across the multiple traversals; and

a set of ephemeral points comprising elements of the 3D data which do not meet a threshold for matching across the multiple traversals.

17. The method of claim 16 wherein the data further comprises visual images and the method further comprises: generating a prior 3D static model of the environment using the static points and not the ephemeral points; comparing a stereoscopic pair of the visual images to a corresponding portion of the static 3D model by producing a synthetic model of a portion of the environment using stereo reconstruction of the stereoscopic pair of visual images; calculating a discrepancy between the synthetic model and the 3D static model for each pixel of each stereoscopic visual image; and calculating an ephemerality mask for the visual image based on the discrepancy, the ephemerality mask marking the segmentation of the visual image into static and ephemeral content.

18. The method of claim 17 wherein the discrepancy comprises at least one of depth discrepancy (disparity) and normal error.

19. A method of automatically distinguishing between static and ephemeral parts of an experienced environment in representations of the experienced environment, the method comprising: taking a neural network trained to distinguish between static and ephemeral parts of a training environment ; providing representations of the experienced environment to the neural network; and predicting, using the neural network, which parts of the representation of the experienced environment relate to static parts of the experienced environment and which to ephemeral parts of the experienced environment.

20. The method of claim 19 wherein the neural network is trained with training data comprising a set of training representations and corresponding ephemerality masks segmenting each of the training representations into static and ephemeral content, wherein the training representations are representations of a training environment and ephemerality masks are generated by comparing representations of the training environment to a 3D model of the static parts of the training environment and computing a discrepancy between the training representations and the static 3D model.

21. The method of claim 20 wherein the training data comprises a set of training visual images and the corresponding ephemerality masks mark the segmentation of each of the training visual images into static and ephemeral content, wherein the ephemerality masks are generated by comparing pairs of stereoscopic visual images selected from the training visual images to the 3D static model by producing a synthetic 3D model of the training environment using stereo reconstruction of each pair of stereoscopic visual images and computing a discrepancy between the synthetic 3D model and the static 3D model for each pixel of each stereoscopic visual image.

22. The method of any of claims 19 to 21 wherein the predicting comprises predicting an ephemerality mask for each representation of the experienced environment using the trained neural network, the ephemerality mask segmenting the representation of the experienced environment into static and ephemeral content.

23. The method of any of claims 19 to 22 further comprising at least one of:

1. determining a route for a vehicle using the predicted ephemerality mask;

11. altering a route for a vehicle using the predicted ephemerality mask;

iii. determining egomotion using the predicted ephemerality mask; and

iv. performing obstacle detection using the predicted ephemerality mask.

24. A system for automatically distinguishing between static and ephemeral parts of an experienced environment around a vehicle in representations of the environment that the vehicle is experiencing, the system comprising: one or more survey vehicles each equipped with one or more sensors arranged to collect a set of training representations, wherein the training visual representations are representations of a training environment; processing circuitry arranged to generate training data comprising the set of training representations and corresponding ephemerality masks segmenting each of the training representations into static and ephemeral content, wherein the ephemerality masks are generated by: comparing each training representation to a corresponding portion of a 3D model of the static parts of the training environment; computing a discrepancy between the training representation and the corresponding portion of the 3D static model for each element of each training representation; and calculating an ephemerality mask for the training representation based on the discrepancy, the ephemerality segmenting the training representation into static and ephemeral content; a neural network arranged to be trained, using the training data, to distinguish between ephemeral and static content in representations of experienced environments; a vehicle comprising a sensor arranged to generate a representation of the experienced environment, and arranged to provide the representation of the experienced environment to the trained neural network; and wherein the trained neural network is arranged to predict which parts of the representation of the experienced environment relate to static parts of the experienced environment and which to ephemeral parts of the experienced environment.

25. The system of claim 24 wherein processing circuitry of the vehicle comprises the trained neural network .

26. The system of claim 24 or claim 25 wherein the one or more survey vehicles are arranged to obtain training representations for multiple traversals of a given route through the training environment, the training representations comprising 3D data; and the processing circuitry is arranged to compare the 3D data from different traversals so as to split the data into:

a set of static points comprising elements of the 3D data which meet a threshold for matching across the multiple traversals; and

27. The system of any of claims 24 to 26 wherein the trained neural network is arranged to predict an ephemerality mask for the representation of the experienced environment.

28. The system of any of claims 24 to 27 wherein the or each survey vehicle is equipped with a camera and the set of training representations comprises visual images.

29. The system of claim 27 or 28 as dependent on claim 26 wherein the processing circuitry is arranged to: generate a prior 3D static model of the training environment using the static points and not the ephemeral points; compare a stereoscopic pair of the visual images to a corresponding portion of the static 3D model by producing a synthetic model of a portion of the environment using stereo reconstruction of the stereoscopic pair of visual images; calculate a discrepancy between the synthetic model and the 3D static model for each pixel of each stereoscopic visual image; and calculate an ephemerality mask for the visual image based on the discrepancy, the ephemerality mask marking the segmentation of the visual image into static and ephemeral content.

30. A system for automatically distinguishing between static and ephemeral parts of an environment in representations of the environment, the system comprising: one or more survey vehicles each equipped with one or more sensors, the one or more survey vehicles being arranged to obtain data for multiple traversals of a given route through the environment, the data comprising 3D data; and processing circuitry arranged to compare the 3D data from different traversals so as to split the data into: a set of static points comprising elements of the 3D data which meet a threshold for matching across the multiple traversals; and a set of ephemeral points comprising elements of the 3D data which do not meet a threshold for matching across the multiple traversals.

31. The system of claim 30 wherein the data further comprises visual images and the processing circuitry is further arranged to : generate a prior 3D static model of the environment using the static points and not the ephemeral points; compare a stereoscopic pair of the visual images to a corresponding portion of the static 3D model by producing a synthetic model of a portion of the environment using stereo reconstruction of the stereoscopic pair of visual images; calculate a discrepancy between the synthetic model and the 3D static model for each pixel of each stereoscopic visual image; and calculate an ephemerality mask for the visual image based on the discrepancy, the ephemerality mask marking the segmentation of the visual image into static and ephemeral content.

32. The system of claim 30 or 31 wherein the discrepancy comprises at least one of a depth discrepancy (disparity) and normal error.

33. A system for automatically distinguishing between static and ephemeral parts of an environment in representations of an experienced environment, the system comprising: a vehicle equipped with: a sensor arranged to obtain representations of the environment experienced by the vehicle; and a neural network trained to distinguish between static and ephemeral parts of a training environment in representations of the training environment; processing circuitry arranged to provide the representations of the experienced environment to the neural network; and wherein the neural network is arranged to predict which parts of the representations of the experienced environment relate to static parts of the experienced environment and which to ephemeral parts of the experienced environment..

34. The system of claim 33 wherein the neural network is trained with training data comprising a set of training representations and corresponding ephemerality masks marking the segmentation of each of the training representations into static and ephemeral content, wherein the training representations are representations of a training environment and ephemerality masks are generated by comparing representations of the training environment to a 3D model of the static parts of the training environment by and computing a discrepancy between the training representations and the static 3D model for each element of each training representation.

35. The system of claim 34 wherein the training data comprises a set of training visual images and the corresponding ephemerality masks segment each of the training visual images into static and ephemeral content, wherein the ephemerality masks are generated by comparing pairs of stereoscopic visual images selected from the training visual images to the 3D static model by producing a synthetic 3D model of the training environment using stereo reconstruction of each pair of stereoscopic visual images and computing a discrepancy between the synthetic 3D model and the static 3D model for each pixel of each stereoscopic visual image.

36. The system of any of claims 33 to 35 wherein the processing circuitry is further arranged to do at least one of the following: i. determine a route for a vehicle using the predicted ephemerality mask;

ii. alter a route for a vehicle using the predicted ephemerality mask;

iii. determine egomotion using the predicted ephemerality mask; and

iv. perform obstacle detection using the predicted ephemerality mask.

37. The system of any of claims 33 to 36 wherein the neural network is arranged to predict an ephemerality mask for each representation of the experienced environment using the trained neural network.

38. A machine-readable medium containing instructions arranged to, when read by a processing circuitry, cause the processing circuitry to perform:

(i) the method of any of claims 1 to 15; or

(ii) the method of any of claims 16-18.