EP3710985A1 - Detecting static parts of a scene - Google Patents
Detecting static parts of a sceneInfo
- Publication number
- EP3710985A1 EP3710985A1 EP18804091.9A EP18804091A EP3710985A1 EP 3710985 A1 EP3710985 A1 EP 3710985A1 EP 18804091 A EP18804091 A EP 18804091A EP 3710985 A1 EP3710985 A1 EP 3710985A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- training
- environment
- static
- ephemerality
- representations
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01S—RADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
- G01S17/00—Systems using the reflection or reradiation of electromagnetic waves other than radio waves, e.g. lidar systems
- G01S17/86—Combinations of lidar systems with systems other than lidar, radar or sonar, e.g. with direction finders
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01S—RADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
- G01S17/00—Systems using the reflection or reradiation of electromagnetic waves other than radio waves, e.g. lidar systems
- G01S17/88—Lidar systems specially adapted for specific applications
- G01S17/89—Lidar systems specially adapted for specific applications for mapping or imaging
- G01S17/894—3D imaging with simultaneous measurement of time-of-flight at a 2D array of receiver pixels, e.g. time-of-flight cameras or flash lidar
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/56—Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
Definitions
- This invention relates to processing a representation (such as an image) of a part of an environment in order to distinguish between static content and ephemeral content.
- the method may relate to a sensed part of an environment proximal to a vehicle.
- the method may relate to a method of route determination, egomotion determination, and/or obstacle detection.
- the processing may be used as part of autonomous vehicle navigation systems.
- surveillance systems which may be arranged to detect non-permanent objects in a scene
- smartphone applications surveying applications which may be arranged to detect change in a previous survey, and the likes.
- Recent 3D SLAM approaches have integrated per-pixel semantic segmentation layers to improve reconstruction quality ([11] J. Civera, D. G ' alvez-L ' opez, L. Riazuelo, J. D. Tard ' os, and J. Montiel, "Towards semantic SLAM using a monocular camera " in Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on. IEEE, 2011, pp. 1277-1284, [12] S. L. Bowman, N. Atanasov, K. Daniilidis, and G. J. Pappas, "Probabilistic data association for semantic SLAM “ in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 1722-1729), but again rely on laboriously manually-annotated training data and chosen classes that encompass all object categories.
- ICRA Robotics and Automation
- a method of distinguishing between static and ephemeral parts of an experienced environment in representations of the experienced environment comprises one or more of the following steps: (i) automatically generating training data comprising a set of training representations and corresponding ephemerality masks.
- the ephemerality masks may segment each of the training representations into static and ephemeral parts.
- the training representations may be representations of a training environment.
- the ephemerality masks may be generated by one or more of the following: comparing each training representation to a corresponding portion of a 3D static model; computing a discrepancy between the training representation and the corresponding portion of the 3D static model of the static parts of the training environment; and calculating an ephemerality mask for the training representation based on the discrepancy, the ephemerality mask segmenting the training representation into static and ephemeral parts;
- the method may further comprise at least one of: i. determining a route for a vehicle using the predicted ephemerality mask;
- the generating training data may comprise: obtaining data for multiple traversals of a given route through the training environment, the data comprising 3D data;
- the 3D data may be a 3D point cloud.
- the prediction may comprise generating an ephemerality mask for the representation of the experienced environment.
- the training representations may comprise visual images of the training environment.
- the visual images comprise stereoscopic pairs of visual images.
- the ephemerality masks may be generated by: comparing a stereoscopic pair of the training visual images to a corresponding portion of the static 3D model by producing a synthetic model of a portion of the training environment using stereo reconstruction of the stereoscopic pair of visual images; calculating a discrepancy between the synthetic model and the 3D static model for each pixel of each stereoscopic visual image; and calculating an ephemerality mask for the visual image based on the discrepancy.
- the representations of the experienced environment may be visual images of the experienced environment.
- the discrepancy may comprise depth discrepancy (disparity).
- the discrepancy may comprise a normal error.
- the discrepancy may comprise both depth discrepancy (disparity) and normal error.
- the training data may further comprises a depth mask for each training representation.
- the method may further comprise predicting, using the trained neural network, depth values for elements of the experienced representation of the environment that is experienced.
- the method may further comprise at least one of: i. determining a route for a vehicle using the predicted depth values;
- the training environment may be different from the environment in which the method is used (ie an experienced environment).
- embodiments of the invention may therefore work in non-surveyed areas.
- the inventors extended the map prior approach of [16] (C. McManus, W. Churchill, A. Napier, B. Davis, and P. Newman, "Distraction suppression for vision-based pose estimation at city scales " in Robotics and Automation (ICRA), 2013 IEEE International Conference on. IEEE, 2013, pp. 3762-3769) to multi-session mapping and quantified ephemerality using a structural entropy metric. The result may then be used to automatically generate training data for a deep convolutional network.
- embodiments described herein do not rely on live localisation against a prior map or live dense depth estimation from a stereo camera feed, and hence can operate in a wider range of (unmapped) locations, even with a reduced (monocular-only) sensor suite and limited or no localisation.
- a computer-implemented method of automatically distinguishing between static and ephemeral parts of an environment in representations of the environment may comprise at least one of the following: i) obtaining data for multiple traversals of a given route through an environment, the data comprising 3D data giving information on a 3D shape of the environment; and ii) comparing the 3D data from different traversals so as to split the data into at least one of: a set of static points comprising elements of the 3D data which meet a threshold for matching across the multiple traversals; and a set of ephemeral points comprising elements of the 3D data which do not meet a threshold for matching across the multiple traversals.
- the data may further comprise visual images.
- the method may further comprise: generating a prior 3D static model of the environment using the static points and not the ephemeral points; comparing a stereoscopic pair of the visual images to a corresponding portion of the static 3D model by producing a synthetic model of a portion of the environment using stereo reconstruction of the stereoscopic pair of visual images; calculating a discrepancy between the synthetic model and the 3D static model for each pixel of each stereoscopic visual image; and calculating an ephemerality mask for the visual image based on the discrepancy, the ephemerality mask marking the segmentation of the visual image into static and ephemeral content.
- the discrepancy may comprise at least one of depth discrepancy (disparity) and normal error.
- the present application is conceptually different in that what is classed as ephemeral is not determined based on class, but rather on whether or not it has moved.
- embodiments of the present invention would recognise the museum display as a static part of the scene (non-ephemeral).
- prior art class-based approaches trained to recognise vehicles as a class, would recognise the museum display as vehicles and therefore classify them as ephemeral objects.
- a method of automatically distinguishing between static and ephemeral parts of an experienced environment in representations of that experienced environment may comprise one or more of the following: taking a neural network trained to distinguish between static and ephemeral parts of a training environment in representations of the training environment; providing representations of the experienced environment to the neural network; and predicting, using the neural network, which parts of the representation of the experienced environment relate to static parts of the experienced environment and which to ephemeral parts of the experienced environment.
- the neural network may be trained with training data comprising a set of training representations and corresponding ephemerality masks segmenting each of the training representations into static and ephemeral content.
- the training representations may be representations of a training environment.
- the ephemerality masks may be generated by comparing representations of the training environment to a 3D model of the static parts of the training environment and computing a discrepancy between the training representations and the static 3D model.
- the training data may comprise a set of training visual images and the corresponding ephemerality masks may mark the segmentation of each of the training visual images into static and ephemeral content.
- the ephemerality masks may be generated by comparing pairs of stereoscopic visual images selected from the training visual images to the 3D static model by producing a synthetic 3D model of the training environment using stereo reconstruction of each pair of stereoscopic visual images and computing a discrepancy between the synthetic 3D model and the static 3D model for each pixel of each stereoscopic visual image.
- the predicting may comprise predicting an ephemerality mask for each representation of the experienced environment using the trained neural network, the ephemerality mask segmenting the representation of the experienced environment into static and ephemeral content.
- the method may further comprise at least one of: i. determining a route for a vehicle using the predicted ephemerality mask;
- a system for automatically distinguishing between static and ephemeral parts of representations of an experienced environment around a vehicle as it moves within that environment may comprise one or more of the following: (i) one or more survey vehicles each equipped with one or more sensors arranged to collect a set of training representations, wherein the training visual representations are representations of a training environment;
- a processing circuitry arranged to generate training data comprising the set of training representations and corresponding ephemerality masks segmenting each of the training representations into static and ephemeral content, wherein the ephemerality masks are generated by: comparing each training representation to a corresponding portion of a 3D model of the static parts of the training environment; computing a discrepancy between the training representation and the corresponding portion of the 3D static model for each element of each training representation; and calculating an ephemerality mask for the training representation based on the discrepancy, the ephemerality segmenting the training representation into static and ephemeral content; (iii) a neural network arranged to be trained, using the training data, to distinguish between ephemeral and static content in representations of environments; and
- a vehicle comprising a sensor arranged to generate a representation of the environment through which it moves, and arranged to provide the representation of that environment to the trained neural network; and wherein the trained neural network is arranged to predict which parts of the representation of the experienced environment relate to static parts of the experienced environment and which to ephemeral parts of the experienced environment.
- the processing circuitry of the vehicle may comprise the trained neural network.
- the one or more survey vehicles may be arranged to obtain training representations for multiple traversals of a given route through the training environment.
- the training representations may comprise 3D data; and the processing circuitry may be arranged to compare the 3D data from different traversals so as to split the data into: a set of static points comprising elements of the 3D data which meet a threshold for matching across the multiple traversals; and a set of ephemeral points comprising elements of the 3D data which do not meet a threshold for matching across the multiple traversals.
- the trained neural network may be arranged to predict an ephemerality mask for the representation of the experienced environment.
- the or each survey vehicle may be equipped with a camera and the set of training representations may comprise visual images.
- the processing circuitry may be arranged to: generate a prior 3D static model of the training environment using the static points and not the ephemeral points; compare a stereoscopic pair of the visual images to a corresponding portion of the static 3D model by producing a synthetic model of a portion of the environment using stereo reconstruction of the stereoscopic pair of visual images; calculate a discrepancy between the synthetic model and the 3D static model for each pixel of each stereoscopic visual image; and calculate an ephemerality mask for the visual image based on the discrepancy, the ephemerality mask marking the segmentation of the visual image into static and ephemeral content.
- a system for automatically distinguishing between static and ephemeral parts of an environment in representations of the environment may comprise: one or more survey vehicles each equipped with one or more sensors, the one or more survey vehicles being arranged to obtain data for multiple traversals of a given route through the environment, the data comprising 3D data; and processing circuitry arranged to compare the 3D data from different traversals so as to split the data into: a set of static points comprising elements of the 3D data which meet a threshold for matching across the multiple traversals; and a set of ephemeral points comprising elements of the 3D data which do not meet a threshold for matching across the multiple traversals.
- the data may further comprise visual images.
- the processing circuitry may be further arranged to: generate a prior 3D static model of the environment using the static points and not the ephemeral points; compare a stereoscopic pair of the visual images to a corresponding portion of the static 3D model by producing a synthetic model of a portion of the environment using stereo reconstruction of the stereoscopic pair of visual images; calculate a discrepancy between the synthetic model and the 3D static model for each pixel of each stereoscopic visual image; and calculate an ephemerality mask for the visual image based on the discrepancy, the ephemerality mask marking the segmentation of the visual image into static and ephemeral content.
- the discrepancy may comprise at least one of a depth discrepancy (disparity) and normal error.
- a system for automatically distinguishing between static and ephemeral parts of an environment in representations of a runtime environment may comprise at least one of the following: i) a vehicle equipped with at least one of: a sensor arranged to obtain representations of the environment; and a neural network trained to distinguish between static and ephemeral parts of a training environment in representations of the training environment; ii) processing circuitry arranged to provide the representations of the experienced environment to the neural network; and wherein the neural network may be arranged to predict which parts of the representations of the experienced environment relate to static parts of the environment and which to ephemeral parts of the environment.
- the neural network may be trained with training data comprising a set of training representations and corresponding ephemerality masks marking the segmentation of each of the training representations into static and ephemeral content.
- the training representations may be representations of a training environment and ephemerality masks may be generated by comparing representations of the training environment to a 3D model of the static parts of the training environment by and computing a discrepancy between the training representations and the static 3D model for each element of each training representation.
- the training data may comprise a set of training visual images and the corresponding ephemerality masks may segment each of the training visual images into static and ephemeral content, wherein the ephemerality masks are generated by comparing pairs of stereoscopic visual images selected from the training visual images to the 3D static model by producing a synthetic 3D model of the training environment using stereo reconstruction of each pair of stereoscopic visual images and computing a discrepancy between the synthetic 3D model and the static 3D model for each pixel of each stereoscopic visual image.
- the processing circuitry may be further arranged to do at least one of the following: i. determine a route for a vehicle using the predicted ephemerality mask;
- the neural network may be arranged to predict an ephemerality mask for each representation of the experienced environment using the trained neural network.
- a seventh aspect of the invention there is provided a machine-readable medium containing instructions arranged to, when read by a processing circuitry, cause the processing circuitry to perform the method of the first or second aspects of the invention.
- the machine readable medium referred may be any of the following: a CDROM; a DVD ROM / RAM (including -R/-RW or +R/+RW); a hard drive; a memory (including a USB drive; an SC card; a compact flash card or the like); a transmitted signal (including an Internet download, ftp file transfer of the like); a wire; etc.
- Figure 1 is a schematic view of a vehicle utilising an embodiment for route determination
- Figure 2 shows input images, corresponding ephemerality masks according to an embodiment, and the effect of their use on visual odometry;
- Figure 3 shows a survey vehicle suitable for use with embodiments of the invention;
- Figure 4 shows static 3D model generation in accordance with embodiments of the invention
- Figure 5 schematically illustrates an ephemerality labelling process according to various embodiments of the invention
- Figure 6 shows a network architecture according to various embodiments of the invention.
- Figure 7 shows input data, a depth map, and ephemerality determinations according to an embodiment of the invention
- Figure 8 shows a selection of input images and corresponding ephemerality masks of an embodiment of the invention
- Figure 9 shows graphs illustrating the performance of an embodiment with respect to velocity estimation errors
- Figure 10 shows an input image and ephemerality mask of an embodiment
- Figure 11 shows a flow chart outlining steps of a method of predicting ephemerality masks of an embodiment
- Figure 12 shows a flow chart outlining steps of a method of preparing training data an embodiment
- Figure 13 shows a flow chart outlining steps of a further embodiment.
- embodiments of the invention may find wider applicability.
- the ability to determine which parts of a scene are ephemeral (non-constant, and/or moving, such as vehicles, road works and pedestrians) and/or which parts relate to static elements (such as buildings, roads, and trees) may find applicability in a number of other fields.
- embodiments may find utility in surveillance systems perhaps to aid object detection, smartphone applications; surveying applications interested in change detection (e.g., maybe returning to a pre-surveyed environment to see if any infrastructure has changed).
- embodiments of the invention can also be applied to other representations of an environment; for example labelling a LIDAR point cloud or other 3D or quasi-3D representation of an environment instead of or as well as labelling visual images.
- embodiments of the invention are described in relation to a sensor 100 mounted upon a vehicle 102 and in relation to the flow chart of Figures 12 and 13.
- the sensor 100 is arranged to monitor its locale and generate data based upon the monitoring thereby providing data giving a representation of a sensed scene around the vehicle; ie an experienced environment.
- the sensor 100 is also arranged to monitor the locale of the vehicle.
- the vehicle 102 is a truck.
- the sensor 100 is a passive sensor (i.e. it does not create radiation and merely detects radiation) and in particular is a monocular camera 100.
- the skilled person will appreciate that different or multiple cameras could be used in some embodiments.
- the senor 100 may comprise other forms of sensor.
- the sensor 100 may also be an active sensor arranged to send radiation out therefrom and detect reflected radiation, such as a LIDAR system.
- the vehicle 102 is a road vehicle travelling along a road 108 and the sensor 100 is imaging the locale (eg the building 110, road 108, car 109, etc.) as the vehicle 102 travels.
- the vehicle 102 also comprise processing circuitry 112 arranged to capture data from the sensor 100 and subsequently to process the data (in this case visual images 200) generated by the sensor 100.
- the processing circuitry captures data from the sensor 100 which data provides a sensed scene from around the vehicle 102 at a current time; as the vehicle 102 moves the sensed scene changes.
- the processing circuitry 112 also comprises, or has access to, a storage device 114 on the vehicle 102.
- the storage device 114 comprises program storage 126 and data storage 128 in the embodiment being described.
- the visual images 200 are stored in the data storage 128 portion.
- the data storage 128 may be differently partitioned, or not partitioned.
- some of the objects 110 remain static (i.e. do not move or change other than changes in lighting, etc) and an example of such a static object within Figure 1 would be the building 110.
- Such static parts of the scene may be thought of as being structural or unchanging parts of the scene.
- Other objects 109 are not static, are not fixed relative to the sensed scene, and/or are only temporarily static and may not be there should the locale be revisited in the future; such objects may be referred to as ephemeral objects.
- An example of such an ephemeral object in Figure 1 would be the car 109 (whether or not the car is parked at the time).
- the processing circuitry 112 comprises a neural network, or segmentation unit, trained to predict whether or not features within the data captured by the sensor 100 are ephemeral.
- captured data are visual images, and the neural network is also trained to predict depth.
- Features determined to be ephemeral can be ignored as distractors when performing visual odometry and localisation, and may be taken into account as potential obstacles.
- the captured data may be of a different type, and/or depth may not be predicted.
- the vehicle 102 may have a LIDAR sensor or other sensor as well as or instead of a monocular camera 100. As depth is determined directly, no depth prediction is needed in this example.
- Such embodiments may be advantageous for example when a user wishes to determine which elements in a point-cloud are static and which are ephemeral in a single traversal of an environment; the trained neural network allows a prediction of ephemerality to be made for the LIDAR point-clouds collected as an environment is experienced (ie at run-time).
- Features determined to be ephemeral can be ignored as distractors when trying to produce a representation of the structure of the environment traversed.
- a self-supervised approach to ignoring such "distractors" in camera images for the purposes of robustly estimating vehicle motion is described herein.
- the approach may have particular utility in cluttered urban environments.
- the approach described herein leverages multi-session mapping (ie maps that have been generated in a number of sessions) to automatically generate ephemerality masks for input images.
- a depth map may also be generated for the input images.
- the generation of the ephemerality masks and depth maps may be performed offline, prior to use of embodiments of the invention onboard a vehicle 102.
- the ephemerality mask is a per-pixel ephemerality mask in the embodiment being described, such that each pixel of each input image 200 is assigned an ephemerality value.
- each image may be divided differently, for example grouping pixels and providing a value for each group of pixels.
- the depth map is a per-pixel depth map in the embodiment being described, such that each pixel of each input image is assigned a depth value.
- each image may be divided differently, for example grouping pixels and providing a value for each group of pixels.
- the images and their associated ephemerality masks and depth maps are then used to train a deep convolutional network (a neural network) in the embodiment being described - the images, ephemerality masks and depth maps may therefore be thought of as training data.
- a deep convolutional network a neural network
- the trained network can then predict ephemerality and depth for other images, even images 200 of environments outside of the environment covered by the training data.
- the following describes embodiments in which the trained network is so used to process representations of an environment that is experienced; ie an experienced environment.
- a vehicle 102 using an embodiment of the invention therefore does not have to be within an environment surveyed for the training data to successfully use an embodiment of the invention, nor does the vehicle 102 require knowledge of its location. Embodiments of the invention may therefore offer greater flexibility of use than prior art alternatives.
- the predicted ephemerality and depth can then be used as an input to a monocular visual odometry (VO) pipeline, for example using either sparse features or dense photometric matching.
- VO monocular visual odometry
- Embodiments of the invention may therefore yield metric-scale VO using only a single camera (due to the metric depth estimation enabled by the depth prediction training), and experiments have shown that embodiments may be able to recover the correct egomotion even when 90% of the image is obscured by dynamic, independently moving objects.
- Embodiments of the invention may therefore yield reduced odometry drift and significantly improved egomotion estimation in the presence of large moving vehicles in urban traffic.
- Figure 2 shows an example of robust motion estimation in urban environments using a single camera and a learned ephemerality mask.
- Figure 2 shows three visual images 200a-c of a particular part of an environment at consecutive times (top left).
- the images 200a-c are captured by a camera 100 of a vehicle 102.
- the images 200a-c show a bus 202 driving along a road 108.
- Figure 2 shows three ephemerality masks 204a-c, one corresponding to each image 200a-c (top right).
- the bus 202 which is an example of an ephemeral object, can be easily distinguished from the road, pavement and wall, which are examples of static objects.
- a large bus 202 passes in front of the vehicle 102 (arrow to indicate direction vehicle is facing) obscuring the view of the scene (top left).
- the learned ephemerality mask 204 correctly identifies the bus 202 as an unreliable region of the image 200a-c for the purposes of motion estimation (top right).
- the ephemerality mask 204a-c predicts stable image regions (e.g. buildings, road markings, static landmarks, shown as dark grey or black in Figure 2) that are likely to be useful for motion estimation, in contrast to dynamic, or ephemeral, objects (e.g. pedestrian and vehicle traffic, vegetation, temporary signage, shown as light grey or white in Figure 2).
- stable image regions e.g. buildings, road markings, static landmarks, shown as dark grey or black in Figure 2
- objects e.g. pedestrian and vehicle traffic, vegetation, temporary signage, shown as light grey or white in Figure 2
- the embodiment being described uses repeated traversals of the same route to generate training data without requiring manual labelling or object class recognition.
- the training data is gathered by a survey vehicle equipped with a LIDAR sensor and a stereoscopic camera.
- LIDAR sensor and a stereoscopic camera.
- alternative or additional sensors may be used in other embodiments, and that a stereoscopic camera alone may be sufficient in some embodiments.
- the survey vehicle 302 traverses a particular route through a particular environment multiple times, gathering data on the environment, the data including 3D data and visual images. The gathered data is then used to produce depth and ephemerality labels for the images, and the labelled images (training data) are then passed to a deep convolutional network in a self- supervised process (no manual assistance or annotation required).
- the depth maps for the training data are generated using stereoscopic image pairs taken by a stereoscopic camera C mounted on a survey vehicle 302. Per-pixel depth estimation is performed by warping the left image ono the right image (or vice versa) and matching intensities. A shift in the position of an object between the left and right images provides an estimate of the distance from the camera C to that object - the larger the shift, the smaller the distance.
- the static 3D model in the embodiment being described; this model is generated from the LIDAR data, but it may be produced differently in other embodiments
- this refinement step may not be used.
- LIDAR data or the likes may be mapped onto images to provide per-pixel depth estimates and stereoscopic data may not be used.
- the ephemerality mask is integrated as a component of a monocular visual odometry (VO) pipeline as an outlier rejection scheme.
- VO monocular visual odometry
- a static 3D model i.e. a model of the environment with the ephemeral objects removed
- the method disclosed herein is advantageous over many prior art techniques in that manual labelling is not needed and classifications of ephemeral objects and image recognition based on those classes are not required.
- prior art techniques could be used to generate a suitable static 3D model.
- LIDAR and stereo camera sensors are only provided on the survey vehicle 302 to collect training data in the embodiment being described; a monocular camera 100 is sufficient (although additional or alternative sensors may be used or present in some embodiments).
- a synthetic 3D static model which may be termed a prior 3D static structure, is generated using the static/structural components of the environment.
- the prior 3D static structure is projected into every stereo camera image collected during the survey.
- a dense stereo approach (similar to [16] - C. McManus, W. Churchill, A. Napier, B. Davis, and P. Newman, "Distraction suppression for vision-based pose estimation at city scales " in Robotics and Automation (ICRA), 2013 IEEE International Conference on. IEEE, 2013, pp. 3762-3769) is used to compare the stereo camera image to the prior 3D static structure.
- the two 3D representations differ, i.e. there is some discrepancy between them.
- ephemerality is computed as a weighted sum of the discrepancies.
- the calculated discrepancy includes both depth disparity and normal difference between the prior/static 3D structure and the "true'Vcurrent 3D structure. 3) Network Training:
- a deep convolutional network is then trained to predict the resulting pixel-wise depth and ephemerality mask for input monocular images 200 and at run time, live depth and ephemerality masks are predicted for images taken by a camera 100 even in locations not traversed by the survey vehicle 302. These three steps are described in more detail below.
- a survey vehicle 302 equipped with a camera C and LIDAR Z illustrated in Figure 3 performs a number of traverses j of an environment.
- a neighbourhood function JV(-) is defined, where a point belongs to a neighbourhood if it satisfies the following condition:
- a is a neighbourhood size parameter, typically set to 0.5m in experiments described herein.
- ⁇ is then built over the traverses j from which points fell in the neighbourhood of the query point as follows:
- neighbourhoods of points that are well-distributed between different traversals indicate static structure, whereas neighbourhoods of points that were only sourced from one or two traversals are likely to be ephemeral objects.
- a point pi is classified as static structure P if the neighbourhood entropy H(p t ) exceeds a minimum threshold ⁇ ; all other points are estimated to be ephemeral and are removed from the static 3D prior.
- Figure 4 illustrates prior 3D mapping to determine the static 3D scene structure. Alignment of multiple traversals of a route (image 402, top left) yields a large number of points only present in single traversals, e.g. traffic or parked vehicles, here shown in white, around the survey vehicle 302. These points corrupt a synthetic depth map (image 404, top right).
- the embodiment being described removes 3D points that were only observed in a small number of traversals, and retains the structure that remained static for the duration of data collection (image 406, bottom left), resulting in high-quality synthetic depth maps (image 408, bottom right).
- black denotes areas of invalid data (e.g. the sky for which no LIDAR returns were received) and white denotes points which are discarded from the map optimisation process as a result of the ephemerality judgements.
- Grayscale from around 0.2 to 0.8 i.e. with a gap at each end to allow the white and black regions to be distinguished illustrates depth from the sensor viewpoint, with lighter coloured areas being further away.
- Figure 5 illustrates the ephemerality labelling process 500 of the embodiment being described.
- a synthetic normal image can also be generated for each survey image, as is also illustrated in Figure 5.
- true disparity d t 504 i.e. depth from the camera
- true normals «, 514 i.e. line perpendicular to local plane in the stereo image taken by the camera
- prior 3D pointcloud p s is projected into the image to form the prior disparity df 506 and prior normal nf 516.
- a difference between the true and prior disparities is termed a disparity error 508.
- a difference between the true and prior normals is termed a normal error 518.
- the disparity and normal error terms are combined to form the ephemerality mask 520 (right).
- normal errors may not be used and the ephemerality mask 520 may be based on disparity (depth error) alone, or on depth error and one or more other factors.
- black denotes invalid pixels.
- the distance from a camera to a first point on the road in the first picture may be similar to the distance from the camera to a tyre blocking the view of the first point on the road, and so the presence of the tyre of the vehicle may not be easy to determine from depth alone.
- the road is substantially horizontal whereas the side of the tyre is substantially vertical, so the normal error is large.
- use of normal errors can assist in providing crisp and accurate outlines of ephemeral objects.
- the ephemerality mask £ t is defined as the weighted difference between the expected static and true disparity and normals as follows: ... (5) where ⁇ and ⁇ are weighting parameters, and £ t is bounded to [0, 1] after computation.
- a convolutional encoder-multi-decoder network architecture is trained and used to predict both disparity and ephemerality masks from a single image in the embodiment being described, as illustrated in Figure 6.
- Figure 6 shows a network architecture 600 for ephemerality and disparity learning (termed a "deep distraction network” as it is arranged to perform deep machine learning to identify distractions, i.e. ephemeral parts of scenes).
- a common encoder 602 is used which splits to multiple decoders 604, 606 for the ephemerality mask 608 and disparity 610 outputs.
- the deep distraction network comprises a single-encoder multi-decoder network with skip connections connecting each scale of the encoder to corresponding scales in the decoders similar to the UNet architecture (see reference [21]).
- the Encoder used in the embodiment being described is based on the VGG network (see Simonyan, K. & Zisserman, A. "Very deep convolutional networks for large-scale image recognition" In Proc. International Conference on Learning Representations) and used to extract a low resolution feature map from the input monocular image 200.
- the decoders perform the opposite of the encoder; essentially reversing the VGG operations. Specifically the Decoders map the low resolution encoder feature map to full input resolution feature maps for pixel-wise classification of ephemerality and disparity. At each scale in the decoders a skip connection passes higher resolution features from the corresponding scale in the encoder.
- Each decoder is identical at initialisation but is trained independently for its specific tasks.
- a pixel-wise loss is used to replicate the pre-computed ground truth ephemerality masks in the training data, which were calculated using the 3D static model and corresponding stereoscopic images in the embodiment being described.
- the stereo photometric loss proposed in [22] C. Godard, O. Mac Aodha, and G. J. Brostow, "Unsupervised monocular depth estimation with left-right consistency," in CVPR, 2017 is used, optionally semi-supervised using the prior LIDAR disparity df to ensure metric-scaled outputs.
- the losses between the different output stages are balanced using the multi-task learning approach in [23] A. Kendall, Y. Gal, and R. Cipolla, "Multi-task learning using uncertainty to weigh losses for scene geometry and semantics " arXiv preprint arXiv: 1705.07115, 2017, which continuously updates the inter-task weighting during training.
- the Adam optimiser as described in Kingma, D. and Ba, J., 2014, "Adam: A method for stochastic optimization", arXiv preprint arXiv: 1412.6980, and an initial learning rate of lxlO "4 were used. The skilled person will appreciate that such implementation details may vary between embodiments.
- the live depth and ephemerality mask produced by the network are leveraged to produce reliable visual odometry estimates accurate to metric scale.
- the wa ⁇ ing function ⁇ (-) ⁇ Wi projects the matched feature Xj into the current image according to relative pose ⁇ and the camera intrinsics.
- the set of all extracted features is typically a small subset of the total number of pixels in the image.
- the step function s(£j) is used to disable the residual according to the predicted ephemerality as follows:
- ⁇ is the maximum ephemerality threshold for a valid feature, typically set to 0.5 in the embodiments being described.
- sparse features are detected using FAST corners ([26] E. Rosten and T. Drummond, "Machine learning for high-speed corner detection " Computer Vision-ECCV 2006, pp. 430-443, 2006) and matched using BRIEF descriptors ([27] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, "BRIEF: binary robust independent elementary features " Computer Vision-ECCV 2010, pp. 778-792, 2010) for real-time operation.
- BRIEF descriptors [27] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, "BRIEF: binary robust independent elementary features " Computer Vision-ECCV 2010, pp. 778-792, 2010
- Figure 7 illustrates the predicted depth, selected sparse features and weighted dense intensity values used for a typical urban scene.
- Figure 7 shows input data for ephemerality-aware visual odometry.
- the network predicts a dense depth map 704 (top right) and an ephemerality mask.
- the ephemerality mask is used to select which features are used for optimisation 706 (bottom left), and for dense VO approaches the photometric error term is weighted directly by the ephemerality mask 708 (bottom right).
- white crosses with a black border correspond to points which are identified as static and black crosses with a white border correspond to points which are identified as ephemeral.
- the approach of the embodiments described above was benchmarked using hundreds of kilometres of data collected from an autonomous vehicle platform in a complex urban environment.
- the goal was to quantify the performance of the ephemerality-aware visual odometry approach in the presence of large dynamic objects in traffic.
- the network was trained using eight 10km traversals from the Oxford RobotCar dataset ([30] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, "7 year, 1000 km: The Oxford RobotCar dataset " The International Journal of Robotics Research, vol. 36, no. 1, pp. 3-15, 2017) for a total of approximately 80km of driving.
- the RobotCar vehicle 302 is equipped with a Bumblebee XB3 stereo camera C and a LMS-151 pushbroom LIDAR scanner L.
- a stereo camera C may be used to provide all of the required information, without using LIDAR.
- point clouds generated by stereo reconstruction from the stereoscopic images taken on different traversals could be compared to establish the synthetic, static 3D prior point cloud. Then, each unmodified stereo-reconstructed point cloud could be compared to the relevant portion of the synthetic static point cloud
- the input was down-sampled images to 640 X 256 pixels and sub-sampled to one image every metre before use; a total of 60,850 images were used for training.
- Ephemerality masks and depth maps were produced at 50Hz using a single GTX 1080 Ti GPU.
- Figure 8 shows two columns of input images (800a, 810a), each adjacent to a corresponding column of ephemerality masks 800b, 810b.
- Figure 8 shows ephemerality masks 800b, 810b produced in challenging urban environments.
- the masks reliably highlight a diverse range of dynamic objects (cars, buses, trucks, cyclists, pedestrians, strollers) with highly varied distances and orientations. Even buses and trucks that almost entirely obscure the camera image are successfully masked despite the lack of other scene context.
- Robust VO approaches that make use of the ephemerality mask may therefore provide correct motion estimates even when more than 90% of the static scene is occluded by an independently moving object.
- the sparse VO approach provided lower overall translational drift, whereas the dense approach produced lower orientation drift.
- Figure 9 shows velocity estimation errors in the presence of distractors.
- the sparse ephemerality-aware approach significantly outperforms the baseline approach, producing far fewer outliers above 0.5 m/s.
- the dense ephemerality-aware approach does not perform as well, but still outperforms the baseline.
- the vertical axis is scaled to highlight the outliers.
- An ephemerality mask estimates the likelihood that any pixel in an input image corresponds to either reliable static structure or dynamic objects in the environment. Further, prediction of the ephemerality mask can be learned using an automatic self-supervised approach as disclosed herein with respect to various embodiments.
- Figure 10 illustrates a static/ephemeral segmentation performed using the ephemerality mask; currently the "background” (static features) may be used to guide motion estimation and/or the "foreground” (ephemeral features) may be used for obstacle detection.
- background static features
- foreground ephemeral features
- ephemerality masks are widely applicable for autonomous vehicles.
- the ephemerality mask can be used to inform localisation against only the static scene (bottom left) whilst guiding object detection to only the ephemeral elements (bottom right).
- Figure 11 illustrates an overall method 1100 for predicting ephemerality masks for representations of an environment that is being experienced, as disclosed herein.
- training data is generated.
- the training data comprises a set of training representations, the training visual representations being representations of a training environment.
- the training data also comprises corresponding ephemerality masks marking the segmentation of each of the training representations into static and ephemeral content.
- the ephemerality masks are generated by comparing 1102 each training representation to a corresponding portion of a 3D model of the static parts of the training environment.
- a discrepancy between the training representation and the corresponding portion of the 3D static model is then calculated 1104.
- the training representations comprise visual images and discrepancy is assessed on a per-pixel basis for a visual image. In other cases, larger elements than a single pixel may be used, or the training representation may not comprise pixels, for example being a point-cloud or another type of representation.
- An ephemerality mask is then calculated 1106 for the training representation based on the discrepancy. The ephemerality mask marks the segmentation of the training representation into static and ephemeral content.
- Figure 12 depicts a method 1200 for generating training data in accordance with various embodiments of the invention.
- the method 1200 comprises obtaining 1202 data for multiple traversals of a given route through an environment, the data comprising 3D data.
- One or more survey vehicles may be used to obtain the data.
- the 3D data from different traversals is then compared 1204 so as to split the data into:
- the separation of points of the point clouds into ephemeral and static sets may be sufficient for use as training data, for example when the vehicle 102 will have a LIDAR system.
- the training data preferably comprises labelled visual images.
- the data comprises visual images and 3D data, and the following steps apply:
- a prior 3D static model of the environment is generated 1206 using the static points and not the ephemeral points;
- a stereoscopic pair of the visual images is compared 1208 to a corresponding portion of the static 3D model by producing a synthetic model of a portion of the environment using stereo reconstruction of the stereoscopic pair of visual images.
- a discrepancy between the synthetic model and the 3D static model is then calculated 1210 for each pixel of each stereoscopic visual image.
- An ephemerality mask for the visual image is then calculated 1212 based on the discrepancy, the ephemerality mask marking the segmentation of the visual image into static and ephemeral content.
- Figure 13 illustrates a method 1300 of various embodiments of the invention.
- a trained neural network is obtained.
- the neural network is trained to distinguish between static and ephemeral parts of representations of environments.
- the neural network may be trained using training data generated as discussed with respect to Figure 12.
- the neural network is also trained to predict depth based on visual images.
- step 1304 representations of ant environment that are experienced by the vehicle 102 are provided to the neural network for analysis.
- the trained neural network predicts an ephemerality mask for each representation of the experienced environment.
- the trained neural network may also predict depth values (which may be thought of as a depth mask) for each representation of the experienced environment.
- depth values which may be thought of as a depth mask
- the skilled person will appreciate that the predicted ephemerality mask (and the predicted depth mask where applicable) can be used in many different ways. Steps 1308 to 1314 illustrate four examples of such uses: i. determining a route 1314 for a vehicle using the predicted ephemerality mask;
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Remote Sensing (AREA)
- Radar, Positioning & Navigation (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Medical Informatics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Electromagnetism (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GBGB1718692.5A GB201718692D0 (en) | 2017-11-13 | 2017-11-13 | Detecting static parts of a scene |
PCT/GB2018/053259 WO2019092439A1 (en) | 2017-11-13 | 2018-11-12 | Detecting static parts of a scene |
Publications (1)
Publication Number | Publication Date |
---|---|
EP3710985A1 true EP3710985A1 (en) | 2020-09-23 |
Family
ID=60788343
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP18804091.9A Withdrawn EP3710985A1 (en) | 2017-11-13 | 2018-11-12 | Detecting static parts of a scene |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP3710985A1 (en) |
GB (1) | GB201718692D0 (en) |
WO (1) | WO2019092439A1 (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10522038B2 (en) | 2018-04-19 | 2019-12-31 | Micron Technology, Inc. | Systems and methods for automatically warning nearby vehicles of potential hazards |
US11675083B2 (en) * | 2019-01-03 | 2023-06-13 | Nvidia Corporation | Removal of ephemeral points from point cloud of a high-definition map for navigating autonomous vehicles |
US11373466B2 (en) | 2019-01-31 | 2022-06-28 | Micron Technology, Inc. | Data recorders of autonomous vehicles |
US11755884B2 (en) | 2019-08-20 | 2023-09-12 | Micron Technology, Inc. | Distributed machine learning with privacy protection |
US11636334B2 (en) | 2019-08-20 | 2023-04-25 | Micron Technology, Inc. | Machine learning with feature obfuscation |
CN110992367B (en) * | 2019-10-31 | 2024-02-02 | 北京交通大学 | Method for semantically segmenting image with occlusion region |
US11755917B2 (en) * | 2019-11-15 | 2023-09-12 | Waymo Llc | Generating depth from camera images and known depth data using neural networks |
US11640692B1 (en) | 2020-02-04 | 2023-05-02 | Apple Inc. | Excluding objects during 3D model generation |
EP3885970A1 (en) * | 2020-03-23 | 2021-09-29 | Toyota Jidosha Kabushiki Kaisha | System for processing an image having a neural network with at least one static feature map |
US20230180018A1 (en) * | 2021-12-03 | 2023-06-08 | Hewlett Packard Enterprise Development Lp | Radio frequency plan generation for network deployments |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB201301281D0 (en) * | 2013-01-24 | 2013-03-06 | Isis Innovation | A Method of detecting structural parts of a scene |
CN106997466B (en) * | 2017-04-12 | 2021-05-04 | 百度在线网络技术(北京)有限公司 | Method and device for detecting road |
-
2017
- 2017-11-13 GB GBGB1718692.5A patent/GB201718692D0/en not_active Ceased
-
2018
- 2018-11-12 EP EP18804091.9A patent/EP3710985A1/en not_active Withdrawn
- 2018-11-12 WO PCT/GB2018/053259 patent/WO2019092439A1/en unknown
Also Published As
Publication number | Publication date |
---|---|
WO2019092439A1 (en) | 2019-05-16 |
GB201718692D0 (en) | 2017-12-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Barnes et al. | Driven to distraction: Self-supervised distractor learning for robust monocular visual odometry in urban environments | |
EP3710985A1 (en) | Detecting static parts of a scene | |
Barnes et al. | Find your own way: Weakly-supervised segmentation of path proposals for urban autonomy | |
Chen et al. | Suma++: Efficient lidar-based semantic slam | |
US10192113B1 (en) | Quadocular sensor design in autonomous platforms | |
US10496104B1 (en) | Positional awareness with quadocular sensor in autonomous platforms | |
US20200026283A1 (en) | Autonomous route determination | |
Leibe et al. | Coupled object detection and tracking from static cameras and moving vehicles | |
EP4191532A1 (en) | Image annotation | |
Barth et al. | Estimating the driving state of oncoming vehicles from a moving platform using stereo vision | |
CN111611853B (en) | Sensing information fusion method, device and storage medium | |
Zhou et al. | Moving object detection and segmentation in urban environments from a moving platform | |
EP3844672A1 (en) | Structure annotation | |
US20200082182A1 (en) | Training data generating method for image processing, image processing method, and devices thereof | |
Parra et al. | Robust visual odometry for vehicle localization in urban environments | |
Murali et al. | Utilizing semantic visual landmarks for precise vehicle navigation | |
Jeong et al. | Multimodal sensor-based semantic 3D mapping for a large-scale environment | |
WO2016170330A1 (en) | Processing a series of images to identify at least a portion of an object | |
Gaspar et al. | Urban@ CRAS dataset: Benchmarking of visual odometry and SLAM techniques | |
Held et al. | A probabilistic framework for car detection in images using context and scale | |
McManus et al. | Distraction suppression for vision-based pose estimation at city scales | |
Suleymanov et al. | Online inference and detection of curbs in partially occluded scenes with sparse lidar | |
Saleem et al. | Neural network-based recent research developments in SLAM for autonomous ground vehicles: A review | |
Nguyen et al. | Confidence-aware pedestrian tracking using a stereo camera | |
CN116597122A (en) | Data labeling method, device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: UNKNOWN |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20200513 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
AX | Request for extension of the european patent |
Extension state: BA ME |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) | ||
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20220601 |