GB2541153A

GB2541153A - Processing a series of images to identify at least a portion of an object

Info

Publication number: GB2541153A
Application number: GB1507009.7A
Authority: GB
Inventors: Maria Paz Lina; Pinies Pedro; Newman Paul
Original assignee: Oxford University Innovation Ltd
Current assignee: Oxford University Innovation Ltd
Priority date: 2015-04-24
Filing date: 2015-04-24
Publication date: 2017-02-15
Also published as: WO2016170330A1; GB201507009D0

Abstract

A series of images (e.g. representing the environment of a vehicle) are processed to identify a portion of an object of interest (e.g. navigable region), each image representing a part of an environment. A first two-dimensional (2D) image (102) is obtained, in which at least one point (i.e. a seed), having a predetermined property, is labelled (104a) as belonging to the object of interest. The image is segmented (e.g. using colour) to identify the region (e.g. navigable region) (104b) corresponding to the labelled point to identify a portion of the object of interest within the first image. A second 2D image of the environment is then obtained, and at least a portion of the region from the first image is propagated to the second image using three dimensional (3D) geometric data (108) (e.g. odometry data, depth map, point cloud from LIDAR). The second image is segmented to identify the region having the predetermined property thereby identifying the portion of the object of interest in the second image. The method may be repeated such that the identified region in the second image is used as the seed to identify a further region in a further image.

Description

PROCESSING A SERIES OF IMAGES TO IDENTIFY AT LEAST A PORTION OF AN

OBJECT

This invention relates to a method and system arranged to process a series of images to identify at least a portion of an object of interest within the images. In particular, the object of interest may typically be a navigable region within that image. In particular, but not exclusively, the system or method may be used to automatically identify navigable regions within a series of monocular images. Further, and again not exclusively, the invention may relate to automatically segmenting and labelling road visible in the images of an outdoor environment. Specifically, and again not exclusively, the invention may have particular utility in driver assistance systems and for self-driving vehicles.

The skilled person would understand that the invention may be applied to the field of road detection, but is not limited to this field. By way of non-limiting example, instead of using a moving sensor to detect road, or buildings, the invention could be applied to a stationary sensor used to detect and track moving vehicles, people or other objects. In the following, reference is made to road for convenience, but it will be appreciated that what is in fact referred to is a ‘navigable region’, be that a road, a pavement, a path, a track, a canal, a river, or the like. As, such reference to road should be viewed in this wider context.

It is convenient to describe the background with reference to road detection which is a challenging task but provides a useful role for supporting advanced driver assistance systems, such as road following or vehicle and pedestrian detection Moreover, the estimation of the road geometry (eg slopes and/or borders, etc) and the localisation of a vehicle are each useful tasks in this context since they aid the lateral and/or longitudinal control of the vehicle. Onboard vision systems have been widely used as they offer advantages over other active sensors such as Radar or Lidar (for example higher resolution, low power consumption, low cost, easy aesthetic integration, and nonintrusive nature), that allow for reduced processing time and computational effort.

Current state of the art two dimensional (2D) segmentation solutions use single colour images and machine learning algorithms that require supervised training on extensive image databases. H. Dahlkamp, A. Kaehler, D. Stavens, S. Thrun, and G. R. Bradski, “Self-supervised monocular road detection in desert terrain”, in Robotics: Science and Systems II, August 16-19, 2006, University of Pennsylvania, 2006, teaches an approach where self-supervised learning is achieved by combining the camera image with a laser range finder to identify a drivable surface area in the near vicinity of the vehicle. Once identified, this area is used as training data for vision-based road classification.

The problem of detecting road geometry from visual information seems simple. However, complexity is introduced as the road is imaged from a mobile vehicle/camera with a constantly changing background, under the presence of different objects, such as vehicles and pedestrians, whilst being exposed to varying ambient illumination and weather conditions. A particularly difficult scenario manifests when the road surface has both shadowed and non-shadowed areas. J. M. Alvarez and A. M. Lopez, “Road detection based on illuminant invariance”, IEEE Transactions on Intelligent Transportation Systems, vol. 12, no. 1, pp. 184-193, 2010, proposes an approach to vision-based road detection that is robust to shadows. The approach relies on using a shadow-invariant feature space combined with a model-based classifier. The model is built online to improve the adaptability of the algorithm to the current lighting conditions and the presence of other vehicles in the scene.

Similarly, the work of F. Μ. M. Beyeler and A. Verl, “Vision-based robust road lane detection in urban environments”, in IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 2014, pp. 243-252 presents a low cost approach to ego-lane detection on illumination-invariant colour images. H. Kong, J.-Y. Audibert, and J. Ponce, “General road detection from a single image”, Trans. Img. Proc., vol. 19, no. 8, pp. 2211-2220, Aug. 2010 addresses road detection within a general dataset using a single image. The process is split into two steps: the estimation of the vanishing point associated with the central (straight) part of the road, followed by the segmentation of the corresponding road area based upon that vanishing point using a technique for detecting and refining road boundaries.

In R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “DTAM: Dense tracking and mapping in real-time”, in Proceedings of the 2011 International Conference on Computer Vision ICCV, Washington, DC, USA: IEEE Computer Society, pp. 2320-2327, the total variation (TV) is used as a regularisation term to preserve sharp depth discontinuities whilst simultaneously enforcing smoothness of homogeneous surfaces. The solution is based on a primal-dual formulation applied in solving variational convex functions that arise in many image processing problems (see, for example, A. Chambolle and T. Pock, “A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging”, Journal of Mathematical Imaging and Vision, vol. 40, no. 1, pp. 120-145, May 2011).

According to a first aspect of the invention, there is provided a method of processing a series of images. Each image may represent at least a part of an environment. The method may be used to identify at least a portion of an object of interest within the images.

The method may comprise at least one of the following steps: (i) obtaining a first two-dimensional (2D) image of the environment, in which at least one point, having a predetermined property, may be labelled as forming at least part of the object of interest; (ii) segmenting the first 2D image to identify the at least one region corresponding to the at least one labelled point to identify at least a portion of the object of interest within the first 2D image; (iii) obtaining a second 2D image of the environment; (iv) propagating at least a portion of the region from the first 2D image to the second 2D image, optionally using three dimensional (3D) geometric data; and (v) segmenting the second 2D image to identify the at least one region having the predetermined property in the second 2D image thereby identifying the at least a portion of the object of interest in the second 2D image.

The skilled person would understand that the steps may not be performed in the order in which they are listed above. For example, the two (or more) 2D images could be obtained before any segmentation is performed. Additionally or alternatively, if the point identified in step (i) can be projected into the second image using 3D geometric data, segmentation of the first 2D image may be unnecessary. In such a case, projecting the point corresponds to propagation of the region - step (iv) - the region being a single point.

Optionally, the method may be repeated such that the region having the predetermined property within the second 2D image is used as the seed to identify a further region in a further image.

In some embodiments, each image may represent at least a portion of the environment in the vicinity of a vehicle, and the object of interest may be a navigable region within the images. In such embodiments, the obtaining a first 2D image of the environment may comprise obtaining a 2D image in which at least one point, having a predetermined property, is labelled as being navigable.

Further, the segmenting the first 2D image to identify the at least one region corresponding to the at least one labelled point may comprise segmenting the first 2D image to identify the navigable region within the first 2D image. The segmenting the second 2D image to identify the at least one region having the predetermined property in the second 2D image may comprise segmenting the second 2D image to identify a navigable region in the second 2D image.

In embodiments wherein each image may represent at least a portion of the environment in the vicinity of a vehicle, odometry data may be generated as the vehicle moves. The skilled person would understand that the odometry data may be used to determine movement of the vehicle from the position at which the first image was taken to the position at which the second image was taken. Further, information on the movement of the vehicle may be used in the step of propagating the at least a portion of the region from the first 2D image to the second 2D image. The 3D geometric data may therefore comprise odometry data.

In embodiments wherein each image may represent at least a portion of the environment in the vicinity of a vehicle, the first and second 2D images may be captured from a sensor mounted on the vehicle. Optionally, the sensor may be a camera, which may typically be a monocular camera. The skilled person would understand that many other types of sensor may be used, and that multiple sensors may be used, optionally of different types.

In some embodiments, the first and second 2D images may be illumination-invariant colour images. The method may therefore include the step of generating these illumination invariant first and second 2D images from images generated by a sensor.

The skilled person would understand that, in some embodiments, the object of interest may be a mobile object. In such embodiments, the series of 2D images may be processed to identify the object of interest, and optionally also to track the object of interest through the series of images.

In embodiments wherein the object of interest is a mobile object, the object of interest may comprise a sensor which generates odometry data as the object of interest moves. Advantageously, the object of interest may transmit the odometry data. The skilled person would understand that transmitted odometry data may be used in the method as part or all of the 3D geometric data used in the step of propagating the at least a portion of the region from the first 2D image to the second 2D image.

The skilled person will understand that a variety of sensors can be used to generate odometry data, for example cameras, rotary encoders, accelerometers and the like.

Alternatively or additionally, the 3D geometric data may comprise a depth-map and/or at least one geometric assumption about the environment.

In embodiments wherein the 2D images comprise pixels, the 3D geometric data may provide a depth estimate for each pixel. Additionally or alternatively, in such embodiments, morphology regularisation may be used on pixels located near boundaries of a region to reduce depth inaccuracies.

In alternative or additional embodiments, the method may further comprise the step of transferring the region to a 3D prior of the environment. The 3D prior may be a point-cloud, which may be generated from a LIDAR.

In alternative or additional embodiments, statistical colour segmentation may be used to segment the first and second 2D images to identify the at least a portion of the object of interest therein.

In embodiments wherein a depth-map is used, the depth-map may be created from a pair of consecutive 2D images of the environment, using parallax.

Optionally, at least one geometric assumption about the environment is used in addition to the pair of consecutive images. The skilled person would understand that using a geometric assumption may improve the accuracy of the depth map.

In embodiments wherein one or more geometric assumptions are used, a geometric assumption may be that the geometry of the environment contains affine surfaces. The skilled person would understand that an assumption that the environment contains affine surfaces may be particularly useful in urban or indoor environments, wherein the environment comprises many planar surfaces.

In alternative or additional embodiments, the method may further comprise generating a visual representation of at least a portion of the environment. In the visual representation, the identified region is highlighted. The skilled person would understand that a different colour, texture or brightness or the like may be used to highlight the identified region. Alternatively or additionally, a border could be used to mark the boundary of the region.

According to a second aspect of the invention, there is provided a system for processing a series of images, the system being arranged to identify at least a portion of an object of interest within the images. Each image may represent at least a part of an environment.

The system comprising processing circuitry may be arranged to perform at least one of the following steps: (i) obtain a first two-dimensional (2D) image of the environment, in which at least one point, having a predetermined property, may be labelled as forming at least part of the object of interest; (ii) segment the first 2D image to identify the at least one region corresponding to the at least one labelled point to identify at least a portion of the object of interest within the first 2D image; (iii) obtain a second 2D image of the environment; (iv) propagate at least a portion of the region from the first 2D image to the second 2D image using three dimensional (3D) geometric data; and (v) segment the second 2D image to identify the at least one region having the predetermined property in the second 2D image thereby identifying the at least a portion of the object of interest in the second 2D image.

The skilled person would understand that the method may then be repeated as many times as desired. For example, the identified region in the second 2D image may be used as the seed region for segmenting a third 2D image, etc.

Optionally, the system may further comprise one or more sensors arranged to generate at least one of the first 2D image and the second 2D image. The sensor may be a camera, and may preferably be a monocular camera. The skilled person would understand that many different types of sensor may be used.

In additional or alternative embodiments, the system may further comprise a storage device arranged to store at least one of the first 2D image, the second 2D image and information relating to the predetermined property.

In some embodiments, the system may comprise a server arranged to communicate with the processing circuitry. The skilled person would understand that a storage device may be located locally to the processing circuitry, or remotely from the processing circuitry. In embodiments wherein the storage device is located remotely from the processing circuitry, the server may be used to transfer data between the processing circuitry and the storage device.

In some embodiments, the system may further comprise a vehicle on which at least some of the processing circuitry is mounted. In such embodiments, the system may comprise a vehicle-based portion and a remote portion. The system may be arranged to transfer data between the vehicle-based portion and the remote portion.

According to a third aspect of the invention, there is provided a vehicle having a sensor mounted thereon, wherein the sensor is arranged to generate two-dimensional (2D) images. Each 2D image may represent at least a portion of the environment in the vicinity of the vehicle.

The vehicle may have a processing circuitry arranged to process the 2D images. The processing circuitry may be arranged to perform at least one of the following steps: (i) obtain a first 2D image of the environment, in which at least one point, having a predetermined property, may be labelled as forming at least part of the object of interest; (ii) segment the first 2D image to identify the at least one region corresponding to the at least one labelled point to identify the at least a portion of the object of interest within the first 2D image; (iii) obtain a second 2D image of the environment; (iv) propagate at least a portion of the region from the first 2D image to the second 2D image using three dimensional (3D) geometric data; and (v) segment the second 2D image to identify the at least one region having the predetermined property in the second 2D image thereby identifying the at least a portion of the object of interest in the second 2D image.

In some embodiments, the obtaining a first 2D image of the environment comprises obtaining a 2D image in which at least one point, having a predetermined property, is labelled as being navigable.

The skilled person would therefore understand that the predetermined property may be “navigability”, which may be defined in terms of at least some of the following characteristics, or the like: (i) colour; (ii) texture; (iii) position relative to the environment; (iv) geometry (for example, large, flat surface); and (v) markers (for example, road marking, kerb or pavement edge).

In such embodiments, the segmenting the first 2D image to identify the at least one region corresponding to the at least one labelled point may comprise segmenting the first 2D image to identify the navigable region within the first 2D image.

Further, the segmenting the second 2D image to identify the at least one region having the predetermined property in the second 2D image may comprise segmenting the second 2D image to identify a navigable region in the second 2D image.

In additional or alternative embodiments, the vehicle may further comprise one or more sensors arranged to generate odometry data as the vehicle moves. The skilled person would understand that the odometry data may be used to determine movement of the vehicle from the position at which the first image was taken to the position at which the second image was taken. Data concerning the movement can be used in the step of propagating the at least a portion of the region from the first 2D image to the second 2D image. The 3D geometric data may therefore comprise odometry data.

According to a fourth aspect of the invention, there is provided a machine readable medium containing instructions which when read by a machine cause that machine to perform as at least one of the following: (i) the method of the first embodiment; (ii) at least a portion of the system of the second embodiment; and/or (iii) the vehicle of the third embodiment.

The machine readable medium referred to in any of the above aspects of the invention may be any of the following: a CDROM; a DVD ROM / RAM (including -R/-RW or +R/+RW); a hard drive; a memory (including a USB drive; an SD card; a compact flash card or the like); a transmitted signal (including an Internet download, ftp file transfer of the like); a wire; etc.

Features described in relation to any of the above aspects of the invention may be applied, mutatis mutandis, to any of the other aspects of the invention.

There now follows, by way of example only, a detailed description of embodiments of the invention with reference to the accompanying drawings in which:

Figure 1 is a schematic view of a vehicle utilising a camera to take and process images of an environment;

Figure 2a schematically illustrates 2D segmentation and propagation of labels from frame to frame on a sequence of collected images of the environment;

Figure 2b shows a dense 3D point cloud, rendered using data from a 2D pushbroom laser, which is used in the segmentation and label propagation illustrated in Figure 2a;

Figure 2c shows the section of the dense 3D point cloud of Figure 2b which is identified as being “road” in accordance with an embodiment;

Figure 2d shows the section of the dense 3D point cloud of Figure 2b which is identified as being “not road” in accordance with an embodiment;

Figure 3a shows a schematic pipeline for road segmentation using monocular images according to an embodiment;

Figure 3b shows a close-up of two of the images shown within Figure 3a;

Figures 4a and 4b show the transfer of 2D labels to 3D point clouds according to an embodiment;

Figure 5 is a graph of running time per step for associated processes of the pipeline shown in Figure 3a: label propagation, depth-map estimation and segmentation resetting;

Figure 6 is a graph plotting the probability that segmentation has to be reset for a given traversed distance; and

Figure 7 shows an aerial view of a sample environment in which the vehicle shown schematically in Figure 1 moves, with the vehicle trajectory.

Embodiments of the invention are described in relation to a sensor 12 mounted upon a vehicle 10, as is shown schematically in Figure 1. The skilled person would understand that the vehicle 10 could be replaced by a plane, boat, aerial vehicle, robot or other vehicle, or by a person carrying a sensor 12. The sensor 12 is arranged to monitor its environment 14, 15 and generate data based upon the monitoring, thereby providing data on a sensed scene around the vehicle 10. In the embodiment being described, since the sensor 12 is mounted upon a vehicle 10, the sensor 12 is also arranged to monitor the environment 14, 15 of the vehicle 10.

Typically, the sensor 12 is a passive sensor (ie it does not create radiation and merely detects radiation) and in particular, in the embodiment being described, the sensor is a monocular camera. Thus, in the embodiment being described, the camera 12 provides images 100 (ie image data) of the environment 14, 15 through which it moves. The images 100 are acquired one after another as the vehicle moves, so may be described as a series of images.

Typically, embodiments are arranged to process the series of images 100 (or other data) and identify an object of interest within those images. In the embodiment being described, the object of interest is navigable regions within those images. In other embodiments, objects of interest may be a vehicle, a landmark, a person, a building, a geological feature, or the like.

For a land-vehicle, the navigable regions may be roads, tracks, paths or the like. For convenience and ease of description, the phrase “road” has been used but the skilled person should read this as meaning “navigable region” and realise that road is but one example.

The skilled person will appreciate that other kinds of sensor 12 could be used, and that the sensor 12 may be stationary in some embodiments and used to monitor moving objects. For example, the sensor 12 may be an active sensor arranged to send radiation out therefrom and detect reflected radiation. In embodiments wherein the sensor 12 is stationary, the sensor 12 may not be located on a vehicle 10, and may instead be connected to a building 15 or fixture (not shown).

In the embodiment shown in Figures 1, 2a and 2b, the vehicle 10 is travelling along a road 14 and the sensor 12 is imaging the environment (eg the road 14, building 15, etc.) as the vehicle 10 moves.

In this embodiment, the vehicle 10 also comprises processing circuitry 16 arranged to receive data from the sensor 12 and subsequently to process the data (in this case image data) generated by the sensor 12. Embodiments of the invention are described in relation to using RGB images 100 taken from a moving monocular camera 12. The skilled person would understand that other image types may be used.

Thus, the processing circuitry 16 receives data from the sensor 12, which data provides images of the environment around the vehicle 10. In the embodiment being described, the processing circuitry 16 also comprises, or has access to, a storage device 17 on the vehicle 10.

The lower portion of Figure 1 shows components that may be found in a typical processing circuitry 16. A processor 18 may be provided which may be an Intel® X86 processor such as an i5, i7 processor, AMD™ Phenom™, Opteron™, etc, or the like. The processor 18 is arranged to communicate, via a system bus 19, with an I/O subsystem 20 (and thereby with external networks, displays, and the like) and a memory 21.

The skilled person will appreciate that memory 21 may be provided by a variety of components including a volatile memory, a hard drive, a non-volatile memory, etc. Indeed, the memory 21 may comprise a plurality of components under the control of, or at least accessible by, the processor 18.

However, typically the memory 21 provides a program storage portion 22 arranged to store program code 24 which when executed performs an action and a data storage portion 23 which can be used to store data either temporarily and/or permanently. The data storage portion 23 stores image data 26 generated by the sensor 12. Trajectory data 25 may also be stored; trajectory data 25 may comprise data concerning a pre-programmed route and/or odometry data concerning the route taken - for example trajectory data from movement of the wheels or from parallax comparison of images taken. In additional or alternative embodiments wherein the sensor 12 is tracking a moving object, trajectory data 25 for the moving object may be stored instead of, or in addition to, trajectory data 25 for the sensor 12.

Dense 3D point-cloud data 27 is also stored in the data storage portion 23. The 3D point-cloud data is generated, for example using a 2D pushbroom laser, in advance of the image segmentation described and is used to assist in the segmentation and label propagation processes.

In other embodiments, at least a portion of the processing circuitry 16 may be provided remotely from the vehicle 10. As such, it is conceivable that processing of the data generated by the sensor 12 is performed off the vehicle 10 or partially on and partially off the vehicle 10. In embodiments in which the processing circuitry is provided both on and off the vehicle then a network connection (such as a 3G (eg UMTS - Universal Mobile Telecommunication System), 4G (LTE - Long Term Evolution) or WiFi (IEEE 802.il) or like) may be used in order for the processing circuitry 16 on the vehicle to communicate with the remote systems.

It is convenient to refer to a vehicle 10 travelling along a road 14 but the skilled person will appreciate that embodiments need not be limited to any particular mobile apparatus or environment. Likewise, it is convenient in the following description to refer to images 100 (ie image data) generated by a camera 12 but other embodiments may generate and use other types of data.

The skilled person would understand that some embodiments do not include generation of the images 100, and may instead obtain images 100 from a separate system. The images 100 may therefore be generated in advance of implementation of such embodiments.

In the embodiment being described, the sensor 12, the processing circuitry 16 to which the sensor 12 is connected and the software running on the processing circuitry 16 form a segmentation and labelling system to segment and label the images 100 collected by the sensor 12.

As the sensor 12/vehicle 10 moves, a set of images 100 is generated. Typically, parallax between consecutive images 100 may be used to generate depth estimates for points within the images 100 as described in more detail below. Each point may correspond to a pixel of an image 100.

The embodiment being described provides an online approach to segmenting roads 14 to determine what is and what is not road within an image 100. Here “online” is intended to mean that embodiments can generate this information as the vehicle 10 is used in order for the segmentation to be useful in navigation for that vehicle. Typically, embodiments will therefore be able to processes images 100 from the camera 12 (or other sensor) at a rate of at least a few hertz which may be substantially any of the following, or values in-between 0.2Hz; 0.5Hz; 5Hz; 10Hz; 15Hz; 20Hz; 30Hz; 50Hz or greater. The skilled person would understand that there is no set upper limit to processing speed, as processing speed is determined by the processing power available. However, to be commercially useful, it is likely that embodiments should be able to perform on-line on processing power that may typically be found in a lap-top or similar.

Thus, in the embodiment being described, a labelled region, generated from the segmentation, is a region of an image 100 identified as having a predetermined property, which may be, as in the example being described, what is road. Here regions are labelled (in this embodiment as being road) if they have at least one predetermined property. In the embodiment being described, the label is “road” and the predetermined property is one or more characteristics associated with the label “road”, which may comprise geometric data (eg a large, flat surface expected to be located below the sensor 12) and visual data (eg illumination invariant colour information, texture). As such, more than one predetermined characteristic may need to be present in a region for that region to be labelled with the label (eg road).

To initialise the method of the embodiment being described, a region 104a, which may be a single point, or is more typically a number of points, is labelled within an image 100. The image 100 comprises multiple pixels; each pixel may be thought of as a point within the image 100 and thus the region 104a may comprise one or more pixels.

It is conceivable that the labelled region 104a may be manually labelled. However, other embodiments, including the one being described automatically label the region 104a using image recognition software. Prior information, such as information as to where the ground level is, can be used to assist automatic labelling. The skilled person would understand that the initial seed label can be provided by any known method. The initial labelled region 104a acts as a seed for propagating the region to any other points within the image 100 which have the same predetermined property, ie which correspond to the labelled region/point 104a. In the Figure, this propagation creates a larger region 104b, when compared to region 104a; ie region 104 is expanded from the initial seed region 104a to the larger region 104b. The identification of the larger region 104b corresponding to the seed labelled region 104a is referred to as segmentation of the image 100.

Here, in the embodiment being described, segmentation of the images 100 to determine the extent of the region 104a/104b is performed by statistical colour segmentation. In alternative embodiments, different or additional statistical segmentation techniques can be used. For example statistical texture segmentation and/or histogram separation may be used. Statistical texture segmentation techniques use local colour scale variations. Textons, giving information on small-scale structure within images, may also be used.

Labelled regions are propagated from frame to frame (ie from image to image) using depth priors of the environment. The skilled person will understand that label propagation reduces user interaction requirements, as there is no requirement to re-seed a labelled region for “road” in subsequent images, as the labels (ie regions) are propagated from image to image.

The labelled regions generated from the segmentation are then transferred to 3D point-clouds (for example 3D point-clouds generated from Lidar data). The skilled person will understand that the transfer of labels reduces the complexity of state-of-the-art segmentation algorithms running on 3D Lidar data.

Thus, in the embodiment being described, the use of geometric priors (in the form of 3D point-clouds in the embodiment being described) is combined with segmentation (statistical colour segmentation in the embodiment being described). Unlike approaches in the prior art, the embodiments described do not require any assumptions about the type of road 14. Instead, it is assumed that the road images can be captured from an arbitrary camera orientation.

The embodiment being described comprises three stages: 1. Automatic propagation of labels from image to image using depth priors of the environment; 2. Segmentation of each image according to the labels (in the example given, “road” and “not road”/background); and 3. Transferral of the labelled regions to 3D laser point clouds.

Figures 2a and 2b illustrate the method of an embodiment schematically. The segmentation and labelling system is provided with a dense 3D point cloud (Figure 2b, 140) corresponding to the environment 14, 15 through which the vehicle 10 is moving and with images 100 from the sensor 12. 2D segmentation is performed on the images 100 and labels are propagated from frame to frame on a sequence 310 of collected images 100. The segmentation of the 3D point cloud 140 into two point cloud sections 141, 142 is shown in Figures 2c and 2d, which have been shown separately for reasons of clarity so that the segmented road 14 can be seen clearly in relation to the surroundings 15 from which the road 14 has been segmented. Some embodiments may process each image 100 generated from the sensor 12. However, other embodiments may drop images and process only a sub-set of images so generated.

In the embodiment being described, image 102 (termed /s ) in which a region has been labelled (ie a labelled image) serves as an initial seed (ie provides a seed region) for a two-region segmentation which generates a region deemed to correspond to road and a region deemed not to be road. The skilled person would understand that, in additional or alternative embodiments, more than two regions may be seeded and generated. A region 104 within image Is 102 is labelled as being “road”. The image ISki 102 is then segmented to form segmented image Ιφΐ€ ΐ 106 in which the regions generated by the segmentation arc highlighted. Seed region 104a is extended to the region the algorithm of the embodiment determines to be “road” - ie to the entire region labelled 104b in Figure 2a. A depth-map 108 corresponding to image ISki 102 is used in conjunction with the segmented image /ψ(£_ι 106 for label propagation. Road labels are transferred from 2D to 3D.

Here a depth-map is intended to mean a record of the distance of the surfaces of objects within the environment observed by the sensor 12 from a reference associated with the sensor 12. The reference may be a point reference, such as a point based around the sensor 12, or may be a reference plane. The distance to the surface may be recoded in any suitable manner. However, a depth map commonly has depth information associated with each pixel within an image.

In the embodiment being described, the depth map used is an inverse depth map. The skilled person would understand that, in other embodiments, the depth map used may not be an inverse depth map. Whether or not the depth map used is an inverse depth map, the depth data is stored for elements within the images that were used to generate the depth map. In an inverse depth map, depth values are in the range zero to one, with lower numbers indicating a greater depth. In depth maps which are not inverse depth maps, depth values can extend to infinity (or at least to the greatest value that can be stored in the number format being used), and higher values indicate a greater depth. The skilled person would understand that rescaling depths over the range zero to one can facilitate calculations.

Time propagation of labels is performed by exploiting geometric information: dense depth estimation is run using a pair of sequential monocular images. At least a portion of the region 104b identified in the first image 100 is projected to the second image using the 3D geometric data.

In the embodiment being described, the 3D geometric data used comprises a geometric assumption about the environment - namely that the environment comprises affine surfaces. Additionally, the 3D geometric data comprises parallax measurements obtained from sequential 2D images. In other embodiments, the 3D geometric data may comprise one or more of the following, or the like: (i) a depth-map of at least a portion of the environment; (ii) data obtained using the first and second images and parallax; (iii) one or more geometrical assumptions about the environment; (iv) odometry data generated by sensors on the vehicle concerning movement between one image 100 and the next image in the series; and/or (v) global positioning data such as GPS, GLONASS, QZSS and the like.

Regions in the second image corresponding to the region in the first image can therefore be identified and provided with the same label (ie the label is propagated using the projected at least a portion of the region of the first image 100). The projection of at least a portion of the region 104b identified in the first image 100 to the second image, and propagation of the label, such that the corresponding at least a portion of the region 104b identified in the second image is given the label from region 104b in the first image, may be described as propagating the at least a portion of the region from the first image to the second image.

The identified region then serves as a seed 104a for segmentation of the second image; the process can be repeated for subsequent images.

Binary image Ihs110 is then formed. The resulting image ISk 112 is fed into the multilabelling optimisation process, producing the desired segmented image I$k 114. In at least some embodiments, the binary image 110 reduces the size of the labelled region 104b by excluding border areas where certainty of identification is lower. A smaller, more certain seed 104a is therefore provided for generation of the image 114. The skilled person will understand that this can reduce the risk of mis-identifying pavement as road, for example, which can appear similar.

In the embodiment being described, a 3D prior generated from a Laser point-cloud is used. Using a calibration between the current position of the vehicle 12 and the point-cloud, the 3D point-cloud obtained from a pushbroom laser is projected onto the corresponding image frame. As a result, a label is assigned to each projected point of the 3D point-cloud using the label at the corresponding pixel coordinate of the segmented image /ψ .

The embodiment being described uses per pixel inverse depth estimates as a geometric prior for road segmentation from a sequence of images 100.

Dense inverse depth-map estimation is performed from a variational method perspective. An energy function is optimised based on a data fidelity term that measures the photoconsistency over a set of small-baseline, monocular frames.

The embodiment being described has a forwards-facing camera 12 mounted on a vehicle 10 travelling forwards and sensing distant objects with low parallax. Thus, embodiments are processing the images with less information (ie lower parallax) when compared to prior art methods, which often involve moving the camera from side to side to increase parallax. The lack of longitudinal (sideways) motion of the camera 12 reduces the parallax, so creating difficulties for depth estimation. As such, an improved regularisation method is used to reinforce depth on critical parts of the scene.

In the embodiment being described, a suitable assumption is to expect to find many affine surfaces in the environment, like roads 14, pathways, building facades 15 or vehicle surfaces. A Total Generalised Variation (TGV) regularisation is applied which has been shown to favour piecewise linear regions of a structure (see, for example, K. Bredies, K. Kunisch, and T. Pock, “Total generalized variation”, SIAM J. Img. Sci., vol. 3, no. 3, pp. 492-526, Sep. 2010).

With regards to image segmentation, a multitude of algorithms are widely used for specific tasks such as image editing, detection of regions of interest in medical images or object tracking in video sequences. The most popular prior art approaches try to efficiently compute minimum energy solutions for cost functions, using graph cuts (see, for example, Y. Boykov, 0. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts”, IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 11, pp. 1222-1239, 2001), level set methods (see, for example, T. Brox, M. Rousson, R. Deriche, and J. Weickert, “Colour, texture, and motion in level set based segmentation and tracking”, Image Vision Comput., vol. 28, no. 3, pp. 376-390, 2010), random walks (see, for example, L. Grady, “Random walks for image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 11, pp. 1768-1783, 2006), and convex relaxation techniques (see, for example, T. Pock, A. Chambolle, D. Cremers, and H. Bischof, “A convex relaxation approach for computing minimal partitions”, in 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), Miami, Florida, USA. IEEE, pp. 810-817).

Embodiments employing these methods combine two important concepts in their energy models: 1. A data fidelity term that measures how well a pixel fits to each label; and 2. A regularisation term that measures the consistency of the segmentation with respect to some prior knowledge (a prior).

Examples of priors are the object boundary length, the number of labels, specific intra-label cost functions and label co-occurrence.

In the embodiments being described herein, a similar approach to that used in C. Nieuwenhuis and D. Cremers, “Spatially varying color distributions for interactive multilabel segmentation”, IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 5, pp. 1234-1247, 2013, is used to reduce the intra-label variability caused by different road textures and lighting conditions.

The embodiments described extend current vision-based approaches for online road segmentation by spatiotemporally relating camera and laser data (ie using the segmentation performed on the images to locate regions within the 3D prior - ie 3D point-cloud in the embodiment being described). The problem of transferring the road labels learnt on the 2D image sequence 310 to 3D laser point clouds (eg Figure 2b) is discussed under the assumption of perfect synchronisation and accurate camera-laser extrinsic calibration below.

Below, the pipeline 300 of the vision-based system for road segmentation of an embodiment is described, and how the depth prior can be used to propagate labels between consecutive frames in the image sequence 310 is explained.

SYSTEM DESCRIPTION

The pipeline 300 used in the embodiment being described for road detection from multiple input images 100, 310 is visualised schematically in Figure 3a.

In the embodiment being described, the main steps, which are described in more detail below, comprise: 1. Obtaining an image 100 in a sequence of images 310; 2. Labelling a region within the image; 3. Segmenting 330 the image 100 in accordance with the or each label; 4. In parallel with the labelling 320 and segmentation 330, estimating 340 a depth-map from the sequence of images 310, using parallax; 5. Predicting 350 locations of labels for a next sequential image in the sequence of images 310; 6. Automatically labelling 320 the next sequential image by propagating the labels from the previous image 100; and 7. Obtaining the next image in the sequence and returning to step (3) for that image.

Steps (3) to (7) are repeated until all images in the sequence 310 have been segmented 330 and labelled 320. Step (1) may continue throughout, with new images 100 being added to the sequence of images 310. It is conceivable that the method fails to correctly propagate a labelled region between images in which case the algorithm may return to step (1).

The labelling process in Step (2) is the initial creation of a seed label which is propagated to all subsequent images 310.

The labelling process in Step (6) projects the labelled region into the next sequential image. The corresponding region in the next sequential image is labelled with the label of the region in the first image 100.

In the embodiment being described, switches 355 are provided to enable selection of a segmentation reset step 360. The skilled person will understand that, although the segmentation 330 and label propagation 320, 350 process (Steps 3 to 7) is implemented to run automatically, static camera motions (for example due to the vehicle 10 stopping) can affect the accuracy of the depth-map and therefore, the label prediction 350, necessitating a system reset 360 - ie a return to Step (1). The skilled person will appreciate that these switches may be used as an automatic detection that the labelled region is not being correctly transferred from one image to the next.

From a front-facing camera 12 mounted on a vehicle 10, it is assumed that a sequence 310 of n RGB images : Ω c R2 —» E is obtained, of which 100 is an example image. In a sample implementation of the embodiment being described, the images 310 were collected during a driving trial in a semi-urban environment.

The corresponding camera poses TIt ... , T„ € SE(3) are estimated in practice from an onboard scaled Visual Odometry system (see, for example, M. Smith, I. Baldwin, W. Churchill, R. Paul, and P. Newman, “The new college vision and laser data set”, The International Journal of Robotics Research, vol. 28, no. 5, pp. 595 - 599, May 2009, Available: http://www.robots.ox.ac.uk/NewCollegeData). The skilled person will know that SE(3) is the Special Euclidean group used for the kinematics of a rigid body, in classical mechanics.

In the embodiment being described, the pipeline 300 initially creates a dense inverse depth-map ξΐ((ϋ) 108 from consecutive monocular images 310. Unlike prior art approaches where longer sequences of images are integrated for accurate depth-map estimation, in the embodiment being described, a pair of images Ik.It Ik is sufficient. This choice enables depth to be estimated for dynamic objects (for example cars - particularly important in urban environments), which could be potentially disregarded by prior art long sequence integration approaches.

Then, road labelling 320 is performed on the image 100 reference Ik.i where the depth-map 108 is created.

The labelling process 320 described above can be carried out by performing a projection of a viewing frustum onto the ground plane using the depth estimates of the central image pixels.

The output of the labelling process 320 is a labelled image 102 /s that serves as a seed for a two-region segmentation 330. The skilled person would understand that, in alternative or additional embodiments, more than two regions may be segmented.

In the segmentation step 330, the spatial variation of colour distribution is considered in a general Bayesian maximum a posteriori probability (MAP) estimation approach, which allows different textures and lighting conditions on the road to be handled. It will be appreciated that embodiments that can cope with varying lighting are advantageous as, in real life, a road will often look very different from day-to-day. At this point, a new image 106 /φ with the desired labelled road region 104 is available. When the next image in the image sequence 310 (the next frame) arrives, both background (subscript b) and road (subscript r) labels are warped into the next frame as follows:

First, the classified (ie labelled) pixels are back-projected to 3D space:

...Eq.(l) where π1 refers to the back-projection of a pixel u with inverse depth ξί and intrinsic calibration matrix K.

Then, the perspective projection of the labelled points is calculated:

...Eq. (2)

Equation 3, below, provides the new set of labelled pixels that will be used as initialisation for road segmentation 330 on image frame Ik. To prevent the incorrect propagation of labels owing to depth inaccuracies, the intra-label variability is reduced using morphology regularisation on pixels at the label boundaries.

The morphology regularisation operation is synthesised as:

...Eq. (3) where Ih is a binary image rendered from the predicted label pixels, Bw is the translation of B by the vector w. In practice, it is assumed that the structure of B is isotropic with the centre located at the origin of Ω.

The resulting image 'sk 1 12 is fed into the multi-labelling optimisation process, producing the desired segmented image 114

Although the pipeline 300 runs automatically, the possibility of resetting 360 the pipeline 300 when segmentation 330 fails is provided in at least some embodiments.

Below, the techniques used in the pipeline 300 to create dense depth-maps 108 and to perform

multi-labelling optimisation are described. A. Total Generalised Variation (TGV) Depth-map calculation

To create a depth-map ξ(υ) 108 from a pair of consecutive images [Ik-i(U), Ik(u)L a variational approach similar to the one presented by R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, cited previously, is followed. However, in the embodiment being described, the camera 12 is mounted on the front of a moving vehicle 10 and facing a distant horizon which provides a constraint that must be taken into account.

Therefore, for most of the pixels (particularly those that correspond to distant objects or points) there is insufficient parallax for reliable depth estimation. Therefore, for estimating the depth, a regulariser 28 containing some prior knowledge of the environment is required.

In the environment 14, 15 used for the embodiment being described, a reasonable assumption is that affine surfaces should be found in the environment, such as roads 14, pathways, building facades 15 or vehicle surfaces. For this reason, the regulariser 28 is implemented as a Total Generalised Variation (TGV) norm that favours piecewise linear solutions.

The energy function to be minimised is given by:

.....Eq. (4) where Εΰ(ξ) is a nonconvex data term that calculates the average photometric error p between the new image Ik(u) and a warping of the previous images Ik-i(u):

.....Eq. (5) and Εκ(ξ) is the TGV regularisation term given by:

.....Eq. (6)

By introducing an additional variable v, TGV can intrinsically yield a balance between the zero and first order derivatives of the solution signal. This property allows the generalisation of the piecewise constant behaviour of the classical TV norm and favours the reconstruction of piecewise affine surfaces. The TGV regularisation term depends on two constants, a, and a2 that control the piecewise smoothing.

Equation 6 provides some intuition about why TGV regularisation favours piecewise affine functions. Think of v as the slope of ξ. If ξ is piecewise linear then v should be a piecewise constant signal which explains the TV || Vvi7||x penalty term for v.

Regarding the first term ||V^ — if v properly estimates the slope of an affine region of ξ then the contribution to the energy cost will be zero because = v and the only cost due to the penalty term will be the TV of v.

The optimisation problem in Eq. 4 is solved using an iterative alternating optimisation method, as explained in the paper of by R. A. Newcombe, S. J. Lovegrove, and A. J. Davison cited previously, that is based on an exhaustive search step which involves the data term Ε0(ξ), and a Primal-Dual algorithm (see the ““A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging” paper of A. Chambolle and T. Pock, cited previously) involving the regularisation \.ζΐνα.Εκ(ξ). B. Multilabel optimisation for road detection

In the embodiment being described, the reference image Ik 112 and the corresponding depth map & 108 are considered. The image is to be split up into «/ pair-wise disjoint regions (eg the road and background, nt=2 in the embodiment described):

...Eq. (7)

Regions are pairwise disjoint (or “mutually disjoint”) if every two different regions are disjoint, ie have no point or region in common. Each point/pixel of an image can therefore only be present in a single region.

The problem is solved by assigning a label i —> {1, ..., nj to each pixel u, such that = [ιι\φι(ιι) = i}, where φί is an indicator function defined as:

...Eq. (8) A challenge is to find an optimal label configuration for all image pixels among all possible configurations. To efficiently solve the problem a general Bayesian estimation approach is adopted to compute the segmentation in a MAP sense,

. Eq. (9) where Ρ(Ι\φ) and Ρ(φ) are the likelihood and prior probabilities over the colour and region functions.

To deal with different textures and lighting conditions on the road, the likelihood is modelled as:

...Eq. (10)

The term inside the product denotes the joint probability for observing a colour value / at spatial location u e which can be approximated using Gaussian kernels with σ and p variances on the colour and location variables respectively.

Here the initial seed ISk 112 is used, calculated automatically from the prediction Ihk 110. In the following, the index k is dropped to simplify notation.

Considering the set of points in label i as Sp. = {Is.,us.}, the joint probability is computed as:

...Eq.(ll) where σ and p are set up as described in the paper of Nieuwenhuis and Cremers, cited above. The segmentation problem in Eq. 9 also requires the specification of a prior, Ρ(φ), over all regions. A common choice is to use priors that favour regions of shorter length such as:

...Eq. (12)

Eq. 12 considers the perimeter of each region measured by gWVffriWx, also known as the Total Variation of the region represented by φ. In practice, g is a function of the form exp(-y||V/(u)||2) typically used to promote edges. A more general formulation of the Total Variation is:

...Eq. (13) where Ψ is the dual variable of region φ.

By substituting Equations 10 to 13 into Eq. 9, the equivalent optimisation problem with linear constraints is obtained: ...Eq. (14) ...Eq. (15) ...Eq. (16)

Recently, T. Pock and A. Chambolle, “Diagonal preconditioning for first order primal-dual algorithms in convex optimization”, in IEEE Int. Conf. on Computer Vision (ICCV), 2011, pp. 1762-1769, has shown that a suboptimal stable solution for the multi-labelling problem (and optimal for a two-region case) can be found by introducing the Lagrange multipliers F(u), such that the energy function can be rewritten as follows:

...Eq. (17)

An iterative primal dual algorithm, applied to the saddle point formulation in Eq. 17, is summarised in the following set of equations:

...Eq. (18) where r, w and μ are the lengths of the gradient steps.

In practice, each variable is updated by performing pixel-wise calculations while the gradient operator V is approximated by finite differences. The parameters r, w and μ are set up to 1/2, 1/4 and 1/5 respectively through the use of preconditioning, as described in the IEEE paper of T. Pock and A. Chambolle, cited above.

Analogous to the depth-map estimation problem, the primal dual approach followed allows advantage to be taken of general purpose graphics processing units (GPUs) hardware for parallel computing. For a detailed derivation of these update equations, the interested reader is referred to the IEEE paper of T. Pock and A. Chambolle, cited above.

TRANSFER OF 2D ROAD LABELS TO 3D LASER POINT CLOUDS

Autonomous transport systems can be envisioned in which vehicles 10 equipped with monocular cameras 12 are provided with a labelled 3D laser point cloud 140 as a prior to support advanced driver assistance systems. The skilled person would understand that the propagation of labels described herein could be used to provide labels for such a point cloud 140, as described below.

The transfer of 2D labels to 3D point-clouds may be carried out by re-projection of the 3D points onto the image plane at the camera reference Tr, r e w where the point-cloud is represented. A label is assigned to each laser point from the corresponding image Ιφ using the pixel coordinates of the projected points. Figures 4a and 4b depict the labels propagated from 2D images to 3D point clouds for two regions.

Given the camera-laser calibration and proper sensor synchronisation, the 3D point cloud is projected onto the image plane of the reference frame. As a result, a label is assigned to each projected point with the label of the pixel at the corresponding coordinates in the segmented image 406a, 406b.

In Figures 4a and 4b, depth-map estimates 408a, 408b are respectively shown for two separate images. 2D road segmentation 406a, 406b is shown for the same two reference images. The result of the transfer of labels to corresponding 3D point clouds 440a, 440b is also shown. “Road” 450a, 450b and “not road” 460a, 460b are distinguished. The 3D point clouds 440a, 440b can therefore be separated into two sections, 441a, 441b and 442a, 442b. As with Figures 2c and 2d, two images are shown for reasons of clarity and to ensure that the road 450a, 450b can be clearly distinguished from the non-road areas 460a, 460b.

In the embodiment being described, labels from 2D images 406a, 406b, taken by a monocular camera 12, are transferred to a 3D point cloud 440a, 440b. Labels on data obtained from data from one sensor (the monocular camera, 12) are therefore transferred to data from a different sensor (here, a pushbroom laser or other sensor used to generate the point cloud 440a, 440b). The skilled person would understand that, in alternative or additional embodiments, labels may be transferred to data from other sensors. For example, labels could be transferred to infra-red images, 3D models, sonar data, or the like.

In order to demonstrate the robustness and scalability of the road segmentation pipeline 300 disclosed herein, experiments were performed on an outdoor image sequence 310 gathered from a 65x50 Field of View (FOV) monocular camera 12 at 25 Hz. The camera 12 was mounted on the roof of a car 10 looking forwards in the moving vehicle direction.

In the sample implementation of the embodiment described, the sequence 310 is composed of 13,600 images of resolution 512x84 collected in a village. The full trajectory is approximately 10 km long from the initial position. The dense mapping approach and the multi-labelling optimisation process are implemented in CUDA C++. The whole pipeline runs on a laptop computer equipped with an i7 Intel processor at 2,3GHz and a GeForce GT 750M NVIDIA Graphics card with 2048 MB of device memory.

Figure 5 shows a time assessment of the online road segmentation. The graph depicts the contribution of the depth-map creation (crosses, 520) and the label propagation stages (circles, 510) to the total running time. The depth-map estimation requires 10 TGV primal dual iterations (less than 500 ms in average). The label propagation stage 510 considers the label warping induced by Equations 1 to 3 and the primal dual iterations, run to achieve the final segmented image 114.

In practice, fewer than 100 primal dual iterations are performed during multi-labelling optimisation (3.5 s average time).

Triangles 530 denote the running time in those steps in which the pipeline 300 was reset 360. In such cases, time is required for the segmentation 330 from the simple label initialisation (projection of the viewing frustum onto ground plane).

Figure 7 shows an aerial view of the environment in which the vehicle 10 moved, and the continuous paths driven 710 (multiple loops are shown). The vehicle 10 trajectory 710 is shown according to the INS recordings (Inertial Navigation System). Only 3% of the places along the trajectory required the system reset 360.

Figure 7 shows the places 720 along the whole trajectory where a system reset 360 was required. The road segmentation system was restarted in 172 places 720, over the total number of places (ie less than 3% of the time).

The analysis was expanded to study the performance of the propagation of road labels on consecutive images 310, to determine if a label could be propagated continuously over long periods of time.

To this end, P(X < xd) is introduced, a probability that denotes how likely it is that the propagation needs to be reset before a minimum distance xd is travelled. X is defined as the travelled distance with respect to the vehicle localisation. Figure 6 shows a plot 600 of P(X < Xd) against distance travelled. The plot 600 shows that, for instance, after label propagation for more than 80 m there is 95% of probability of requiring a reset. For the embodiment being described, it is very unlikely that the system would require reset below a minimum distance of 20 m.

Claims

1. A method of processing a series of images, each image representing at least a part of an environment, to identify at least a portion of an object of interest within the images, the method comprising: obtaining a first two-dimensional (2D) image, in which at least one point, having a predetermined property, is labelled as forming at least part of the object of interest; segmenting the first 2D image to identify the at least one region corresponding to the at least one labelled point to identify the at least a portion of the object of interest within the first 2D image; obtaining a second 2D image of the environment; propagating at least a portion of the region from the first 2D image to the second 2D image using three dimensional (3D) geometric data; and segmenting the second 2D image to identify the at least one region having the predetermined property in the second 2D image thereby identifying the at least a portion of the object of interest in the second 2D image.

2. The method of claim 1 wherein the method is repeated such that the region having the predetermined property within the second 2D image is used as the seed to identify a further region in a further image.

3. The method of any preceding claim wherein each image represents at least a portion of the environment in the vicinity of a vehicle and the object of interest is a navigable region within the images, and wherein: the obtaining a first 2D image comprises obtaining a 2D image in which at least one point, having a predetermined property, is labelled as being navigable; the segmenting the first 2D image to identify the at least one region corresponding to the at least one labelled point comprises segmenting the first 2D image to identify the navigable region within the first 2D image; and the segmenting the second 2D image to identify the at least one region having the predetermined property in the second 2D image comprises segmenting the second 2D image to identify a navigable region in the second 2D image.

4. The method of claim 3 wherein odometry data is generated as the vehicle moves and wherein the odometry data is used to determine movement of the vehicle from the position at which the first image was taken to the position at which the second image was taken, and wherein the movement is used in the step of propagating the at least a portion of the region from the first 2D image to the second 2D image.

5. The method of claim 3 or claim 4 wherein the first and second 2D images are captured from a sensor and wherein the sensor is mounted on the vehicle.

6. The method of claim 4 or claim 5 wherein the sensor is a camera which is typically a monocular camera.

7. The method of claim 5 or claim 6 wherein the first and second 2D images are illumination-invariant colour images and the method may include the step of generating these illumination invariant first and second images from images generated by the sensor.

8. The method of claim 1 or claim 2 wherein the object of interest is a mobile object, and wherein the series of 2D images is processed to identify the object of interest and to track the object of interest through the series of images.

9. The method of claim 8 wherein the object of interest comprises a sensor which generates odometry data as the object of interest moves.

10. The method of claim 9 wherein the odometry data is used in the step of propagating the at least a portion of the region from the first 2D image to the second 2D image.

11. The method of any preceding claim further comprising the step of transferring the region to a 3D prior of the environment, wherein the 3D prior may be a point-cloud which may be generated from a LIDAR.

12. The method of any preceding claim wherein the 3D geometric data is a depth-map.

13. The method of any preceding claim wherein the 2D images comprise pixels, and wherein the 3D geometric data provides a depth estimate for each pixel.

14. The method of claim 13 wherein morphology regularisation is used on pixels located near boundaries of a region to reduce depth inaccuracies.

15. The method of any preceding claim wherein statistical colour segmentation is used to segment the first and second 2D images to identify the at least a portion of the object of interest.

16. The method of any preceding claim wherein at least one geometric assumption about the environment is used in the step of propagating the at least one label.

17. The method of claim 12 wherein the depth-map is created from a pair of consecutive 2D images of the environment, using parallax and at least one geometric assumption about the environment.

18. The method of claim 16 or claim 17 wherein the at least one geometric assumption is that the geometry of the environment contains affine surfaces.

19. The method of any preceding claim, further comprising generating a visual representation of at least a portion of the environment in which the identified region is highlighted.

20. A system for processing a series of images, each image representing at least a part of an environment, the system being arranged to identify at least a portion of an object of interest within the images, the system comprising processing circuitry arranged to: obtain a first two-dimensional (2D) image, in which at least one point, having a predetermined property, is labelled as forming at least part of the object of interest; segment the first 2D image to identify the at least one region corresponding to the at least one labelled point to identify the at least a portion of the object of interest within the first 2D image; obtain a second 2D image of the environment; propagate at least a portion of the region from the first 2D image to the second 2D image using three dimensional (3D) geometric data; and segment the second 2D image to identify the at least one region having the predetermined property in the second 2D image thereby identifying the at least a portion of the object of interest in the second 2D image.

21. The system of claim 20, further comprising one or more sensors arranged to generate at least one of the first 2D image and the second 2D image.

22. The system of claim 20 or claim 21 in which at least one of the following applies: (i) the system further comprises a storage device arranged to store at least one of the first 2D image, the second 2D image and information relating to the predetermined property; (ii) the system further comprises a server arranged to communicate with the processing circuitry.

23. The system of any of claims 20 to 22, further comprising a vehicle on which at least some of the processing circuitry is mounted.

24. The system of claim 23 wherein the system comprises a vehicle-based portion and a remote portion, and wherein the system is arranged to transfer data between the vehicle-based portion and the remote portion.

25. A vehicle having a sensor mounted thereon, wherein the sensor is arranged to generate two-dimensional (2D) images wherein each image represents at least a portion of the environment in the vicinity of the vehicle, the vehicle having a processing circuitry arranged lo process the 2D images, wherein the processing circuitry is arranged to: obtain a first 2D image, in which at least one point, having a predetermined property, is labelled as forming at least part of the object of interest; segment the first 2D image to identify the at least one region corresponding to the at least one labelled point to identify the at least a portion of the object of interest within the first 2D image; obtain a second 2D image of the environment; propagate at least a portion of the region from the first 2D image to the second 2D image using three dimensional (3D) geometric data; and segment the second 2D image to identify the at least one region having the predetermined property in the second 2D image thereby identifying the at least a portion of the object of interest in the second 2D image.

26. The vehicle of claim 25, wherein the obtaining a first 2D image of the environment comprises obtaining a 2D image in which at least one point, having a predetermined property, is labelled as being navigable; segmenting the first 2D image to identify the at least one region corresponding to the at least one labelled point comprises segmenting the first 2D image to identify the navigable region within the first 2D image; and segmenting the second 2D image to identify the at least one region having the predetermined property in the second 2D image comprises segmenting the second 2D image to identify a navigable region in the second 2D image.

27. The vehicle of claim 25 or claim 26, wherein the vehicle further comprises one or more sensors arranged to generate odometry data as the vehicle moves, and wherein the odometry data is used to determine movement of the vehicle from the position at which the first image was taken to the position at which the second image was taken, and wherein the movement is used in the step of propagating the at least a portion of the region from the first 2D image to the second 2D image.

28. The vehicle of any of claims 25 to 27 wherein the sensor is a camera which is typically a monocular camera.

29. A machine readable medium containing instructions which when read by a machine cause that machine to perform as at least one of the following: (i) the method of any of claims 1 to 19; (ii) at least a portion of the system of any of claims 20 to 24; and (iii) the vehicle of any of claims 25 to 28.