AU2003264580B2

AU2003264580B2 - Range Estimation Using Multi-dimensional Segmentation

Info

Publication number: AU2003264580B2
Application number: AU2003264580A
Authority: AU
Inventors: Brian John Parker
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2002-11-29
Filing date: 2003-11-26
Publication date: 2006-08-10
Anticipated expiration: 2023-11-26
Also published as: AU2003264580A1

Description

S&F Ref: 649513

AUSTRALIA

PATENTS ACT 1990 COMPLETE SPECIFICATION FOR A STANDARD PATENT Name and Address of Applicant: Canon Kabushiki Kaisha 30-2, Shimomaruko 3-chome, Ohta-ku Tokyo 146 Japan Actual Inventor(s): Address for Service: Invention Title: Brian John Parker Spruson Ferguson St Martins Tower Level 31 Market Street Sydney NSW 2000 (CCN 3710000177) Range Estimation Using Multi-dimensional Segmentation ASSOCIATED PROVISIONAL APPLICATION DETAILS [33] Country [31] Applic. No(s) AU 2002953003 [32] Application Date 29 Nov 2002 The following statement is a full description of this invention, including the best method of performing it known to me/us:- 5815c -1- RANGE ESTIMATION USING MULTI-DIMENSIONAL SEGMENTATION Technical Field of the Invention The present invention relates generally to computer vision and, in particular, to passive range estimation of a scene.

Background Estimating range from video or image data is a well-studied problem in computer vision. Having accurate and spatially well-demarcated range data allows for such post-processing as foreground/background object separation, which is a fundamental initial step in image understanding. Another application of foreground/background object separation is the replacement of the background portion of the image with a new background, the application being known as range-keying.

There are two broad approaches to range estimation, those being active and passive ranging. Active range estimators use lasers or other known light sources to illuminate a scene, which helps greatly in estimating the range of features in the scene.

Accordingly, active range estimators typically give very accurate and well-demarcated range data, but are limited in their application due to the need for additional equipment.

Furthermore, the active lighting often limits the range of applicability due to a limited maximum distance, as well as sensitivity to ambient conditions.

In contrast thereto, a passive range estimator does not use an ancillary light source. More than one image obtained through stereo cameras or range-from-defocus are typically used to estimate the range of features in the scene.

In the case of stereo camera passive range estimators, the fundamental problem to be solved is to find the corresponding features between the images captured by the two (or more) cameras. The range data is then determined from the estimated disparity 649513.doc -2- (change in position) between such corresponding features. A typical approach employed for finding the corresponding features between the images is to perform a local area matching between the image pairs at the pixel level, and to estimate the local disparity per pixel. However, such local area correlation approaches do not distinguish between such features as object boundaries and internal texture, and as a consequence are inherently poorly demarcated spatially. Also, such local area correlation approaches rely upon local texture for the feature matching, causing the local area correlation approaches to often give inaccurate results in areas of low texture. This lack of good spatial demarcation severely limits the applicability of the resulting range data for applications requiring pixel-accurate foreground/background object separation.

Summary of the Invention It is an object of the present invention to substantially overcome, or at least ameliorate, one or more disadvantages of existing arrangements.

According to a first aspect of the invention, there is provided a method of estimating range measures of a scene from pixel data of at least two images, said images being captured of said scene from displaced relative positions, said method comprising the steps of: forming a multi-dimensional pixel data set from said of at least two images, the dimensions of said pixel data set comprising spatial dimensions corresponding to dimensions of said images, and a dimension corresponding to said displaced relative positions; segmenting said multi-dimensional data set into multi-dimensional segments using multi-dimensional segmentation; 649513.doc -3for each multi-dimensional segment, estimating a disparity for said multidimensional segment by estimating a displacement of intersections in a spatial plane of said multi-dimensional segment; and calculating said range measures from said disparities.

According to another aspect of the invention, there is provided an apparatus for implementing the aforementioned method.

According to another aspect of the invention there is provided a computer program stored on a computer readable medium for implementing the method described above.

Other aspects of the invention are also disclosed.

Brief Description of the Drawings One or more embodiments of the present invention will now be described with reference to the drawings, in which: Fig. 1 shows a schematic block diagram representation of a passive range estimation device for estimating a range map of a scene; Figs. 2A to 2D show schematic flow diagrams of methods of estimating a range map of the scene; Fig. 3 illustrates the formation of a three-dimensional block of colour values from D images and the coordinate system used; Fig. 4 illustrates one of the three-dimensional segments formed by the segmentation step and segmented images that may be formed from the segmented threedimensional block of data by slicing through the segmented three-dimensional block in the image plane; Fig. 5 illustrates the occlusion of one two-dimensional segment by another; 649513.doc Fig. 6 shows a graphical representation of a data structure for use with the segment-merging segmentation step; Fig. 7 shows the segment-merging segmentation step in more detail; Fig. 8 is a plot of the value of the merging cost as the segment-merging segmentation proceeds in a typical case; Fig. 9A illustrates a three-dimensional segment which always results in contiguous two-dimensional segments for all time dimension values; Fig. 9B illustrates a three-dimensional segment which does not result in contiguous two-dimensional segments for all time dimension values; and Figs. 10A and 10B illustrate the parameters used to test whether two neighbouring segments may be merged while maintaining contiguous two-dimensional segments for all time dimension values.

Detailed Description Some portions of the description that follows are explicitly or implicitly presented in terms of algorithms and symbolic representations of operations on data within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities.

Fig. 1 shows a schematic block diagram representation of a passive range estimation device 200 for estimating a range map of a scene. Such a device 200 may be specially constructed for the required purposes, or may comprise a general-purpose computer or other device selectively activated or reconfigured by a computer program 649513.doc stored in the computer or device. The range estimation device 200 comprises a computer module 201 interfaced with two or more cameras 250-1 to 250-D. A display device 214 and controls 202 are also interfaced with the computer module 201.

The cameras 250-1 to 250-D are arranged along a baseline 260, preferably equally spaced, and all viewing the same scene from displaced positions along the baseline 260.

The computer module 201 typically includes at least one processor205, a memory unit 206, a storage device 209, input/output interfaces including a video interface 207, an I/O interface 213 for the controls 203, and an interface 208 for the cameras 250-1 to 250-D. The components 205 to 213 of the computer module 201, communicate via an interconnected bus 204 and in a manner which results in a conventional mode of operation of the computer module 201 known to those in the relevant art. Image data received from the cameras 250-1 to 250-D is typically stored in the semiconductor memory 206, possibly in concert with the storage device 209.

The computer module 201 may be constructed from one or more integrated circuits performing the functions or sub functions of the range estimation. In one implementation the computer module 201 is a general-purpose computer interfaced with the cameras 250-1 to 250-D. In such an implementation the storage device 209 typically includes a hard disk drive and a floppy disk drive.

Fig. 2A shows a schematic flow diagram of a method 100 of estimating a range map of the scene according to a first embodiment. In the first embodiment, the cameras 250-1 to 250-D are still cameras for capturing still images of the scene. The steps of method 100 are preferably effected by instructions in an application program that are executed by the processor 205 of the range estimation device 200 (Fig. The 649513.doc application program may be stored on the storage device 209. In the implementation where the range estimation device 200 is implemented using a general-purpose computer, the application program is typically read into the computer module 201 from a computer readable medium. A computer readable medium having such software or computer program recorded on it is a computer program product.

The method 100 starts in step 105 where each of the cameras 250-1 to 250-D captures an image of the scene. The image data is fed through I/O interface 208 to the memory 206 from where the processor 205 processes the image data. Each of the D images comprises a two dimensional array y) of colour values fx,y), typically in some colour space such as RGB or LUV. In the preferred implementation the LUV colour space is used, with the luminance component L down-weighted. The colour values fxy) may alternatively be chrominance values only, in which case the colour values fix,y) is a 2-vector.

Step 110 follows where the processor 205 forms a three-dimensional block of colour values j(x,y,d) from the colour values fix,y) of the individual images. Two of the dimensions, those being the x and y dimensions, are the spatial dimensions and correspond to the coordinates of a particular pixel in the image, and the other dimension hereinafter referred to as the parallax dimension, corresponds to the camera 250-d that captured the image. Fig. 3 illustrates the formation of the three-dimensional block of colour valuesf(x,y,d) from D images 301-1 to 301-D and the coordinate system used.

Referring again to Fig. 2A, step 115 follows wherein the processor 205 performs spatio-parallax (3D) segmentation on the three-dimensional block of colour values f(x,y,d) based on colour to form a set of contiguous three-dimensional segments {Si}.

Every pixel in the block is related by the 3D segmentation step 115 to one of the three- 649513.doc -7dimensional segments Si such that all pixels belonging to the same segment Si have homogeneous colour valuesf(x,y,d). A contiguous three-dimensional segment Si is one in which each pixel in the segment Si can be reached from every other pixel in that same segment Si via a neighbourhood relation. Such a requirement forbids disjoint segments Si.

In the preferred implementation the Mumford-Shah segment-merging segmentation is used to perform the spatio-parallax (3D) segmentation on the threedimensional block of colour valuesf(x,y,d). An assumption underlying the segmentation problem is that each colour value f(x,y,d) is associated with a particular state defined by an approximation vector ui. The Mumford-Shah segment-merging segmentation is a variational segmentation which requires that a cost function M be assigned to each possible segmentation. The overall cost functional Mmay be defined as: E i f)+A1(B) (1) where B is a set of boundaries with total length I(B) defining the segmentation, E, denotes the approximation error over segment S; as a result of associating the colour values f(x,y,d) with approximation vector u, and 2 is a regulation parameter. Each approximation error Ei is given by: E(ui, I [f-i] T (2) (x,y,d)~Si The approximation vector u, that minimises the approximation error Ei for three-dimensional segment Si, denoted by Iu, is simply the mean of all colour values f(x,y,d) in the three-dimensional segment Si. The minimised approximation error E, is denoted by E,.

649513.doc From Equation it can be seen that the cost function M balances the image approximation resulting from the segmentation with the overall boundary length and the regulation parameter k controls the relative importance of the image approximation and the overall boundary length l(B).

The contribution of the image approximation to the cost functional M encourages a proliferation of segments, while the overall boundary length I(R) encourages few segments. The functional M must therefore balance the two components to achieve a reasonable result.

If parameter k is low, many boundaries are allowed, giving "fine" segmentation.

At one extreme where the overall boundary length 1(B) is completely discounted, the trivial segmentation is obtained where every pixel constitutes its own segment Si.

As parameter X increases, the segmentation gets coarser. At the other extreme the error due to the image approximation is discounted, resulting in a segmentation in which the entire block is represented by a single segment Si.

To find an approximate solution to the variational segmentation problem, a segment merging strategy is employed, wherein properties of neighbouring segments Si and S are used to determine whether those segments come from the same state, thus allowing the segments Si and S to be merged as a single segment So.. The approximation error E also increases after any 2 neighbouring segments Si and S are merged.

Knowing that the trivial segmentation is the optimal segmentation for the smallest possible parameter k value of 0, in segment merging, each pixel in the block is initially labelled as its own unique segment with approximation vector ui set to the colour valuesf(x,y,d) of that pixel. Adjacent segments Si and Sj are then compared using 649513.doc -9some similarity criterion and merged if they are sufficiently similar. In this way, small segments take shape and are gradually built into larger ones.

The segmentations before and after the merger differ only in the two segments Si and S. Accordingly, in determining the effect on the total cost functional M after such a merger, a computation may be confined to those segments Si and S. Accordingly, a merging cost for the adjacent segment pair {Si, Sj} may be written as: SEi Ej) T l(Si (3) where the numerator is the increase in approximation error due to the merge, and l(a(Si, is the area of the common boundary between three-dimensional segments Si and S. If the merging cost ry is less than parameter X, the merge is allowed.

The numerator of the merging cost ry in Equation may be shown to evaluate to an area-weighted Euclidean distance, and hence may be computed as: S I(a(SS,

T

i j lI(SS (4) where S i is the area of segment Si.

During segment-merging segmentation, a search is performed over all segment pairs {Si,S} to find the pair {Si,Sj} with the smallest merging cost ri.. The segments Si and Sj are merged if the associated merging cost ry is below some monotonically increasing X-schedule, else the merging stops.

Fig. 6 shows a graphical representation of a data structure 600 for use with the segment-merging segmentation step 115, which is essentially an adjacency list 649513.doc representation of the segmentation. The data structure 600 includes a segment list 610, an adjacency list 620 for each segment in the segment list 610, and a priority queue 630.

The segment list 610 contains for each segment Si a segment identifier i, co-ordinates of all pixels in that segment Si, and the approximation vector The priority queue 630 stores the merging costs rj of two segments Si and Sj, the area 1((Si,S j of the common boundary between three-dimensional segments Si and Sj, and points to the corresponding segments Si and Sj. This data structure 600 allows the segment-merging segmentation step 115 to efficiently store all the data and parameters required for the segmentation. This data structure 600 also allows for dynamic modifications to the data and parameters stored.

Fig. 7 shows the segment-merging segmentation step 115 (Fig. 2A) in more detail. The segmentation step 115 starts in sub-step 704 with the trivial segmentation where each pixel forms its own segment Si. Hence, the approximation vectors i, are set to the colour valuesffor each pixel in the block. Referring also to Fig. 6, the segment list 610 is populated with one pixel co-ordinate for each segment Si, and the approximation vector i, which corresponds to the colour valuef of that pixel.

When populating the graph, the area I(a(Si, of a boundary unit perpendicular to each axis is set by default to 1. This default area I(a(Si,S)) can be scaled for a particular axis to favour the creation of segments Si along that axis.

The processor 205 then computes in sub-step 706 the merging cost 1T according to Equation for each of the boundaries between adjacent segment pairs S, and and stores the merging costs ry in the priority queue 630 (Fig. Sub-step 708 follows where the processor 205 inserts the boundary unit area I(a(Si,Sj)) of the common boundary 649513.doc -11 between three-dimensional segments S, and S into the priority queue 630 in priority order of the lowest merging cost r 0 computed in sub-step 706.

In sub-step 710 the processor 205 removes the first entry from the priority queue, that being the lowest merging cost and merges the corresponding segment pair Si and Sj to form a new segment Si. The approximation vector u, of the new segment S, is the area-weighted average of the respective segments Si and S that are being merged, that is: (jlSiI--+ISj The adjacency list 620 is also updated.

Sub-step 712 identifies any multi-edges that have been introduced by the merge in sub-step 710, merges any such multi-edges, and sets the area 1(a(Si,S)) of the common boundary between three-dimensional segments SiS and St to the combined area of the two constituent boundaries a(S,S, S) and a(Sj, Sub-step 714 follows where the processor 205 calculates a new merging cost zij.

for each boundary a(Sj, S, between adjacent segments Sy and St and inserts the new boundary into the priority queue 630 so as to maintain the correct priority order.

The processor 205 then determines in sub-step 716 whether the merging cost ry at the top of the priority queue 630 has a value greater than a threshold Xstop, which signifies the stopping point of the merging. If the merging has reached the stopping point, then the three-dimensional segmentation step 115 ends. Alternatively, control is returned to sub-step 710 from where sub-steps 710 to 716 are repeated, merging the two segments with the lowest merging cost ry every cycle, until the stopping point is reached.

649513.doc -12- In determining the threshold ktp it is noted that different thresholds stp are useful in different contexts, such as empirically determining a final merging cost ry, or halting when a certain fixed number of segments is reached. These suffer from the disadvantage that they are not adaptive to changing image data.

Fig. 8 is a plot of the value of the merging cost z-i as the segment-merging step 115 proceeds in a typical case. It is noted that, as merging proceeds, the merging cost rT of the segments Si and Sj that are being merged generally increases. As can be seen from Fig. 8, the increase in merging cost rj is not purely monotonic. In fact, the overall rise in merging cost z-ij is punctuated by departures from monotonicity, which herein are termed local minima. A local minimum represents the collapse of what might be termed a selfsupporting group of adjacent segments. Such occurs if one boundary within the group is removed, and the merging costs r for adjacent boundaries then suddenly reduce. In effect, the hypothesis that the segments of the group are part of the same object is confirmed as more segments merge and so the merging cost zy decreases. The result is that all the boundaries in the group are removed in quick succession. These selfsupporting groups tend to represent the internal structure of objects and background clutter. In the preferred implementation the merging cost zr immediately after the last local minimum is used as the threshold in sub-step 716 (Fig. 7).

Referring again to Fig. 2A, following the 3D segmentation step 115, the processor 205 forms in step 120 two segmented images from the segmented threedimensional block of data by "slicing through" the segmented three-dimensional block in the image plane (x-y plane), that is in planes having constant parallax dimension values d.

The segmented images include two-dimensional segments Sid, with pixels in the two segmented images that belong to the same three-dimensional segment Si having the same 649513.doc -13two-dimensional segment identifier i but different values of d. Accordingly, the corresponding features between the images are matched automatically, removing the need for a subsequent algorithm to solve the correspondence problem between features of different images. In the preferred implementation, the two segmented images are formed by slicing through the segmented three-dimensional block in image planes having parallax dimension values 1 and D. This allows the two-dimensional segments sid to have the maximum possible separation in the parallax dimension d.

Fig. 4 illustrates one of the three-dimensional segments Si formed by the segmentation step 115 and segmented images 410-1 to 410-D that may be formed from the segmented three-dimensional block of data by slicing through the segmented threedimensional block in the image plane (x-y plane). In particular, the parallax dimension values d of the images illustrated correspond with those of the D images received in step 105. Each of the segmented images 410-1 to 410-D includes one two-dimensional segment sil to siD that corresponds to the three-dimensional segments Si. Also illustrated are the two segmented images 420-1 and 420-D formed in step 120, those images 420-1 and 420-D corresponding with images 410-1 and 410-D respectively.

Returning again to Fig. 2A, the processor 205 then estimates, in step 125, the disparity 68 of each of the three-dimensional segments Si by measuring the displacement 6i of corresponding two-dimensional segments sid of the three-dimensional segment Si under consideration, those two-dimensional segments sid being the two-dimensional segments sid in the segmented images formed in step 120. In particular, the displacement 6i is measured in a direction parallel with the baseline 260 (Fig. In the preferred implementation the displacement 6i of the centroids of the corresponding twodimensional segments Sid is measured.

649513.doc -14- Referring again to Fig. 4, the disparity 8i of the illustrated three-dimensional segments S, is then the displacement 6i of the corresponding two-dimensional segments sid in the x direction, that being the alignment of the baseline 260 (Fig. 1).

Occlusion of objects in the scene may cause inaccurate disparity estimates, and hence an inaccurate range estimation. Fig. 5 illustrates two segmented images 510-1 and 510-D formed in step 120, with only two two-dimensional segments sld and S2d illustrated in each of the segmented images 510-1 and 510-D. In the example, the object corresponding to two-dimensional segments s 1 l and siD is closer to the cameras 250-1 to 250-D (Fig. and partially occludes the object corresponding to two-dimensional segments s 21 and S2D. Also, because the object corresponding to two-dimensional segments sll and sID is closer to the cameras 250-1 to 250-D, that object has a large displacement 61 of the centroids of the corresponding two-dimensional segments sil 1 and sID due to the parallax between the views of cameras 250-1 and 250-D respectively. In contrast thereto, the displacement 62 of the object corresponding to two-dimensional segments s 21 and S2D should be smaller because of the larger distance of that object from the cameras 250-1 to 250-D. However, as is illustrated in Fig. 5, the displacement of the object corresponding to two-dimensional segments s 21 and S2D is mainly caused by the shift of the centroids of two-dimensional segments s 21 and s2D caused by the partial occlusion, rather than displacement due to the parallax between the views of cameras 250-1 and 250-D respectively. Should the range be calculated from that exaggerated displacement 82, then the range of such an object would be incorrectly determined to be closer than the actual range.

Accordingly, in step 130 (Fig. 2A) the processor 205 tests for occlusions that may cause inaccurate disparity estimates 6~i by detecting changes in area and shape of 649513.doc corresponding two-dimensional segments Sid. If the processor 205 determines that the area and/ or shape of corresponding two-dimensional segments sid have changed above a threshold, the disparity estimate 8i of the three-dimensional segment Si under consideration is flagged as inaccurate. In the preferred implementation, if one or more of the two-dimensional segment area, Hu invariant moments shape descriptors, and aligned image difference change more than predetermined thresholds, then the disparity estimate 6i of the three-dimensional segment Si under consideration is flagged as inaccurate.

Aligned image difference is computed by first aligning the two corresponding two-dimensional segments Sid by shifting the second segment by the inverse of the computed disparity estimate 6i, and then summing the absolute pixel difference between the two segments Sid over the area of intersection between the segments Sid.

In order to calculate the Hu invariant moments shape descriptors, let the central moment of order (p q) be defined as: Ppq X Y) P (y Y)q g(x, y) (6) x y where g(x, y) is a binary image function that has a value of 1 where the pixel is included in the segment sid and 0 otherwise, and where x and y are the x and y coordinates respectively of the segment centroid. Further, let the normalized central moments be defined as: ,pPpq (7) rl pq o /00 where P+q+1 (8) 2 649513.doc 16- The seven invariant Hu moments can then be derived from the second and third moments as follows: 0 720 702 (9) 0 2 (72o -72) 2 47, 2 (730 312) 2 (321 -703) 2 (11) 0 4 (30 772)2 (721 703)2 (12) (130 3112 X7o 712 )(730 712)2 -3( 7 21 703)2 (37921 7 0 3

X

21 +03)[3( 7 730 12)2- (721 703 2 (3 0 6 (720 -102)[(730 +712)2 -(721 +1703)2] 4,11(30 +1712)( 21 +703) (14) 7 (31 7 21 03 X,7 3 12 30 )1 2 3( 7 72 703 )2 (3712 30 X721 703 3(730 r12 )2 (721 703) 2 Additionally, in the preferred implementation this occlusion test also requires that the three-dimensional segment Si under consideration spans all camera views, that is, have corresponding two-dimensional segments sid for each d value.

From the disparity estimates 8i, the processor 205 then calculates in step 135 a range measure for each three-dimensional segment Si that has not been flagged as having an inaccurate disparity. The range measure is the inverse of the disparity estimate 68 of the three-dimensional segment Si under consideration. The range Ri of the threedimensional segment Si under consideration may be calculated as follows: 649513.doc -17- R =bf (16) where b is the distance along the baseline 260 between the cameras 250-1 and 250-D from which the images are used for the disparity estimate 68, andfis the focal distance of the lenses of the cameras 250-1 and 250-D. Equation (16) assumes that, if the optical axes of the cameras 250-1 and 250-D are not completely parallel, or there is other intrinsic camera distortion, then these distortions are first removed by a calibration and rectification stage, the techniques of which are known in the art.

Step 140 follows where the processor 205 forms a range map by forming a twodimensional array of range measures, with the range measure of grid position (x,y) corresponding to the range measure of the three-dimensional segment Si to which the pixel (xy,d) belongs for a predetermined parallax dimension value d. Equivalently, for a predetermined parallax dimension value d, the range measure of grid position (x,y) corresponds to the range measure of the three-dimensional segment Si that corresponds to the two-dimensional segment sid to which the pixel belongs. In the preferred implementation the parallax dimension value d used for the range map is a central one.

In the preferred implementation more than two cameras 250-1 to 250-D are used. Even though only two segmented images are formed in step 120, which preferably correspond to the images of the outer cameras 250-1 and 250-D, the benefit of such an implementation is that the middle image(s) provide connectivity of segments in the two segmented images. Hence, the separation of the outer cameras 250-1 and 250-D may be extended much further than would otherwise be possible without losing the connectivity.

According to a further preferred implementation, if the focal distance to an object of interest is known, such as is the case in range keying applications, then the individual images from the cameras 250-1 to 250-D are offset by the respective disparities 649513.doc -18when forming the three-dimensional block of colour values J(x,y,d) in step 110. This implementation has the effect of aligning segments corresponding to objects in the focal plane, thereby improving the connectivity in the block of colour values fj(x,y,d) for such objects. Accordingly, this implementation is particularly useful in the case of a small number of cameras 250-1 to 250-D separated widely along the baseline 260.

In range-keying applications where the foreground is selected based on the estimated range map, and a background portion of the image is replaced with a new background, it is desirable to "feather" the foreground with the new background. During feathering, for those pixels in the range-keyed image that are on the border of the foreground portion and the background portion, it is necessary to determine what percentage of the colour of each pixel is due to the foreground portion so that those pixels may be appropriately blended. However, the segment-merging segmentation described above inherently only gives a binary per-pixel classification for each segment Si.

Accordingly, in estimating a range map for range-keying applications, the images from cameras 250-1 to 250-D are enlarged K times using an accurate filter, such as bicubic interpolation, after receipt of the D images in step 105. The enlarged images are then processed by steps 110 to 140 in the usual manner. After the range map has been formed and range-keying has been applied to the enlarged range map, the range-keyed image is then resized by a factor 1/K to the size of the images captured by the cameras 250-1 to 250-D using an averaging filter. The averaging generates 2K-level gray-scale coverage information for the border pixels.

Typical images of scenes include objects that have depth. Due to the fact that the segmentation performed in step 115 continues until a "full" segmentation has been reached, that is, until a small number of segments Si are formed, many of the resulting 649513.doc -19segments S, may include objects that span more than one range. However, only one range measure is estimated for each three-dimensional segment Si, and the depth of objects represented by such segments Si is lost.

Fig. 2B shows a schematic flow diagram of a method 101 of estimating a range map of the scene according to a second embodiment. In the second embodiment, the cameras 250-1 to 250-D again are still cameras for capturing still images of the scene.

The method 101 is similar to the method 100 described with reference to Fig. 2A, but the method 101 aims to ameliorate the deficiency of method 100 by providing a range map with improved depth accuracy.

The method 101 also starts in step 105 where each of the cameras 250-1 to 250- D captures an image of the scene, followed by step 110 where the processor 205 forms a three-dimensional block of colour values f(x,y,d) from the colour values f(x,y) of the individual images. Steps 105 and 110 are performed in the manner described with reference to Fig. 2A.

The processor 205 then, in step 117, performs spatio-parallax (3D) segmentation on the three-dimensional block of colour values f(x,y,d) based on colour to form a set of contiguous three-dimensional segments Again the Mumford-Shah segment-merging segmentation is preferably used and in the manner similar to that described with reference to Fig. 7. However, in this embodiment, instead of fully segmenting the threedimensional block of colour values ftx,y,d) as is the case in step 115 (Fig. 2A), step 117 halts the segment merging segmentation at monotonically increasing intermediate threshold 'st,,op values during the merge sequence. The segmentation with low intermediate threshold X'stop values results in small segments Si.

649513.doc Following the 3D segmentation step 117, the processor 205 performs steps 120 to 140 in the manner described with reference to Fig. 2A. Typically, many of the threedimensional segments Si, which are smaller in the initial stages of this embodiment caused by the lower starting intermediate threshold values, are flagged as having inaccurate disparity estimates 6,i in step 130. The resulting range map formed in step 140 is thus sparsely populated with range measures.

The processor 205 next determines in step 145 whether the intermediate threshold X't,,p used in step 117 is smaller than the threshold -stop which signifies the stopping point of the merging and which was determined with reference to Fig. 8. If the intermediate threshold used in step 117 is smaller than the threshold kstop, then the method 101 continues to step 146 where the processor 205 increases the value of the intermediate threshold 'st,,p by a predetermined amount. The method 101 then returns to step 117 where the processor 205 further segments the three-dimensional block of colour valuesf(x,y,d) to provide a somewhat coarser segmentation.

Steps 120 to 140 are repeated with this coarser segmentation to again calculate a two-dimensional array of range measures. These newly calculated range measures are added to the range map in their appropriate positions in a manner whereby old range measure values in the range map are not replaced with newly calculated range measure values. Hence, range measures in the range map resulting from small three-dimensional segments Si are retained and unpopulated grid positions of the range map are filled in with newly calculated range measures.

Steps 117 to 145 are repeated until the processor 205 determines in step 145 that the intermediate threshold ',top, used in step 117 is equal to the threshold which signifies the stopping point of the merging. The method 101 then ends in step 147.

649513.doc -21- The range map formed by the method 101 thus typically includes more than one different range measure for objects spanning different ranges, thus retaining the depth of objects. However, for this to occur, sufficient colour differences, or shading or texture across such objects are needed.

Due to the occlusion test performed in step 130 of the method 101, the resulting range map may include segments that have no range measures assigned thereto. In order to form a fully populated range map, in the preferred implementation of methods 101, step 130 is not performed during the final iteration of steps 117 to 145.

Also, if high range precision is not required, the disparity of each segment of the final segmentation may be set to the area-weighted average of the disparity of the finer segments contained within the boundaries.

In the first and second embodiments described above, the cameras 250-1 to 250- D are still cameras for capturing still images of the scene. In a third embodiment, the cameras 250-1 to 250-D are video cameras, each video camera 250-1 to 250-D arranged to capture a sequence of images, also known as frames, of the scene. Fig. 2C shows a schematic flow diagram of a method 102 of estimating a range map of the scene according to a third embodiment. In particular, a sequence of range maps is formed by method 102. The method 102 is similar to the method 100 described with reference to Fig. 2A, but the method 102 uses the sequence of images of the scene to estimate the sequence of range maps.

The method 102 starts in step 150 where the processor 205 receives a predetermined number L of images from each of the cameras 250-1 to 250-D. Preferably, the predetermined number L of images received from each of the cameras 250-1 to 250-D 649513.doc -22is 2 to 8. Each of the images comprises a two dimensional array y) of colour values and is captured at frame interval t.

Step 152 follows where the processor 205 forms a four-dimensional block of colour valuesf(x,y,d, t) from the colour values f(x,y) of the individual images from each of the cameras 250-1 to 250-D. The first two of the dimensions, those being the x and y dimensions, are the spatial dimensions of the images which correspond to the coordinates of a particular pixel in the image, the third dimension is the parallax dimension which corresponds to the camera 250-d that captured the image, and the fourth dimension is the time dimension which corresponds to the frame interval t at which the image was captured.

The processor 205 then, in step 154, performs spatio-parallax-temporal (4D) segmentation on the four-dimensional block of colour values f(x,y,d, t) based on colour to form a set of contiguous four-dimensional segments Again the Mumford-Shah segment-merging segmentation is preferably used and in the manner similar to that described with reference to Fig. 7 except that the segments Si are four-dimensional.

Additionally, the segment-merging segmentation has to ensure that all two-dimensional segments Sid that can be formed from the four-dimensional segments Si are contiguous, that is do not consist of multiple, disjoint lobes.

Fig. 9A illustrates a three-dimensional segment SA which always results in contiguous two-dimensional segments SAd for all time dimension values. The parallax dimension is not illustrated for simplicity. The segment SA may be compared with the three-dimensional segment SB illustrated in Fig. 9B, which results in disconnected twodimensional segments SBd and S'Bd for certain time dimension values. The configuration illustrated in Fig. 9B is termed a "branching" configuration herein.

649513.doc -23- In the preferred embodiment, segments Si and S that would produce a branching configuration are prevented from merging, thereby preventing disconnected twodimensional segments. Before merging neighbouring segments Si and Sj in step 710 (Fig.

the processor 205 tests whether the resulting segment Sy would have a branching configuration. If the resulting segment Sy would have a branching configuration, then the merging of segments Si and Sj is prevented by effectively setting the merging cost ry to infinity.

Fig. 10A illustrates two neighbouring segments SA and SB having a common boundary. A minimum and maximum extent of segment SA along the time axis is defined as RAtmin and RAtmax respectively. Similarly, the minimum and maximum extent of segment SB along the time axis is defined as RBtmin and RBtmax respectively. The boundary separating segments SA and SB has a time extent (Bndrytmin, Bndrytmax). With the block of colour values f including images taken at time interval Wtmin to Wtmax, segments SA and SB are allowed to merge if, and only if: [max(RAtmin, Wttmin)=max(Bndrytmin, Wtmin) OR max(RBtmin, Wtmin)=max(Bndrytmin, Wtmin)]

AND

[min(RBmax, Wtmax)=min(Bndrytmax, Wtmax) OR min(RAtmx, Wtmax)=min(Bndrytmax, Wtnm)] (17) An example of two neighbouring segments SA and SB that would not be allowed to merge is illustrated in Fig. 10B. As can be seen from the illustration, Equation (17) will not be satisfied for the illustrated segments SA and SB.

Referring again to Fig. 2C, the processor 205 then forms in step 156 two segmented images from the segmented four-dimensional block of data by "slicing through" the segmented four-dimensional block in the image plane (x-y plane), that is in planes having constant parallax dimension values d and constant time dimension values t.

649513.doc -24- In the preferred implementation the two segmented images are formed by slicing through the segmented four-dimensional block in planes having the time dimension t corresponding to the oldest image in the block, and parallax dimension values 1 and D.

The processor 205 then estimates, in step 158, the disparity 6i of each of the four-dimensional segments Si by measuring the displacement 6 in a direction parallel with the baseline 260 (Fig. 1) of corresponding two-dimensional segments sidi of the fourdimensional segment Si under consideration.

In step 160 the processor 205 tests for occlusions, followed by calculating in step 162 a range measure from the disparity 65 for each four-dimensional segment S, that has not been flagged as having an inaccurate disparity. A range map is then formed in step 164. Steps 160, 162 and 164 are performed in a manner similar to steps 130, 135 and 140 respectively, described with reference to Fig. 2A.

The method 102 then proceeds to step 175 where the processor 205 receives a next image in the sequence of images from each of the cameras 250-1 to 250-D. In step 176 the processor 205 adds the D images received in step 175 to the four-dimensional block of data while removing the oldest D images from the block of data. Step 154 is then repeated by the processor 205 with the newly formed block of data. In particular, the segments Si formed in a previous segmentation are maintained in sub-step 704 (Fig. 7), with only the approximation vectors u of the pixels of the new images being set to the colour values f The segmentation step 156 thus merges the unsegmented pixels of the new images into the existing segments Si from the previous segmentation.

Steps 156 to 164 are then repeated to form a next range map in step 164. A sequence of range maps is thus formed. Because the two-dimensional segments sidt are associated with the same four-dimensional segment Si for different values of time interval 649513.doc t, the object represented by segment Si may be tracked, and any changes in its range monitored. In the instance where the two segmented images are formed in step 156 by slicing through the segmented four-dimensional block in planes having the time dimension t corresponding to the oldest image in the block, the sequence of range maps has a time lag of L-1 image intervals behind the sequence of images from the cameras.

The pre-segment disparity estimate may be averaged across time over each segment to give a more stable range estimate.

Fig. 2D shows a schematic flow diagram of yet another method 103 of estimating a range map of the scene according to a fourth embodiment. The method 103 is similar to method 102 described with reference to Fig. 2C in that it also uses a sequence of images of the scene to estimate a sequence of range maps, and method 103 is similar to the method 101 described with reference to Fig. 2B in that each range map includes more than one different range measure for objects spanning different ranges.

In order to obtain segmentations ranging from fine to coarse to enable the creation of different range measures for objects spanning different ranges, the segmentation may be restarted during every iteration of the segmentation, that is after the pixels from the oldest images are removed from the block of data, and the colour values of the pixels of the newly received images are added to the block of data. The range map for each time interval t is then gradually built up, as is the case in method 101 (Fig. 2B).

However, in the preferred implementation, when performing segmentation which includes a time dimension t, the segments Si formed in a previous segmentation are maintained, and the unsegmented pixels of the new images are merged into the existing segments Si from the previous segmentation, as is described above with reference to method 102 (Fig. 2C). Accordingly, method 103 forms a number N of four-dimensional 649513.doc -26blocks of colour values, and performs segmentation and range estimation on each of the four-dimensional blocks in parallel. Each block is segmented to a differing level of coarseness, controlled by the threshold %,top used for each segmentation.

The method 103 starts in step 150 where the processor 205 receives the predetermined number L of images from each of the cameras 250-1 to 250-D. The processor 205 then, in step 151, forms N four-dimensional blocks of colour values f(x,y,d,t) from the colour values f(x,y) of the individual images from each of the cameras 250-1 to 250-D, and in a manner similar to that described in relation to step 152 (Fig. 2C).

Parallel steps 153-1 to 153-N follow, where each parallel step 153-n includes steps 154 to 164 as is described with reference to Fig. 2C above. The processor 205 segments each block in step 154 with differing threshold %,tp values. The segmentation with the lowest threshold kstp value results in small segments Si and, as the threshold Xtop increases, so do the sizes of the resulting segments Si. It is noted that the test to determine whether two neighbouring segments are allowed to merge as set out in Equation (17) has to be included in each segmentation.

Each parallel step 153-n then forms two segmented images, estimates the disparity 8i of each of the four-dimensional segments S, tests for occlusions, calculates a range measure from the disparity 8i for each four-dimensional segment S. that has not been flagged as having an inaccurate disparity, and finally, forms a range map. The range maps formed from fine segmentations (low threshold values) are typically sparsely populated with range measures.

The parallel steps 153-1 to 153-N are followed by step 170 where the processor 205 combines the range maps formed in each of the parallel steps 153-1 to 153-N to form a final range map. The range measures from the coarsest segmentation (highest threshold 649513.doc -27value) are first added to the final range map at corresponding grid positions, followed by the range measures from progressively finer segmentations. Range measures from a finer segmentation overwrite those from coarser segmentations. The final range map thus typically includes more than one different range measure for objects spanning different ranges.

The method 103 then proceeds to step 175 where the processor 205 receives a next image in the sequence of images from each of the cameras 250-1 to 250-D. In step 177 the processor 205 adds the D images received in step 175 to each of the N fourdimensional blocks of data while removing the oldest D images from each of the N blocks. Parallel steps 153-1 to 153-N and 170 are then repeated by the processor 205 with the newly formed blocks of data to form a next final range map in a sequence of final range maps.

From the embodiments described with reference to Figs 2A to 2D it can be seen that, after the block of colour values is formed, the dimensionality of the block of colour values is not really used again. Accordingly, the embodiments described may be changed to form higher dimensional blocks of colour values, and estimating range maps therefrom, without departing from the invention. For example, instead of arranging the cameras 250- 1 to 250-D along a baseline as is described with reference to Fig. 1, the cameras may be arranged in a two-dimensional grid. In the case where the cameras are still image cameras, a four-dimensional block of colour values is formed, followed by fourdimensional segmentation. Separate horizontal and vertical disparities are then estimated from which the range map is formed.

Similarly, in the case where the cameras are video cameras, a five-dimensional block of colour values is formed, followed by five-dimensional segmentation. Again, 649513.doc -28separate horizontal and vertical disparities are estimated from which a sequence of range maps is formed.

In yet a further implementation, the cameras are arranged in a three-dimensional grid, in which case five- or six-dimensional segmentation is performed.

The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive.

For example, in the case where the amount of data in the block of colour values fis large, the processing costs may be reduced by averaging several adjacent images from cameras 250-1 to 250-D together, while maintaining connectivity.

Further, although the methods 2A and 2B have been described in terms of multiple physical cameras 250-1 to 250-D, the methods 2A and 2B may similarly be applied to multiple displaced images from any source. For example, the multiple displaced images may be captured by a single camera tracking along the baseline 260.

Yet further, the multiple displaced images may be captured by a single handheld camera moved along the baseline 260, in which case a pre-registration step is required to remove camera pan/tilt and perspective effects before forming the block of colour valuesf In the context of this specification, the word "comprising" means "including principally but not necessarily solely" or "having" or "including" and not "consisting only of'. Variations of the word comprising, such as "comprise" and "comprises" have corresponding meanings.

649513.doc

Claims

1. A method of estimating range information of a scene from pixel data of at least two images, said images being captured of said scene from displaced relative positions, said method comprising the steps of: forming a multi-dimensional data set comprising pixel data from said at least two images, the dimensions of said pixel data set comprising spatial dimensions corresponding to dimensions of said images, and a dimension corresponding to said displaced relative positions; segmenting said multi-dimensional data set into multi-dimensional segments using multi-dimensional segmentation; for each multi-dimensional segment, estimating a disparity for said multi- dimensional segment by estimating a displacement between first and second intersections in a spatial plane of said multi-dimensional segment; and calculating range measures from said disparities.

2. The method as claimed in claim 1 comprising the further step of: forming a two-dimensional array of range measures, said array being associated with a third intersection in said spatial plane of said multi-dimensional segment, and the values of said array are range measures of the multi-dimensional segment in said third intersection.

649513.doc

3. The method as claimed in claim 2, said method further comprising testing for occlusion of said first and second intersections, wherein disparities of multi-dimensional segments having occluded intersections are disregarded.

4. The method as claimed in claim 3, wherein testing for occlusion comprises testing for differences in one or more of area, Hu invariant moments and image difference. The method as claimed in claim 3 or 4, wherein steps to are repeated with segmentations to differing levels of coarseness, and each value in said two-dimensional array of range measures is the range measure resulting from the finest segmentation for which said intersections are not occluded. 6. The method as claimed in any one of claims 1 to 45, wherein said images comprise sequences of images, each sequence of images being captured from said displaced relative positions, and said dimensions of said pixel data set further comprise a time dimension. 7. The method as claimed in claim 6 when depended to any one of claims 2 to 4, wherein steps to are performed in parallel with segmentations to differing levels of coarseness, said method comprises the further step of: combining said two-dimensional arrays of range measures by retaining range measures resulting from the finest segmentation for which said intersections are not occluded. 649513.doc -31 8. The method as claimed in claim 6 or 7, wherein said multi-dimensional pixel data set comprises images from a predetermined number of time intervals, and said method further comprises: removing pixel data corresponding to the oldest images in said sequences of images from said multi-dimensional pixel data set; adding pixel data corresponding to next images in said sequences of images to said multi-dimensional pixel data set; and repeating at least steps to 9. The method as claimed in any one of claims 1 to 8, wherein a focal distance to an object of interest is known, said method comprising offsetting said images when forming said multi-dimensional pixel data set to compensate for displacement of said object in said images. An apparatus for estimating range information of a scene from pixel data of at least two images, said images being captured of said scene from displaced relative positions, said apparatus comprising: means for forming a multi-dimensional data set comprising pixel data from said at least two images, the dimensions of said pixel data set comprising spatial dimensions corresponding to dimensions of said images, and a dimension corresponding to said displaced relative positions; means for segmenting said multi-dimensional data set into multi-dimensional segments using multi-dimensional segmentation; 649513.doc -32 for each multi-dimensional segment, means for estimating a disparity for said multi- dimensional segment by estimating a displacement between first and second intersections in a spatial plane of said multi-dimensional segment; and means for calculating range measures from said disparities. 11. The apparatus as claimed in claim 10 further comprising: means for forming a two-dimensional array of range measures, said array being associated with a third intersection in said spatial plane of said multi-dimensional segment, and the values of said array are range measures of the multi-dimensional segment in said third intersection. 12. The apparatus as claimed in claim 11, said apparatus further comprising means for testing for occlusion of said first and second intersections, wherein disparities of multi-dimensional segments having occluded intersections are disregarded. 13. A program stored on a memory medium for estimating range information of a scene from pixel data of at least two images, said images being captured of said scene from displaced relative positions, said program comprising: code for forming a multi-dimensional data set comprising pixel data from said at least two images, the dimensions of said pixel data set comprising spatial dimensions corresponding to dimensions of said images, and a dimension corresponding to said displaced relative positions; code for segmenting said multi-dimensional data set into multi-dimensional segments using multi-dimensional segmentation; 649513.doc -33 for each multi-dimensional segment, code for estimating a disparity for said multi- dimensional segment by estimating a displacement between first and second intersections in a spatial plane of said multi-dimensional segment; and code for calculating range measures from said disparities. 14. The program as claimed in claim 13 further comprising: code for forming a two-dimensional array of range measures, said array being associated with a third intersection in said spatial plane of said multi-dimensional segment, and the values of said array are range measures of the multi-dimensional segment in said third intersection. The program as claimed in claim 14, said program further comprising code for testing for occlusion of said first and second intersections, wherein disparities of multi- dimensional segments having occluded intersections are disregarded. 16. A method of estimating range information of a scene from pixel data of at least two images, said method being substantially as herein described with reference to any one of the embodiments shown in the accompanying drawings. 649513.doc -34- 17. An apparatus for estimating range information of a scene from pixel data of at least two images, said apparatus being substantially as herein described with reference to any one of the embodiments shown in the accompanying drawings. DATED this Sixth Day of November 2003 CANON KABUSHIKI KAISHA Patent Attorneys for the Applicant SPRUSON&FERGUSON 649513.doc