WO2000027131A2

WO2000027131A2 - Improved methods and apparatus for 3-d imaging

Info

Publication number: WO2000027131A2
Application number: PCT/GB1999/003584
Authority: WO
Inventors: Joseph Zhengping Jin; Timothy Bryan Niblett; Colin William Urquhart
Original assignee: C3D Limited
Priority date: 1998-10-30
Filing date: 1999-10-29
Publication date: 2000-05-11
Also published as: GB9823689D0; EP1125249A2; WO2000027131A3; AU1054300A

Abstract

A method and apparatus for measuring stereo image disparity, for use in a 3-D modelling system. The method includes processing left and right camera images to form an image pyramid, calculating a disparity map at the coarsest level in the pyramid, and using this disparity map to carry out a warping operation on one of the images at the next-coarsest level, prior to calculating a disparity map for that level. This process is repeated for each subsequent pyramid level, at each level using the disparity map obtained at the previous level for carrying out the warping process, until a final disparity map for the least coarse pair of images in the pyramid is obtained. A computer program product for implementing this method is claimed, as well as a new method and apparatus for calibrating the cameras.

Description

IMPROVED METHODS AMD APPARATUS FOR 3-D IMAGING

The present invention relates to apparatus and methods for the acquisition of three-dimensional (3-D) model images using stereo imaging techniques . More specifically, although not exclusively, the invention provides novel calibration and stereo matching methods for use in stereo imaging.

Three-dimensional or "3-D" imaging is of great interest as there are numerous potentially useful applications of such a technique including, for example, capturing 3-D images of objects, human faces or the environment, for quantitative measurement or visual recordal purposes . One known technique involves capturing a pair of stereo images of an object, commonly using two cameras, and "matching" the two images in order to construct a so-called "disparity map" specifying the stereo disparity for each pixel in one image relative to the other image. The disparity map can be used to construct a 3-D model image, from the pair of captured images, for viewing on a computer screen, the 3-D image being rotatable on screen for viewing from apparent different angles.

The disparity map is a two dimensional array which specifies for each pixel p(x,y) in one image (e.g. the left image) of the pair, the displacement {as a vector (x,y) } to a corresponding point π_r in the other image (i.e. the right image) . By a "corresponding point" in the right image we mean the point in the right image at which the scene component π₈ imaged by pixel p(x,y) in the left image appears in the right image. The construction of the disparity map is a crucial step in the 3-D imaging process and the accuracy of the final 3-D model which is obtained will depend on the quality of the disparity map. To date, the matching techniques used to construct such disparity maps have required substantial processing power and computation time, for example around 45 minutes to match two 512x512 pixel images.

Furthermore, one known correlation-based matching method uses a 5 x 5 pixel window to compute correlation between the left and right images. The assumption is that given an image location u = (x,y) in the left image, L, there is a corresponding point u' in the right image, R, such that 1^ = Ry- + Δ-_j, where Δ is a noise process. The correlation estimate opens a 5 x 5 (in the preferred embodiment) window around u in L and u' in R to determine the most likely u' . To be accurate this formula assumes that there is no disparity gradient over this small window. This assumption is invalid, as in fact the geometric distortion over a 5 x 5 pixel window can be as much as 1 pixel. One solution is to reduce the size of the window, when such distortion occurs. Unfortunately, this increases the effect of noise, and in any case the discrete nature of image sampling cannot eliminate the problem. Another solution is to apply a local affine map. Unfortunately local computation of this map is difficult, and any local fitting process is liable to over-fitting because of the lack of available data.

Another important aspect of the stereo imaging process is the calibration of the cameras used to record the raw images. This usually requires skilled intervention by the operator of the imaging system to carry out a calibration procedure in order to achieve accurate calibration, or "factory" calibration which reduces the range of operating volumes of the 3-D imaging process. Projective distortion effects in the imaging system can also affect the accuracy of the calibration achieved.

The construction or so-called "recovery" of three-dimensional surfaces from disparity maps, once camera calibration has been achieved, is well understood in the art. Given a number of 3-D models generated from a stereo matching system, or other range-finding device, there are two problems: (a) determine transformations that will bring the different models into registration, and (b) integrate the different 3-D models into a single model. The integration of the 3-D models can be achieved via the popular method of volumetric representation, for example as described by Curless & Levoy (SIGGRAPH 96 Conference Proceedings, pages 303-312) . Problems are though often encountered when attempting to smoothly merge photographic render images associated with the individual 3-D models into the integrated 3-D model.

It is an object of the present invention to avoid or minimise one or more of the foregoing disadvantages.

According to a first aspect of the invention we provide a method of measuring stereo image disparity for use in a 3-D modelling system, the method comprising the steps of :

(a) producing a first camera output image of an object scene;

(b) producing a second camera output image of said object scene;

(c) digitising each of said first and second camera output images and storing them in a storage means; (d) processing said first and second digitised camera output images so as to produce an image pyramid comprising a plurality of successively produced pairs of filtered images, each said pair of filtered images providing one level in the pyramid, each successive pair of filtered images being scaled relative to the pair of filtered images in the previous pyramid level by a predetermined amount and having coarser resolution than the pair of filtered images in said previous pyramid level, and storing these filtered images; (e) calculating an initial disparity map for the coarsest pair of filtered images in the pyramid by matching one image of said pair of coarsest filtered images with the other image of said coarsest pair of filtered images;

(f) using said initial disparity map to carry out a warping operation on one image of the next-coarsest pair of filtered images in the pyramid, said warping operation producing a shifted version of said one image; and

(g) matching said shifted version of said one image of the next -coarsest pair of images with the other image of said next-coarsest pair of images so as to obtain a respective disparity map for said other image and said shifted image, which respective disparity map is combined with said initial disparity map so as to obtain a new, updated, disparity map for said next-coarsest pair of images; and (h) repeating steps (f) and (g) for the pair of filtered images in each subsequent pyramid level, at each level using the new, updated disparity map obtained for the previous level as said initial disparity map for carrying out the warping process in step (f) , so as to arrive at a final disparity map for the least coarse pair of images in the pyramid. Said processing step (d) for image pyramid generation may conveniently comprise operating on said first and second digitised camera output images with a scaling and convolution function. The plurality of pairs of filtered images produced by said scaling and convolution function are preferably Difference of Gaussian (DoG) images. Alternatively, the scaling and convolution function may produce a plurality of pairs of Laplacian images. Other filtering functions could alternatively be chosen, as long as each pair of filtered images produced by the chosen function are of a relatively lower resolution than the resolution of the pair of images in the previous pyramid level . The first pair of filtered images produced in step (d) may in fact be of the same scale (i.e. equal in size) as the digitised first and second camera output images. Preferably, each subsequent pyramid level contains images which are scaled by a factor of f, where 0 < f < 1, relative to the previous level. Preferably, scaling and summing over all levels of the pyramid provides the original digitised first and second camera output images.

Thus, it will be appreciated that step (d) above preferably includes successively producing a scale pyramid of pairs of left and right filtered images, preferably Difference of Gaussian (DoG) images, from said first and second digitised camera output images, each successive scale providing smaller images having a lower ("coarser") resolution. Advantageously, starting with the pair of images of smallest size and lowest resolution (namely the pair of images from the top level of the pyramid) , and the initial disparity map obtained therefor, each pair of left and right images in the scale pyramid may successively and sequentially be used to calculate a new, updated, disparity map at the next level down in the pyramid. - S -

This process proceeds from the coarsest to the finest scale of detail, propagating down new, further refined, values for the disparity map for said first and second camera output images at each transition between scales of the pyramid. Preferably, there are at least five levels in the scale pyramid and the process is therefore iterated at least five times. The final disparity map provides an accurate measure of the disparity between the first and second digitised camera output images.

It will be appreciated that the filtered "images" and shifted or "warped" images referred to above are in the form of two- dimensional data arrays. Each disparity map in practice preferably comprises two two-dimensional data arrays, a first said array comprising the horizontal disparity values for each pixel (in a chosen one of said first and second images relative to the other of said first and second images) and a second said array comprising the respective vertical disparity values for each pixel .

A significant advantage of the above-described invention is that the warping operation ensures that the current best estimate of the disparity (for each pixel) , inherited from the previous scale or "level" of the pyramid, is always used when matching the pair of images in the next level down in order to calculate a new, even better estimate of the disparity. This tends to minimise the adverse effects of geometric distortion. By iterating this process through all the scale pyramid levels we can obtain a final, very accurate disparity map which can then be used to construct a 3-D or "stereo" image from the original (i.e. first and second) digitised camera output images, as will be described hereinafter. Optionally, the method may include repeating steps (f) and

(g) , namely the warping and matching steps, once or more, at one or more of said pyramid levels, using the latest available disparity map for each warping operation. This tends to further improve the accuracy of the final disparity map for the object scene.

The method preferably further includes constructing a confidence map conveniently in the form of a two-dimensional data array, during each iteration of the above-described method. Each confidence map provides an estimate of the confidence with which the calculated new, further refined (horizontal and vertical) disparity values for each pixel is held. The method preferably also includes the further step of carrying out a smoothing operation on the disparity and confidence maps produced at each level in the scale pyramid of images (for said first and second camera output images of the object scene) , prior to using these smoothed maps in the calculation of the new, further refined disparity and confidence maps at the next level. The smoothing operation preferably comprises convolving each map with a predetermined weight construction function W(I,a,b,P) which, preferably, is dependent upon the original image intensity (brightness) values, and the confidence values, associated with each pixel. This weight construction function preferably computes a convolution kernel, for example a 3 X 3 kernel, on an image (data array) I at pixel p(a,b) using a probability array P, which is preferably the confidence map C. Advantageously, this convolution is repeated a plurality of times, preferably at least ten, advantageously twenty, times in order to obtain a final smoothed version of each disparity map and confidence map produced at each pyramid level. The above-described smoothing operation results in a further improvement in the accuracy of the final calculated disparity map which, in turn, results in improved quality in the 3-D image which is ultimately constructed using the first and second digitised camera output images and the final disparity map (together with calibration parameters for the cameras) .

In steps (e) and (g) of the above-described method, said matching process by means of which one image is matched with another is preferably carried out in each case by:

(a) calculating horizontal correlation values for the correlation between the neighbourhood around each pixel p(x,y) in one image and the neighbourhoods around at least three horizontally co-linear points in a spatially corresponding area of the second image, and fitting a parabolic curve (namely, a quadratic function) to said horizontal correlation values and analysing said curve (i.e. quadratic function) in order to estimate a horizontal disparity value for the said pixel p(x,y) ; and

(b) calculating vertical correlation values for the correlation between the neighbourhood around each pixel p(x,y) in one image and the neighbourhoods around at least three vertically co-linear points in a spatially corresponding area of the second image, and fitting a parabolic curve (namely, a quadratic function) to said vertical correlation values and analysing said curve (i.e. quadratic function) in order to estimate a vertical disparity value for the said pixel p(x,y).

The data values (which are real numbers representing image brightness) of the image for fractional points located between pixels may be obtained by a suitable interpolation method, preferably using bilinear interpolation.

An advantage of this matching method is that the correlation process is carried out at sub-pixel level i.e. correlation values for fractional points located between actual pixels are calculated and used to fit the parabolic curve. This results in a more accurate estimate of the horizontal and vertical disparities being achieved.

For the avoidance of doubt, the terms "horizontal" and "vertical" are used above with reference to the horizontal lines (rows) and vertical lines (columns) of pixels in the images, and not intended to refer to the actual orientation of the image, for example with respect to a surface or horizon.

It will be appreciated that in the above-described stereo matching method we calculate the disparity for each pixel in one of the first and second (digitised) camera images relative to the other of these two camera images e.g. the disparity for each pixel in the first (hereinafter referred to as the "left") image relative to the second (hereinafter referred to as the "right") image - these could be termed the right-to- left disparities. In a further improvement, the entire above- described stereo matching method may be repeated, this time calculating the disparities in the reverse direction i.e. the right-to-left disparities. By comparing the right-to-left disparities with the left-to-right disparities we can detect occluded areas.

For additional accuracy, the method may also include illuminating the object scene with a textured pattern. This may conveniently be achieved by projecting the pattern onto the object scene using a projector means. Preferably, the pattern comprises a fractal pattern, for example in the form of a digitally generated fractal random pattern of dots of different levels of transparency. Illuminating the object scene with a textured pattern has the advantage of generating detectable visual features on surfaces in the object scene which, due to a uniformity in colour, would otherwise be visually flat . The use of a fractal pattern ensures that texture information will be available in the Difference of Gaussian (DoG) images at all levels in the scale pyramid.

According to a second aspect of the invention we provide a 3-D image modelling system comprising: first camera imaging means for producing a first camera output image of an object scene; second camera imaging means for producing a second camera output image of said object scene; digitising means for digitising each of said first and second camera output images; storage means for storing said digitised first and second camera output images; and image processing means programmed to:

(a) process said first and second camera output images so as to produce an image pyramid of pairs of filtered, preferably

Difference of Gaussian (DoG) , images from said first and second digitised camera output images, each successive level of the pyramid providing smaller images having coarser resolution, and said storage means being capable of also storing the pairs of filtered images so produced;

(b) process the coarsest pair of filtered images in the pyramid so as to: calculate an initial disparity map for said coarsest pair of filtered images; use said initial disparity map to carry out a warping operation on one said next-coarsest pair of filtered images, said warping operation producing a shifted version of said one of said next-coarsest pair of filtered images; matching said shifted version of said one of said next-coarsest pair of filtered images with the other of said next-coarsest pair of filtered images to obtain a respective disparity map for said other image and said shifted image, which disparity map is combined with said initial disparity map to obtain a new, updated, disparity map for said next-coarsest pair of filtered images;

(c) iterating said warping and matching processes for the pair of images at each subsequent level of the scale pyramid, at each level using the new, updated disparity map from the previous iteration as the "initial" disparity map for carrying out the warping step of the next iteration for the next level, prior to calculating the new, updated, disparity map at this next level, so as to obtain a final disparity map for the least coarse pair (i.e. the finest reolution pair) of filtered images in the image pyramid; and

(d) operating on said first and second digitised camera output images using said final disparity map, in a 3-D model construction process, in order to generate a three-dimensional model from said first and second camera output images.

The 3-D modelling system preferably further includes projector means for projecting a textured pattern onto the object scene, preferably for projecting a fractal random dot pattern onto the object scene.

According to a third aspect of the invention we provide a computer program product comprising: a computer usable medium having computer readable code means embodied in said medium for carrying out a method of measuring stereo image disparity in a 3-D image modelling system, said computer program product having computer readable code means for: processing data corresponding to a pair of first and second digitised camera output images of an object scene so as to produce filtered data corresponding to a plurality of successively produces pairs of filtered images, each pair of filtered images providing one level in the pyramid, each pair of filtered images being being scaled relative to the pair of filtered images in the previous level by a predetermined amount and having coarser resolution than the pair of images in said previous level; calculating an initial disparity map for the coarsest pair of filtered images by matching filtered data of one image of said coarsest pair of filtered images with the filtered data of the other image of said coarsest pair of filtered images; using said initial disparity map to carry out a warping operation on the data of one image of the next-coarsest pair of filtered images in the pyramid, said warping operation producing a shifted version of said one image; and matching said shifted version of said one of said next- coarsest pair of images with the other of said next-coarsest pair of images so as to obtain a respective disparity map for said other image and said shifted image, which disparity map is combined with said initial disparity map so as to obtain a new, updated, disparity map for said next-coarsest pair of filtered images; and iterating said warping and matching processes for the pair of images at each subsequent level of the scale pyramid, at each level using the new, updated disparity map from the previous iteration as the "initial" disparity map for carrying out the warping step of the next iteration for the next level, prior to calculating the new, updated, disparity map at this next level, so as to obtain a final disparity map for the least coarse pair (i.e. the finest reolution pair) of filtered images in the image pyramid.

The computer program product preferably further includes computer readable code means for: operating on said first and second digitised camera output image data using said final disparity map, in a 3-D model construction process, in order to generate data corresponding to a three-dimensional image model from said first and second digitised camera output image data.

According to a fourth aspect of the invention we provide a method of calibrating cameras for use in a 3-D modelling system so as to determine external orientation parameters of the cameras relative to a fixed reference frame (one of the said cameras may be used as the fixed reference frame, although this need not always be the case) , and determine internal orientation parameters of the cameras, the method comprising the steps of: (a) providing at least one calibration object having a multiplicity of circular targets marked thereon, wherein said targets lie in a plurality of planes in three dimensional space and are arranged such that they can be individually identified automatically in a camera image of said at least one calibration object showing at least a predetermined number of said circular targets not all of which lie in the same plane; (b) storing in a memory means of the modelling system the relative spatial locations of the centres of each of said target circles on said at least one calibration object; (c) capturing a plurality of images of said at least one calibration object with each of a pair of first and second cameras of the modelling system, wherein at least some points, on said at least one calibration object, imaged by one of said cameras are also imaged by the other of said cameras;

(d) analysing said captured images so as to: locate each said circular target on said at least one calibration object which is visible in each said captured image and determine the centre of each such located circular target, preferably with an accuracy of at least 0.05 pixel, most preferably with up to 0.01 pixel accuracy; and identify each located circular target as a known target on said at least one calibration object;

(e) calculating initial estimates of the internal and external orientation parameters of the cameras using the positions determined for the centres of the identified circular targets.

The method preferably includes the further step of :

(f) refining the initial estimates of the internal and external orientation parameters of the cameras using a least squares estimation procedure.

The terms "external orientation parameters" and "internal orientation parameters" of the cameras are well known and understood in the art, and are therefore not described in detail herein. For the avoidance of doubt, the method according to the fourth described aspect of the invention calculates, for each said camera, initial estimates of at least the following internal orientation parameters: the position (in x-y co-ordinates) of the principal point of the camera; the focal length f of the camera; and the relative size of an image pixel (relative to the x and y axes) . The method preferably further includes calculating estimates for further internal orientation parameters, namely lens distortion parameters. The external orientation parameters for which initial estimates are calculated preferably comprise three location (i.e. linear position) parameters and three angular (position) parameters.

Step (e) may conveniently be done using a Direct Linear Transform (DLT) technique.

Step (f) may conveniently be carried out by applying a modified version of the co-linearity constraint, conveniently in a the form of an iterative non-linear least squares method, to the initial estimates of the internal and external orientation parameters of the cameras, to calculate a more accurate model of the internal and external orientation parameters of the cameras, which may include lens distortion parameters .

One advantage of the above-described calibration method is that, by using a calibration object designed to provide a source of points the location of which in space is known accurately, and the location of which can be accurately identified in images, calibration of the cameras can be achieved automatically without user intervention, for example manipulation of the camera positions, being required.

The method preferably also includes the step of modelling perspective distortion, which causes the centres of circular targets on the or each said calibration object not to be in the centre of the corresponding ellipses appearing in the captured camera images. This process is preferably incorporated in the afore-mentioned iterative non-linear least squares method used to calculate a more accurate model of the internal and external orientation parameters of the cameras, so as to calculate ellipse correction parameters . This enables further improved accuracy to be achieved in the model of the internal and external orientation parameters of the cameras. It will be appreciated that it is the feature of using circular targets in the calibration object (s) which enables such ellipse correction to be achieved.

According to yet another aspect of the present invention we provide a 3-D modelling method incorporating the afore- described method of calibrating cameras, and the afore- described method for measuring stereo image disparity, and wherein the estimated internal and external parameters of the first and second cameras, and the final calculated disparity map for the first and second camera images, are used to construct a 3-D model of the object scene. The 3-D model may be produced in the form of a polygon mesh.

Accordingly, in a yet further aspect of the invention, we provide a 3-D modelling system as afore-described in which the image processing means is further programmed to carry out steps (d) and (e) , and preferably also step (f) , in the above- described method of calibrating cameras, and wherein the system further includes at least one said calibration object and storage means for storing constructed 3-D models. Optionally, in any of the above-described 3-D modelling methods and apparatus, more than two cameras may be used, for example a plurality of pairs of left and right cameras may be used, in order to allow simultaneous capture of multiple pairs of images of the object scene. In a preferred embodiment of the 3-D modelling method of the invention, in which multiple pairs of images of the object scene are captured in this manner, each pair of images may be used to produce a 3-D model of the object scene, using one or more of the afore-described techniques, and the plurality of said 3-D models thus produced are preferably combined together in a predetermined manner to produce a single, output 3-D model. This single, output 3-D model is preferably produced in the form of a polygon mesh. Preferably, the plurality of 3-D models are combined in an intermediate 3-D voxel image which may then be triangulated, conveniently using an isosurface extraction method, in order to form the polygon mesh 3-D model.

The method may also include integrating one or more render images onto said polygon mesh so as to provide texturing of the polygon mesh. One or more dedicated cameras may be provided for capturing said one or more render images . In one possible embodiment the 3-D modelling system of the invention may therefore incorporate three cameras, a left and right camera pair for capturing the left and right object scene images, and a third, central camera for capturing at least one image for providing visual render information. Such a camera triple may be referred to as a "pod". The 3-D modelling system of the invention preferably incorporates one or more such pods. Alternatively, in another possible embodiment one of said first and second cameras may be used to capture at least one image for providing visual render. Preferably, said one or more render images are merged onto the polygon mesh is such a way as to achieve substantially seamless texturing of the polygon mesh, with image blurring kept to a minimum. This may be achieved by using a boundary- based merging technique, rather than an area-based merging approach. Each triangle of the polygon mesh has one or more of said render images projected or mapped thereon, and the system determines which image projects or maps most accurately onto the said triangle by analysing the confidence, or weighting, values associated with the vertices of the said triangle, and preferably also taking into account the size (i.e. area) of the said triangle onto which the render image (s) is/are being projected.

More than one calibration object may sometimes be used in the 3-D modelling method and apparatus of the invention. Where multiple pairs or pods of cameras are used, there is preferably provided a different calibration object for use with each said pair, or each said pod. Each calibration object may advantageously be provided with at least one optically readable bar code pattern which is unique to that calibration object. The image processing means is preferably programmed to locate said at least one bar code pattern for each calibration object imaged by the cameras, and to read and identify said bar code pattern as one of a pre-programmed selection of bar code patterns stored in a memory means of the apparatus, each said stored bar code pattern being associated with a similarly stored set of data corresponding to the respective calibration object. In this manner individual ones of the calibration object may be recognised by the image processing means where multiple calibration object appear in a single captured image. Each said calibration object is preferably additionally provided with bar code location means in the form of a relatively simple locating pattern which the image processing means can locate relatively easily and, from the location of said locating pattern, identify that portion of the image containing said at least one bar code pattern, prior to reading said bar code pattern.

Preferred embodiments of the invention will now be described by way of example only and with reference to the accompanying drawings in which:

Fig.l is a schematic diagram of a camera apparatus according to one embodiment of the invention;

Figs.2 (a) , (b) , and (c) show three image frames captured of a human subject, comprising left texture, right texture and left render images respectively;

Fig.3 is a disparity map for left and right captured images;

Fig.4 is a Lambertian shaded view of a 3-D model image of the human subject of Figs.2; Figs.5 (a) and (b) are 3-D mesh models of the human subject;

Fig.6 is the 3-D mesh model of Fig.5 (a) overlaid with the left render image of Fig.2(c);

Fig.7 is a random dot pattern;

Fig.8 a stereo pair of images of a shoe last illuminated with the random dot pattern of Fig.7;

Fig.9 shows the stereo image pair of Fig.8 at level five of the scale pyramid obtained therefrom;

Fig.10 shows a fractal random dot image;

Fig.11 shows a stereo pair of images of the shoe last illuminated with the random dot pattern of Fig.10;

Fig.12 shows the stereo image pair of Fig.11 at level five of the scale pyramid obtained therefrom; Fig.13 a barcode design;

Fig.14 shows front, side and bottom views of a calibration object used in one embodiment of the invention;

Fig.15 is a flow diagram illustrating part of the calibration process used in one embodiment of the invention;

Fig.16 illustrates a contour following set-up;

Figs. 17(a) illustrate the definitions of slant (σ) angle and tilt (τ) angle respectively;

Fig. 18 illustrates a barcode reading process; Fig.19 illustrates projective distortion of the center of a circle;

Fig.20 is a flow chart illustrating the formation of a

Difference of Gaussian (DoG) pyramid;

Fig.21 is a flow chart illustrating a main loop of a stereo matching process of the invention;

Fig.22 is a flow diagram illustrating a matching process carried out at each level of the scale pyramid; and

Fig.23 is a flow chart illustrating in further detail how the matching process of Fig.22 is used to operate on all levels of the scale pyramid.

The present invention is a 3-D camera apparatus and associated processing methods, capable of capturing 3-D models of a wide variety of objects. Figure 1 shows a 3-D camera apparatus 1 consisting of : a monochrome or colour left camera 3 ; a monochrome or colour right camera 4 ; an (optional) central colour camera (not shown in Fig.l); a calibration object 7 for determining the external and internal orientation parameters of the cameras 3, 4, relative either to the object 7 or to one of the cameras. The cameras 3,4 are, in this embodiment, digital still cameras. However, in other alternative embodiments the cameras 3,4 may, for example, be videocameras or 35mm cameras.

The calibration object 7 contains an identifying code and a number of circular targets; a Central Processing Unit 11 (hereinafter referred to as "the

CPU" or "the computer") for controlling the operation of the cameras, and projector, and for processing the images to produce a 3-D rendered model; a digitiser 9 for digitizing each of the output images from the left and right cameras 3, 4 and storing them in the CPU

11; a projector 13 is controlled by the CPU 11 and, in use, projects a fractal pattern over an object scene imaged by the left and right cameras; mounting means, such as tripods (not shown) , for mounting the cameras 3 , 4 so as to allow for user controlled alteration to the separation and orientation of the cameras; a storage medium (e.g. memory) 15 for storing 3-D models; and two frame buffers 17,18 connected between the digitizer 9 and the CPU 11.

The cameras, imageprocessing hardware, mounting hardware, CPU and storage medium for the present invention are commercially available. Suitable cameras are: left camera Sony XC77-CE, right camera JVC FY55-RGB. Suitable image processing hardware is the DataCell Snapper 24. The separation and rotation of the left and right cameras 3,4 are adjustable, as are the two lenses 6, 8 attached to the respective cameras. This allows a variety of viewing volumes to be imaged. The use of the calibration target 7 (described in detail later) , allows rapid calibration without user intervention after alteration of the positions of the cameras 3 , 4 or changes in lenses 6, 8.

A summary of the operation of the apparatus 1 of Fig. 1 will now be given, followed by a more detailed description of the apparatus and methods used.

1. The cameras 3 , 4 are calibrated by capturing several images of the calibration object 7, with the object 7 visible from both cameras in at least one of the images. This calibration process is described in more detail later.

2. The projector 13 is set to project a fractal pattern onto the object scene. (Details of the fractal pattern are described later) .

3. A target object or human subject to be modeled (not shown) is then placed in front of the camera. In the preferred embodiment for a single pair of cameras using 50mm C-mount lenses the subject is situated at a distance of 1.5 meters from the cameras 3, 4 which are separated by 30cm.

4. The projector 13 is used to light up the subject or target with a textured illumination pattern. This is derived from a 35mm slide with a digitally generated random fractal pattern of dots of different levels of transparency. The purpose of the illumination with textured light is to generate detectable visual features on surfaces which, due to a uniformity in colour, would otherwise be visually flat . (The use of a fractal pattern ensures that texture information will be available at all levels of a Difference of Gaussian (DoG) image pyramid) .

5. The CPU 11 instructs the two cameras 3, 4 to simultaneously capture a single frame of the scene containing the subject.

6. The frames from the left and right cameras 3, 4 are downloaded respectively into the two frame buffers 17, 18. Call these frames A_f and B_f. Two such frames A_f, B_f obtained of a human subject's head are shown in Figs. 2(a) and (b) .

7. The computer instructs the projector 13 to illuminate the subject or target with uniform white light. The purpose of this is to allow the underlying colour and texture of the subject's face, or the target's surfaces to be recorded.

8. One of the cameras, preferably the left camera, is instructed to capture a frame 9 (or "render image") of the target or subject illuminated with white light.

9. The frame C_f is downloaded into the left frame buffer 17. Fig 2(c) shows such a frame captured of the human subject of Figs. 2(a) and (b) .

10. The computer calculates the disparity between the left and right images . For each pixel position a horizontal disparity and a vertical disparity) is calculated between the left and right images. The confidence of the disparity estimates, stored as a floating point image, with elements in the range 0-1 is also generated. The disparity and confidence (in the horizontal and vertical directions) is then organised as two image frames, D_f, E_f. The horizontal disparity image D_f for the human subject of Fig. 2 is shown in Figure 3. Details of the matching method used to calculate the disparities are given later.

11. This disparity map is translated, using the known internal and external orientation parameters of the left and right cameras 3, 4 into a dense 3-D map or image F_f (shown in Fig. 4), where each pixel in the image F_£ represents a position in space. A representation of this map is shown in Fig. 4 where the model has been rendered into a byte-coded image by Lambertian shading of the model, with a light source positioned at the principal point of the left camera.

12. This 3-D model image is translated, using the known internal and external orientation parameters of cameras, into a 3-D mesh representing the subject. This mesh can be conveniently written out as a VRML file on the storage medium 15. Figs. 6 (a) and (b) show the mesh from two different angles of view. The fact that the mesh is a 3D model allows the angle of view to be altered so that a computer can present the same object from alternative angles .

13. The CPU 11 stores on the storage medium 15, along with the 3D mesh, Frame C_f, the image obtained from the left camera with white light. With appropriate graphics software it is possible to reconstruct different fully textured views of the original subject, such as that shown in fig. 6, using the stored 3-D mesh and the render image (frame C_f) . In the preferred embodiment more than two cameras are used. These are described herebelow. The preferred operation of the system, when more than two cameras are used organises the cameras in triples, with left and right monochrome cameras and a central color camera in each triple. We shall call each such camera triple a pod. The left and right cameras 3, 4 form a stereo pair for matching purposes, while the central camera is used to capture the visual render image Cf . The cameras are connected in a tree structure via a Universal Serial Bus (USB) (not shown) , using USB hubs to form the connection topology.

Multiple projectors 13 may be used, and the projectors are positioned so that, the fractal texture pattern projected thereby covers the object or objects to be captured. The design of the fractal texture is such that the output of multiple projectors may overlap.

Multiple calibration objects 7 may be used. The design of the objects 7 is such that multiple calibration objects may be recognised when they occur in a single image, and identified. The design of the objects and the methods employed are described in full detail herebelow. Where multiple calibration objects 7 are used, operation proceeds as follows: 1. The cameras are calibrated by capturing multiple images of the calibration objects. In order to ensure that all cameras are calibrated within the same co-ordinate frame the following constraints must be satisfied. Define a graph <V,E> with edges E and vertices V. The vertices are C O where C is the set of cameras, and 0 is the set of calibration objects being used. An edge E = <c,o),c e C,o e O is in the graph if there exists an image I captured by camera c with o visible in the image. Similarly an edge E = (o,c),c e C,o e O is in the graph if there exists an image I that contains o and I was captured by camera c. The constraint that must hold for calibration to succeed is that the graph V is connected.

5 (Details of the calibration method can be found in a later section of this description) .

2. Each projector is set to project the fractal pattern onto the subject. (Details of the fractal pattern are provided later) .

103. A target object or human subject is placed in front of the cameras .

4. The computer issues a signal to the left and right camera pairs in the pods via the USB, instructing them to capture an image. The captured imaged is stored on internal is memory in each camera.

5. The computer instructs the central cameras on the pods to fire simultaneously, via the USB, and store the resulting image. These pictures are captured in ambient light. For optimal texture the ambient light should be controlled to

20 achieve the illumination desired, as with a conventional film camera. This can conveniently be done using standard photographic flash equipment.

6. The frames from each pod are transferred under control of the CPU via the USB, either to the storage medium 15 or to

25 the computer memory.

7. For each pair of left and right cameras, the computer calculates the disparity between the left and right images (as described later) .

8. The disparity maps are used together with the known

30 internal and external orientation parameters of the left and right cameras to generate dense 3-D maps as described in above . 9. The dense 3-D maps, together with the internal and external orientation parameters of the left cameras, are used to generate a 3-D implicit surface image of the complete region imaged. This process is explained in detail later.

10. The implicit surface is polygonized, using a method such as Marching Cubes [Lorensen and Cline, SIGGRAPH '87 Conference proceedings, Anaheim, CA, July 1987, p.163-170); Mesh Propagation [Howie and Blake, Computer Graphics Forum, 13 (3) :C/65-C/74, October 1994]; or span space methods [Livnat et al . , IEEE Transactions on Visualisation and Computer Graphics, 2 (1) :73-84 , March 1996], to generate a triangular mesh. The triangle mesh can be decimated by applying, for example, the method of Garland and Heckbert (SIGGRAPH 96 Conference Proceedings, Annual Conference

Series, p.209-216, 1997] if required for ease of display.

11. The white light (render) images from the center cameras, the internal and external orientation parameters of the center cameras, and the location in space of the vertices of the generated polygon mesh are used to merge the render images, so that they can be projected seamlessly onto the polygon mesh. This method is also described later.

12. If accurate measurement from the generated models is required, the polygon mesh can be used to merge the dense models, using the same method as in 11 above, and can also be used to mask the areas of the models which correspond to the object (s) being imaged.

13. The computer 11 stores along with the 3D mesh on the storage medium 15, the images obtained from the central cameras and their internal and external orientation parameters. If accurate measurements and the highest quality rendered images of the object (s) from different viewpoints are required, the merged dense 3-D model images obtained in step 12 above can also be stored in the storage medium 15.

5 FRACTAL TEXTURE PATTERN

Images of objects illuminated with natural light can contain visually flat areas where accurate estimation of disparity is difficult. When maximum accuracy is required the current invention uses texture projection to illuminate the object (s) o of interest with a textured pattern which provides information for the stereo matching method (described later) . One way to do this is to project a random dot pattern onto the object to be modeled. A random dot pattern can be generated as follows.

1. Initialise a byte-coded image, arranged as an array of

15 numbers in memory. This image should have a user-defined size. In the preferred implementation this is 600 X 800.

2. Initialise a random number generator, which generates uniformly distributed random numbers in the range 0-255

(the range of a byte) . 203. For each pixel of the image, assign a new random value generated by the random number generator.

Fig. 7 shows a random dot pattern and Fig. 8 shows a stereo image pair of a shoe last illuminated with the random

25 dot pattern. Fig. 7 is a non-fractal image. The match method (described later) estimates disparities at each level of a DoG pyramid created from a stereo image pair. The fine grain of the random dot pattern gradually diminishes towards the top of the pyramid, as illustrated in Fig. 9 which shows the stereo

30 image pair of fig. 8 at level 5 of the pyramid (the original image is taken to be level 0) . To eliminate this problem of diminishing pattern, in the method and apparatus of the present invention we use a fractal random dot pattern, which maintains the same image properties at all levels of the DoG pyramid. An example is shown first 5 and then the method to generate the fractal random dot image is described. Fig. 10 shows a fractal random dot image and Fig. 11 shows a stereo image of the same cast illuminated with the fractal random dot image projection. Fig. 12 shows the stereo image pair of Fig. 11 at level 5 of the pyramid. lo Comparing these Figs, with the respective ones of Figs.7 to 9, as seen from Fig.10, the fractal pattern is in general slightly more dense in some areas than the random dot pattern of Fig.7. Comparing Fig.ll with Fig. 8 it can be seen that the fractal dot pattern is preserved better (throughout the

15 pyramid) , at level 5 of the pyramid, than the non-fractal pattern of Fig. 7.

The method used in the current invention to generate the fractal image is : 201. create a stack of random dot images, r^x^y^r = 0,1,... ,255; l = 0,1,...L-l; x_t = 0 , 1 , ... ,X,-1; y_t = 0, 1, ... ,Y,-1. The ratios of X_l : X_l+1 and Y, : Y_l+1 are the same, and are consistent with the ratio between pyramid levels of the matcher. It is 2 for the preferred 25 embodiment of the stereo matching method.

2. each of these images is enlarged by pixel repetition to the same size as the largest one X₀ x Y₀ : r'_t(x, y) = fx,, y , x=x_l2¹,x_l2¹+l, ...,_Xl2^l+1-l, y = y^y^+l, ...,y_t2_t ⁺¹-l. 3. the average of

y) is taken as the fractal random dot 30 image, r(x,y) = l/L [∑^L_1 _l=0 r_t(x,y) ] . CAMERA APPARATUS

Suitable cameras for the preferred embodiment with two cameras are: left camera, Sony XC77-CE, right camera, JVC FY55-RGB.

When multiple cameras are used, conventional cameras of this type are less suitable. In addition, the requirement for many cameras requires use of a cheaper system with some additional capability, in particular the ability for a subset of the cameras to capture images simultaneously. The preferred embodiment for multi-camera systems therefore uses cameras with the following functionality:

1. Progressive scan.

2. Ability to fire anychronously, to enable multiple cameras to fire simultaneously.

3. Ability to store images in local memory, for progressive download over serial (USB) bus.

4. USB connection supplying both power and control.

The preferred embodiment, uses a WL 5850 CMOS sensor, and a Hitachi 3048G processor, with 16MB of on-board memory. The camera is connected via USB to the host computer, thereby allowing at least 32 cameras to be connected to a single computer, and each camera to be powered over the USB.

THE CALIBRATION OBJECT It is an objective of the system that accurate camera calibration can be obtained automatically. A simple way to determine camera calibration is to use the position of known points in the world, which are imaged by the camera. If lens distortion is to be modeled accurately a large number of points, whose position is known with precision are required. The calibration target is designed to provide a source of points the location of which in space is known accurately, and the location of which can be determined accurately in images.

In order for several cameras to be calibrated within the same world co-ordinate framework, it is necessary that some points imaged by a camera are common with at least one other camera. The calibration target is designed in such a way that it can be recognised automatically from any orientation, and several objects can be present in the field of view at any time and each be individually recognised. This enables many cameras to be calibrated together, which is necessary when more than 1 pair (or triple) of cameras is used.

The calibration object is illustrated in Figs. 13 (a), (b) and (c) . The preferred embodiment of the calibration object includes two main features . Feature one is a barcode and feature two is a set of targets. The barcode consists of a barcode section and two anchor points on each side of the barcode section. Figure 14 shows the barcode scheme. The set of targets consists of twelve circles 30 lying in two planes 32,33 in 3-D space. (In other possible embodiments only 8 target circles are used) . The centre of each circle is a "target point" . The arrangement of the twelve circles in two planes is such that the calibration object can be recognised by a camera positioned at any position within an angular viewing range of approximately 60 degrees (subtended from the object plane containing the calibration object) , said angular viewing range being in a plane perpendicular to the object plane, and centered on an axis extending perpedicularly to the object plane. The barcode comprises 6 digits and therefore handles up to 64 calibration objects. When that number is exceeded, some mechanism needs to be put in place to handle that . It is straightforward to extend from 6 digits to 18 digits (up to 262144 calibration objects) and still maintain backward compatibility by adding one barcode sequence on top of the current one and one below.

The data of the calibration object is Oi = < (id,x,y, z,d) | id. e 0, 1, ... ,nθi>, where d is the diameter of the circular target. This data is built into the program. The data are design data, not measured data. The program works for any calibration object of this configuration up to a scale factor.

The target finder has three tasks to carry out. These are to identify the calibration object (if more than one calibration object is used) , as each one is different from the others; to find and measure the target points and to identify the targets. To identify the calibration object, the barcode anchor points, namely the 2 con-circles 35,36 on either side of the barcode, are identified. From their positions, the portion of the photograph that holds the barcode is outlined and the barcode is read.

To find and measure the target points, the contours of the circles are followed, and their centers are calculated.

To identify the individual targets, the configuration of the photograph target points is matched with projection of calibration object target points. Since at this stage the camera parameters are unknown, a search of the projection is carried out to find one projection that gives the best match. Figure 15 is a flow chart illustrating the various steps carried out in the above process which will now be described in further detail .

IMPLEMENTATION

For each photograph or image I (i, j ) , i-0, 1, ... ,M-1; j = 0,1,..., N-l, the target finder does the following:

CONTOUR FOLLOWING

The purpose of the contour following routine is to find all contours in the image. A contour is defined as a list of closed and connected zero crossings. Closed means that the last zero crossing in the list is the same as the first zero crossing of the list. Connected means that their positions are not more than 2 pixels apart. A zero crossing is defined as a position in an image which has one of the 4 configurations: a) a negative pixel on its left side and positive pixel on its right side; b) a positive pixel on its left side and negative pixel on its right side; c) a negative pixel on its top side and positive pixel on its bottom side; d) a positive pixel on its bottom side and negative pixel on its top side.

(It will be appreciated that the pixel values, which represent pixel brightness, in the various DoG images are real numbers i.e. can be "positive" or "negative", as referred to in (a) to (d) above.) The contour following process uses the setup shown in Figure 16. The current pixel is d(i,j). A contour enters through one of 4 sides UP, DOWN, LEFT or RIGHT. It exits through one of the other sides, and the adjacent pixel on that side is then tracked. If the contour exits d(i,j) through side s it enters the adjacent pixel through next(s), where the function next implements the following mapping: UP - DOWN, DOWN → UP, LEFT → RIGHT, RIGHT → LEFT The adjacent pixel to d(i,j) for a given side is determined by the following mapping (adjacent) : UP - d(i-l,j), DOWN → d(i+l,j), LEFT → d(i,j-l), RIGHT → d(i,j+l). If any of these pixels lies outside the bounds of the image, then the adjacent pixel is NONE.

l.The input of the routine is an image, I(i,j).

2. The output of this routine is a list of contours, X = <C^l, 1 = 0,l,...L-l>, where C^l = <(x_n ^l,y_n\ z_n ^l) , n = 0,...,Li-l>, where x and y provide the image position of a zero crossing and z the product of the 2 pixels that define the zero crossing. Note that z is always negative, and the magnitude of z indicates the strength of the zero crossing.

3. It creates a binary registration image R(i,j), i = 0,...,M- 1; j = 0,...,N-1. Each pixel of R, r(i,j), has the value of either processed or unprocessed, and is initialised to unprocessed.

4. It filters the image I with a DoG filter to get I_d(σ)(i,j) such that a transition from either black to white or vice versa in I becomes a zero crossing in I_d(σ)(i,j), where σ is the parameter of positive number that controls the shape of the DoG filter. The parameter σ is chosen to be as small as possible to maximise zero crossing accuracy and at the same time to be great enough to suppress noise in the image. See Figure 4 for an example.

5. For each pixel d(i,j) in I_d(σ)(i,j), start to trace if a zero crossing is found.

(a) Set last to NONE. (b) If r(i,j) is processed return failure.

(c) Create a temporary empty contour, C .

(d) Set current_pixel to (i,j).

(e) For each side s ≠ last if there is a zero-crossing at s then calculate the precise location of the zero crossing by linear interpolation and add it to C¹.

Set last to next(s) and set the current pixel to adjacent (current.pixel) . If the current pixel is NONE return failure. If the current pixel is equal to the starting pixel d(i,j) return success and add C(l) to the set of contours. Otherwise goto 5(e).

ELLIPSE COMPUTATION

Given a contour C = ((x^y^Zi), i = 0,...,L_C-1> where Xi,yι is the position of the zero crossing in floating point co-ordinates and z is a negative floating point number that indicates the strength of the zero crossing, we can determine an ellipse with the following parameters:

zeroz The average strength of the zero crossings, defined as

zerox = X-X z I polarity Polarity is a value that indirectly describes whether a circle in the image is black on a white background or the other way round. When an image that contains white ellipses on black background is filtered by a DoG filter, the inside of the contours of zero crossings that correspond to the edges of the ellipses can be either negative or positive. But it is consistent for a DoG filter, say negative. When the DoG filter is negated, it will be positive. Polarity takes binary values, positive-inside or negative-inside, to account for this.

length Defined as

length = £| {x_n ,y_n }- {r₍„__1)mod ,y_lβ__l)aoΛLe }| n=0

centre Defined as

centre lengt _tht, 1 [ » Η ) (n-l) odL_c > "(„-l)modZ._c j| |

mean radius Defined as

meanr = i

circularity Defined as

variance ■ ^:

, . meanr circularity = variance

minimum radius, maximum radius, orientation These three parameters are determined by a least squares fit of the data points to an ellipse. This is done using the following process:

1. Centralise the contour. (x_n,y_n) <- (x_n,y_n) - center, n = 0, ... ,L_C-1.

2. We can express the ellipse in the form ax² + 2bxy + cy² = 1 since we assume that the ellipse is centered on the origin. Minimise the expression E = ∑^o (^ax _n ² + ^2t3X _nY_n + cy_n ² -l)² by a linear least squares method.

3. In matrix form the ellipse equation is

The matrix P = / a b \ is symmetric and therefore there

( b e ) is a decomposition T^τ ΛT = P

where T =l- , λ ≥ λ₂

4. The maximum radius (r_max) is λ_x, the minimum radius (r_min) is λ₂ and the orientation (θ) is arctan (-s/c) .

fit The goodness of fit of the calculated ellipse to the data points is given by

where (rx_n,ry_n) = SR((x_n,y_n) - centre),

The above parameters are used to filter the contour set . In the preferred implementation r_min, fit and zerox are used. When any of these parameters lies outside specified bounds the contour is ignored.

Using the above technique we have been able to determine the centers of the circular targets on the calibration object to within 0.02 of a pixel for a circle with a radius of 30 pixels in the image .

CONCENTRIC CIRCLE DETECTION AND POINT GROUPING

The purpose of this routine is to organise the points into barcode anchor point group and target point group. In case of an image of multiple calibration objects, a further organisation is carried out to group together the group of barcode anchor points and the group of target points that belong to the same calibration object. The input of this routine is a set of ellipses, E = (E ) 1- 0,...,^-!, and the output is a set of group pairs, where each pair consists of a group of barcode anchor points and a group of target points. Namely, T = <G_b ^l,G_t ^l), 1 = 0,...,G-1, where Gj,¹ = <E_m> n=0,l and G,.¹ = <E_ιa>, n = 0, ... ,T_t-l.

1.Determine the concentric circles by determining whether the centers of pairs of ellipses co-coincide. 2. Divide the concentric circles into groups by pairing them off, since the number of barcode anchor points for each calibration object is 2. 3. Divide the remaining ellipses that are not concentric circles into groups by comparing their distance from the barcode anchor points .

POINT CONFIGURATION MATCHING

The purpose of this routine is to match the points of the calibration object o^ < (id,x,y, z, d) | id e 0, 1, ... ,n_oi), with the image points, namely, G = <G_b,G_t> where G, = <E_n>, n = 0,1 and G_t = <E_n> n = 0, ...T-l. When done, every point in the image has an id assigned.

1. Pre-processing image points

(a) Barcode point: i. Get point p_b ⁿ = G_jj.E¹¹. centre, i =0,1. ii. Negate y-axis: pⁿ.y = -P^-Y'ϊ =0,1.

(b) Target point: i. p_t ⁿ = G_t.Eⁿ. centre, i = 0,...,T-1. ii. Negate y-axis : p_t ⁿ-Y = -p_t ⁿ-Y'i = 0,..., T-l. (a) Center calibration object points. Let the points of calibration object o be p°, ... ,p^no. Then:

n₀ — \ center ^s —

(PX' - P), . P\ ) <^~ (P , P - P¹ _Z ) - center

(b) Extract the target points and the anchor points.

The anchor points are those with concentric circles

3. Find the best match between the calibration object target points and image target points . (a) Find the best slant angle σ of the calibration object (The slant angle σ is illustrated by Figure 17 (a) ) .

(b) Find the best tilt angle τ (illustrated in Figure 17(b)) of the calibration object. This is done via an exhaustive search with respect to the anchor points and the tilt angle. For the two possible ordering of the anchor points and for a number of angles that span the range of 0 to 2π, a match is made between the orthogonal projection in the z-axis of the slant and tilt rotated calibration object points and the affine transformed image target points, where the affine transform is determined by the mapping between the rotated anchor points and the image barcode points. The combination of the tilt angle and the ordering of anchor points that gives the best match is chosen as the true value. To avoid occlusion the tilt angle and the ordering of anchor points that gives the best match is chosen as the true value. To avoid occlusion problems, the calibration object points that lie behind other points in the orthogonal projection are detected and ignored. The diameter of a circle in the calibration object is used to do the detection. After rotation, an occluded point is found when the projection of the point lies inside the radius of another point that is closer to the image plane . To do a match of a particular tilt angle and ordering of anchor points, the following is carried out.

(i) Rotate the anchor and target points. Ignore occluded points, (ii) Determine the affine transform, and carry out the transform on the image target points.

(iii) Match the transformed calibration object target points and the image target points using a least squares estimator. Let the k-th image point be t³^ and the k-th transformed object point be t°_k then the error estimate is :

4. Assign ids to image target points

(a) Accept the best match.

(b) Assign to each image target point the id of a calibration object target point that is closest.

(c) When an image target point has more than one id, those ids whose associated calibration object points are further away are ignored. (d) Assign to each image barcode point the id of the calibration object anchor point that is closest.

This is to establish the direction of barcode reading.

5

BARCODE READING

The purpose of this routine is to find a parallelogram in the image that contains the barcode . After the ids of the image points are identified, the geometrical relations between 0 points are identified. For instance, the line passing though point 5 and point 9 and the line passing through point 6 and point 10 approximate the direction of the vertical sides of a parallelogram. The two barcode points approximate the direction of the horizontal sides of the parallelogram. The s actual positions are approximated through the known positions of barcode points and known sizes of barcode point circles. The parallelogram is represented as 4 points, (t_1#

, the image pixel locations of the top left corner, the top right corner, the bottom left corner and the bottom right 0 corner of the parallelogram.

The purpose of this routine is to read the 6 digit barcode of a calibration object. It reads evenly spaced lines parallel to the horizontal sides of the parallelogram starting from the 5 left vertical side to the right vertical side of the parallelogram. Then it combines the readings and converts them into an integer. Figure 18 shows an example of reading number 4 in operation. The operation of the reader is as follows:

301. Sample 225 pixel values along five lines in the parallelogram. 2. 2.Determine a threshold between black and white on the barcode by histogramming, and threshold these lines.

3. Determine edges along the 5 sample lines, and convert to a binary number.

54. Convert to a decimal barcode, and determine the object identity.

CALIBRATION

Calibration is achieved using a novel version of the o colinearity constraint as described by C.C. Slama in "Manual of Photogrammetry, fourth addition, American Society of Photogrammetry and Remote Sensing, Virginia, 1980. This allows a large number of accurately measured points to be used to determine the orientation and internal parameters of the s cameras.

In the preferred implementation calibration proceeds as follows :

1. A number of image pairs (or triples if 3 cameras are used) 0 of the calibration object are captured. To determine the cameras' external orientation parameters, and the internal parameters excluding lens distortion, 1 pair (or triple) of images is sufficient. To accurately model lens distortion a larger number of images, up to 30, is 5 required.

2. The location and position of the centers of the target circles on the calibration object are determined for each image .

3. An initial estimate of the position and orientation 0 parameters for the calibration objects, and the internal and external orientation parameters of the cameras is obtained using a version of the DLT algorithm. 4. By use of a modified version of the colinearity constraint, described below, a precise estimate of the parameters is obtained from the initial estimate, using a non-linear least squares method.

THE MODIFIED COLINEARITY CONSTRAINT

The colinearity constraint in our preferred implementation is now described. The basic equation is:

f x \ t_x > | / u + Au — u_c y ty = λM v + Av — υ_c ) ' \ f

where M is the rotation matrix of the camera w.r.t. the default world coordinate frame; (t_x,t_y,t₂)^τ is the location of the camera's principal point; (x,y,z)^τ is the location of a point in the world (w.r.t. the default co-ordinate frame); (u,v)^τ is the location in the image plane, w.r.t. the camera coordinate system, of the image of (x,y,z)^τ; Δu and Δv are corrections to the values of u and v due to lens distortion and the distortion due to the observed center of a calibration circle in the image not being the image of the circle center in space; f is the focal length of the camera, and (u_c,v_c)^τ is the location, in camera space, of the principal point.

This equation differs somewhat from the standard equation in that the location of the world co-ordinate is on the LHS of the equation. Our preferred implementation extends this equation to take account of the fact that we are taking many images of the calibration object at different orientations, so that all the points on a given pair (or triple) of images of the calibration object share a common reference frame which is not the default world frame. The revised equation is:

where M° and T° are the rotation matrix and translation for the object reference frame.

This equation is converted into a constraint by dividing through by the z-component to eliminate λ. This gives us:

m%₀x + m^_xy - - mg₂z + t_x° — t_x mpo (u + AM — u_c) + mpi (v + Aυ — v_c) — m^f m%₀x + rn^y + m ₂z + t° - t_z m₂ {u + Au — u_c) + m₂₁(w + Aυ - υ_c) — m₂₂ m x + m°_ny + m°_l2z + t° - t_y m₁₀(u + Δu - u_c) + m_π(υ + Av - v_c) - m₁₂/

+ t° - t_z m,₂o(u + Au — u_c) + m,₂i(v + Av — v_c) — m₂₂

The parameters in these equations, the values of which are to be determined are :

(r°_x, r°_y, r°_z) The Eulerian rotation angles for the calibration object w.r.t. the default world co-ordinate frame.

₀ (t°_X/ t°_y, t°_z) The translation for the calibration object w r.t. the default world co-ordinate frame, (r_x, r_y, r_z) The Eulerian rotation angles for the camera w.r.t. the default world co-ordinate frame.

(t_x, t_y, t_z) The translation for the camera w.r.t. the default world co-ordinate frame.

(u_c, v_c, f) The principal point of the camera w.r.t. the camera frame. u_c and v_c are in pixel co-ordinates. The translation from camera space to pixel coordinates involves:

X_p The size of an image pixel along the x-axis.

s The ratio between an x-pixel and y-pixel.

(k₁,k₂,k₃,k₄,k₅) Distortion parameters for the camera.

o_r The radius of the target circle, the center of which is at (x,y,z)^τ. This is necessary to calculate the ellipse distortion.

It should be noted that the expressions Δu and Δv are functions of the distortion parameters, the rotation and translation parameters for both the object and the camera, and the radius of the target .

Given a pair of cameras and multiple image pairs of a calibration object captured from the object calibration proceeds in two steps .

1. An initial estimate of the camera parameters, both internal and external, together with transformation parameters for the positions of the calibration object in the various image pairs is derived. 2. Given this initial estimate a non-linear iterative least squares method is used to refine the estimate and achieve precise calibration. At this stage lens distortion and ellipse correction parameters are derived, following to the model described above.

The initial estimate of camera and target parameters is derived using the Direct Linear Transform (DLT) , the preferred implementaiton of which is described below.

INITIAL PARAMETER ESTIMATION USING THE DLT

The DLT provides an estimate of camera orientation and internal parameters w.r.t. a set of world points. To achieve an estimate of the orientation of multiple cameras and the calibration target in various orientations requires several applications of the DLT. The algorithm for initial parameter estimation takes as input a set of image pairs (triples) of the calibration target with target points identified and located in each image. The DLT implementation given s set of points provides an estimate of the camera parameters, relative to the co-ordinate of the calibration object. The internal parameters of the camera, are of course independent of the coordinate frame. A C++ pseudo-code version of the algorithm is given below:

void initial_estimate (set<imagepoints> images, set<cameras> cameras) {

<set first image's translation and rotation to 0> <set first image as done> while <exists camera that is not done> { <foreach done image i> { if <exists undone camera for i> {

<set camera from i using the D T> } } <foreach done camera c> {

<foreach undone image i for c> {

<calculate the DLT for camera c' from i> (*)

<apply the inverse of this transform to determine the orientation of i> <set i done> } }

The step in this algorithm labelled with an asterisk determines the orientation of an object with respect to a camera with known orientation. Let a camera c have external orientation M, expressed as a homogenous 4 x 4 matrix, which includes both the rotation and translation parameters. If the DLT is run and determines a matrix M₀ for the camera relative to the object in the default world frame, then the orientation for the object relative to camera c will be M M₀ ^_1. From this matrix it is possible to recover rotation and translation parameters. On termination of this algorithm, we have an initial estimate for the internal and external orientation parameters of both (all three) cameras and the orientation and location of the co-ordinate system for each image object.

REFINEMENT OF THE ESTIMATE USING NON- INEAR LEAST SQUARES

Given an initial estimate of parameters obtained by the above algorithm, a least squares refinement process is used to provide a more accurate estimate of these parameters, of the lens distortion parameters for each camera, and of the elliptical correction for each object point in each image. The least squares problem is set up as follows: 1. For each point on each image of the calibration object, two constraints can be created, as detailed above in the description of the modified co-linearity constraint.

2. For each point on the calibration object three constraints can be applied specifying the x, y, and z location of the points w.r.t. the default world coordinate frame.

The total number of parameters to be solved for is : 15n_c + 6 ( i-1) + 12 x 3, where n_c is the number of cameras, n_± is the number of images of the calibration object taken by the cameras (at least 1 for each camera) , and 12 is the number of target points on the calibration object. The parameters for each camera break down as: 6 for external orientation, 3 for internal orientation, 1 to specify the ratio between pixel sizes along the x and y dimensions, and 5 for lens distortion.

The number of constraints available is 2 n-j i + 3 x 12, where Pi is the number of circular targets detected in the ith image, which is at most 12.

In practice it is often satisfactory to use only one lens distortion parameter, in which case one image from each of two cameras is sufficient.

The Levenberg-Marquand algorithm (described later) is used to solve this problem. In our preferred implementation an algorithm such as that proposed by J J More (in G A Watson, Lecture notes in mathematics 630, pages 105-116, Berlin, 1928, published by Springer-Verlag) or J E Dennis, D M Gray, R E Welsh (Algorithem 573 NL2SOL : An Adaptive non-linear last squares algorithm, ACM Transaction on Mathematical software, 7:369-383,1981) should be used which can approximate the Jacobian. The reason for this is that partial derivatives are hard to calculate for the ellipse correction, which depends on several other parameters . No degradation in the performance of the method when using a finite differencing approximation to the Jacobian has been observed.

THE PREFERRED EMBODIMENT OF THE DLT ALGORITHM

Photogrammetry in our system requires the solution of nonlinear least squares equations. A major problem is the determination of a starting position for the iterative solution procedure.

A basic photogrammetric problem is to determine a camera's internal and external parameters given a set of known world points which are imaged in the camera. The standard method used by C.C. Slama (referenced above) is to use the colinearity constraint to set up a series of non-linear equations which, given a suitable starting point can be iterated to get an accurate solution to the cameras internal and external orientation parameters.

To determine a suitable starting point the Direct Linear Transform (DLT) introduced in Y F Abed-Aziz and N M Karara ("Direct linear transformation from comparator co-ordinates into object co-ordinates in close range photogrammetry; Proceedings of the ASP Symposium on Close-Range Photogrammetry, pages 1-18, Illinois, 1971) is the preferred method used in the present invention. This method is now described.

The method uses a modified pinhole model for cameras. The modification being that some allowance is made for lens distortion in the system. Lens distortion is not used in the DLT, although versions have been developed that make allowance for it. The basic equation relating the image coordinates to the world position of the camera is

where the components r^ form a rotation matrix, representing the rotation of the camera. Our system uses a right handed coordinate frame, with the camera by default pointing down the -ve z axis. A rotation vector (w,p,κ) represents a rotation of ω radians about the x-axis, followed by a rotation of p radians about the y-axis and finally a rotation of K radians about the z-axis. The matrix for this rotation is:

cos cos

The camera internal parameters are: _p the size of a pixel in the x axis; s the ratio between a pixel in the x and y axes;

(p_x,p_y,p_z) the location in camera space of the principal point; (u,v) the location in image space of a an observed point

(note: in our system the images have the (0,0) coordinate at the top left, so the y coordinate is reversed.

We now simplify this equation by substitution to obtain two equations, as follows. The first step is to divide the third line into the first two in order to eliminate λ, and to simplify slightly.

01 + a₂u + f(_r[l, 1]X + r[l, 2]Y + r[l, 3]Z + t[l])/U = 0 α3 + aiv + /(r[2, 1]X + r[2, 2]Y + r[2, 3] + t[2])/U = 0

Substitution:α₁ < p_x, a₂ <- x_p, a₃ < p_y, α₄ < sx_p, U <- r₂₀X + r₂₁ + r₂ Z + **

We now eliminate p_z (the focal length)

a[ + a₂'u + (r_mX + r₀₁Y + r₀₂Z + t_x)/U = 0 α₃ + a₄' v + (r_l0X + r_uY + r₁₂Z + t_y)/U = 0

Substitution:^- > a _z, a₂- > a₂/p_z, _a'₃- > a₃/p_z, a₄> - > _{a z}

Next we divide top and bottom by t_z:

a[ + a'₂u + (r_m' X + r₀'₁Y + r₀'₂Z + t_x' )/U' = 0 α₃ + a'₄v + (r'₁₀X + r'_nY + r[₂Z + t'_y)/U' = 0

Substitution:^ <- r_i3 ,/t_z, t - ti/t_z, U' «- r₂'_QX + r₂' _lY + r₂'₂Z + 1

Finally, we remove the a' i, and then make a final substitution to get a linear form for the DLT :

hX + l₂Y + l₃Z + l₄ + l_9UX + l_wuY + l_nuZ = -u hX + kY + l₇Z + l₈ + l_9Vχ + ι_wυγ + ι_nυZ _ __v

Substitution l:r£^. <- (r₀'_j

Substitution 2:/, <- r ₀, Z₂ <- rft <- r'[₂,

We end up with two linear equations in eleven unknowns. A minimum of five points is required to provide a solution, and the points cannot be colinear. This point is discussed in slightly more detail below. Given the 1 it is possible to recover the basic camera parameters, as follows:

Px = -x_Pt_z * {h * h + * ho + l₃ * hi) Py/s = x_pt₃ * (Z₅ * /_g + l₆ * ho + * hι)

s = /{ /{$ * {ll + % + %) - toy/*)²))

The angles can be recovered using the arcsin function using the following relations:

ship = -X_pt_z ((p_xl_n/x_p - l₃)/_Pz)

^roι = t_z (p_xl₁₀ - x_pl₂)/p_z cos ft = r₀o/ cosp sin ft = r₀₁/ cosp sinω = r₁₂/ cosp cos ω = t n/ cosp

ELLIPSE CORRECTION

The image of a circle in a camera is an ellipse. The center of the ellipse in the image plane is not necessarily coincident with the image of the circle's center, due to perspective distortion. This effect is shown, in exaggerated form for two dimensions ("2-D") in Figure 19. The camera has principal point P. Its focal plane is AC and the line AB is the cross section of an ellipse. 0₁ is the center of the projection of the line on the image plane, and 0₂ is the projection of the center of the ellipse O. We now derive an expression for the difference between the observed center of the ellipse in an image, and the projected center of the circle. This correction is used by our system in its photogrammetric calculations.

Without loss of generality we assume that the ellipse is of unit radius, in the x-y plane, with center at (0,0,0). This situation can always be obtained by a suitable rigid body transformation of the camera and scaling of the focal length of the camera. This derivation explains the principle behind the specific calculations carried out in the system.

The ellipse is expressed parametrically in homogenous coordinates as e_t = (sin t, cost, 0, l) .

The camera has a rotation R and translation T w.r.t. the coordinate frame of the ellipse. We represent this rigid body transform with a 4 x 4 homogenous matrix M:

In calculating the projection of the circle and its center onto the image plane, we are interested only in the distance between the two (in meters) . The projection matrix therefore needs only consider the focal distance of the lens, and not the principal point . We also ignore the effects of lens distortion, which over the distances being considered have only a second-order effect. We represent to focal distance by f = r/p_z, where p_z is the distance from the projective center of the camera to the focal plane, and r is the radius of the circle. The projection matrix P is

We can now derive a parametric expression for the ellipse and o its center c^p in the image plane. The center is c^p = P M (0,0,0,l)^τ and the ellipse is e^p _t = P M e_t . Determination of the center of the ellipse in the image is done by locating the two points of the ellipse tangent to lines parallel to the y- axis. The mid-point of the line connecting these two points 15 is the center of gravity of the ellipse. These points are located by:

1.Differentiating the ellipse w.r.t t, which gives (e_x,e_y,e_w)^τ = de^p/dt, and is algebraic in both sint and cost 202. In order to avoid a transcendental equation, we make a substitution, u - sin t, • (1-u²) - cost 3.We solve the equation e_x/e_w = 0 for u, which is quadratic in u and gives (in general) two real solutions for the tangent points.

25

The ellipse equation in the plane in homogenous coordinates is:

_eP _ (t_x + r_\ cos t + ro sin t, t_y + r cos t + r₃ sin t, ft_z + fr$ cos t + fr₄ sin t)^τ

30

Differentiating the ellipse and performing the change of variables u - sin t, • (1-u²) <- cost gives:

where de^p(x) is the affine value of the x-coordinate of de^p.

To determine the zeros of this equation w.r.t u we need only consider the zeros of the top half of the fraction. If we make the substitution a <- r₀r₅ - r.,r₄, b - r₀t_z - r₄t_x, c - r₅t_x - x t_z we have an equation of the form a + bu + cV(l-u²⁾ = 0 which has solution:

u = -2ab ± 2y^/α^ - (q* - c ){b^'2 + c^∑) 2(6² + c²)

Given the values for u we can back-substitute in the equation for e ^p to determine the extremal points p₀ and p_x. The center of the ellipse is (p₀ + p_x) /2.

THE PREFERRED EMBODIMENT OF THE LEVENBERG MARQUAND ALGORITHM

Photogrammetry in our system requires the solution of nonlinear least squares equations. We use a variant of the Levenberg Marquand algorithm. The non-linear least-squares problem is to find a vector of parameter values, a, to minimize the sum of squares of differences between a given vector, y, of observed values and a vector of fitted values f (a; x) where the values of x and the function f is known.

When we are doing bundles adjustment the values y_t are the observations of the image coordinates of known world points, the known parameters x_A are the positions of the known world points, and the parameters a_± are the internal and external orientation parameters of the camera (s) . We assume that the errors in observations are all normally distributed. We discuss how the variance of different measurements can be incorporated below.

We start with an estimate a₀ of a, whch will be iteratively refined. Initially we have the relationship:

y = /(a₀;x) + e₀

We assume that a = (a₀, ... ,a_m) , and that y = (y₀, ... ,y_n) T. We write the i-th equation as y_L = f (a_/x + e₀ (i) . We wish to find an a that minimises the 2 -norm of e . We assume that f can be approximated locally by first derivatives. Let J be the Jacobian, with entries Ji_j = dfi/δa-_j . A better approximation for a is then a_x = a) + JΔ where Δ is chosen to minimise the 2-norm of y - _f(a₀,-x) - JΔ. the value of Δ can be determined by the method of normal equations, which evaluates the pseudo- inverse of J. We have:

Δ = (J^rJ)J^τe₀

When we have some idea of the errors associated with individual measurements, we can incorporate a weight matrix into this equation

Δ = (J^TWJ)J^Te₀

where W_i;j <χ 1/σ²^ and σ².^ is the covariance between the ith and jth measuments. In practice this matrix is usually diagonal, representing independent measurements. Once Δ is calculated we iterate until we are satisfied there is a solution. There are many choices of termination criterion (see Slama, referred to above) . The advantage of this Newton-like method is that when the function f is reasonably well behaved (and models reality) , and the initial approximation a₀ is close to the correct value we get quadratic convergence. The problem with the Newton-like approach is that when the function is not well approximated by the Jacobian the solution will not be found. The Levenberg- Marquand algorithm is a variation of the Newton method which improves stability by varying between the Newton step and a direct descent method. The Levenberg-Marquand algorithm uses the following formula to determine Δ.

Δ = (J^TWJ + XD)J^Te₀, D = diag( J^TW J)

At the first iteration λ is set to a relatively large number (say 1) . If the first iteration reduces e then λ is reduced by some factor (say 5-10) , otherwise λ is increased by some factor and we keep trying until e is reduced.

The Levenberg-Marquand method converges upon a stationary point (or subspace if the system is degenerate) which could be the global minimum we want, a local minimum, a saddle point, or 1 or more of the parameters could diverge to infinity.

The standard references are K Levenberg: "A Method for the Solution of certain non-linear problems in least squares" (Quarterly Applied Math., 2:164-168,1944) and D M Marquardt : ("Journal of the Society for Industrial and Applied Mathematics, 11:431-441, 1963). Robust public domain implementations are available. The preferred implementations used in our system are the Linpack version as described by More (see above reference) and the implementation described by Dennis et al (see above reference) which is somewhat more robust .

STEREO MATCHING

The purpose of stereo matching is to construct a disparity map from a pair of digitised images, obtained simultaneously from a pair of cameras . By a disparity map we mean a two dimensional array which specifies for each pixel p_t in the left image of the pair, the distance to a corresponding point p_r in the right image (i.e. the stereo image disparity) . By a point we mean a pair of real valued co-ordinates in the frame of reference in the right image. By a corresponding point we mean one that most probably corresponds to the point in the right image, at which the scene component π_s imaged by p_t in the left camera, will appear when imaged by the right camera.

The present technique uses five main data structures each of which is an image in the form of a two dimensional array of real numbers:

l.The left image L, derived from the left camera 3 in which the real numbers represent brightness values

2. The right image R derived from the right camera 4 in which the real numbers represent brightness values 3. The horizontal displacement image H. This specifies for each pixel (x,y) in the left image the displacement from x at which the corresponding point occurs in the right image. 4. The vertical displacement image V, with similar properties. 5. The confidence image C, which specifies the degree of confidence with which the disparity is held.

At any stage in the process the best available estimate of the disparity map is provided by H and V. Thus to pixel L_xy there corresponds a point R_x+(Hx,y) ,_y+(vx,_y) ■ ^τhis wi^{11 in} general have real valued indices.

FORMING THE SCALE PYRAMID

The images L and R are available to the algorithm at multiple scales with each scale being the result of a decimation and filtering process as follows . An image pyramid is formed for each of L and R, with n+1 levels, labelled 0, 1, ..., n. If P_± is an image in the pyramid at scale i and if Pj_..-_. is an image at the previous, larger, scale (which will be the next level down in the pyramid), by which we mean that if δ( i) is the length of the diagonal of the image at scale i then δ(Pi) = fδtPi.i) for some f in the range 0 < f < 1, then

= scale (G_i+1, 1/f) -Gi

where Gi (0 < i < n+1) is a Gaussian convolution defined by:

G₀ = convolved) , where I is either L or R, at the largest scale G_i+1 = scale (convolve (Gi) , f) for all other scales.

Convolve is a function returning a Gaussian convolution of an image; scale (I, f) is a function that scales an image I by the real number f . In the present described embodiment the scale factor is 0.5, such that each next level in the pyramid is scaled by a factor of 0.5 relative to the level below it (in each linear dimension i.e. x and y dimensions) .

s If we define a convolution of an image I at point (x,y) by an n x n matrix K to be conv(I,x,y,K) where

Cθnv(I,X,y,K) = ∑"^"1^,, Σ ^n'X _i=0 Iχ-m₊i,y-m₊jKi, j where m = (n-l)/2. For the Gaussian convolution above the 0 kernel is given by K_L#j = ω(i)ω(j) where

ω(i) = ω (i)/[∑^m _j=__mω^< (j)] and ω j) = (1/N) . (∑^_ogfj - 1/2 + u/(N-l); σ ) 5 N is the number of sample steps used to approximate the value of the Gaussian distribution for any given point in the weight vector = ω ^v and

In practice the scaling and convolution are applied at the same time using separable horizontal and vertical convolution kernels of size 5. Thus each image P_± contains only 5 information at a characteristic frequency, all higher and lower frequencies having been removed by the filtering process.

The present stereo matching technique uses a scale pyramid 0 formed by a difference of Gaussian (DoG) convolutions as immediately above-described. Fig. 20 is a flow diagram illustrating, in principle, the method used to obtain the DoG pyramid. Fig. 20 shows how each input image is used to obtain the corresponding image at each level in the pyramid.

A matching process is run on each successive scale starting with the coarsest (i.e. top-most level of the pyramid), as will be described. In moving to successively greater levels of detail (moving "down" the pyramid) , the initial estimates of disparity for each level are provided from the results of matching the previous level. A flowchart for this process is provided in figure 21. The benefits of using a scale DoG pyramid are :

• It is easier to find the globally optimum disparity estimate, rather than local optima. • The inter-scale relationship provides guarantees about the amount of search that must be done at a lower scale, given an estimate at a higher scale, and assuming that this estimate is accurate.

• The use of a DoG filter provides immunity from illumination changes across the two views of the scene.

• No user intervention to determine an initial set of matching points or regions is required.

Fig. 22 illustrates the key stages of the "matching" process carried out at any one scale of the scale pyramid and the sequence of these stages. It will be observed that the discrepancy and confidence maps (H,V,C) are circulated through four processing blocks 52,54,56,58 each cycle of the process. The L and R images for the appropriate level of the pyramid are input at the respective cylces of the process. The whole process can be seen as an iterative process aimed at arriving at a good estimate of (H,V,C) . It is run for five iterations in the preferred embodiment of the invention. The process will now be described in further detail.

Initialisation The initialisation block 50 performs the following function:

If it is the start then set all pixels in H and V to 0.0 and set all pixels in C to 1. Otherwise leave H, V, C with the values obtained from the last iteration of the smoother.

Starting with the pair of images from the top level of the pyramid (i.e. the "coarsest" images) we then move on to the "Warping" phase.

Warping

The "warping" block 52 takes as its input (R,H,V) and generates an output L' , where L'_xy = R_x+(_Hx,y) ,y₊(vx,_y) ^{For a} perfect estimate of H and V, and for a scene with no occlusions we should have L = L' . (To the extent that they differ, either the estimated disparity is wrong or there are too many occluding edges) .

Because L' is not necessarily on the integer image grid of L (i.e. L' may be at a fractional position between integer pixel positions in L) , the value (which will be a real number representing brightness of the pixel) for L' is calculated using 4-point bilinear interpolation. Thus, the four pixels that enclose L' (in L) are used to estimate L' .

Matching The correlation measure used by the matching block 54 is:

with

^CO\ _ltr(x, y) = ∑ Σ ^L(x+u,y+v)R(x+u,y+v)W(v,v)> -² <«<2,2-<U<2

and

where u and v are integral pixel locations, and ω is a Gaussian kernel centered at (0,0).

The matching phase carried out by block 54 proceeds as follows:

1. For each pixel p=(x,y) in L

(a) Let κ_xy(i,j) be the correlation between the neighborhood around L_{x y} and the neighborhood around

^L'i,j- (b) Compute the horizontal correlations κ_xy(x-δ,y), κ_xy(x,y) , κ_xy(x+δ,y) . δ is a value which is a (decimal) fraction of one pixel integer i.e. such that the location (x-δ), for example, is an integral location along the x-axis, falling between integer pixel positions x and x-1. For example, the intial chosen value of δ might be 0.5.

(c) Fit a parabolic curve through (i.e. in practice, fit a quadratic function to) these points and (i) (a) if the curve has a negative second derivative and has a maximum x,^ in the range x ±1.5δ, then (b) add x-x-^ to H_{x y} to give the new estimate of the horizontal disparity. (ii) if the curve has a negative second derivative and a maximum outside this range set x,^ - ± 1.5δ and proceed as in (i) (b) above, (iii) if the curve has a positive second derivative, evaluate it at (x± 1.5δ,y) and set X_jaax to whichever position yields the greatest predicted correlation and then proceed as in (i) (b) above (d) proceed similarly in the vertical direction, updating ^vχ.y (e) set K_π,_^ to be the average of the maximums of the correlations in the horizontal and vertical directions, (f) set C_x>y <- 0.71^^+0.3C_X/y 2. Set δ <- 0.5δ

Smoothing

The purpose of the smoothing or regularization phase (carried out by regulating block 56) is to adjust the disparity H, V and confidence C maps to arrive at a more accurate estimate of the disparity. The reason this is desirable is that the initial estimates derived from the matching phase are inevitable noisy. These estimates are adjusted, taking into account the strength of the correlation, and data from the original image . 1. New values of H, V, and C are obtained by applying a convolution to the neighborhoods surrounding each pixel in H, V, C such that:

jjnew^w =_ conv(H, x, y, W(L, x, y, C)) V™^w = conv(F, x, y, W(L, x, y, C)) C ^w = conv(O, x, y, W(L, x, y, C))

where : W(I,a,b,P) is a weight construction function that computes a 3 x 3 convolution kernel on image I at co-ordinate a,b, using a probability array P. As shown above, in the present case the confidence map C is chosen as the probability array P.

conv(I,a,b,K) evaluates a convolution on image I at position a,b using kernel K.

The weight construction function currently used produces the matrix ψV where

and

2. Step l is repeated 20 times

Once the initialise 50, warping 54, matching 54, and smoothing (regularising) 56 phases have been carried out as above described, the system moves on to the carry out all of these phases for the pair of images in the next level down (moving from coarse to fine) in the image pyramid. The complete stereo matching process is iterated through all the levels of the pyramid, as illustrated by Fig.21. (There are five iterations in the present described embodiment.)

It will be appreciated that before carrying out the warping stage at the next level down in the pyramid, the values of H and V (the horizontal and vertical disparities) must be appropriately scaled. (In the present embodiment H and V mst therefore be multiplied by the factor 2 when moving from one pyramid level down to the next level . This is because the scaling factor f used to calculate the DoG images is 0.5 in our embodiment, as mentioned above. It will though be appreciated that other scaling factors could be used in different possible embodiments.)

By way of illustration, Fig.23 shows in flow chart form the general scheme of the above-described stereo matching method, and the sequence of steps used to iterate through all levels in the pyramid.

Using the above-described technique, together with textured illumination of the object scene, we have been able to "match" the left and right images to an accuracy of 0.15 pixels over approximately 90% of the image.

ALTERNATIVE METHOD USING BOTH LEFT-RIGHT AND RIGHT-LEFT DISPARITIES The accuracy of the reconstructed image can be increased by calculating both left to right and right to left disparities. This allows occluded areas to be detected. (Occluded areas are those which are visible to only one of the pair of cameras

3.4). Since the disparities calculated for occluded areas are almost certainly erroneous, the confidence map can be adjusted to take this into account .

To detect occlusions the warping (block 52) and matching (block 54) are done twice, everything left (1) is swapped with that of right (r) the second time. After this we get (H_α, V_x, C_x) and (H_r, V_r, C_r) . Subsequently occlusions are detected and incorporated and used to modify C and C_r. Smoothing (block 56) is applied to both (H_x, V_1# Cx) and (H_r, V_r, C_r) .

OCCLUSION DETECTION

The above-mentioned detection of occlusion is done as follows:

Consider pixel L(_xy) of the left image. From (H_lf V+l) we get the corresponding position of the pixel in the right image, R(_x, _y. ) . From (H_r, V_r) we get the corresponding position of the pixel in the left image, L(_x. , _y. ,). If the Euclidean distance of L(_xy) and L(_x.. _y, ,) is greater than some threshold [which is preferably equal to one (1)], the pixel L(_xy) of the left image is classified as an occluded pixel. The pixel value of Ci is set 0 to indicate that this is an occlusion, and thus that we have no confidence in our disparity measurement. (In our system the minimum value of confidence C is 0.04) . This is done for every pixel of C_x and C_r.

BUILDING A 3-D MODEL

Once the left and right cameras are calibrated (using the afore-described calibration technique) (i.e. their internal and external orientation is known) , and the afore-described (stereo) matching method has determined the disparity map between the two images, it is relatively straightforward to then build a dense 3-D world model. Recall from above the camera parameters used by our system for a camera c :

(r_x,r_y,r_z) The Eulerian rotation angles for the camera w.r.t. the default world co-ordinate frame. The (3 x 3) rotation matrix corresponding to these angles (in the order x-axis, y- axis then z-axis is R„.

(t_x,t_y,t_z) The translation for the camera w.r.t. the default world co-ordinate frame, let t_c = (t_x,t_v,t₂)^τ

(u_c,v_c,f) The principal point of the camera w.r.t. the camera frame. u_c and v_c are in pixel co-ordinates. The translation from camera space to pixel coordinates involves :

X_p The size of an image pixel along the x-axis.

s The ratio between an x-pixel and y-pixel.

(k₁,k₂,k₃,k₄,k₅) Distortion parameters for the camera.

Given this information about both the left and right cameras, and a pixel location

in the left camera c_lf we can calculate the world position of this point as follows.

Let the disparity at

be (d_x,d_y) . The coordinate of the corresponding point in the right camera c_r, is then (u_r,v_r)

For each camera 3 , 4 the vector from the principal point to the image point (u, v) in pixel co-ordinates is p= ( (u-u_c) X_p, (v- v_c) X_pS , - f ) . Let these vectors for c,_^ and c_r be V_c and V_r . Let ω, = R_λV_λ and ω_r = R_rV_r. c = t_t - t_r. Let

-1 w_r ^■ w_τ —u>ι ^■ w_r ιυ_r • c t = -wι • w_τ Wl ^• Wl -wι ^■ c

Then the point in space p = t_cι + t_xω_t .

This formula is used for each pixel in the left image L, and produces an output image, in one-to-one correspondence with the left camera image L, which contains a 3-D world position at each pixel .

Given one or more dense 3-D models, represented as an image in memory, with each pixel containing the distance from the principal point to the object surface, and given the internal and external orientation parameters of the camera for each model, the present invention includes a method for generating triangular polygon meshes from these models for the purposes of display and transmission. In addition a novel method of merging render images associated with each mode has been developed.

Method for Building Polygon Meshes from Dense 3-D images

In order to combine multiple dense models an intermediate structure is used, namely a 3-D voxel image. The preferred implementation is a variant of the that processed by Curless and Levoy (referenced above) . This image is an array of points v^ ^z*, = (Xi,y_j,z_k) ^τ, 0 < i < n_x, 0 < i <n_y, 0 < k <n_z. This array is arranged uniformly in space with separation s, so that v_i+1/j+1(K+1 - v_{± j κ} = (s,s,s)^τ.

Each point is categorized as UNSEEN, EMPTY or BOUNDARY. An UNSEEN point is one which lies between the principal point of a camera, and the model surface seen from that camera, with a distance greater than a threshold τ. A BOUNDARY} point is one which has a distance from the principal point within tolerance τ of that seen in the model . Note that boundary points contain a signed distance. Points start with label UNSEEN.

Before describing the method, we first introduce some terms. Let the model image be I = I (_xy) , (x,y) e (0,0) - (n,m) . Let the corresponding confidence map C derived from the initial match (see above) be the image W = W(_xy), (x,y) e (0,0)-(n,m). Let the projection matrix that maps points in space to image co-ordinates in the image from camera M be P.. Let the axis vectors in space be i,j,k, where i = (1,0,0)^T for example. Let the scale of camera c be s_c = l/max{ I |p_ci, I |_2/ I |p_cj I \ _{2 ι} I |p_ck| |_2}. Given an image point u = (u_x,u_y) in I with real co-ordinates, the value of I_u if it is in the image, is computed by bilinear interpolation from the 4 neighboring pixels. Let the 2- distance of any point v in space from the principal point of camera c be d_c(v) . Let the cosine of the ray from the principal point of c to the model surface at image point I_u be Co^u) The method of adding a dense 3-D model, represented by a (float coded) image array I_u and camera c, to an image volume V with distance threshold τ and confidence threshold ω_t is as follows: 1. If no models have yet been added to V, initialize all points in V to UNSEEN. We assume that the weight and distance of this point are both initialized at 0.

2. If s_c < 1 scale I and W by s_c, and scale the focal length of c by s_c. This ensures that the sampling of the image by V will not alias.

3. Process the confidence image W, by:

(a) Setting all pixels below ω_t to 0.

(b) Blob filtering W to remove all blobs of non- zero pixels, whose areas relative to that of the whole image falls below 3%.

(c) Reduce the weight of non- zero pixels, if they are near the edge of a blob.

4. For each point v e V compute u = P_cv. If u is in I then let di = l_u, α>i =W_U, d,, be the distance from v to the principal point of c, and Cθi = Cθ_j(u) . Let d = d± - d„..

(a) If COid > τ then set v to EMPTY.

(b) else if COid < τ ignore v (c) else if Oi > 0 let the current distance of v be d₀ and let the current weight be ω₀, set the distance of v to (dcθi + d₀ω₀)/( α>i + ω₀) and set the weight of v to G) + ω₀.

Once this image has been computed it is possible to triangulate this mesh using a suitable known isosurface extraction method (e.g. as_ proposed by Lorensen and Cline in SIGGRAPH ^Λ 87 Conference Proceedings (Analeim CA) , pages 163- 170; or Howie & Blake in Computer Graphics Forum, 13(3) : C/65-C/74, October 1994) ; or Livnat et al, IEEE Transactions on Visualisation and Computer Graphics, 2(1) : 73-84, March 1996) .

The efficiency of the process of constructing the voxel image can be improved by two algorithmic methods .

1) Run-length Encoding

Since it is not necessary to make distinctions between any two points with labels UNSEEN or EMPTY, it is possible to run- length encode the voxel image. This is done by storing the x- ordinate as a variable run-length encoded array. Since the principal mode of access to the voxel image is iterative, this can be done without a performance penalty.

The amount of space required by the run-length encoded image is a function of the complexity of the surface. Typically each non-empty x-row will intersect the surface twice. Assuming the depth of the boundary cells is fixed, the space requirement is quadratic in the linear dimension of the volume, rather than cubic.

The memory organisation of the run-length encoded volume is amenable to multi-processing, whereby multiple range images can be simultaneously added to the volume . The most convenient way to do this is for each range image to be added using a random permutation of the x-coded arrays. This minimises possible memory contention. 2) Pyramid coding

When the number of voxels is very large most of the processing undertaken is redundant examination on non-BOUNDARY cells. Use of a voxel pyramid can significantly reduce processing time by focusing on BOUNDARY cells, the number of which grows only quadratically with linear volume dimension.

Each level of the volume pyramid is constructed and processed as described above, with some minor variations to be described below. The bottom level of this pyramid is described in section above entitled "Method for Building Polygon Meshes from Dense 3-D images". The other levels are scaled by a factor of 0.5 relative to the level below them. Let there be n levels to the pyramid, with the bottom level numbered n-1 and the top level 0. Using the terminology of the afore-mentioned above section, the voxel size at level I is Si and the distance tolerance is Xi.

In order to avoid aliasing the model images are, if necessary, scaled using a box filter to ensure that the projection of the vectors Sii, Sj, SiZ in the image are less than one pixel in size.

In order to add a new model I_u and camera c, to an n-level image pyramid volume V with distance threshold τ_n-l and confidence threshold ω_t to the pyramid the following extension to the algorithm used in the section entitled "Method for

Building Polygon Meshes..." above, is used:

1. The value of s_n-l is chosen. The other values of S are set at Si= s_i+1/2, I e{θ,n-2}. 2. The value of τ_n-l is chosen. The other values of τ_n are set as ι_τ= max{τ_n+l/2, 2S }, I e{θ,n-2}.

3. The top level (level 0) of the pyramid is processed as described in the afore-mentioned above section entitled "Method for Building Polygon Meshes ... " .

4. For subsequent levels of the pyramid, computation is only done for those voxels with the label BOUNDARY in the level above. When the voxel at level I is processed with a distance greater than τ_{i t} it is set to EMPTY. When the voxel is processed with distance value less than -τ_x it is set to UNSEEN and then processed in the same manner as the previous algorithm.

the specific representation used for voxel volumes, and the manner of processing them provides both a very significant speed-up over single-level computation, as well as improved memory performance.

APPLYING SEAMLESS RENDER The render images captured from the center cameras of the pods, can be used to provide an image texture to a polygon mesh M, constructed as described above. The present invention contains a novel method for merging the different images to provide seamless texturing of the polygon mesh. Before describing the merging algorithm in detail, we describe the method by which render images, captured from the central (or left) cameras of the 3-D camera apparatus can be mapped onto polygon meshes.

Let the image to be mapped be I_u, as above. Let the camera associated with the image be c. Let the vertices of the polygon mesh be V = v_±, i e I. Let the normal of each vertex be ⁿi . If the projection matrix of camera c is P_c as above, and the vector from the principal point of c to Vi is ω_± then we associate a texture coordinate u = P_c i with _A ifthe vertex is visible from camera c.

Informally, a vertex is visible if (a) the surface at that point is facing the camera, and (b) no other part of the sirface is between the camera and the vertex. The test for (a) is whether ^.W <0. The test for (b) is performed by a standard hidden surface computation on the whole mesh.

With this preliminary we turn to the seamless merging of texture images. The images be I^k, k e {l,...,n}, the confidence images be W^k, k e {l,...,n}, the cameras be c , k e {l, ...,n}, and the vertices be V = i, i e I as above.

The goal of the merging method is to merge seamlessly, and with as little blurring of the image as possible. Area-based merging techniques are likely to introduce blurring of the images due to misregistration and to differences in illumination between images . In order to obtain a seamless merge of the textures with minimal blurring we use a boundary- based approach. Each triangle in the mesh M is projected by one or more of the texture images . We determine which image projects to which triangles by optimising a combination of the confidence, or weight, associated with each triangle, and the size of connected patches to which an image projects. Before presenting this algorithm we define the neighbors of a triangle t as the set N_t. A neighbor is a triangle that shares an edge. The algorithm for determination of which images project to which triangles is as follows:

1. For each traingle t in the mesh associate the image I^k for which t has the highest confidence. The confidence C_t of triangle t is defined as the average of the weights of its vertices with respect to the image. If any weight is missing, or the point does not project to the image the triangle's confidence is 0. We define a function m, where m(t) is the image I^k with the maximum value of C_t .

2. Generate a partition P of the triangles of M where each element p of P is maximal in the sense that there do not exist t, t'e M such that t e p and t' «. p with t e N_t..

3. We say that a triangle t e M is consistent with an image I and camera c if C_t is non-zero. We extend this definition to an element p of partition P in the natural way. We reduce the cardinality of P by repeating the following steps: (a) Sort P by cardinality

(b) For each element p, smallest first, if p is consistent with an image I and there is a triangle t e p neighboring a triangle in a different element p' e P and the cardinality of p is not greater than that of p' then assign each element of p to a partition p' and reset m(t) for each t e p.

On termination of this algorith each triangle t is associated with an image I^k via the partition P. Consider the vertices v_if i e I that belongs to triangles in more than one partition of P. These are vertices on the boundary between two partitions. These vertices will be used to define the merge between textures. We extend the set I with additional elements correspomding to new vertices. These vertices are positioned between vertices v_if i e I connected by edges in M. The number of additional vertices to be added is user-specified. The rational for this addition is to ensure that the boundaries when projected onto the images are dense and will provide a smooth merge .

For each i e I let the subset of images in which _£ has associated texture co-ordinate u be Si ς: {l,...,n}, and let the texture co-ordinates be u^s _i# s e S, and the weights be ω^s _i# s e Si.

These points are the projection of the point _A in each image. In general the pixel intensities of these image points will be different from each other. In order to merge the textures, we define a new common value at this image point in each of the image I^s, s e S . This value is weighted by the corresponding pixel weight in each image, and is

= ∑_s w uf

Σ S_{ ™i

We also define 6 = p_± - u^Si, which is the difference between the original value at a pixel in the image, and its new value.

For each image I , k e {l,...,n}, let the set of points in

V where Pi is defined over more than one image, be Λ _ I . Let the rectangular domain over which I^k is defined be Ω_k c R . The method of texture integration is based on determining a function 0_k: R → R, which is as "smooth" as possible subject to the pointwise constraint: 0_k(u_±) = p_{i t} I € Ω_k. Given this function, each image I^k can be "warped" by adding the value of 0_k evaluated at a pixel to that pixel .

This is a version of the problem of data interpolation from scattered points. The preferred embodiment of the 3-D camera apparatus uses the method described by S Lee et al : "Scattered Data Interpolation with Multi level B-Splines", IEEE Transactions on Visualisation and Computer Graphics, 3(3) : 228-244, 1997.

Claims

1. A method of measuring stereo image disparity for use in a 3-D modelling system, the method comprising the steps of: (a) producing a first camera output image of an object scene;

(b) producing a second camera output image of said object scene;

(c) digitising each of said first and second camera output images and storing them in storage means; (d) processing said first and second digitised camera output images so as to produce an image pyramid comprising a plurality of successively produced pairs of filtered images, each said pair of filtered images providing one level in the pyramid, each successive pair of filtered images being scaled relative to the pair of filtered images in the previous pyramid level by a predetermined amount and having coarser resolution than the pair of filtered images in said previous pyramid level, and storing these filtered images;

(e) calculating an initial disparity map for the coarsest pair of filtered images in the pyramid by matching one image of said pair of coarsest filtered images with the other image of said coarsest pair of filtered images;

(g) matching said shifted version of said one image of the next-coarsest pair of images with the other image of said next-coarsest pair of images so as to obtain a respective disparity map for said other image and said shifted image, which respective disparity map is combined with said initial disparity map so as to obtain a new, updated, disparity map for said next-coarsest pair of images; and (h) repeating steps (f) and (g) for the pair of filtered images in each subsequent pyramid level, at each level using the new, updated disparity map obtained for the previous level as said initial disparity map for carrying out the warping process in step (f), so as to arrive at a final disparity map for the least coarse pair of images in the pyramid.

2. A method according to claim 1, wherein said processing step (d) for image pyramid generation comprises operating on said first and second digitised camera output images with a scaling and convolution function.

3. A method according to claim 2, wherein the plurality of pairs of filtered images produced by said scaling and convolution function are Difference of Gaussian (DoG) images .

4. A method according to any preceding claim, wherein the first pair of filtered images produced in step (d) are of the same scale as the digitised first and second camera output images, and each subsequent pyramid level contains images which are scaled by a factor of f, where 0 < f < 1, relative to the previous level.

5. A method according to any preceding claim, wherein there are at least five levels in the image pyramid.

6. A method according to any preceding claim, wherein the method includes repeating steps (f) and (g) at least once, at at least one of said pyramid levels, using the latest new, updated disparity map for each warping operation.

7. A method according to any preceding claim, further including constructing a confidence map during each iteration of steps (f) and (g) of the method, the contents of said confidence map constructed at any one level of the pyramid representing the confidence with which the contents of the disparity map at said one level are held.

8.A method according to claim 7, further including the step of carrying out a smoothing operation on the disparity and confidence maps produced at each level in the image pyramid, prior to using these smoothed maps in the calculation of the new, updated disparity maps and confidence maps at the next level .

9. A method according to claim 8, wherein said smoothing operation comprises convolving each said map with a predetermined weight construction function (I,a,b,P).

10. A method according to claim 9, wherein said weight construction function W(I,a,b,P) is dependent upon original image intensity values, and confidence values, associated with each pixel.

11. A method according to claim 9, wherein said weight construction function W(I,a,b,P) computes a convolution kernel on a data array representing an image I, at a pixel p(a,b), using a probability array P, which is the confidence map C.

12. A method according to any preceding claim, wherein in steps (e) and (g) said matching process by means of which one image is matched with another is carried out in each case by:

(a) calculating horizontal correlation values for the

5 correlation between a neighbourhood around each pixel p(x,y) in one image and neighbourhoods around at least three horizontally co-linear points in a spatially corresponding area of the second image, and fitting a parabolic curve to said horizontal correlation values and analysing said curve 0 in order to estimate a horizontal disparity value for the said pixel p(x,y); and

(b) calculating vertical correlation values for the correlation between a neighbourhood around each pixel p(x,y) in one image and neighbourhoods around at least three 5 vertically co-linear points in a spatially corresponding area of the second image, and fitting a parabolic curve to said vertical correlation values and analysing said curve in order to estimate a vertical disparity value for the said pixel p(^χ,y) •

20

13. A method according to claim 12, wherein data values of the image for fractional points located between pixels are obtained by using bilinear interpolation.

25 14. A method according to any preceding claim, wherein each of said first and second output images comprises an array of pixels, and the final disparity map comprises the calculated disparity for each pixel in the first image relative to the second image, and the method further includes repeating steps

30 (e) to (h) of the method to calculate a final disparity map for each pixel in the second image relative to the first image, and then comparing the two final disparity maps so as to detect areas of the object scene which are occluded in one of said first and second output images.

15. A method according to any preceding claim, further including illuminating the object scene with a textured pattern.

16. a method according to claim 15, wherein the textured pattern comprises a digitally generated fractal random pattern of dots of different levels of transparency.

17. A 3-D image modelling system comprising: first camera imaging means (3) for producing a first camera output image of an object scene; second camera imaging means (4) for producing a second camera output image of said object scene; digitising means (9) for digitising each of said first and second camera output images; storage means (17,18) for storing said digitised first and second camera output images; and image processing means (11) programmed to:

(a) process said first and second camera output images so as to produce an image pyramid of pairs of filtered, preferably Difference of Gaussian (DoG) , images from said first and second digitised camera output images, each successive level of the pyramid providing smaller images having coarser resolution, and said storage means being capable of also storing the pairs of filtered images so produced;

(b) process the coarsest pair of filtered images in the pyramid so as to: calculate an initial disparity map for said coarsest pair of filtered images; use said initial disparity map to carry out a warping operation (52) on one said next- coarsest pair of filtered images, said warping operation producing a shifted version of said one of said next-coarsest pair of filtered images; matching (54) said shifted version of said one of said next-coarsest pair of filtered images with the other of said next-coarsest pair of filtered images to obtain a respective disparity map for said other image and said shifted image, which disparity map is combined with said initial disparity map to obtain a new, updated, disparity map for said next-coarsest pair of filtered images; (c) iterating said warping and matching processes for the pair of images at each subsequent level of the scale pyramid, at each level using the new, updated disparity map from the previous iteration as the "initial" disparity map for carrying out the warping step of the next iteration for the next level, prior to calculating the new, updated, disparity map at this next level, so as to obtain a final disparity map for the least coarse pair of filtered images in the image pyramid; and (d) operating on said first and second digitised camera output images using said final disparity map, in a 3-D model construction process, in order to generate a three-dimensional model from said first and second camera output images.

18. A 3-D modelling system according to claim 17, further including projector means (13) for projecting a textured pattern onto the object scene.

19. A computer program product comprising: a computer usable medium having computer readable code means embodied in said medium for carrying out a method of measuring stereo image disparity in a 3-D image modelling system, said computer program product having computer readable code means for: processing data corresponding to a pair of first and second digitised camera output images of an object scene so as to produce filtered data corresponding to a plurality of successively produces pairs of filtered images, each pair of filtered images providing one level in the pyramid, each pair of filtered images being scaled relative to the pair of filtered images in the previous level by a predetermined amount and having coarser resolution than the pair of images in said previous level; calculating an initial disparity map for the coarsest pair of filtered images by matching filtered data of one image of said coarsest pair of filtered images with the filtered data of the other image of said coarsest pair of filtered images; using said initial disparity map to carry out a warping operation on the data of one image of the next-coarsest pair of filtered images in the pyramid, said warping operation producing a shifted version of said one image; and matching said shifted version of said one of said next- coarsest pair of images with the other of said next-coarsest pair of images so as to obtain a respective disparity map for said other image and said shifted image, which disparity map is combined with said initial disparity map so as to obtain a new, updated, disparity map for said next-coarsest pair of filtered images; and iterating said warping and matching processes for the pair of images at each subsequent level of the scale pyramid, at each level using the new, updated disparity map from the previous iteration as the "initial" disparity map for carrying out the warping step of the next iteration for the next level, prior to calculating the new, updated, disparity map at this next level, so as to obtain a final disparity map for the least coarse pair of filtered images in the image pyramid.

20. A computer program product according to claim 19, further including computer readable code means for: operating on said first and second digitised camera output 5 image data using said final disparity map, in a 3-D model construction process, in order to generate data corresponding to a three-dimensional image model from said first and second digitised camera output image data.

10 21. A method of calibrating cameras for use in a 3-D modelling system so as to determine external orientation parameters of the cameras relative to a fixed reference frame, and determine internal orientation parameters of the cameras, the method comprising the steps of:

15 (a) providing at least one calibration object having a multiplicity of circular targets marked thereon, wherein said targets lie in a plurality of planes in three dimensional space and are arranged such that they can be individually identified automatically in a camera image of said at least

20 one calibration object showing at least a predetermined number of said circular targets not all of which lie in the same plane;

(b) storing in a memory means of the modelling system the relative spatial locations of the centres of each of said

25 target circles on said at least one calibration object; (c) capturing a plurality of images of said at least one calibration object with each of a pair of first and second cameras of the modelling system, wherein at least some points, on said at least one calibration object, imaged by one of said

30 cameras are also imaged by the other of said cameras;

(d) analysing said captured images so as to: locate each said circular target on said at least one calibration object which is visible in each said captured image and determine the centre of each such located circular target, preferably with an accuracy of at least 0.05 pixel, most preferably with up to 0.01 pixel accuracy; and identify each located circular target 5 as a known target on said at least one calibration object;

(e) calculating initial estimates of the internal and external orientation parameters of the cameras using the positions determined for the centres of the identified circular targets. 0

22. A method according to claim 21, wherein step (e) is carried out using a Direct Linear Transform (DLT) technique.

23. A method according to claim 21 or claim 22, including the 15 further step of:

20 24. A method according to claim 23, wherein step (f) is carried out by applying a modified version of the co-linearity constraint, in a the form of an iterative non-linear least squares method, to the initial estimates of the internal and external orientation parameters of the cameras, to calculate a

25 more accurate model of the internal and external orientation parameters of the cameras .

25. A method according to claim 24, where modelling of perspective distortion is incorporated in said iterative non- 30 linear least squares method used to calculate a more accurate model of the internal and external orientation parameters of the cameras.

26. A 3-D modelling method incorporating the method of calibrating cameras according to claim 21, and the method for measuring stereo image disparity according to claim 1, and 5 wherein the estimated internal and external parameters of the first and second cameras, and the final calculated disparity map for the first and second camera output images, are used to construct a 3-D model of the object scene.

10 27. A 3-D modelling method according to claim 26, wherein the 3-D model is in the form of a polygon mesh.

28. A 3-D modelling system according to claim 17 or claim 18, wherein the image processing means is further programmed to 15 carry out steps (d) and (e) in the method of claim 21, and wherein the system further includes at least one said calibration object and storage means for storing constructed 3-D models .

2029. A 3-D modelling system according to claim 17 or claim 18, wherein a plurality of pairs of left and right cameras are used, in order to allow simultaneous capture of multiple pairs of images of the object scene.

25 30. A 3-D modelling method according to claim 26 or claim 27, wherein multiple pairs of images of the object scene are captured and each pair of images is used to produce a 3-D model of the object scene, and the plurality of said 3-D models thus produced are combined together in a predetermined

30 manner to produce a single output 3-D model.

31. A 3-D modelling method according to claim 30, wherein the plurality of 3-D models are combined in an intermediate 3-D voxel image which is then triangulated using an isosurface extraction method, in order to form the single output 3-D 5 model which is in the form of a polygon mesh.

32. A method according to claim 31, further including integrating at least one render image onto said polygon mesh so as to provide texturing of the polygon mesh.

10

33. A method according to claim 32, wherein said at least one render image is integrated onto the polygon mesh by using a boundary-based merging technique.

15 34. A 3-D modelling system according to any of claims 17,18 and 28, wherein the system incorporates a camera pod comprising three cameras, namely said first and second camera imaging means for capturing left and right object scene images, and a third camera imaging means for capturing at

20 least one image for providing visual render information.

35. A system according to claim 34, wherein the system incorporates a plurality of such camera pods.

25 36. A modelling system according to claim 28 or claim 29, wherein the or each said calibration object is provided with at least one optically readable bar code pattern which is unique to that calibration object, and the image processing means is programmed to locate said at least one bar code

30 pattern and to read and identify said bar code pattern as one of a pre-programmed selection of bar code patterns stored in a memory means of the system, each said stored bar code pattern being associated with a similarly stored set of data corresponding to a respective said calibration object.

37. A system according to claim 36, wherein the or each said calibration object is additionally provided with bar code location means in the form of a relatively simple locating pattern which the image processing means is configured to locate and, from the location of said locating pattern, identify that portion of the image containing said at least one bar code pattern, prior to reading said bar code pattern.