GB2623119A

GB2623119A - A computer vision method and system

Info

Publication number: GB2623119A
Application number: GB2214751.6A
Authority: GB
Inventors: Logothetis Fotios; Mecca Roberto; Budvytis Ignas; Cipolla Roberto
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2022-10-07
Filing date: 2022-10-07
Publication date: 2024-04-10
Also published as: JP2024055772A; GB202214751D0

Abstract

A computer vision method for generating a three-dimensional reconstruction of an object comprising receiving a first set of photometric stereo images of the object and a second set of photometric stereo images of the object, the first set comprising at least one image taken from a camera in a first position using illumination from different directions and the second set comprising at least one image taken from a camera in a second position using illumination from different directions. A first normal map of the object using the first set of photometric stereo images and a second normal map of the object using the second set of photometric stereo images are generated and a stereo estimate of the shape of the object is generated by performing stereo matching between patches of normals in the first normal map and patches of normals in the second normal map. The normal maps are then used with the first estimate of the shape of the object to generate a reconstruction of the object.

Description

A Computer Vision Method and System

FIELD

Embodiments are concerned with a computer vision system and method for performing 3D imaging of an object.

BACKGROUND

Many computer vision tasks require retrieving accurate 3D reconstruction of objects from the way they reflect lights. However, reconstructing the 3D geometry is a challenge as global illumination effects such as cast-shadows, self-reflections and ambient light come into play, especially for specular surfaces.

Photometric Stereo is a long standing problem in computer vision. Recent methods have achieved impressive normal estimation accuracy on both real and synthetic datasets.

However, the progress in the quality and practical usefulness of the estimated shape has been much less convincing, since it is highly limited due inaccuracies in retrieving global geometry especially when dealing with general light reflections.

When photometric stereo techniques rely only on acquisition of images from a single camera position (monocular Photometric Stereo), the accuracy of the reconstruction is usually dependent on a rough estimation of the distance between the camera and target object. For example, prior to or following acquisition of the images, an initial estimate of the object geometry is obtained and a depth map is initialised based on the initial object geometry.

BRIEF DESCRIPTION OF THE FIGURES

Figure 1 is a schematic of a system in accordance with an example useful for understand of the current invention; Figure 2 is a schematic showing an arrangement of a camera and light sources for performing 3D imaging of an object; Figure 3 is a high level schematic of method of a method for recovering surface normals from a set of photometric stereo images of an object or a scene; Figure 4 is a schematic showing an arrangement of a camera and light sources for performing 3D imaging of an object; Figure 5 is a flow diagram of the method for 3D Reconstruction according to an embodiment; Figure 6 is schematic of a CNN which can be used for retrieving normal maps from observation maps; Figure 7 illustrates the working principles of the reconstruction method, according to an embodiment; Figure 8 illustrates steps of the method, according to an embodiment; Figure 9 illustrates results of the method compared to monocular (single view) photometric stereo; and Figure 10 is a schematic of a system in accordance with an embodiment.

DETAILED DESCRIPTION

In a first aspect, a computer vision method is provided for generating a three dimensional reconstruction of an object, the method comprising: receiving a first set of photometric stereo images of the object and a second set of photometric stereo images of the object, the first set comprising at least one image taken from a camera in a first position using illumination from different directions, the second set comprising at least one image taken from a camera in a second position using illumination from different directions; generating a first normal map of the object using the first set of photometric stereo images; generating a second normal map of the object using the second set of photometric stereo images; determining a stereo estimate of the shape of the object by performing stereo matching between patches of normals in the first normal map and patches of normals in the second normal map; and using the normal maps with the first estimate of the shape of the object to generate a reconstruction of the object.

The disclosed method addresses a technical problem tied to computer technology and arising in the realm of computing, namely the technical problem of generating a three-dimensional reconstruction of an object or a scene. The disclosed method solves this technical problem of how to improve the quality of the reconstruction. The improvement is provided by providing binocular based photometric stereo which merges the strengths of Photometric Stereo in estimating dense local shape variation and Structure from Motion in providing sparse but accurate depth estimates.

The object may be placed within the depth range of the camera of the capturing apparatus, and the distance (i.e. depth z) between the object and the camera is approximately measured with a ruler. The depth map is then initialised by setting the depth of all points to a constant value, as estimated with the ruler or other method of estimating the mean distance from the camera to the object. Other methods of initialising the depth could be used, for example, using a cad model, kinect type depth sensor, etc The addition of any further sophisfications in the estimation of the initial object geometry could reduce the run time of the method, for example, by a single iteration.

In summary, an intrinsic ambiguity in photometric stereo comes from determining the rough depth of the target object from the camera. This is crucial especially for real world applications where the scale factor has to be determined in order to ensure measurability of the geometry.

To at least partially address this issue, a binocular based photometric stereo is discussed herein which merges the strengths of the Photometric Stereo in estimating dense local shape variation and the strengths of Structure from Motion in providing sparse but accurate depth estimates.

The embodiments described herein perform matching on the normals, instead of the initial two sets of images, more reliable stereo matching can be performed which is robust even when dealing with objects which are without texture or shiny. As a result, constraining the reconstruction of the object to the pairwise correspondences of the stereo matching step can provide a more accurate representation of the object geometry which will provide a better initial estimate for the iterative procedure of updating the object geometry to converging to single object geometry.

In an embodiment, the first normal map and the second normal map are generated using an estimate of the shape of the object to recalculate light distribution for near field effects of the illumination on the object.

Once a first reconstruction of the object has been generated, the method comprises recalculating the light distribution due to near field effects using the first reconstruction of the object, re-calculating the first normal map and the second normal map from the remodelled light distribution and producing a further reconstruction of the object.

The above can be used in an iterative manner, wherein re-calculating the light distribution due to near field effects comprises: (a) using the reconstruction of the object to re-calculate the first and second normal maps from the remodelled light distribution; (b) determining a further estimate of the shape of the object by performing stereo matching between patches of normals in the recalculated first normal map and patches of normals in the recalculated second normal map; (c) producing a further reconstruction of the shape from the at least one of the first and second recalculated normal maps; and repeating (a) to (c) until the further reconstructions of the object converge.

In an embodiment, using the normal maps with the first estimate of the shape of the object to generate a reconstruction of the object comprises: integrating the first normal map with a constraint of the first estimate of the shape to produce a first reconstruction; integrating the second normal map with a constraint of the first estimate of the shape to produce a second reconstruction; and combining the first and second reconstructions to produce a fused reconstruction which is the reconstruction of the object.

The fused reconstruction may be produced using a Poisson solver.

Performing stereo matching on the first normal map and the second normal map may comprise: selecting at least one-pixel group on the first normal map, searching for a matched pixel group in the second normal map by scanning across the second normal map for matches.

To reduce the amount of data processed while looking for matches, scanning across the second normal map for matches may be performed across an epi-polar line. Searching for a matched pixel group may also be constrained by the current estimated reconstruction of the object. Thus, as the shape of the object converges, the searching for a match can be made more efficient.

In a further embodiment, performing stereo matching on the first normal map and the second normal map comprises: performing patch warping on the at least one pixel group of the first normal map; and determining the corresponding pixel group of the second normal map using the warped at least one pixel group of the first normal map.

In a further embodiment, the searching for a matched pixel group in the second normal map by scanning across the second normal map for matches is used to produce a first partial stereo estimate, the method further comprising: selecting at least one pixel group on the second normal map, searching for a matched pixel group in the first normal map by scanning across the first normal map for matches to produce a second partial stereo estimate; combining the first and second partial stereo estimates to form a stereo estimate of the shape, wherein points from the first partial stereo estimate and the second stereo estimate which do not coincide are discarded.

The above allows two stereo maps to be generated which are combined to provide robustness to errors since points on the partial stereo map which indicate different depths for the same point can be discarded.

The first set pf photometric stereo images comprises a first plurality of images taken a camera in the first position using illumination from different directions, and the second set comprises a second plurality of images taken from a camera in the second position using illumination from different directions.

In an embodiment, generating the first normal map comprises inputting information representing the first set of photometric stereo images, an estimate of the shape of the object and positional information of the light sources into a neural network which has been trained to output the first normal map.

For example the information representing the first set of photometric stereo images, an estimate of the shape of the object and positional information of the light sources is provided in the form of an observation map, wherein an observation map is generated for each pixel of the camera, each observation map comprising a projection of the lighting directions onto a 2D plane, the lighting directions for each pixel being derived from each photometric stereo image.

In a further embodiment, a system for generating a three dimensional (3D) reconstruction of an object is provided, the system comprising an interface and a processor: the interface having an image input and configured to receive a set of photometric stereo images of an object, the set of photometric stereo images comprising a plurality of images using illumination from different directions using one or more light sources, the processor being configured to: receive a first set of photometric stereo images of the object and a second set of photometric stereo images of the object, the first set comprising at least one image taken from a camera in a first position using illumination from different directions, the second set comprising at least one image taken from a camera in a second position using illumination from different directions; generate a first normal map of the object using the first set of photometric stereo images; generate a second normal map of the object using the second set of photometric stereo images; determine a stereo estimate of the shape of the object by performing stereo matching between patches of normals in the first normal map and patches of normals in the second normal map; and using the normal maps with the first estimate of the shape of the object to generate a reconstruction of the object.

In the above system, there may be a first camera and a second camera, the first camera being configured to produce the first set of images and the second camera being configured to produce the second set of images. In further embodiments, a single camera is used which is moved between the first position and the second position.

A carrier medium may be provided carrying computer readable instructions adapted to cause a computer to perform the method as described above.

Figure 1 shows a schematic of a system which may be used for capturing three-dimensional (3D) image data of an object, and reconstructing the object. As used herein, the term "object" will be used to denote what is being imaged. However, it is to be understood that this term could cover a plurality of objects, a scene, or a combination of objects and a scene et cetera.

The 3D image data of object 10 is captured using apparatus 11. Further details of apparatus 11 will be provided in reference to Figure 2.

The 3D image data captured by apparatus 11 is provided to computer 12, where it is processed. In Figure 1, the computer 12 is shown as a desktop computer, however, it is to be understood that it can be any processor, for example, a distributed processor or a processor of a mobile phone et cetera. Details of an example processor will be described further below in this description with reference to Figure 10.

The system of Figure 1 can be provided in an existing hardware setup and be used for quality control of the processes of said setup. Such setups include but are not limited to 3D printing setups and industrial pipelines. For example, the system of Figure 1 can be provided in a 3D printer setup, where the system is used to perform quality control of the printing process. More specifically, the system can be used to capture photometric stereo images of intermediate results of the printing process, and confirm the proper executing of the printing process. Furthermore, the system 1 can be implemented as a handheld device used by a user to obtain 3D models of everyday objects.

Figure 2 shows an exemplary arrangement of apparatus 11 of Figure 1 which can be used for photometric stereo. Figure 2 shows a mount that holds camera 20, comprising a plurality of pixels, and a plurality of light sources 21 in a fixed relationship with one another and camera 11. This arrangement of apparatus 11 allows camera 20 and the plurality of light sources 21 to be moved together maintaining constant separation from each other.

In the specific exemplary arrangement of apparatus 11, the light sources 21 are provided surrounding camera 20. However, it is understood that the light sources 21 and camera 20 can be provided in a different arrangement. Camera 20 is used together with light sources 21 to obtain photometric stereo data of object 10. Individual light sources 15 are activated one after another to allow the camera 11 to capture photometric stereo data.

In a particular arrangement of apparatus 11, a FLEA 3.2 Megapixel camera provided with a 8mm lens is used as camera 20. The camera 20 has an 8mm lens and is rigidly attached to a printed circuit board. The printed circuit board further comprises 16 white bright LEDs, used as light sources 21, arranged in a co-planar fashion with the image plane and provided surrounding camera 20 at a maximum difference of 6.5 centimetres.

Apparatus 11 may be used to capture photometric stereo images of an object 10. The object 10 is positioned in front of the camera 20 within the depth range of the camera 20.

The depth range, also referred to as the depth of field (DOF), is a term used to denote the range of distances between the camera 20 and the object 10 at which the object 10 is in focus. If an object 10 is too close or too far to the camera 20 outside the depth range of the camera 20, the object gets out of focus and details cannot be resolved. For example, the camera 20, equipped with 8mm lens, may have depth range of between 5 and 30 centimetres.

Light sources 21 are individually activated one after the other to allow camera 20 to capture photometric stereo data of the object under different lighting conditions. This is achieved by switching on just one light source at a time. For example, the apparatus 3 may comprises 16 LEDs and thus a set of 16 photometric stereo images may be captured.

In photometric stereo, each of the light sources 21 is individually activated upon the object 10 to capture photometric stereo data of the object under different lighting conditions. This is achieved by switching on just one light source at a time. The amount of light reflected from the known surface or object onto each pixel of the camera 20 is then measured. For example, the apparatus 3 may comprises 16 LEDs and thus a set of 16 photometric stereo images may be captured.

Photometric stereo allows the surface normals to be estimated. In its most basic form, for any point on a Lambertian surface the intensity / of the reflected light at the camera can be expressed by: I = Where / is the intensity of the reflected light at the camera, L is a normal vector corresponding to the light direction of the irradiating light, n is the normal to the surface at that point and k is the albedo reflectivity at the point. By solving the above for three or more different light directions, it is possible to estimate the surface normal n. Once all of the surface normal have been determined for all points on the surface, it is possible to reconstruct the surface by integrating the normals.

In practice, the situation is more complex as the surface is unlikely to have a fixed albedo.

Also, near field light attenuation can also affect the shape of the object recovered using photometric stereo. To take this into account To model this, the following can be used where each set of photometric stereo images comprises m number of images, where the number of images corresponds to the number of illumination directions (which in this example corresponds to the number of light sources used. Each image tip for] = 1, , m, can be seen as a set of pixels p. For each of the m light sources, the direction Lp and the brightness 0j are known.

As noted above, near field light attenuation will affect the reflection of light from the object. The near-field light attenuation is modelled using the following non-linear radial model of dissipation am(X) (Coo * gmtm IlL.(x)112 (1) where Om is the intrinsic brightness of the light source, S", is the principle direction denoting the orientation of the LED point light source, pn, is an angular dissipation factor, and the lighting direction is expresses as 1,(X) Lm(X) IlLin(X)11 (2) It is assumed calibrated point light sources at positions Pm with respect to the camera centre at point 0, which results in variable lighting vectors Lm = Pk -X, where X is the 3D surface point coordinates and is expressed as X = [x,y,z1T Defining the viewing vector I? as ;X ;the general image irradiance equation is expressed as zm = amB(N, Lm, 9, p) where N is the surface normal, B is assumed to be a general bidirectional reflectance distribution function (BRDF) and p is the surface albedo, where the albedo p and the images are RGB and the reflectance is different per channel, thus allowing for the most general case. Furthermore, global illumination effects such as shadows and self-reflections could also be incorporated into B. To allow the above to be modelled, the photometric stereo images are converted to observation maps. An observation map is generated to describe pixel-wise illumination information. In an embodiment, an observation map is generated for each pixel. Each observation map comprises a projection of the lighting directions onto a 2D plane, the lighting directions for each pixel being derived from each photometric stereo image. ;Details of how the observation maps may be constructed will be described later. However, in summary, the observation maps are generated using Lm which is the lighting vector. As described above, this is defined with respect to the 3D position of the point on a surface of the object which corresponds to the pixel. Therefore, the shape of the object influences Lm and hence influences the observation maps. When the observation maps are generated for the first time, a simple estimate of the shape will be used. For example, it can be assumed that all pixels are at a fixed distance from the camera. (3) (4) ;The observation maps are intended to account for every illumination direction at each surface pixel and allow the merging information of a variable number of images into a single d x d image map Given a plurality of images, for each pixel p of camera 20, its value at image j is denoted as ija,. All observations of the pixel in the images with varied illumination are combined into a single dxdx4 map 0, where d is the dimension. How this is achieved will be described later. ;Figure 3 shows a high-level diagram of the basic steps of the method for recovering surface normals from a set of photometric stereo images of an object or a scene, which can take into account near-field optical effects. ;Images of an object are obtained using apparatus 11 following the above-described procedure. Alternatively, the photometric stereo images can be obtained from a remote photometric stereo imaging apparatus, communicated to the computer 12 and thus provided as an input. ;For each pixel of the photometric images, an observation map is rendered by combining the pixel observations of all photometric images onto an observation map. (as described above). Subsequently, the observation maps are processed to generate a normal (orientation) map. In an embodiment, the observation maps may be processed by a Convolutional Neural Network (CNN). A possible configuration of such a CNN will be described with reference to FIG. 6 Figure 4 shows an apparatus 40 according to an embodiment. Figure 4 shows a mount that holds a first camera 41 and a second camera 42. The mount further holds a plurality of light sources 43 in a fixed relationship with one another and the first camera 41 and the second camera 42. This arrangement of apparatus 40 allows the first camera 41, the second camera 42 and the plurality of light sources 43 to be moved together maintaining constant separation from each other. ;In an embodiment 15 LEDs are used as the plurality of light sources, however any number and arrangement of light sources can be used as long as their positions relative to each other and the first camera 41 and second camera 42 are known. ;Although Figure 4 shows an apparatus comprising a first camera 41 and a second camera 42, the described method could be implemented by a single camera, wherein the single camera is configured to move between the position of the first camera 41 of Figure 4 and the position of the second camera 42 of Figure 4, for given positions of the LEDs and known distances between the first and second position, the first position and the LEDs and the second position and the LEDs. ;Figure 5 is a flow chart showing an outline of a method in accordance with an embodiment. ;In data capture with Camera 1 step S501 a first set of photometric stereo images of an object are acquired. For example, a first set of photometric stereo images can be acquired from the first camera (or Camera 1) shown in Figure 4 in a first position. Simultaneously, in data capture with camera 2, step S503, a second set of photometric stereo images is acquired from the second camera (or Camera 2) shown in Figure 4 in a second position. Both the first set and second set of images are acquired under known light (LED) positions and known camera positions as described above. ;In an alternative embodiment, the two sets of photometric stereo images of the object are acquired from a memory device, wherein the two sets of photometric stereo images were acquired previously, at a time before execution of the described method. In a further variation, there is just one camera and the first set of photometric stereo images are acquired in the first position and the second set of photometric stereo images are acquired in the second position by the same camera. ;For ease of reference in the following description, the first set of photometric stereo images may be referred to as the left stereo images and the second set of photometric stereo image may be referred to as the right stereo images. However, the first position may be to the right of the second position, above the second position, below the second position or any position separated from the second position. The first set, second set, left and right terminology may be used to distinguish an image, or set of images, obtained from a camera in the first position, with an image, or set of image, obtained from a camera in the second position. ;Each set of photometric stereo images comprises in number of images, where the number of images corresponds to the number of light sources used. Each image (hp for] = 1, , 772, can be seen as a set of pixels p. For each of the m light sources, the direction LI, and the brightness Of are known and are used in the calculation of the normals N. In step 3505, a normal map is generated from the data capture step 3501. This will be referred to as the left normal map NL. In step 3507, a normal map is generated from the data capture step 3503. This will be referred to as the right normal map AI,. ;In this embodiment, the left and right normal maps are generated using the method described above where a left observation map is created from the data from camera 1 and a right observation map is created from the data from camera 2. To create these initial observation maps, an estimate of the initial object geometry zest is used. This may be a very rough depth initialisation Zest (which can be accurate to a few centimetres). An estimate of object geometry can comprise a depth map with constant values corresponding to z"t. ;It is assumed that two sets of photometric images provided from 3501 provide a number of calibrated stereo pairs comprising of left and right views (views from a first camera position and a second camera position). ;Assuming a camera focal length f, stereo baseline b, the well know depth (z) disparity (d) relation applies: f b Equation (5) Z = -or equivalently d =a) a Assuming photometric stereo images for both the first position and the second position and a depth initialisation Zest, left and right normal maps /V', and NR can be computed. ;In step 3509, stereo matching is performed on the left normal map and the right normal map. In this step, stereo matching (or disparity estimation) is performed on both the left normal map and the right normal map to output a set of sparse pairwise shape-preserving correspondences between the left normal map and the right normal map. ;The normal maps NL and NR from 5505 and S507 are expected to be accurate to a few degrees to the ground truth. However, the depth maps (ZoL and ZoR which are calculated from NL and NR respectively using numerical integration) suffer from scale ambiguity. This is an inevitable consequence of single view version of the problem as numerical integration preserves the overall mean depth i.e. Zot, = ZoR = Zest. Thus for an almost perfectly integrable surface with truth depth iv it is expected that: zr or, equivalently, ZT ZrZo Zr Zo Equation (6) Zo In practice, Equation (6) only applies to an infinitely smooth surface and it is very sensitive to error propagation for any reasonably sized image. This in turn means that the overall shape Zo of the object exhibits some low frequency deformation or bending. ;However, if aftenfion is restricted to a small size image patch, then the smoothness constraint is a lot easier to satisfy (for patches not including depth discontinuities) and thus Equation (6) can be exploited to facilitate left to right matching. ;In an embodiment, sparse stereo matching (or disparity estimation) is used to generate a sparse stereo representation of the object. Formally, the aim of this stereo matching step is to find for each pixel with position (x, y) on left normal map /V', the best match position (x -d, y) at the right normal map AIR as this directly allows to recover the absolute surface depth using Equation 1. ;Single pixel matching can be too ambiguous and too sensitive to noise therefore, in an embodiment, it is aimed to match a patch w pixels around (x, y), which can be expressed as NPL = NL[x -w x + w,y-w: y+ w], where NPL is the patch of the left normal map NL. ;In an embodiment, for a given pixel patch on the left image NPL, with a centre around pixel position (x, y). a search for NPR [x -d, y] is performed on the right normal map NR starting with a first candidate pixel patch at (x, y) in the right normal map NR. Further candidate matches are obtained, for example, by shifting the first candidate pixel patch to the left or right along the x dimension. The search is constrained about the epipolar line (or scan line). The epipolar constraint states that, given a pixel in one of the images, potential conjugate pixels in the other image belong to a straight line called the epipolar line. This constraint shows that stereo matching is fundamentally a one-dimensional problem. ;Then a new pixel patch is selected on the left image NPL and the search for a match in the right normal map is performed again. ;Candidate pixel patches may overlap. Each candidate pixel patch is associated with a tentative disparity dt, which is the shift in the x direction of the first candidate pixel patch. ;For the given pixel patch on the left image NPL, each candidate pixel patch on the right image NPR is evaluated to determine which candidate pixel is the best match. In an embodiment, each candidate pixel patch is evaluated using the angular distance (cosine similarity of the normal). However, other metrics can be used for the evaluation such as least squares difference. ;This results in pairwise shape-preserving correspondences between the left normal map NL and the right normal map NR. ;In an embodiment, left-right consistency of disparity is enforced. This enhances the disparity estimation by referring to the information from the opposite view, i.e. from right to left. In an embodiment, this is achieved by, the given pixel patch on the left image NPL" determining identifying the matching pixel patch NPR on the right normal map. Then for NPR, with a centre around pixel position (x,y). a search for NPL [x -d,y] is performed on the left normal map NL, starting with a first candidate pixel patch at (x,y) in the left normal map NL. Further candidate matches are obtained by shifting the first candidate pixel patch to the left or right. Candidate pixel patches may overlap. Each candidate pixel patch is associated with a tentative disparity at. ;Next, for a given pixel patch on the right image NPR, each candidate pixel patch on the left image NPL is evaluated to determine which candidate pixel is the best match. In an embodiment, each candidate pixel patch is evaluated using the angular distance (cosine similarity of the normal). However, other metrics can be used for the evaluation such as least squares difference. ;Left-right consistency of disparity is enforced by corn that a match between a patch on the left normal map NI, and a patch on the right normal map NR and a match between right and left have a disparity difference that is under a given threshold. If the disparity difference exceeds the threshold then that match may be eliminated. For example, in an embodiment, all matches with more than 0.5 disparity difference are eliminated. However, other metrics based on disparity can be used. ;In an embodiment, left-right consistency of single pixel normal match is also enforced, by keeping only points with less than 5° normal difference. ;While the above describes a windowing method for searching for candidate pixel patches, alternate methods of identifying the candidate patches may be used. ;In an embodiment, to provide more accurate stereo matching, patch warping can be used to correct for the curvature associated with the normal maps. It is noted that for any non-frontal facing planar surface, the corresponding patch at the right normal,map NPR will not be a square of size 2w centered around (x -d, y) as pixels with different depth have different disparities. However, the approximate relative depth Eqn (6) can be used to compute a more accurate mapping from left to right image (i.e. patch warping). ;More precisely, in an embodiment, for each tentative disparity dt, the warping procedure is as follows: 1. Compute the mean tentative depth as zt = Lab zt 2. Compute relative depth scaling factor s = zoLix,yi 3. For each patch pixel (xp, yp) scale its relative depth to zp = sZOL[xp, yp] fb 4. Compute the patch pixel disparity as dp = z, 5. Sample an interpolated value from right normal map at position N[xp dp,yp] Thus, using the above warping procedure, for each patch of pixels on the left normal map NL, a tentative correspondence to the right normal map NR is computed. As, above, the matching is evaluated using the angular distance (cosine similarity of the normals). In an embodiment, left-right consistency of disparity is enforced, as described above. ;This produces a set of sparse pairwise shape-preserving correspondences between the left normal map A//, and the right normal map A/R. From this, two shapes are reconstructed, one from the patch matching that started from the left normal map and one from the patch matching that started from the right normal map. ;The two shapes (partial stereo estimates) that are constructed are then combined into one. If corresponding points are estimated to be at different depths on the two partial stereo estimates then the corresponding points are both discarded. The recombined stereo estimate is sparse and maybe integrated or interpolated to increase its number of points. However, the sparse points from the stereo estimate can be used directly in the next step. ;From the pairwise correspondences, a set of stereo points with absolute depth is computed using triangulation methods utilising the principle of Eqn (5) and the calibration matrices of the cameras. The set of sparse pairwise shape-preserving correspondences is then input into a 3D shape retrieval steps S511 and S513. ;In step S511, a first 3D shape (or reconstruction) is obtained using the normal map obtained in step S505 from Camera 1, constrained by the pairwise correspondences obtained from the stereo matching as detailed above. ;A second 3D shape (or reconstruction) is obtained in step S513 using the normal map obtained in step S507 from Camera 2 constrained by the pairwise correspondences obtained from the stereo matching as detailed above. ;The reconstructions are generated by numerically integrating the normal maps to generate a depth map, the integration being constrained by the depth information determined from stereo matching the two depth maps. ;In an embodiment, this may be done by following the variational solver of Queau and Durou [ Y. Queau and J.-D. Durou. Edge-preserving integration of a normal field: Weighted least squares, TV and L1 approaches. In SSVM, 2015] where the surface derivatives are approximated using first order finite differences The numerical integration of each normal map (/)/L, NR) to obtain a depth map (Zot, ZoR) can be constrained to follow these stereo points. This essentially adds another term AI lz -z0 I I to the integration equation, where z is the depth, zo is the estimated depth, A being a constant weight that could also be scaled based on the confidence of the matches. This eliminates the scale ambiguity as well as reducing the low frequency bending and thus significantly improves the single view Photometric Stereo depth accuracy. ;The term for integration with respect to dxdy is IlDz + AstIlz zstl I + AprevIlz z""I I Equation (7) where D is the divergence operator, z is the known depth, zd is the estimated depth from normals, where Ast I lz -zst I I is the constraints using the stereo points, and 2 II -prep iz -ZprevIl is the constraints using the quadratic prior determined from the previous reconstruction. If due to the sparse stereo reconstructions, a point is reconstructed where there is no corresponding value from the stereo reconstruction, the term Astilz zstil is set to 0. ;This sparse stereo reconstruction is less affected by global distortions than monocular photometric stereo reconstruction It is noted, that at this stage, the output is still a set of 2.50 depth maps per each view On case of multiview photometric stereo that can correspond to multiple rectified stereo pairs) and thus fusion is still required to get a complete surface. The stereo-constrained single view reconstructions (ZoL ZoR) are metrically consistent and thus can be merged onto a unified surface in step S515. To minimise the impact of inaccuracies at different views on the fusion, each point is weighted using its viewing term i.e. n * v (as detailed in Fotios Logothetis, Ignas Budvytis, Roberto Mecca, and Roberto Cipolla. A differential volumetric approach to multi-view photometric stereo. In ICCV, 2019). This allows points in the centre of the image to be given more weight as they are more likely to be correct.

In addition, points that lie outside the objects visual hull are completely eliminated.

In an embodiment, Poisson reconstruction is performed. The stereo point cloud of the sparse stereo points can be made denser with a Poisson reconstruction. The output of the Poisson reconstruction is a dense surface which can be projected to all the views and can be used as a better initialisation for the next stage of the iterative procedure.

In an embodiment, the process described above is iterated a number of times to improve the reconstruction. In a subsequent iteration, the merged stereo-constrained construction from the previous iteration is used as the rough depth estimate or infial 3D plane. Using this estimate, the normals are computed, stereo matching is performed on the normals and an image of the object is reconstructed constrained by the stereo points obtained from stereo matching.

In addition, having a dense surface estimate also allows for estimating discontinuity boundaries which can be respected in the single view reconstruction stage (essentially removing the normal integratability constraint for the pixels across boundaries).

The merged stereo-constrained single view reconstruction in step S515 is then used as the regenerated shape in step S517 for re-computing light distributions which take into account near field effects as described above in relation to equations (1) to (4) which explain how the shape of the object impacts near field effects.

In Step S519, new observation maps are generated for cameras 1 and 2. These observation maps are generated in the same manner as described above. However, here, the regenerated shape in S517 is used as Zest. Normal maps are generated in step 3521 from the observation maps generated in S519.

Next, the method loops back to 5509 where the new normal maps generated in step S521 are stereo matched. The process then continues as above until the shape reconstructed in 3515 converges.

The above method discusses the formation of observation maps and the generation of normal maps from these observation maps.

In more detail, if the photometric stereo images are RGB images, a pre-processing may be performed, where the RGB channels are averaged and thus the images are converted to gray-scale images. In the pre-processing stage, the photometric stereo images are also compensated with the intrinsic light source brightness, where the intrinsic light source brightness Om is a constant property of each LED. The resultant image values are referred to as the RAW gray image values and are included in the observation maps as discussed further below in this specification.

The general image irradiance equation can be thus re-arranged into a BRDF inversion problem, as: = -tm Irn= s(rv, L",, p) am (8) where jm denotes the BRDF samples, It should be noted that the viewing vector 17 is known but due to the non-linear dependence on the distance between the object and the photometric stereo capturing device, denoted by depth z, the lighting direction Gm and near-field light attenuation am are unknowns. The objective of the convolutional neural network, is therefore to solve the inverted BRDF problem in the general viewing direction, which is input into the network through the 3' and 41h channels of the map as explained further below, and recover the surface normals N and subsequently the depth z. Having assumed an initial estimate of the depth z (local surface depth) of each pixel, the near-field attenuation an,(X) can be computer following equation (1), and thus the equivalent far-field reflectance samples jm, representing the observations in the photometric stereo images after the compensation for the near-field attenuation, can be obtained using the first part of equation (8). The second part of equation (8) is modelled by the CNN, and is used to approximate the normals N. In Step S407, the set of equivalent far-field reflectance samples jm is used to generate observation maps which are consequently provided to the input of a Convolutional Neural Network (CNN) which is used to calculate surface Normals N. Each equivalent far-field reflectance sample jm comprises a set of pixels x = xpl, where for each sample jm, in denoting the number of light sources and thus the number of samples in the set of equivalent far-field reflectance samples, the light directions Lm and the brightness Om are known and used in the estimation of the surface normals N(x) for each pixel x.

The observation maps are computer by combining the information from all the light sources into a single map. In particular, for each pixel x, all observations in the set of equivalent far-field reflectance samples jm, are merged into a single d x d observation map.

Initially normalised observations Imp for each pixel x are computed. Normalising the observations is performed by compensating for the variation in the light sources brightness Om and dividing with the maximum brightness for all light sources in: Imp im,p max",[i",,p/ Om] (9) Compensating for the light sources variations is aimed to compensate for the albedo variation of the different pixels. Consequently, this also results in a reduced range of the data associated with the observations of each pixel.

Subsequently, the normalised observations imp (x) for each pixel x are placed onto a normalised observation map On. The normalised observation map is a square observation map with dimensions d x d. In certain embodiments, d is 32. The size of the observation maps is independent from the number or size of the photometric stereo images used.

The normalised observations are mapped onto the normalised observation map by projecting the light source direction 1,"" = [1,13",Y,L] to the d x d map, following the below equation: 0" ([di; 2 +1],[di?;2 ± -Im,p (10) In certain instances, the normalised observation data may be corrupted by the division operation. For example, corruption of the normalised observation data may occur when the maximum value of the observations is saturated. Division with a saturated value results in an overestimate in the normalised observation values. In other instances, the division of very dark points in the observations becomes numerically unstable and any amount of noise or any discrimination inaccuracies are amplified.

Therefore, the d x d observation map for each pixel x is extended to a three-dimensional observation map with dimensions d x d x2 by the addition of a RAW channel map, 0".. The RAW channel map may be a RAW grey scale channel map. The RAW channel map is defined as follows: 0, (kilt +11,1d ± 1 -2 2 (11) The d x d x2 observation map, denoted by 0 below, is created by concatenation operation on the third axis of the normalised observation map O, and the RAW channel map Or: o = [or; on] (12) In certain embodiments, the observation maps 0, with dimensions d x d x2 can be extended to d x d x4 observation maps, by augmenting the observation map to include two additional channels, being constant to the first two components of the viewing vector 1.7., and fly, respectively. The components 17, and 17y are scalar components and fully determine the viewing vector V., which itself is used in BRDF inversion problem equation.

The observation maps record the relative pixel intensities from the BRDF samples on a 2D grid of discretised light directions. The observation map representation is highly convenient representation for use with classical CNN architectures as it provides 2D input for a pixel length despite the potential varying number of lights and hence photometric stereo images used.

In Steps 5505, 5507 and S521 the observation maps for each pixel x are provided to the input of the CNN, which is used to solve the BRDF inversion problem, and calculate the surface normals for each point based on the relative pixel intensifies in the observation maps. As the CNN is designed to be robust to real world effects, the modelled BRFD inversion problem equation is an inexact representation of the BRDF inversion problem equation.

A high-level flow-diagram of convolutional neural network that may be used for generating surface normals from the pixel-wise observation maps is presented in Figure 6.

The network comprises 7 convolutional layers which are used to learn robust features for dealing with real world data. This is done by employing an augmentation strategy during training of the network, as discussed in GB 2598711. The network further comprises 2 fully connected layers, and a fully connected layer in combination with a logarithm layer at the end, which is used to solve the inverse BRDF problem, and hence compute a surface normal for each pixel, Steps 5505 and S507.

The network has around 4.5M parameters in total. A full diagram of the architecture of the network is shown pictorially in Figure 6. While Figure 6 present a specific network architecture, it is understood that different neural network architectures can be used for estimating surface normals from observations maps.

The proposed network comprises a single branch network. The 7 convolutional layers 603, 605, 609, 613, 619, 623, and 627 and the first two fully connected layers 631 and 635 are each followed by a RELU activation function. The size of the convolutional filter of each of the convolutional layers is denoted in Figure 6, and thus the size of the output volume for each layer can be inferred. In particular, the convolutional layer 603, comprises 32 convolutional filters with dimensions (3 x 3) and outputs an output volume with dimensions (32, 32, 32); convolutional layer 605 comprises 32 convolutional filters with dimensions (3 x 3) and outputs an output volume with dimensions (32, 32, 32). The output volume of the first concatenation layer 608 has dimensions of (32, 32, 64). The convolutional layer 609 comprises 32 convolutional filters with dimensions (3 x 3) and outputs an output volume with dimensions of (32, 32, 32). The output volume of the second concatenation layer 612 has dimensions (32, 32, 96). Convolutional layer 613 comprises 64 convolutional filters with dimensions (1 x 1) and outputs an output volume with dimensions (32, 32, 64). The output volume of the average pooling layer 617 has dimensions of (16, 16, 64). The convolutional layer 619 comprises 64 convolutional filters with dimensions of (3 x3) and outputs an output volume with dimensions (16, 16, 64).

The output volume of the third concatenation layer 622 has dimensions (16, 16, 128). The convolutional layer 623 comprises 64 convolutional filters with dimensions (3 x 3) and outputs an output volume with dimensions of (16, 16, 64). Convolutional layer 627 comprises 128 filters with dimensions (3x3) and outputs an output volume with dimensions of (16, 16, 128).

Furthermore, after convolutional layer 605, 609, 613, 619 and 623, dropout layers 607, 611, 615, 621 and 625 respectively, are used. Dropout is a training approach which reduces the independent learning amongst the neurons in the network. During training a random set of nodes is dropped out of the network, so that a reduced version of the network is created. The reduced version of the network learns independently of other sections of the neural network and hence prevents neurons from developing codependency amongst one another.

Each dropout layer is associated with a dropout parameter that specifies the probability at which outputs of the layer are dropped out. In layers 607, 611, 615, 621 and 625, the dropout parameter is 0.2 and thus 20% of the parameters are dropped out.

Skip connection are also employed in the network architecture to speed up convergence.

Skip connections are used in order to allow the output of convolutional layers to skip proceeding layers and to allow for the outputs of subsequent layer to be concatenated together, before being provided as in input to the next layer of the network. Average pooling layer 617 is also employed.

The first part of the convolutional neural network, comprising the 7 convolutional layers 603, 605, 609, 613, 619, 623, and 627, is effectively separated from the second part of the network, comprising the fully connected layers 631, 635, 637 and the logarithmic layer 633, by a Flattening layer 629. Fully connected layers are used because they provide good approximation of non-linear mathematical functions.

Flattening layer 629 rearranges the feature maps, output by convolutional layer 627, into a vector of input data that is provided to the first fully connected layer 631. The output volume of convolutional layer 627, is flattened into a 32,768 element vector, which is provided to the first fully connected layer 631.

The BRDF model, described above in this specification, follows the Blinn-Phong reflectance model which is considered to be a good approximation of many real BRDFs, where the BRDF is modelled as a summation of a diffuse component and an exponential component. Therefore, in order to solve the inverse BRDF problem, the inverse of these operations is approximated as a combination of a linear and logarithmic summation. The combination of the linear and logarithmic summation is implemented in the CNN using the combination of dense layer 635 and logarithmic layer 633. Dense layer 635 represents the linear component of the summation and the logarithmic layer represents the logarithmic component of the summation.

Finally, a normalized layer 639 is used to convert the extracted by the network features to a unit vector -the surface normal of the pixel and thus the network outputs a surface normal from the pixel-wise observation map input.

Figure 7 illustrates steps of the method, according to an embodiment. Figure 7 shows a set of images arranged in rows and columns. In the following description, each image will be referred to as image (row, column). Where an image spans two rows, the image will be referred to as (row1:row2, column). The images are acquired for an object for which a ground truth is known for each stage of the method.

Image (1,1) is an image from the first set of photometric stereo images and image (2,1) is an image from the second set of photometric stereo images. Image (1,1) may be the right image, and Image (2,1) may be the left image or vice versa.

Image (1,2) shows the error of the estimated normal map computed for image (1,1) compared to the ground truth. Image (2,2) shows the error of the estimated normal map computed for image (2,1) compared to the ground truth.

Image (2,3) shows the structure from normals (SfN) obtained by integrating the normal map estimated for image (2,1). Image (2, 4) shows the error of the SfN map with respect to the ground truth Image (1,3) shows the structure from normals (SfN) obtained by integrating the normal map estimated for image (1,1). Image (1, 4) shows the error of the SfN map with respect to the ground truth.

Image (1, 6) shows the disparity estimation map determined from matching of the normal map for image (1,1) with the normal map of image (2,1) as explained with reference to S509. Image (2, 6) shows the disparity estimation map determined from matching of the normal map for image (2,1) with the normal map of image (1,1). Black is low disparity, while white is high disparity. Image (1, 5) shows the error in the disparity map of image (1, 6). Image (2, 5) shows the error in the disparity map of image (2, 6).

Image (1:2,7) illustrates the stereo points obtained from the stereo matching. Image (1:2, 8) shows the error in the stereo points compared to the ground truth.

Image (1:2, 9) shows the reconstructed object obtained from fusing depth maps obtained for image (1,1) and image (2, 1), constrained by the stereo points of image (1:2, 7). Image (1:2, 10) shows the error in the fused shape compared to the ground truth.

Image (1:2, 11) shows the reconstructed object using Poisson reconstruction. Image (1:2, 12) shows the error in the fused shape compared to the ground truth.

Rows 3 to 4 show the steps of the process for a second iteration of the method. Rows 5 to 6 show the steps of the process for a third iteration of the method. Rows 5 to 6 show the steps of the process for a third iteration of the method Figure 8 illustrates the working principles of the reconstruction method. S901 shows a sample of the input images coming from two cameras where the object is lit by one (out of 15) LED light sources. Following the arrow path to 5902 is the pipeline where normal maps have been computed from monocular photometric stereo. Such normal maps are then used to geometrically match the two views and provide the sparse stereo representation of the object. Patch warping may be used at S903. S904 shows a sparse stereo point reconstruction. 5905 shows the fusion of the two stereo-constrained single view reconstructions. At 3906, Poisson reconstruction is performed to further refine the reconstruction.

At S907, the reconstruction from S906 is used as an initial estimate of the shape geometry for a subsequent iteration of the method. This is used in conjunction with the input images, to compensate the input images for near-field light attenuation.

Figure 8 illustrates steps of the fusion process, from left to right: Point-cloud from stereo; Normal integration using the point-cloud prior; Fusion from all views and projection back to initialise next round; Integration using the surface prior as well as respecting the depth discontinuities.

Figure 9 illustrates results of the proposed system compared to monocular (single view) photometric stereo. At 1001, pairs of binocular photometric stereo images of three objects (bunny, queen and squirrel) made of different materials are shown. After the ground truth (1002), there is the monocular photometric stereo reconstruction (1003), followed by the reconstructions of the proposed methods (1004).

The above method and systems have been described with relation to binocular photometric stereo, i.e. two views or two sets of images taken from cameras in two different positions. However, the described methods may also be applied to more views. For example, in an embodiment there are three views or three sets of images taken from cameras in three different positions. The described stereo matching can then be performed between the first view and the second view, the first view and the third view, and the second view and the third view. The resultant stereo-constrained single view reconstructions (first reconstruction, the second reconstruction and the third reconstruction) can be merged together.

The above method can be applied to any number of views, or sets of photometric images.

Figure 10 is a schematic of the hardware that can be used to implement methods in accordance with embodiments. It should be noted that this is just one example and other arrangements can be used.

The hardware comprises a computing section 1100. In this particular example, the components of this section will be described together. However, it will be appreciated they are not necessarily co-located.

Components of the computing system 1100 may include, but not limited to, a processing unit 1113 (such as central processing unit, CPU), a system memory 1101, a system bus 1111 that couples various system components including the system memory 1101 to the processing unit 1113. The system bus 1111 may be any of several types of bus structure including a memory bus or memory controller, a peripheral bus and a local bus using any of a variety of bus architecture etc. The computing system 1100 also includes external memory 1115 connected to the bus 1111.

The system memory 1101 includes computer storage media in the form of volatile/or non volatile memory such as read-only memory. A basic input output system (BIOS) 1103 containing the routines that help transfer information between the elements within the computer, such as during start-up is typically stored in system memory 1101. In addition, the system memory contains the operating system 1105, application programs 1107 and program data 1109 that are in use by the CPU 1113.

Also, interface 1125 is connected to the bus 1111. The interface may be a network interface for the computer system to receive information from further devices. The interface may also be a user interface that allows a user to respond to certain commands et cetera.

Graphics processing unit (GPU) 1119 is particularly well suited to the above described method due to the operation of this multiple parallel calls. Therefore, in an embodiment, the processing may be divided between CPU 1113 and GPU 1119.

The embodiments described above combine two steps of (i) estimating sparse pairwise shape-preserving correspondences and (ii) using them to initialize reconstructions, for example Poission reconstruction, guided by estimated pixel-wise normals, although other reconstruction methods can be used.

Stereo matching requires matching view invariant features which is very challenging in texture-less objects, especially for objects experiencing specular reflection which changes the appearance and pixel intensities in different views. In addition, local surface curvature distorts the local appearance between the left and right (or first and second) views so that patch-match based method struggle to match rectangular patches of pixels in the two views.

Single view near-field photometric stereo can compute dense (for all foreground pixels) surface normals which can be very accurate even with approximate depth initialisation. Surface normals are view invariant features and also exhibit variation for any nonplanar surface essentially allowing for patch matching.

In addition, integration of normals provides a shape estimate which is locally accurate (and only suffers from low frequency deformation or bending) and this local shape can be used in order to perform patch warping and thus maximise the stereo matching, as described below.

Hence, performing matching on the normals, instead of the initial two sets of images, more reliable stereo matching can be performed which is robust even when dealing with objects which are without texture or shiny. As a result, constraining the reconstruction of the object to the pairwise correspondences of the stereo matching step can provide a more accurate representation of the object geometry which will provide a better initial estimate for the iterative procedure of updating the object geometry to converging to single object geometry.

The above described architecture also lends itself to mobile telephones using GPUs.

Whilst certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel devices, and methods described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the devices, methods and products described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.

Claims

CLAIMS: 1. A computer vision method for generating a three dimensional reconstruction of an object, the method comprising: receiving a first set of photometric stereo images of the object and a second set of photometric stereo images of the object, the first set comprising at least one image taken from a camera in a first position using illumination from different directions, the second set comprising at least one image taken from a camera in a second position using illumination from different directions; generating a first normal map of the object using the first set of photometric stereo images; generating a second normal map of the object using the second set of photometric stereo images; determining a stereo estimate of the shape of the object by performing stereo matching between patches of normals in the first normal map and patches of normals in the second normal map; and using the normal maps with the first estimate of the shape of the object to generate a reconstruction of the object.
2. A method according to claim 1, wherein the first normal map and the second normal map are generated using an estimate of the shape of the object to recalculate light distribution for near field effects of the illumination on the object.
3. A method according to claim 2, further comprising re-calculating the light distribution due to near field effects using the first reconstruction of the object, re-calculating the first normal map and the second normal map from the remodelled light distribution and producing a further reconstruction of the object.
4. A method according to claim 3, wherein re-calculating the light distribution due tonear field effects comprises:(a) using the reconstruction of the object to re-calculate the first and second normal maps from the remodelled light distribution; (b) determining a further estimate of the shape of the object by performing stereo matching between patches of normals in the recalculated first normal map and patches of normals in the recalculated second normal map; (c) producing a further reconstruction of the shape from the at least one of the first and second recalculated normal maps; and repeating (a) to (c) until the further reconstructions of the object converge.
5. A method according to any preceding claim, wherein using the normal maps with the first estimate of the shape of the object to generate a reconstruction of the object comprises: integrating the first normal map with a constraint of the first estimate of the shape to produce a first reconstruction; integrating the second normal map with a constraint of the first estimate of the shape to produce a second reconstruction; and combining the first and second reconstructions to produce a fused reconstruction which is the reconstruction of the object.
6. A method according to claim 5, wherein the fused reconstruction is produced using a Poisson solver,
7. A method according to any preceding claim, wherein performing stereo matching on the first normal map and the second normal map comprises: selecting at least one-pixel group on the first normal map, searching for a matched pixel group in the second normal map by scanning across the second normal map for matches.
8. A method according to claim 7, wherein scanning across the second normal map for matches is performed across an epi-polar line.
9. A method according to either of claims 7 or 8, wherein searching for a matched pixel group is constrained by the current estimated reconstruction of the object.
10. A method according to any of claims 7 to 9, wherein performing stereo matching on the first normal map and the second normal map comprises: performing patch warping on the at least one pixel group of the first normal map; and determining the corresponding pixel group of the second normal map using the warped at least one pixel group of the first normal map.
11. A method according to any of claims 7 to 10, wherein selecting at least one-pixel group on the first normal map, searching for a matched pixel group in the second normal map by scanning across the second normal map for matches is used to produce a first partial stereo estimate, the method further comprising: selecting at least one pixel group on the second normal map, searching for a matched pixel group in the first normal map by scanning across the first normal map for matches to produce a second partial stereo estimate; combining the first and second partial stereo estimates to form a stereo estimate of the shape, wherein points from the first partial stereo estimate and the second stereo estimate which do not coincide are discarded.
12. A method according to any preceding claim, wherein the first set comprises a first plurality of images taken a camera in the first position using illumination from different directions, and the second set comprises a second plurality of images taken from a camera in the second position using illumination from different directions.
13. A method according to any preceding claim, wherein generating the first normal map comprises inputting information representing the first set of photometric stereo images, an estimate of the shape of the object and positional information of the light sources into a neural network which has been trained to output the first normal map.
14. A method according to claim 13, wherein the information representing the first set of photometric stereo images, an estimate of the shape of the object and positional information of the light sources is provided in the form of an observation map, wherein an observation map is generated for each pixel of the camera, each observation map comprising a projection of the lighting directions onto a 2D plane, the lighting directions for each pixel being derived from each photometric stereo image.
15. A system for generating a three dimensional (3D) reconstruction of an object, the system comprising an interface and a processor: the interface having an image input and configured to receive a set of photometric stereo images of an object, the set of photometric stereo images comprising a plurality of images using illumination from different directions using one or more light sources, the processor being configured to: receive a first set of photometric stereo images of the object and a second set of photometric stereo images of the object, the first set comprising at least one image taken from a camera in a first position using illumination from different directions, the second set comprising at least one image taken from a camera in a second position using illumination from different directions; generate a first normal map of the object using the first set of photometric stereo images; generate a second normal map of the object using the second set of photometric stereo images; determine a stereo estimate of the shape of the object by performing stereo matching between patches of normals in the first normal map and patches of normals in the second normal map; and using the normal maps with the first estimate of the shape of the object to generate a reconstruction of the object.
16. A system according to claim 15, further comprising a first camera and a second camera, the first camera being configured to produce the first set of images and the second camera being configured to produce the second set of images.
17. A carrier medium carrying computer readable instructions adapted to cause a computer to perform the method of any of claims 1 to 14.