WO2013074153A1

WO2013074153A1 - Generating three dimensional models from range sensor data

Info

Publication number: WO2013074153A1
Application number: PCT/US2012/042792
Authority: WO
Inventors: Gerard Guy Medioni; Matthias HERNANDEZ; Jongmoo Choi
Original assignee: University Of Southern California
Priority date: 2011-11-17
Filing date: 2012-06-15
Publication date: 2013-05-23

Abstract

The present disclosure describes systems and techniques relating to generating three dimensional models from range sensor data, for example, performing three dimensional face modeling using a low resolution range sensor. According to an aspect, unwrapped 2D images are generated (in canonical form and according to a generalized cylinder model) from clouds of 3D points in a 3D coordinate system. This includes registering a 3D input cloud to one or multiple 3D reference frames, where the registering can include registering the 3D input cloud to all points in the 3D reference frame(s), or to only a portion of the points in the 3D reference frame(s), in accordance with an assessed rigid body transformation between the clouds of 3D points. In addition, the unwrapped 2D images are processed in a 2D image domain, and the processed 2D images are transformed to the 3D coordinate system to help form the 3D model.

Description

Generating Three Dimensional Models From Range Sensor Data

Cross Reference To Related Applications

[0001] This application claims priority to U.S. Provisional Application Serial No.

61/561,218, entitled "ACCURATE 3D FACE MODELING USING A LOW-RESOLUTION RANGE SENSOR", filed November 17, 2011, and which is hereby incorporated by reference.

Background

[0002] The present disclosure describes systems and techniques relating to generating three dimensional models from range sensor data, for example, performing three dimensional face modeling using a low resolution range sensor.

[0003] Three dimensional modeling from range sensor information is an active field. Many advances have been made in using software to build complex three dimensional models using range sensor information. For example, U.S. Patent No. 7,583,275 to

Newmann et al. describes generating a three dimensional model of an environment from range sensor information representing a height field for the environment. In addition, much work has gone into face recognition and reconstruction. For example, U.S. Patent No.

7,856, 125 to Medioni et al. describes a three dimensional face reconstruction technique using two dimensional images, such as photographs of a face. Other approaches for three dimensional modeling using images include those described in U.S. Patent No. 7,224,357 to Chen et al.

Summary

[0004] The present disclosure describes systems and techniques relating to generating three dimensional (3D) models from range sensor data. According to an aspect, unwrapped two dimensional (2D) images are generated (in canonical form and according to a generalized cylinder model) from clouds of three dimensional (3D) points in a 3D coordinate system. This includes registering a 3D input cloud to one or multiple 3D reference frame(s), where the registering can include registering the 3D input cloud to all points in the 3D reference frame(s), or to only a portion of the points in 3D reference frame(s), in accordance with an assessed rigid body transformation between the clouds of 3D points. In addition, the unwrapped 2D images are processed in a 2D image domain, and the processed 2D images are transformed to the 3D coordinate system to help form the 3D model.

[0005] The generating can include estimating a pose difference for the 3D input cloud and accepting or rejecting an unwrapped 2D image for the 3D input cloud based on the estimated pose difference, or detected occlusions, or presence of non-rigid motion. In addition, the processing can include applying temporal filtering, such as a running mean, to a pixel of an unwrapped 2D image, applying one or more image-based operators to filter the unwrapped 2D image, and performing interpolation, in an image domain, on the unwrapped 2D image to fill holes.

[0006] In various implementations, one or more of the following features and advantages can be provided. A system can track a face in real-time and generate at the same time a realistic high-resolution 3D model using a low resolution range sensor. The disclosed approach need not rely on any prior knowledge and can produce faithful models of any kind of object. The modeling can be fast compared with prior approaches and can provide accurate results while using a low-resolution noisy input. The first frame can be set as a reference, an initial pose (e.g., a 3D head pose) can be computed, and this can be used to add new information to the model. For dense reconstruction, a cylindrical representation can be used, which enables the addition of information in a finite amount of data and also faster processing of the 3D information. The described methods can perform as well or better that prior state of the art methods, while using an affordable low resolution sensor, such as the PRIMESENSE™ camera available from PrimeSense Ltd. of Tel Aviv, Israel.

[0007] The above and other aspects and embodiments are described in greater detail in the drawings, the description and the claims.

Description of Drawings

[0008] FIG. 1A shows a PRIMESENSE™ camera.

[0009] FIG. IB shows an exmaple of depth information received from the PRIMESENSE™ camera.

[0010] FIG. 1C shows an example of the corresponding RGB (Red/Green/Blue) image received from the PRIMESENSE™ camera.

[001 1] FIG. 2A shows an example of projected 3D data on the YZ plane.

[0012] FIG. 2B shows this same example of projected 3D data on the XY plane.

[0013] FIG. 2C shows projected 3D data, in which a hand is now present in the image.

[0014] FIG. 2D shows projected 3D data, in which the hand occludes the face.

[0015] FIG. 2F shows successful face detection in projected 3D data for various challenging situations.

[0016] FIG. 3 A shows an example of registration between a reference frame and a current frame. [0017] FIG. 3B shows an example of wrong registration between a reference frame and a current frame.

[0018] FIG. 3C shows an example of splitting a face into two halves to improve registration.

[0019] FIG. 4A shows an example of a cylindrical representation used to model a face.

[0020] FIG. 4B shows an example of using a running mean to remove noise for one pixel.

[0021] FIGs. 4C and 4D show an example of the effect of bilateral filtering on an example of a model of a face.

[0022] FIGs. 4E-4H show an example of results for noise removal for an example of a model of a face being generated.

[0023] FIG. 5A shows a model projected onto an image and an individual projection from the camera and image to the model.

[0024] FIG. 5B shows the weighted sum of four pixels around a projected value.

[0025] FIG. 5C shows a cylindrical map obtained for a face after a first image and also a cylindrical map obtained for the face after ten seconds with several poses.

[0026] FIG. 6A shows laser scans for a Caucasian person and an Asian person.

[0027] FIG. 6B shows a heat map for the Caucasian person.

[0028] FIG. 6C shows a heat map for the Asian person.

[0029] FIG. 6D shows error distrubutions for both the Caucasian person's face and the Asian person's face.

[0030] FIG. 7A shows an example terminal interface.

[0031] FIG. 7B shows various monkey heads used to indicate display pose estimation.

[0032] FIG. 8A shows a method of creating a high resolution 3D model.

[0033] FIG. 8B shows an example of a method of generating unwrapped 2D images from clouds of 3D points.

[0034] FIG. 8C shows an example of a method of processing unwrapped 2D images in a 2D image domain before transformation of the 2D images to a 3D coordinate system.

Detailed Description

[0035] The following description details a method to estimate the pose of a face from range data, which can include tracking the position and direction of a face in a range image. Also described is a method to build a dense 3D model of the face and a process for using the information on the pose estimation to increase the resolution of the reconstructed model. The following detailed examples are presented in sections and subsections, including in which, section 1 provides an introduction, section 2 explains a simple face cropping algorithm, section 3 gives more details regarding an example of a head pose estimation algorithm, section 4 presents an example of a modeling algorithm, section 5 describes a user interface and some further improvements, and section 6 provides a concluding overview regarding the detailed examples. These detailed examples are provided in the interest of clarity, but it will be appreciated that other implementations are also possible.

Section 1 - Introduction

[0036] Three dimensional (3D) head pose estimation and automatic face modeling are two challenging problems which have many potential applications, for example, in a face recognition system. Such biometrics system can be robust to both illumination changes and pose changes.

[0037] A method to find the 3D pose of a head can be performed in real-time, and the information created can be used to generate high-resolution 3D models. The modeling approach can provide outstanding results given the affordability and the low quality of the sensor in some implementations. Noisy information can be accumulated and refined through time, and the resolution of a sensor can be increase by filtering the provided information.

[0038] The method can be purely data-driven and provide faithful models. The 3D pose of the head can be estimated in real-time using a registration algorithm (e.g., Z. Zhang, "Iterative point matching for registration of free-form curves and surfaces", International Journal of Computer Vision, 13(2), which is hereby incorporated by reference) that is able to provide the rigid transformation between two point clouds. The speed can be increased using one or more Graphics Processing Units (GPUs) that enable computation on graphics hardware (e.g., as described in B. Amberg et, "Reconstructing high quality face-surfaces using model based stereo", ICCV, 2007, which is hereby incorporated by reference). The new information can be aligned and added to the existing model using the estimated pose. For dense reconstruction, a cylindrical representation (e.g., as described in Y. Lin et al,

"Accurate 3D face reconstruction from weakly calibrated wide baseline images with profile contours", CVPR, 2010, which is hereby incorporated by reference) can be used. A running mean can be performed on every pixel to reduce the noise. Moreover, a bilateral filter (e.g., as described in C. Tomasi et al, "Bilateral filtering for gray and color images", IEEE Conference of Computer Vision, 1998, which is hereby incorporated by reference) can be used to remove remaining noise.

[0039] FIG. 1A shows a PRIMESENSE™ camera 100, which can be used as the acquisition hardware in some implementations. The PRIMESENSE™ camera 100 includes an infrared (IR) light source, an RGB camera and a depth camera. The PRIMESENSE™ camera 100 is sold as a single unit, and can thus be understood as a single camera or sensor, even though it includes multiple sensor devices. The sensor 100 can provide both a standard RGB image and a depth image containing the 3D information at 30 frames per second in Video Graphics Array (VGA) format. The sensor 100 can also provide RGB information in Super Extended Graphics Array (SXGA) format at 15 frames per second.

[0040] FIG. IB shows an example of depth information 1 10 received from the sensor. FIG. 1 C shows an example of the corresponding RGB image 120 received from the sensor. The 3D is computed in the infra-red domain thanks to a triangulation method. The sensor 100 can therefore provide results robust to illumination changes and can work in the dark. The hardware is inexpensive, but the low cost comes with a drop in the quality compared to the other state of the art sensors. The resolution is only VGA and the depth data is very noisy, which is a challenge that can be overcome, using the techniques described herein, for various kinds of applications in computer vision. The openNI library (see http://www.openni.org) can be used to facilitate working with the depth information 1 10. The depth information 1 10 can be converted to actual 3D information, and the RGB and depth data can be aligned properly, which enables working with both inputs at the same time.

[0041] Getting an accurate model from the PRIMESENSE™ camera is a difficult problem. Indeed, the quality of the input is poor. The resolution is VGA, the depth data is very noisy, and the depth information cannot be well computed on the edges and on some other occluded parts. That is why modeling from only a frontal view will not typically provide a good model, even if the data is aggregated during a long period of time.

[0042] To compensate for the poor depth data, several poses can be used. This enables adding the information on the occluded parts and refining the information on the edges, and a face pose estimator can be used. However, a drift in the pose estimation can impact the model by adding even more bad information. To minimize the added noise, the pose estimator should be very accurate. Thus, a registration algorithm can be used to address these issues. Moreover, a process of removing the noise from both the input data and the drift in pose estimation can be used to increase the resolution.

Section 2 - Face Segmentation

[0043] An object of interest, such as a face in this example, can be detected and located in the range image using a segmentation approach. In the following example, a simple and fast method to segment a face from the range data using a face cropping algorithm is described. The first step in face tracking is the segmentation of the face, and since the pose is to be detected, the detection should be robust to pose-variation. Moreover, the technique should be fast so it can be used as a pre-processing operation.

[0044] The pose constraint may not allow the use of openCV face detector, which may not work well for every pose of the face. Using only the depth information has the advantage of providing results that are not impacted significantly by illumination changes. Some methods using only the depth information rely on complex statistical modeling (see e.g., S. Malassiotis et al, "Robust real-time 3D head pose estimation from range data", Elsevier Science, 2004, which is hereby incorporated by reference). Such methods can be used in some implementations, but a simpler method is now described.

[0045] A simple and fast method can extract accurately the face from one person standing in front of the camera. This technique is described in the context of an assumption that there is only one person in front of the camera. However, the present invention is not limited to this approach, and other techniques can be used in other implementations, where multiple faces (or other objects) can be identified during pre-processing.

[0046] The upper body can be extracted from the background using a threshold for the depth value. It can be assumed that, in the first image, the arms are under the head, which is a casual body pose. The highest point can be taken and considered to be the top of the head. Then, the data can be analyzed from the top down to look for a discontinuity in the depth map. In order to find a discontinuity, the closest point to the camera can be found for each height .

[0047] For instance, FIG. 2 A shows an example of projected 3D data 200 on the YZ plane. The height y corresponds to the line 205 in FIG. 2A. When comparing the closest point at height y and height y+1, there should be two big discontinuities: the first one is the nose and the second one is the chin, as shown in FIG. 2A. Then, the right-most and left-most points in the heights between the top of the head and the chin can be used for form a bounding box for the face. FIG. 2B shows projected 3D data 210 on the XY plane, with a bound box 215 defined therefor.

[0048] For the following images in the sequence of images received from the camera, the segmented face can be looked for in a neighborhood of the previous detected face. The neighborhood can be fixed to a predefined limit (e.g., 5 cm) in every direction around the previous face. This approach can work since the typical move between two consecutive images is small. Moreover, it facilitates detection of the face even when the arms come over the head. [0049] FIG. 2C shows projected 3D data 220, in which a hand is now present in the image. As shown, the face is still accurately segmented and defined by the bounding box. In order to handle small occlusions, the height of the chin at time t+1 can be set to the value it had at time t if it cannot be found. False positives can be removed by checking that the size of the face remains consistent. This approximation allows detection of the face even when there is a partial occlusion, such as shown in projected 3D data 230 in FIG. 2D, in which the hand occludes the face.

[0050] This approach can detect the face in every pose regardless of the illumination. It can handle some challenging cases such as expression change, facial hair, presence/absence of glasses and some occlusions. For example, FIG. 2F shows successful face detection in projected 3D data including when the face is (a) looking right, (b) looking up, (c) looking down, (d) looking back, (e) horizontal, (f) with glasses, (g) with expression, and (h) with occlusion. Moreover, using a single-core Windows 7 (x32) system with a 2.79 GHz processor, face segmentation can be achieve in 1 ms on average, which is suitable for a realtime application where face segmentation is a pre-processing step for the overall system.

[0051] This face segmentation algorithm can provide fair results robust to pose change, illumination changes and other challenging problems. It can run extremely fast, which is suitable for real-time applications. The main idea is to consider that the chin is a

discontinuity in the depth map, which is easy and fast to find. This algorithm provides a starting point for the remainder of this description, but as will be appreciated, the following systems and techniques can be implemented in combination with other, more complex face segmentation algorithms, or with other object segmentation algorithms, not limited to faces or people.

Section 3 - 3D Face Pose Estimation

[0052] The 3D pose of a face can be computed in real-time using an affordable range sensor (e.g., the PRIMESENSE™ camera 100 from FIG. 1A). The approach described need not rely on any prior knowledge. The following description provides details regarding the head pose estimation algorithm.

[0053] The first frame of an image stream can be assumed to be close to frontal, or a designated frame of an image stream can be indicated as close to frontal. This image can be set as a reference, and a registration algorithm can be used between each new frame and the reference frame. This algorithm can provide accurate results and can be implemented using one or more GPUs to enable fast computation. Moreover, the use of a reference frame can help prevent error propagation. [0054] A rigid transformation between the reference frame and the current input can be computed using a registration algorithm (e.g., Z. Zhang, "Iterative point matching for registration of free-form curves and surfaces", International Journal of Computer Vision, 13(2)). The data of the reference frame can be refined in order to deal with pose-related occlusions, as described further below. The described method performs as well or better than prior state of the art methods, while using an affordable low resolution device. Moreover, the method can work for any kind of object since no prior knowledge on faces need be used.

[0055] Automatic and robust algorithms for head pose estimation can have many applications in human-computer interfaces, augmented reality or face recognition systems. However, it is a complex problem, and obtaining accurate results in real-time is very challenging. Most methods process classic RGB images and have trouble dealing with illumination changes or areas without texture. The release of affordable range sensors, such as the PRIMESENSE™ camera, provides stable and reliable results at a good price. However, the low-cost of these sensors comes with a drop in the quality of the input. The value provided on each pixel is very noisy. As a result, information regarding curvature for the object is not reliable, which can prevent the detection of facial features directly from the depth data.

[0056] In the following approach, the pose can be estimated using Expectation- Maximization - Iterative Closest Point (EM-ICP) on Compute Unified Device Architecture (CUDA) (which is described in T. Tarnaki et al, "Softassign and EM-ICP on GPU", CVPR, 2010, and which is hereby incorporated by reference). This facilitates real-time performance. The approach need not rely on any specific facial feature, which can be challenging to detect, depending on the pose.

[0057] Face pose estimation is a problem which has been widely studied (see e.g., E. Murphy-Chutorian et al., "Head pose estimation in computer vision: A survey", TPAMI, 31(4):607-626, 2009). Existing methods rely on either classic RGB images or range data. The methods dealing with RGB images can be split into appearance-based methods and feature-based methods. Some other methods rely on finding the relative motion between consecutive images.

[0058] The main idea in appearance-based methods is to discretize the head poses in order to learn pose-related models, such as described in M. Jones et al, "Fast multi-view face detection", Technical report, Mitsubishi Electric Research Laboratories, 2003. The input is then compared to the models in order to find the most-resembling one. The best feature-based methods use pose-dependent features. One was developed by Yao and Cham, who select feature points manually and match them to a generic wireframe model (see J. Yao et al., "Efficient model-based linear head motion recovery from movies", CVPR, 2004).

[0059] Methods dealing with 2D images are typically very sensitive to illumination changes and partial occlusions. Using range sensors providing the 3D information can make the systems more stable and robust, and as noted above, these sensors are becoming more affordable. That is likely why many of the recent works on pose estimation use 3D sensors, either purely (see e.g., S. Malassiotis et al, "Robust real-time 3D head pose estimation from range data", Elsevier Science, 2004; and G. Fanelli et al, "Real time head pose estimation with random regression forests", CVPR, 201 1) or coupled with RGB information (see e.g., Bleiweiss et al, "Robust head pose estimation by fusing time-of-flight depth and color", MMSP, 2010).

[0060] Malassiotis uses a feature-dependent method. He uses robust nose tip detection in the range data and then finds the pose by detecting the nose ridge. Fanelli's method relies on machine learning and is able to provide the face pose in real-time on one or more Central Processing Units (CPUs) and without any initialization step. However, the machine learning algorithms are highly dependent on the training data, and the sensors used for the experiments are very high quality. Simon (D. Simon et al, "Real-time 3-D pose estimation using a high-speed range sensor", IEEE International Conference on Robotics and Automation, ICRA, 3 :2235-2241 , 1994) used Iterative Closest Point (ICP) for pose estimation and got good results in a small range.

[0061] In the following method, only the range data provided by the PRIMESENSE™ camera is processed. EM-ICP on GPU can be used in a way that obtains a high rotation range and handles fast moves, the goal being to robustly find the 3D pose of a face in realtime. As before, an initial frame (e.g., the first frame or a designated frame) is set as a reference frame. Every new input is then registered to that reference frame. The face region is segmented, such as described above, where it is assumes that there is only one person standing in front of the camera, and someone standing way behind the main user will be considered as background and be ignored. If someone stands next to the person, the same algorithm can be used by splitting out regions of interest.

[0062] At each frame, the face region is segmented, and the points on the face are sampled to obtain an input point cloud, and registration between the input point cloud and the point cloud from the reference frame is performed. Note that one could register consecutive images and incrementally deduce the pose. However, such methods may require a very accurate pose computation since any drift would likely be propagated. Using a reference frame can be more robust and stable and can recover the pose at any time if an error occurs.

[0063] The development of more advanced graphic cards, such as those available from NVIDIA of Santa Clara, California, enables the use of one or more GPUs to dramatically increase the speed of the computation. An implementation of EM-ICP on GPU was developed by Tamaki (see T. Tarnaki et al, "Softassign and EM-ICP on GPU", CVPR, 2010). It can handle point clouds of different size and some occlusions. For example, FIG. 3A shows an instance of registration between a reference frame 300 and a current frame 305 to produce a registration result 310, where a portion of the points in the cloud 310 are points from the cloud 300 after registration (points of the reference frame 300 are in blue, points of the current frame 305 are in red, and the transformed blue points after registration in the registration result 310 are in green). However, to use this approach for real-time pose estimation, several problems should be handled properly.

[0064] First, the initialization step is decisive for both accuracy and speed. A wrong initialization could either make the system really slow or converge towards a local minimum and not provide the desired results. This can be handled by initializing the transformation matrix at time t by the value it had at time (t-1). This hypothesis seems decent since the difference of object position is typically small between two consecutive frames.

[0065] Second, point clouds which are not similar enough will not be well registered. Unfortunately, a frontal face and a profile face are two objects that globally look different and cannot be well registered for big yaw angles. For example, FIG. 3B shows an instance of wrong registration between a reference frame 320 and a current frame 325. As shown, a registration result 330 is incorrect in the nose region 335. This problem comes from the fact that many points from the frontal face are occluded in the profile face and many points from the profile face are occluded in the frontal face. The minimized overall error will eventually not give the transformation desired.

[0066] To address this issue, some patterns can be artificially created from the reference frame. One way to do this is to remove the points which should be occluded in the views with high yaw angle value. For example, the frontal face can be split into the left part and the right part. The left half of the frontal face will be occluded for a right profile view. If the input is a right profile, using only the right half of the frontal face can provide a good registration.

[0067] FIG. 3C shows an example of splitting a face 350 into two halves to improve registration. Frontal face input 355 is split into two halves, including a left half 360. This makes a facial input from the front and an input 365 with a high yaw angle, with respect to the camera (C) and image plane (I), resemble each other more. This then facilitates registration with inputs having high yaw angles, and the system can thus handle larger pose variations. In this approach, a strategy to switch between methods using the full reference face or only half of it should be adopted. For example, the transformation at time (t-1) can be used to decide which strategy should be applied at time t. If the previous yaw angle is inferior to -15°, only the left half is used, if it superior to +15°, the right half is used, and otherwise the whole face is used. This has been set arbitrarily and provides fair results, but it will be appreciated that other cut off values can be used, or other strategies altogether.

[0068] Such a system can processes VGA images and can provide accurate results for - 40° to 70° for pitch angles, -70° to 70° for yaw angles and 360° for roll angle, which is enough for casual behavior in front of a camera. The system can handle some occlusions, some translations along Z and expression changes. Moreover, it can recover if the person goes out of the screen and back in. Note that the system need not rely on any specific facial features, and it can thus work for any kind of face and even on any kind of object, such as a teapot.

[0069] The speed of the ICP algorithm depends on the number of points in the points cloud. The points can be sampled but the number of points should be chosen in order to have a decent speed while keeping good accuracy. The system can run at 6 frames per second on a GeForce GTX460, but other graphic cards can also be used. In some implementations, around 1,000 points can be used for each frame, which provides a good trade-off between speed and accuracy.

[0070] Thus, a method for fast head pose tracking from low resolution depth data can use registration for pose estimation that does not depend on any specific feature and works for any kind of object. While this approach provides accurate results without a specific model for the object at hand, as shown by the handling of high yaw angle when performing registration, some form of modeling of the object may be desirable. In particular, modeling the object using a generalized approach, which can be applied to many different types of objects may have significant advantages. Moreover, such an approach may have particular relevance in applications involving face recognition.

[0071] Thus, the following description provides details of a method to generate a realistic high-resolution 3D face model using a low resolution depth sensor. The described approach is purely data driven and can produce faithful models without any prior knowledge, where the model can be built up over time by adding information along the way. A live input taken from a depth camera can be processed, and for each processed frame, the input data can be added to the model. This approach can result in a faithful dense 3D face model in a small amount of time.

[0072] As before, an initial frame (e.g., the first frame or a designated frame) is set as a reference frame. The rigid transformation between the reference frame and the current input can be computed using the registration algorithm discussed above (e.g., Z. Zhang, "Iterative point matching for registration of free- form curves and surfaces", International Journal of Computer Vision, 13(2)). That transformation can be used to align the input data to the model. For dense reconstruction, a cylindrical representation can be used (e.g., such as described in Y. Lin et al, "Accurate 3D face reconstruction from weakly calibrated wide baseline images with profile contours", CVPR, 2010) which enables the addition of information in a finite amount of data and faster processing of the 3D information.

Section 4 - 3D Face Modeling

[0073] Dense personalized 3D model reconstruction is an active research subject in the computer vision field. An accurate model can have many applications. For instance, it can be used for face recognition purposes in biometric systems or as an avatar in video games. The following description focuses on face modeling, but it will be appreciated that the systems and techniques described are applicable to modeling other types of objects.

[0074] The release of affordable sensors providing the depth information enables 3D face modeling for home use. However, these sensors (e.g., the PRIMESENSE™ camera discussed above) typically provide low resolution images and very noisy depth information.

Nonetheless, such sensors can be used to build a dense 3D face model as described herein, and the method described can work for nearly any kind of real-time range sensor. The approach can intrinsically handle asymmetry of face and facial hair, unlike most model-based methods. It is also robust to pose-related occlusions when using a single view point. In general, the data-driven reconstruction can create an accurate model from a range sensor providing low-resolution noisy data, where the quality of the model depends on the length of the processed video and the distance of the object (e.g., the user's face) from the camera.

[0075] Initially, a 3D head pose estimator can be used to find the best rigid

transformation between the current input and the model. Thus, the new input is aligned to the model, and the information is added. Typically, this registration is not perfect and adds some noise to the model.

[0076] For dense reconstruction, a cylindrical representation (e.g., such as described in Y. Lin et al, "Accurate 3D face reconstruction from weakly calibrated wide baseline images with profile contours", CVPR, 2010) can be used. This can facilitate processing the data quickly and accumulating the information using a finite amount of memory. The data can be refined in order to remove the noise of both the input and the error on pose estimation. In order to do that, a running mean algorithm can be used on each pixel, and the data can be post-processed with a bilateral filtering (e.g., such as described in as described in C. Tomasi et al, "Bilateral filtering for gray and color images", IEEE Conference of Computer Vision, 1998).

[0077] Most proposed methods to build a face model start from a generic model which is then deformed in order to fit the face. These methods provide decent results, but the generic model tends to bias the reconstructed model. Moreover, face models have been created by locating facial features on the input face and then deforming the generic face model accordingly. Such reconstructed models are decent but still not dense enough to have a good accuracy.

[0078] Other approaches involve fitting a generic face model to the scans and then refining the registration thanks to features detected in the RGB space. The results are decent but do not fully resemble the person and may not handle every pose for the faces. Therefore, all the parts of the face which are occluded will be unknown and wrongly filled by the generic model. Furthermore, in Y. Lin et al, "Accurate 3D face reconstruction from weakly calibrated wide baseline images with profile contours", CVPR, 2010, Lin builds a data-driven model by taking advantage of a stereo system with five different views. The reconstructed models are very accurate. However, the data acquisition requires a special studio environment and the resolution of the processed images must be high in order to achieve good results. In contrast, the present method can be fully data-driven and can run with an affordable noisy low-resolution sensor with good results.

[0079] The present method, which can reconstruct the dense 3D surface of a face, can be understood in terms of two main parts: the acquisition of the 3D information and the acquisition of the texture. Input data can be acquired using the PRIMESENSE™ system, or other such systems. The acquisition system can provide both an RGB image and a depth map at 30 frames per second in VGA format. The sensor system can also provide SXGA format for the RGB image, which is a 1260x1024 resolution.

[0080] To build a fully data-driven high-resolution model from a single view point, one frame is not enough. Indeed, some parts are occluded in one image, but what is needed is accurate information for the whole face. The idea is to get the information through time. By processing a video with several poses for the face, the information that is missing in a single image can be obtained. Moreover, the noise of the depth information provided by the sensor can be significantly removed by averaging through time.

[0081] At each new frame in the video, information is obtained from the frame and added to the model. The new information should be aligned to the model in an accurate way to obtain a consistent result. It can be presumed that the rigid transformation (R, t), composed of the rotation R and the translation t, between a new input frame and the first frame is known. This hypothesis is quite strong. However, the transformation matrix can be provided by a 3D head pose estimation algorithm. For example, the EM-ICP algorithm on CUDA detailed in T. Tarnaki et al, "Softassign and EM-ICP on GPU", CVPR, 2010 can be applied to register the input data to the reference frame. For each new frame, the input can be registered to the model using the inverse of the rigid transformation (R, t). Note that the results of the registration are good but not perfect and create some noise in the model.

[0082] Then, to aggregate the information in a finite amount of data, a cylindrical model can be used. This has proven to give good results, such as described in Y. Lin et al, "Accurate 3D face reconstruction from weakly calibrated wide baseline images with profile contours", CVPR, 2010. In essence, a cylinder is set around the object in the reference frame. For example, FIG. 4A shows a cylindrical representation used to model a face. A cylinder is set around the face in the 3D information 400 received from the sensor. For each vertex of coordinates (x, y, z), the cylindrical coordinate (p, Θ, y) can be computed using the following equations:

p jx + z if(x = 0 A y = O)

z

Θ arcsin— if x≥0

P

. z

- arcsin— h π if x < 0

P

[0083] The 3D information is projected onto the cylinder as shown at 405. The geometry of a facial surface can be represented using an unwrapped cylindrical depth map D, where the value at D(0, y) is the horizontal distance pt the cylinder axis. Thus, an unwrapped map 410 can be generated from one image, as shown in FIG. 4A. This model thus facilitates transformation of the 3D data into a 2D image, which can have several advantages. For example, it limits the amount of data to be retained for the 3D information to a single image, which is suitable for an algorithm where information is continuously added at each new frame. In addition, the 3D data can be processed as a 2D image, which means processing such as filtering becomes easier to use and can be applied faster. Moreover, meshes can be readily generated by creating triangles among the neighboring pixels on the image.

[0084] One potential drawback of this model is that it may be limited to only star-shaped objects. Other objects may not be readily modeled in this way since one angle Θ will have several p values. This means the model may not be able to handle glasses for example. However, a face being a star-shaped object, this model is suitable and enables fast computation.

[0085] In order to remove the noise from both the input and the error on transformation, two combined strategies can be used. First, a running mean can be applied on the /r value of each pixel of the unwrapped cylinder map. FIG. 4B shows an example 420 of using a running mean to remove noise from raw input for the p value for one pixel. This temporal integration enables reduction of the intrinsic noise while aggregating the data. When the whole data has been aggregated, a spatial smoothing can also be applied to remove any remaining noise.

[0086] For the running mean, if the value has been updated n times already, and the value v is to be added, then the update can be in accordance with:

_ nV + v

V

n + \

n = n + \

The model thus obtained, after aggregating the whole data, may not be perfectly smooth. In order to refine it, the unwrapped cylindrical map can be further processed. A closing operation can be performed to fill any remaining holes and a bilateral filter (e.g., such as described in as described in C. Tomasi et al, "Bilateral filtering for gray and color images", IEEE Conference of Computer Vision, 1998) can be used to remove any remaining noise.

[0087] FIGs. 4C and 4D show an example of the effect of bilateral filtering on an example of a model of a face. FIG. 4C shows noise removal on a slice 430 of the nose on the model of the face, including the data of the model both before filtering and after filtering. FIG. 4D shows the model 435 of the face and the corresponding slice 437. Note that the use of a bilateral filter can facilitate removal of the noise while keeping the edges. Moreover, this filtering process is relatively fast thanks to the cylindrical representation of the model.

[0088] Furthermore, the noise from both the input data and the error in pose estimation can be removed by the combination of a temporal integration and a spatial filtering. FIGs. 4E-4H show an example of results for noise removal for an example of a model of a face being generated. FIG. 4E shows accumulated raw input 450 for the model. FIG. 4F shows model data 460 after applying the running mean only. FIG. 4G shows model data 470 after applying bilateral filtering only. FIG. 4H shows model data 480 after applying both the running mean and bilateral filtering. As shown, this process can result in obtaining a smooth model.

[0089] In addition, a good head pose estimation should be used to obtain a good model. In order to handle big pose estimation failure, we can reject the images in which the pose could not be properly computed. Let us note M and / the unwrapped cylindrical depth images containing respectively the model and the new information. We can compute the difference d between the two images M and / and reject images with a high d value. If the two images are not similar enough, this means either that the registration process was wrong or that an object is occluding the face. For example, the intensity of the difference between the model and the new frame can reveal the overlapping error, which can be used to exclude cases of bad registration or occlusion (e.g., by a hand in front of the face). Therefore, this rejection algorithm can also handle some hard occlusions. If we note A the set of pixels which are non-zero in both M and /, the function d can be: d(M, I) = -∑ \M(x, y) - I(x, y)\

[0090] Adding the color information is another complex problem. Applying a similar algorithm on the RGB input does not typically provide good results. Indeed, averaging makes every part of the face blurry and we get a result which does not look natural.

Moreover, creating texture by aggregating RGB images can make it difficult to distinguish details in the color image, such as the pupils of the eyes.

[0091] In order to get a neater texture, a single image (e.g., the reference image) can be used as the source of color information. When the 3D information of the model is computed, every point of the model can be projected onto this image in order to get the RGB value. FIG. 5A shows the model projected onto the image at 500 and an individual projection from the camera and image to the model at 510. However, this approach has some drawbacks in that the three points shown on the model will get the same projected value even though this value should be assigned only to the closest pixel. The values for the occluded parts are wrong, as shown on FIG. 5A, but can still be close enough for many applications. In addition, in some implemantions, multiple reference images can be set (e.g., frontal, left-side and right-side views for a face, or four or more ordinal reference images to get a full 360 degree view of the object) and used to assign color values to the model.

[0092] To reduce the pixelization effect, the RGB values can be computed as a weighted sum of the pixels around the projected value. FIG. 5B shows the weighted sum of four pixels around a projected value. Let (xp, yp) 520 be the projected coordinates of the vertex of coordinates (X Y, Z) onto the image plane. Let (x, dx), (y, dy) respectively be the integer part and the decimal part of p and yp. The red value R can be computed as follows:

R(xP, yP) = (\ - S_x) R(x, y)

+ (\ - S_x)S_y R(x, y + \)

+ S_x(l - S_y) R(x + l, y)

+ S_xS_y R(x + \, y + \)

The green value G and the blue value B can be computed in a similar way, thus providing a good texture as the final product.

[0093] In our experiments, we choose a size of 360x200 for the unwrapped cylindrical map. This contains enough information to get an accurate model and can be filled quickly. However, other sizes are also possible. In addition, although the input was VGA for the depth map and SXGA for the RGB image in the experiments discussed below, it will be appreciated that alternative formats are also possible.

[0094] A single-core Windows 7 (x32) system with a 2.79GHz processor was used. A GPU was used for pose estimation, and this uses a GeForce GTX460. Adding a new frame to the model is very fast and the speed depends on how much information is obtained from the input. It takes about 8 ms for a face of 120x170 pixels (about 14,000 points) and up to 14 ms for a face of 160x200 pixels (about 25,000 points). Note that a bigger face may be obtained in some implementations, provided that the depth map can still be well computed (e.g., the face does not get too close to the camera). A complete model can be obtained in about 10 seconds of live video. FIG. 5C shows a cylindrical map 530 obtained for a face after a first image and also a cylindrical map 535 obtained for the face after ten seconds with several poses. [0095] The systems and techniques have been shown to provide quality reconstruction results for several people and objects. The results are visually accurate, especially the 3D information. In addition, a comparison of the modeling results from the present systems and techniques with those provided by the Geometrix ActivelD face recognition software shows that the present approach can provide more accurate results on the shape while using low- resolution images.

[0096] To quantify the accuracy of the present approach to modeling, we compare it to a commercial laser scan bought at Cyber F/X. FIG. 6A shows laser scans 600 and 605 for a Caucasian person and an Asian person. The first thing to notice is that our method can actually get some value for the hair while the laser scanning systems cannot. That is why the error on the hair region is that high. FIG. 6B shows a heat map 610 for the Caucasian person. FIG. 6C shows a heat map 615 for the Asian person.

[0097] FIG. 6D shows error distributions 620 for both the Caucasian person's face and the Asian person's face. As shown, our model is very close to a laser scan. The average error is about 1mm. For these results, we consider that the laser scans can be used as ground truth. However, in some cases our approach can actually provide better results in some areas, such as the nose area of the face.

[0098] In addition, the quality of the models generated using the present approach can be robust to changes in lighting conditions, at least in part because the 3D information is provided by the sensor and computed with infrared radiations. Thus, a good model can be reconstructed in the dark even if good texture information is not available.

[0099] If we assume that a face has on average a neutral expression, a change in expression can be considered as another noise factor which can be removed by running mean and bilateral filtering (e.g., such as described in as described in C. Tomasi et al., "Bilateral filtering for gray and color images", IEEE Conference of Computer Vision, 1998).

Moreover, our approach should be robust to pose changes since for each new frame we compute the pose and transform the point cloud to align it to the model.

[00100] Thus, a method to generate an accurate 3D face model from a video taken from a low resolution range sensor is presented. A real-time face pose estimator can be used to align the new data to the model. For dense reconstruction, a cylindrical representation, which enables N-view aggregation, can be used. A running mean can be used to remove the noise from both the input data and the drift in pose estimation.

[00101] The combination of temporal integration and spatial smoothing can be used to reduce the noise. Reducing the noise on each pixel results in reducing the variance of the data, which enables an increase in the precision and facilitates a higher resolution.

Experimental results confirm the accuracy, robustness and stability of our method. Our method also performs as well as other state of the art methods even while using a cheap low- resolution noisy sensor.

Section 5 - User Interface and Further Details

[00102] Various user interfaces can be used with the present systems and techniques. FIG. 7A shows an example terminal interface 700. When the program runs, a terminal can open with two openCV windows as shown in FIG. 7A. The terminal (a) displays all the options that can be used. The first window (b) displays the depth input, with the detected face highlighted at 710, the estimated pose at the top left-hand corner and the speed at the top right-hand corner. FIG. 7B shows various monkey heads 750 used to indicate the display pose estimation; from left to right: frontal, looking right, looking up, looking down and rolling the head. Referring again to FIG. 7A, the second window (c) shows the current unwrapped cylindrical map of the the reconstructed model for the face.

[00103] Several options are possible to interact with the program. For example, in the implementation shown in FIG. 7A, the user can press: 'd' to switch between depth and RGB input; 't' to enable/disable the display of the detected face; 's' to start/stop recording the frames; 'm' to freeze/launch the modeling; 'b' to render the estimated pose in a simpler way; 'r' to reset the reference frame and restart the model; and 'q' to terminate the program and issue an OBJ file for the model. When the program is exited, an OBJ file containing the model is release in the model folder, called myModel.obj. The OBJ file can be opened in any software for 3D display such as Meshlab.

[00104] Other implementations can use other 3D file formats to save 3D information. In addition, in some implementations, the 3D model can be displayed directly. Further, as will be appreciated, rather than using the keys of a keyboard to control the system, a user- friendly interface can employ buttons in a graphical user interface in some implementations. In addition, openGL can be used for display instead of using other software to open the model.

[00105] In order to improve the results, the texture information of the model can be improved. Using only one image can give wrong information on the occluded parts, but stitching several images may not work because of the changes in the illumination conditions. One way to deal with this problem would be to remember several RGB images with the corresponding pose and change the texture as a function of the direction of the model. This could provide better results but may require significant memory for the images. Moreover, rather than export a fixed OBJ file, one may need to create the whole interface with other software, such as openGL.

[00106] Another approach would be to virtually build the entire room and compute the value on each pixel of the face based on the lighting conditions and the pose. The skin would be modeled with a Bidirectional Reflectance Distribution Function (BRDF) where the facial reflectance can be computed thanks to the several illumination conditions, as in: A. Ghosh et al., Practical modeling and acquisition of layered facial reflectance, ACM SIGGRAPH Asia, 2008; and P. Debeve et al., Acquiring tile reflectance field of a human face. SIGGRAPH, 2000. P. Debeve et al. uses a more complex model where the layers of the skin are taken into consideration. This could provide faithful results for the whole face. However, this may be heavy in terms of computation and require a very high accuracy for the pose estimation.

[00107] The pose estimation algorithm can be improved by using the created model as an input of the EM-ICP algorithm (T. Tarnaki et al., "Softassign and EM-ICP on GPU", CVPR, 2010) instead of the reference frame only. This kind of feedback loop would make the pose estimation more stable since all the noise of the points would be removed. Moreover, we could reach higher yaw angles, superior to 90° since we would add the information that used to be occluded in the reference frame. Another way to improve it can be to set a region of interest removing the mouth region. This would make the system more robust to emotion changes.

[00108] Other improvements can include adding landmarks on the model to assist in detecting and incorporating user-dependent emotions into the model. The RGB image data can be used to detect the emotions since the depth information is very noisy and may not be useful in this regard. Moreover, another improvement would be to incorporate a face recognition module so that each time a user enters the screen, we can refine his model little by little.

Section 6 - Concluding Overview

[00109] FIG. 8A shows a method of creating a high resolution 3D model. Unwrapped two dimensional (2D) images are generated 800 from clouds of three dimensional (3D) points in a 3D coordinate system. The generating can include registering a 3D input cloud to a 3D reference frame, such as described above, and the unwrapped 2D images can be generated in canonical form and according to a generalized cylinder model. For example, a cylindrical model for N-view aggregation can be used, such as described in Y. Lin et al., "Accurate 3D face reconstruction from weakly calibrated wide baseline images with profile contours", CVPR, 2010. [00110] FIG. 8B shows an example of a method of generating unwrapped 2D images from clouds of 3D points. A first 3D point cloud of the clouds can be set 830 as the 3D reference frame. Registering 832 the 3D input cloud to the 3D reference frame can include registering the 3D input cloud to all points in the 3D reference frame, or to only a portion of the points in the 3D reference frame, in accordance with an assessed rigid body transformation between the clouds of 3D points, such as described in detail above for the case of face reconstruction.

[00111] A pose difference can be estimated 834 for the 3D input cloud. As noted previously, the pose estimation method can work rapidly for any kind of object and any star- shaped object can be modelled. The pose estimation approach can use an EM-ICP algorithm and the speed can be increased by using GPU (see T. Tarnaki et al., "Softassign and EM-ICP on GPU", CVPR, 2010). This approach is fast, reliable, robust to some occlusions, and does not rely on any prior knowledge. Moreover, the pose estimation can be used in generating the 3D model. Thus, a 2D image for the 3D input cloud can be accepted or rejected 836 based on the estimated pose difference, or detected occlusions, or presence of non-rigid motion.

[00112] Referring again to FIG. 8A, the unwrapped 2D images can be processed 802 in a 2D image domain, before the processed 2D images are transformed 804 to the 3D coordinate system. FIG. 8C shows an example of a method of processing unwrapped 2D images in a 2D image domain before transformation of the 2D images to a 3D coordinate system. A running mean can be applied 870 to pixels of an unwrapped 2D image for noise removal.

Interpolation can be performed 872, in an image domain, on the unwrapped 2D image to fill holes. In addition, one or more image-based operators can be applied 874 to filter the unwrapped 2D image. For example, bilateral filtering can be performed, as discussed above. Using these additional processing techniques in some implementations can provide high accuracy and robustness, even though the input data has low-resolution and is very noisy.

[00113] The processes described above, and all of the functional operations described in this specification, can be implemented in electronic circuitry, or in computer hardware, firmware, software, or in combinations of them, such as the structural means disclosed in this specification and structural equivalents thereof, including potentially a program (stored in a machine-readable medium) operable to cause one or more programmable machines including processor(s) (e.g., a computer) to perform the operations described. It will be appreciated that the order of operations presented is shown only for the purpose of clarity in this description. No particular order may be required for these operations to achieve desirable results, and various operations can occur simultaneously or at least concurrently. In certain implementations, multitasking and parallel processing may be preferable.

[00114] The various implementations described above have been presented by way of example only, and not limitation. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

[00115] Thus, the principles, elements and features described may be employed in varied and numerous implementations, and various modifications may be made to the described embodiments without departing from the spirit and scope of the invention. Accordingly, other embodiments may be within the scope of the following claims.

Claims

Claims What is claimed is:

1. A method performed by a computer system comprising processor electronics and at least one memory device, the method comprising:

generating, in canonical form and according to a generalized cylinder model, unwrapped two dimensional (2D) images from clouds of three dimensional (3D) points in a 3D coordinate system, the generating comprising registering a 3D input cloud to a 3D reference frame;

processing the unwrapped 2D images in a 2D image domain; and

transforming the processed 2D images to the 3D coordinate system.

2. The method of claim 1, wherein the 3D reference frame is one or more 3D reference frames, and the method comprises setting one or more 3D point clouds of the clouds as the one or more 3D reference frames.

3. The method of claim 1, wherein the registering comprises registering the 3D input cloud to all points in the 3D reference frame, or to only a portion of the points in the 3D reference frame, in accordance with an assessed rigid body transformation between the clouds of 3D points.

4. The method of any of claims 1-3, comprising:

estimating a pose difference for the 3D input cloud; and

accepting or rejecting an unwrapped 2D image for the 3D input cloud based on the estimated pose difference, or detected occlusions, or presence of non-rigid motion.

5. The method of claims 4, wherein the processing comprises applying pixel based comparison to accept or reject an unwrapped 2D image.

6. The method of claim 1, wherein the processing comprises applying temporal filtering including a running mean to a pixel of an unwrapped 2D image.

7. The method of claim 6, wherein the processing comprises applying one or more image-based operators to filter the unwrapped 2D image.

8. The method of claim 7, wherein applying the one or more image-based operators comprises performing bilateral filtering.

9. The method of any of claims 6-8, wherein the processing comprises performing interpolation, in an image domain, on the unwrapped 2D image to fill holes.

10. A computer-readable medium encoding a program that causes data processing apparatus to perform operations in accordance with any of method claims 1-3 and 6-8.

1 1. The computer-readable medium of claim 10, wherein generating the unwrapped 2D images comprises:

estimating a pose difference for the 3D input cloud; and

12. The computer-readable medium of claim 11, wherein the generating comprises applying pixel based comparison to accept or reject an unwrapped 2D image.

13. The computer-readable medium of claim 10, wherein the processing comprises performing interpolation, in an image domain, on the unwrapped 2D image to fill holes.

14. A system comprising:

processor electronics; and

computer-readable media configured and arranged to cause the processor electronics to: generate, in canonical form and according to a generalized cylinder model, unwrapped 2D images from clouds of 3D points in a 3D coordinate system, the generating comprising registering a 3D input cloud to a 3D reference frame; process the unwrapped 2D images in a 2D image domain; and transform the processed 2D images to the 3D coordinate system.

15. The system of claim 14, wherein the computer-readable media is configured and arranged to cause the processor electronics to set a first 3D point cloud of the clouds as the 3D reference frame.

16. The system of claim 14, wherein the computer-readable media is configured and arranged to cause the processor electronics to register the 3D input cloud to all points in the 3D reference frame, or to only a portion of the points in the 3D reference frame, in accordance with an assessed rigid body transformation between the clouds of 3D points.

17. The system of any of claims 14-16, wherein the computer-readable media is configured and arranged to cause the processor electronics to: estimate a pose difference for the 3D input cloud; and accept or reject an unwrapped 2D image for the 3D input cloud based on the estimated pose difference, or detected occlusions, or presence of non-rigid motion.

18. The system of claim 17, wherein the computer-readable media is configured and arranged to cause the processor electronics to apply pixel based comparison to accept or reject an unwrapped 2D image.

19. The system of claim 17, wherein the computer-readable media is configured and arranged to cause the processor electronics to apply temporal filtering including a running mean to a pixel of an unwrapped 2D image.

20. The system of claim 19, wherein the computer-readable media is configured and arranged to cause the processor electronics to apply one or more image-based operators to filter the unwrapped 2D image.

21. The system of claim 20, wherein the computer-readable media is configured and arranged to cause the processor electronics to perform bilateral filtering.

22. The system of claim 20, wherein the computer-readable media is configured and arranged to cause the processor electronics to perform interpolation, in an image domain, on the unwrapped 2D image to fill holes.

23. The system of claim 14, comprising a server computer system and a user- interface computer, wherein the user-interface computer comprises the processor electronics, and the server computer system comprises the computer-readable media.