WO2021120052A1

WO2021120052A1 - 3d reconstruction from an insufficient number of images

Info

Publication number: WO2021120052A1
Application number: PCT/CN2019/126298
Authority: WO
Inventors: Sato Hiroyuki
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2021-06-24

Abstract

A device (100) is provided. The device (100) includes: a camera (115) for capturing an image sequence of a subject, a three dimension (3D) reconstruction unit (123) for reconstructing a 3D model from the image sequence, and a model refinement unit (124) for refining the 3D model so as to be fitted to one or more images selected by a user from the image sequence. The device (100) closes holes on the reconstructed 3D model caused by an insufficient number of images.

Description

3D RECONSTRUCTION FROM AN INSUFFICIENT NUMBER OF IMAGES

TECHNICAL FIELD

The present invention relates to three dimension (3D) reconstruction from a plurality of two dimension (2D) images captured from a subject.

BACKGROUND

Color image-based 3D reconstruction is a well-studied field. SfM (Structure-from-Motion) (for example, refer to:

Johannes L. and Jan-Michael Frahm, "Structure-from-motion revisited" , Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition) estimates camera motion and sparse 3D points, and then MVS (Multi-View Stereo) (for example, refer to:

Johannes L. et al., "Pixelwise view selection for unstructured multi-view stereo" , European Conference on Computer Vision, Springer, Cham, 2016) is applied to make a dense 3D model from them. Recent depth image based 3D reconstruction like KinectFusion (for example, refer to: Newcombe, Richard A. et al., "Kinectfusion: Real-time dense surface mapping and tracking" , ISMAR, Vol. 11. No. 2011, 2011) can make a dense 3D model in real time. These methods are able to make a complete 3D model if a sufficient number of images are captured. However, in a casual scanning by a non-professional user with a consumer device like a smart phone, a large portion of a surface of a subject often is not validly captured. There are several reasons: a limited camera Field-of-View, big/complex object shape, limited scanning space, limited user interface (UI) feedback, fast camera/user motion, light condition and material (e.g. depth sensor using IR (Infrared) emission cannot capture valid depth values on some black materials with low IR reflection. MVS methods are not able to recover dense depth values on uniformly colored surface) , etc. Therefore, an insufficient number of images are captured to reconstruct the subject. Such an insufficient number of images cannot reconstruct a complete 3D model, namely, they leave big holes on the reconstructed 3D model of the subject.

SUMMARY

A device, is provided to achieve closing holes on the reconstructed 3D model caused by an insufficient number of images.

According to a first aspect, a device is provided, where the device includes: a camera for capturing an image sequence of a subject, a three dimension (3D) reconstruction unit for reconstructing a 3D model from the image sequence, and a model refinement unit for refining the 3D model so as to be fitted to one or more images selected by a user from the image sequence.

In a first possible implementation manner of the first aspect, the 3D model is refined based on one or more silhouettes of the subject that are extracted from the one or more selected images.

With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the device further includes: a user interface unit for showing one or more silhouettes of the subject that are extracted from the one or more selected images, and making the user check whether the one or more silhouettes are accurate or not.

With reference to the first possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, the 3D model is reconstructed as a set of points, and holes on the 3D model is closed by one or more parts of a set of tangent surfaces computed from the one or more silhouettes, wherein the one or more parts of the set of tangent surfaces are inside a 3D model reconstructed as 3D mesh from the set of points.

According to a second aspect, a method performed by a device is provided, where the method includes: capturing an image sequence of a subject, reconstructing a three dimension (3D) model from the image sequence, and refining the 3D model so as to be fitted to one or more images selected by a user from the image sequence.

According to a third aspect, a computer readable storage media storing a program thereon is provided, where when the program is executed by a processor, the program causes the processor to perform the method according to the second aspect.

BRIEF DESCRIPTION OF DRAWINGS

To describe the technical solutions in the embodiments of the present invention or in the prior art more clearly, the following briefly introduces the accompanying drawings required for describing the embodiments or the prior art. Apparently, the accompanying drawings in the following description show merely some embodiments of the present invention, and a person of ordinary skill in the art may still derive other drawings from these accompanying drawings without creative efforts.

Fig. 1 depicts an example of a usage scene of a 3D human model reconstruction application according to a first embodiment of the present invention;

Fig. 2 depicts an example of a block diagram of a hardware configuration;

Fig. 3 depicts an example of a block diagram of a functional configuration;

Fig. 4 (a) depicts an example of an overall flowchart of model refinement;

Fig. 4 (b) depicts an example of a detailed flowchart of the model refinement;

Fig. 5 (a) depicts an example of a UI shown on the display 117;

Fig. 5 (b) depicts an example of a UI shown on the display 117;

Fig. 6 (a) depicts an example of a 3D model 300 with a hole;

Fig. 6 (b) depicts an example of a tangent surface 301;

Fig. 6 (c) depicts an example of a 3D model 300 and corresponding part of the tangent surface 302 that will be merged to fill the hole;

Fig. 6 (d) depicts an example of a refined 3D model 303 .

DESCRIPTION OF EMBODIMENTS

The following clearly and completely describes the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. The described embodiments are merely some but not all of the embodiments of the present invention. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts shall fall within the protection scope of the present invention.

A first embodiment of the present invention is a 3D human model reconstruction application on a mobile device. Fig. 1 depicts an example of a usage scene of the 3D human model reconstruction application on a mobile device 100, for example, a smart phone. A user 102 holding and operating the mobile device 100 scans a static target person 101. In Fig. 2, only a hand of the user 102 is shown. The target person 101 in Fig. 1 is conveniently shown as a simplified shape of a human, but it is intended to be an actual human. The subject is not limited to a human, for example, it may be from a small thing to a large thing, such as a stuffed toy, a car, etc. The user 102 is supposed to move around the target person 101 while keeping a camera 115 (Fig. 2) on the mobile device 100 toward the target person 101 and operate a user interface (UI) on a display 117 (Fig. 2) .

The term “scan” means capturing images of a subject from various directions. It is ideal to capture enough images to cover almost all of the surface of the subject; however, images captured by non-professional users are not enough. In many cases, 3D depth information of, for example, the top of the head, armpits, and crotch cannot be obtained, and holes are left on the reconstructed 3D model of the subject. Some of the reasons of the lack of the images are that it is difficult to capture the top of the head of the static target person 101 without going a higher place, and armpits and crotch are usually occluded by the other parts of the body. Existing techniques for closing the holes are as follows:

Existing Method 1: Screened Poisson surface reconstruction (for example, refer to: Kazhdan, Michael, and Hugues Hoppe, "Screened poisson surface reconstruction" , ACM Transactions on Graphics (ToG) 32.3 (2013) : 29) , which assumes that a continuous implicit surface is behind observed points, is widely used to make a 3D mesh from a set of points. Method 1 fills holes implicitly at the same time as meshing. However, Screened Poisson surface reconstruction often fails to naturally close large holes around locally steep geometry, and makes inflated artifacts which are bigger/fatter than the actual surface. Giving a simplified example, if there is a big hole on a surface of a sphere, the surface around the hole is extended in the direction of tangent line, as a result, the hole is closed with a shape like a cone rather than a part of the sphere, and this extended surface expands to be bigger than the original spherical surface. After texture mapping on such inflated artifacts becomes much more noticeable and bad-looking, because conspicuous background color is put on the inflated artifacts.

Existing Method 2: Hole filling (for example, refer to: Liepa, Peter, "Filling holes in meshes" , Proceedings of the 2003 Eurographics/ACM SIGGRAPH symposium on Geometry processing, Eurographics Association, 2003) is also widely used to fill holes on a 3D model. Method 2 detects hole boundaries on a 3D model, performs parameterization for it, and finally polygonises it. But Method 2 is not robust and sometimes fails to fill holes or fills a hole unnaturally in practice since it can’t handle complex hole boundaries on a noisy mesh.

Existing Method 3: Visual Hull (for example, refer to: United States Patent Application, Publication No. US2015/0178988A1, “Method and a system for generating a realistic 3d reconstruction model for an object or being” ) , which reconstructs 3D model from a plurality of silhouette images, is another way of 3D reconstruction. Visual Hull is usually performed under well calibrated settings. For instance, a subject is supposed to be in a special room where a sufficient number of cameras is tightly fixed and walls and floors are covered by distinct color to extract accurate silhouette of the subject. Under such lab-setting Visual Hull would reconstruct accurate 3D model.

Method 3 describes a system in such a special room. Method 3 basically relies on Visual Hull, but enhances fidelity of a face by fusing high resolution mesh that comes from structured-light based triangulation. Special smoothing method is applied to boundary of face to alleviate visible geometrical steps caused by combining two independent meshes.

There are two main problems when we use the approach of Method 3 in a casual setup. First of all, everything is not well calibrated. Movable camera trajectory estimated by SfM or SLAM is prone to drift. A heuristic method or a machine learning based method can be applied to extract a silhouette of a subject in front of an unknown background, but the boundary of the silhouette will be noisy. Drifted camera trajectory and noisy silhouette decrease quality of Visual Hull output. Second problem is that we don’t know which part should be adopted by either Visual Hull or other methods and how to identify boundary region to be smoothed. Therefore, using Method 3 as approach in casual setting fails to generate a visually good 3D model.

In the present invention, one or more silhouettes of a subject are used to close holes on the 3D model. The silhouette is useful to close holes appearing on unobservable region such as the top of the head, crotch or armpits.

Fig. 2 depicts an example of a block diagram of a hardware configuration of the first embodiment. The mobile device 100 includes a CPU (Central Processing Unit) 110, a RAM (Random Access Memory) 111, a ROM (Read Only Memory) 112, a bus 113, an Input/Output I/F (Interface) 114, a display 117, and a touch panel 118. The mobile device 100 also has a camera 115 and a storage device 116 that are connected to the bus 113 via the Input/Output I/F 114. The CPU 110 controls each element connected through the bus 113. The RAM 111 is used for a main memory of the CPU 110 and so on. The ROM 112 stores OS (Operating System) , programs, device drivers and so on. The camera 115 connected via the Input/Output I/F 114 captures still images or videos. The storage device 116 connected via the Input/Output I/F 114 is a storage having a large capacity, for example, a hard disk or a flash memory. The Input/Output I/F 114 converts data captured by the camera 115 into an image format, and stores it in the storage device 116. The display 117 shows a user interface. The touch panel 118 embedded on the display 117 accepts and transfers touch operations by the user 102 to the CPU 110.

Fig. 3 depicts an example of a block diagram of a function configuration of the first embodiment. The mobile device 100 includes a user interface control unit 120, an image acquisition unit 121, a silhouette extraction unit 122, a 3D reconstruction unit 123, a model refinement unit 124, and a storage unit 125.

The user interface control unit 120 controls a user interface shown on the display 117 according to the states of the other units and touch operations by the user 102 to the touch panel 118. For example, the user interface control unit 120 is realized by the CPU 110, the RAM 111, programs in the ROM 112, the bus 113, the display 117, and the touch panel 118.

The image acquisition unit 121 obtains a sequence of still images or a video from the camera 115, and stores it in the RAM 111 or the storage device 116. For example, the image acquisition unit 121 is realized by the CPU 110, the RAM 111, programs in the ROM 112, the bus 113, the Input/Output I/F 114, and the camera 115.

The silhouette extraction unit 122 extracts a silhouette of the target person 101 from a still image or a frame of video captured by the image acquisition unit 121 and stored in the RAM 111 or the storage unit 125. The silhouette extraction unit 122 could be implemented in various ways, for example, background subtraction or CNN (Convolutional Neural Network) . For example, the silhouette extraction unit 122 is realized by the CPU 110, the RAM 111, programs in the ROM 112, and the bus 113.

The 3D reconstruction unit 123 reconstructs a 3D model of the target person 101 from the sequence of still images or the video captured by the image acquisition unit 121 and stored in the RAM 111 or the storage unit 125. The 3D reconstruction unit 123 also estimates extrinsic parameters that define 3D rigid transformation between each image used for the reconstruction and the 3D model. The 3D reconstruction unit 123 could be implemented in various ways, for example, SfM (Structure-from-Motion) and MVS (Multi-View Stereo) for color images or KinectFusion for depth images. For example, the 3D reconstruction unit 123 is realized by the CPU 110, the RAM 111, programs in the ROM 112, and the bus 113.

The model refinement unit 124 refines the 3D model reconstructed by the 3D reconstruction unit 123 to make a refined 3D model with one or more silhouettes selected by the user 102. The details will be described later. For example, the model refinement unit 124 is realized by the CPU 110, the RAM 111, programs in the ROM 112, and the bus 113.

The storage unit 125 stores the captured images and the refined 3D model into the storage device 116 for further use. For example, the storage unit 125 is realized by the Input/Output I/F 114 and the storage device 116.

The CPU 110 controls the above-mentioned units in this embodiment.

Fig. 4 (a) depicts an example of an overall flowchart of model refinement according to the first embodiment. Fig. 4 (b) depicts an example of a detailed flowchart of the model refinement according to the first embodiment. Each step of Figs. 4 (a) and 4 (b) would be executed by the CPU 110 and data are stored in the RAM 111 or the storage device 116 and loaded from them as needed.

At step S100, the CPU110 obtains an image sequence via the image acquisition unit 121 with the camera 115 and stores it in the RAM 111. It is assumed that the images are colored in this embodiment. It could be possible to store it in the storage device 116 by the storage unit 125. Fig. 1 shows how the mobile device 100 is operated in this step. The user 102 holding and operating the mobile device 100 scans a static target person 101 as completely as possible. The user 102 is supposed to move around the target person 101 while keeping the camera 115 on the back of the mobile device 100 toward the target person 101.

At step S101, CPU110 processes the image sequence obtained at step S100 to generate a 3D model. The 3D reconstruction unit 123 reconstructs the 3D model and estimates extrinsic camera parameters (mentioned above) and if necessary, intrinsic camera parameters (mentioned later) . All of the output at step S101 are stored in the RAM 111 or the storage device 116.

A complete 3D model is rarely reconstructed at S101 because only an insufficient number of images are often captured at S100. At step S102, the UI on the display 117 requests the user 102 to select one or more images, to which the user 102 wishes to fit the 3D model, from the image sequence. The message “Select frontal view” in Fig. 5 (a) is merely an example, and the user 102 is requested to select “one or more images” to be used for fitting the 3D model into the silhouettes of the subject that are extracted from those “one or more images” . The UI is controlled by the user interface control unit 120.

Fig. 5 (a) is the UI of this step. On the display 117, thumbnails of the image sequence 200 are shown. The images captured by the camera 115 are used for thumbnails in Figs. 5 (a) and 5 (b) (the face of the person in the thumbnails in Figs. 5 (a) and 5 (b) have been processed for the purpose of privacy protection because this patent application document will be opened to the public) . In Fig. 5 (a) , the upper-left image is a photo of a person standing in a room that is captured from the front, and the upper-right image, the middle-left image, the middle-right image, and the lower-right image are captured from the rear right, from the back, from the left, and from the front right, respectively, and the lower-left image is a photo of the lower body of the person captured from the front right. The user 102 is supposed to select one or more frames corresponding to thumbnails by a touch operation. If there are too many images to show on the display 117 at one time, a next page button 201 is shown to change the thumbnails 200 to show the other images. After the user 102 selects one of the images, for example, if the user 102 selects the upper-left image, the UI changes and the selected image 202 is displayed as shown in Fig. 5 (b) .

At step S103, a silhouette extraction unit 122 extracts the silhouette of the target person 101 from the selected frames at step S102.

At step S104, the user 102 checks whether the silhouette extraction result shown on the display 117 is acceptable or not in terms of silhouette accuracy. Fig. 5 (b) depicts an example of the UI at step S104. The selected image 202 and corresponding extracted silhouette 203 are shown. The silhouette is shown in white and the background is shown in black. There may be cases where a wrong silhouette is extracted because of an algorithm error. If the user 102 taps one of response buttons 204, namely, taps “OK” or “NG” according to acceptance or rejection of the extracted silhouette. After the user 102 responses, the UI continues to show another selected image and corresponding silhouette. If all of the one or more selected images and corresponding silhouettes are checked by the user 102, the process goes to the next step, specifically, if at least one silhouette does not have acceptable quality (the user 102 responded “NG” at least once) , goes to step S105, or if all silhouettes are acceptable (the user 102 responded “OK” for all of the one or more selected images and corresponding silhouettes) , goes to step S107.

The above-mentioned UI interaction may be eliminated by automatically selecting one or more images for silhouette refinement.

All or part of process at step S104 could be performed in advance or integrated into earlier steps. For example, in step S100, when the user 102 is capturing the image sequence, the UI could show corresponding silhouette and the response buttons for each captured image in real time. In this case, the user 102 could select one or more images and check corresponding silhouettes in step S100.

At step S105, the UI asks whether the user 102 wishes to select the other images from the existing image sequence (the image sequence obtained at step S100) or not. If yes, the process goes back to step S102, or if no, goes to step S106.

At step S106, the user 102 captures another image sequence in the same way as step S100. After capturing, camera parameters for each additional image are estimated in the same way as step S101. Images captured at this step are merged to the existing image sequence. Then, the process goes back to step S102.

At step S107, the model refinement unit 124 refines the 3D model reconstructed at step S101 by using the silhouettes extracted at step S103 and confirmed by the user 102 at step S104. The details of the model refinement are shown in Fig. 4 (b) and will be explained later. The refined 3D model is stored in the RAM 111 or the storage device 116 for further use, for instance, 3D model viewers or Augmented Reality applications.

Summarizing the user operation other than obtaining an image sequence, a user who operates a device selects one or more images of a subject from the image sequence that the user is capturing and/or has already captured. Then, the device according to the present invention extracts one or more silhouettes of the subject that are extracted from the one or more selected images, and asks the user whether the silhouettes are accurate or not. If it is OK, the device applies refinement, based on the silhouettes, for the reconstructed 3D model to close holes on it. If not, the user is requested to select the other images from the image sequence or capture additional images.

Next, the details of the model refinement by the model refinement unit 124 is introduced with reference to the flowchart in Fig. 4 (b) . The model refinement unit 124 refines the 3D model reconstructed by the 3D reconstruction unit 123 to make a refined 3D model with the one or more silhouettes selected by the user 102. The 3D model often has large holes caused by the insufficient number of input images. Fig. 6 (a) depicts an example of a 3D model 300 of a human with a hole on the top of the head caused by difficulty to capture there in a casual scanning. Fig. 6 (a) shows a simplified 3D model of a human only from shoulder to top, viewed from obliquely above. The model refinement unit 124 fills this hole by using the one or more silhouettes.

At step S200, the model refinement unit 124 computes a set of 3D dimensional curved tangent surfaces from each silhouette based on the principle of perspective projective transformation and camera parameters like those for Visual Hull. Various methods could be used at this step. For example, a signed distance function on a voxel grid set to cover the 3D model is updated by unprojecting ( “projection” means mapping 3D points to 2D points on an image, and “unprojection” means the reverse of this “projection” , namely, mapping 2D points on an image to 3D points) inside pixels of a silhouette to make an implicit tangent surface of the silhouette. Then marching cube is applied to extract the tangent surface. The tangent surface should cover over the holes if the 3D model and the silhouette is enough accurate. Such a tangent surface 301 is shown in Fig. 6 (b) , in which the white part corresponds to the silhouette, and the dark gray part corresponds to a possible surface calculated from the silhouette. This process is applied to all of the one or more silhouettes. To perform the unprojection in the same coordinate system of the 3D model, camera parameters are required, and are already obtained before performing this step S200. In the first embodiment, camera 115 has a standard Field-of-View and could be approximated by a pinhole camera model. Intrinsic camera parameters, which are focal length and principal point, and distortion coefficients, could be independently calibrated before the application runs or estimated in the 3D reconstruction unit 123. Extrinsic camera parameters are estimated in the 3D reconstruction unit 123.

At step S201, the model refinement unit 124 calculates parts of the tangent surfaces on the holes to close them. In this step, the places of the holes are identified. Various ways could be used for this step. For instance, Poisson surface reconstruction (refer to “Existing Method 1” mentioned earlier) is applied to the 3D model to make closed surface with inflated artifacts on the holes. By projecting vertices or faces of the closed surface to the silhouettes and checking projected positions are outside or not, outside inflated artifacts parts of the closed surface are determined. Such outside parts are above the holes. Then the Nearest Neighbor from the outside parts to the tangent surfaces is used to find corresponding parts of the tangent surfaces to close holes. Such a part of tangent surface 302 is shown in Fig. 6 (c) .

At step S202, the parts of the tangent surfaces calculated at step S201 are merged to the 3D model to make a refined 3D model. A merged surface of the refined 3D model 303 is shown in Fig. 6 (d) .

According to the embodiment of the present invention, holes on the 3D model are closed and aligned with silhouettes of the subject. A closed surface is not only visually good but also important for further application with the 3D model because many computer graphics and computer vision algorithms assume input surface is closed.

What is disclosed above is merely exemplary embodiments of the present invention, and certainly is not intended to limit the protection scope of the present invention. A person of ordinary skill in the art may understand that all or some of processes that implement the foregoing embodiments and equivalent modifications made in accordance with the claims of the present invention shall fall within the scope of the present invention.

Claims

A device, comprising:

a camera for capturing an image sequence of a subject,

a three dimension (3D) reconstruction unit for reconstructing a 3D model from the image sequence, and

a model refinement unit for refining the 3D model so as to be fitted to one or more images selected by a user from the image sequence.
The device according to claim 1, wherein the 3D model is refined based on one or more silhouettes of the subject that are extracted from the one or more selected images.
The device according to claim 2, further comprising:

a user interface unit for showing one or more silhouettes of the subject that are extracted from the one or more selected images, and making the user check whether the one or more silhouettes are accurate or not.
The device according to claim 2, wherein the 3D model is reconstructed as a set of points, and holes on the 3D model is closed by one or more parts of a set of tangent surfaces computed from the one or more silhouettes, wherein the one or more parts of the set of tangent surfaces are inside a 3D model reconstructed as 3D mesh from the set of points.
A method performed by a device, comprising:

capturing an image sequence of a subject,

reconstructing a three dimension (3D) model from the image sequence, and

refining the 3D model so as to be fitted to one or more images selected by a user from the image sequence.
A computer readable storage media storing a program thereon, when the program is executed by a processor, the program causes the processor to perform the method according to claim 5.