WO2020118565A1

WO2020118565A1 - Keyframe selection for texture mapping wien generating 3d model

Info

Publication number: WO2020118565A1
Application number: PCT/CN2018/120635
Authority: WO
Inventors: Sato Hiroyuki; Chiba NAOKI
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2018-12-12
Filing date: 2018-12-12
Publication date: 2020-06-18

Abstract

A method for generating a 3D model, includes: reconstructing a 3D shape of an object based on a plurality of frames captured from a plurality of different positions, wherein each of the plurality of captured frames includes a depth image and a view image; selecting a plurality of key-frames from the plurality of captured frames, based on a distance between the reconstructed 3D shape and a 3D coordinate obtained by using the depth image; and mapping a texture to the reconstructed 3D shape by using the view image in each of the plurality of key-frames.

Description

[Title established by the ISA under Rule 37.2] KEYFRAME SELECTION FOR TEXTURE MAPPING WIEN GENERATING 3D MODEL

Technical Field

The present disclosure relates to generating a 3D (three dimensional) model of one more objects in a physical environment from input image data. In more detail, the disclosure relates to optimally selecting key frames, such as according to correspondence with a finally reconstructed 3D shape of the 3D model, for use in texture mapping. The disclosure also relates to re-estimating camera pose based on the final 3D shape. In particular, the disclosure relates to capture and processing equipment such as in RGB-D cameras and in mobile devices, e.g. handhelds, phones that have comparatively limited power and processing capabilities.

Background Art

In recent years, techniques for simultaneously capturing color images and depth images and reconstructing 3D models of objects have been extensively studied. Such techniques use depth images to reconstruct a 3D shape (3D shape) , and use color images to map a texture to the 3D shape, thereby generating a 3D model of an object. A technique called KinectFusion has become a well-known means of capturing and reconstructing a 3D model of an object using just a RGB-D camera. Compared with color images, depth maps are nosier. KinectFusion refines the noise over time, by estimating camera motion at the same time as reconstructing the shape of the object. In order to refine noisy data, each depth-map captured by the RGB-D camera is recorded in voxel space, smoothed over time frames. The camera motion between consecutive frames is estimated by minimizing the distance between the depth-map and the reconstructed shape from frames until the previous frame. Once the camera motion is estimated, the corresponding depth-map is integrated into the reconstructed shape by averaging over past frames in voxel space. After processing all of the frames, the final shape is converted from the voxel space to mesh patches. By using the mesh and color images, the textured model is generated.

Texture mapping uses the camera motion when color images and depth images are acquired. The camera motion may be estimated, for example, in the process of reconstructing a 3D shape. When the estimation of the camera motion is inaccurate, a texture may be mapped to an inaccurate location and an unnatural 3D model may be generated.

Summary

Embodiments provide a method for generating a 3D model, an apparatus such as an encoder, a mobile device and a computer-readable storage medium. For example, the mobile device may be a phone, specialized video capture apparatus, an encoder chip and the like.

To achieve the foregoing objective, the following technical solutions are used in the embodiments.

The first aspect of an embodiment provides a method for generating a 3D model. The method according to the first aspect, includes: reconstructing a 3D shape of an object based on a plurality of captured frames corresponding to view points of the object, wherein each of the plurality of captured frames includes a depth image and a view image; selecting a plurality of key-frames from the plurality of captured frames, based on the distance between of the reconstructed 3D shape and a 3D coordinate obtained by using the depth image; and mapping a texture to the reconstructed 3D shape by using the view image in each of the plurality of key-frames. In the reconstructing, the 3D model configured by the 3D shape and the texture may be generated. The object may be a physical real world object, and the reconstructing may apply to at least one object set each including a plurality of physical real world objects. The plurality of captured frames may be captured from a plurality of different positions e.g. as the camera and/or object move relative to each other. The plurality of different positions may correspond to time points. At each of the plurality of different positions, at least part of the object may be included in a view angle of the camera. Accordingly, the plurality of different positions may further correspond to angle ranges covering intended parts of the object. Each angle range may be expressed by using a center of the object, a position of the camera at each time point and the view angle of the camera, because a straight line joining the center of the object and the position of the camera may correspond to an optical axis of the camera, and the view angle of the camera can be determined based on specifications of a lens and an image sensor equipped in the camera. The centre of the object may be configured based on a 3D shape estimated in the reconstructing. The view image may be a color image or a non-color image. The distance may indicate a distance between a surface or point of the reconstructed 3D shape and the 3D coordinate. The method relates to image encoding and is hence computational operation performed by a processing circuitry, on real world input data. The processing circuitry may be implemented via at least one processing unit which may include application-specific integrated circuit (ASIC) logic, graphics processor (s) , general purpose processor (s) , or the like. In some examples, the processing circuitry may be implemented via hardware, imaging dedicated hardware, or the like.

According to the first aspect, since key-frames are selected based on the distance between the reconstructed 3D shape and a depth image, texture mapping is carried out by using a view image corresponding to a location where the deviation of the camera motion is small. As a result, improved results of texture mapping are obtained, and a high-quality 3D model is generated. In addition, as compared to the optimization method of document D of enhancing the accuracy of texture mapping by using all view images, a 3D model with a sufficiently high quality is obtained at a reduced calculation cost of a processor.

In the first possible implementation form of the method according to the first aspect, the selecting the plurality of key-frames includes: extracting a plurality of key-frame candidates from the plurality of captured frames, wherein the plurality of key-frame candidates satisfy a predetermined condition relevant to angle ranges to an object and/or a characteristic of the view image; and selecting the plurality of key-frames from the plurality of key-frame candidates based on the distance between the reconstructed 3D shape and the 3D coordinate. For example, the predetermined angle corresponding to the captured frames at 5 sec., 10 sec., 15 sec. ..., or similarly at certain angular intervals e.g. 5 degrees, after the reconstructing may be configured as the condition relevant to the angle ranges. The condition relevant to the angle ranges may further include that the captured frame corresponding to a predetermined angle is selected as a key-frame candidate in an angle range including the predetermined angle and one or more captured frames before and/or after this key-frame candidate are also selected as key-frame candidates in this angle range. A characteristic of the view image may be a proxy for a quality of the view image, which might correspond to the extent of motion blur in the view image; a lower motion blur might correspond to a higher quality. Higher and lower values may be expressed numerically e.g. by an index, and may be absolute values or relative values in a set of view images such as the captured ones.

According to the first possible implementation form, frames each including a high-quality view image among view images respectively corresponding to predetermined angle ranges for an object are extracted as key-frame candidates, and a key-frame is selected from the key-frame candidates, so that the calculation cost can be reduced as compared with a case where all frames are used.

In the second possible implementation form of the method according to the first aspect, the method further includes: estimating a motion of a camera capturing the plurality of captured frames, based on a captured frame and an intermediate 3D shape in the reconstructing; transforming the 3D coordinate obtained by using the depth image on a camera coordinate system to a world coordinate system based on the estimated motion; and calculating the distance based on the transformed 3D coordinate and a finally reconstructed 3D shape. For example, the finally reconstructed 3D shape may be reconstructed by sequentially using the depth images included in the captured frames #1, #2, ..., #n. In this case, the intermediate 3D shape is generated by using the depth images in the captured frames #1, #2, ..., #k (k<n) , and a camera motion between the captured frame #k and the captured frame #k+1 is estimated based on the intermediate 3D shape and the captured frame #k+1. The intermediate 3D shape refers to a temporal, mid-stage 3D shape reconstructed by using the captured frames #1 to #k prior to the last captured frame #n that is used to reconstruct the finally reconstructed 3D shape.

According to the second possible implementation form, the motion of the camera can be efficiently estimated based on an intermediate 3D shape and a depth image in the process of reconstructing a 3D shape, without using a motion sensor such as an accelerator sensor or a gyro sensor.

In the third possible implementation form of the method according to the first aspect, the method further includes: re-estimating a motion of the camera based on the depth image and the finally reconstructed 3D shape, wherein a thereby obtained re-estimated motion of the camera is used in the mapping the texture to the reconstructed 3D shape.

According to the third possible implementation form, the motion of the camera is re-estimated by using the finally obtained 3D shape and the depth image, so that a more accurate motion of the camera is obtained. As texture mapping is carried out by using the re-estimated motion of the camera, the mapping accuracy may be improved to generate a 3D model with a higher quality. From another aspect, the camera’s motion is smoothed over time, resulting in more accurate motion of the camera, which confers the higher mapping accuracy.

In the fourth possible implementation form of the method according to the first aspect, the selecting the plurality of key-frames includes: calculating an error indicating mismatching between the estimated motion and the finally reconstructed 3D shape based on the distance, with respect to each of the plurality of the key-frame candidates, and selecting as a key-frame, a key-frame candidate corresponding to the minimum error from a set of key-frame candidates corresponding to each angle range. According to the fourth possible implementation form, the mapping accuracy may be improved to generate a 3D model with a higher quality.

In the fifth possible implementation form of the method according to the first aspect, only the plurality of key-frames selected in the selecting are used in the mapping the texture. According to the fifth possible implementation form, the calculation cost can be reduced as compared with a case where all frames are used.

The second aspect of an embodiment provides the following apparatus such as the mobile device. For example, the mobile device may be a phone, specialized video capture apparatus, an encoder chip and the like. The apparatus may be configured to perform the methods disclosed herein.

The apparatus, preferably the mobile device according to the second aspect, includes processing circuitry which is configured to perform: reconstructing a 3D shape of an object based on a plurality of captured frames from a plurality of different positions, wherein each of the plurality of captured frames includes a depth image and a view image; selecting a plurality of key-frames from the plurality of captured frames, based on the distance between the reconstructed 3D shape and a 3D coordinate obtained by using the depth image; and mapping a texture to the reconstructed 3D shape by using the view image in each of the plurality of key-frames.

In the reconstructing, the 3D model configured by the 3D shape and the texture may be generated. The object may be a physical real world object, and the reconstructing may apply to at least one object set each including a plurality of physical real world objects. The plurality of captured frames may be captured from a plurality of different positions. The plurality of different positions may correspond to time points. At each of the plurality of different positions, at least part of the object may be included in a view angle of the camera. Accordingly, the plurality of different positions may further correspond to angle ranges covering intended parts of the object. Each angle range may be expressed by using a center of the object, a position of the camera at each time point and the view angle of the camera, because a straight line joining the center of the object and the position of the camera may correspond to an optical axis of the camera, and the view angle of the camera can be determined based on specifications of a lens and an image sensor equipped in the camera. The center of the object may be configured based on a 3D shape estimated in the reconstructing. The view image may be a color image or a non-color image. The distance may indicate a distance between a surface or point of the reconstructed 3D shape and the 3D coordinate. The method relates to image encoding and is hence computational operation performed by a processing circuitry, on real world input data. In some examples, the processing circuitry may be implemented via processing unit (s) . Processing unit (s) may include application-specific integrated circuit (ASIC) logic, graphics processor (s) , general purpose processor (s) , or the like. In some examples, the processing circuitry may be implemented via hardware, imaging dedicated hardware, or the like.

According to the second aspect, since key-frames are selected based on the distance between the reconstructed 3D shape and a depth image, texture mapping is carried out by using a view image corresponding to a location where the deviation of the camera motion is small. As a result, improved results of texture mapping are obtained, and a high-quality 3D model is generated. In addition, as compared to the optimization method of document D of enhancing the accuracy of texture mapping by using view images, a 3D model with a sufficiently high quality is obtained at a reduced calculation cost of a processor.

In the first possible implementation form of the mobile device according to the second aspect, in the selecting the plurality of key-frames, the mobile device is configured to perform: extracting a plurality of key-frame candidates from the plurality of captured frames, wherein the plurality of key-frame candidates satisfy a predetermined condition relevant to angle ranges to an object and a quality of the view image; and selecting the plurality of key-frames from the plurality of key-frame candidates based on the distance between the reconstructed 3D shape and the 3D coordinate. For example, the predetermined angle corresponding to the captured frames at 5 sec., 10 sec., 15 sec. ... after the reconstructing may be configured as the condition relevant to the angle ranges. The condition relevant to the angle ranges may further include that the captured frame corresponding to a predetermined angle is selected as a key-frame candidate in an angle range including the predetermined angle and one or more captured frames before and/or after this key-frame candidate are also selected as key-frame candidates in this angle range.

In the second possible implementation form of the mobile device according to the second aspect, the mobile device is further configured to perform: estimating a motion of a camera capturing the plurality of captured frames, based on a captured frame and an intermediate 3D shape in the reconstructing; transforming the 3D coordinate obtained by using the depth image on a camera coordinate system to a world coordinate system based on the estimated motion; and calculating the distance based on the transformed 3D coordinate and a finally reconstructed 3D shape. For example, the finally reconstructed 3D shape may be reconstructed by sequentially using the depth images included in the captured frames #1, #2, ..., #n. In this case, the intermediate 3D shape is generated by using the depth images in the captured frames #1, #2, ..., #k (k<n) , and a camera motion between the captured frame #k and the captured frame #k+1 is estimated based on the intermediate 3D shape and the captured frame #k+1. The intermediate 3D shape refers to a temporal, mid-stage 3D shape reconstructed by using the captured frames #1 to #k prior to the last captured frame #n that is used to reconstruct the finally reconstructed 3D shape.

In the third possible implementation form of the mobile device according to the second aspect, the mobile device is further configured to perform: re-estimating a motion of the camera based on the depth image and the finally reconstructed 3D shape, wherein a thereby obtained re-estimated motion of the camera is used in the mapping the texture to the reconstructed 3D shape.

According to the third possible implementation form, the motion of the camera is re-estimated by using the finally obtained 3D shape and the depth image, so that a more accurate motion of the camera is obtained. As texture mapping is carried out by using the re-estimated motion of the camera, the mapping accuracy may be improved to generate a 3D model with a higher quality. [0000] In the fourth possible implementation form of the mobile device according to the second aspect, the selecting the plurality of key-frames includes: calculating an error indicating mismatching between the estimated motion and the finally reconstructed 3D shape based on the distance, with respect to each of the plurality of the key-frame candidates, and selecting as a key-frame, a key-frame candidate corresponding to the minimum error from a set of key-frame candidates corresponding each angle range. According to the fourth possible implementation form, the mapping accuracy may be improved to generate a 3D model with a higher quality.

In the fifth possible implementation form of the mobile device according to the second aspect, only the plurality of key-frames selected in the selecting are used in the mapping the texture. According to the fifth possible implementation form, the calculation cost can be reduced as compared with a case where all frames are used.

The third aspect of an embodiment provides the computer-readable storage medium storing a computer program, the computer program causing a computer to implement the method specified by any one of the first aspect and the first to third possible implementation form according to the first aspect.

According to the third aspect, since key-frames are selected based on the distance between the reconstructed 3D shape and a depth image, texture mapping is carried out by using a view image corresponding to a location where the deviation of the camera motion is small. As a result, favorable results of texture mapping are obtained, and a high-quality 3D model is generated. In addition, as compared to the optimization method of enhancing the accuracy of texture mapping by using view images, a 3D model with a sufficiently high quality is obtained at a less calculation cost.

For the purpose of clarity, any one of the foregoing embodiments may be combined with any one or more of the other foregoing embodiments to create a new embodiment within the scope of the present disclosure. These and other features will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings and claims.

Brief Description of Drawings

[Fig. 1] Fig. 1 is a diagram for describing a manner of capturing frames to be used in three-dimensional model reconstruction according to an embodiment of the present disclosure.

[Fig. 2] Fig. 2 is a diagram for describing color images and depth images included in captured frames, and an outline of the processing that is performed by using the color images and the depth images.

[Fig. 3] Fig. 3 is a block diagram for describing an example of a hardware configuration of a mobile device according to an embodiment of the present disclosure.

[Fig. 4] Fig. 4 is a block diagram for describing the functions of the mobile device according to the embodiment of the present disclosure.

[Fig. 5] Fig. 5 is a diagram for describing a manner of evaluating a motion error.

[Fig. 6] Fig. 6 is a flow diagram for describing processing of textured model generation according to an embodiment of the present disclosure.

[Fig. 7] Fig. 7 is a flow diagram for further describing processing of key-frame selection in the flow diagram shown in Fig. 6.

[Fig. 8] Fig. 8 is a flow diagram showing a modification of the processing of key-frame selection shown in Fig. 7.

Description of Embodiments

The following describes technical solutions of the embodiments, referring to the accompanying drawings. It will be understood that the embodiments described below are not all but just some of embodiments relating to the present disclosure. It is to be noted that all other embodiments which may be derived by a person skilled in the art based on the embodiments described below without creative efforts shall fall within the protection scope of the present disclosure.

(Capture frames for 3D model reconstruction)

With reference to Fig. 1, explanation will be hereinafter provided for a manner of capturing frames to be used in 3D model reconstruction according to an embodiment of the present disclosure. Fig. 1 is a diagram for describing the manner of capturing frames to be used in 3D model reconstruction according to an embodiment of the present disclosure.

Fig. 1 exemplarily shows an operation for acquiring twenty frames of an object by using a mobile device 10, as an example. It should be noted that Fig. 1 shows a box-shaped object for descriptive convenience, but the shape, the quantity, and the size of an object 20, and the like are merely illustrative.

The mobile device 10 includes a depth camera 11 and a color camera 12 as shown in a dashed-line block in Fig. 1. The dashed-line block represents the inside of the mobile device 10. The depth camera 11 detects a distance (depth) to an object in a depth direction perpendicular to the imaging surface of the camera, and outputs a depth image (depth map) . The color camera 12 outputs a color image by using an imaging device having RGB sensors.

The depth camera 11 and the color camera 12 operate in such a way that at least some depth images and color images are synchronized with each other. Even when the frame rate of the depth camera 11 differs from the frame rate of the color camera 12, for example, a synchronized pair of a depth image and a color image can be extracted as long as at least some frames are in synchronism.

In addition, the correlation between a depth image and a color image can be set by using, for example, a time sequence. In the following description, a synchronized pair of a depth image and a color image is referred as a captured frame for descriptive convenience.

A shooter moves the mobile device 10 while directing the imaging surfaces (lens side) of the depth camera 11 and the color camera 12 toward the object 20 to shoot the object 20 from various angles. The following refers to that motion of the mobile device 10 as "camera motion. " It should be noted that with the mobile device 10 being stationary, the object 20 may be shot while being moved. In this case, changes in the relative position of the mobile device 10 to the object 20 becomes the "camera motion. "

(Overview of 3D reconstruction)

When n (n=1, 2, ... ) captured frames are obtained in the operation exemplified in Fig. 1, a 3D model of the object 20 is reconstructed in procedures shown in Fig. 2. Fig. 2 is a diagram for describing the outline of the processing of 3D reconstruction that is performed by using the color images and the depth images included in frames captured by the

cameras

12 and 11.

As shown in Fig. 2, n captured frames (#1, #2, ..., #n) each include a color image and a depth image. In the example of Fig. 2, first, a world coordinate system is determined by using the depth image included in the first captured frame (#1) (S11) . The world coordinate system is a coordinate system in which the object 20 is located in a certain scene, and is set as a fixed coordinate system with reference to that location independent of the camera motion.

Then, in the example of Fig. 2, key-frame candidates are extracted from the n captured frames (#1, #2, ..., #n) (S12) . In Fig. 2, two captured frames (#3, #k: k≤n-1) are exemplified as key-frame candidates, which is not however restrictive. As will be described later, the key-frame candidates are candidates of captured frames (key-frames) each including a depth image which may be used at the time of reconstructing a 3D shape, as well as a color image to be mapped to a 3D shape through texture mapping.

When an angle range with reference to the object 20 is preset as an example of extraction of the captured frames (#3, #k) which are key-frame candidates, for example, then captured frames corresponding to the angle range are extracted as key-frame candidates. When the angle range is 5 degrees, for example, every time the angle increases by five degrees starting from 0 degrees (0 degrees, 5 degrees, ... ) , captured frames corresponding to that angle are extracted as key-frame candidates. Further, captured frames including color images whose qualities are higher than a predetermined quality may be selected based on an index (for example, the amount of blurring) relating to the quality of color images, and the captured frames may be extracted as key-frame candidates. Alternatively, as an example of extraction of key-frame candidates, a time (for example, a time period of 1 second or the like from the initiation of shooting by a camera) can be set in advance, so that captured frames corresponding to the time can be sequentially extracted as key-frame candidates.

Moreover, in the example of Fig. 2, a 3D shape is reconstructed by using depth images included in the individual captured frames (#1, #2, ..., #n) (S13) . Then, in the 3D reconstruction according to the embodiment of the present disclosure, key-frames which are used in texture mapping are selected , such as for each angle range based on the depth images included in the captured frames (for example, the captured frames #2, #3 in a certain angle) extracted for each predetermined angle range (for example, 5 degrees) as key-frame candidates in S12, and the 3D shape reconstructed in S13 (S14) . In the embodiment of the present disclosure, all of some of the key-frames selected for each angle range in S14 are used in texture mapping. Thus in general the candidate (extracted) frames are fewer than the captured frames, and the selected frames are fewer than the candidate frames.

The following describes the methods described the following Documents A to D as comparative examples to the foregoing manner of reconstructing a 3D shape.

Document A: Newcombe, Richard A., et al. "KinectFusion: Real-time dense surface mapping and tracking" , Mixed and augmented reality (ISMAR) , 2011 10th IEEE international symposium on IEEE, 2011.

Document B: Kahler, Olaf, et al., "Very high frame rate volumetric integration of depth images on mobile devices, "IEEE transactions on visualization and computer graphics 21.11 (2015) , 1241-1250.

Document C: Keller, Maik, et al., "Real-time 3d reconstruction in dynamic scenes by using point-based fusion" , 3D Vision-3DV 2013, 2013 International Conference on IEEE, 2013.

Document D: Zhou, Qian-Yi, and Vladlen Koltun, "Color map optimization for 3D reconstruction with consumer depth cameras, "ACM Transactions on Graphics (TOG) 33.4 (2014) , 155.

The Documents A and B have proposed methods of mapping evaluation values based on the distance in the depth direction, obtained from depth images, into voxel space, and averaging the evaluation values mapped into the voxel space along the direction of time to reproduce a 3D shape (Volumetric SDF (Signed Distance Function) fusion) . The foregoing Document C has proposed a method of integrating 3D points, obtained from depth images, into a world coordinate system, and averaging a 3D shape by weighted averaging.

Each of Documents A to C has had a problem that mismatching between the camera motion and the 3D shape may reduce the accuracy of texture mapping since a 3D shape is smoothened while sacrificing the accuracy of the camera motion. On the other hand, the foregoing method of Document D may enhance the accuracy of texture mapping by using the color-image based optimization, but has had a problem of an increased calculation cost due to the construction of a 3D shape by using all the key-frame candidates.

According to the method of 3D model reconstruction according to the embodiment of the present disclosure schematically shown in Fig. 2, unlike the foregoing methods of Documents A to C, key-frames selected preferably based on the finally obtained 3D shape (3D shape reconstructed based on the depth images of the captured frames #1 to #n) are used in texture mapping. This suppresses a reduction in accuracy of texture mapping originated from the mismatching between the camera motion and the 3D shape. Unlike the foregoing Document D, the method of 3D model reconstruction according to the embodiment of the present disclosure reconstructs a 3D shape by using key-frames obtained when the final 3D shape is reconstructed, that is, key-frames which are part of key-frame candidates are used (the optimization that results in high calculation cost is not applied) , so that the calculation cost is reduced. The part of the key-frame candidates may be the selected key frames from the key frame candidates. In addition, a 3D model with a sufficiently high quality is obtained.

(Hardware configuration of mobile device)

The following describes a hardware configuration of the mobile device 10 according to the embodiment of the present disclosure with reference to Fig. 3. Fig. 3 is a block diagram for describing an example of the hardware configuration of the mobile device 10 according to the embodiment of the present disclosure.

As shown in Fig. 3, the mobile device 10 includes the depth camera 11, the color camera 12, an input/output interface 13, a bus 14, a CPU 15, a RAM (Random Access Memory) 16, and a ROM (Read Only Memory) 17.

The input/output interface 13, the CPU 15, the RAM 16, and the ROM 17 are connected to one another by the bus 14. The depth camera 11 and the color camera 12 are connected to the bus 14 via the input/output interface 13.

A storage device 18 may be connected to the input/output interface 13. The storage device 18 is a memory device, such as an HDD (Hard Disk Drive) , an SSD (Solid State Drive) , an optical disc, a magneto-optical disk, or a semiconductor memory. The storage device 18 may also be a non-transitory computer-readable storage medium where a computer program for controlling the operation of the mobile device 10 is stored.

The CPU 15 may read the computer program stored in the storage device 18 and store the computer program in the RAM 16, and control the operation of the mobile device 10 according to the computer program read from the RAM 16. It should be noted that the computer program for controlling the operation of the mobile device 10 may be stored in advance in the ROM 17, or may be downloaded over a network by using the communication capability of the mobile device 10.

(Functional configuration of mobile device)

Next, with reference to Figs. 2 and 4, the functions of the mobile device 10 according to the embodiment of the present disclosure will be described. Fig. 4 is a block diagram for describing the functions of the mobile device 10 according to the embodiment of the present disclosure.

As shown in Fig. 4, the mobile device 10 includes an image acquisition unit 101, a 3D reconstruction unit 102, a key-frame selection unit 103, a texture mapping unit 104, and a storage unit 105.

The functions of the image acquisition unit 101, the 3D reconstruction unit 102, the key-frame selection unit 103 and the texture mapping unit 104 may be achieved by the foregoing CPU 15. The functions of the storage unit 105 may be achieved by the foregoing RAM 16 and/or storage device 18.

The image acquisition unit 101 controls the depth camera 11 to acquire depth images, and controls the color camera 12 to acquire color images. Also, the image acquisition unit 101 outputs synchronized pairs of depth images and color images as captured frames to the 3D reconstruction unit 102.

The 3D reconstruction unit 102 reconstructs a 3D shape by sequentially using depth images included in the captured frames #1, #2, ..., #n. The 3D reconstruction unit 102 also generates an intermediate 3D shape by using depth images in the captured frames #1, #2, ..., #k (k≤n-1) , and estimates a camera motion between the captured frame #k and the captured frame #k+1 based on the intermediate 3D shape and the captured frame #k+1. The intermediate 3D shape refers to a temporal, mid-stage 3D shape reconstructed by using the captured frames #1 to #k prior to the last captured frame #n that is used to reconstruct a 3D shape.

Information on the 3D shape finally obtained through the reconstruction and information on the camera motion are output to the key-frame selection unit 103 and the texture mapping unit 104.

The key-frame selection unit 103 selects key-frame candidates from among the captured frames #1, #2, ..., #n. For example, the key-frame selection unit 103 selects, as key-frame candidates, captured frames satisfying a predetermined condition (for example, captured frames each including a blurless color image when the predetermined condition is the absence of color blurring) from among a plurality of captured frames respectively corresponding to a plurality of preset angle ranges (for example, 0 degrees, 5 degrees and so forth) . The level that indicates if there is some blurring can be calculated based on, for example, the dispersion or histogram or the like of pixel values obtained through transform of color images to gray images.

Also, the key-frame selection unit 103 calculates, for each of the key-frame candidates, a motion error E given by the following equation (1) based on the camera motion estimated by, and the final 3D shape obtained by, the 3D reconstruction unit 102.

E (T _k) =Σ _u|T _kV _k (u) -V (u) ^TN (u) | ² (1)

where V _k, V, and N are matrices and u is a vector. Further, |…| represents the norm. The superscription T represents transposition. u represents two-dimensional coordinates on a depth image. V _k (u) is a 3D vertex map with each element indicating a point on a camera coordinate system which corresponds to a point on a depth image indicated by the vector u for the kth captured frame #k. T _k is a transform matrix for transforming a camera coordinate system for the captured frame #k to a world coordinate system based on the camera motion.

V (u) is a matrix of a 3D vertex map, expressed by the world coordinate system, with each element as a vector indicating a point on the surface of the final 3D shape. N (u) is a matrix of a unit normal map, with each element of a vector indicating a direction perpendicular to the surface of the final 3D shape. The relationship among T _k, V _k, V and N is shown in Fig. 5. Fig. 5 is a diagram for describing a manner of evaluating a motion error. As shown in Fig. 5, the motion error E (T _k) indicates a square error between a point on a depth image expressed by the world coordinate system and the surface of the final 3D shape. That is, in Fig. 5, the value of E (T _k) becomes smaller as the value (distance) of {T _kV _k (u) -V (u) ^TN (u) } becomes smaller. In other words, as the distance between the surface of the reconstructed 3D shape and the 3D coordinate obtained by using depth images becomes smaller, the reproducibility of the 3D shape which is influence by the estimation error of the of the camera motion may become higher.

The key-frame selection unit 103 selects, as a key-frame, the captured frame whose motion error E becomes minimum. As a modification, the key-frame selection unit 103 may select, as a key-frame, the captured frame whose motion error E becomes smaller than a predetermined threshold. Moreover, as another modification, the key-frame selection unit 103 may re-calculate the camera motion based on the final 3D shape and the depth images, for each of the key-frame candidates or selected therefrom.

Information on the key-frame selected by the key-frame selection unit 103 is output to the texture mapping unit 104. In the case of the other modification (i.e. in the case of re-calculation of the camera motion based on the final 3D shape) , information on the camera motion re-calculated by the key-frame selection unit 103 is output to the texture mapping unit 104.

The texture mapping unit 104 divides the final 3D shape to polygons, and pastes textures or parts thereof, such as color images, to the respective polygons based on the camera motion output from the 3D reconstruction unit 102. In the case of the other modification (in the case of re-calculation of the camera motion by the key-frame selection unit 103) , color images may be pasted to the final 3D shape based on the re-calculated camera motion. In this case, the color images can also be pasted to more accurate positions to provide a high-quality 3D model.

(Textured model generation)

Next, with reference to Figs. 1 to 6, the processing flow of the textured model generation according to an embodiment of the present disclosure will be described. Fig. 6 is a flow diagram for describing processing of textured model generation according to the embodiment of the present disclosure.

In S101, the image acquisition unit 101 acquires time-synchronized pairs of depth and color images (captured frames #1, #2, ..., #n) .

In S102, the 3D reconstruction unit 102 estimates the camera motion based on depth images and an intermediate 3D shape. For example, the 3D reconstruction unit 102 generates the intermediate 3D shape by using the depth images of the captured frames #1, #2, ..., #k (k≤n-1) , and estimates the camera motion between the captured frame #k and the captured frame #k+1 based on the intermediate 3D shape and the captured frame #k+1.

In S103, the key-frame selection unit 103 selects the key-frame based on the depth images of the captured frames, the camera motion and the final 3D shape so that the camera motion matches the 3D shape. The processing flow of selection will be described later in further detail.

In S104, the texture mapping unit 104 performs texture mapping based on the color image of the key-frame selected by the key-frame selection unit 103, the camera motion estimated by the 3D reconstruction unit 102, and the final 3D shape. When the processing in S104 is completed and a textured model is generated, the series of processes shown in Fig. 6 is terminated.

When the camera motion is estimated by the key-frame selection unit 103 based on the final 3D shape and the depth images as in a modification described later, the camera motion estimated by the key-frame selection unit 103 may be used in texture mapping.

With reference to Fig. 7, now, processing of key-frame selection corresponding to S103 in Fig. 6 will be further described. Fig. 7 is a flow diagram for further describing processing of key-frame selection in the flow diagram shown in Fig. 6.

In S131, the key-frame selection unit 103 selects key-frame candidates from among the captured frames #1, #2, ..., #n.

For example, the key-frame selection unit 103 selects, as key-frame candidates, captured frames satisfying a predetermined condition (for example, captured frames each including a blurless color image) from among a plurality of captured frames respectively corresponding to a plurality of preset angle ranges (for example, 0 degrees, 5 degrees, 10 degrees, and so forth) . For example, a predetermined number of (for example, ten) key-frame candidates are selected for each angle range, such as an angle range of 0 to 5 degrees, an angle range of 5 to 10 degrees, an angle range of 10 to 15 degrees, ..., an angle range of 175 to 180 degrees. The level of blurring can be calculated based on, for example, the dispersion or histogram or the like of pixel values obtained through transform of color images to gray images.

Processing between S132 and S138 is performed for each angle range. For example, the processing between S133 to S137 is repeatedly performed on a set of key-frame candidates corresponding to each angle range while varying the angle range. In the processing of S133 to S137, a key-frame for a target angle range is selected from the key-frame candidates. When the key-frames are selected for all the angle ranges, the series of processes shown in Fig. 7 is terminated.

In S133, the key-frame selection unit 103 selects one key-frame candidate from a set of target key-frame candidates, and evaluates the motion error E based on the depth image of that key-frame candidate, the estimated camera motion, and the final 3D shape. The motion error E is given by the foregoing equation (1) .

In S134, the key-frame selection unit 103 determines whether the motion error E evaluated in S133 is smaller than the minimum value in the key-frame candidates. When the motion error E is the minimum value, which means that the camera motion matches the 3D shape, the reproducibility of the 3D shape may become higher. When the motion error E is smaller than the minimum value, the processing proceeds to S135. When the motion error E is not smaller than the minimum value, the processing proceeds to S137. It is to be noted that when the minimum value in the key-frame candidates is not stored in the storage unit 105, the processing proceeds to S135.

In S135, the key-frame selection unit 103 stores the motion error E evaluated in S133 in the storage unit 105 as the minimum value in the key-frame candidates.

In S136, the key-frame selection unit 103 stores the estimated camera motion in the storage unit 105, and then stores, as key-frames, the color images used in the evaluation of the motion error E in the storage unit 105.

In S137, the key-frame selection unit 103 determines whether there is a key-frame candidate which has not been selected in S133. When there is an unselected key-frame candidate, the processing proceeds to S133. When all the key-frame candidates have been selected, the processing proceeds to S138. In S138, when the processing is completed for all the angle ranges, the series of processes shown in Fig. 7 is terminated.

With reference to Fig. 8, now, a modification of the processing of key-frame selection shown in Fig. 7 is described. Fig. 8 is a flow diagram showing a modification of the processing of key-frame selection shown in Fig. 7.

In S201, the key-frame selection unit 103 selects key-frame candidates from among the captured frames #1, #2, ..., #n. For example, the key-frame selection unit 103 selects, as key-frame candidates, captured frames satisfying a predetermined condition (for example, captured frames each including a blurless color image) from among a plurality of captured frames respectively corresponding to a plurality of preset angle ranges. For example, a predetermined number of (for example, ten) key-frame candidates are selected for each angle range, such as an angle range of 0 to 5 degrees, an angle range of 5 to 10 degrees, an angle range of 10 to 15 degrees, ..., an angle range of 175 to 180 degrees.

Processing between S202 and S208 is performed for each angle range. For example, the processing between S203 to S207 is repeatedly performed on a set of key-frame candidates corresponding to each angle range while varying the angle range. In the processing of S203 to S207, a key-frame for a target angle range is selected from the key-frame candidates. When the key-frames are selected for all the angle ranges, the series of processes shown in Fig. 8 is terminated.

In S203, the key-frame selection unit 103 selects one key-frame candidate from a set of target key-frame candidates, and re-estimates the camera motion based on the depth image of that key-frame candidate and the final 3D shape. For example, the key-frame selection unit 103 specifies the camera motion in which depth information of the depth image is most approximate to the surface shape of the final 3D shape. Information on the specified camera motion is output to the texture mapping unit 104 to be used in texture mapping. Moreover, the key-frame selection unit 103 evaluates the motion error E based on the depth image of that key-frame candidate, the estimated camera motion, and the final 3D shape. The motion error E is given by the foregoing equation (1) .

In S204, the key-frame selection unit 103 determines whether the motion error E evaluated in S203 is smaller than the minimum value in the key-frame candidates. When the motion error E is smaller than the minimum value, the processing proceeds to S205. When the motion error E is not smaller than the minimum value, the processing proceeds to S207. It is to be noted that when the minimum value in the key-frame candidates is not stored in the storage unit 105, the processing proceeds to S205.

In S205, the key-frame selection unit 103 stores the motion error E evaluated in S203 in the storage unit 105 as the minimum value in the key-frame candidates.

In S206, the key-frame selection unit 103 stores the estimated camera motion in the storage unit 105, and then stores, as key-frames, the color images used in the evaluation of the motion error E in the storage unit 105.

In S207, the key-frame selection unit 103 determines whether there is a key-frame candidate which has not been selected in S203. When there is an unselected key-frame candidate, the processing proceeds to S203. When all the key-frame candidates have been selected, the processing proceeds to S208. In S208, when the processing is completed for all the angle ranges, the series of processes shown in Fig. 8 is terminated.

According to the 3D reconstruction method according to the embodiment of the present disclosure, as described above, a key-frame is selected by using the final 3D shape. The selected key-frame is the captured frame which causes small mismatching (motion error) between the final 3D shape and the camera motion due to the time average of the shape that is carried out in the process of generating a 3D shape. The selected key-frame is used in texture mapping.

The use of the key-frame having a small motion error in texture mapping provides a 3D textured model having a sufficiently high quality. Further, unlike the foregoing method described in Document D, the 3D reconstruction method according to the embodiment of the present disclosure does not use the optimization method that demands a high calculation cost, the embodiment of the present disclosure provides a 3D textured model having a sufficiently high quality at a lower calculation cost.

Moreover, according to the method of the foregoing modification, the camera motion matched with the final 3D shape is re-calculated based on the final 3D shape and the depth images. Then, the re-calculated camera motion is used in texture mapping. The application of this method can provide a 3D textured model having a higher quality. In addition, the calculation cost for the re-calculation of the camera motion is sufficiently small as compared with the foregoing optimization method described in Document D, so that even the application of the method according to the foregoing modification can provide a 3D textured model having a high quality at a sufficiently small calculation cost.

The foregoing disclosure merely discloses exemplary embodiments, and is not intended to limit the protection scope of the present invention. It will be appreciated by those skilled in the art that the foregoing embodiments and all or some of other embodiments and modifications which may be derived based on the scope of claims of the present invention will of course fall within the scope of the present invention.

Claims

A method for generating a 3D model, comprising:

reconstructing a 3D shape of an object based on a plurality of frames captured from a plurality of different positions, wherein each of the plurality of captured frames includes a depth image and a view image;

selecting a plurality of key-frames from the plurality of captured frames, based on a distance between the reconstructed 3D shape and a 3D coordinate obtained by using the depth image; and

mapping a texture to the reconstructed 3D shape by using the view image in each of the plurality of key-frames.
The method according to claim 1, wherein the selecting the plurality of key-frames includes:

extracting a plurality of key-frame candidates from the plurality of captured frames, wherein the plurality of key-frame candidates satisfy a predetermined condition relevant to angle ranges to the object and/or a quality of the view image; and

selecting the plurality of key-frames from the plurality of key-frame candidates based on the distance between the reconstructed 3D shape and the 3D coordinate.
The method according to claim 2, further comprising:

estimating a motion of a camera configured to capture the plurality of captured frames, based on a captured frame and an intermediate 3D shape in the reconstructing of the 3D shape;

transforming the 3D coordinate obtained by using the depth image on a camera coordinate system to a world coordinate system based on the estimated motion; and

calculating the distance based on the transformed 3D coordinate and a finally reconstructed 3D shape.
The method according to any one of claims 1 to 3, further comprising

re-estimating a motion of the camera based on the depth image and the finally reconstructed 3D shape, wherein a thereby obtained re-estimated motion of the camera is used in the mapping the texture to the reconstructed 3D shape.
The method according to claim 3, wherein the selecting the plurality of key-frames includes:

calculating an error indicating mismatching between the estimated motion and the finally reconstructed 3D shape based on the distance, with respect to each of the plurality of the key-frame candidates, and

selecting as a key-frame, a key-frame candidate corresponding to the minimum error from a set of key-frame candidates corresponding to each angle range.
The method according to any one of claims 1 to 5, wherein only the plurality of key-frames selected in the selecting are used in the mapping the texture.
An apparatus, preferably a mobile device, comprising a processing circuitry, wherein the processing circuitry is configured to:

reconstruct a 3D shape of an object based on a plurality of frames captured from a plurality of different positions, wherein each of the plurality of captured frames includes a depth image and a view image;

select a plurality of key-frames from the plurality of captured frames, based on a distance between the reconstructed 3D shape and a 3D coordinate obtained by using the depth image; and

map a texture to the reconstructed 3D shape by using the view image in each of the plurality of key-frames.
The mobile device according to claim 7, wherein in the selecting of the plurality of key-frames, the mobile device is configured to:

extract a plurality of key-frame candidates from the plurality of captured frames, wherein the plurality of key-frame candidates satisfy a predetermined condition relevant to angle ranges to the object and/or a quality of the view image; and

select the plurality of key-frames from the plurality of key-frame candidates based on the distance between the reconstructed 3D shape and the 3D coordinate.
The mobile device according to claim8, wherein the mobile device is further configured to:

estimate a motion of a camera capturing the plurality of captured frames, based on a captured frame and an intermediate 3D shape in the reconstructing of the 3D shape;

transform the 3D coordinate obtained by using the depth image on a camera coordinate system to a world coordinate system based on the estimated motion; and

calculate the distance based on the transformed 3D coordinate and a finally reconstructed 3D shape.
The mobile device according to any one of claims 7 to 9, wherein the mobile device is further configured to:

re-estimate a motion of the camera based on the depth image and the finally reconstructed 3D shape, wherein the a thereby obtained re-estimated motion of the camera is used in the mapping the texture to the reconstructed 3D shape.
The mobile device according to claim 9, wherein the mobile device is further configured to:

in the selecting the plurality of key-frames,

calculate an error indicating mismatching between the estimated motion and the finally reconstructed 3D shape based on the distance, with respect to each of the plurality of the key-frame candidates, and

select as a key-frame, a key-frame candidate corresponding to the minimum error from a set of key-frame candidates corresponding each angle range.
The mobile device according to any one of claims 7 to 11, wherein only the plurality of key-frames are used for mapping the texture.
A computer-readable storage medium storing a computer program, the computer program causing a computer to execute a method, the method comprising:

reconstructing a 3D shape of an object based on a plurality of frames captured from a plurality of different positions, wherein each of the plurality of captured frames includes a depth image and a view image;

selecting a plurality of key-frames from the plurality of captured frames, based on a distance between the reconstructed 3D shape and a 3D coordinate obtained by using the depth image; and

mapping a texture to the reconstructed 3D shape by using the view image in each of the plurality of key-frames.
The computer-readable storage medium according to claim 13, wherein the selecting the plurality of key-frames includes:

extracting a plurality of key-frame candidates from the plurality of captured frames, wherein the plurality of key-frame candidates satisfy a predetermined condition relevant to angle ranges to the object and/or a quality of the view image; and

selecting the plurality of key-frames from the plurality of key-frame candidates based on the distance between the reconstructed 3D shape and the 3D coordinate.
The computer-readable storage medium according to claim 14, the method further comprising:

estimating a motion of a camera configured to capture the plurality of captured frames, based on a captured frame and an intermediate 3D shape in the reconstructing of the 3D shape;

transforming the 3D coordinate obtained by using the depth image on a camera coordinate system to a world coordinate system based on the estimated motion; and

calculating the distance based on the transformed 3D coordinate and a finally reconstructed 3D shape.
The computer-readable storage medium according to any one of claims 13 to 15, the method further comprising:

re-estimating a motion of the camera based on the depth image and the finally reconstructed 3D shape, wherein a thereby obtained re-estimated motion of the camera is used in the mapping the texture to the reconstructed 3D shape.
The computer-readable storage medium according to claim 15, wherein the selecting the plurality of key-frames includes:

calculating an error indicating mismatching between the estimated motion and the finally reconstructed 3D shape based on the distance, with respect to each of the plurality of the key-frame candidates, and

selecting as a key-frame, a key-frame candidate corresponding to the minimum error from a set of key-frame candidates corresponding each angle range.
The computer-readable storage medium according to any one of claims 13 to 17, wherein only the plurality of key-frames selected in the selecting are used in the mapping the texture.