WO2018133119A1

WO2018133119A1 - Method and system for three-dimensional reconstruction of complete indoor scene based on depth camera

Info

Publication number: WO2018133119A1
Application number: PCT/CN2017/072257
Authority: WO
Inventors: 李建伟; 高伟; 吴毅红
Original assignee: 中国科学院自动化研究所
Priority date: 2017-01-23
Filing date: 2017-01-23
Publication date: 2018-07-26

Abstract

Provided are a method and system for three-dimensional reconstruction of a complete indoor scene based on a consumer-level depth camera. The method comprises: acquiring a depth image and performing adaptive bilateral filtering; performing visual odometer estimation using the filtered depth image, automatically segmenting an image sequence on the basis of visual content, performing closed-loop detection between segments, and performing global optimization; and performing weighted volume data fusion according to optimized camera trajectory information, so as to reconstruct a three-dimensional model of a complete indoor scene. The method and system realize edge preservation and denoising of a depth image by means of an adaptive bilateral filtering algorithm. An automatic segmentation algorithm based on visual content can effectively reduce a cumulative error in the process of visual odometer estimation and improve the registration accuracy. The use of the weighted volume data fusion algorithm can effectively preserve the geometric details of the surface of an object. Thus, the technical problem of how to improve the accuracy of three-dimensional reconstruction in an indoor scene is solved, such that a complete, accurate and refined indoor scene model can be obtained.

Description

Method and system for three-dimensional reconstruction of indoor complete scene based on depth camera

Technical field

The present invention relates to the field of computer vision technology, and in particular, to a method and system for performing three-dimensional reconstruction of a complete indoor scene based on a consumer-grade depth camera.

Background technique

High-precision 3D reconstruction of indoor scenes is one of the challenging research topics in computer vision, involving theories and techniques in computer vision, computer graphics, pattern recognition, optimization and many other fields. There are many ways to realize 3D reconstruction. The traditional method is to use laser or radar ranging sensors or structured light technology to acquire the structural information of the scene or the surface of the object for 3D reconstruction. However, these instruments are mostly expensive and difficult to carry, so the application is limited. . With the development of computer vision technology, researchers began to study the use of pure vision methods for 3D reconstruction, which has produced a lot of useful research work.

After the introduction of the consumer-grade depth camera Microsoft Kinect, people can directly use the depth data to easily reconstruct the indoor scene in three dimensions. The KinectFusion algorithm proposed by Newcombe et al. uses Kinect to obtain the depth information of each point in the image, and iteratively approximates the coordinates of the 3D point in the current frame camera coordinate system and the global model by iterative approximation of the nearest neighbor (ICP) algorithm. The coordinates are aligned to estimate the pose of the current frame camera, and the volume data fusion is performed through the Truncated Signed Distance Function (TSDF) iteration to obtain a dense three-dimensional model. Although Kinect capture depth is not affected by lighting conditions and texture richness, its depth data range is only 0.5-4m, and the position and size of the mesh model is fixed, so it is only suitable for local and static indoor scenes.

Based on the consumer-level depth camera for 3D reconstruction of indoor scenes, there are generally the following problems: (1) The resolution of the depth image acquired by the consumer-level depth camera is small, the noise is large, and the surface details of the object are difficult to maintain, and the depth value range is limited and cannot be directly used. 3D reconstruction of the complete scene; (2) The cumulative error caused by camera pose estimation will cause a wrong, distorted 3D model; (3) Consumer-grade depth cameras are generally handheld, the camera's motion state is relatively random, and the acquired data quality There are good and bad, affecting the reconstruction effect.

In order to perform a complete 3D reconstruction of the indoor scene, Whelan et al. proposed the Kintinuous algorithm, which is a further extension of KinectFusion. The algorithm uses ShiftingTSDFVolume uses the memory to solve the problem of memory consumption of the mesh model during large scene reconstruction. It also uses DBoW to find matching key frames for closed-loop detection. Finally, the pose and model are optimized to obtain a large-scale 3D model. Choi et al. proposed the Elastic Fragment idea. First, the RGBD data stream is segmented every 50 frames. The visual odometer estimation is performed separately for each segment. The geometric descriptor FPFH is extracted from the point cloud data between the two segments to find the matching. Closed-loop detection, and then introduce line processes constraints to optimize the detection results, remove the error closed loop, and finally use the optimized odometer information for volume data fusion. The indoor complete scene reconstruction is realized by segmentation processing and closed-loop detection, but the local geometric details of the retained objects are not considered, and the method of fixed segmentation is not robust when performing real indoor scene reconstruction. Zeng et al. proposed the concept of 3D Match descriptor. The algorithm firstly fixed the RGBD data stream and reconstructed it to obtain a local model. The key points are extracted from each segmented 3D model as a 3D convolution network (ConvNet). Input, using the feature vector learned by the network as an input of another matrix network, and outputting the matching result by similarity comparison. Since the deep network has a very obvious feature learning advantage, geometric registration can be improved by using 3D Match for other descriptors. However, this method requires local 3D reconstruction, using the deep learning network for geometric registration, and then outputting the global 3D model, and the network training requires a large amount of data, and the entire reconstruction process is inefficient.

In improving the accuracy of 3D reconstruction, Angela et al. proposed the VSBR algorithm. The main idea is to use Stratified Shading (SFS) technology to stratify TSDF data and then fuse it to solve the problem of TSDF data fusion. Smoothing causes the surface details of the object to be lost, resulting in a more refined three-dimensional structural model. However, this method is only effective for the reconstruction of the single element under the ideal light source, and the indoor scene is not obviously improved due to the large variation of the light source.

In view of this, the present invention has been specifically proposed.

Summary of the invention

In order to solve the above problems in the prior art, in order to solve the technical problem of how to improve the accuracy of three-dimensional reconstruction in an indoor scene, a method and system for performing three-dimensional reconstruction of a complete indoor scene based on a consumer-level depth camera are provided.

In order to achieve the above object, on the one hand, the following technical solutions are provided:

A method for performing three-dimensional reconstruction of a complete indoor scene based on a consumer-level depth camera, the method may include:

Get a depth image;

Performing adaptive bilateral filtering on the depth image;

Performing visual content-based block fusion and registration processing on the filtered depth image;

According to the processing result, weighted volume data fusion is performed to reconstruct a three-dimensional model of the complete scene in the room.

Preferably, the performing adaptive bilateral filtering on the depth image specifically includes:

Adaptive bilateral filtering is performed according to the following formula:

Wherein the u and the u _k respectively represent any pixel on the depth image and its domain pixel; the Z(u) and the Z(u _k ) respectively represent the u and the u Depth value of _k ;

Representing the corresponding depth value after filtering; the W is expressed in the field

The normalization factor on; the w _s and the w _c represent Gaussian kernel functions filtered in the spatial domain and the range domain, respectively.

Preferably, the Gaussian kernel function filtered in the spatial domain and the range domain is determined according to the following formula:

Wherein the δ _s and the δ _c are variances of a spatial domain and a range Gaussian kernel function, respectively;

Wherein the δ _s and the δ _c are determined according to the following formula:

Wherein f represents a focal length of the depth camera, and the K _s and the K _c represent a constant.

Preferably, the performing visual content-based block fusion and registration processing on the filtered depth image specifically includes: segmenting the depth image sequence based on the visual content, Each segment is block-fused, and closed-loop detection is performed between the segments, and the result of the closed-loop detection is globally optimized.

Preferably, the depth image sequence is segmented based on the visual content, and each segment is block-fused, and the closed-loop detection is performed between the segments, and the global optimization of the closed-loop detection result includes:

The depth segmentation sequence is segmented according to the automatic segmentation method for visual content detection, the similar depth image content is divided into one segment, and each segment is subjected to block fusion to determine the transformation relationship between the depth images. And performing closed-loop detection between segments and segments according to the transformation relationship to achieve global optimization.

Preferably, the visual content detection automatic segmentation method segments the depth image sequence, divides the similar depth image content into one segment, and performs block fusion on each segment to determine the depth image. The transformation relationship between the two, and the closed-loop detection between the segments and the segments according to the transformation relationship, to achieve global optimization, specifically including:

Using the Kintinuous framework, visual odometer estimation is performed to obtain camera pose information under each frame of depth image;

Depicking the point cloud data corresponding to the depth image of each frame back to the initial coordinate system according to the camera pose information, and comparing the similarity between the depth image obtained by the projection and the depth image of the initial frame, and when the similarity is When the similarity threshold is lower, the camera pose is initialized and segmented;

Extracting the PFFH geometric descriptor in each segment point cloud data, and performing coarse registration between each two segments, and performing fine registration using the GICP algorithm to obtain a matching relationship between segments and segments;

Using the pose information of each segment and the matching relationship between the segments and the segments, the map is constructed and optimized by using the G2O framework to obtain optimized camera track information, thereby implementing the global optimization.

Preferably, the back-projection of the point cloud data corresponding to the depth image of each frame to the initial coordinate system according to the camera pose information, and comparing the similarity between the depth image obtained by the projection and the depth image of the initial frame And when the similarity is lower than the similarity threshold, the camera pose is initialized and segmented, specifically including:

Step 1: calculating a similarity between the depth image of each frame and the depth image of the first frame;

Step 2: determining whether the similarity is lower than a similarity threshold;

Step 3: If yes, segment the depth image sequence;

Step 4: The next frame depth image is taken as the starting frame depth image of the next segment, and steps 1 and 2 are repeatedly performed until all frame depth images are processed.

Preferably, the step 1 specifically includes:

Calculating a first spatial three-dimensional point corresponding to each pixel on the depth image according to a projection relationship and a depth value of any frame depth image:

p=π ^-1 (u _p ,Z(u _p ))

Wherein the u _p is any pixel on the depth image; the Z(u _p ) and the p respectively represent a depth value corresponding to the u _p and the first spatial three-dimensional point; the π Representing the projection relationship;

The first spatial three-dimensional point rotation translation is transformed into a world coordinate system according to the following formula to obtain a second spatial three-dimensional point:

q=T _i p

Wherein, the T _i represents a rotational translation matrix corresponding to the spatial 3D point of the ith frame depth map to the world coordinate system; the p represents the first spatial three-dimensional point, and the q represents the second spatial three-dimensional point; The i takes a positive integer;

The second spatial three-dimensional point is back-projected to the two-dimensional image plane according to the following formula to obtain the projected depth image:

Wherein the u _q is a pixel on the projected depth image corresponding to the q; the f _x , the f _y , the c _x and the c _y represent internal parameters of the depth camera; the x _q , y _q , z _q represent the coordinates of the q; the T represents the transposition of the matrix;

The number of effective pixels on the start frame depth image and the depth image projected on any frame is calculated separately, and the ratios of the two are used as similarities.

Preferably, the weighting volume data fusion is performed according to the processing result, so that reconstructing the indoor full scene three-dimensional model specifically includes: combining the depth image of each frame by using the truncated symbol distance function mesh model according to the processing result, and using the voxel The grid represents the three-dimensional space, thereby obtaining a three-dimensional model of the complete scene in the room.

Preferably, according to the processing result, the depth image of each frame is merged by using the truncated symbol distance function mesh model, and the voxel mesh is used to represent the three-dimensional space, thereby obtaining a three-dimensional model of the indoor complete scene, which specifically includes:

Based on the noise characteristics and the region of interest model, the weighted fusion of the truncated symbol distance function data is performed by using a Volumemetric method framework;

The Mesh model extraction is performed by using the Marching cubes algorithm to obtain a three-dimensional model of the indoor complete scene.

Preferably, the truncated symbol distance function is determined according to the following formula:

f _i (v)=[K ^-1 z _i (u)[u ^T ,1] ^T ] _z -[v _i ] _z

Where f _i (v) represents the truncated symbol distance function, that is, the distance from the mesh to the surface of the object model, positive or negative indicates whether the mesh is on the occluded side or the visible side, and the zero crossing is on the surface a point; the K represents an internal parameter matrix of the camera; the u represents a pixel; the z _i (u) represents a depth value corresponding to the pixel u; and the v _i represents a voxel.

Preferably, the data weighted fusion is performed according to the following formula:

Wherein, the v represents a voxel; the f _i (v) and the w _i (v) respectively represent a truncated symbol distance function corresponding to the voxel v and a weight function thereof; the n takes a positive integer; The F(v) represents a truncated symbol distance function value corresponding to the voxel v after the fusion; the W(v) represents a weight of a truncated symbol distance function value corresponding to the voxel v after the fusion;

Wherein, the weight function can be determined according to the following formula:

Wherein the d _i represents a radius of the region of interest; the δ _s is a noise variance in the depth data; the w is a constant.

In order to achieve the above object, on the other hand, a system for performing three-dimensional reconstruction of a complete indoor scene based on a consumer-grade depth camera is provided, the system comprising:

Obtaining a module for acquiring a depth image;

a filtering module, configured to perform adaptive bilateral filtering on the depth image;

a block fusion and registration module for performing visual content-based block fusion and registration processing on the filtered depth image;

The volume data fusion module is configured to perform weighted volume data fusion according to the processing result, thereby reconstructing a three-dimensional model of the complete scene in the room.

Preferably, the filtering module is specifically configured to:

Adaptive bilateral filtering is performed according to the following formula:

Preferably, the block fusion and registration module is specifically configured to: segment the depth image sequence based on the visual content, perform block fusion on each segment, and perform closed-loop detection between the segments, The results of closed-loop detection are globally optimized.

Preferably, the block fusion and registration module is further specifically configured to:

The depth segmentation sequence is segmented according to the automatic segmentation method for visual content detection, and the similar depth image content is divided into one segment, and each segment is subjected to block fusion to determine a transformation relationship between the depth images. And performing closed-loop detection between segments and segments according to the transformation relationship to achieve global optimization.

Preferably, the block fusion and registration module specifically includes:

The camera pose information acquisition unit is configured to perform visual odometer estimation using the Kintinuous frame to obtain camera pose information under each frame depth image;

a segmentation unit, configured to backproject the point cloud data corresponding to the depth image of each frame to an initial coordinate system according to the camera pose information, and perform similarity between the depth image obtained by the projection and the depth image of the initial frame. Comparing, and when the similarity is lower than the similarity threshold, initializing the camera pose and performing segmentation;

A registration unit is used to extract the PFFH geometric descriptor in each piece of point cloud data, and perform coarse registration between each two segments, and perform fine registration using the GICP algorithm to obtain a match between the segments. relationship;

An optimization unit is configured to utilize the pose information of each segment and the matching relationship between the segments and the segments, construct a map, and perform graph optimization using a G2O framework to obtain optimized camera trajectory information, thereby implementing the global optimization. .

Preferably, the segmentation unit specifically includes:

a calculating unit, configured to calculate a similarity between the depth image of each frame and the depth image of the first frame;

a determining unit, configured to determine whether the similarity is lower than a similarity threshold;

a segmentation subunit, configured to segment the depth image sequence when the similarity is lower than a similarity threshold;

And a processing unit, configured to use the next frame depth image as the starting frame depth image of the next segment, and repeatedly execute the calculating unit and the determining unit until all the frame depth images are processed.

Preferably, the volume data fusion module is specifically configured to: according to the processing result, use a truncated symbol distance function mesh model to fuse the depth image of each frame, and use a voxel grid to represent the three-dimensional space, thereby obtaining a complete indoor scene. 3D model.

Preferably, the volume data fusion module specifically includes:

a weighted fusion unit, configured to perform weighted fusion of the truncated symbol distance function data by using a Volumemetric method framework based on noise characteristics and an interest region;

An extracting unit is configured to perform Mesh model extraction by using a Marching cubes algorithm to obtain a three-dimensional model of the indoor complete scene.

Embodiments of the present invention provide a method and system for performing three-dimensional reconstruction of a complete indoor scene based on a consumer-grade depth camera. The method includes: acquiring a depth image; performing adaptive bilateral filtering on the depth image; performing block-based fusion and registration processing on the filtered depth image based on the visual content; and performing weighted volume data fusion according to the processing result, thereby reconstructing the indoor A complete 3D model of the scene. The embodiment of the invention can effectively reduce the cumulative error in the visual odometer estimation and improve the registration precision by performing the visual content-based block fusion and registration on the depth image, and also adopts the weighted volume data fusion algorithm, which can effectively The geometrical details of the surface of the object are maintained, thereby solving the technical problem of how to improve the accuracy of the three-dimensional reconstruction in the indoor scene, so that a complete, accurate and refined indoor scene model can be obtained.

DRAWINGS

1 is a flow chart showing a method for performing three-dimensional reconstruction of a complete indoor scene based on a consumer-grade depth camera according to an embodiment of the present invention;

2a is a color image corresponding to a depth image according to an embodiment of the present invention;

2b is a schematic diagram of a point cloud obtained from a depth image according to an embodiment of the present invention. Figure

2c is a schematic diagram of a point cloud obtained by bilaterally filtering a depth image according to an embodiment of the present invention;

2d is a schematic diagram of a point cloud obtained by adaptive bilateral filtering of a depth image according to an embodiment of the invention

3 is a schematic flow chart of segmentation fusion and registration based on visual content according to an embodiment of the present invention;

4 is a schematic diagram of a weighted volume data fusion process according to an embodiment of the present invention;

FIG. 5a is a schematic diagram of a three-dimensional reconstruction result using an unweighted volume data fusion algorithm;

Figure 5b is a partial detail view of the three-dimensional model of Figure 5a;

FIG. 5c is a schematic diagram of a three-dimensional reconstruction result obtained by a weighted volume data fusion algorithm according to an embodiment of the present invention; FIG.

Figure 5d is a partial detail view of the three-dimensional model of Figure 5c;

6 is a schematic diagram of an effect of performing three-dimensional reconstruction on a 3D Scene Data data set using the method proposed by the embodiment of the present invention;

FIG. 7 is a schematic diagram showing the effect of performing three-dimensional reconstruction on the Augmented ICL-NUIM Dataset data set using the method proposed by the embodiment of the present invention;

FIG. 8 is a schematic diagram showing an effect of three-dimensional reconstruction of indoor scene data collected by Microsoft Kinect for Windows according to an embodiment of the present invention; FIG.

9 is a schematic structural diagram of a system for performing three-dimensional reconstruction of a complete indoor scene based on a consumer-grade depth camera according to an embodiment of the present invention.

detailed description

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. Those skilled in the art should understand that these embodiments are only used to explain the technical principles of the present invention, and are not intended to limit the scope of the present invention.

Embodiments of the present invention provide a method for performing three-dimensional reconstruction of a complete indoor scene based on a consumer-level depth camera. As shown in Figure 1, the method includes:

S100: Acquire a depth image.

Specifically, this step may include: utilizing consumption based on the principle of structured light Level depth camera to get depth images.

Among them, the consumer-level depth camera (Microsoft Kinect for Windows and Xtion, referred to as the depth camera) based on the structured light principle acquires the depth data of the depth image by transmitting the structured light and receiving the reflection information.

In practical applications, real indoor scene data can be acquired using the handheld consumer depth camera Microsoft Kinect for Windows.

The depth data can be calculated according to the following formula:

Where f represents the focal length of the consumer-grade depth camera; B represents the baseline; D represents the parallax.

S110: Perform adaptive bilateral filtering on the depth image.

In this step, the acquired depth image is adaptively double-filtered by using the noise characteristics of the consumer-level depth camera based on the structured light principle.

The adaptive bilateral filtering algorithm refers to filtering in both the spatial domain and the value domain of the depth image.

In practical applications, the parameters of the adaptive bilateral filtering algorithm can be set according to the noise characteristics of the depth camera and its internal parameters, which can effectively remove noise and preserve edge information.

For the depth Z with respect to the parallax D, the following relationship exists:

The noise of the depth data is mainly generated in the quantization process. It can be seen from the above equation that the variance of the depth noise is proportional to the square of the depth value, that is, the larger the depth value, the larger the noise. In order to effectively remove noise in the depth image, embodiments of the present invention define a filtering algorithm based on this noise characteristic.

Specifically, the above adaptive bilateral filtering can be performed according to the following formula:

Where u and u _k respectively represent any pixel on the depth image and its domain pixel; Z(u) and Z(u _k ) respectively represent depth values corresponding to u and u _k ;

Indicates the depth value corresponding to the filter; W indicates the field

The normalization factor on; w _s and w _c represent the Gaussian kernel function filtered in the spatial domain and the range domain, respectively.

In the above embodiment, w _s and w _c can be determined according to the following formula:

Where δ _s and δ _c are the variances of the spatial domain and the Gaussian kernel function of the range, respectively.

δ _s and δ _{c are} related to the magnitude of the depth value, which is not fixed.

Specifically, in the above embodiment, δ _s and δ _c can be determined according to the following formula:

Where f represents the focal length of the depth camera, and K _s and K _c represent constants, the specific values of which are related to the parameters of the depth camera.

Figures 2a-d exemplarily show a comparison of the effects of different filtering algorithms. Among them, FIG. 2a shows a color image corresponding to the depth image. Figure 2b shows a point cloud derived from a depth image. Figure 2c shows a point cloud resulting from bilateral filtering of the depth image. Figure 2d shows a point cloud obtained by adaptive bilateral filtering of depth images.

The embodiment of the present invention can implement edge preservation and denoising of the depth map by adopting an adaptive bilateral filtering method.

S120: Perform visual content-based block fusion and registration processing on the depth image.

In this step, the depth image sequence is segmented based on the visual content, and each segment is block-fused, and closed-loop detection is performed between segments, and the result of the closed-loop detection is globally optimized. The depth image sequence is a depth image data stream.

Preferably, the step may include: determining a transformation relationship between the depth images, segmenting the depth image sequence according to the method of automatically segmenting the visual content, and dividing the similar depth image content into one segment, for each segment The segment performs block fusion, determines the transformation relationship between the depth images, and performs closed-loop detection between segments and segments according to the transformation relationship, and achieves global optimization.

Further, this step may include:

S121: Using the Kintinuous frame, visual odometer estimation is performed to obtain camera pose information under each frame depth image.

S122: Backprojecting the point cloud data corresponding to the depth image of each frame to the initial coordinate system according to the camera pose information, and comparing the similarity between the depth image obtained by the projection and the depth image of the initial frame, and when the similarity is lower than At the similarity threshold, the camera pose is initialized and segmented.

S123: Extract the PFFH geometric descriptor in each piece of point cloud data, perform coarse registration between each two segments, and perform fine registration using the GICP algorithm to obtain a matching relationship between segments.

This step performs closed-loop detection between segments.

S124: Using the pose information of each segment and the matching relationship between segments and segments, constructing a graph and performing graph optimization using a G2O framework to obtain optimized camera trajectory information, thereby achieving global optimization.

This step improves the non-rigid distortion in the Simultaneous Localization and Calibration (SLAC) mode, and introduces line processes constraints to remove the wrong closed-loop matching.

The foregoing step S122 may further include:

S1221: Calculate the similarity between the depth image of each frame and the depth image of the first frame.

S1222: Determine whether the similarity is lower than the similarity threshold, and if yes, execute step S1223; otherwise, execute step S1224.

S1223: Segment the depth image sequence.

This step segments the depth image sequence based on the visual content. In this way, the cumulative error problem caused by the estimation of the visual odometer can be effectively solved, and the similar content can be fused together, thereby improving the registration accuracy.

S1224: The depth image sequence is not segmented.

S1225: The next frame depth image is taken as the start frame depth image of the next segment, and steps S1221 and S1222 are repeatedly performed until all the frame depth images are processed.

In the above embodiment, the step of calculating the similarity between the depth image of each frame and the depth image of the first frame may specifically include:

S12211: Calculate the first spatial three-dimensional point corresponding to each pixel on the depth image according to the projection relationship and the depth value of the image of any frame depth:

p=π ^-1 (u _p ,Z(u _p ))

Where u _p is any pixel on the depth image; Z(u _p ) and p respectively represent the depth value corresponding to u _p and the first spatial three-dimensional point; π represents the projection relationship, that is, the point cloud data corresponding to each depth image Backprojection to 2D-3D projection transformation relationship in the initial coordinate system.

S12212: The first spatial three-dimensional point rotation translation is transformed into a world coordinate system according to the following formula to obtain a second spatial three-dimensional point:

q=T _i p

Wherein, T _i represents a rotational translation matrix corresponding to the spatial 3D point of the ith frame depth map to the world coordinate system, which can be estimated by a visual odometer; i takes a positive integer; p represents a first spatial three-dimensional point, and q represents a second The three-dimensional point of space, the coordinates of p and q are:

p = (x _p , y _p , z _p ), q = (x _q , y _q , z _q ).

S12213: Backprojecting the second spatial three-dimensional point to the two-dimensional image plane according to the following formula to obtain the projected depth image:

Where u _q is the pixel on the projected depth image corresponding to q; f _x , f _y , c _x and c _y represent the internal parameters of the depth camera; x _q , y _q , z _q represent the coordinates of q; T represents the matrix Transpose.

S12214: Calculate the number of effective pixels on the start frame depth image and the depth image after any frame projection, and compare the ratios of the two as the similarity.

For example, calculate the similarity according to the following formula:

Where n ⁰ and n ⁱ respectively represent the starting frame depth image and the number of effective pixels on the depth image after any frame projection; ρ represents the similarity.

FIG. 3 exemplarily shows a flow diagram of segmentation fusion and registration based on visual content.

The embodiment of the invention adopts an automatic segmentation algorithm based on visual content, which can effectively reduce the cumulative error in the visual odometer estimation and improve the registration accuracy.

S130: Perform weighted volume data fusion according to the processing result, thereby reconstructing a three-dimensional model of the indoor complete scene.

Specifically, the step may include: combining the depth image of each frame by using a truncated symbol distance function (TSDF) mesh model according to the content of the block-based fusion and registration processing based on the visual content, and using the voxel mesh to represent the three-dimensional space To obtain a three-dimensional model of the complete scene in the room.

This step may further include:

S131: Based on the noise characteristics and the region of interest, the weighted fusion of the truncated symbol distance function data is performed by using the Volumemetric method framework.

S132: Mesh model extraction is performed by using the Marching cubes algorithm.

In practical applications, according to the estimation result of the visual odometer, the TSDF mesh model can be used to fuse the depth images of each frame to represent the three-dimensional space using a voxel grid with a resolution of m, that is, each three-dimensional space is divided into m blocks. Each grid v stores two values: the truncated symbol distance function f _i (v) and its weight w _i (v).

Among them, the truncated symbol distance function can be determined according to the following formula:

f _i (v)=[K ^-1 z _i (u)[u ^T ,1] ^T ] _z -[v _i ] _z

Where f _i (v) represents the truncated symbol distance function, that is, the distance from the mesh to the surface of the object model, positive or negative indicates whether the mesh is on the occluded side or the visible side, and the zero crossing is on the surface Point; K represents the internal parameter matrix of the camera; u represents the pixel; z _i (u) represents the depth value corresponding to the pixel u; v _i represents the voxel. Among them, the camera can be a depth camera or a depth camera.

Among them, data weighted fusion can be performed according to the following formula:

Where f _i (v) and w _i (v) respectively represent the truncated symbol distance function (TSDF) corresponding to voxel v and its weight function; n takes a positive integer; F(v) represents the voxel v after fusion The truncated symbol distance function value; W(v) represents the weight of the truncated symbol distance function value corresponding to the voxel v after fusion.

In the above embodiment, the weight function may be determined according to the noise characteristics of the depth data and the region of interest, and the value is not fixed. In order to maintain the geometric details of the surface of the object, the weight of the area with small noise and the area of interest is set large, and the weight of the area with high noise or the area of no interest is set small.

Specifically, the weight function can be determined according to the following formula:

Where d _i represents the radius of the region of interest, the smaller the radius, the more interested, the greater the weight; δ _s is the noise variance in the depth data, and its value is consistent with the variance of the spatial domain kernel function of the adaptive bilateral filtering algorithm; w is a constant, preferably it may take a value of 1 or 0.

FIG. 4 exemplarily shows a schematic diagram of a weighted volume data fusion process.

The weighted volume data fusion algorithm in the embodiment of the invention can effectively maintain the geometric details of the surface of the object, and can obtain a complete, accurate and refined indoor scene model, which has good robustness and expandability.

Figure 5a exemplarily shows a three-dimensional reconstruction result using an unweighted volume data fusion algorithm; Figure 5b exemplarily shows a partial detail of the three-dimensional model in Figure 5a; Figure 5c exemplarily illustrates the use of an embodiment of the invention The three-dimensional reconstruction result obtained by the proposed weighted volume data fusion algorithm; Figure 5d exemplarily shows the local details of the three-dimensional model in Figure 5c.

FIG. 6 exemplarily shows an effect of performing three-dimensional reconstruction on the 3D Scene Data data set using the method proposed by the embodiment of the present invention; FIG. 7 exemplarily shows the use of the present invention on the Augmented ICL-NUIM Dataset data set Schematic diagram of the effect of the method proposed by the embodiment for three-dimensional reconstruction; FIG. 8 exemplarily shows the effect of three-dimensional reconstruction of the indoor scene data collected by Microsoft Kinect for Windows.

It should be noted that although the embodiments of the present invention are described in the above-described order, those skilled in the art can understand that the present invention may be implemented in a different order than the description herein. These simple changes should also be included in the present invention. Within the scope of protection of the invention.

Based on the same technical concept as the method embodiment, the embodiment of the present invention further provides a system for performing three-dimensional reconstruction of a complete indoor scene based on a consumer-level depth camera. As shown in FIG. 9 , the system 90 includes: an obtaining module 92 and a filtering module 94. The block fusion and registration module 96 and the volume data fusion module 98. The obtaining module 92 is configured to acquire a depth image. The filtering module 94 is configured to perform adaptive bilateral filtering on the depth image. The block fusion and registration module 96 is configured to perform visual content based block fusion and registration processing on the filtered depth image. The volume data fusion module 98 is configured to perform weighted volume data fusion according to the processing result, thereby reconstructing a three-dimensional model of the indoor complete scene.

By adopting the above technical solution, the embodiment of the invention can effectively reduce the accumulated error in the visual odometer estimation, improve the registration precision, can effectively maintain the geometric details of the surface of the object, and can obtain a complete, accurate and refined indoor scene model. .

In some embodiments, the filtering module is specifically configured to: perform adaptive bilateral filtering according to the following formula:

Indicates the depth value corresponding to the filter; W indicates the field

In some embodiments, the block fusion and registration module may be specifically configured to: segment the depth image sequence based on the visual content, perform block fusion on each segment, and perform closed-loop detection between segments, The results of the test are globally optimized.

In other embodiments, the block fusion and registration module is further specifically configured to: determine a transformation relationship between depth images, segment the depth image sequence based on the visual content detection automatic segmentation method, and similar depth images The content is divided into a segment, and each segment is block-fused, the transformation relationship between the depth images is determined, and closed-loop detection is performed between segments and segments according to the transformation relationship, and global optimization is realized.

In some preferred embodiments, the block fusion and registration module may specifically include: a camera pose information acquisition unit, a segmentation unit, a registration unit, and an optimization unit. The camera pose information acquisition unit is configured to perform visual odometer estimation using the Kintinuous frame to obtain camera pose information under each frame depth image. The segmentation unit is configured to backproject the point cloud data corresponding to each frame depth image to the initial coordinate system according to the camera pose information, and compare the similarity between the depth image obtained by the projection and the depth image of the initial frame, and similar When the degree is lower than the similarity threshold, the camera pose is initialized and segmented. The registration unit is used to extract the PFFH geometric descriptor in each piece of point cloud data, and perform coarse registration between each two segments, and perform fine registration using the GICP algorithm to obtain a matching relationship between the segments. . The optimization unit is used to utilize the pose information of each segment and the matching relationship between segments and segments, construct a graph and optimize the graph by using the G2O framework to obtain optimized camera track information, thereby achieving global optimization.

The segmentation unit may specifically include: a calculation unit, a determination unit, a segmentation subunit, and a processing unit. The calculation unit is configured to calculate the similarity between the depth image of each frame and the depth image of the first frame. The judging unit is configured to judge whether the similarity is lower than the similarity threshold. The segmentation subunit is configured to segment the depth image sequence when the similarity is below the similarity threshold. The processing unit is configured to use the next frame depth image as the start frame depth image of the next segment, and repeatedly execute the calculation unit and the determination unit until all frame depth images are processed.

In some embodiments, the volume data fusion module can be specifically used in accordance with As a result, the depth image of each frame is fused by the truncated symbol distance function mesh model, and the voxel mesh is used to represent the three-dimensional space, thereby obtaining a three-dimensional model of the indoor complete scene.

In some embodiments, the volume data fusion module may specifically include a weighted fusion unit and an extraction unit. The weighted fusion unit is configured to perform weighted fusion of the truncated symbol distance function data based on the noise feature and the region of interest using the Volumetric method framework. The extraction unit is used to extract the Mesh model by using the Marching cubes algorithm, thereby obtaining a three-dimensional model of the indoor complete scene.

The invention will now be described in detail by way of a preferred embodiment.

The system for performing three-dimensional reconstruction of indoor complete scenes based on the consumer-level depth camera includes an acquisition module, a filtering module, a block fusion and registration module, and a volume data fusion module. among them:

The acquisition module is used for depth image acquisition of indoor scenes using a depth camera.

The filtering module is configured to perform adaptive bilateral filtering processing on the acquired depth image.

The acquisition module is an equivalent replacement of the above acquisition module. In practical applications, real indoor scene data can be acquired using the handheld consumer depth camera Microsoft Kinect for Windows. Then, the adaptive depth filtering is performed on the acquired depth image, and the parameters in the adaptive bilateral filtering method are automatically set according to the noise characteristics of the depth camera and its internal parameters. Therefore, the embodiment of the present invention can effectively remove noise and preserve edge information. .

The block fusion and registration module is used to automatically segment the data stream based on the visual content, each segment performs block fusion, and the closed-loop detection is performed between segments, and the result of the closed-loop detection is globally optimized.

The block fusion and registration module performs automatic block fusion and registration based on visual content.

In a more preferred embodiment, the block fusion and registration module specifically includes: a pose information acquisition module, a segmentation module, a coarse registration module, a fine registration module, and an optimization module. The pose information acquisition module is configured to perform visual odometer estimation using the Kintinuous framework to obtain camera pose information under each frame depth image. The segmentation module is configured to backproject the point cloud data corresponding to each frame depth image to the initial coordinate system according to the camera pose information, and compare the similarity between the projected depth image and the depth image of the initial frame, if the similarity is lower than The similarity threshold initializes the camera pose and performs a new segmentation. The coarse registration module is used to extract the PFFH geometric descriptor in each piece of point cloud data, and perform coarse registration between each two segments; The fine registration module is used for fine registration using the GICP algorithm to obtain the matching relationship between segments. The optimization module is used to construct the map and use the G2O framework for graph optimization by using the pose information of each segment and the matching relationship between segments.

Preferably, the optimization module is further used to apply a SLAC (Simultaneous Localization and Calibration) mode to optimize non-rigid distortion, and use line processes constraints to delete the wrong closed-loop matching.

The above-mentioned block fusion and registration module segments the RGBD data stream based on the visual content, which can effectively solve the cumulative error problem caused by the visual odometer estimation, and can fuse the similar content together, thereby improving the registration. Precision.

The volume data fusion module is configured to perform weighted volume data fusion according to the optimized camera track information to obtain a three-dimensional model of the scene.

The volume data fusion module defines a weight function of the truncated symbol distance function according to the noise characteristics of the depth camera and the region of interest to achieve the geometric detail of the surface of the object.

Experiments on a system based on consumer-grade depth camera for 3D reconstruction of indoor complete scenes show that a high-precision 3D reconstruction method based on consumer-grade depth camera can obtain a complete, accurate and refined indoor scene model, and the system has good robustness. And scalability.

The system embodiment for performing three-dimensional reconstruction of indoor complete scenes based on the consumer-level depth camera may be used to implement a method embodiment for performing three-dimensional reconstruction of indoor complete scenes based on a consumer-level depth camera, the technical principle, the technical problems solved, and the technical effects produced. Similarly, reference may be made to each other; for the convenience and brevity of the description, the same portions are omitted between the various embodiments.

It should be noted that, the system and method for performing three-dimensional reconstruction of indoor complete scenes based on a consumer-level depth camera provided by the above embodiments are only illustrated by dividing the above functional modules, units or steps in performing three-dimensional reconstruction of indoor complete scenes. For example, the acquisition module in the foregoing may also be used as an acquisition module. In an actual application, the function distribution may be performed by different functional modules, units or steps according to requirements, that is, modules, units or steps in the embodiment of the present invention. Decomposed or combined, for example, the acquisition module or the acquisition and filtering module can be combined into a data preprocessing module.

Heretofore, the technical solutions of the present invention have been described in conjunction with the preferred embodiments shown in the drawings, but it is obvious to those skilled in the art that the scope of the present invention is obviously not limited to the specific embodiments. Without departing from the principles of the invention, Those skilled in the art can make equivalent changes or substitutions to the related technical features, and the technical solutions after the modification or replacement will fall within the protection scope of the present invention.

Claims

A method for performing three-dimensional reconstruction of an indoor complete scene based on a consumer-level depth camera, the method comprising:

Get a depth image;

Performing adaptive bilateral filtering on the depth image;

Performing visual content-based block fusion and registration processing on the filtered depth image;

According to the processing result, weighted volume data fusion is performed to reconstruct a three-dimensional model of the complete scene in the room.
The method according to claim 1, wherein the performing adaptive bilateral filtering on the depth image comprises:

Adaptive bilateral filtering is performed according to the following formula:

Wherein the u and the u k respectively represent any pixel on the depth image and its domain pixel; the Z(u) and the Z(u k ) respectively represent the u and the u Depth value of k ;
Representing the corresponding depth value after filtering; the W is expressed in the field
The normalization factor on; the w s and the w c represent Gaussian kernel functions filtered in the spatial domain and the range domain, respectively.
The method according to claim 2, wherein said Gaussian kernel function in spatial domain and range filtering is determined according to the following formula:

Wherein the δ s and the δ c are variances of a spatial domain and a range Gaussian kernel function, respectively;

Wherein the δ s and the δ c are determined according to the following formula:

Wherein f represents a focal length of the depth camera, and the K s and the K c represent a constant.
The method according to claim 1, wherein the performing visual content-based block fusion and registration processing on the filtered depth image comprises: visual content based The depth image sequence is segmented, and each segment is block-fused, and closed-loop detection is performed between the segments, and the result of the closed-loop detection is globally optimized.
The method according to claim 4, wherein the segmentation of the depth image sequence is performed based on the visual content, and each segment is subjected to block fusion, and the closed-loop detection is performed between the segments, and the closed-loop detection is performed. The results of global optimization include:

The depth segmentation sequence is segmented according to the automatic segmentation method for visual content detection, the similar depth image content is divided into one segment, and each segment is subjected to block fusion to determine the transformation relationship between the depth images. And performing closed-loop detection between segments and segments according to the transformation relationship to achieve global optimization.
The method according to claim 5, wherein said automatic segmentation method based on visual content segmentation segments the depth image sequence, and the similar depth image content is divided into a segment and each segment is segmented Performing block fusion, determining a transformation relationship between the depth images, and performing closed-loop detection between the segments according to the transformation relationship, to achieve global optimization, specifically including:

Using the Kintinuous framework, visual odometer estimation is performed to obtain camera pose information under each frame of depth image;

Depicking the point cloud data corresponding to the depth image of each frame back to the initial coordinate system according to the camera pose information, and comparing the similarity between the depth image obtained by the projection and the depth image of the initial frame, and when the similarity is When the similarity threshold is lower, the camera pose is initialized and segmented;

Extracting the PFFH geometric descriptor in each segment point cloud data, and performing coarse registration between each two segments, and performing fine registration using the GICP algorithm to obtain a matching relationship between segments and segments;

Using the pose information of each segment and the matching relationship between the segments and the segments, the map is constructed and optimized by using the G2O framework to obtain optimized camera track information, thereby implementing the global optimization.
The method according to claim 6, wherein the back-projection of the point cloud data corresponding to the depth image of each frame to the initial coordinate system according to the camera pose information, and the depth image obtained by the projection Performing a similarity comparison with the depth image of the initial frame, and when the similarity is lower than the similarity threshold, initializing the camera pose and performing segmentation, specifically including:

Step 1: calculating a similarity between the depth image of each frame and the depth image of the first frame;

Step 2: determining whether the similarity is lower than a similarity threshold;

Step 3: If yes, segment the depth image sequence;

Step 4: The next frame depth image is taken as the starting frame depth image of the next segment, and steps 1 and 2 are repeatedly performed until all frame depth images are processed.
The method of claim 7, wherein the step 1 comprises:

Calculating a first spatial three-dimensional point corresponding to each pixel on the depth image according to a projection relationship and a depth value of any frame depth image:

p=π -1 (u p ,Z(u p ))

Wherein the u p is any pixel on the depth image; the Z(u p ) and the p respectively represent a depth value corresponding to the u p and the first spatial three-dimensional point; the π Representing the projection relationship;

The first spatial three-dimensional point rotation translation is transformed into a world coordinate system according to the following formula to obtain a second spatial three-dimensional point:

q=T i p

Wherein, the T i represents a rotational translation matrix corresponding to the spatial 3D point of the ith frame depth map to the world coordinate system; the p represents the first spatial three-dimensional point, and the q represents the second spatial three-dimensional point; The i takes a positive integer;

The second spatial three-dimensional point is back-projected to the two-dimensional image plane according to the following formula to obtain the projected depth image:

Wherein the u q is a pixel on the projected depth image corresponding to the q; the f x , the f y , the c x and the c y represent internal parameters of the depth camera; the x q , y q , z q represent the coordinates of the q; the T represents the transposition of the matrix;

The number of effective pixels on the start frame depth image and the depth image projected on any frame is calculated separately, and the ratios of the two are used as similarities.
The method according to claim 1, wherein the performing the weighted volume data fusion according to the processing result, thereby reconstructing the indoor complete scene three-dimensional model comprises: merging the truncated symbol distance function mesh model according to the processing result A depth image of each frame, and a voxel grid is used to represent the three-dimensional space, thereby obtaining a three-dimensional model of the indoor complete scene.
The method according to claim 9, wherein according to the processing result, the depth image of each frame is merged by using a truncated symbol distance function mesh model, and the voxel mesh is used to represent the three-dimensional space, thereby obtaining a complete indoor scene. The 3D model specifically includes:

Based on the noise characteristics and the region of interest, the weighted fusion of the truncated symbol distance function data is performed by using a Volumemetric method framework;

The Mesh model extraction is performed by using the Marching cubes algorithm to obtain a three-dimensional model of the indoor complete scene.
The method according to claim 9 or 10, wherein the truncated symbol distance function is determined according to the following equation:

f i (v)=[K -1 z i (u)[u T ,1] T ] z -[v i ] z

Where f i (v) represents the truncated symbol distance function, that is, the distance from the mesh to the surface of the object model, positive or negative indicates whether the mesh is on the occluded side or the visible side, and the zero crossing is on the surface a point; the K represents an internal parameter matrix of the camera; the u represents a pixel; the z i (u) represents a depth value corresponding to the pixel u; and the v i represents a voxel.
The method according to claim 10, wherein said data weighting fusion is performed according to the following formula:

Wherein, the v represents a voxel; the f i (v) and the w i (v) respectively represent a truncated symbol distance function corresponding to the voxel v and a weight function thereof; the n takes a positive integer; The F(v) represents a truncated symbol distance function value corresponding to the voxel v after the fusion; the W(v) represents a weight of a truncated symbol distance function value corresponding to the voxel v after the fusion;

Wherein, the weight function can be determined according to the following formula:

Wherein the d i represents a radius of the region of interest; the δ s is a noise variance in the depth data; the w is a constant.
A system for performing three-dimensional reconstruction of an indoor complete scene based on a consumer-grade depth camera, wherein the system comprises:

Obtaining a module for acquiring a depth image;

a filtering module, configured to perform adaptive bilateral filtering on the depth image;

a block fusion and registration module for performing visual content-based block fusion and registration processing on the filtered depth image;

The volume data fusion module is configured to perform weighted volume data fusion according to the processing result, thereby reconstructing a three-dimensional model of the complete scene in the room.
The system according to claim 13, wherein the filtering module is specifically configured to:

Adaptive bilateral filtering is performed according to the following formula:

Wherein the u and the u k respectively represent any pixel on the depth image and its domain pixel; the Z(u) and the Z(u k ) respectively represent the u and the u Depth value of k ;
Representing the corresponding depth value after filtering; the W is expressed in the field
The normalization factor on; the w s and the w c represent Gaussian kernel functions filtered in the spatial domain and the range domain, respectively.
The system according to claim 13, wherein the block fusion and registration module is specifically configured to: segment the depth image sequence based on the visual content, and perform block fusion on each segment, and Closed-loop detection is performed between segments, and the results of closed-loop detection are globally optimized.
The system according to claim 15, wherein the block fusion and registration module is further configured to:

The depth segmentation sequence is segmented according to the automatic segmentation method for visual content detection, and the similar depth image content is divided into one segment, and each segment is subjected to block fusion to determine a transformation relationship between the depth images. And performing closed-loop detection between segments and segments according to the transformation relationship to achieve global optimization.
The system according to claim 16, wherein the block fusion and registration module specifically comprises:

The camera pose information acquisition unit is configured to perform visual odometer estimation using the Kintinuous frame to obtain camera pose information under each frame depth image;

a segmentation unit, configured to backproject the point cloud data corresponding to the depth image of each frame to an initial coordinate system according to the camera pose information, and perform similarity between the depth image obtained by the projection and the depth image of the initial frame. Comparing, and when the similarity is lower than the similarity threshold, initializing the camera pose and performing segmentation;

a registration unit for extracting a PFFH geometric descriptor in each piece of point cloud data, and The coarse registration is performed between every two segments, and the fine registration is performed by the GICP algorithm to obtain the matching relationship between the segments and the segments;

An optimization unit is configured to utilize the pose information of each segment and the matching relationship between the segments and the segments, construct a map, and perform graph optimization using a G2O framework to obtain optimized camera trajectory information, thereby implementing the global optimization. .
The system according to claim 17, wherein the segmentation unit specifically comprises:

a calculating unit, configured to calculate a similarity between the depth image of each frame and the depth image of the first frame;

a determining unit, configured to determine whether the similarity is lower than a similarity threshold;

a segmentation subunit, configured to segment the depth image sequence when the similarity is lower than a similarity threshold;

And a processing unit, configured to use the next frame depth image as the starting frame depth image of the next segment, and repeatedly execute the calculating unit and the determining unit until all the frame depth images are processed.
The system according to claim 13, wherein the volume data fusion module is specifically configured to: according to the processing result, fuse a depth image of each frame by using a truncated symbol distance function mesh model, and use a voxel grid To represent the three-dimensional space, so as to obtain a three-dimensional model of the indoor complete scene.
The system according to claim 19, wherein the volume data fusion module specifically comprises:

a weighted fusion unit, configured to perform weighted fusion of the truncated symbol distance function data by using a Volumemetric method framework based on noise characteristics and an interest region;

An extracting unit is configured to perform Mesh model extraction by using a Marching cubes algorithm to obtain a three-dimensional model of the indoor complete scene.