CN106910242B

CN106910242B - Method and system for carrying out indoor complete scene three-dimensional reconstruction based on depth camera

Info

Publication number: CN106910242B
Application number: CN201710051366.5A
Authority: CN
Inventors: 李建伟; 高伟; 吴毅红
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2017-01-23
Filing date: 2017-01-23
Publication date: 2020-02-28
Anticipated expiration: 2037-01-23
Also published as: CN106910242A

Abstract

The invention discloses a method and a system for performing indoor complete scene three-dimensional reconstruction based on a consumption-level depth camera. The method comprises the steps of obtaining a depth image and carrying out self-adaptive bilateral filtering; carrying out visual odometer estimation by utilizing the filtered depth image, automatically segmenting an image sequence based on visual contents, carrying out closed-loop detection between segments, and carrying out global optimization; and performing weighted volume data fusion according to the optimized camera track information so as to reconstruct an indoor complete scene three-dimensional model. The embodiment of the invention realizes the edge protection and the noise removal of the depth map through the self-adaptive bilateral filtering algorithm, can effectively reduce the accumulated error in the estimation of the visual odometer and improve the registration precision based on the automatic segmentation algorithm of the visual content, and can effectively keep the geometric details of the surface of the object by adopting the weighted volume data fusion algorithm. Therefore, the technical problem of how to improve the three-dimensional reconstruction precision in the indoor scene is solved, and a complete, accurate and refined indoor scene model can be obtained.

Description

Method and system for carrying out indoor complete scene three-dimensional reconstruction based on depth camera

Technical Field

The invention relates to the technical field of computer vision, in particular to a method and a system for performing indoor complete scene three-dimensional reconstruction based on a consumption-level depth camera.

Background

High-precision three-dimensional reconstruction of an indoor scene is one of challenging research subjects in computer vision, and relates to theories and technologies in multiple fields of computer vision, computer graphics, pattern recognition, optimization and the like. There are many ways to realize three-dimensional reconstruction, and the traditional method adopts ranging sensors such as laser and radar or structured light technology to acquire the structural information of the scene or object surface for three-dimensional reconstruction, but most of the instruments are expensive and not easy to carry, so the application occasions are limited. With the development of computer vision technology, researchers have begun to research three-dimensional reconstruction using purely visual methods, and a great deal of useful research work has emerged.

After the consumer-grade depth camera is pushed out by Microsoft Kinect, people can directly and conveniently reconstruct the indoor scene in three dimensions by using the depth data. The KinectFusion algorithm proposed by Newcombe et al obtains depth information of each Point in an image by using Kinect, aligns coordinates of a three-dimensional Point under a current frame camera coordinate system with coordinates in a global model by Iterative approximate Closest Point (ICP) algorithm to estimate a pose of the current frame camera, and iteratively performs volume data fusion by a curved surface hidden Function (TSDF) to obtain a dense three-dimensional model. Although the Kinect acquisition depth is not influenced by illumination conditions and texture richness, the depth data range is only 0.5-4m, and the position and size of the grid model are fixed, so that the Kinect acquisition depth is only suitable for local and static indoor scenes.

Three-dimensional reconstruction of indoor scenes based on a consumer-grade depth camera generally has the following problems: (1) depth images acquired by a consumer-grade depth camera are low in resolution and high in noise, so that details on the surface of an object are difficult to maintain, and the depth value range is limited and cannot be directly used for three-dimensional reconstruction of a complete scene; (2) accumulated errors generated by camera pose estimation can cause wrong and distorted three-dimensional models; (3) the consumer-grade depth camera is generally hand-held shooting, the motion state of the camera is random, the quality of the obtained data is good or bad, and the reconstruction effect is influenced.

To perform a complete three-dimensional reconstruction of an indoor scene, whalan et al proposed a kininuous algorithm, which is a further extension of KinectFusion. The algorithm solves the problem of the consumption of the video memory of a grid model during the reconstruction of a large scene by using a mode of circularly utilizing the video memory of ShiftingTSDFVolume, searches matched key frames through DBoW for closed-loop detection, and finally optimizes a pose graph and a model, so that a large-scene three-dimensional model is obtained. Choi et al proposed an Elastic Fragment idea, segmenting the RGBD data stream every 50 frames, performing visual odometry estimation on each segment separately, extracting a geometric descriptor FPFH from point cloud data between every two segments to search for matching for closed loop detection, introducing line processes constraint to optimize the detection result and remove the wrong closed loop, and finally performing volume data fusion by using the optimized odometry information. The reconstruction of an indoor complete scene is realized through segmentation processing and closed-loop detection, but the preservation of the local geometric details of an object is not considered, and the fixed segmentation method is not robust when the reconstruction of a real indoor scene is carried out. Zeng et al proposed a 3D Match descriptor concept, which includes performing fixed segmentation processing on RGBD data streams and reconstructing the data streams to obtain local models, extracting key points from each segmented 3D model as the input of a 3D convolutional network (ConvNet), using feature vectors obtained by learning the network as the input of another matrix network (Metric network), and outputting matching results through similarity comparison. Due to the fact that the deep network has obvious feature learning advantages, compared with other descriptors, the reconstruction accuracy can be improved by using 3D Match for geometric registration. However, the method needs to perform local three-dimensional reconstruction first, perform geometric registration by using a deep learning network, and then output a global three-dimensional model, and network training needs a large amount of data, so that the efficiency of the whole reconstruction process is low.

In the aspect of improving the three-dimensional reconstruction precision, Angela et al propose a VSBR algorithm, and the main idea is to perform hierarchical optimization on TSDF data by using a Shape From Shaping (SFS) technique and then perform fusion to solve the problem that the surface details of an object are lost due to excessive smoothness during TSDF data fusion, thereby obtaining a relatively fine three-dimensional structure model. However, the method is only effective for monomer reconstruction under an ideal light source, and the accuracy improvement of an indoor scene is not obvious due to large light source change.

In view of the above, the present invention is particularly proposed.

Disclosure of Invention

In order to solve the above problems in the prior art, that is, to solve the technical problem of how to improve the three-dimensional reconstruction accuracy in an indoor scene, a method and a system for performing indoor complete scene three-dimensional reconstruction based on a consumer-grade depth camera are provided.

In order to achieve the above object, on one hand, the following technical solutions are provided:

a method for performing indoor complete scene three-dimensional reconstruction based on a consumer-grade depth camera may include:

acquiring a depth image;

performing adaptive bilateral filtering on the depth image;

carrying out block fusion and registration processing based on visual contents on the filtered depth image;

and according to the processing result, performing weighted volume data fusion so as to reconstruct an indoor complete scene three-dimensional model.

Preferably, the performing adaptive bilateral filtering on the depth image specifically includes:

adaptive bilateral filtering is performed according to:

wherein said u and said u_kRespectively representing any pixel and a domain pixel thereof on the depth image; the Z (u) and the Z (u)_k) Respectively represent corresponding to the u and the u_kDepth value of (d); the above-mentioned

Representing the corresponding depth value after filtering; said W represents in the field

A normalization factor of (a); said w_sAnd said w_cRepresenting a Gaussian kernel filtered in the spatial and value domains, respectivelyAnd (4) counting.

Preferably, the gaussian kernel function filtered in the spatial domain and the value domain is determined according to the following equation:

wherein, the delta_sAnd said delta_cThe variance of the spatial domain and the value domain gaussian kernel function respectively;

wherein, the delta_sAnd said delta_cDetermined according to the following formula:

wherein f represents a focal length of the depth camera, K_sAnd said K_cRepresenting a constant.

Preferably, the process of performing the visual content-based block fusion and registration on the filtered depth image specifically includes: and segmenting the depth image sequence based on visual content, performing block fusion on each segment, performing closed-loop detection between the segments, and performing global optimization on the result of the closed-loop detection.

Preferably, the segmenting the depth image sequence based on the visual content, performing block fusion on each segment, performing closed-loop detection between the segments, and performing global optimization on the result of the closed-loop detection specifically includes:

segmenting a depth image sequence based on an automatic segmentation method for visual content detection, dividing similar depth image contents into segments, performing block fusion on each segment, determining a transformation relation between the depth images, and performing closed-loop detection between the segments according to the transformation relation so as to realize global optimization.

Preferably, the automatic segmentation method based on visual content detection is configured to segment a depth image sequence, divide similar depth image contents into one segment, perform block fusion on each segment, determine a transformation relationship between the depth images, and perform closed-loop detection between segments according to the transformation relationship, so as to implement global optimization, and specifically includes:

estimating a visual odometer by adopting a Kintinuous frame to obtain camera pose information under each frame of depth image;

according to the camera pose information, back projecting the point cloud data corresponding to each frame of depth image to an initial coordinate system, comparing the similarity of the depth image obtained after projection with the depth image of the initial frame, and initializing a camera pose and segmenting when the similarity is lower than a similarity threshold value;

extracting a PFFH geometric descriptor in each segmented point cloud data, performing coarse registration between each two segments, and performing fine registration by adopting a GICP algorithm to obtain a matching relation between the segments;

and constructing a graph by using the pose information of each segment and the matching relation between the segments, and performing graph optimization by using a G2O frame to obtain optimized camera track information, thereby realizing the global optimization.

Preferably, the back projecting the point cloud data corresponding to each frame of depth image to an initial coordinate system according to the camera pose information, comparing the similarity between the depth image obtained after projection and the depth image of the initial frame, and initializing a camera pose and segmenting when the similarity is lower than a similarity threshold, specifically comprising:

step 1: calculating the similarity between each frame of depth image and a first frame of depth image;

step 2: judging whether the similarity is lower than a similarity threshold value;

and step 3: if yes, segmenting the depth image sequence;

and 4, step 4: and taking the depth image of the next frame as the depth image of the starting frame of the next segment, and repeatedly executing the step 1 and the step 2 until all the depth images of the frames are processed.

Preferably, the step 1 specifically includes:

according to the projection relation and the depth value of any frame of depth image, calculating a first space three-dimensional point corresponding to each pixel on the depth image by using the following formula:

p＝π^-1(u_p,Z(u_p))

wherein u is_pIs any pixel on the depth image; z (u) as defined above_p) And said p represents said u, respectively_pCorresponding depth values and the first spatial three-dimensional points; the pi represents the projection relation;

and rotationally translating the first space three-dimensional point to a world coordinate system according to the following formula to obtain a second space three-dimensional point:

q＝T_ip

wherein, T is_iRepresenting a rotation translation matrix from the spatial three-dimensional point corresponding to the depth map of the ith frame to a world coordinate system; the p represents the first three-dimensional point in space, and the q represents the second three-dimensional point in space; the i is a positive integer;

and back projecting the second space three-dimensional point to a two-dimensional image plane according to the following formula to obtain a projected depth image:

wherein u is_qIs the pixel on the projected depth image corresponding to said q; f is_xThe above-mentioned f_yC to c_xAnd c is as described_yRepresenting an internal reference of the depth camera; said x_q、y_q、z_qCoordinates representing said q; the T represents a transpose of a matrix;

and respectively calculating the number of effective pixels on the depth image of the initial frame and the depth image projected by any frame, and taking the ratio of the two as the similarity.

Preferably, the performing weighted volume data fusion according to the processing result, so as to reconstruct the indoor complete scene three-dimensional model specifically includes: and according to the processing result, fusing the depth image of each frame by using a truncated symbolic distance function grid model, and representing a three-dimensional space by using a voxel grid so as to obtain an indoor complete scene three-dimensional model.

Preferably, according to the processing result, the depth image of each frame is fused by using a truncated symbolic distance function grid model, and a three-dimensional space is represented by using a voxel grid, so as to obtain a three-dimensional model of an indoor complete scene, specifically including:

based on the noise characteristics and the interest region model, performing weighted fusion on the truncated symbol distance function data by using a Volumetric method frame;

and (4) extracting a Mesh model by adopting a Marching cubes algorithm, thereby obtaining the indoor complete scene three-dimensional model.

Preferably, the truncated symbol distance function is determined according to:

f_i(v)＝[K^-1z_i(u)[u^T,1]^T]_z-[v_i]_z

wherein f is_i(v) Representing a truncated sign distance function, namely the distance from the grid to the surface of the object model, wherein the positive and negative represent that the grid is positioned on the shielded side or the visible side of the surface, and the zero-crossing point is a point on the surface; the K represents an intrinsic parameter matrix of the camera; the u represents a pixel; z is_i(u) representing a depth value corresponding to the pixel u; v is_iThe voxels are represented.

Preferably, the data weighted fusion is performed according to the following formula:

wherein v represents a voxel; f is_i(v) And said w_i(v) Respectively representing a truncated symbol distance function and a weight function thereof corresponding to the voxel v; the n is a positive integer; f (v) represents a truncated sign distance function value corresponding to the voxel v after fusion; w (v) represents the weight of the truncated sign distance function value corresponding to the voxel v after fusion;

wherein the weight function may be determined according to the following equation:

wherein d is_iA radius representing the region of interest; delta. the_sIs the noise variance in the depth data; w is a constant.

In order to achieve the above object, in another aspect, there is also provided a system for performing indoor complete scene three-dimensional reconstruction based on a consumer-grade depth camera, the system including:

the acquisition module is used for acquiring a depth image;

the filtering module is used for carrying out self-adaptive bilateral filtering on the depth image;

the block fusion and registration module is used for carrying out block fusion and registration processing based on visual content on the filtered depth image;

and the volume data fusion module is used for carrying out weighted volume data fusion according to the processing result so as to reconstruct an indoor complete scene three-dimensional model.

Preferably, the filtering module is specifically configured to:

adaptive bilateral filtering is performed according to:

wherein said u and said u_kRespectively representing any pixel and a domain pixel thereof on the depth image; the Z (u) and the Z (u)_k) Respectively represent corresponding to the u and the u_kDepth value of (d); the above-mentionedRepresenting the corresponding depth value after filtering; said W represents in the field

A normalization factor of (a); said w_sAnd said w_cRepresenting the gaussian kernel filtered in the spatial and value domains, respectively.

Preferably, the block fusion and registration module may be specifically configured to: and segmenting the depth image sequence based on visual content, performing block fusion on each segment, performing closed-loop detection between the segments, and performing global optimization on the result of the closed-loop detection.

Preferably, the block fusion and registration module may be further specifically configured to:

Preferably, the block fusion and registration module specifically includes:

the camera pose information acquisition unit is used for estimating a visual odometer by adopting a Kintinuous frame to obtain camera pose information under each frame of depth image;

the segmentation unit is used for back projecting the point cloud data corresponding to each frame of depth image to an initial coordinate system according to the camera pose information, comparing the similarity of the depth image obtained after projection with the depth image of the initial frame, and initializing a camera pose for segmentation when the similarity is lower than a similarity threshold value;

the registration unit is used for extracting the PFFH geometric descriptor in each segmented point cloud data, performing coarse registration between each two segments, and performing fine registration by adopting a GICP algorithm to obtain the matching relationship between the segments;

and the optimization unit is used for constructing a graph by using the pose information of each segment and the matching relation between the segments, and performing graph optimization by adopting a G2O frame to obtain optimized camera track information so as to realize the global optimization.

Preferably, the segmentation unit specifically includes:

the calculating unit is used for calculating the similarity between each frame of depth image and the first frame of depth image;

the judging unit is used for judging whether the similarity is lower than a similarity threshold value;

a segmentation subunit, configured to segment the depth image sequence when the similarity is lower than a similarity threshold;

and the processing unit is used for taking the next frame depth image as the starting frame depth image of the next segmentation, and repeatedly executing the calculating unit and the judging unit until all the frame depth images are processed.

Preferably, the volume data fusion module is specifically configured to: and according to the processing result, fusing the depth image of each frame by using a truncated symbolic distance function grid model, and representing a three-dimensional space by using a voxel grid so as to obtain an indoor complete scene three-dimensional model.

Preferably, the volume data fusion module specifically includes:

the weighted fusion unit is used for performing weighted fusion on the truncated symbol distance function data by using a Volumetric method frame based on the noise characteristics and the interest region;

and the extraction unit is used for extracting the Mesh model by adopting a Marching cubes algorithm so as to obtain the indoor complete scene three-dimensional model.

The embodiment of the invention provides a method and a system for performing indoor complete scene three-dimensional reconstruction based on a consumption-level depth camera. The method comprises the steps of obtaining a depth image; carrying out self-adaptive bilateral filtering on the depth image; carrying out block fusion and registration processing based on visual contents on the filtered depth image; and according to the processing result, performing weighted volume data fusion so as to reconstruct an indoor complete scene three-dimensional model. According to the embodiment of the invention, the depth image is subjected to the block fusion and registration based on the visual content, so that the accumulated error in the estimation of the visual odometer can be effectively reduced, the registration precision is improved, and the geometric details of the object surface can be effectively maintained by adopting a weighted volume data fusion algorithm, so that the technical problem of how to improve the three-dimensional reconstruction precision in an indoor scene is solved, and a complete, accurate and refined indoor scene model can be obtained.

Drawings

Fig. 1 is a schematic flow chart of a method for performing indoor complete scene three-dimensional reconstruction based on a consumer-grade depth camera according to an embodiment of the present invention;

FIG. 2a is a color image corresponding to a depth image according to an embodiment of the present invention;

FIG. 2b is a schematic diagram of a point cloud derived from a depth image according to an embodiment of the invention;

FIG. 2c is a schematic diagram of a point cloud obtained by bilateral filtering of a depth image according to an embodiment of the present invention;

FIG. 2d is a schematic diagram of a point cloud obtained by performing adaptive bilateral filtering on a depth image according to an embodiment of the present invention

FIG. 3 is a flow chart of visual content segmentation based fusion and registration according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a weighted volumetric data fusion process according to an embodiment of the present invention;

FIG. 5a is a schematic diagram of a three-dimensional reconstruction result using a non-weighted volumetric data fusion algorithm;

FIG. 5b is a schematic partial detail view of the three-dimensional model of FIG. 5 a;

fig. 5c is a schematic diagram of a three-dimensional reconstruction result obtained by the weighted volume data fusion algorithm according to the embodiment of the present invention;

FIG. 5d is a schematic partial detail view of the three-dimensional model of FIG. 5 c;

fig. 6 is a schematic diagram illustrating an effect of performing three-dimensional reconstruction on a 3D Scene Data set by using the method according to the embodiment of the present invention;

FIG. 7 is a schematic diagram illustrating the effect of three-dimensional reconstruction performed on an Augmented ICL-NUIM Dataset by using the method proposed by the embodiment of the present invention according to the embodiment of the present invention;

FIG. 8 is a schematic diagram illustrating an effect of three-dimensional reconstruction using indoor scene data collected by Microsoft Kinect for Windows according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a system for performing indoor complete scene three-dimensional reconstruction based on a consumer-grade depth camera according to an embodiment of the present invention.

Detailed Description

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.

The embodiment of the invention provides a method for carrying out indoor complete scene three-dimensional reconstruction based on a consumption-level depth camera. As shown in fig. 1, the method includes:

s100: a depth image is acquired.

Specifically, the step may include: depth images are acquired with a consumer-grade depth camera based on structured light principles.

Among them, a consumer depth camera (Microsoft Kinect for Windows and rotation, abbreviated as depth camera) based on the structured light principle obtains depth data of a depth image by emitting structured light and receiving reflection information.

In practical applications, a handheld consumer-grade depth camera Microsoft Kinect for Windows can be used to collect real indoor scene data.

The depth data may be calculated according to the following equation:

wherein f represents the focal length of the consumer-level depth camera; b represents a baseline; d represents the parallax.

S110: and carrying out self-adaptive bilateral filtering on the depth image.

The step carries out self-adaptive bilateral filtering on the acquired depth image by using the noise characteristics of the consumer-grade depth camera based on the structured light principle.

The adaptive bilateral filtering algorithm is to perform filtering in both the spatial domain and the value domain of the depth image.

In practical application, parameters of the adaptive bilateral filtering algorithm can be set according to the noise characteristics and internal parameters of the depth camera, so that noise can be effectively removed and edge information is kept.

The depth Z is partially derived with respect to the disparity D, and the following relationship exists:

the noise of the depth data is mainly generated in the quantization process, and it can be seen from the above equation that the variance of the depth noise is proportional to the depth value squared, i.e. the larger the depth value, the larger the noise. In order to effectively remove noise in the depth image, the embodiment of the invention defines a filtering algorithm based on the noise characteristic.

Specifically, the adaptive bilateral filtering may be performed according to the following formula:

wherein u and u_kRespectively representing any pixel and a domain pixel on the depth image; z (u) and Z (u)_k) Respectively represent the corresponding u and u_kDepth value of (d);

representing the corresponding depth value after filtering; w denotes in the field

A normalization factor of (a); w is a_sAnd w_cRepresenting the gaussian kernel filtered in the spatial and value domains, respectively.

In the above embodiments, w_sAnd w_cCan be determined according to the following equation:

wherein, delta_sAnd delta_cThe variance of the spatial domain and value domain gaussian kernel functions, respectively.

δ_sAnd delta_cThe value of the depth value is not fixed in relation to the size of the depth value.

Specifically, in the above-described embodiment, δ_sAnd delta_cCan be determined according to the following equation:

where f denotes the focal length of the depth camera, K_sAnd K_cRepresenting constants whose specific values are related to the parameters of the depth camera.

Fig. 2a-d schematically show a comparison of the effect of different filtering algorithms. Wherein fig. 2a shows a color image corresponding to the depth image. Fig. 2b shows a point cloud obtained from a depth image. Fig. 2c shows the point cloud resulting from bilateral filtering of the depth image. Fig. 2d shows the point cloud resulting from adaptive bilateral filtering of the depth image.

The embodiment of the invention can realize edge protection and denoising of the depth map by adopting a self-adaptive bilateral filtering method.

S120: and carrying out visual content-based block fusion and registration processing on the depth image.

The method comprises the steps of segmenting a depth image sequence based on visual contents, carrying out block fusion on each segment, carrying out closed-loop detection between the segments, and carrying out global optimization on the result of the closed-loop detection. Wherein the depth image sequence is a depth image data stream.

Preferably, this step may include: determining the transformation relation among the depth images, segmenting the depth image sequence based on a visual content automatic segmentation method, dividing similar depth image contents into segments, performing block fusion on each segment, determining the transformation relation among the depth images, performing closed-loop detection among the segments according to the transformation relation, and realizing global optimization.

Further, the step may include:

s121: and (3) estimating a visual odometer by adopting a Kintinuous framework to obtain the camera pose information under each frame of depth image.

S122: and according to the camera pose information, back projecting the point cloud data corresponding to each frame of depth image to an initial coordinate system, comparing the similarity of the depth image obtained after projection with the depth image of the initial frame, and initializing the camera pose for segmentation when the similarity is lower than a similarity threshold value.

S123: extracting PFFH geometric descriptors in each segmented point cloud data, performing coarse registration between each two segments, and performing fine registration by adopting a GICP algorithm to obtain a matching relationship between the segments.

In the step, closed loop detection is carried out between the sections.

S124: and constructing a graph by using the pose information of each segment and the matching relation between the segments, and performing graph optimization by using a G2O frame to obtain optimized camera track information, thereby realizing global optimization.

In the step, a (SLAC) mode is applied during optimization to improve non-rigid distortion, and line processes are introduced to restrict and delete wrong closed-loop matching.

The step S122 may further specifically include:

s1221: and calculating the similarity of each frame of depth image and the first frame of depth image.

S1222: judging whether the similarity is lower than a similarity threshold, if so, executing a step S1223; otherwise, step S1224 is performed.

S1223: the sequence of depth images is segmented.

This step performs segmentation processing on the depth image sequence based on the visual content. Therefore, the problem of accumulated errors generated by visual odometer estimation can be effectively solved, and similar contents can be fused together, so that the registration accuracy is improved.

S1224: the depth image sequence is not segmented.

S1225: the next frame depth image is taken as the starting frame depth image of the next segment, and step S1221 and step S1222 are repeatedly performed until all frame depth images are processed.

In the above embodiment, the step of calculating the similarity between each frame of depth image and the first frame of depth image may specifically include:

s12211: and calculating a first space three-dimensional point corresponding to each pixel on the depth image according to the projection relation and the depth value of any frame of depth image by using the following formula:

p＝π^-1(u_p,Z(u_p))

wherein u is_pIs any pixel on the depth image; z (u)_p) And p each represents u_pCorresponding depth values and first spatial three-dimensional points; and pi represents a projection relation, namely a 2D-3D projection transformation relation of point cloud data corresponding to each frame of depth image back projection to the initial coordinate system.

S12212: and (3) converting the first space three-dimensional point into a world coordinate system by rotating and translating according to the following formula to obtain a second space three-dimensional point:

q＝T_ip

wherein, T_iA rotation and translation matrix from the corresponding space three-dimensional point of the depth map of the ith frame to a world coordinate system is represented, and the rotation and translation matrix can be obtained by estimation of a visual odometer; i is a positive integer; p represents a first three-dimensional point in space, q represents a second three-dimensional point in space, and the coordinates of p and q are respectively:

p＝(x_p,y_p,z_p)，q＝(x_q,y_q,z_q)。

s12213: and back projecting the second space three-dimensional point to a two-dimensional image plane according to the following formula to obtain a projected depth image:

wherein u is_qIs the pixel on the projected depth image corresponding to q; f. of_x、f_y、c_xAnd c_yRepresenting an internal reference of the depth camera; x is the number of_q、y_q、z_qCoordinates representing q; t denotes the transpose of the matrix.

S12214: and respectively calculating the number of effective pixels on the depth image of the initial frame and the depth image projected by any frame, and taking the ratio of the two as the similarity.

For example, the similarity is calculated according to the following formula:

wherein n is⁰And nⁱRespectively representing the number of effective pixels on the depth image of the initial frame and the depth image projected by any frame; ρ represents the similarity.

Fig. 3 exemplarily shows a flow diagram of fusion, registration based on visual content segmentation.

The embodiment of the invention adopts the automatic segmentation algorithm based on the visual content, can effectively reduce the accumulated error in the estimation of the visual odometer and improve the registration precision.

S130: and according to the processing result, performing weighted volume data fusion so as to reconstruct an indoor complete scene three-dimensional model.

Specifically, the step may include: and according to the block fusion and registration processing result based on the visual content, fusing the depth image of each frame by using a Truncated Symbolic Distance Function (TSDF) grid model, and representing a three-dimensional space by using a voxel grid, thereby obtaining the indoor complete scene three-dimensional model.

The step may further comprise:

s131: and performing weighted fusion on truncated symbol distance function data by using a Volumetric method framework based on the noise characteristic and the interest region.

S132: and (4) extracting the Mesh model by adopting a Marching cubes algorithm.

In practical application, according to the estimation result of the visual odometer, the depth image of each frame is fused by using the TSDF grid model to represent three-dimensional spaces by using a voxel grid with a resolution of m, that is, each three-dimensional space is divided into m blocks, and each grid v stores two values: truncating the symbol distance function f_i(v) And weight w thereof_i(v)。

Wherein the truncated symbol distance function may be determined according to:

f_i(v)＝[K^-1z_i(u)[u^T,1]^T]_z-[v_i]_z

wherein f is_i(v) Representing a truncated symbolic distance function, i.e. the distance of the mesh to the surface of the object modelPositive and negative indicate whether the mesh is on the occluded side or the visible side of the surface, and the zero crossing points are points on the surface; k represents an intrinsic parameter matrix of the camera; u represents a pixel; z is a radical of_i(u) represents a depth value corresponding to pixel u; v. of_iThe voxels are represented. Wherein the camera may be a depth camera or a depth camcorder.

Wherein the data weighted fusion can be performed according to the following formula:

wherein f is_i(v) And w_i(v) Respectively representing a Truncated Symbolic Distance Function (TSDF) corresponding to the voxel v and a weight function thereof; n is a positive integer; f (v) represents a truncated sign distance function value corresponding to the voxel v after fusion; w (v) represents the weight of the truncated sign distance function value corresponding to the voxel v after fusion.

In the above embodiment, the weight function may be determined according to the noise characteristics of the depth data and the interest region, and the value thereof is not fixed. In order to maintain the geometric details of the surface of the object, the weight of the area with low noise and the area of interest is set to be large, and the weight of the area with high noise or the area without interest is set to be small.

Specifically, the weight function may be determined according to the following equation:

wherein d is_iThe radius of the interest area is represented, and the smaller the radius is, the more interest is represented, and the weight is larger; delta_sThe noise variance in the depth data is consistent with the variance of a kernel function in a spatial domain of the self-adaptive bilateral filtering algorithm in value; w is a constant, which may preferably take the value 1 or 0.

Fig. 4 exemplarily shows a weighted volumetric data fusion process diagram.

The embodiment of the invention adopts the weighted volume data fusion algorithm, can effectively keep the geometric details of the object surface, can obtain a complete, accurate and refined indoor scene model, and has good robustness and expansibility.

FIG. 5a schematically shows the result of a three-dimensional reconstruction using a non-weighted volumetric data fusion algorithm; FIG. 5b schematically shows a partial detail of the three-dimensional model of FIG. 5 a; FIG. 5c is a schematic diagram illustrating a three-dimensional reconstruction result obtained by using a weighted volume data fusion algorithm proposed by an embodiment of the present invention; fig. 5d schematically shows a partial detail of the three-dimensional model in fig. 5 c.

Fig. 6 is a schematic diagram illustrating an effect of three-dimensional reconstruction on a 3D Scene Data set by using the method proposed by the embodiment of the present invention; FIG. 7 is a schematic diagram illustrating the effect of three-dimensional reconstruction using the method proposed by the embodiment of the present invention on an Augmented ICL-NUIM Dataset; fig. 8 exemplarily shows an effect diagram of three-dimensional reconstruction using indoor scene data acquired by Microsoft Kinect for Windows.

It should be noted that although the embodiments of the present invention have been described herein in the above order, those skilled in the art will appreciate that the present invention may be practiced in other than the order described, and that such simple variations are intended to be within the scope of the present invention.

Based on the same technical concept as the method embodiment, an embodiment of the present invention further provides a system for performing indoor complete scene three-dimensional reconstruction based on a consumer-level depth camera, as shown in fig. 9, where the system 90 includes: an acquisition module 92, a filtering module 94, a block fusion and registration module 96, and a volume data fusion module 98. The obtaining module 92 is configured to obtain a depth image. The filtering module 94 is configured to perform adaptive bilateral filtering on the depth image. The block fusion and registration module 96 is configured to perform a visual content-based block fusion and registration process on the filtered depth image. And the volume data fusion module 98 is used for performing weighted volume data fusion according to the processing result so as to reconstruct an indoor complete scene three-dimensional model.

By adopting the technical scheme, the embodiment of the invention can effectively reduce the accumulated error in the estimation of the visual odometer, improve the registration precision, effectively keep the geometric details of the surface of the object and obtain a complete, accurate and refined indoor scene model.

In some embodiments, the filtering module is specifically configured to: adaptive bilateral filtering is performed according to:

In some embodiments, the patch fusion and registration module may be specifically configured to: and segmenting the depth image sequence based on visual content, performing block fusion on each segment, performing closed-loop detection between the segments, and performing global optimization on the result of the closed-loop detection.

In other embodiments, the block fusion and registration module may be further specifically configured to: determining the transformation relation among the depth images, segmenting a depth image sequence based on an automatic segmentation method for visual content detection, dividing similar depth image contents into segments, performing block fusion on each segment, determining the transformation relation among the depth images, performing closed-loop detection among the segments according to the transformation relation, and realizing global optimization.

In some preferred embodiments, the block fusion and registration module may specifically include: the device comprises a camera pose information acquisition unit, a segmentation unit, a registration unit and an optimization unit. The camera pose information acquisition unit is used for estimating the visual odometer by adopting a Kintinuous frame to obtain the camera pose information under each frame of depth image. The segmentation unit is used for back projecting the point cloud data corresponding to each frame of depth image to an initial coordinate system according to the camera pose information, comparing the similarity of the depth image obtained after projection with the depth image of the initial frame, initializing the camera pose when the similarity is lower than a similarity threshold value, and segmenting. The registration unit is used for extracting the PFFH geometric descriptor in each segmented point cloud data, performing coarse registration between each two segments, and performing fine registration by adopting a GICP algorithm to obtain the matching relationship between the segments. The optimization unit is used for constructing a graph by using the pose information of each segment and the matching relation between the segments, and performing graph optimization by using a G2O frame to obtain optimized camera track information, so that global optimization is realized.

The segmentation unit may specifically include: the device comprises a calculating unit, a judging unit, a segmenting subunit and a processing unit. The calculating unit is used for calculating the similarity between each frame of depth image and the first frame of depth image. The judging unit is used for judging whether the similarity is lower than a similarity threshold value. The segmentation subunit is configured to segment the depth image sequence when the similarity is lower than a similarity threshold. The processing unit is used for taking the depth image of the next frame as the depth image of the starting frame of the next segment, and repeatedly executing the calculating unit and the judging unit until all the depth images of the frames are processed.

In some embodiments, the volume data fusion module may be specifically configured to fuse the depth images of the frames by using a truncated symbolic distance function mesh model according to the processing result, and use a voxel mesh to represent a three-dimensional space, so as to obtain a three-dimensional model of an indoor complete scene.

In some embodiments, the volume data fusion module may specifically include a weighted fusion unit and an extraction unit. The weighted fusion unit is used for performing weighted fusion on truncated symbol distance function data by using a Volumetric method frame based on noise characteristics and interest areas. The extraction unit is used for extracting the Mesh model by adopting a Marching cubes algorithm so as to obtain an indoor complete scene three-dimensional model.

The invention is explained in more detail below with reference to a preferred embodiment.

The system for performing indoor complete scene three-dimensional reconstruction based on the consumer-grade depth camera comprises an acquisition module, a filtering module, a block fusion and registration module and a volume data fusion module. Wherein:

the acquisition module is used for acquiring depth images of indoor scenes by using the depth camera.

The filtering module is used for carrying out self-adaptive bilateral filtering processing on the acquired depth image.

The acquisition module is an equivalent replacement of the acquisition module. In practical applications, a handheld consumer-grade depth camera Microsoft Kinect for Windows can be used to collect real indoor scene data. Then, the acquired depth image is subjected to self-adaptive bilateral filtering, and parameters in the self-adaptive bilateral filtering method are automatically set according to the noise characteristics and the internal parameters of the depth camera, so that the embodiment of the invention can effectively remove noise and keep edge information.

The block fusion and registration module is used for automatically segmenting the data stream based on visual content, carrying out block fusion on each segment, carrying out closed-loop detection between segments and carrying out global optimization on the result of the closed-loop detection.

The block fusion and registration module performs automatic block fusion and registration based on visual contents.

In a more preferred embodiment, the block fusion and registration module specifically includes: the system comprises a pose information acquisition module, a segmentation module, a coarse registration module, a fine registration module and an optimization module. The pose information acquisition module is used for estimating the visual odometer by adopting a Kintinuous frame to obtain the camera pose information under each frame of depth image. The segmentation module is used for back projecting the point cloud data corresponding to each frame of depth image to an initial coordinate system according to the camera pose information, comparing the similarity of the projected depth image with the depth image of the initial frame, initializing the camera pose if the similarity is lower than a similarity threshold, and performing new segmentation. The rough registration module is used for extracting a PFFH geometric descriptor in each segmented point cloud data and carrying out rough registration between each two segments; and the fine registration module is used for performing fine registration by adopting a GICP algorithm so as to acquire the matching relation between the segments. And the optimization module is used for constructing a graph and optimizing the graph by adopting a G2O framework by utilizing the pose information of each segment and the matching relation between the segments.

Preferably, the optimization module is further configured to apply a slac (simultaneous Localization and calibration) mode to optimize non-rigid distortion and remove erroneous closed-loop matching using line processes constraint.

The block fusion and registration module carries out segmented processing on the RGBD data stream based on the visual content, so that the problem of accumulated errors generated by estimation of the visual odometer can be effectively solved, similar contents can be fused together, and the registration precision can be improved.

And the volume data fusion module is used for performing weighted volume data fusion according to the optimized camera track information to obtain a three-dimensional model of the scene.

The volume data fusion module defines a weight function of a truncated symbol distance function according to the noise characteristics of the depth camera and the region of interest to realize the retention of the geometric details of the object surface.

Experiments on a system for performing indoor complete scene three-dimensional reconstruction based on a consumption-level depth camera show that: the high-precision three-dimensional reconstruction method based on the consumption-level depth camera can obtain a complete, accurate and refined indoor scene model, and the system has good robustness and expansibility.

The system embodiment for performing indoor complete scene three-dimensional reconstruction based on the consumer-grade depth camera can be used for executing the method embodiment for performing indoor complete scene three-dimensional reconstruction based on the consumer-grade depth camera, and the technical principle, the solved technical problems and the generated technical effects are similar and can be mutually referred; for convenience and brevity of description, the same parts as those described in the respective embodiments are omitted.

It should be noted that, when the system and the method for performing indoor complete scene three-dimensional reconstruction based on a consumer-grade depth camera provided in the foregoing embodiment performs indoor complete scene three-dimensional reconstruction, only the division of the above functional modules, units, or steps is taken as an example, for example, the aforementioned acquisition module may also be used as an acquisition module, and in practical application, the above functions may be allocated to different functional modules, units, or steps according to needs, that is, the modules, units, or steps in the embodiment of the present invention are decomposed or combined again, for example, the acquisition module or the acquisition and filtering module may be combined into a data preprocessing module.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A method for performing indoor complete scene three-dimensional reconstruction based on a consumer-grade depth camera, the method comprising:

acquiring a depth image;

performing adaptive bilateral filtering on the depth image;

according to the processing result, performing weighted volume data fusion so as to reconstruct an indoor complete scene three-dimensional model;

the performing weighted volume data fusion according to the processing result, so as to reconstruct an indoor complete scene three-dimensional model specifically comprises: according to the processing result, fusing the depth image of each frame by using a truncated symbolic distance function grid model, and representing a three-dimensional space by using a voxel grid so as to obtain an indoor complete scene three-dimensional model;

the truncated symbol distance function is determined according to:

f_i(v)＝[K^-1z_i(u)[u^T,1]^T]_z-[v_i]_z

wherein f is_i(v) Representing truncated symbol distance functionThe distance between the grid and the surface of the object model, wherein the positive and negative indicate that the grid is on the shielded side or the visible side of the surface, and the zero-crossing point is a point on the surface; the K represents an intrinsic parameter matrix of the camera; the u represents a pixel; z is_i(u) representing a depth value corresponding to the pixel u; v is_iThe voxels are represented.

2. The method according to claim 1, wherein the adaptive bilateral filtering of the depth image specifically comprises:

adaptive bilateral filtering is performed according to:

Representing the corresponding depth value after filtering; w represents in the field N (u)_k) A normalization factor of (a); said w_sAnd said w_cRepresenting the gaussian kernel filtered in the spatial and value domains, respectively.

3. The method of claim 2, wherein the gaussian kernel function for spatial and value domain filtering is determined according to the following equation:

4. The method according to claim 1, wherein the process of visual content-based block fusion and registration of the filtered depth image comprises in particular: and segmenting the depth image sequence based on visual content, performing block fusion on each segment, performing closed-loop detection between the segments, and performing global optimization on the result of the closed-loop detection.

5. The method according to claim 4, wherein the segmenting the depth image sequence based on the visual content, and performing block fusion on each segment, and performing closed-loop detection between the segments, and performing global optimization on the result of the closed-loop detection specifically comprises:

6. The method according to claim 5, wherein the automatic segmentation method based on visual content detection is configured to segment a depth image sequence, segment similar depth image contents into one segment, perform block fusion on each segment, determine a transformation relationship between the depth images, and perform closed-loop detection between segments according to the transformation relationship, so as to implement global optimization, and specifically includes:

7. The method according to claim 6, wherein the back-projecting the point cloud data corresponding to each frame of depth image to an initial coordinate system according to the camera pose information, comparing the similarity between the depth image obtained after projection and the depth image of the initial frame, and initializing a camera pose for segmentation when the similarity is lower than a similarity threshold, specifically comprises:

and step 3: if yes, segmenting the depth image sequence;

8. The method according to claim 7, wherein the step 1 specifically comprises:

p＝π^-1(u_p,Z(u_p))

q＝T_ip

9. The method according to claim 1, wherein, according to the processing result, fusing the depth image of each frame by using a truncated symbolic distance function mesh model, and representing a three-dimensional space by using a voxel mesh, thereby obtaining a three-dimensional model of an indoor complete scene, specifically comprising:

based on the noise characteristic and the interest region, performing weighted fusion on the truncated symbol distance function data by using a Volumetric method frame;

10. The method of claim 9, wherein the data weighted fusion is performed according to the following equation:

11. A system for three-dimensional reconstruction of a complete scene indoors based on a consumer-grade depth camera, the system comprising:

the acquisition module is used for acquiring a depth image;

the volume data fusion module is used for carrying out weighted volume data fusion according to the processing result so as to reconstruct an indoor complete scene three-dimensional model;

the block fusion and registration module is specifically configured to: segmenting the depth image sequence based on visual content, performing block fusion on each segment, performing closed-loop detection between the segments, and performing global optimization on the result of the closed-loop detection;

the block fusion and registration module is further specifically configured to:

segmenting a depth image sequence based on a visual content detection automatic segmentation method, dividing similar depth image contents into segments, performing block fusion on each segment, determining a transformation relation between the depth images, and performing closed-loop detection between the segments according to the transformation relation so as to realize global optimization;

the block fusion and registration module specifically comprises:

12. The system of claim 11, wherein the filtering module is specifically configured to:

adaptive bilateral filtering is performed according to:

13. The system according to claim 11, wherein the segmentation unit comprises in particular:

14. The system of claim 11, wherein the volumetric data fusion module is specifically configured to: and according to the processing result, fusing the depth image of each frame by using a truncated symbolic distance function grid model, and representing a three-dimensional space by using a voxel grid so as to obtain an indoor complete scene three-dimensional model.

15. The system according to claim 14, wherein the volume data fusion module specifically comprises: