CN114782628A

CN114782628A - Indoor real-time three-dimensional reconstruction method based on depth camera

Info

Publication number: CN114782628A
Application number: CN202210440569.4A
Authority: CN
Inventors: 吴学毅; 耿瑞琳
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2022-04-25
Filing date: 2022-04-25
Publication date: 2022-07-22

Abstract

The invention discloses an indoor real-time three-dimensional reconstruction method based on a depth camera, which comprises the following steps: step 1: acquiring a depth map and a color image of an indoor scene; and 2, step: completing holes of the depth map; and 3, step 3: converting the depth map into point cloud, and performing downsampling on the point cloud; and 4, step 4: detecting and matching the characteristic points, and solving an initial value pose parameter estimation; and 5: registering point cloud by using an ICPpoint-to-plane algorithm, and then tracking the pose by using a characteristic point method and a direct method; step 6: completing loop detection by using a bag-of-words method, and optimizing a detected closed loop; and 7: and constructing a mesh map according to the distance sign function in the TSDF stage, reconstructing the surface of the object by using a moving cube MC algorithm, and completing texture mapping by using weighted fusion to obtain a three-dimensional reconstruction model of the indoor object. The invention solves the problem of drift caused by accumulated errors in the prior art and can obtain a more accurate three-dimensional reconstruction model.

Description

Indoor real-time three-dimensional reconstruction method based on depth camera

Technical Field

The invention belongs to the technical field of machine vision, and relates to an indoor real-time three-dimensional reconstruction method based on a depth camera.

Background

With the increasing rise of the three-dimensional visual field technology and the popularization of three-dimensional space data scanning hardware equipment, the instant positioning and mapping (SLAM) technology in the three-dimensional visual field and the real-time three-dimensional reconstruction technology based on the visual SLAM are continuously developed, and the interaction between a virtual three-dimensional space and a real scene also becomes a trend.

In the process of acquiring the depth map, because the Kinectv2 camera sends continuous light pulses to the target object, the sensor receives light returned from the target object, and the infrared laser array returned by the part near the indoor environment light source and directly irradiated by strong light is interfered, hole noise caused by the environment can occur; due to the fact that the distance exists between the infrared transmitter and the infrared receiver, visual shielding is caused, if the obtained data area is not in the infrared transmitting view field range, area shielding can be generated, gradient step steps exist between the foreground and the background, and hole noise appears at the edge; when the surface of the target object is made of glass or computer screen, which is easy to reflect light, the infrared laser pulse cannot completely return to the receiver, so that hole noise appears on the surface of the object. Depth map noise and holes caused by conditions such as object shielding, material of the object, ambient light and the like are generally removed by adopting modes such as Gaussian filtering, bilateral filtering, adaptive median filtering and the like, but the depth map holes still exist after noise is filtered, and the visual effect of three-dimensional reconstruction is influenced.

Because a large number of feature points need to be extracted for three-dimensional reconstruction to ensure dense reconstruction and accurate camera pose tracking, the commonly used direct method applied scene needs strong discrimination and is easy to fall into a local minimum value, and the feature point method has the problems of long matching time and low efficiency, the number and real-time performance of the feature points need to be balanced when the real-time three-dimensional reconstruction is completed.

The accumulated errors cause drift of the reconstruction result, deformation occurs in the reconstruction room, and the roof of the cube may be constructed into an ellipsoid shape.

Disclosure of Invention

The invention aims to provide an indoor real-time three-dimensional reconstruction method based on a depth camera, and solves the problems of drift and the like caused by accumulative errors in the prior art.

The technical scheme adopted by the invention is as follows:

an indoor real-time three-dimensional reconstruction method based on a depth camera comprises the following steps:

step 1: acquiring a depth map and a color image of an indoor scene;

step 2: completing holes of the depth map by using a depth completion algorithm based on hole neighborhood depth values and color information;

and 3, step 3: the camera internal parameters are obtained by using a checkerboard calibration method, the depth map is converted into point cloud according to the conversion principle of a related coordinate system, and the point cloud is subjected to downsampling by using a voxel filtering method;

and 4, step 4: detecting and matching the characteristic points by using an improved SURF algorithm, and solving initial value pose parameter estimation;

and 5: registering point clouds by using an ICP point-to-plane algorithm, and tracking the pose by using a characteristic point method and a direct method;

and 6: completing loop detection by using a bag-of-words method, and optimizing a detected closed loop;

and 7: and constructing a mesh map according to the distance sign function in the TSDF stage, reconstructing the surface of the object by using a moving cube MC algorithm, and completing texture mapping by using weighted fusion to obtain a three-dimensional reconstruction model of the indoor object.

The invention is also characterized in that:

the step 2 is: setting effective depth values, and performing hole completion on holes with pixel lengths larger than the effective values by adopting an interpolation method, and performing hole completion on holes with pixel lengths not larger than the effective values by sequentially adopting a hole neighborhood depth value hole completion method and a color information depth completion method;

wherein to the length of pixel be not more than the hole of virtual value, concrete operation is:

step 2.1, the types of the holes are divided into two types according to the effective depth values: holes on the right side of the effective depth value and holes on the left side of the effective depth value are filled by respectively adopting a left-side indexing method and a right-side indexing method;

step 2.2, further hole completion is carried out by adopting a depth completion method of color information after 2.1 hole filling, and the formula is as follows:

wherein the Gaussian weight of the color information is set to w_RGBThe holes on the right side of the effective depth value are holes of the first type, and the weight is w_RGB1Holes on the left side of the effective depth value are the second type of holes, and the weight is w_RGB2；σ_RGBRepresenting a Gaussian weight of w_RGBStandard deviation of (D), J_kV (i) depth values that take into account neighborhood depth values and slopes for determining whether to hole in the first or second type, v (i)_finalIncrease the limit of color information, L_validFor the length of the available depth value, L_missHole length, N is the set of color information of the neighborhood, v (i)_finalThe final completed depth value at i.

In step 2.1, when the hole extends to the rightmost pixel in the current object, the depth values are propagated rightwards from the left index, and the hole is filled in sequence, wherein the specific calculation formula is as follows:

v(i)＝v(i-1)+[v(i-L)-v(i-L-1)]×slope (1)

wherein i is a pixel point, v (i) is a complementary position depth value considering a slope and a neighborhood, slope is a slope, and L is a hole length: slope initialization is set to 1, v₀(i) Is initialized to zero, i.e. v₀(i) 0, where slope and v₀(i) The solving formula of (2) is:

v₀(i)＝v(i-1)+v(i-L)-v(i-L-1) (3)

in step 2.1, when the hole is on the left side of the effective depth value, the depth value is indexed from the right side, and filling is performed in sequence to the left, specifically, formula (4) is adopted for filling:

v(i)＝v(i-1)+[v(i+L)-v(i+L+1)]×slope (4)

wherein slope and v₀(i) The solving formula of (2) is:

v₀(i)＝v(i+1)+v(i+L)-v(i+L+1) (6)。

the step 3 comprises the following steps:

step 3.1, obtaining internal parameters of the camera, namely shooting a plane checkerboard image in more than 18 three-dimensional scenes from different angles; the calibration chessboard is specially made, and angular point coordinates are obtained through an angular point extraction algorithm; obtaining homography matrixes of the chessboard plane and the image plane, and finally solving the least square by using SVD (singular value decomposition) to obtain all internal parameters of the camera, namely the focal length f of the x axis and the y axis of the camera_x，f_yOffset C of the optical axis of the camera in the image coordinate system_x，C_y；

Step 3.2: the process of converting the depth map into the point cloud, defining the phase difference of each light beam corresponding to the adjacent pixel points, setting six degrees of freedom, utilizing PCL point cloud library PCL:, completing the conversion from the depth map to the point cloud by PCL _ range _ image in the common module, and converting the depth map into the point cloud (x) under the world coordinate system by the depth map (u, v)_w,y_w,z_w) According to the formula, wherein Z_CFor the depth value of the point, f is the focal length of the camera, and equation (11) describes the point cloud coordinates of the point (u, v) in the world coordinate system as (x)_w,y_w,z_w)；

Step 3.3: removing outliers by using a statistical filtering method, and completing point cloud downsampling by using a voxel filtering method; the method comprises the following specific steps:

step 3.3.1: for each point, firstly calculating the average distance from each point to all adjacent points in a statistical filtering method; assuming that the obtained distribution is Gaussian distribution, and calculating a mean value mu and a standard deviation sigma;

step 3.3.2: calculating the mean value μ and the standard deviation σ of all the average distances, the distance threshold dmax can be expressed as dmax ═ μ + α × σ, α is a constant, and can be called a proportionality coefficient; traversing the point cloud again, and eliminating points with the average distance to k neighbor points being greater than dmax to be regarded as outliers, and removing the outliers from the point cloud data to complete the statistical filtering;

step 3.3.3: the voxel filtering uses an AABB bounding box to voxelate the point cloud data, and noise points and outliers are removed through a voxel grid.

Step 4 comprises the following steps:

step 4.1, determining whether a certain pixel point is an angular point or not according to the gradient of pixel values of the pixel point and a neighborhood point, selecting one pixel point in an image and setting the brightness of the pixel point as Ia, setting a threshold value k, searching by taking the selected point p as the center of a circle and acquiring the brightness; if N consecutive points are selected, the brightness I is greater than I_a+ k or I < I_aK, the point can be defined as a Fast characteristic point, the value of N is 9, and the steps are repeated until all pixel points are traversed; if enough points with larger gray value change exist in the neighborhood range of the characteristic point, the points are regarded as the characteristic point;

step 4.2, feature point screening: and dividing the extracted feature points by using a quadtree, setting the number of divided grids of the quadtree to be 25, enabling all the feature points to fall into 25 grids, and selecting the feature point with the highest FAST feature point response value in each grid as the feature point, thereby realizing feature point homogenization. Then adding 3 layers of Gaussian pixel pyramids and scaling factors, gradually decreasing pyramid templates layer by layer, respectively extracting features under different scales, and enhancing the robustness of feature points; and finally, adding the Euclidean distance, determining the number of the extracted feature points, and finishing the screening process of the feature points.

Step 4.3, feature point description: selecting a rectangular frame with the side length of 20s in the neighborhood of the key point, dividing the rectangular frame into 16 rectangles with the same size, calculating by harr with the size of 2s to obtain a response, taking the direction of the longest vector as a main direction, and counting response values to generate characteristic vectors of sigma dy, sigma dx and sigma dy;

step 4.4, feature matching: if the Euclidean distance is smaller than a set threshold value, the feature points can be considered to be successfully matched; and the trace of the Hessian matrix can be used as a judgment basis, and the signs of the traces of the matrix formed by the characteristic points are the same, so that the direction among the characteristic points is proved to be the same, and the characteristic points are successfully matched.

The step 5 specifically comprises the following steps:

step 5.1: inputting the point cloud in the pcd format obtained in the step 3;

and step 5.2: the method comprises the following steps of registering point cloud by using an ICP point-to-plane algorithm:

step 5.2.1: calculating the gravity center position coordinates of the point set, and performing point set centralization to generate a new point set;

step 5.2.2: calculating a positive definite matrix N by the new point set, and calculating the maximum eigenvalue of N and the maximum eigenvector thereof;

step 5.2.3: because the maximum feature vector is equivalent to the rotation quaternion when the sum of squared residuals is minimum, converting the quaternion into a rotation matrix R;

step 5.2.4: after R is determined, the translation vector t is only the difference of the barycenters of the two point sets;

equation (12) describes the error function J, p to be minimized_iAs a set of source points, p_i' is the set of target points:

expression for the rotational component R:

expression for translation component t:

t^*＝p-R^*·p′ (14)

and 5.3, combining a feature point method and a direct method to obtain an optimal camera pose estimation result.

The step 6 is specifically as follows:

step 6.1: creating a feature dictionary: creating and training a dictionary by using a SURF (speeded up robust features) feature description algorithm, and dividing the dictionary by using a kd tree, wherein parameters are set as follows: the branching number k is 10, the depth d is 5;

step 6.2: completing the construction process of the dictionary: storing the extracted image features into a container, and calling a DBoW dictionary generation interface;

step 6.3: judging whether a loop is generated or not by using a TF-IDF weighting method; by means of L₁The norm represents the difference between the two images X, Y;

step 6.4: if the similarity between the current frame and a certain previous key frame exceeds 3 times of the similarity between the current frame and the previous key frame, the possibility of loop exists, and the information is transmitted to the rear end for optimization processing; when the two frames are consistent in space, performing feature matching on the two frames detected by the loop, and estimating the motion of the camera; the loop detection generates more compact and accurate constraint than the back end, and the constraint condition can form a track map with consistent topology; if the closed loop can be detected and optimized, the accumulated error can be eliminated.

The step 7 is specifically as follows:

step 7.1: expressing point cloud data by using an octree;

step 7.2: estimating an implicit expression function of a potential curved surface using known point cloud data; classifying the vertexes of the octree nodes by using the function, and respectively marking the vertexes as the inner side and the outer side of the curved surface;

step 7.3: extracting the surface of the triangular mesh by adopting an MC moving cube algorithm;

step 7.4: and fusing the values of the grid colors according to a weighting mode to obtain a three-dimensional reconstruction model with color textures.

The invention has the beneficial effects that:

firstly, the invention can realize the indoor real-time three-dimensional reconstruction process by utilizing the scanning data of the depth camera. The method comprises the steps of obtaining a depth map and a color image by using a Kinectv2 camera, completing holes of the depth map by using a depth completion algorithm based on hole neighborhood depth values and color information, solving camera internal parameters by using a checkerboard calibration method, and converting the completed depth map into point cloud. And finishing point cloud down-sampling by using a voxel filtering method. Detecting and matching feature points by using an improved SURF algorithm, obtaining initial value pose parameter estimation, registering point cloud by using an ICP point-to-plane algorithm, realizing pose tracking by using a feature point method combined with a direct method, solving the drift problem caused by accumulated errors by using loop detection based on a bag-of-words method, and finally realizing object curved surface reconstruction and texture mapping by using TSDF data fusion and a moving cube algorithm to finish the indoor real-time three-dimensional reconstruction process.

The method can utilize the acquired depth map and the color image to traverse the depth map point by point according to the color information of the color image, and then complete the hole type classification based on the boundary constraint and the completion process according to the combined action of the neighborhood depth effective value, the gradient and the color consistency, and applies the hole completion algorithm of the depth map to an indoor real-time three-dimensional reconstruction system to enhance the reconstruction model precision, solve the problems of depth map noise and holes caused by the conditions of object shielding, object material, ambient light and the like, and realize the hole completion.

Thirdly, describing the corner points extracted by FAST by using SURF, realizing the homogenization of the feature points by using a quadtree, and setting the feature points screened by Euclidean distance, thereby realizing the effect of acquiring a certain number of feature points in a short time, and eliminating the wrong matching points by combining with RANSAC algorithm when the features are matched. In the reconstruction process, the requirement that dense reconstruction can be realized by the number of the feature points is met, meanwhile, the operation speed is improved, and the balance problem between the number of the feature points and the reconstruction instantaneity is solved.

Fourthly, scene drift caused by accumulated errors is solved by using loop detection based on a bag-of-words method; the problems of camera tracking loss and scene re-identification are solved by utilizing a pose tracking mode of a fusion characteristic point method and a direct method; the ICP combines the NDT algorithm to solve the problem of taking account of rigid motion translation and rotation transformation.

Drawings

FIG. 1 is a complete flow chart of the present invention

FIG. 2 is a color image and depth map acquired with a Kinect v2 depth camera

FIG. 3 is a depth map after completion using the hole completion algorithm of the present invention

FIG. 4 is a schematic diagram of a camera reference calibration process

FIG. 5 is a transformation of a depth map into a point cloud using an internal reference

FIG. 6 shows two frames of chair point cloud before and after registration

FIG. 7 is a diagram of feature points and trajectories for tracking pose changes by fusing a feature point method and a direct method

FIG. 8 is a curved surface reconstruction model of a chair

FIG. 9 is a textured three-dimensional model of the final three-dimensional reconstructed indoor chair

FIG. 10 is a three-dimensional reconstruction of an indoor scene including a chair

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention relates to an indoor real-time three-dimensional reconstruction method based on a depth camera, which has the process as shown in figure 1 and specifically comprises the following steps:

step 1: and scanning the indoor scene through a Kinect v2 depth camera to obtain a depth map and a color image of the indoor scene.

And 2, step: and (4) completing the holes of the depth map by using a depth completion algorithm based on the hole neighborhood depth value and the color information.

and 4, step 4: and detecting and matching the characteristic points by using an improved SURF algorithm, and solving initial value pose parameter estimation.

And 5: and (3) registering the point cloud by using an ICP point-to-plane algorithm, and then realizing pose tracking by using a characteristic point method and a direct method.

And 6: the loop detection is completed by using a bag-of-words method, and if a closed loop can be detected and optimized, the drift problem caused by accumulated errors can be eliminated.

And 7: and constructing a mesh map according to the distance symbolic function in the TSDF stage, reconstructing the surface of the object by using a moving cube MC algorithm, and completing texture mapping by using weighted fusion to obtain a three-dimensional reconstruction model of the indoor object.

The Kinect v2 depth camera is adopted in the step 1 for obtaining, and the specific processes of the color image and the depth image are as follows:

step 1.1: configuring a Libfreenect2 library at a PC end to realize the connection of Kinectv 2;

step 1.2: data acquisition was performed using OPENNI2 library: a handheld Kinectv2 camera acquires color images and depth images of an indoor scene;

step 1.3: the resolution of the acquired color image is 1920 × 1080, the resolution of the depth map is 512 × 424, and the format of the acquired image is.

Wherein the step 2 is as follows: and setting effective depth values, performing hole completion on holes with the lengths of the pixel points larger than the effective value by adopting an interpolation method, and performing hole completion on holes with the lengths of the pixel points not larger than the effective value by sequentially adopting a hole neighborhood depth value hole completion method and a color information depth completion method.

step 2.1, the types of the holes are divided into two types according to the effective depth values: and holes on the right side of the effective depth value and holes on the left side of the effective depth value are filled by respectively adopting a left-side index method and a right-side index method for the two types of holes.

When the hole extends to the rightmost pixel in the current object, the depth value is indexed from the left side, and no depth information is available on the right side of the hole, so that the condition of filling continuity does not need to be considered. It is sufficient that the hole identifies a pattern of depth changes on the left side of the hole, which is longer than the length of the hole itself, and this pattern is propagated to the right, filling the hole in turn. The constrained holes do not reach the leftmost or rightmost boundary edge of the scene object, i.e. the holes are contained within objects having an effective depth, in which case the depth variation pattern may be sampled from either side depending on the availability of effective depth values within the same scene object. To correctly predict missing depth values, the continuity between valid values on the left and right sides of the hole is considered. The slope is used to ensure that the predictor reasonably connects the depth value to the left and right of the hole, and the specific calculation formula is as follows:

v(i)＝v(i-1)+[v(i-L)-v(i-L-1)]×slope (1)

equation (1) describes that the depth value at i can be complemented to the depth value at i-1 plus the product of the difference and slope of the depth value at i-L and the depth value at i-L-1. Wherein i is a pixel point, v (i) is a complementary depth value considering the slope and neighborhood, slope is the slope, and L is the hole length: slope initialization is set to 1, v₀(i) Is initialized to zero, i.e. v₀(i) 0. Wherein slope and v₀(i) The solving formula of (2) is:

v₀(i)＝v(i-1)+v(i-L)-v(i-L-1) (3)

equation (2) describes the solution for the slope in equation (1), and equation (3) describes the solution for the initialized depth value at i.

When the hole is on the left side of the effective depth value, the depth value is indexed from the right side. The concrete filling is carried out by adopting a formula (4):

v(i)＝v(i-1)+[v(i+L)-v(i+L+1)]×slope (4)

wherein i is a pixel point, v (i) is a complement depth value considering a slope and a neighborhood, slope is a slope, and L is a hole length: the slope is initialized to 1, v₀(i) Is initialized to zero, i.e. v₀(i)＝0。

The constrained hole does not reach the leftmost or rightmost border of the object, but the number of valid depth values on the left side of the hole is less than the length of the hole itself, and there is enough valid depth valuesThe depth value is completed on the right side of the hole. Wherein slope and v₀(i) The solving formula of (2) is:

v₀(i)＝v(i+1)+v(i+L)-v(i+L+1) (6)

and 2.2, filling the 2.1 holes, and then performing hole completion further by adopting a depth completion method of color information, wherein in order to increase the semantic rationality of the completed holes, the consistency of color areas of the color image needs to be considered, so that the missing depth value is estimated by the constraint of the color information of the corresponding area of the color image corresponding to the depth map, and the interference of other areas is prevented. Using the Gaussian weight of the color information, the above v (i), and the judgment function J_kThe average of the product sums of (a) and (b) to construct a completion hole.

And establishing a Gaussian weight model of the color information in a completion algorithm. Setting the Gaussian weight of color information to w_RGBThe holes on the right side of the effective depth value are holes of the first type, and the weight is w_RGB1The holes on the left side of the effective depth value are the second type of holes, and the weight is w_RGB2. The weight is defined by R, G, B value corresponding to the color image at the neighborhood pixel point i +1 or i-1 and R value at the pixel point i corresponding to the hole_i、G_i、B_iCommon determination, σ_RGBRepresenting a Gaussian weight of w_RGBStandard deviation of (D), J_kFunction is to determine whether a hole to the right or left of the effective depth value, v (i) is the depth value taking into account the neighborhood depth value and the slope, v (i)_finalThe limitation of color information, L, is increased_validFor the length of the available depth value, L_missIs the hole length and N is the set of color information for the neighborhood. v (i)_finalThe final completed depth value at i.

Equation (7) describes the depth value of the final solution at i, which can be described as: judging function J_kAnd the product of the v (i) and the Gaussian standard deviation of the color information in the range of the object where the hole is located is distributed on the average value of the object. Equations (8) and (9) describe the color information weight of the first type of hole as w_RGB1The second type is w_RGB2. Equation (10) determines whether the holes belong to the first class and the second class.

And 3, solving camera internal parameters by using a checkerboard calibration method, converting the depth map into point cloud according to the conversion principle of a related coordinate system, and performing downsampling on the point cloud by using a voxel filtering method. The method comprises the following specific steps:

step 3.1: the method comprises the following steps of obtaining camera internal parameters; shooting more than 18 planar checkerboard images in the three-dimensional scene from different angles; the calibration chessboard is specially made, and intersection point coordinates are obtained through an angular point extraction algorithm; obtaining homography matrixes of the chessboard plane and the image plane, and finally solving least squares by using SVD (singular value decomposition) to obtain each internal parameter of the camera: focal length f of x and y axes of camera_x，f_yOffset C of the optical axis of the camera in the image coordinate system_x，C_y。

Step 3.2: and in the process of converting the depth map into the point cloud, defining the phase difference of each light beam corresponding to the adjacent pixel points, setting six degrees of freedom, and completing the conversion from the depth map to the point cloud by utilizing a PCL _ range _ image in a PCL point cloud database PCL. Converting the depth map (u, v) into a point cloud (x) under a world coordinate system_w,y_w,z_w) According to the formula, wherein Z_CFor the depth value of the point, f is the focal length of the camera, and equation (11) describes the point cloud coordinates of the point (u, v) in the world coordinate system as (x)_w,y_w,z_w)。

Step 3.3: removing outliers by using a statistical filtering method, and completing point cloud downsampling by using a voxel filtering method; the specific operation is as follows:

step 3.3.1: for each point, its average distance to all neighboring points is first calculated in the statistical filtering method. Calculating a mean value mu and a standard deviation sigma assuming that the obtained distribution is Gaussian;

step 3.3.2: calculating the mean value μ and the standard deviation σ of all the average distances, the distance threshold dmax can be expressed as dmax ═ μ + α × σ, α is a constant, and can be referred to as a scaling factor; traversing the point cloud again, eliminating k neighborhood points, and regarding the points with the average distance greater than dmax as outliers, and removing the outliers from the point cloud data to complete the statistical filtering;

step 3.3.3: voxel filtering is to use AABB bounding box to voxelate the point cloud data and remove noise points and outliers through a voxel grid.

And 4, detecting and matching the feature points by using an improved SURF algorithm, and solving initial value pose parameter estimation, wherein the improved SURF algorithm comprises the following specific steps. And matching feature points of the two adjacent acquired images by using an improved SURF algorithm, and eliminating error matching by using RANSAC.

Step 4.1: the SURF algorithm has the advantages that the determination factor is only brightness, the determination factor is single, and the detection process can be accelerated. FAST corner detection determines whether a certain pixel point is a corner point according to the difference of pixel values of the pixel point and a neighborhood point, selects a pixel point in an image and sets the brightness of the pixel point as Ia, sets a threshold value k, searches for the pixel point with p as the center of a circle by taking the selected point p as the center, and acquires the brightness. If it is selectedOf which there are N consecutive points, the brightness I > I_a+ k or I < I_aK, then the point can be defined as Fast feature point, the value of N is 9, and the above steps are repeated until all pixel points are traversed. If enough points with larger gray value change exist in the neighborhood range of the characteristic point, the points are regarded as the characteristic points.

Step 4.2: and (3) feature point screening, namely dividing the feature points extracted by the FAST by utilizing a quadtree, setting the number of divided grids of the quadtree to be 25, enabling all corner points to fall into 25 grids, and selecting the corner point with the highest FAST corner point response value in each grid as the feature point, thereby realizing corner point homogenization. Then adding 3 layers of Gaussian pixel pyramids and scaling factors, gradually decreasing pyramid templates layer by layer, respectively extracting features under different scales, and enhancing the robustness of feature points; and finally, adding the Euclidean distance, determining the number of the extracted feature points, and finishing the screening process of the feature points.

Step 4.3: characterization is performed using SURF algorithms. Selecting a rectangular frame with the side length of 20s in the neighborhood of the key point, dividing the rectangular frame into 16 rectangles with the same size, calculating by harr with the size of 2s to obtain a response, taking the direction of the longest vector as a main direction, and counting response values to generate characteristic vectors of sigma dy, sigma dx and sigma dy.

Step 4.4: and (5) matching the features. If the Euclidean distance is smaller than the set threshold value, the feature point matching can be considered to be successful. The trace of the Hessian matrix can also be used as a judgment basis, and if the symbols of the traces of the matrix formed by the characteristic points are the same, the direction among the characteristic points is proved to be the same, and the matching of the characteristic points is successful.

And 5, registering the point cloud by the ICP point-to-plane algorithm, and tracking the pose by using a characteristic point method and a direct method.

Step 5.1: inputting the point cloud in the pcd format obtained in the step 3;

and step 5.2: and registering the point cloud by utilizing an ICP point-to-plane algorithm, and specifically comprising the following steps.

Step 5.2.1: calculating the coordinates of the gravity center position of the point set, and performing centralization on the point set to generate a new point set;

step 5.2.2: calculating a positive definite matrix N by the new point set, and calculating the maximum eigenvalue and the maximum eigenvector of N;

step 5.2.3: because the maximum eigenvector is equivalent to the rotation quaternion when the sum of the squares of the residuals is minimum, converting the quaternion into a rotation matrix R;

step 5.2.4: after R is determined, the translation vector t is simply the difference in the center of gravity of the two point sets.

expression for the rotational component R:

expression for translation component t:

t^*＝p-R^*·p′ (14)

step 5.3: the method aims to solve the problems that the direct method is high in applied scene discrimination, prone to falling into a local minimum value, long in matching time with the direct method and low in efficiency, and the method can be used in a fusion mode to obtain the optimal camera pose estimation result.

Step 5.3.1: a sequence of images is initialized. And if the number of FAST corner points detected in the image is larger than a set threshold value, setting the frame as a key frame, tracking the feature points by using an optical flow method, and converting the feature points into depth normalized points in a camera coordinate system. The sequence of consecutive images after the first is processed and if there are enough inliers, this frame is considered to be the frame suitable for triangularization. And carrying out scale transformation according to the pose and map points recovered by the tracked key frame angular points, taking the pose of the previous frame as the initial pose of the current frame, and sending the frame serving as the key frame into the depth filter. And the depth filter searches for a matching point on the epipolar line for the seed point, updates the seed point, and converges the seed point to obtain a new candidate map point.

Step 5.3.2: and estimating the pose based on the brightness of the sparse points. And combining all image blocks on the reference frame with map points, and projecting the image blocks onto the pyramid image of the current frame image. On the pyramid image of the current frame, layer-by-layer calculation is started from the highest layer. Each time inherits the optimization result of the previous time, each layer of optimization is iterated for 30 times. If the error of the previous time is not reduced compared with the error of the previous time, the pose of the previous time after optimization is inherited. And taking the pose of the previous frame as the initial pose of the current frame, and iterating among a plurality of image pyramid layers. If the number of tracked pixels is small, the tracking is considered to be failed, and the next frame starts to be relocated.

Step 5.3.3: feature point matching based on tiles. And if the projection position of the map point on the current frame can obtain a picture block of 8x8, the map point is stored in the grid of the projection position of the current frame. If 10 frames of projection of one candidate point are unsuccessful, deleting the point, traversing all map points, and finding all observed key frames where the map point is located. And acquiring the connecting line of the light center of the key frame and the map point and the included angle between the connecting line of the map point and the light center of the current frame. And selecting the key frame with the minimum included angle as a reference frame and corresponding characteristic points, calculating an affine matrix, the target search layer number of the current frame and an inverse affine matrix of the affine matrix, taking out a reference image block with a certain size from the mapping image blocks, and optimizing and adjusting the reference image block to be matched with the image block at the target position.

Step 5.3.4: and optimizing the pose and map points. For each map point, its three-dimensional spatial position is optimized so that the reprojection error on each keyframe it is viewed is minimized. Each map point was optimized 5 times. And if the sum of the squares of the errors of the optimization is smaller than that of the squares of the errors of the last optimization, updating the optimization result.

Step 6, completing loop detection based on a bag-of-words method, efficiently eliminating accumulated errors, and constructing a dictionary:

step 6.1: creating a feature dictionary: creating and training a dictionary by using a SURF (speeded up robust features) feature description algorithm, and dividing the dictionary by using a kd-tree, wherein parameters are set as follows: the branching number k is 10, the depth d is 5;

step 6.3: and judging whether a loop is generated or not by using a TF-IDF weighting method. By means of L₁The norm represents the difference between the two images.

Step 6.4: if the similarity between the current frame and a certain previous key frame exceeds 3 times of the similarity between the current frame and the previous key frame, a loop is considered to possibly exist, and information is transmitted to the back end for optimization processing. I.e. when the spatial agreement is kept, the two frames detected by the loop are subjected to feature matching, and then the camera motion is estimated. Loop detection produces a more compact and accurate constraint than the back end, and this constraint can form a topologically consistent trajectory map. If the closed loop can be detected and optimized, the accumulated error can be eliminated.

And 7, constructing a mesh map according to the stage distance symbolic function to perform data fusion on the point cloud model obtained by pose transformation of the generated point cloud model, and reconstructing the surface of the object by using a moving cube MC algorithm. And finishing texture mapping by using weighted fusion to obtain a three-dimensional reconstruction model of the indoor object. The reconstruction scene is generated by holding the indoor point cloud model scanned by Kinectv2, the reconstruction details are accurate and fine, and the method can be used for three-dimensional reconstruction of small indoor objects and large scenes. The specific process is as follows:

step 7.1: expressing point cloud data by using an octree;

and 7.2: estimating an implicit expression function of a potential curved surface using known point cloud data; classifying the vertexes of the octree nodes by using the function, and respectively marking the vertexes as the inner side and the outer side of the curved surface;

Example 1

The method is adopted to carry out indoor real-time three-dimensional reconstruction on the (indoor chair image);

step 1 is executed, the acquired image is shown as figure 2, and comprises a color image and a depth map;

and (5) executing the step (2), and completing the depth map of the hole as shown in the figure 3.

Step 3 is executed to realize the internal reference calibration of the camera, and the calibration process is as shown in figure 4; the depth map is converted into a point cloud as shown in fig. 5 using the calibrated parameters.

And (5) executing the step (4) and the step (5) to realize the registration process of the two-frame point cloud, wherein the two-frame point cloud is shown in the left picture of fig. 6, and the registered point cloud is shown in the right picture of fig. 6. The pose transformation of the chair is tracked, and as shown in the right side of fig. 7, the trajectory of the camera motion pose transformation and the feature map points of the chair scene are shown.

And 6, executing the step 6 to the step 7 to realize the curved surface reconstruction of the chair, and converting the points into mesh grids as shown in fig. 8. The end result is the acquisition of a textured chair model as shown in fig. 9, and fig. 10 is a three-dimensional reconstruction of an indoor scene containing a chair.

Claims

1. The indoor real-time three-dimensional reconstruction method based on the depth camera is characterized by comprising the following steps of:

step 1: acquiring a depth map and a color image of an indoor scene;

and 2, step: completing holes of the depth map by using a depth hole completion algorithm based on hole neighborhood depth values and color information;

step 6: completing loop detection by using a bag-of-words method, and optimizing a detected closed loop;

2. The depth camera-based indoor real-time three-dimensional reconstruction method of claim 1, wherein the step 2 is: setting effective depth values, and for holes with pixel point lengths larger than the effective values, performing hole completion by adopting an interpolation method, and for holes with pixel point lengths not larger than the effective values, performing hole completion by sequentially adopting a hole neighborhood depth value hole completion method and a color information depth completion method;

wherein to the length of pixel not more than the hole of RMS, concrete operation is:

wherein the Gaussian weight of the color information is set to w_RGBThe holes on the right side of the effective depth value are holes of the first type, and the weight is w_RGB1Holes on the left side of the effective depth value are the second type of holes, and the weight is w_RGB2；σ_RGBRepresenting a Gaussian weight of w_RGBStandard deviation of (D), J_kV (i) depth values that take into account neighborhood depth values and slopes for determining whether to hole in the first or second type, v (i)_finalIncrease the limit of color information, L_validFor the length of available depth values, L_missHole length, N is the set of color information of the neighborhood, v (i)_finalThe final completed depth value at i.

3. The method of claim 1, wherein in step 2.1, when the hole extends to the rightmost pixel in the current object, the depth values are propagated from the left index depth value to the right, and the hole is filled in sequence, and the specific calculation formula is as follows:

v(i)＝v(i-1)+[v(i-L)-v(i-L-1)]×slope (1)

v₀(i)＝v(i-1)+v(i-L)-v(i-L-1) (3)

in step 2.1, when the hole is on the left side of the effective depth value, the depth value is indexed from the right side, and filling is performed in sequence to the left, specifically, formula (4) is used for filling:

v(i)＝v(i-1)+[v(i+L)-v(i+L+1)]×slope (4)

wherein slope and v₀(i) The solving formula of (2) is as follows:

v₀(i)＝v(i+1)+v(i+L)-v(i+L+1) (6)。

4. the depth camera-based indoor real-time three-dimensional reconstruction method of claim 1, wherein the step 3 comprises:

step 3.1, obtaining internal parameters of the camera, namely shooting a plane checkerboard image in more than 18 three-dimensional scenes from different angles; the calibration chessboard is specially made, and intersection point coordinates are obtained through an angular point extraction algorithm; obtaining homography matrixes of the chessboard plane and the image plane, and finally solving the least square by using SVD (singular value decomposition) to obtain the focal length f of each internal parameter camera x and y axis of the camera_x，f_yOffset C of the optical axis of the camera in the image coordinate system_x，C_y；

Step 3.2: the process of converting the depth map into point cloud defines the phase difference of each light beam corresponding to the adjacent pixel points, sets six degrees of freedom, utilizes PCL point cloud base PCL _ range _ image in common module to complete the conversion from the depth map to the point cloud, and converts the depth map (u, v) into the point cloud (x) under the world coordinate system_w,y_w,z_w) According to the formula, wherein Z_CFor the depth value of the point, f is the focal length of the camera, and equation (11) describes the point cloud coordinates of the point (u, v) in the world coordinate system as (x)_w,y_w,z_w)；

Step 3.3: removing outliers by using a statistical filtering method, and completing point cloud down-sampling by using a voxel filtering method; the method comprises the following specific steps:

step 3.3.1: for each point, firstly calculating the average distance from the point to all adjacent points in a statistical filtering method; assuming that the obtained distribution is Gaussian distribution, and calculating a mean value mu and a standard deviation sigma;

step 3.3.2: calculating the mean value μ and the standard deviation σ of all the average distances, and then expressing the distance threshold dmax as dmax ═ μ + α × σ, α is a constant and can be called a proportionality coefficient; traversing the point cloud again, eliminating points with the average distance between the points and the k neighbor points being larger than dmax, then regarding the points as outliers, and removing the points from the point cloud data to complete the statistical filtering;

5. The depth camera-based indoor real-time three-dimensional reconstruction method of claim 1, wherein the step 4 comprises:

step 4.1, determining whether a certain pixel point is an angular point or not according to the difference of pixel values of the detected pixel point and a neighborhood point, selecting one pixel point in an image and setting the brightness of the pixel point as Ia, setting a threshold value k, searching by taking a selected point p as a center of a circle and acquiring the brightness; if N consecutive points are selected, the brightness I is greater than I_a+ k or I < I_aK, the point can be defined as a Fast characteristic point, the value of N is 9, and the steps are repeated until all pixel points are traversed; in the neighborhood range of the characteristic point, if enough points with larger gray value change exist, the points are regarded as the characteristic points;

step 4.2, feature point screening: dividing the extracted feature points by using a quadtree, setting the number of divided grids of the quadtree to be 25, enabling all the feature points to fall into 25 grids, and selecting the feature point with the highest response value of the FAST feature point in each grid as the feature point, thereby realizing feature point homogenization; then adding 3 layers of Gaussian pixel pyramids and scaling factors, gradually decreasing pyramid templates layer by layer, respectively extracting features under different scales, and enhancing the robustness of feature points; finally, adding the Euclidean distance, determining the number of the extracted feature points, and finishing the screening process of the feature points;

and 4.3, describing feature points: selecting a rectangular frame with the side length of 20s in the neighborhood of the key point, dividing the rectangular frame into 16 rectangles with the same size, calculating by harr with the size of 2s to obtain a response, taking the direction of the longest vector as a main direction, and then counting response values to generate characteristic vectors sigma | dy |, [ sigma ] dx, [ sigma | dx |, and [ sigma ] dy;

step 4.4, feature matching: if the Euclidean distance is smaller than a set threshold value, the feature points are considered to be successfully matched; or the trace of the Hessian matrix is used as a judgment basis, and the symbols of the traces of the matrix formed by the characteristic points are the same, so that the direction among the characteristic points is proved to be the same, and the characteristic points are successfully matched.

6. The depth camera-based indoor real-time three-dimensional reconstruction method of claim 1, wherein the step 5 is specifically:

step 5.1: inputting the point cloud in the pcd format obtained in the step 3;

equation (12) describes the error function J, p to be minimized_iAs a set of source points, p_i' as set of target points:

expression for the rotational component R:

expression for translation component t:

t^*＝p-R^*·p′ (14)

7. The depth camera-based indoor real-time three-dimensional reconstruction method of claim 1, wherein the step 6 is specifically:

step 6.4: if the similarity between the current frame and a certain previous key frame exceeds 3 times of the similarity between the current frame and the previous key frame, determining that a loop exists, and transmitting information to a rear end for optimization; when the two frames are consistent in space, performing feature matching on the two frames detected by the loop, and estimating the motion of the camera; the loop detection generates more compact and accurate constraint than the back end, and the constraint condition can form a track map with consistent topology; if the closed loop can be detected and optimized, the accumulated error can be eliminated.

8. The depth camera-based indoor real-time three-dimensional reconstruction method according to claim 1, wherein the step 7 is specifically:

step 7.1: expressing point cloud data by using an octree;

step 7.2: estimating an implicit expression function of a potential curved surface by using known point cloud data; classifying the vertexes of the octree nodes by using the function, and marking the vertexes as the inner side and the outer side of the curved surface respectively;