CN109544677B

CN109544677B - Indoor scene main structure reconstruction method and system based on depth image key frame

Info

Publication number: CN109544677B
Application number: CN201811278361.7A
Authority: CN
Inventors: 周元峰; 高凤毅; 张彩明
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2018-10-30
Filing date: 2018-10-30
Publication date: 2020-12-25
Anticipated expiration: 2038-10-30
Also published as: CN109544677A

Abstract

The method and the system for reconstructing the main structure of the indoor scene based on the depth image key frame acquire the depth image: acquiring a depth image from a depth camera, processing the depth image to obtain corresponding point cloud data, and obtaining a normal vector according to the point cloud data; calculating a camera pose matrix of the current frame; judging whether the current frame depth image is a key frame depth image or not, and if so, adding the current frame depth image into the key frame sequence; calculating a main structure plane equation set for each frame of depth image added into the key frame sequence; converting the main structure plane equation from a camera coordinate system to a world coordinate system based on the camera pose matrix; adding the primary structure plane equation set into the secondary structure plane equation set, and adding the primary structure plane equation set into the primary structure plane equation set for registration and fusion; and reconstructing the main structure of the indoor scene according to the finally fused main structure plane equation set until all frames in the key frame sequence are processed.

Description

Indoor scene main structure reconstruction method and system based on depth image key frame

Technical Field

The embodiment of the application relates to the technical field of image processing, in particular to a method and a system for reconstructing a main structure of an indoor scene based on a depth image key frame.

Background

The Kinect sensor is an RGB-D sensor, can obtain environmental color value and depth value simultaneously, and its collection speed is fast, and the precision is high, and advantage such as low price makes its application in each field. The problems of object recognition, scene segmentation, three-dimensional reconstruction and the like of indoor scenes based on depth images also become research hotspots.

The depth map obtained by the Kinect camera can be mapped into a space to form a group of discrete three-dimensional point clouds, camera pose estimation and registration fusion are carried out through the point clouds, and finally a reconstruction algorithm of a three-dimensional model of an indoor scene is continuously optimized.

Resolving the main structure of an indoor scene based on a single frame depth image provides a foundation for the positions of beds, tables, doors and windows in a later scene. In recent years, a plurality of researchers have proposed some methods for analyzing main structures of indoor scenes, but because the limitation of the range of a single-frame image cannot accurately identify the real main structures in the scenes, partial analysis results are wrong, and identification of objects in later periods is influenced by marks.

In recent years, three-dimensional reconstruction based on various depth cameras, on-line images, disordered point clouds and the like has become a very popular research direction in the fields of computer graphics and vision. Slam (simultaneous Localization and mapping) has become a research hotspot in the field of robots, and aims to build a surrounding physical environment and track a user or a robot. Scanning indoor environments with a Kinect sensor to obtain depth data and modeling are the subject of much research by many scholars. University of Washington and Microsoft laboratories, Henry P, Krainin M, Herbst E, et al, RGB-D mapping Using depth cameras for dense 3D modeling of inductor environments [ C]A visual SLAM system based on SIFT (Scale invariant feature transform) feature matching positioning and TORO (Tree-based network Optimizer) optimization algorithm is developed to establish a 3D point cloud map. Newcomb R A, IZadi S, Hilliges O, et al Kinectfusion: Real-time dense surface mapping and tracking [ C]// Mixed and augmented reality (ISMAR), 201110 th IEEE international symposium on. IEEE,2011: 127-.

O,Prisacariu V A,Ren C Y,et al.Very high frame rate volumetric integration of depth images on mobile devices[J]IEEE transactions on visualization and computer graphics,2015,21(11):1241-1250 is a voxel hash-based indoor scene reconstruction algorithm developed by southern Kao university, which reduces memory by voxel block hashing, and also uses GPU acceleration to achieve very high real-time.

A plurality of researchers have proposed various research methods for the main structure analysis of indoor scenes, Lee D C, Hebert M, Kanade T.Geometric recovery for single image structure recovery [ C ]// Computer Vision and Pattern Recognition,2009.CVPR 2009.IEEE Conference on. IEEE,2009: 2136-; taylor C J, Cowley A. matching index genes using RGB-d image [ C ]// Robotics: Science and systems.2013,8:401 and 408. based on RGB image segmentation, fusing plane segmentation results, grouping the images, and finally converting the problem of extracting wall distribution into the problem of solving the optimal label by using dynamic programming; zhou Y, Xu H, Pan X, et al, matching main structures of inductor scenes from a single RGB-D image [ J ]. Journal of Advanced Mechanical Design, Systems, and Manufacturing,2016,10(8): JAMDSM0102-JAMDSM0102. clustering-based methods to perform plane fitting to generate the final main structure. Because of the limitation of a single frame image, the above methods all have the condition that the extraction main structure is wrong.

Disclosure of Invention

In order to overcome the defects of the prior art, the embodiment of the application provides an indoor scene main structure reconstruction method based on the depth image key frames, the extraction of a single-frame depth image main structure is expanded, a transformation matrix of each frame is calculated by utilizing camera pose estimation in a Kinectfusion algorithm, a main structure plane equation of a plurality of frames of key frames is subjected to registration fusion, and a three-dimensional model of the indoor scene main structure is generated.

The first aspect provides a method for reconstructing a main structure of an indoor scene based on a depth image key frame;

the method for reconstructing the main structure of the indoor scene based on the key frame of the depth image comprises the following steps:

step (1): acquiring a depth image: acquiring a depth image from a depth camera, processing the depth image to obtain corresponding point cloud data, and obtaining a normal vector according to the point cloud data;

step (2): based on point cloud data and normal vectors, searching corresponding pixel points of the current frame depth image in the previous frame depth image through a projection algorithm, and calculating a camera pose matrix of the current frame by minimizing the distance from the pixel points of the current frame depth image to the tangent plane of the corresponding pixel points of the previous frame depth image;

and (3): judging whether the current frame depth image is a key frame depth image or not, and if so, adding the current frame depth image into the key frame sequence;

and (4): calculating a main structure plane equation set for each frame of depth image added into the key frame sequence; converting the main structure plane equation from a camera coordinate system to a world coordinate system based on the camera pose matrix;

and (5): adding the primary structure plane equation set into the secondary structure plane equation set, and adding the primary structure plane equation set into the primary structure plane equation set for registration and fusion;

and (6): and reconstructing the main structure of the indoor scene according to the finally fused main structure plane equation set until all frames in the key frame sequence are processed.

Optionally, in some possible implementations, the step (1) includes:

a step (101): calibrating the Kinect depth camera to obtain internal parameters of the Kinect depth camera;

a step (102): scanning an indoor scene by using a Kinect depth camera to obtain a depth image;

step (103): performing noise reduction on the obtained depth image by using a bilateral filtering algorithm, calculating a point cloud three-dimensional coordinate corresponding to each pixel point of the depth image subjected to noise reduction through internal parameters of a Kinect depth camera, further obtaining three-dimensional point cloud data, and representing the three-dimensional point cloud data into V_iWherein i represents the ith frame;

a step (104): computing a normal vector N for a pixel (x, y)_i(x,y)：

N_i(x,y)＝(V_i(x+1,y)-V_i(x,y))×(V_i(y+1,x)-V_i(x,y))；

Wherein, x represents cross product, V_i(x +1, y) represents three-dimensional point cloud data of a pixel point (x +1, y) of the ith frame image; v_i(x, y) three-dimensional point cloud data representing pixel points (x, y) of the ith frame of image; v_i(y +1, x) denotes the ith frame mapThree-dimensional point cloud data like pixel points (y +1, x);

and then, calculating normal vectors of all pixel points in the current frame.

Optionally, in some possible implementations, the step (2) includes:

step (201): determining the relation between corresponding pixel points of two frames of depth images at adjacent moments by using a projection method;

step (202): finding the best accuracy of the current relative pose by using the minimum total error among all corresponding pixel points;

step (203): judging whether the set iteration times are met, if so, entering the step (204), and if not, entering the repeated steps (201) - (202);

a step (204): and obtaining an optimal relative pose matrix.

Optionally, in some possible implementations, the step (201) includes:

using a camera to sample objects in the same environment at two positions, O_kDenotes the origin of the coordinate system of the camera at time k, O_k-1Representing the origin of the coordinate system of the camera at the moment k-1;

converting a point in a world coordinate system at the moment k-1 into a three-dimensional point coordinate and a normal vector thereof in a camera coordinate system at the moment k-1 through a rotation matrix; then, converting three-dimensional points in a camera coordinate system at the moment k-1 into pixel points P in an image coordinate system at the moment k-1 by using camera internal parameters;

finding a pixel point P with the same pixel coordinate at the moment k, and calculating the three-dimensional point coordinate and the normal vector of the pixel point P by using the rotation matrix at the moment k;

judging whether the coordinates of the two three-dimensional points corresponding to the pixel point P are consistent at the time k-1 and the time k;

judging whether two normal vectors corresponding to the pixel point P are consistent at the time k-1 and the time k;

if the coordinates of the two three-dimensional points corresponding to the pixel point P are consistent, and the two normal vectors corresponding to the pixel point P are consistent; the pixel point P at the time k-1 and the time k are corresponding.

Optionally, in some possible implementations, the step (202) of calculating the total error includes:

pixel point P₁And P₂With an error of P₁And P₂The distance of the tangent to the point;

calculating the total error E (T) among all corresponding pixel points_g,k)：

Wherein the content of the first and second substances,

indicating that one point u in the point cloud at the moment k has a corresponding point T_g,kIs a pose matrix of 4 x 4, which represents the absolute pose of the camera at time k in the world coordinate system, which is defined as the camera coordinate system of the first frame,

the vertex coordinates of the u point in the image frame at the time k,

is the vertex coordinate of the corresponding point in the image frame of the u point at the time instant k-1,

is a normal vector of the corresponding point,

is the corresponding point in the image of point u at time k-1.

Optionally, in some possible implementations, step (204): calculating an optimal solution x by calculating a linear equation set formula (2) through a least square method, and further obtaining an optimal relative pose matrix T_g,k：

Wherein the content of the first and second substances,

x＝(β,γ,α,t_x,t_y,t_z)^T∈R⁶；

wherein

Representing the pose matrix obtained by the z-1 st iteration,

a homogeneous representation of the point cloud coordinates representing pixel u at time k;

wherein, T_g,kThe camera position matrix at the moment k is represented, and points in the camera coordinate system at the moment k are converted into points of a global coordinate system through the camera position matrix; t is_g,kThe pose matrix includes: rotation matrix R_g,kAnd a translation matrix t_g,k(ii) a Parameters beta, alpha and gamma in the rotation matrix represent degrees of rotation along three coordinate axes of X, Y and Z; parameter t in translation matrix_x,t_y,t_zIndicating the distance moved along three coordinate axes X, Y and Z;

representing a normal diagram and a point cloud diagram of the pixel point u in a world coordinate system at the moment k;

representing the corresponding point at the time k-1 by the pixel point u

Normal and point cloud images in the world coordinate system at the moment k-1; omega_k(u) represents a corresponding point set of u points searched by a projection method at the moment k-1;

the three-dimensional point at the k coordinate system at the k moment corresponding to the u coordinate is transformed to the coordinate system at the k-1 moment and projected to the pixel coordinate position at the k-1 moment corresponding to the pixel coordinate system;

representing the z-1 th iteration solved camera pose matrix,

the depth map representing the k moment is transformed to the three-dimensional point coordinates of a world coordinate system by the pixel u after z-1 iteration; g (u) is a matrix of 3 x 6, the first three columns being

The last three columns are 3 x 3 identity matrices; g^T(u) is the transpose of G (u); i is_3×3Is a 3 x 3 identity matrix.

Optionally, in some possible implementations, the step (3) includes:

determining whether a rotation transformation matrix between a frame pre-added to a sequence of key frames and a current frame in the sequence of key frames is greater than a threshold θ₀If greater than, it will beAdding the frames added into the key frame sequence, otherwise, not adding the frames into the key frame sequence;

T_g,i ^-1*T_g,k>θ₀ (7)

setting the current frame in the key frame sequence as the kth frame and setting the frame newly added in the key frame sequence as the ith frame; if the formula (7) is satisfied, the ith frame is a key frame and is added into the key frame sequence; if the formula (7) is not satisfied, the ith frame is not added with the key frame sequence, and whether the rotation transformation matrix of the (i + 1) th frame is larger than the threshold value theta is continuously judged₀(ii) a If the newly added continuous 10 frames are not more than theta₀Then frame 10 is added to the sequence of key frames.

Optionally, in some possible implementations, the specific step of calculating the main structure plane equation set for each depth image added to the sequence of key frames in step (4) is:

a step (401): clustering normal vectors of all points to generate a plurality of core normal vectors;

step (402): denoising the clustering result, and removing points which have a direction difference of more than 30 degrees with the core normal vector in the same type of clustering result;

step (403): establishing a distance histogram for the projection distance of each type of point on the core normal vector according to the clustering result;

a step (404): and searching the maximum plane from large to small in the distance histograms of each type, and performing weighted least square fitting on the three-dimensional point cloud in the maximum plane to obtain a main structure plane of the frame depth image.

Optionally, in some possible implementations, the obtaining the main structure plane of the frame depth image includes:

min w_u,v(Ax_u,v+By_u,v+Cz_u,v+D)²,(u,v)∈h_f,

wherein h is_fPoints representing the denoised core region; w is a_u,vIndicating that the point provides more weight to the plane fit as the normal vector of the (u, v) point is closer to the kernel normal, A, B, C and D are the parameters of the plane to be fitted.

Optionally, in some possible implementations, the specific step of converting the main structure plane equation to the world coordinate system in step (4) is:

A_k,ix+B_k,iy+C_k,iz+D_k,i＝0；

A_k,i ²+B_k,i ²+C_k,i ²＝1；

i＝{R⁺,1≤i≤n}；

wherein A is_k,i,B_k,i,C_k,i,D_k,i∈RⁿN represents the number of main structure planes in the frame depth image;

determining a plane equation according to the plane normal vector and any point on the plane, converting the plane equation in the camera coordinate system into a world coordinate system, and solving a plane parameter equation of the plane equation in the world coordinate system;

using a rotation matrix R_g,kConverting the normal vector from the camera coordinate system to the world coordinate system:

[A_g,k,B_g,k,C_g,k]^T＝R_g,k[A_k,B_k,C_k]^T (3)

using a rotation matrix R_g,kAnd a translation matrix t_g,kConverting points in the camera coordinate system to the world coordinate system:

[X_g,k,Y_g,k,Z_g,k]^T＝R_g,k[-D_kA_k,-D_kB_k,-D_kC_k]^T+t_g,k(4)

[-D_FA_k,-D_kB_k,-D_kC_k]^Tis a point located on a plane;

D_g,k＝-[A_g,k,B_g,k,C_g,k][X_g,k,Y_g,k,Z_g,k]^T (5)

according to equations (3) and (5), the plane equation in the camera coordinate system at time k is converted into the plane parameter equation in the world coordinate system as follows:

A_g,kx+B_g,ky+C_g,kz+D_g,k＝0 (6)。

optionally, in some possible implementations, the specific step of step (5) is:

step (501): and (3) solving the cosine distance of every two input frames and all the normal vectors of the plane equation after the current fusion, and judging whether the formula (8) is satisfied:

abs(N_cur,i*N_new,k)>θ₁ (8)

wherein N is_cur,iNormal vector of plane equation in ith frame representing current sequence of key frames, N_new,kNormal vector of plane equation, theta, in the k-th frame representing the newly added sequence of key frames₁Setting a threshold value;

step (502): if the two planes meet the formula (8), judging whether the two planes are the same plane according to the equation parameter D of the two planes, and if the two planes meet the formula (9), indicating that the two planes are the same plane; if the formula (9) is not satisfied, judging that the two planes are parallel, and directly entering the step (503);

abs(D_i-D_k)<θ₂ (9)

wherein D is_iConstant term in the plane parameter equation in frame i, corresponding to equation (8), D_kConstant term, θ, in the plane parameter equation expressing the k-th frame in accordance with equation (8)₂Setting a threshold value;

a step (503): registration plane weighted fusion:

the registered planes in the input frame and current plane equations are weighted and fused using equation (10):

P_merge＝wP_cur,i+(1-w)P_new,k (10)

wherein, P_cur,iRepresenting the plane parameter equation, P, resulting from the fusion of the current sequence of key frames_new,kRepresenting the plane parameter equation of the newly added key frame, w representing the fused weight, P_mergeAnd expressing the fused plane parameter equation.

Step (504): generating a new plane set:

P_now＝{P_merge,P_cur,n,P_new,m}

wherein, P_cur,nIndicating that there is no plane in the current key frame sequence that is in registration with the plane of the newly added key frame, P_new,mIndicating that none of the newly added keyframes is registered with a plane in the current sequence of keyframes.

Optionally, in some possible implementations, the specific step of step (6) is:

according to the plane set P after fusion registration_nowAnd solving an intersection line between every two planes and an intersection point between the three planes so as to construct an indoor scene main structure.

In a second aspect, an indoor scene main structure reconstruction system based on a depth image key frame is provided;

indoor scene main structure reconstruction system based on depth image key frame includes: a depth camera and a processor;

the depth camera is used for acquiring a depth image of an indoor scene;

the processor is used for acquiring a depth image from the depth camera, processing the depth image to obtain corresponding point cloud data, and obtaining a normal vector according to the point cloud data; based on point cloud data and normal vectors, searching corresponding pixel points of the current frame depth image in the previous frame depth image through a projection algorithm, and calculating a camera pose matrix of the current frame by minimizing the distance from the pixel points to tangent planes where the corresponding pixel points are located; judging whether the current frame depth image is a key frame depth image or not, and if so, adding the current frame depth image into the key frame sequence; calculating a main structure plane equation set for each frame of depth image added into the key frame sequence; converting a main structure plane equation into a world coordinate system based on the camera pose matrix; adding the primary structure plane equation set and the primary structure plane equation set to carry out registration fusion; and reconstructing the main structure of the indoor scene according to the finally fused main structure plane equation set until all frames in the key frame sequence are processed.

In a third aspect, an electronic device is provided;

an electronic device, comprising: the computer program product comprises a memory, a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of any of the above methods.

In a fourth aspect, a computer-readable storage medium is presented;

a computer readable storage medium having computer instructions embodied thereon, which, when executed by a processor, perform the steps of any of the above methods.

Compared with the prior art, the beneficial effects of the embodiment of the application are that:

and registering corresponding points of an indoor scene three-dimensional reconstruction frame based on a Kinectfusion algorithm to acquire the camera pose of each frame, and acquiring the camera pose change between key frames. If the camera pose change between the subsequent frame and the previous key frame exceeds the threshold value, adding the frame into the key frame sequence, and if the camera pose change is less than theta₀Then the key frame is fetched at regular intervals. And clustering by taking the normal vector of each point in the key frame depth image as input, extracting the plane of the main structure of the indoor scene according to the clustering result, registering according to the pose of the key frame, fusing the plane equation of the main structure, and reconstructing the complete main structure of the indoor scene through plane intersection. The method is characterized by comprising the steps of expanding to multiple frames on the basis of a single-frame image, carrying out registration fusion on adjacent key frames based on a rotation matrix of a Kinectfusion algorithm, and finally generating a three-dimensional indoor scene main structure reconstruction model of a key frame sequence.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 is a Kinectfusion algorithm flow;

FIG. 2(a) shows two consecutive samples of the same scene;

FIG. 2(b) is a schematic diagram of a projection method point cloud corresponding point search and point-to-surface error mechanism;

FIG. 3(a) is an RGB map extracted from the main structure plane;

FIG. 3(b) is a main structure diagram extracted from the main structure plane;

FIG. 3(c) is a graph of the results of main structure plane extraction;

FIG. 4 is a schematic plan view of an intersection;

5(a) -5 (g) are partial key frames of Bedroom-0091;

FIG. 5(h) shows Kinectfusion reconstruction results;

FIG. 5(i) is the main structure point cloud reconstruction result;

FIG. 5(j) is a reconstruction result of an embodiment of the present application;

6(a) -6 (g) are Bedroom-0097 partial key frames;

FIG. 6(h) shows Kinectfusion reconstruction results;

FIG. 6(i) is the main structure point cloud reconstruction result;

FIG. 6(j) is a reconstruction result of an embodiment of the present application;

fig. 7 is a flowchart of an embodiment of the present application.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used in the examples herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

With the popularization of depth cameras, depth images are widely applied to various fields of image processing because they contain depth information. The method for analyzing the main structure of the indoor scene according to the input depth images is challenging work in the field of image analysis and computer vision, the condition of extraction errors of the main structure of a single depth image can be reduced through continuous depth image input, and the extracted main structure is optimized. According to the method and the device, the indoor scene main structure extraction and the visual SLAM are combined to generate a complete indoor scene main structure three-dimensional model. The method and the device for acquiring the camera pose of each frame are obtained by registering corresponding points of an indoor scene three-dimensional reconstruction frame based on Kinectfusion, and the change of the camera pose among key frames is obtained. And if the camera pose change between the subsequent frame and the previous key frame exceeds a threshold value, adding the frame into the key frame sequence, and if the camera pose change is smaller, taking the key frame at a fixed interval. And clustering by taking the normal vector of each point in the key frame depth image as input, extracting the plane of the main structure of the indoor scene according to the clustering result, registering according to the pose of the key frame, fusing the plane equation of the main structure, and reconstructing the complete main structure of the indoor scene through plane intersection.

The Kinectfusion algorithm realizes 3D scene reconstruction by matching, positioning and fusing depth data acquired by Kinect. The algorithm flow of the method is shown in fig. 1 and 7, and mainly comprises four parts:

a) the depth data processing is to convert the original depth data of the sensor into 3D point cloud to obtain the three-dimensional coordinates and normal vectors of the vertexes in the point cloud;

b) performing ICP (iterative closed Point) matching on the current frame of 3D point cloud and the predicted 3D point cloud generated by the current model, and calculating to obtain the pose of the current frame of camera;

c) point cloud fusion, namely fusing the 3D point cloud of the current frame into the existing model by using a TSDF point cloud fusion algorithm [ Curless B, Levoy M.A volumetric method for building complex models [ C ]// Proceedings of ACM SIGGRAGPH. New York, USA: ACM,1996: 303-;

d) scene rendering, namely predicting the environmental point cloud observed by the current camera according to the existing model and the current camera pose by using a ray tracing method, and feeding the environmental point cloud to a user on one hand and providing the environmental point cloud for b) to carry out ICP matching on the other hand.

ICP (iterative close Point) positioning method in Kinectfusion, namely, nearest point iterative algorithm)

In the ICP positioning link, when matching the current frame 3D point cloud with the predicted 3D point cloud, the method is realized by the following steps:

(1) the relationship between the corresponding points is determined using a projective method. The process is represented by a 2-dimensional ICP, the black curve in FIG. 2(a) being the object in the environment, which the camera samples and predicts at two successive locations, O, respectively_kAnd O_k-1Respectively the origin of the coordinate systems of the current camera and the previous frame of camera, firstly converting two point clouds at the moment of k and k-1 into the coordinate system of the current frame of k, and then passing the two point clouds through the center O of the camera_kProjecting to the image plane, the point with the same projection point on the image plane in the two point clouds is the corresponding point, P in fig. 2(b)₁And P₂And the algorithm also screens the corresponding points through Euclidean distances and normal direction included angles between the corresponding points.

(2) And measuring the accuracy of the current relative pose by using a point-to-plane error mechanism. In FIG. 2(b), in the 2-dimensional case, P₁And P₂With an error of P₁And P₂The distance d of the tangent to the point. The total error between all corresponding points is given by the following equation:

wherein

Indicating that a corresponding point, T, exists at a point u in the current point cloud_g,kIs a pose matrix of 4 x 4, which represents the absolute pose of the camera of the current frame in the world coordinate system, which is defined as the camera coordinate system of the first frame,

is the vertex coordinate of the u point in the current frame,

is the vertex coordinates of the corresponding point of the u point in the predicted frame,

is the normal vector of the corresponding point. FIG. 3(a) is an RGB map extracted from the main structure plane; FIG. 3(b) is a main structure diagram extracted from the main structure plane; fig. 3(c) is a graph of the result of main structure plane extraction.

(3) Obtaining the optimal relative pose T by optimizing the formula (1)_g,k。

And converting the optimization problem into least square optimization by adopting a linearization method, and calculating an optimal solution x by calculating a linear equation set as the formula (2).

Wherein

x＝(β,γ,α,t_x,t_y,t_z)^T∈R⁶,

Wherein

Representing the pose matrix obtained by the z-1 st iteration,

a homogeneous representation of the point cloud coordinates representing the pixel u at time k.

(4) Iterating the step (1) to the step (2) for ten times

1.1 Main Structure plane extraction based on Single frame depth image

In the embodiment of the application, an indoor scene main structure is extracted according to each input frame of depth image, and a plane parameter equation is adopted to express a main structure plane in the frame. The method mainly comprises the following steps:

1) clustering is carried out according to the normal vector, and the density of each point and the distance between each point and a point with larger density are calculated according to the cosine distance to obtain a clustering center;

2) denoising the clustering result, and removing points which have too large difference with the direction of the core normal vector in the same type of clustering result;

3) establishing a distance histogram for the projection distance of each type of points in the core normal direction according to the clustering result;

4) searching a large plane from large to small (from small to large) in the distance histogram of each type, and performing weighted least square fitting on the three-dimensional point cloud in the large plane to obtain a parameter equation of the main structure plane in the frame:

min w_u,v(Ax_u,v+By_u,v+Cz_u,v+D)²,(u,v)∈h_f,

wherein h is_fPoints representing a denoised core region;

2 indoor scene main structure reconstruction

2.1 transformation of the coordinate System

By preprocessing the single frame depth image, the transformation matrix T of each frame image to the world coordinate system is obtained_g,kParameter equation of main structure plane of indoor scene of each frame image, A_k,ix+B_k,iy+C_k,iz+D_k,i＝0,A_k,i ²+B_k,i ²+C_k,i ²＝1,i＝{R⁺,1≤i≤n},A_k,i,B_k,i,C_k,i,D_k,i∈RⁿAnd n represents the number of main structural planes in the frame. And determining a plane equation according to the plane normal vector and a point on the plane, and converting the plane equation in the camera coordinate system into the world coordinate system to obtain a plane parameter equation of the plane equation in the world coordinate system.

Using R_g,kConverting the normal vector from the camera coordinate system to the world coordinate system, wherein the formula is as follows:

[A_g,k,B_g,k,C_g,k]^T＝R_g,k[A_k,B_k，C_k]^T (3)

using R_g,kAnd t_g,kConverting a point in the camera coordinate system to the world coordinate system, the formula is as follows:

[X_g,k,Y_g,k,Z_g,k]^T＝R_g,k[-D_kA_k,-D_kB_k,-D_kC_k]^T+t_g,k (4)

[-D_kA_k,-D_kB_k,-D_kC_k]^Tis a point located on the plane.

D_g,k＝-[A_g,k,B_g,k,C_g,k][X_g,k,Y_g,k,Z_g,k]^T (5)

According to the formulas (3) and (5), the plane equation in the k frame camera coordinate system is converted into the plane parameter equation in the world coordinate system as follows:

A_g,kx+B_g,ky+C_g,kz+D_g,k＝0 (6)

2.2 Key frame selection

In order to accelerate the speed of extracting and reconstructing the main structure, the embodiment of the application selects a key frame sequence to reconstruct the main structure of the indoor scene. By judging whether the rotation transformation matrix between two frames is larger than a threshold value theta₀To determine whether to add to the key frame sequence.

T_g,i ^-1*T_g,k>θ₀ (7)

And (3) setting the current frame in the key frame sequence as the kth frame and the newly added frame as the ith frame, and if the formula (7) is satisfied, setting the ith frame as the key frame and adding the key frame sequence. If not, the newly added frame does not add the key frame sequence, and whether the rotation transformation matrix of the next frame is larger than the threshold value theta set by the user or not is continuously judged₀. If the newly added continuous 10 frames are not more than theta₀Then frame 10 is added to the sequence of key frames.

2.3 planar registration fusion

1) And (3) determining the cosine distances of the input frame and all the planes after current fusion in a pairwise manner, and judging whether the following formula is met:

abs(N_cur,i*N_new,k)>θ₁ (8)

wherein N is_cur,iRepresenting the normal vector of the plane equation, N, in the current sequence of key frames_new,kNormal vector of plane equation, theta, in the k-th frame representing the newly added sequence of key frames₁The value in the experiment was 0.93.

If the normal vectors of the two planes conform to the formula (8), it is determined that the two planes are parallel.

2) In a hexahedral scene, the situation that two wall surfaces are parallel is very likely to occur, under the situation, the normal directions of the two wall surfaces are basically consistent, and if the two planes satisfy the above formula (8), whether the two planes are the same plane is judged according to the equation parameter D of the two planes.

abs(D_i-D_k)<θ₂ (9)

Wherein D is_i，D_kRepresents a constant term in the plane parameter equation corresponding to equation (8) in the ith and kth frames, respectively, theta₂The value in the experiment was 0.5.

1) Registration plane weighted fusion

The registered planes (considered computationally identical planes) in the input frame and current plane equations are weighted fused using equation (10).

P_merge＝wP_cur,i+(1-w)P_new,k (10)

Wherein, P_cur,iRepresenting the plane parameter equation, P, resulting from the fusion of the current sequence of key frames_new,kA plane parameter equation representing the newly added key frame. w represents the weight of fusion, and the value in the experiment is 0.4.

2) Generating a new set of planes

P_now＝{P_merge,P_cur,n,P_new,m}

Wherein, P_cur,nRepresenting planes in the current sequence of key frames that are not in plane registration with the newly added key frame, P_new,mRepresenting planes in the newly added key frame that are not registered with planes in the current key frame sequence.

2.4 plane intersection

According to the obtained fusion registered plane set P_nowAnd (3) solving intersection lines and intersection points, selecting the direction of the intersection lines pointing to the viewpoint direction to reconstruct a main structure of the indoor scene, wherein in the figure 4, a small sun point is the position of the viewpoint.

Algorithm indoor scene main structure reconstruction

Inputting: depth image sequence S ═ S_i|i＝1,2,……,n}

And (3) outputting: fused registered plane parameter equation set P_now；

1, acquiring a depth image from the Kinect, performing noise reduction processing, and calculating point cloud and normal information;

2 obtaining the camera pose T of each frame by using Kinectfusion algorithm_g,i；

3 judging whether the current frame is added into a key frame sequence (the key frame sequence is a subset of the input image sequence S) according to a formula (7);

4 calculating main structure plane equation set P for each frame depth image added to the key frame sequence_curAnd converting the plane equation into a world coordinate system according to a formula (6);

5, carrying out registration fusion on the plane equation set added later and the previous plane equation set according to the formulas (8), (9) and (10);

and 6, reconstructing an indoor scene main structure according to the finally fused plane equation set until all frames in the key frame sequence are processed.

4 results of the experiment

Experimental data were derived from the data set in NYU v2, which is widely used by scholars. We reconstruct the main structure of the indoor scene from the android-0091 and android-0097, and compare the method with the reconstruction method of the main structure of the indoor scene based on the point cloud, as shown in fig. 5(a) -5 (j) and 6(a) -6 (j). As can be seen from the result graph, in the cabinet at the corner in the bedroom-0091, because the wall plane extends leftwards due to the shielding of the cabinet, the main structure extraction does not obtain a left plane, and the point cloud fusion method cannot eliminate redundant main structure planes, so that the subsequent rendering result is affected, but the redundant part of the previous main structure fitting can be easily eliminated through the plane intersection of the method in the embodiment of the application. The ceiling in the upper right corner of the Bedroom-0097 is not obviously shown in the main structure reconstruction of Kinectfusion and point cloud because the area is too small, but the ceiling plane can be explicitly reconstructed by the main structure extraction reconstruction method in the embodiment of the application.

Compared with the extraction of the main structure of the indoor scene of a single frame image, the method can more reliably extract the main structure of the indoor scene, and avoids the situation of the extraction error of the main structure caused by the shielding of a large plane; meanwhile, the main structure of the scene can be completely reconstructed by reconstructing the main structure of the indoor scene based on the plane parameters, and a basic frame is provided for understanding the indoor scene and adding objects into the scene one by one in the later period. Compared with the reconstruction of the main structure of the indoor scene based on the point cloud, the reconstruction method based on the plane equation in the embodiment of the application has the obvious effect of being more efficient.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. The method for reconstructing the main structure of the indoor scene based on the key frame of the depth image is characterized by comprising the following steps:

and (6): reconstructing an indoor scene main structure according to a final fused main structure plane equation set until all frames in the key frame sequence are processed;

the step (2) comprises the following steps:

the step (201) comprises the following steps:

if the coordinates of the two three-dimensional points corresponding to the pixel point P are consistent, and the two normal vectors corresponding to the pixel point P are consistent; the pixel point P at the k-1 moment and the k moment is corresponding;

the specific step of calculating the main structure plane equation set for each frame depth image added into the key frame sequence in the step (4) is as follows:

2. The method for reconstructing a main structure of an indoor scene based on a key frame of a depth image as claimed in claim 1, wherein the step (1) comprises the steps of:

a step (104): computing a normal vector N for a pixel (x, y)_i(x，y)：

N_i(x，y)＝(V_i(x+1，y)-V_i(x，y))×(V_i(y+1，x)-V_i(x，y))；

Wherein, x represents cross product, V_i(x +1, y) represents three-dimensional point cloud data of a pixel point (x +1, y) of the ith frame image; v_i(x, y) three-dimensional point cloud data representing pixel points (x, y) of the ith frame of image; v_i(y +1, x) represents three-dimensional point cloud data of a pixel point (y +1, x) of the ith frame of image;

and then, calculating normal vectors of all pixel points in the current frame.

3. The method for reconstructing a main structure of an indoor scene based on a key frame of a depth image as claimed in claim 1, wherein the step (2) further comprises:

a step (204): and obtaining an optimal relative pose matrix.

4. The method as claimed in claim 3, wherein the total error of step (202) is calculated by:

computingTotal error E (T) between all corresponding pixels_g，k)：

Wherein the content of the first and second substances,

indicating that one point u in the point cloud at the moment k has a corresponding point T_g，kIs a pose matrix of 4 x 4, which represents the absolute pose of the camera at time k in the world coordinate system, which is defined as the camera coordinate system of the first frame,

the vertex coordinates of the u point in the image frame at the time k,

is a normal vector of the corresponding point,

is the corresponding point in the image of point u at time k-1;

a step (204): calculating an optimal solution x by calculating a linear equation set formula (2) through a least square method, and further obtaining an optimal relative pose matrix T_g，k：

Wherein the content of the first and second substances,

x＝(β，γ，α，t_x，t_y，t_z)^T∈R⁶；

wherein

Representing the pose matrix obtained by the z-1 st iteration,

wherein, T_g，kThe pose matrix is 4 multiplied by 4, and points in a camera coordinate system at the moment k are converted into points of a global coordinate system through the camera pose matrix; t is_g，kThe pose matrix includes: rotation matrix R_g，kAnd a translation matrix t_g，k(ii) a Parameters beta, alpha and gamma in the rotation matrix represent degrees of rotation along three coordinate axes of X, Y and Z; parameter t in translation matrix_x，t_y，t_zIndicating the distance moved along three coordinate axes X, Y and Z;

representing the corresponding point at the time k-1 by the pixel point u

representing the z-1 th iteration solved camera pose matrix,

5. The method for reconstructing a main structure of an indoor scene based on a key frame of a depth image as claimed in claim 1, wherein the step (3) comprises the steps of:

determining frames pre-added to a sequence of key frames and a current frame in the sequence of key framesWhether the rotational transformation matrix between frames is greater than a threshold θ₀If the number of the frames in the key frame sequence is larger than the preset number, adding the frames pre-added into the key frame sequence, otherwise, not adding the frames into the key frame sequence;

T_g，i ^-1*T_g，k＞θ₀ (7)

6. Indoor scene main structure reconstruction system based on depth image key frame, characterized by includes: a depth camera and a processor;

the depth camera is used for acquiring a depth image of an indoor scene;

the processor is used for acquiring a depth image from the depth camera, processing the depth image to obtain corresponding point cloud data, and obtaining a normal vector according to the point cloud data; based on point cloud data and normal vectors, searching corresponding pixel points of the current frame depth image in the previous frame depth image through a projection algorithm, and calculating a camera pose matrix of the current frame by minimizing the distance from the pixel points to tangent planes where the corresponding pixel points are located; judging whether the current frame depth image is a key frame depth image or not, and if so, adding the current frame depth image into the key frame sequence; calculating a main structure plane equation set for each frame of depth image added into the key frame sequence; converting a main structure plane equation into a world coordinate system based on the camera pose matrix; adding the primary structure plane equation set and the primary structure plane equation set to carry out registration fusion; reconstructing an indoor scene main structure according to a final fused main structure plane equation set until all frames in the key frame sequence are processed;

the step of calculating the camera pose matrix of the current frame comprises:

the step (201) comprises the following steps:

the specific steps of calculating the main structure plane equation set for each frame depth image added into the key frame sequence are as follows:

7. An electronic device, comprising: memory, a processor, and computer instructions stored on the memory and executed on the processor, the computer instructions, when executed by the processor, performing the steps of the method of any of claims 1-5.

8. A computer-readable storage medium having computer instructions embodied thereon, which when executed by a processor, perform the steps of the method of any one of claims 1 to 5.