CN110555908A

CN110555908A - three-dimensional reconstruction method based on indoor moving target background restoration

Info

Publication number: CN110555908A
Application number: CN201910799527.8A
Authority: CN
Inventors: 吴宪祥; 耿煜恒; 张晋新; 孙牧野; 陈笑; 赵博; 周旭阳; 程开
Original assignee: Xian University of Electronic Science and Technology
Current assignee: Xian University of Electronic Science and Technology
Priority date: 2019-08-28
Filing date: 2019-08-28
Publication date: 2019-12-10
Anticipated expiration: 2039-08-28
Also published as: CN110555908B

Abstract

the invention provides a three-dimensional reconstruction method based on indoor moving target background restoration, which is used for solving the technical problem of low accuracy of three-dimensional reconstruction of an indoor dynamic scene caused by a large number of moving target noise points in the technology, and comprises the following specific steps: (1) acquiring an RGB image sequence and a depth image sequence of an indoor scene to be reconstructed; (2) acquiring a target area of each RGB image; (3) carrying out feature matching on adjacent frames in the RGB image sequence; (4) calculating a rotation matrix and a translation vector of the pose transformation of the depth camera when shooting adjacent frames; (5) determining the area of the moving target in each RGB image; (6) repairing a static background blocked by a moving target in a moving target area of each RGB image; (7) and acquiring a three-dimensional reconstruction result. The three-dimensional reconstruction accuracy in the indoor dynamic environment is obviously higher than that of the prior art, and the method can be used for acquiring and analyzing the three-dimensional information of the indoor dynamic scene.

Description

Three-dimensional reconstruction method based on indoor moving target background restoration

Technical Field

the invention belongs to the technical field of computer vision image processing, relates to a three-dimensional reconstruction method, and particularly relates to a three-dimensional reconstruction method based on indoor moving target background restoration.

background

The three-dimensional reconstruction is to simulate a three-dimensional object in the real world by a computer to acquire complete three-dimensional information of the object including structure, texture, dimension and the like. At present, two common three-dimensional reconstruction methods are available, namely a contact measurement method and a non-contact measurement method. The contact type measuring method is based on a force triggering principle, and the coordinates of the surface sampling points of the object are obtained through the direct contact of the probe and the object. The non-contact measurement method is a method for obtaining a target three-dimensional space and measuring the target three-dimensional space on the premise of not contacting an object. A technique for realizing three-dimensional reconstruction based on an RGB image and a depth image is a typical non-contact measurement method that reconstructs a three-dimensional shape of an object from the depth image and RGB image characteristic information. Wherein the reconstruction accuracy is an important index for evaluating the reconstruction result.

the existing three-dimensional reconstruction method based on RGB image and depth image mainly includes the following types:

1. the method comprises the steps of reconstructing a static scene, obtaining images of an object to be reconstructed under different shooting angles, establishing an effective imaging model through camera calibration, solving internal and external parameters of a camera, extracting image feature points, solving the pose of the camera by adopting an ICP (inductively coupled plasma) algorithm, and finally splicing each point cloud image.

2. And the reconstruction of the dynamic scene mainly comprises the steps of detecting a moving target in the scene by using a target detection technology, removing point clouds corresponding to the moving target in a point cloud splicing stage, and reconstructing a point cloud picture without the moving target. To improve the accuracy of three-dimensional reconstruction in dynamic scenes, developers must try to remove point cloud data of moving objects in the reconstructed scene. For example, a Master thesis entitled "semantic SLAM Key technology research based on Vision" published by information engineering university, strategic support department, information engineering university, Japan, 10/15/2018 discloses a three-dimensional reconstruction method for moving object removal, which provides a SLAM composition method based on a lookup table. The method comprises the steps of firstly segmenting an image, estimating eight neighborhood motion directions of the image, and then optimizing composition by utilizing the motion directions of the image when a scene map is constructed. A method for removing the influence of a dynamic target based on the combination of a lookup table and an optical flow method is provided. The method uses the optical flow to detect the dynamic target in the scene, and adopts a lookup table to build the map of the scene. A target detection method based on deep learning is researched, and an improved SLAM method for removing dynamic target influence is realized by adopting a FasterR-CNN network. The effect of visual SLAM is improved by target detection and elimination of dynamic targets in the scene. Experiments prove that the method can effectively identify the pedestrians in the scene and eliminate the pedestrians when the map is constructed. However, a large number of holes are left in the position of the moving target by the method, the static background blocked by the moving target is not repaired, and the accuracy of three-dimensional reconstruction is still insufficient. How to remove a moving target in a dynamic scene, recover a point cloud of a static background blocked by the moving target, and reconstruct a three-dimensional image in the dynamic scene is an important problem to be solved currently.

disclosure of Invention

the invention aims to overcome the defects in the prior art, provides a three-dimensional reconstruction method based on indoor moving target background restoration, and aims to solve the technical problem that a dynamic indoor scene cannot be accurately reconstructed in the prior art.

in order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:

(1) acquiring an RGB image sequence and a depth image sequence of a dynamic indoor scene to be reconstructed:

Carrying out N times of continuous shooting on a dynamic indoor scene to be reconstructed by using a depth camera to obtain an RGB image sequence I (I ₁), I ₂,. I _i., I _N and a depth image sequence D (D ₁), D ₂,. D _i., D _N and I _i represent the ith RGB image, D _i represents a depth image corresponding to I _i, and N is more than 50;

(2) acquiring a target area of each RGB image I _i:

(2a) by adopting a method for extracting and identifying target features based on a Yolo depth neural network, inputting each RGB image I _i in an image sequence I into a target detection network Yolo one by one according to the shooting time sequence for detection, and obtaining k detection targets C _i1, C _i2,. C _ij and C _ik of I _i, wherein C _ij represents the j-th detection target of the I-th RGB image I _i, C _ij is x _ij, y _ij, w _ij, h _ij, s _ij and C35 _ij, (x _ij and y _ij) are the coordinates of the center position of C _ij, w _ij and h _ij are the width and height of the pixel region of the C _ij respectively, s _ij represents the target category, and C _ij represents the confidence coefficient of s _ij;

(2b) marking rectangular pixel regions (x _ij, y _ij) and w _ij and h _ij which are respectively width and height in C _ij to obtain a target region B _ij of C _ij;

(3) And (3) carrying out feature matching on adjacent frames in the RGB image sequence I:

(3a) Extracting m ORB feature points in the RGB image I _i by using a FAST corner detection algorithm, and calculating a rotation invariance BRIEF descriptor of each ORB feature point by using a rotation invariance BRIEF descriptor formula, wherein m is larger than 300;

(3b) calculating a hamming distance between each rotation invariance BRIEF descriptor of I _i and each rotation invariance BRIEF descriptor of I _i+1, and matching feature points in I _i with feature points in I _i+1 by adopting a brute force matching algorithm to obtain a matching pair set consisting of v pairs of matching pairs { (p _i1, p _i1 '), (p _iu, p _iu '), (p _iv, p _iv ') }, wherein p _i ═ p _i1,. p _iu.. p _iv } belongs to the set of I _i matched ORB feature points, p _i ' { p _i1 ',. p _iu ',. p _iv ' } belongs to the set of I _i+1 matched ORB feature points, and 0 ≦ v ≦ m;

(4) calculating a rotation matrix and a translation vector of the pose transformation of the shooting I _i+1 relative to the shooting I _i depth camera:

(4a) The method comprises the steps of fusing p _i with depth information in a corresponding depth image to obtain a three-dimensional point set pd _i ═ { pd _i1,. pd _iu.. pd _iv }, and fusing the depth information in the depth image corresponding to p _i 'to obtain a three-dimensional point set pd _i' ═ pd _i1 ',. pd _iu.. pd _iv' };

(4b) Calculating centroid coordinates C _i and C '_i of pd _i and pd _i' and centroid coordinates q _iu and q _iu 'of pd _iu and pd _iu' and calculating a rotation matrix R _i of pose transformation of the shooting I _i+1 relative to the shooting I _i depth camera by adopting SVD dimension reduction algorithm:

(4c) calculating a translation vector t _i of the pose transformation of the shooting I _i+1 relative to the shooting I _i depth camera:

t_i＝C_i-R_iC′_i；

(5) determining the area of the moving target in each RGB image I _i:

(5a) Calculating the target matching degree of each target detection result C _ij in I _i and each target detection result C _(i+1)j' in I _i+1:

f(C_ij,C_(i+1)j′)＝(x_ij-x_(i+1)j′)²+(y_ij-y_(i+1)j′)²+(w_ij-w_(i+1)j′)²+(c_ij-c_(i+1)j′)²s_ij＝s_(i+1)j′

when f (C _ij, C _(i+1)j') < 0.5, then C _ij and C _(i+1)j' are determined to be the same target;

(5b) And obtaining a corresponding three-dimensional point set by a method of (4a) for s pairs of feature matching points in the target areas B _ij and B _(i+1)j' respectively corresponding to the matched target detection results C _ij and C _(i+1)j' as follows:

{(td_i1,td_i1′),...(td_iw,td_iw′),...(td_is,td_is′)}0＜s＜m；

(5c) calculating the dynamic variation T of the same target region between I _i and I _i+1, and marking the corresponding B _ij as a moving target region B _ij' when T > 0.1, wherein the calculation formula of T is as follows:

(6) repairing a static background blocked by a moving target in a moving target region B _ij' of each RGB image I _i:

(6a) Screening 5 contrast frames I _i+5, I _i+10, I _i+15, I _i+20 and I _i+25 after I _i when I is less than N-25 or screening 5 contrast frames I _i-5, I _i-10, I _i-15, I _i-20 and I _i-25 before I _i when I is less than or equal to N when N-25 is less than or equal to N by using the step length L which is 5;

(6b) calculating B _ij ' center coordinates (x _ij, y _ij) ' in I _i, and corresponding projection coordinates (x _ij, y _ij) _l ' in 5 contrast frames, wherein l is 1,2,3,4 and 5;

(6c) carrying out ORB feature point extraction and matching on B _ij ' and a pixel region H _ijl which is in 5 contrast frames and takes projection coordinates (x _ij, y _ij) _l ' as the center, has the same width and height as B _ij ', updating the pixel value of B _ij ' to the pixel value of a region H _ijl ' with the least number of point pairs of feature points which are successfully matched to obtain an RGB image I _i ' of I _i subjected to background restoration, and simultaneously updating the pixel value of a depth image region corresponding to B _ij ' to the pixel value of a depth image region corresponding to H _ijl ' to obtain a depth image D _i ' of D _i subjected to background restoration;

(7) Obtaining a three-dimensional reconstruction result:

Fusing each RGB image I _i 'subjected to background restoration and each depth image D _i' subjected to background restoration to obtain N three-dimensional space point clouds Q ₁, Q ₂, a.t., Q _i, a.t., Q _N, and performing iterative splicing on Q ₁, Q ₂, a.t., Q _i, a.t., Q _N by using a rotation matrix R _i and a translation vector t _i between I _i adjacent frames to obtain a global three-dimensional point cloud Q for eliminating a moving target:

compared with the prior art, the invention has the following advantages:

firstly, the invention adopts the background restoration technology, and after the moving target in the scene is removed, the static area shielded by the moving target is repaired, so that the three-dimensional reconstruction system can reconstruct the static scene under the condition of the interference of the moving object, the defect that a background hole is left after the moving target is removed in the prior art is avoided, and the accuracy of the three-dimensional reconstruction result is improved.

secondly, the method comprises the following steps: the invention adopts the deep neural network Yolo, detects the moving target by combining the parallax information between two frames, has accurate and effective detection method of the moving target, can detect a plurality of moving targets in a scene, and avoids the defects that the type of the moving target must be known in advance and only a single moving target can be detected in the prior art.

drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a diagram illustrating the effect of background restoration on a moving target area according to the present invention;

FIG. 3 is a simulation comparison graph of the reconstruction results of the present invention and the prior art.

Detailed Description

the invention is described in further detail below with reference to the figures and specific examples.

Referring to fig. 1, the present invention includes the steps of:

step 1) acquiring an RGB image sequence and a depth image sequence of a dynamic indoor scene to be reconstructed:

Using a depth camera to perform N times of continuous shooting on an indoor scene to be reconstructed, so as to obtain an RGB image sequence I-I ₁, I ₂,. I _i., I _N and a depth image sequence D-D ₁, D ₂,. D _i., D _N, I _i represent the ith RGB image, D _i represents a depth image corresponding to I _i, N > 50, and in the embodiment, N-605;

step 2) obtaining a target detection network Yolo:

taking images under three catalogues of human beings, animals and indoor articles marked in the PASCAL VOC public data set as the input of a target detection network Yolo with a basic network of Darknet-53, and carrying out iterative training on the Yolo network until the network converges to obtain a trained target detection network Yolo, wherein the network can detect common targets in an indoor scene;

step 3) obtaining a target area of each RGB image I _i:

step 3a) adopting a method for extracting and identifying target features based on a Yolo depth neural network, inputting each RGB image I _i in an image sequence I into a target detection network Yolo one by one according to a shooting time sequence for detection, and obtaining k detection targets C _i1, C _i2,. C _ij and C _ik of I _i, wherein C _ij represents the j-th detection target of the I-th RGB image I _i, C _ij is x _ij, y _ij, w _ij, h _ij, s _ij and C _ij, (x _ij and y _ij) are the coordinates of the center position of C _ij, w _ij and h _ij are the width and height of the image pixel region where C _ij is located respectively, s _ij represents a target class, and C _ij represents the confidence of s _ij;

step 3B) labeling rectangular pixel regions with (x _ij, y _ij) as the center and w _ij and h _ij as the width and the height respectively in C _ij to obtain a target region B _ij of C _ij;

step 4) carrying out feature matching on adjacent frames in the RGB image sequence I:

step 4a) extracting m ORB feature points in the RGB image I _i by using a FAST corner detection algorithm, and calculating a rotation invariance BRIEF descriptor of each ORB feature point through a rotation invariance BRIEF descriptor formula, wherein m is larger than 300;

Step 4b) calculating a hamming distance between each rotation invariance BRIEF descriptor of I _i and each rotation invariance BRIEF descriptor of I _i+1, and matching feature points in I _i with feature points in I _i+1 by adopting a brute force matching algorithm to obtain a matching pair set composed of v pairs of matching pairs { (p _i1, p _i1 '), (p _iu, p _iu '), (p _iv, p _iv ') }, wherein p _i ═ p _i1,. p _iu.. p _iv } belongs to the set of matched ORB feature points of I _i, p _i '. p _i1 ',. p _iu ',. p _iv ' } belongs to the set of matched ORB feature points of I _i+1, and 0 ≦ v ≦ m;

step 5) calculating a rotation matrix and a translation vector of the shooting I _i+1 relative to the shooting I _i depth camera pose transformation:

step 5a), fusing the p _i with depth information in a corresponding depth image to obtain a three-dimensional point set pd _i ═ { pd _i1,. pd _iu.. pd _iv }, and simultaneously fusing the depth information in the depth image corresponding to p _i 'to obtain a three-dimensional point set pd _i' ═ pd _i1 ',. pd _iu.. pd _iv' };

step 5b) calculate centroid coordinates C _i and C '_i for pd _i and pd _i', and de-centroid coordinates q _iu and q _iu 'for pd _iu and pd _iu':

C_i＝(pd_i1+pd_i2+...+pd_iv)/v

C_i′＝(pd_i1′+pd_i2′+...+pd_iv′)/v

q_iu＝pd_iu-C_i

q_iu′＝pd_iu′-C′_i；

step 5c), calculating a rotation matrix R _i of the pose transformation of the shooting I _i+1 relative to the shooting I _i depth camera by adopting an SVD (singular value decomposition) dimension reduction algorithm:

step 5d) calculating a translation vector t _i of the shooting I _i+1 relative to the shooting I _i depth camera pose transformation:

t_i＝C_i-R_iC′_i；

step 6) determining the area of the moving target in each RGB image I _i:

step 6a) calculating the target matching degree of each target detection result C _ij in I _i and each target detection result C _(i+1)j' in I _i+1, where the target matching degree is obtained by combining the type of the Yolo detection result and the disparity information between two adjacent frames, and because the moving target does not have the possibility of "instantaneous motion" in the three-dimensional space, the center position coordinates of the same target between two adjacent frames and the frame height and width information of the target do not change suddenly, so that the matching degree of the same type of target between two adjacent frames within a certain threshold range can be determined to determine whether the two targets are the same target appearing in two adjacent frames:

Step 6B), fusing the s pairs of feature matching points in the target areas B _ij and B _(i+1)j' respectively corresponding to the matched target detection results C _ij and C _(i+1)j' with the corresponding depth image information to obtain a corresponding three-dimensional point set, and recording the three-dimensional point set as:

{(td_i1,td_i1′),...(td_iw,td_iw′),...(td_is,td_is′)}0＜s＜m；

step 6c) calculating the same target area dynamic variation A between I _i and I _i+1, and marking B _ij as a moving target area B _ij' when A is more than 0.1, wherein the calculation formula of A is as follows:

Step 7) repairing the static background blocked by the moving target in the moving target area B _ij' of each RGB image I _i:

Step 7a) screening 5 contrast frames I _i+5, I _i+10, I _i+15, I _i+20 and I _i+25 after I _i when I is less than N-25 or screening 5 contrast frames I _i-5, I _i-10, I _i-15, I _i-20 and I _i-25 before I _i when I is less than or equal to N when N-25 is less than or equal to N from the RGB image sequence I by a step length L of 5;

step 7B) calculating B _ij 'center coordinates (x _ij, y _ij)' in I _i, corresponding projection coordinates (x _ij, y _ij) '_l in 5 contrast frames, wherein the projection coordinates represent the center coordinates of a region corresponding to B _ij' and having the same static background in the contrast frames, and since the object in B _ij 'has a moving attribute, the region having the same static background as B _ij' in the contrast frames may not be occluded by the moving object, so that the background repair may be performed using the image of the corresponding position in the contrast frames, where l is 1,2,3,4, 5;

step 7b1) calibrating the camera I by adopting an SfM algorithm to obtain an initialized internal reference matrix K of the depth camera, and optimizing the initialized internal reference matrix K by adopting a cluster optimization method to obtain an optimized internal reference matrix K;

step 7b2) calculates (x _ij, y _ij) 'the corresponding camera coordinates (x _ij', y _ij ', z _ij') under the camera coordinate system:

[x_ij′,y_ij′,z_ij′,1]^T＝z_ij′K^-1[x_ij,y_ij,1]^T；

step 7b3) inputs (x _ij ', y _ij', z _ij ') into F for 5 iterations, each iteration is sequentially 5,10,15,20, and 25 times, where the first iteration results in (x _ij', y _ij ', z _ij') the projection camera coordinates (x _ij ', y _ij', z _ij ') in the contrast frame I _i+5l or I _i-5l' _l:

step 7b4) calculates (x _ij ', y _ij ', z _ij ') _l image coordinate system corresponding coordinates (x _ij, y _ij) ' _l as projection coordinates of (x _ij, y _ij) ' in the comparison frame I _i+5l or I _i-5l, where the specific I _i±5l is related to the selection direction of the comparison frame of the current frame, if the selection is backward from the current frame, I _i+5l, if the selection is forward from the current frame, I _i-5l:

Step 7c) carrying out ORB feature point extraction and matching on B _ij ' and a pixel region H _ijl which takes projection coordinates (x _ij, y _ij) ' _l as the center and has the same width and height as B _ij ' in 5 contrast frames respectively, updating the pixel value of B _ij ' to the pixel value of a region H _ijl ' with the least number of point pairs of feature points which are successfully matched to obtain an RGB image I _i ' of the I _i subjected to background restoration, and updating the pixel value of a depth image region corresponding to B _ij ' to the pixel value of a depth image region corresponding to H _ijl ' to obtain a depth image D _i ' of the D _i subjected to background restoration;

step 8) obtaining a three-dimensional reconstruction result:

the technical effects of the invention are further explained by combining simulation tests as follows:

1. Experimental conditions and contents:

The experimental conditions are as follows: the experiment is carried out on equipment provided with Ubuntu16.04, 32GB memory, Intel E5-2620 dual-core processor and GTX1080Ti GPU processor. As input, a sequence of images (605 sheets, 768 × 574) in a "freiburg 3_ walking _ xyz" dataset, which contains two moving human targets, is used.

the experimental contents are as follows: in this experiment, an image sequence freiburg3_ walking _ xyz (605 x, 768 x 574) is used as input, and three-dimensional point cloud reconstruction is performed on the image sequence by using the method provided by the present invention and the existing three-dimensional point cloud reconstruction method based on the image sequence, and the result is shown in fig. 3.

2. And (3) analyzing an experimental result:

referring to fig. 2, fig. 2(a) is an image including a moving human object, and fig. 2(b) is an image of the same scene after background restoration. Referring to fig. 3, fig. 3(a) is an indoor three-dimensional point cloud model reconstructed by using a conventional three-dimensional reconstruction method; fig. 3(b) shows an indoor three-dimensional point cloud model reconstructed by the three-dimensional reconstruction method of the present invention. In fig. 3(a), since the dynamic object is not removed, a large number of three-dimensional point clouds of a moving human body appear in the reconstruction result. Fig. 3(b) uses the image subjected to background restoration for reconstruction, removes the moving human body in the indoor scene, restores the static background blocked by the moving human body, and improves the accuracy of three-dimensional reconstruction in the indoor dynamic scene.

Claims

1. a three-dimensional reconstruction method based on indoor moving target background restoration is characterized by comprising the following steps:

(2) acquiring a target area of each RGB image I _i:

t_i＝C_i-R_iC′_i；

(5) determining the area of the moving target in each RGB image I _i:

(5b) and fusing s pairs of feature matching points in the target areas B _ij and B _(i+1)j' corresponding to the matched target detection results C _ij and C _(i+1)j' respectively with corresponding depth image information to obtain a corresponding three-dimensional point set as follows:

{(td_i1,td_i1′),...(td_iw,td_iw′),...(td_is,td_is′)}0＜s＜m；

(5c) Calculating the dynamic variation A of the same target region between I _i and I _i+1, and marking the corresponding B _ij when A is greater than 0.1 as a moving target region B _ij', wherein the calculation formula of A is as follows:

(6c) Carrying out ORB feature point extraction and matching on B _ij ' and a pixel region H _ijl which takes projection coordinates (x _ij, y _ij) ' _l as the center and has the same width and height as B _ij ' in 5 contrast frames respectively, updating the pixel value of B _ij ' to the pixel value of a region H _ijl ' with the least number of point pairs of feature points which are successfully matched to obtain an RGB image I _i ' of I _i subjected to background restoration, and updating the pixel value of a depth image region corresponding to B _ij ' to the pixel value of a depth image region corresponding to H _ijl ' to obtain a depth image D _i ' of D _i subjected to background restoration;

(7) obtaining a three-dimensional reconstruction result:

2. the method of three-dimensional reconstruction for indoor moving object background restoration based on RGB-D images as claimed in claim 1, wherein the step (4b) calculates the centroid coordinates C _i and C '_i of pd _i and pd _i' and the centroid coordinates q _iu and q _iu 'of pd _iu and pd _iu', respectively, as:

C_i＝(pd_i1+pd_i2+...+pd_iv)/v

C′_i＝(pd_i1′+pd_i2′+...+pd_iv′)/v

q_iu＝pd_iu-C_i

q_iu′＝pd_iu′-C′_i。

3. the three-dimensional reconstruction method for RGB-D image based indoor moving object background restoration according to claim 1, wherein the step (6b) of calculating the projection coordinates (x _ij, y _ij)' _l is implemented by the steps of:

(6b1) carrying out camera calibration on the I by adopting an SfM algorithm to obtain an initialized internal reference matrix of the depth camera, and optimizing the initialized internal reference matrix by adopting a cluster optimization method to obtain an optimized internal reference matrix;

(6b2) converting (x _ij, y _ij) 'into corresponding camera coordinates (x _ij', y _ij ', z _ij') under a camera coordinate system by using an internal reference matrix;

(6b3) Inputting (x _ij ', y _ij', z _ij ') into F for 5 iterations, the number of each iteration is 5,10,15,20,25, wherein the first iteration results in the projection camera coordinates (x _ij', y _ij ', z _ij') in the contrast frame I _i+5l or I _i-5l (x _ij ', y _ij', z _ij ')' _l:

(6b4) the (x _ij ', y _ij ', z _ij ') _l is converted into image coordinates (x _ij, y _ij) ' _l in a corresponding image coordinate system through an internal reference matrix as projection coordinates of (x _ij, y _ij) ' in the contrast frame I _i+5l or I _i-5l.