CN116309774A

CN116309774A - Dense three-dimensional reconstruction method based on event camera

Info

Publication number: CN116309774A
Application number: CN202211668022.6A
Authority: CN
Inventors: 张飞虎; 张威; 侯旭佳
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-12-23
Filing date: 2022-12-23
Publication date: 2023-06-23

Abstract

The invention relates to a dense three-dimensional reconstruction method based on an event camera, wherein the event camera is a dynamic vision sensor, and the three-dimensional reconstruction step comprises three parts of intensity image reconstruction, motion structure recovery and dense three-dimensional reconstruction. Firstly reconstructing an intensity image from an event by using deep learning; then estimating internal parameters, gestures and sparse point clouds of the camera by utilizing a motion recovery structure based on SFM, and finally completing dense reconstruction by using multi-stereoscopic view based on MVS. In the special environments of high dynamic range, lack of illumination and high-speed movement, the traditional camera can have the conditions of blurring, overexposure and underexposure, which can greatly influence the quality of reconstruction, but the quality of the dense three-dimensional reconstruction is far higher than that of the traditional camera due to the advantages of the event camera.

Description

Dense three-dimensional reconstruction method based on event camera

Technical Field

The invention belongs to the field of computer vision three-dimensional reconstruction, relates to a dense three-dimensional reconstruction method based on an event camera, and belongs to a dynamic vision sensor three-dimensional reconstruction method.

Background

An event camera (dynamic vision sensor) is a new sensor different from a conventional video camera, and each pixel is triggered asynchronously by an event, which refers to the change of brightness irradiated on each pixel, when the brightness is increased or decreased by more than a certain set threshold value, and the event is output. Compared with the traditional camera, the event camera has the advantages of low delay, extremely low power consumption, high information availability, high dynamic range and no motion blur. Since events are caused by significant motion of intensity edges, most three-dimensional reconstructions consist only of scene edges, i.e. semi-dense reconstructions, which is insufficient for some applications. In view of this, a dense three-dimensional reconstruction method based on event cameras is invented herein, which first reconstructs intensity images from events using deep learning; then estimating internal parameters, gestures and sparse point clouds of the camera by utilizing a motion recovery structure based on SFM, and finally completing dense reconstruction by using multi-stereoscopic view based on MVS. In the special environments of high dynamic range, lack of illumination and high-speed movement, the traditional camera can have the conditions of blurring, overexposure and underexposure, which can greatly influence the quality of reconstruction, but the quality of the dense three-dimensional reconstruction is far higher than that of the traditional camera due to the advantages of the event camera.

Disclosure of Invention

Technical problem to be solved

In order to avoid the defects of the prior art, the invention provides a dense three-dimensional reconstruction method based on an event camera, which utilizes the characteristics of low delay, extremely low power consumption, high information availability, high dynamic range and no motion blur of the event camera to carry out low-illumination and dense three-dimensional reconstruction in a rapid motion environment. The experiment is performed on an industrial personal computer, and an event camera is connected to the industrial personal computer.

Technical proposal

The dense three-dimensional reconstruction method based on the event camera is characterized by comprising the following steps of: the event camera is a dynamic vision sensor, and the three-dimensional reconstruction step comprises three parts of intensity image reconstruction, motion structure recovery and dense three-dimensional reconstruction, and the specific process is as follows:

1. intensity image reconstruction:

step 1.1: establishing an intensity image reconstruction neural network based on UNet, wherein the neural network comprises a head layer H, a main body layer B and a prediction layer P; the main body layer B comprises three cyclic convolution modules R1, R2 and R3, namely five grouping convolution modules G1, G2, G3, G4 and G5 with consistent parameters, and three sub-pixel convolution modules U1, U2 and U3;

step 1.2: encoding the event to obtain a channel tensor input to the neural network, wherein the channel tensor is input to the neural network, and the tensor with the output channel size of 32 is obtained through a convolution layer in the head layer H and a RELU activation function; wherein the convolution kernel size is 3 x 3;

step 1.3: the output of the head layer H is sent to three cyclic convolution modules R1, R2 and R3 for downsampling operation, the number of output channels is doubled after the input passes through each cyclic convolution block, and the tensor height and width of each convolution block channel are doubled;

each cyclic convolution block comprises a CBR and a ConvLSTM module, and the ConvLSTM module is used for reserving previous state information which is used for updating the current state in combination with the current input;

the CBR is a convolution layer with a convolution kernel size of 5 multiplied by 5, a regularization (BatchNorm) layer and a ReLU activation function;

the convolution kernel size of the ConvLSTM module is 3 multiplied by 3;

step 1.4: inputting the output data passing through the three cyclic convolution blocks into two grouping convolution blocks G1 and G2, grouping different feature graphs of an input layer, and then convolving each group;

the grouping convolution module firstly sends channel tensor with the number of N input channels into a convolution layer with the size of 1 multiplied by 1, halves the number of channels, then obtains output with the number of N/2 through a grouping convolution layer with the number of 4 tensors and the size of 3 multiplied by 3, finally sends the output into a convolution layer with the size of 1 multiplied by 1, doubles the number of channels, and obtains output with the number of N;

step 1.5: the output of the three cyclic convolution modules is respectively input into a branch, and each branch is formed into G3, G4 and G5 by a grouping convolution block;

the grouping convolution module firstly sends the channel tensor with the number of N of input channels into a convolution layer with the convolution kernel size of 1 multiplied by 1, and halving the number of channels; then, a group convolution layer with 4 tensors and a convolution kernel size of 3 multiplied by 3 is passed through to obtain an output with N/2 channels; finally, a convolution layer with the convolution kernel size of 1 multiplied by 1 is sent, the number of channels is doubled, and the output with the number of channels of N is obtained;

step 1.6: the output with the channel number of N is subjected to up-sampling operation by three continuous sub-pixel convolution modules U1, U2 and U3;

each sub-pixel convolution module performs sub-pixel splicing operation on input, namely the input size is a multiplied by B multiplied by c during up-sampling, the output after the splicing operation is 2a multiplied by 2B multiplied by c/4, and then a convolution layer with the convolution kernel size of 3 multiplied by 3 is adopted to obtain the output of the whole main body layer B;

wherein: the number of output channels is reduced by two times after each sub-pixel convolution block, and the tensor height and width of each convolution block channel are enlarged by two times; the input of each sub-pixel convolution module is the output of the corresponding cyclic convolution module G3, G4 and G5 after the grouping convolution branch processing and the output of the previous sub-pixel convolution module;

step 1.7: the output of the first cyclic convolution module R1 is input into a sub-pixel convolution module U0 to carry out up-sampling operation, and the up-sampling result and the output of the whole main body layer B are taken as inputs to be sent into a prediction layer; after the prediction layer obtains input, the input is sent to a convolution layer for convolution, then sent to a regularization layer, finally output is obtained through a Sigmoid activation function, and the output is to predict a value between 0 and 1 for each pixel;

2. and (3) restoring a motion structure:

step 2.1: taking the intensity image reconstructed by the neural network as an input image set psi = { I _i |i＝1,2...N _I Detecting the characteristics and descriptors of each input image by adopting SIFT characteristic extraction algorithm, and sitting the characteristicsLabeled x, feature descriptor f, all detected features and descriptors as a set

Step 2.2: by F _i SIFT feature matching as appearance description of image by providing for image I _b Finding an image I for each feature in a database _a Searching the corresponding relation of the features of the feature description sub-search features with highest matching degree, and finding out the images of the same scene;

through feature matching, an image pair C= { I with a group of potential matching success is output _a ,I _b }|I _a ,I _b ε. Phi., a < b } and their associated features correspond to M _ab ∈F _a ×F；

Step 2.3: computing basis matrices F for each image pair using epipolar geometry for potential matching successful image pairs, computing camera internal parameters and poses, if one valid camera internal parameter and pose maps features between more than 30 images, considering that geometry verification is passed, then outlier filtering the geometry verified image pairs using RANSAC method, and outputting the final output as geometry verified image pairs

Corresponding to their relevant features->

Step 2.4: selecting an image pair with the most matching features in the image pair with the baseline distance being more than 10cm from the image pair subjected to geometric verification, performing triangulation and nonlinear optimization, and estimating a transformation matrix of a camera and coordinates of 3D points in space;

step 2.5: adding the rest image pairs into nonlinear optimization by taking the order of the feature matching quality from more to less as input, repeating the process of step 2.4 until all the image pairs are optimized, and finally obtaining a transformation matrix set P= { P of the camera as the output _c ∈SE(3)|c＝1,2...N _p Sum space 3D coordinate point set x= { X _k ∈R ³ |k＝1,2...N _X Sparse reconstruction is completed;

3. dense three-dimensional reconstruction:

step 3.1: taking a transformation matrix output by the motion recovery structure, 3D point coordinates in the space and corresponding images as inputs, and carrying out dense point cloud reconstruction by using a Patch-Match Stereo method, and outputting the dense point cloud and the corresponding images;

step 3.2: taking the dense point cloud and the corresponding image thereof as input, performing rough surface reconstruction on the dense point cloud by using a surface reconstruction method based on binary segmentation, dividing the dense point cloud into an inner type and an outer type, wherein a surface between the inner type and the outer type is the surface of an object, and outputting the surface as a rough surface reconstruction;

step 3.3: taking the rough surface reconstruction as input, optimizing the details of the grid surface based on the luminosity consistency, calculating the luminosity consistency again for the multi-view image, and outputting the luminosity consistency as a fine surface reconstruction;

step 3.4: and carrying out texture mapping on the fine surface reconstruction, and outputting a final dense three-dimensional reconstruction result.

The grouping number eta=n/8 of the grouping convolution modules of steps 1.4 and 1.5.

The nonlinear optimization formula:

wherein P is _c For the transformation matrix of the camera, X _k Is the coordinates of a 3D point in space, x _j For feature coordinates in the image, pi is the projection function of the 3D point in space onto the camera plane, ρ _j The weight is determined according to the number of image-to-feature matching.

The intensity image reconstructed by the event camera is a gray image, and the result of the dense three-dimensional reconstruction is gray.

Advantageous effects

The invention provides a dense three-dimensional reconstruction method based on an event camera, wherein the event camera is a dynamic vision sensor, and the three-dimensional reconstruction step comprises three parts of intensity image reconstruction, motion structure recovery and dense three-dimensional reconstruction. Firstly reconstructing an intensity image from an event by using deep learning; then estimating internal parameters, gestures and sparse point clouds of the camera by utilizing a motion recovery structure based on SFM, and finally completing dense reconstruction by using multi-stereoscopic view based on MVS. In the special environments of high dynamic range, lack of illumination and high-speed movement, the traditional camera can have the conditions of blurring, overexposure and underexposure, which can greatly influence the quality of reconstruction, but the quality of the dense three-dimensional reconstruction is far higher than that of the traditional camera due to the advantages of the event camera. The invention uses the characteristics of low delay, extremely low power consumption, high information availability, high dynamic range and no motion blur of the event camera to carry out low illumination and dense three-dimensional reconstruction in a rapid motion environment.

Drawings

FIG. 1 is a flow chart of a dense three-dimensional reconstruction method based on event cameras

FIG. 2 is an intensity image reconstruction network

FIG. 3 is a flow chart of a motion restoration structure and dense three-dimensional reconstruction

FIG. 4 is an original image of an event camera

FIG. 5 is an intensity image reconstruction result

FIG. 6 is a dense three-dimensional reconstruction result

Detailed Description

The invention will now be further described with reference to examples, figures:

the invention is mainly divided into three parts: intensity image reconstruction, motion structure recovery, and dense three-dimensional reconstruction.

A first part: intensity image reconstruction

The first step: and establishing an intensity image reconstruction neural network based on UNet, wherein the neural network is divided into a head layer H, a main body layer B and a prediction layer P. The main body layer is the central part of the whole network, the feature extraction and fusion are completed, and the main body layer comprises five grouping convolution modules G1, G2, G3, G4 and G5 of three circular convolution modules R1, R2 and R3 and three sub-pixel convolution modules U1, U2 and U3.

And a second step of: the event is encoded to obtain a channel tensor input to the neural network with a fixed size, and the tensor with the size of the convolution kernel of 3 multiplied by 3 is obtained through the convolution layer in the head layer H and the RELU activation function to obtain a tensor with the size of the output channel of 32.

And a third step of: the output of the header layer H is fed into three cyclic convolution blocks R1, R2, R3 for downsampling, each consisting of a CBR (convolutional layer+regularized (batch norm) layer+relu activation function), a convolution kernel size of 5 x 5, and a ConvLSTM block (convolution kernel size of 3 x 3). The purpose of using ConvLSTM is to keep previous state information that is used to update the current state in conjunction with the current input. The number of output channels is doubled after the input passes through each circular convolution block, and the tensor width of each convolution block channel is doubled.

Fourth step: the output data passing through the three cyclic convolution blocks is input again into two packet convolution blocks, namely G1 and G2. The grouping convolution is to group different feature maps of an input layer and then to convolve each group with different convolution kernels. The grouping convolution module firstly sends channel tensor with the number of N input channels into a convolution layer, the convolution kernel size is 1 multiplied by 1, the number of channels is halved, then the channel is divided into groups of 4 tensors through the convolution layer, wherein the convolution kernel size is 3 multiplied by 3, the grouping number eta=N/8, the output with the number of N/2 is obtained, and finally the output is sent into a convolution layer, the convolution kernel size is 1 multiplied by 1, and the number of channels is doubled, so that the output with the number of N is obtained. After the cyclic convolution module performs the downsampling operation, the cyclic convolution module passes through two grouping convolution modules, so as to fully extract the most abstract features.

Fifth step: the inventor also inputs the outputs of the three cyclic convolution modules to one branch, respectively, each of which is constituted by one block of packet convolution, i.e., G3, G4, G5. The grouping convolution module firstly sends channel tensor with the number of N input channels into a convolution layer, the convolution kernel size is 1 multiplied by 1, the number of channels is halved, then the channel is divided into groups of 4 tensors through the convolution layer, wherein the convolution kernel size is 3 multiplied by 3, the grouping number eta=N/8, the output with the number of N/2 is obtained, and finally the output is sent into a convolution layer, the convolution kernel size is 1 multiplied by 1, and the number of channels is doubled, so that the output with the number of N is obtained. After the cyclic convolution module performs the downsampling operation, the cyclic convolution module passes through two grouping convolution modules, so as to fully extract the most abstract features.

Sixth step: the up-sampling operation is then performed by three consecutive sub-pixel convolution modules U1, U2, U3. Each sub-pixel convolution module performs sub-pixel splicing operation on input, namely, the input size is a multiplied by b multiplied by c during up-sampling, the output after the splicing operation is 2a multiplied by 2b multiplied by c/4, and then a convolution layer with the convolution kernel size of 3 multiplied by 3 is passed. The number of output channels is reduced by two after each sub-pixel convolution block, and the tensor height and width of each convolution block channel are enlarged by two. The input of each sub-pixel convolution module is the output of the corresponding cyclic convolution module G3, G4 and G5 after the grouping convolution branch processing and the output of the previous sub-pixel convolution module. Traditional upsampling is achieved by non-learnable methods such as linear interpolation, so we replace the original interpolation with sub-pixel convolution. Thus, the output of the whole body layer B is obtained.

Seventh step: the output of the first cyclic convolution module R1 is input to a sub-pixel convolution module U0 for up-sampling and the up-sampled result and the output of the whole body layer B are fed as inputs to the prediction layer. After the input is obtained by the prediction layer, the input is sent to a convolution layer for convolution, then sent to a regularization layer, finally output is obtained through a Sigmoid activation function, and the output is a value between 0 and 1 predicted for each pixel.

A second part: exercise structure restoration

The first step: taking the intensity image reconstructed by the neural network as an input image set psi = { I _i |i＝1,2...N _I Detecting for each input image his features and descriptors using SIFT feature extraction algorithm, and labeling feature coordinates as x, feature descriptors as f, all detected features and descriptors as a set

And a second step of: by F _i SIFT feature matching as appearance description of image by providing for image I _b Finding an image I for each feature in a database _a The feature description with the highest matching degree searches the corresponding relation of the features, and discovers the images of the same scene. Through feature matching, an image pair C= { I with a group of potential matching success is output _a ,I _b }|I _a ,I _b ε. Phi., a < b } and their associated features correspond to M _ab ∈F _a ×F。

And a third step of: computing basis matrices F for each image pair using epipolar geometry for potential matching successful image pairs, computing camera parameters and poses that would be considered geometrically verified if one valid camera parameter and pose mapped features between more than 30 images, then outlier filtering geometrically verified image pairs using RANSAC method, and outputting as geometrically verified image pairs

Corresponding to their relevant features->

Fourth step: and selecting an image pair with the largest matching characteristic in the image pair with the baseline distance being more than 10cm from geometric verification to perform triangulation and nonlinear optimization (Bundle Adjustment), and estimating a transformation matrix of a camera and coordinates of a 3D point in space, wherein a nonlinear optimization formula is as follows:

Fifth step: adding the rest image pairs into nonlinear optimization by taking the order of the feature matching quality from more to less as input, repeating the fourth step, knowing that all the image pairs are optimized, and finally obtaining a transformation matrix set P= { P of the camera as output _c ∈SE(3)|c＝1,2...N _p Sum space 3D coordinate point set x= { X _k ∈R ³ |k＝1,2...N _X And (3) completing sparse reconstruction.

Third section: dense three-dimensional reconstruction

The first step: and taking a transformation matrix output by the motion recovery structure, 3D point coordinates in the space and a corresponding image as inputs, and carrying out dense point cloud reconstruction by using a Patch-Match Stereo method to obtain complete and accurate dense point cloud at a reasonable speed. The output of this section is a dense point cloud and its corresponding image.

And a second step of: and (3) taking the dense point cloud and the corresponding image thereof in the first step as input, performing rough surface reconstruction on the point cloud by using a surface reconstruction method based on binary segmentation, dividing the dense point cloud into an inner type and an outer type, wherein a surface between the inner type and the outer type is the surface of an object, and outputting the surface as a rough surface reconstruction.

And a third step of: and taking the rough surface reconstruction in the second step as input, optimizing the details of the grid surface based on the luminosity consistency, and calculating the luminosity consistency again for the multi-view image so as to ensure that the luminosity consistency is the best, and outputting the luminosity consistency as a fine surface reconstruction.

Fourth step: and (3) carrying out texture mapping on the fine surface reconstruction generated in the third step, and outputting a final dense three-dimensional reconstruction result, wherein the final dense three-dimensional reconstruction result is gray because the intensity image reconstructed by the event camera is a gray image.

Specific examples:

as shown in fig. 1, the dense three-dimensional reconstruction method based on the event camera is specifically as follows:

the first step: an intensity image reconstruction neural network based on UNet is established, and the network structure is shown in fig. 2 and is divided into a head layer H, a main body layer B and a prediction layer P. The main body layer is the central part of the whole network, the feature extraction and fusion are completed, and the main body layer comprises five grouping convolution modules G1, G2, G3, G4 and G5 of three circular convolution modules R1, R2 and R3 and three sub-pixel convolution modules U1, U2 and U3.

And a second step of: in order for the convolutional recurrent neural network to process the event stream in the text, the event stream is encoded into a stereo tensor with a channel size of 5.

And a third step of: training was performed using the eco dataset with an image size of 240×180, during which the dataset was randomly flipped (-20 °,20 °) and cropped to a 128×128 size to expand the dataset using the following loss function formula:

L＝SSIM(X ₁ ,X ₂ )+d(X ₁ ,X ₂ )

wherein X is ₁ And X ₂ Representing two images, SSIM (X ₁ ,X ₂ )＝L(X ₁ ,X ₂ )*C(X ₁ ,X ₂ )*S(X ₁ ,X ₂ ) The similarity between the two images is measured as a structural similarity index. Wherein the method comprises the steps of

The similarity of the brightness is indicated and,

representing contrast similarity +.>

Representing the structure score. Wherein->

And->

Representing the mean value of two images, +.>

And->

Represents the standard deviation of the two images, +.>

Representing covariance of two images, C ₁ C ₂ C ₃ The denominator is not 0 for a constant.

To learn the perceived similarity, two images are passed through a VGG19 network and the difference of the output values of each layer is calculated, where l is the number of layers of the neural network, H _l W _l For the current layer image height and width, w _l Is a scaling factor and Y is the corresponding output for each layer.

The deep learning framework is PyTorch, the period is 200, the batch size is 4, the maximum learning rate is 5 multiplied by 10 < -4 > by using an ADAM optimizer, and the dynamic learning rate is adjusted for training, and the trained model parameters are saved.

Fourth step: information of a place is collected by using an event camera, and recorded into a robag, and an original image of the event camera is shown in fig. 4. Subscribing to the topic of/dvs/events in rosbag recorded by an event camera, wherein the message type is dvs_msgs, the content is pixel coordinates, polarity and time stamp, reading the three contents, and storing the text file in the sequential format of (time stamp, pixel coordinates and polarity) for subsequent processing.

Fifth step: and repeating the encoding process in the second step, inputting the encoded 5-channel tensor into a trained intensity image reconstruction network, sequentially passing through a head layer, a main body layer and a prediction layer, and outputting a 3-channel tensor, and converting the tensor into a picture form, namely, a result of intensity image reconstruction, as shown in fig. 5.

Sixth step: then, the motion structure recovery and dense three-dimensional reconstruction are carried out, the specific flow is as shown in fig. 3, the reconstructed intensity image is subjected to feature extraction and matching by using SIFT, and a group of image pairs with potential matching success are output, wherein the image pairs are C= { I _a ,I _b }|I _a ,I _b ε. Phi., a < b } and their associated features correspond to M _ab ∈F _a ×F。

Seventh step: computing a fundamental matrix F of each image pair by utilizing the epipolar geometry relation of the matched and functional image pairs, computing internal parameters and postures of a camera, then carrying out outlier filtering by utilizing a RANSAC method, and finally outputting the image pairs subjected to geometric verification

Corresponding to their relevant features->

Eighth step: image pairs with sufficient matching features and sufficient baseline distances are selected from the geometrically validated image pairs for triangulation and nonlinear optimization (Bundle Adjustment), estimating the transformation matrix of the camera and coordinates of the 3D points in space.

Ninth step: adding the rest image pairs into nonlinear optimization by taking the order of the characteristic matching quality from good to bad as input, repeating the seventh step, knowing that all the image pairs are optimized, and finally obtaining a transformation matrix set P= { P of the camera as the output _c ∈SE(3)|c＝1,2...N _p Sum space 3D coordinate point set x= { X _k ∈R ³ |k＝1,2...N _X And (3) completing sparse reconstruction.

Tenth step: and taking the transformation matrix output by the motion recovery structure, 3D point coordinates in the space and corresponding images as inputs, carrying out dense point cloud reconstruction by using a Patch-Match Stereo method, and outputting dense point clouds and corresponding images thereof.

Eleventh step: and (3) taking the dense point cloud and the corresponding image thereof in the tenth step as input, performing rough surface reconstruction on the point cloud by using a surface reconstruction method based on binary segmentation, and outputting the rough surface reconstruction.

Twelfth step: taking the rough surface reconstruction in the eleventh step as input, optimizing the details of the grid surface based on photometric consistency, and outputting a fine surface reconstruction.

Thirteenth step: texture mapping is carried out on the fine surface reconstruction generated in the twelfth step, and a final dense three-dimensional reconstruction result is output, as shown in fig. 6.

Claims

1. The dense three-dimensional reconstruction method based on the event camera is characterized by comprising the following steps of: the event camera is a dynamic vision sensor, and the three-dimensional reconstruction step comprises three parts of intensity image reconstruction, motion structure recovery and dense three-dimensional reconstruction, and the specific process is as follows:

1. intensity image reconstruction:

the convolution kernel size of the ConvLSTM module is 3 multiplied by 3;

2. and (3) restoring a motion structure:

step 2.1: taking the intensity image reconstructed by the neural network as an input image set psi = { I _i |i＝1,2...N _I Detecting the features and descriptors of each input image by adopting a SIFT feature extraction algorithm, marking the feature seats as x, marking the feature descriptors as F, and marking all the detected features and descriptors as a set F _i ＝{(x _j ,f _j )|j＝1,2...N _Fi }；

Corresponding to their relevant features->

3. dense three-dimensional reconstruction:

2. The dense three-dimensional reconstruction method based on event cameras according to claim 1, wherein: the grouping number eta=n/8 of the grouping convolution modules of steps 1.4 and 1.5.

3. The dense three-dimensional reconstruction method based on event cameras according to claim 1, wherein: the nonlinear optimization formula:

4. The dense three-dimensional reconstruction method based on event cameras according to claim 1, wherein: the intensity image reconstructed by the event camera is a gray image, and the result of the dense three-dimensional reconstruction is gray.