CN116309774A - Dense three-dimensional reconstruction method based on event camera - Google Patents

Dense three-dimensional reconstruction method based on event camera Download PDF

Info

Publication number
CN116309774A
CN116309774A CN202211668022.6A CN202211668022A CN116309774A CN 116309774 A CN116309774 A CN 116309774A CN 202211668022 A CN202211668022 A CN 202211668022A CN 116309774 A CN116309774 A CN 116309774A
Authority
CN
China
Prior art keywords
convolution
output
image
layer
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211668022.6A
Other languages
Chinese (zh)
Inventor
张飞虎
张威
侯旭佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202211668022.6A priority Critical patent/CN116309774A/en
Publication of CN116309774A publication Critical patent/CN116309774A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/593Depth or shape recovery from multiple images from stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a dense three-dimensional reconstruction method based on an event camera, wherein the event camera is a dynamic vision sensor, and the three-dimensional reconstruction step comprises three parts of intensity image reconstruction, motion structure recovery and dense three-dimensional reconstruction. Firstly reconstructing an intensity image from an event by using deep learning; then estimating internal parameters, gestures and sparse point clouds of the camera by utilizing a motion recovery structure based on SFM, and finally completing dense reconstruction by using multi-stereoscopic view based on MVS. In the special environments of high dynamic range, lack of illumination and high-speed movement, the traditional camera can have the conditions of blurring, overexposure and underexposure, which can greatly influence the quality of reconstruction, but the quality of the dense three-dimensional reconstruction is far higher than that of the traditional camera due to the advantages of the event camera.

Description

Dense three-dimensional reconstruction method based on event camera
Technical Field
The invention belongs to the field of computer vision three-dimensional reconstruction, relates to a dense three-dimensional reconstruction method based on an event camera, and belongs to a dynamic vision sensor three-dimensional reconstruction method.
Background
An event camera (dynamic vision sensor) is a new sensor different from a conventional video camera, and each pixel is triggered asynchronously by an event, which refers to the change of brightness irradiated on each pixel, when the brightness is increased or decreased by more than a certain set threshold value, and the event is output. Compared with the traditional camera, the event camera has the advantages of low delay, extremely low power consumption, high information availability, high dynamic range and no motion blur. Since events are caused by significant motion of intensity edges, most three-dimensional reconstructions consist only of scene edges, i.e. semi-dense reconstructions, which is insufficient for some applications. In view of this, a dense three-dimensional reconstruction method based on event cameras is invented herein, which first reconstructs intensity images from events using deep learning; then estimating internal parameters, gestures and sparse point clouds of the camera by utilizing a motion recovery structure based on SFM, and finally completing dense reconstruction by using multi-stereoscopic view based on MVS. In the special environments of high dynamic range, lack of illumination and high-speed movement, the traditional camera can have the conditions of blurring, overexposure and underexposure, which can greatly influence the quality of reconstruction, but the quality of the dense three-dimensional reconstruction is far higher than that of the traditional camera due to the advantages of the event camera.
Disclosure of Invention
Technical problem to be solved
In order to avoid the defects of the prior art, the invention provides a dense three-dimensional reconstruction method based on an event camera, which utilizes the characteristics of low delay, extremely low power consumption, high information availability, high dynamic range and no motion blur of the event camera to carry out low-illumination and dense three-dimensional reconstruction in a rapid motion environment. The experiment is performed on an industrial personal computer, and an event camera is connected to the industrial personal computer.
Technical proposal
The dense three-dimensional reconstruction method based on the event camera is characterized by comprising the following steps of: the event camera is a dynamic vision sensor, and the three-dimensional reconstruction step comprises three parts of intensity image reconstruction, motion structure recovery and dense three-dimensional reconstruction, and the specific process is as follows:
1. intensity image reconstruction:
step 1.1: establishing an intensity image reconstruction neural network based on UNet, wherein the neural network comprises a head layer H, a main body layer B and a prediction layer P; the main body layer B comprises three cyclic convolution modules R1, R2 and R3, namely five grouping convolution modules G1, G2, G3, G4 and G5 with consistent parameters, and three sub-pixel convolution modules U1, U2 and U3;
step 1.2: encoding the event to obtain a channel tensor input to the neural network, wherein the channel tensor is input to the neural network, and the tensor with the output channel size of 32 is obtained through a convolution layer in the head layer H and a RELU activation function; wherein the convolution kernel size is 3 x 3;
step 1.3: the output of the head layer H is sent to three cyclic convolution modules R1, R2 and R3 for downsampling operation, the number of output channels is doubled after the input passes through each cyclic convolution block, and the tensor height and width of each convolution block channel are doubled;
each cyclic convolution block comprises a CBR and a ConvLSTM module, and the ConvLSTM module is used for reserving previous state information which is used for updating the current state in combination with the current input;
the CBR is a convolution layer with a convolution kernel size of 5 multiplied by 5, a regularization (BatchNorm) layer and a ReLU activation function;
the convolution kernel size of the ConvLSTM module is 3 multiplied by 3;
step 1.4: inputting the output data passing through the three cyclic convolution blocks into two grouping convolution blocks G1 and G2, grouping different feature graphs of an input layer, and then convolving each group;
the grouping convolution module firstly sends channel tensor with the number of N input channels into a convolution layer with the size of 1 multiplied by 1, halves the number of channels, then obtains output with the number of N/2 through a grouping convolution layer with the number of 4 tensors and the size of 3 multiplied by 3, finally sends the output into a convolution layer with the size of 1 multiplied by 1, doubles the number of channels, and obtains output with the number of N;
step 1.5: the output of the three cyclic convolution modules is respectively input into a branch, and each branch is formed into G3, G4 and G5 by a grouping convolution block;
the grouping convolution module firstly sends the channel tensor with the number of N of input channels into a convolution layer with the convolution kernel size of 1 multiplied by 1, and halving the number of channels; then, a group convolution layer with 4 tensors and a convolution kernel size of 3 multiplied by 3 is passed through to obtain an output with N/2 channels; finally, a convolution layer with the convolution kernel size of 1 multiplied by 1 is sent, the number of channels is doubled, and the output with the number of channels of N is obtained;
step 1.6: the output with the channel number of N is subjected to up-sampling operation by three continuous sub-pixel convolution modules U1, U2 and U3;
each sub-pixel convolution module performs sub-pixel splicing operation on input, namely the input size is a multiplied by B multiplied by c during up-sampling, the output after the splicing operation is 2a multiplied by 2B multiplied by c/4, and then a convolution layer with the convolution kernel size of 3 multiplied by 3 is adopted to obtain the output of the whole main body layer B;
wherein: the number of output channels is reduced by two times after each sub-pixel convolution block, and the tensor height and width of each convolution block channel are enlarged by two times; the input of each sub-pixel convolution module is the output of the corresponding cyclic convolution module G3, G4 and G5 after the grouping convolution branch processing and the output of the previous sub-pixel convolution module;
step 1.7: the output of the first cyclic convolution module R1 is input into a sub-pixel convolution module U0 to carry out up-sampling operation, and the up-sampling result and the output of the whole main body layer B are taken as inputs to be sent into a prediction layer; after the prediction layer obtains input, the input is sent to a convolution layer for convolution, then sent to a regularization layer, finally output is obtained through a Sigmoid activation function, and the output is to predict a value between 0 and 1 for each pixel;
2. and (3) restoring a motion structure:
step 2.1: taking the intensity image reconstructed by the neural network as an input image set psi = { I i |i=1,2...N I Detecting the characteristics and descriptors of each input image by adopting SIFT characteristic extraction algorithm, and sitting the characteristicsLabeled x, feature descriptor f, all detected features and descriptors as a set
Figure BDA0004015348210000041
Step 2.2: by F i SIFT feature matching as appearance description of image by providing for image I b Finding an image I for each feature in a database a Searching the corresponding relation of the features of the feature description sub-search features with highest matching degree, and finding out the images of the same scene;
through feature matching, an image pair C= { I with a group of potential matching success is output a ,I b }|I a ,I b ε. Phi., a < b } and their associated features correspond to M ab ∈F a ×F;
Step 2.3: computing basis matrices F for each image pair using epipolar geometry for potential matching successful image pairs, computing camera internal parameters and poses, if one valid camera internal parameter and pose maps features between more than 30 images, considering that geometry verification is passed, then outlier filtering the geometry verified image pairs using RANSAC method, and outputting the final output as geometry verified image pairs
Figure BDA0004015348210000042
Corresponding to their relevant features->
Figure BDA0004015348210000043
Step 2.4: selecting an image pair with the most matching features in the image pair with the baseline distance being more than 10cm from the image pair subjected to geometric verification, performing triangulation and nonlinear optimization, and estimating a transformation matrix of a camera and coordinates of 3D points in space;
step 2.5: adding the rest image pairs into nonlinear optimization by taking the order of the feature matching quality from more to less as input, repeating the process of step 2.4 until all the image pairs are optimized, and finally obtaining a transformation matrix set P= { P of the camera as the output c ∈SE(3)|c=1,2...N p Sum space 3D coordinate point set x= { X k ∈R 3 |k=1,2...N X Sparse reconstruction is completed;
3. dense three-dimensional reconstruction:
step 3.1: taking a transformation matrix output by the motion recovery structure, 3D point coordinates in the space and corresponding images as inputs, and carrying out dense point cloud reconstruction by using a Patch-Match Stereo method, and outputting the dense point cloud and the corresponding images;
step 3.2: taking the dense point cloud and the corresponding image thereof as input, performing rough surface reconstruction on the dense point cloud by using a surface reconstruction method based on binary segmentation, dividing the dense point cloud into an inner type and an outer type, wherein a surface between the inner type and the outer type is the surface of an object, and outputting the surface as a rough surface reconstruction;
step 3.3: taking the rough surface reconstruction as input, optimizing the details of the grid surface based on the luminosity consistency, calculating the luminosity consistency again for the multi-view image, and outputting the luminosity consistency as a fine surface reconstruction;
step 3.4: and carrying out texture mapping on the fine surface reconstruction, and outputting a final dense three-dimensional reconstruction result.
The grouping number eta=n/8 of the grouping convolution modules of steps 1.4 and 1.5.
The nonlinear optimization formula:
Figure BDA0004015348210000051
wherein P is c For the transformation matrix of the camera, X k Is the coordinates of a 3D point in space, x j For feature coordinates in the image, pi is the projection function of the 3D point in space onto the camera plane, ρ j The weight is determined according to the number of image-to-feature matching.
The intensity image reconstructed by the event camera is a gray image, and the result of the dense three-dimensional reconstruction is gray.
Advantageous effects
The invention provides a dense three-dimensional reconstruction method based on an event camera, wherein the event camera is a dynamic vision sensor, and the three-dimensional reconstruction step comprises three parts of intensity image reconstruction, motion structure recovery and dense three-dimensional reconstruction. Firstly reconstructing an intensity image from an event by using deep learning; then estimating internal parameters, gestures and sparse point clouds of the camera by utilizing a motion recovery structure based on SFM, and finally completing dense reconstruction by using multi-stereoscopic view based on MVS. In the special environments of high dynamic range, lack of illumination and high-speed movement, the traditional camera can have the conditions of blurring, overexposure and underexposure, which can greatly influence the quality of reconstruction, but the quality of the dense three-dimensional reconstruction is far higher than that of the traditional camera due to the advantages of the event camera. The invention uses the characteristics of low delay, extremely low power consumption, high information availability, high dynamic range and no motion blur of the event camera to carry out low illumination and dense three-dimensional reconstruction in a rapid motion environment.
Drawings
FIG. 1 is a flow chart of a dense three-dimensional reconstruction method based on event cameras
FIG. 2 is an intensity image reconstruction network
FIG. 3 is a flow chart of a motion restoration structure and dense three-dimensional reconstruction
FIG. 4 is an original image of an event camera
FIG. 5 is an intensity image reconstruction result
FIG. 6 is a dense three-dimensional reconstruction result
Detailed Description
The invention will now be further described with reference to examples, figures:
the invention is mainly divided into three parts: intensity image reconstruction, motion structure recovery, and dense three-dimensional reconstruction.
A first part: intensity image reconstruction
The first step: and establishing an intensity image reconstruction neural network based on UNet, wherein the neural network is divided into a head layer H, a main body layer B and a prediction layer P. The main body layer is the central part of the whole network, the feature extraction and fusion are completed, and the main body layer comprises five grouping convolution modules G1, G2, G3, G4 and G5 of three circular convolution modules R1, R2 and R3 and three sub-pixel convolution modules U1, U2 and U3.
And a second step of: the event is encoded to obtain a channel tensor input to the neural network with a fixed size, and the tensor with the size of the convolution kernel of 3 multiplied by 3 is obtained through the convolution layer in the head layer H and the RELU activation function to obtain a tensor with the size of the output channel of 32.
And a third step of: the output of the header layer H is fed into three cyclic convolution blocks R1, R2, R3 for downsampling, each consisting of a CBR (convolutional layer+regularized (batch norm) layer+relu activation function), a convolution kernel size of 5 x 5, and a ConvLSTM block (convolution kernel size of 3 x 3). The purpose of using ConvLSTM is to keep previous state information that is used to update the current state in conjunction with the current input. The number of output channels is doubled after the input passes through each circular convolution block, and the tensor width of each convolution block channel is doubled.
Fourth step: the output data passing through the three cyclic convolution blocks is input again into two packet convolution blocks, namely G1 and G2. The grouping convolution is to group different feature maps of an input layer and then to convolve each group with different convolution kernels. The grouping convolution module firstly sends channel tensor with the number of N input channels into a convolution layer, the convolution kernel size is 1 multiplied by 1, the number of channels is halved, then the channel is divided into groups of 4 tensors through the convolution layer, wherein the convolution kernel size is 3 multiplied by 3, the grouping number eta=N/8, the output with the number of N/2 is obtained, and finally the output is sent into a convolution layer, the convolution kernel size is 1 multiplied by 1, and the number of channels is doubled, so that the output with the number of N is obtained. After the cyclic convolution module performs the downsampling operation, the cyclic convolution module passes through two grouping convolution modules, so as to fully extract the most abstract features.
Fifth step: the inventor also inputs the outputs of the three cyclic convolution modules to one branch, respectively, each of which is constituted by one block of packet convolution, i.e., G3, G4, G5. The grouping convolution module firstly sends channel tensor with the number of N input channels into a convolution layer, the convolution kernel size is 1 multiplied by 1, the number of channels is halved, then the channel is divided into groups of 4 tensors through the convolution layer, wherein the convolution kernel size is 3 multiplied by 3, the grouping number eta=N/8, the output with the number of N/2 is obtained, and finally the output is sent into a convolution layer, the convolution kernel size is 1 multiplied by 1, and the number of channels is doubled, so that the output with the number of N is obtained. After the cyclic convolution module performs the downsampling operation, the cyclic convolution module passes through two grouping convolution modules, so as to fully extract the most abstract features.
Sixth step: the up-sampling operation is then performed by three consecutive sub-pixel convolution modules U1, U2, U3. Each sub-pixel convolution module performs sub-pixel splicing operation on input, namely, the input size is a multiplied by b multiplied by c during up-sampling, the output after the splicing operation is 2a multiplied by 2b multiplied by c/4, and then a convolution layer with the convolution kernel size of 3 multiplied by 3 is passed. The number of output channels is reduced by two after each sub-pixel convolution block, and the tensor height and width of each convolution block channel are enlarged by two. The input of each sub-pixel convolution module is the output of the corresponding cyclic convolution module G3, G4 and G5 after the grouping convolution branch processing and the output of the previous sub-pixel convolution module. Traditional upsampling is achieved by non-learnable methods such as linear interpolation, so we replace the original interpolation with sub-pixel convolution. Thus, the output of the whole body layer B is obtained.
Seventh step: the output of the first cyclic convolution module R1 is input to a sub-pixel convolution module U0 for up-sampling and the up-sampled result and the output of the whole body layer B are fed as inputs to the prediction layer. After the input is obtained by the prediction layer, the input is sent to a convolution layer for convolution, then sent to a regularization layer, finally output is obtained through a Sigmoid activation function, and the output is a value between 0 and 1 predicted for each pixel.
A second part: exercise structure restoration
The first step: taking the intensity image reconstructed by the neural network as an input image set psi = { I i |i=1,2...N I Detecting for each input image his features and descriptors using SIFT feature extraction algorithm, and labeling feature coordinates as x, feature descriptors as f, all detected features and descriptors as a set
Figure BDA0004015348210000081
And a second step of: by F i SIFT feature matching as appearance description of image by providing for image I b Finding an image I for each feature in a database a The feature description with the highest matching degree searches the corresponding relation of the features, and discovers the images of the same scene. Through feature matching, an image pair C= { I with a group of potential matching success is output a ,I b }|I a ,I b ε. Phi., a < b } and their associated features correspond to M ab ∈F a ×F。
And a third step of: computing basis matrices F for each image pair using epipolar geometry for potential matching successful image pairs, computing camera parameters and poses that would be considered geometrically verified if one valid camera parameter and pose mapped features between more than 30 images, then outlier filtering geometrically verified image pairs using RANSAC method, and outputting as geometrically verified image pairs
Figure BDA0004015348210000082
Corresponding to their relevant features->
Figure BDA0004015348210000083
Fourth step: and selecting an image pair with the largest matching characteristic in the image pair with the baseline distance being more than 10cm from geometric verification to perform triangulation and nonlinear optimization (Bundle Adjustment), and estimating a transformation matrix of a camera and coordinates of a 3D point in space, wherein a nonlinear optimization formula is as follows:
Figure BDA0004015348210000084
wherein P is c For the transformation matrix of the camera, X k Is the coordinates of a 3D point in space, x j For feature coordinates in the image, pi is the projection function of the 3D point in space onto the camera plane, ρ j The weight is determined according to the number of image-to-feature matching.
Fifth step: adding the rest image pairs into nonlinear optimization by taking the order of the feature matching quality from more to less as input, repeating the fourth step, knowing that all the image pairs are optimized, and finally obtaining a transformation matrix set P= { P of the camera as output c ∈SE(3)|c=1,2...N p Sum space 3D coordinate point set x= { X k ∈R 3 |k=1,2...N X And (3) completing sparse reconstruction.
Third section: dense three-dimensional reconstruction
The first step: and taking a transformation matrix output by the motion recovery structure, 3D point coordinates in the space and a corresponding image as inputs, and carrying out dense point cloud reconstruction by using a Patch-Match Stereo method to obtain complete and accurate dense point cloud at a reasonable speed. The output of this section is a dense point cloud and its corresponding image.
And a second step of: and (3) taking the dense point cloud and the corresponding image thereof in the first step as input, performing rough surface reconstruction on the point cloud by using a surface reconstruction method based on binary segmentation, dividing the dense point cloud into an inner type and an outer type, wherein a surface between the inner type and the outer type is the surface of an object, and outputting the surface as a rough surface reconstruction.
And a third step of: and taking the rough surface reconstruction in the second step as input, optimizing the details of the grid surface based on the luminosity consistency, and calculating the luminosity consistency again for the multi-view image so as to ensure that the luminosity consistency is the best, and outputting the luminosity consistency as a fine surface reconstruction.
Fourth step: and (3) carrying out texture mapping on the fine surface reconstruction generated in the third step, and outputting a final dense three-dimensional reconstruction result, wherein the final dense three-dimensional reconstruction result is gray because the intensity image reconstructed by the event camera is a gray image.
Specific examples:
as shown in fig. 1, the dense three-dimensional reconstruction method based on the event camera is specifically as follows:
the first step: an intensity image reconstruction neural network based on UNet is established, and the network structure is shown in fig. 2 and is divided into a head layer H, a main body layer B and a prediction layer P. The main body layer is the central part of the whole network, the feature extraction and fusion are completed, and the main body layer comprises five grouping convolution modules G1, G2, G3, G4 and G5 of three circular convolution modules R1, R2 and R3 and three sub-pixel convolution modules U1, U2 and U3.
And a second step of: in order for the convolutional recurrent neural network to process the event stream in the text, the event stream is encoded into a stereo tensor with a channel size of 5.
And a third step of: training was performed using the eco dataset with an image size of 240×180, during which the dataset was randomly flipped (-20 °,20 °) and cropped to a 128×128 size to expand the dataset using the following loss function formula:
L=SSIM(X 1 ,X 2 )+d(X 1 ,X 2 )
wherein X is 1 And X 2 Representing two images, SSIM (X 1 ,X 2 )=L(X 1 ,X 2 )*C(X 1 ,X 2 )*S(X 1 ,X 2 ) The similarity between the two images is measured as a structural similarity index. Wherein the method comprises the steps of
Figure BDA0004015348210000101
The similarity of the brightness is indicated and,
Figure BDA0004015348210000102
representing contrast similarity +.>
Figure BDA0004015348210000103
Representing the structure score. Wherein->
Figure BDA0004015348210000104
And->
Figure BDA0004015348210000105
Representing the mean value of two images, +.>
Figure BDA0004015348210000106
And->
Figure BDA0004015348210000107
Represents the standard deviation of the two images, +.>
Figure BDA0004015348210000108
Representing covariance of two images, C 1 C 2 C 3 The denominator is not 0 for a constant.
Figure BDA0004015348210000109
To learn the perceived similarity, two images are passed through a VGG19 network and the difference of the output values of each layer is calculated, where l is the number of layers of the neural network, H l W l For the current layer image height and width, w l Is a scaling factor and Y is the corresponding output for each layer.
The deep learning framework is PyTorch, the period is 200, the batch size is 4, the maximum learning rate is 5 multiplied by 10 < -4 > by using an ADAM optimizer, and the dynamic learning rate is adjusted for training, and the trained model parameters are saved.
Fourth step: information of a place is collected by using an event camera, and recorded into a robag, and an original image of the event camera is shown in fig. 4. Subscribing to the topic of/dvs/events in rosbag recorded by an event camera, wherein the message type is dvs_msgs, the content is pixel coordinates, polarity and time stamp, reading the three contents, and storing the text file in the sequential format of (time stamp, pixel coordinates and polarity) for subsequent processing.
Fifth step: and repeating the encoding process in the second step, inputting the encoded 5-channel tensor into a trained intensity image reconstruction network, sequentially passing through a head layer, a main body layer and a prediction layer, and outputting a 3-channel tensor, and converting the tensor into a picture form, namely, a result of intensity image reconstruction, as shown in fig. 5.
Sixth step: then, the motion structure recovery and dense three-dimensional reconstruction are carried out, the specific flow is as shown in fig. 3, the reconstructed intensity image is subjected to feature extraction and matching by using SIFT, and a group of image pairs with potential matching success are output, wherein the image pairs are C= { I a ,I b }|I a ,I b ε. Phi., a < b } and their associated features correspond to M ab ∈F a ×F。
Seventh step: computing a fundamental matrix F of each image pair by utilizing the epipolar geometry relation of the matched and functional image pairs, computing internal parameters and postures of a camera, then carrying out outlier filtering by utilizing a RANSAC method, and finally outputting the image pairs subjected to geometric verification
Figure BDA0004015348210000111
Corresponding to their relevant features->
Figure BDA0004015348210000112
Eighth step: image pairs with sufficient matching features and sufficient baseline distances are selected from the geometrically validated image pairs for triangulation and nonlinear optimization (Bundle Adjustment), estimating the transformation matrix of the camera and coordinates of the 3D points in space.
Ninth step: adding the rest image pairs into nonlinear optimization by taking the order of the characteristic matching quality from good to bad as input, repeating the seventh step, knowing that all the image pairs are optimized, and finally obtaining a transformation matrix set P= { P of the camera as the output c ∈SE(3)|c=1,2...N p Sum space 3D coordinate point set x= { X k ∈R 3 |k=1,2...N X And (3) completing sparse reconstruction.
Tenth step: and taking the transformation matrix output by the motion recovery structure, 3D point coordinates in the space and corresponding images as inputs, carrying out dense point cloud reconstruction by using a Patch-Match Stereo method, and outputting dense point clouds and corresponding images thereof.
Eleventh step: and (3) taking the dense point cloud and the corresponding image thereof in the tenth step as input, performing rough surface reconstruction on the point cloud by using a surface reconstruction method based on binary segmentation, and outputting the rough surface reconstruction.
Twelfth step: taking the rough surface reconstruction in the eleventh step as input, optimizing the details of the grid surface based on photometric consistency, and outputting a fine surface reconstruction.
Thirteenth step: texture mapping is carried out on the fine surface reconstruction generated in the twelfth step, and a final dense three-dimensional reconstruction result is output, as shown in fig. 6.

Claims (4)

1. The dense three-dimensional reconstruction method based on the event camera is characterized by comprising the following steps of: the event camera is a dynamic vision sensor, and the three-dimensional reconstruction step comprises three parts of intensity image reconstruction, motion structure recovery and dense three-dimensional reconstruction, and the specific process is as follows:
1. intensity image reconstruction:
step 1.1: establishing an intensity image reconstruction neural network based on UNet, wherein the neural network comprises a head layer H, a main body layer B and a prediction layer P; the main body layer B comprises three cyclic convolution modules R1, R2 and R3, namely five grouping convolution modules G1, G2, G3, G4 and G5 with consistent parameters, and three sub-pixel convolution modules U1, U2 and U3;
step 1.2: encoding the event to obtain a channel tensor input to the neural network, wherein the channel tensor is input to the neural network, and the tensor with the output channel size of 32 is obtained through a convolution layer in the head layer H and a RELU activation function; wherein the convolution kernel size is 3 x 3;
step 1.3: the output of the head layer H is sent to three cyclic convolution modules R1, R2 and R3 for downsampling operation, the number of output channels is doubled after the input passes through each cyclic convolution block, and the tensor height and width of each convolution block channel are doubled;
each cyclic convolution block comprises a CBR and a ConvLSTM module, and the ConvLSTM module is used for reserving previous state information which is used for updating the current state in combination with the current input;
the CBR is a convolution layer with a convolution kernel size of 5 multiplied by 5, a regularization (BatchNorm) layer and a ReLU activation function;
the convolution kernel size of the ConvLSTM module is 3 multiplied by 3;
step 1.4: inputting the output data passing through the three cyclic convolution blocks into two grouping convolution blocks G1 and G2, grouping different feature graphs of an input layer, and then convolving each group;
the grouping convolution module firstly sends channel tensor with the number of N input channels into a convolution layer with the size of 1 multiplied by 1, halves the number of channels, then obtains output with the number of N/2 through a grouping convolution layer with the number of 4 tensors and the size of 3 multiplied by 3, finally sends the output into a convolution layer with the size of 1 multiplied by 1, doubles the number of channels, and obtains output with the number of N;
step 1.5: the output of the three cyclic convolution modules is respectively input into a branch, and each branch is formed into G3, G4 and G5 by a grouping convolution block;
the grouping convolution module firstly sends the channel tensor with the number of N of input channels into a convolution layer with the convolution kernel size of 1 multiplied by 1, and halving the number of channels; then, a group convolution layer with 4 tensors and a convolution kernel size of 3 multiplied by 3 is passed through to obtain an output with N/2 channels; finally, a convolution layer with the convolution kernel size of 1 multiplied by 1 is sent, the number of channels is doubled, and the output with the number of channels of N is obtained;
step 1.6: the output with the channel number of N is subjected to up-sampling operation by three continuous sub-pixel convolution modules U1, U2 and U3;
each sub-pixel convolution module performs sub-pixel splicing operation on input, namely the input size is a multiplied by B multiplied by c during up-sampling, the output after the splicing operation is 2a multiplied by 2B multiplied by c/4, and then a convolution layer with the convolution kernel size of 3 multiplied by 3 is adopted to obtain the output of the whole main body layer B;
wherein: the number of output channels is reduced by two times after each sub-pixel convolution block, and the tensor height and width of each convolution block channel are enlarged by two times; the input of each sub-pixel convolution module is the output of the corresponding cyclic convolution module G3, G4 and G5 after the grouping convolution branch processing and the output of the previous sub-pixel convolution module;
step 1.7: the output of the first cyclic convolution module R1 is input into a sub-pixel convolution module U0 to carry out up-sampling operation, and the up-sampling result and the output of the whole main body layer B are taken as inputs to be sent into a prediction layer; after the prediction layer obtains input, the input is sent to a convolution layer for convolution, then sent to a regularization layer, finally output is obtained through a Sigmoid activation function, and the output is to predict a value between 0 and 1 for each pixel;
2. and (3) restoring a motion structure:
step 2.1: taking the intensity image reconstructed by the neural network as an input image set psi = { I i |i=1,2...N I Detecting the features and descriptors of each input image by adopting a SIFT feature extraction algorithm, marking the feature seats as x, marking the feature descriptors as F, and marking all the detected features and descriptors as a set F i ={(x j ,f j )|j=1,2...N Fi };
Step 2.2: by F i SIFT feature matching as appearance description of image by providing for image I b Finding an image I for each feature in a database a Searching the corresponding relation of the features of the feature description sub-search features with highest matching degree, and finding out the images of the same scene;
through feature matching, an image pair C= { I with a group of potential matching success is output a ,I b }|I a ,I b ε. Phi., a < b } and their associated features correspond to M ab ∈F a ×F;
Step 2.3: computing basis matrices F for each image pair using epipolar geometry for potential matching successful image pairs, computing camera internal parameters and poses, if one valid camera internal parameter and pose maps features between more than 30 images, considering that geometry verification is passed, then outlier filtering the geometry verified image pairs using RANSAC method, and outputting the final output as geometry verified image pairs
Figure FDA0004015348200000031
Corresponding to their relevant features->
Figure FDA0004015348200000032
Step 2.4: selecting an image pair with the most matching features in the image pair with the baseline distance being more than 10cm from the image pair subjected to geometric verification, performing triangulation and nonlinear optimization, and estimating a transformation matrix of a camera and coordinates of 3D points in space;
step 2.5: adding the rest image pairs into nonlinear optimization by taking the order of the feature matching quality from more to less as input, repeating the process of step 2.4 until all the image pairs are optimized, and finally obtaining a transformation matrix set P= { P of the camera as the output c ∈SE(3)|c=1,2...N p Sum space 3D coordinate point set x= { X k ∈R 3 |k=1,2...N X Sparse reconstruction is completed;
3. dense three-dimensional reconstruction:
step 3.1: taking a transformation matrix output by the motion recovery structure, 3D point coordinates in the space and corresponding images as inputs, and carrying out dense point cloud reconstruction by using a Patch-Match Stereo method, and outputting the dense point cloud and the corresponding images;
step 3.2: taking the dense point cloud and the corresponding image thereof as input, performing rough surface reconstruction on the dense point cloud by using a surface reconstruction method based on binary segmentation, dividing the dense point cloud into an inner type and an outer type, wherein a surface between the inner type and the outer type is the surface of an object, and outputting the surface as a rough surface reconstruction;
step 3.3: taking the rough surface reconstruction as input, optimizing the details of the grid surface based on the luminosity consistency, calculating the luminosity consistency again for the multi-view image, and outputting the luminosity consistency as a fine surface reconstruction;
step 3.4: and carrying out texture mapping on the fine surface reconstruction, and outputting a final dense three-dimensional reconstruction result.
2. The dense three-dimensional reconstruction method based on event cameras according to claim 1, wherein: the grouping number eta=n/8 of the grouping convolution modules of steps 1.4 and 1.5.
3. The dense three-dimensional reconstruction method based on event cameras according to claim 1, wherein: the nonlinear optimization formula:
Figure FDA0004015348200000041
wherein P is c For the transformation matrix of the camera, X k Is the coordinates of a 3D point in space, x j For feature coordinates in the image, pi is the projection function of the 3D point in space onto the camera plane, ρ j The weight is determined according to the number of image-to-feature matching.
4. The dense three-dimensional reconstruction method based on event cameras according to claim 1, wherein: the intensity image reconstructed by the event camera is a gray image, and the result of the dense three-dimensional reconstruction is gray.
CN202211668022.6A 2022-12-23 2022-12-23 Dense three-dimensional reconstruction method based on event camera Pending CN116309774A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211668022.6A CN116309774A (en) 2022-12-23 2022-12-23 Dense three-dimensional reconstruction method based on event camera

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211668022.6A CN116309774A (en) 2022-12-23 2022-12-23 Dense three-dimensional reconstruction method based on event camera

Publications (1)

Publication Number Publication Date
CN116309774A true CN116309774A (en) 2023-06-23

Family

ID=86812015

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211668022.6A Pending CN116309774A (en) 2022-12-23 2022-12-23 Dense three-dimensional reconstruction method based on event camera

Country Status (1)

Country Link
CN (1) CN116309774A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117197472A (en) * 2023-11-07 2023-12-08 四川农业大学 Efficient teacher and student semi-supervised segmentation method and device based on endoscopic images of epistaxis
CN118247418A (en) * 2024-05-28 2024-06-25 长春师范大学 Method for reconstructing nerve radiation field by using small quantity of blurred images

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117197472A (en) * 2023-11-07 2023-12-08 四川农业大学 Efficient teacher and student semi-supervised segmentation method and device based on endoscopic images of epistaxis
CN117197472B (en) * 2023-11-07 2024-03-08 四川农业大学 Efficient teacher and student semi-supervised segmentation method and device based on endoscopic images of epistaxis
CN118247418A (en) * 2024-05-28 2024-06-25 长春师范大学 Method for reconstructing nerve radiation field by using small quantity of blurred images
CN118247418B (en) * 2024-05-28 2024-07-16 长春师范大学 Method for reconstructing nerve radiation field by using small quantity of blurred images

Similar Documents

Publication Publication Date Title
US10593021B1 (en) Motion deblurring using neural network architectures
CN110136062B (en) Super-resolution reconstruction method combining semantic segmentation
CN110503680B (en) Unsupervised convolutional neural network-based monocular scene depth estimation method
Chen et al. Cross parallax attention network for stereo image super-resolution
CN105741252B (en) Video image grade reconstruction method based on rarefaction representation and dictionary learning
CN111489287A (en) Image conversion method, image conversion device, computer equipment and storage medium
CN115690324A (en) Neural radiation field reconstruction optimization method and device based on point cloud
CN116309774A (en) Dense three-dimensional reconstruction method based on event camera
CN110443842A (en) Depth map prediction technique based on visual angle fusion
CN110348330A (en) Human face posture virtual view generation method based on VAE-ACGAN
CN110728219A (en) 3D face generation method based on multi-column multi-scale graph convolution neural network
CN111325851A (en) Image processing method and device, electronic equipment and computer readable storage medium
CN110570377A (en) group normalization-based rapid image style migration method
CN112232134B (en) Human body posture estimation method based on hourglass network and attention mechanism
Thasarathan et al. Automatic temporally coherent video colorization
CN115187638A (en) Unsupervised monocular depth estimation method based on optical flow mask
CN111241963B (en) First person view video interactive behavior identification method based on interactive modeling
CN111950477A (en) Single-image three-dimensional face reconstruction method based on video surveillance
CN113962858A (en) Multi-view depth acquisition method
CN110599411A (en) Image restoration method and system based on condition generation countermeasure network
CN108924528B (en) Binocular stylized real-time rendering method based on deep learning
CN112906675B (en) Method and system for detecting non-supervision human body key points in fixed scene
CN110930500A (en) Dynamic hair modeling method based on single-view video
CN114996814A (en) Furniture design system based on deep learning and three-dimensional reconstruction
CN111640172A (en) Attitude migration method based on generation of countermeasure network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination