CN111652966B

CN111652966B - Three-dimensional reconstruction method and device based on multiple visual angles of unmanned aerial vehicle

Info

Publication number: CN111652966B
Application number: CN202010393797.1A
Authority: CN
Inventors: 曹先彬; 罗晓燕; 杜文博; 张旭东
Original assignee: Beihang University
Current assignee: CHECC Data Co Ltd
Priority date: 2020-05-11
Filing date: 2020-05-11
Publication date: 2021-06-04
Anticipated expiration: 2040-05-11
Also published as: CN111652966A

Abstract

The invention discloses a three-dimensional reconstruction method and device based on multiple visual angles of an unmanned aerial vehicle, and belongs to the technical field of computer image processing. The method comprises the following steps: inputting multi-view two-dimensional images of an unmanned aerial vehicle aerial scene into a three-dimensional reconstruction model for processing, acquiring optimized depth maps at corresponding view angles, and fusing the optimized depth maps at all view angles to obtain a three-dimensional point cloud of the scene; and extracting a feature map of the image by using the three-dimensional reconstruction model, carrying out homography transformation and constructing a cost matrix, generating a depth probability distribution map, regressing the depth probability distribution map into an initial depth map, fusing the initial depth map with a reference map, inputting a depth residual error learning network, and optimizing the depth map. The device comprises a processor and a memory; a memory storing therein a computer program implementing the three-dimensional reconstruction method; and the processor executes the computer program to carry out scene three-dimensional reconstruction. The method and the device reduce the problems of time consumption, resource occupation and the like during the reconstruction of the three-dimensional scene, and realize the reconstruction of the three-dimensional scene with higher speed and higher accuracy.

Description

Three-dimensional reconstruction method and device based on multiple visual angles of unmanned aerial vehicle

Technical Field

The invention relates to the technical field of computer image processing, in particular to a method and a device for reconstructing a three-dimensional scene.

Background

With the development of information technology and the demand of real world three-dimensional scene construction, the three-dimensional reconstruction technology has been widely applied to the fields of military exploration, urban planning, virtual reality and the like. At present, in consideration of factors such as flexibility, cost, convenience, and the like, restoring a two-dimensional image into a three-dimensional scene through a visual sensor such as a camera becomes a mainstream method in academia and industry. Unmanned aerial vehicles are paid more and more attention to and are applied, and unmanned aerial vehicles have advantages such as large-scale, wide visual angle of aerial photography, but the image that its shooting was obtained is two-dimensional image only, can't obtain depth information, so also hardly be used for restoring three-dimensional scene directly.

In recent years, convolutional neural networks have shown strong capability in the field of computer vision such as two-dimensional feature extraction, and more researchers have applied neural networks to three-dimensional reconstruction tasks and achieved certain results. The surfacent proposed in the ICCV of 2017 is based on voxel three-dimensional reconstruction, and firstly, a three-dimensional scene is divided into space grids, and then whether each voxel belongs to the surface part of the scene is estimated, so that the whole scene is reconstructed. 2010 TPAMI proposes three-dimensional reconstruction based on point cloud, the method directly acts on points in a three-dimensional space, and continuously densifies and reconstructs a scene in a point form by depending on an updating strategy, but because of the method, the reconstruction process is front-back association and time sequence, and is difficult to parallelize, so that the whole reconstruction process consumes too much time.

Disclosure of Invention

In order to reduce the time consumption during the reconstruction of the three-dimensional scene at present and reduce the problems of high memory occupation of calculation during reconstruction and the like, the invention provides a three-dimensional reconstruction method and a three-dimensional reconstruction device based on multiple visual angles of an unmanned aerial vehicle, wherein the three-dimensional reconstruction method and the three-dimensional reconstruction device combine the aerial photography of the unmanned aerial vehicle and the three-dimensional reconstruction based on a convolutional neural network to realize the reconstruction of the three-dimensional scene with higher speed and higher accuracy.

The invention discloses a three-dimensional reconstruction method based on multiple visual angles of an unmanned aerial vehicle, which comprises the following steps:

acquiring a multi-view two-dimensional image under a scene to be three-dimensionally reconstructed by unmanned aerial vehicle aerial photography; selecting one of the images as a reference image;

processing the multi-view two-dimensional image as the input of a three-dimensional reconstruction model; firstly, extracting a two-dimensional characteristic graph from each two-dimensional image through a two-dimensional convolution neural network; converting the two-dimensional feature map into a plane parallel to the reference image through homography, and constructing a cost matrix by using all feature maps subjected to homography conversion; secondly, generating a depth probability distribution map by using the cost matrix through a three-dimensional convolutional neural network with a multi-scale structure, and returning the depth probability distribution map to an initial depth map through entropy operation; then fusing the initial depth map and the reference image, inputting the fused initial depth map and the reference image into a depth residual error learning network, and outputting an optimized depth map;

training the three-dimensional reconstruction model, optimizing a neural network in the three-dimensional reconstruction model, and solving a first-order norm and then summing an initial depth map and an optimized depth map with a calibrated real depth map respectively during training to serve as a loss function during training; each training sample is a multi-view two-dimensional image, and a label is a real depth map of a scene;

after the three-dimensional reconstruction model is trained, two-dimensional images under different visual angles are sequentially used as reference images, and then the two-dimensional images under multiple visual angles are input into the three-dimensional reconstruction model to obtain an optimized depth map under the visual angle corresponding to the reference images; and finally, fusing the optimized depth maps under all the visual angles to obtain the final three-dimensional point cloud of the scene.

The three-dimensional reconstruction model utilizes a convolution neural network structure with eight layers to extract the characteristics of an input two-dimensional image, the translation step length of a filter after every three layers is changed from 1 to 2, and batch normalization processing and a ReLU activation function are carried out after other layers except the last layer. The feature size after the eight-layer convolutional neural network becomes one fourth of the input image size, which corresponds to a downsampling scale of 4. Although downsampling is performed when the features are extracted, context information of the original input image is also stored in the convolutional neural network.

In the three-dimensional reconstruction model, homography transformation is to map one plane to another plane, and the operation is to connect an intermediate bridge from two dimensions to three dimensions; and after homography transformation is carried out on the feature maps of the input images under different visual angles, the feature maps are combined into a cost matrix by utilizing variance operation.

In the three-dimensional reconstruction model, the three-dimensional convolution neural network with the multi-scale structure is as follows: and (3) performing scale transformation and fusion on the feature map by each layer by using a similar structure of coding and decoding, and finally transforming the cost matrix into the probability distribution of the depth map, namely the depth probability distribution map.

In the three-dimensional reconstruction model, when an initial depth map is regressed, the depth probability distribution of each pixel is obtained through the depth probability distribution map, four depth values closest to the peak value are selected to carry out entropy calculation, the depth values and the corresponding depth value probabilities are multiplied and summed, and the depth of the pixels in the initial depth map is obtained.

In the three-dimensional reconstruction model, because the obtained initial depth map is too smooth, a reference image is introduced, the initial depth map and the reference image are fused to be used as input of 4 channels, and then a depth residual error learning network is connected to output the optimized depth map. The deep residual learning network is formed by a two-dimensional convolutional neural network of 3 layers of 32 channels and 1 layer of 1 channel, and in order to learn negative residual values, the last layer in the network does not comprise a batch normalization processing layer and a ReLU layer.

The invention relates to a multi-view three-dimensional reconstruction device based on an unmanned aerial vehicle, which comprises a processor and a memory, wherein the processor is used for processing a plurality of images; wherein, a computer program for realizing the unmanned aerial vehicle multi-view three-dimensional reconstruction method is stored in the memory; the processor executes the computer program stored in the memory to perform the three-dimensional reconstruction of the scene.

Compared with the prior art, the three-dimensional reconstruction method and the three-dimensional reconstruction device have the following advantages and positive effects:

(1) the input images during three-dimensional reconstruction are more flexible, are not limited to the double visual angles of a binocular camera, a certain number of input images are not emphasized by a three-dimensional reconstruction algorithm before imaging, and aerial images with any visual angles and any number can be used as the input for three-dimensional scene reconstruction.

(2) According to the method, the three-dimensional reconstruction task is converted into the depth map of each view angle of the unmanned aerial vehicle, and then the depth map is fused into the final three-dimensional point cloud, so that the calculated amount is reduced, and the whole scene reconstruction process is more efficient. Meanwhile, the parameter quantity is greatly reduced during model training, the training speed is higher, and the trained three-dimensional reconstruction model can be quickly obtained.

(3) The invention provides a more refined and effective three-dimensional reconstruction model structure, which utilizes the geometric relationship between the images shot from multiple visual angles and the corresponding cameras, utilizes dense matching and neural network to extract features, and introduces a three-dimensional convolution neural network model with a new coding-decoding structure, so that the scene is reconstructed, global semantic information can be introduced, the stereo matching capability is stronger, and the operation speed and the accuracy of three-dimensional scene reconstruction are improved.

Drawings

FIG. 1 is a flow chart of a three-dimensional reconstruction method of the present invention;

FIG. 2 is a block diagram of a three-dimensional convolutional neural network of the multi-scale structure of the present invention;

FIG. 3 is a two-dimensional convolutional neural network structure of the present invention;

fig. 4 is a schematic structural diagram of a system for three-dimensional reconstruction of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the three-dimensional reconstruction method based on multiple perspectives of an unmanned aerial vehicle according to the embodiment of the present invention is divided into the following seven steps for explanation.

Step 1, acquiring a multi-view two-dimensional image in a scene to be researched according to unmanned aerial vehicle aerial photography.

For a scene to be three-dimensionally reconstructed, the invention acquires two-dimensional graphs of a plurality of visual angles in the scene through unmanned aerial vehicle aerial photography. The shot images are a plurality of scene images shot at 7 different angles in the embodiment of the invention, 1 of the scene images is selected as a reference picture, and a three-dimensional scene is reconstructed according to the shot angles of the reference picture. The photographed image is sampled and cropped so that the image size becomes 640 × 512 pixels.

And 2, inputting the shot two-dimensional picture into a two-dimensional convolution neural network to extract characteristic information.

As shown in fig. 2, the embodiment of the present invention uses eight layers of convolutional neural networks to extract features from an image, where each layer has a channel (channel) value of 32, the filter size is set to 3 × 3, except for the last layer, a BN (batch normalization processing) layer and a ReLU layer are added after each layer, the filter sliding step size is set from 1 to 2 after each three layers, and the feature map size is reduced by half, so that the final feature map becomes one fourth of the original map, which corresponds to a downsampling scale of 4. Meanwhile, each group of input pictures shares network parameters in the backward propagation process. Although downsampling is performed when the features are extracted, context information of the original input image is also stored in the convolutional neural network.

And 3, performing homography transformation on the extracted two-dimensional characteristic graph to construct a cost matrix.

The homography transformation is to perform nonlinear interpolation on the extracted planar features by utilizing operations including camera functions, rotation, inversion and the like, and map one plane to another plane. The homography transformation operation is an intermediate bridge connecting two dimensions to three dimensions, and in addition, the homography transformation is differentiable, thereby facilitating end-to-end training.

In the step, feature maps of input images at different view angles are transformed into a plane parallel to a reference image through homography, and the transformed map size is (W/4) · (H/4) · D · C, wherein W, H, D, C are the width, height, depth and channel value of the feature maps of the input images respectively. And combining the graphs after homography transformation under different viewing angles into a cost matrix by using the following variance operation, wherein compared with the mean operation, the adopted variance operation can be more blended into difference information of different images, and the final reconstruction result is more accurate.

And combining the feature maps after homography transformation to obtain a cost matrix E as follows:

wherein, each graph corresponds to a feature matrix, N represents the number of the combined feature graphs, V_iIs the ith characteristic diagram after homography transformation,

an average matrix of N signatures is shown.

In the step, feature maps of input images under different visual angles are constructed into a cost matrix after homography transformation, and the realization of the process is actually the principle of dense matching.

And 4, generating a depth probability distribution map by using the cost matrix obtained in the step 3 through a three-dimensional convolution neural network with a multi-scale structure.

Because there is much noise in the cost matrix, the cost matrix needs to be optimized by using a three-dimensional convolutional neural network. As shown in fig. 3, the three-dimensional convolutional neural network with a multi-scale structure utilizes a similar structure of encoding and decoding, each layer performs scale transformation and fusion on the feature map, and finally transforms the cost matrix into probability distribution of the depth map, so as to further generate the depth map.

The structure of the three-dimensional convolution neural network adopting the multi-scale structure is as follows: coding 4 levels in total, wherein the first level is a three-convolution layer of 32 channels (channels); the second level is reduced to 8 channels, and the three-layer convolution is changed into two-layer convolution; the following third and fourth levels both maintain two layers of 8-channel convolution; in addition, the filter sliding step size between each level becomes 2, so that the feature map (feature map) size becomes half after each level; meanwhile, the decoding also has 4 levels which can be regarded as the inverse process of the encoding, the first level to the third level are 8 channels and two-layer convolution, the last level is 32 channels and three-layer convolution, and in addition, the void convolution operation is adopted between every two levels, so that the size of feature map is doubled after each level is passed. Therefore, the feature map sizes between the corresponding levels of coding and decoding are kept consistent, and subsequent interlayer information fusion is facilitated.

And at each layer of the decoding part, performing information fusion on the output of the last decoding layer and the corresponding coding layer. The interlayer information fusion process of the three-dimensional convolution neural network with the multi-scale structure is as follows: from top to bottom, the top layer is connected with encoding and decoding by using convolution operation; the second layer of coding layer firstly passes through an 8-channel neural network and then reaches a decoding layer, and the decoding layer performs information fusion on the output of the upper layer and the output of the left layer; the third layer of coding layer passes through two 8-channel neural network layers and then reaches a decoding layer, similarly, the second 8-channel neural network layer of the third layer performs information fusion on the 8-channel neural network layer of the second layer and the output of the first 8-channel neural network layer on the left side, and the neural network layer of the third layer performs information fusion on the output of the second layer of decoding layer and the output of the second 8-channel neural network layer on the left side; for the fourth layer, there are three 8-channel neural network layers between the encoding layer and the decoding layer, and similarly, for the two 8-channel neural network layers and the decoding layer located in the middle of the fourth layer, information fusion is required to be performed on the outputs of the upper layer and the left side. And the last convolutional layer output is 1 channel, and then the output is converted into a depth probability distribution map by using a softmax operation. The probability of each pixel at different depth values is recorded in the depth probability distribution map, and the higher the probability value, the higher the probability value is, the higher the probability of representing the pixel at the depth value is.

And 5, returning the depth probability distribution map to an initial depth map.

In the embodiment of the invention, the depth probability distribution map is restored into the depth map by utilizing entropy operation. Although the traditional winner-eating-all algorithm can simply take the depth information at the maximum probability, the operation is not trivial and needs to be improved to some extent, and for each pixel, the depth is multiplied by the corresponding probability and summed, and the specific formula is as follows:

wherein F is the depth value of the pixel in the initial depth map recovered from the probability map, d is the possible depth value of each pixel, P (d) is the probability value corresponding to the depth value d, d_minAnd d_maxRepresenting the minimum depth value and the maximum depth value in the probability map, respectively.

However, if all depth information is directly subjected to entropy calculation, the probability distribution of the depths of the pixels which are not matched in error cannot be concentrated in one peak value, so that the method adopts the nearest four depth values for each pixel, and performs entropy calculation according to the above formula to obtain the depth value of the initial depth image pixel. The nearest four depth values are the four depth values selected closest to the peak (maximum) depth.

And 6, optimizing the depth map obtained in the step 5 and outputting the optimized depth map.

Because the operation of the step 5 can cause the obtained depth map to be too smooth, the invention introduces a reference picture, fuses the reference picture with the result obtained in the step 5 as the input of 4 channels, and then connects a depth residual error learning network. The deep residual learning network is formed by a two-dimensional convolutional neural network of 3 layers of 32 channels and 1 layer of 1 channel, and the last layer does not include a BN layer and a ReLU layer in order to learn negative residual values.

The three-dimensional reconstruction model is integrally formed by the steps 2-6, wherein the related two-dimensional convolution neural network, the three-dimensional convolution neural network with the multi-scale structure and the deep residual error learning network need to be optimized.

And 7, training the three-dimensional reconstruction model. Respectively solving a first-order norm of the initial depth map and the optimized depth map and a calibrated real depth map, and then summing the first-order norm and the optimized depth map as a loss function during training, wherein the specific formula is as follows:

wherein L is a loss function, P represents a set of valid pixels in the image, P represents a pixel in P, d (P) is a depth value of the pixel P in the real depth map,

is the depth value of the pixel p in the initial depth map,

for the depth value of pixel p in the optimized depth map, | · | represents a first-order norm.

The invention needs to train the two-dimensional convolution neural network, the three-dimensional convolution neural network and the deep residual error learning network and optimize the network parameters. During model training, the real depth map d (p) of the scene is used as a label, but the point cloud data of the scene is easier to obtain under most conditions, so the point cloud data is firstly converted into a grid map by utilizing a conversion algorithm SPSR (normalized Poisson surface retrieval) provided by Kazhdan and the like, and then rendered into the real depth map of the scene according to each view angle. And when the model is trained, the loss function L is taken as an optimization target to conduct guide training, and the smaller the value of the loss function L is, the better the value is. The model parameters are continuously updated using a gradient descent algorithm until the loss function reaches a minimum value.

After the optimized three-dimensional reconstruction model is trained, in the embodiment of the invention, the unmanned aerial vehicle takes 7 images under different visual angles by aerial photography, and the steps 1-6 are executed by taking each image as a reference image in sequence to obtain depth maps under the visual angles corresponding to the reference images, so that the depth maps of 7 different angles are obtained, and the 7 depth maps are fused and converted into three-dimensional point cloud data of the finally reconstructed scene.

As shown in fig. 4, a three-dimensional reconstruction apparatus 40 according to an embodiment of the present invention includes: a processor 41 and a memory 42.

A memory 42 for storing computer programs, computer instructions, and the like; the computer program comprises a program that can perform the method shown in fig. 1 and will not be described in detail here.

The computer programs, computer instructions, etc. described above may be stored in one or more memories 42 in partitions. And the above-mentioned computer program, computer instructions, data, etc. can be called by the processor 41.

A processor 41 for executing the computer program stored in the memory 42 to implement the steps of the three-dimensional reconstruction method according to the above embodiments.

The processor 41 and the memory 42 may be separate structures or may be integrated structures integrated together. When the processor 41 and the memory 42 are separate structures, the memory 42 and the processor 41 may be coupled by a bus 43.

The functional modules for three-dimensional reconstruction of a scene implemented by the computer program stored in the memory 42 include:

the image input module is used for inputting multi-view two-dimensional images obtained by aerial photography of the unmanned aerial vehicle and selecting one of the multi-view two-dimensional images as a reference image;

inputting a multi-view two-dimensional image into a three-dimensional reconstruction model, and extracting a characteristic map of each image by a two-dimensional convolution neural network; performing homography transformation on the output feature maps, transforming each feature map into a plane parallel to the reference image according to parameters such as a conical body of the camera, and merging the feature maps after the homography transformation to obtain a cost matrix; secondly, generating a depth probability distribution map for the cost matrix by using a three-dimensional convolution neural network with a multi-scale structure, and then performing regression to the depth probability distribution map to obtain an initial depth map; fusing the initial depth map and the reference image, inputting the fused initial depth map and the reference image into a depth residual error learning network, and outputting an optimized depth map;

sequentially taking two-dimensional images shot by the unmanned aerial vehicle at different visual angles as reference images, and outputting optimized depth maps at corresponding visual angles by the three-dimensional reconstruction model;

and the three-dimensional scene output module is used for fusing the optimized depth maps under all the visual angles and outputting the final three-dimensional point cloud of the reconstructed scene.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A three-dimensional reconstruction method based on multiple visual angles of an unmanned aerial vehicle is characterized by comprising the following steps:

step 1, acquiring multi-view two-dimensional images under a scene to be three-dimensionally reconstructed through unmanned aerial vehicle aerial photography, and selecting one of the multi-view two-dimensional images as a reference image;

step 2, extracting a two-dimensional characteristic map from each two-dimensional image through a two-dimensional convolution neural network;

step 3, performing homography transformation on the extracted feature graph to convert the feature graph into a plane parallel to the reference image, and constructing a cost matrix by using the feature graph after the homography transformation;

step 4, generating a depth probability distribution map by using the cost matrix through a three-dimensional convolution neural network with a multi-scale structure;

step 5, returning the depth probability distribution map to an initial depth map by utilizing entropy operation;

in the step 5, the depth probability distribution of each pixel is obtained through the depth probability distribution map, four depth values closest to the peak value are selected to carry out entropy calculation, the depth values are multiplied by the corresponding depth value probabilities and summed, and the depth of the pixel in the initial depth map is obtained;

step 6, fusing the initial depth map and the reference image, inputting the fused initial depth map and the reference image into a depth residual error learning network, and outputting an optimized depth map;

step 7, training the two-dimensional convolutional neural network, the three-dimensional convolutional neural network and the deep residual error learning network, and optimizing network parameters; respectively solving a first-order norm of the initial depth map and the optimized depth map and a calibrated real depth map, and then summing the first-order norm and the optimized depth map as a loss function during training; each training sample is a multi-view two-dimensional image, and a label is a real depth map of a scene; after the network is trained, the two-dimensional images under different visual angles in the step 1 are sequentially used as reference images, the steps 2-6 are executed to obtain optimized depth maps under corresponding visual angles, and finally, the optimized depth maps under all the visual angles are fused to obtain the final three-dimensional point cloud of the scene.

2. The method according to claim 1, wherein in the step 2, the feature extraction is performed on the two-dimensional image by using a convolutional neural network with eight layers, the translation step size of the filter is changed from 1 to 2 after every three layers, and batch normalization processing and a ReLU activation function are added after other layers except the last layer; the feature size after eight layers of convolutional neural networks becomes one quarter of the input two-dimensional image.

3. The method according to claim 1, wherein in step 3, the feature maps corresponding to the two-dimensional images under different viewing angles are subjected to homography transformation and then merged into a cost matrix by using variance operation.

4. The method according to claim 1, wherein in the step 4, the three-dimensional convolutional neural network of the multi-scale structure comprises an encoding and decoding structure, each layer performs scale transformation and fusion on the feature map, and the cost matrix is transformed into the depth probability distribution map.

5. The method according to claim 1 or 4, wherein in the step 4, the three-dimensional convolutional neural network of the multi-scale structure comprises: the encoding part and the decoding part are provided with 4 layers from bottom to top, the first layer is a three-layer convolution layer with 32 channels, and the second layer to the fourth layer are two-layer convolution with 8 channels; adopting a hole convolution operation between each two levels, and changing the size of the characteristic graph into twice of the original size after passing through each two levels; the decoding part is regarded as the inverse process of the encoding, and the sizes of the feature maps between the corresponding levels of the encoding part and the decoding part are kept consistent; performing information fusion on the output of the previous decoding layer and the corresponding coding layer on each layer of the decoding part; the output of the last decoding layer is converted into a depth probability distribution map by a softmax operation.

6. The method according to claim 1, wherein in step 6, the deep residual learning network is formed by a two-dimensional convolutional neural network of 3-layer 32 channels and 1-layer 1 channels, and the last layer of the deep residual learning network does not include a batch normalization processing layer and a ReLU layer.

7. The device of the unmanned aerial vehicle multi-view three-dimensional reconstruction method is characterized by comprising a processor and a memory; a computer program for realizing the unmanned aerial vehicle multi-view three-dimensional reconstruction method is stored in the memory; the processor executes the computer program stored in the memory to perform the three-dimensional reconstruction of the scene.