CN115205489A

CN115205489A - Three-dimensional reconstruction method, system and device in large scene

Info

Publication number: CN115205489A
Application number: CN202210630432.5A
Authority: CN
Inventors: 梁凌宇; 邹朝军
Original assignee: Guangzhou Zhongsi Artificial Intelligence Technology Co ltd
Current assignee: Guangzhou Zhongsi Artificial Intelligence Technology Co ltd
Priority date: 2022-06-06
Filing date: 2022-06-06
Publication date: 2022-10-18

Abstract

The invention discloses a three-dimensional reconstruction method under a large scene, which comprises the following steps: acquiring image data of a reconstructed target through RGB image acquisition equipment, and preprocessing the image data; retrieving and matching the images, calculating the characteristic points of each image, and matching the characteristic points; calculating the corresponding camera pose of each image; obtaining a dense point cloud intermediate model of the scene according to the image and the corresponding camera pose; post-processing the three-dimensional point cloud model to finally obtain a three-dimensional reconstruction grid model; the method and the device solve the problems of overlong image data matching time, low precision of partial scenes and insufficient integrity in large scenes; in addition, the invention also provides a three-dimensional reconstruction system and a device, wherein the system is used for realizing the reconstruction method, and the device is used for deploying the system; the method, the system and the device can realize three-dimensional reconstruction in a large scene, realize the separation of image acquisition and three-dimensional reconstruction calculation, and have wide application possibility.

Description

Three-dimensional reconstruction method, system and device in large scene

Technical Field

The invention relates to the technical field of image processing, in particular to a three-dimensional reconstruction method, a three-dimensional reconstruction system and a three-dimensional reconstruction device in a large scene.

Background

Three-dimensional reconstruction techniques refer to the reconstruction of real-world scenes or objects into computer-expressible and processable data models using computer technology. Three-dimensional reconstruction based on RGB images is increasingly widely applied due to the low requirements on acquisition equipment and the low cost of the reconstruction process. At present, the three-dimensional reconstruction technology based on the RGB image can be mainly divided into two steps, including sparse reconstruction based on a motion recovery structure technology and dense reconstruction based on a multi-view stereo technology.

The motion recovery structure technology is used for recovering the pose of the camera. The extraction and matching of the feature points are key steps, the SIFT features are mainly adopted in the industry to extract and describe the feature points at present, and although the SIFT features have illumination and scale invariance, the SIFT features are well represented in scenes with rich textures. However, for pure color scenes such as wall surfaces and floors, the features are difficult to extract, and the reconstruction effect is poor.

The multi-view stereo reconstruction technology is to recover dense point cloud of a scene, and at present, the technology can be divided into multi-view stereo reconstruction algorithms based on point cloud diffusion, voxel and depth map. The point cloud diffusion method limits the parallel capability of calculation in the propagation process, and the speed is very low; voxel reconstruction occupies a large amount of memory space, and memory consumption is unacceptable under the requirement of large-scene application; the point cloud reconstruction method based on the depth map estimates the depth map of each picture, then performs depth fusion, and decouples the MVS task into the depth estimation task of each view, so that the method is very suitable for large-scale scene reconstruction. In order to ensure the feasibility and efficiency of large scene reconstruction, the reconstruction method is also based on the depth map. The multi-view stereo reconstruction technology has poor performance in some scenes including wall surfaces, glass and the like, and the reconstruction precision and integrity are insufficient, which is a problem to be solved urgently.

In a large scene, tens of thousands of RGB images are needed for reconstruction, and the efficiency of the reconstruction process is obviously reduced. How to improve reconstruction efficiency under the requirement of large scene reconstruction is another problem to be solved.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, provides a three-dimensional reconstruction method, a three-dimensional reconstruction system and a three-dimensional reconstruction device in a large scene, and solves the problems of overlong image data matching time, low precision of partial scenes and insufficient integrity in the large scene.

In order to achieve the purpose, the invention provides a three-dimensional reconstruction method under a large scene, which comprises the following steps:

s1, acquiring image data of a reconstructed target through RGB image acquisition equipment, and preprocessing the image data;

s2, retrieving and matching the images, calculating the characteristic points of each image, and matching the characteristic points;

s3, calculating the corresponding camera pose of each image;

s4, obtaining a dense point cloud intermediate model of the scene according to the image and the corresponding camera pose;

and S5, post-processing the three-dimensional point cloud model to finally obtain a three-dimensional reconstruction grid model.

Preferably, the three-dimensional reconstruction method includes the steps of:

(1) RGB image data of a scene is acquired by an image capture device, which may include various formats of pictures or videos. For video data, the video is subjected to self-adaptive sampling according to the length and quality of the video to obtain a scene picture. Obtaining a set of images of a scene I = { I = { _i |i＝1,2,...,N}。

(2) By adopting an image retrieval technology based on a deep neural network NetVLAD, a vector is extracted from each picture in an image set I and is used as a global descriptor of the picture

Rapidly matching pictures of similar view angles according to a global descriptor, and acquiring a matched image pair C = { { I { (I) } _a ,I _b }|I _a ,I _b ∈I,a＜b}。

(3) Extracting and describing feature points based on the SuperPoint of the deep neural network to obtain the representation of the feature points

Where x is the key point and d is the descriptor.

Firstly, a convolutional neural network structure based on VGG is used as an encoder, downsampling and encoding are carried out on an image with the scale of W multiplied by H, and the image is obtained

A characteristic diagram of (2); and respectively obtaining the key point representation and the descriptor representation of the characteristic points through the parallel key point decoder and the descriptor decoder. The keypoint decoder convolves the feature map to obtain 65 channels, which correspond to 64 regional channels and 1 channel of irrelevant keypoints. Finally, obtaining a W multiplied by H multiplied by 1 characteristic diagram through Softmax and Reshape. The descriptor decoder adopts a full convolution structure similar to UCN to realize more accurate geometric and semantic information, and then obtains a description matrix of W multiplied by H multiplied by D through bicubic interpolation and L2 normalization, and the description matrix corresponds to the D-dimensional descriptor D of each key point x.

(4) Feature point matching is achieved through SuperGlue based on graph convolution neural networks. In the image matching pair C _a,b ＝{I _a ,I _b On the points, obtain the matching pair A of the characteristic points _a,b ＝{A(F _i ,F _j )|F _i ∈I _a ,F _j ∈I _b }。

For a certain feature point F _i ＝(x _j ,d _j ) And merging the positions of the characteristic points and the description information by adopting a multilayer perceptron (MLP):

y _i ＝d _i +MLP(x _i )

all subsequent processing will use y _i As a representation of the feature points.

And (4) carrying out feature point aggregation in and among images by adopting an attention-based image convolution neural network. Wherein the nodes of the graph are characteristic points y _i The edges of the graph are divided into two types, namely an edge epsilon connecting characteristic points in the image _s And an edge ε connecting feature points between images _c 。ε _s Neighbor information, epsilon, to reflect feature points _c To describe the similarity of different feature points between images. By passingNeighbor features of the multilayer GNN aggregation feature points and similarity information of the feature points among the images are aggregated, and finally the feature point y is obtained _i Is described by _i 。

Using the matching description z, in two matching images C _a,b And constructing an allocation matrix for all the feature points x. The distribution matrix is iteratively optimized through a Sinkhorn algorithm to realize two images C _a,b ＝{I _a ,I _b Feature point match A between _a,b ＝{A(F _i ,F _j )|F _i ∈I _a ,F _j ∈I _b }。

(5) And performing camera pose estimation and scene reconstruction of sparse point cloud by using a motion recovery structure technology. Initializing, image registration, triangularization and beam adjustment optimization are carried out on the basis of the incremental SfM, and a camera pose and a sparse point cloud of a scene corresponding to each picture are obtained. The method comprises the following steps:

and (5) initializing. According to matching feature points

Number and distribution selection in images of initially matching image pairs C _init And selecting the image pair with the most matching points and the most uniform distribution as much as possible. The relative pose of the camera is calculated using epipolar geometric constraints, including the rotation matrix R and the translation vector t. Obtained mainly by solving the following epipolar geometric constraint equation:

and (5) triangularization. Known camera (relative) poses R, t and matching point position x _i 、x _j The following relationship holds:

Z _i x _j ×Rx _i +x _i ×t＝0

solving the equation by matching the coordinates of the characteristic point pairs to obtain the characteristic points x _i Depth Z of _i And obtaining the position of the three-dimensional space. After initialization and triangularization, an initial model M is obtained containing only two images _init 。

And (4) image registration. Registering the remaining images to the initial model M using the PnP algorithm _init In (1). The coordinates of the three-dimensional points in the camera coordinate system are obtained by utilizing the space similar geometric relation, and the coordinates { P ] of the n three-dimensional points in the space coordinate system are obtained _k I k =1,2,. N } and coordinates { P ] in the camera coordinate system _k ' | k =1, 2., n }, and finally solving the pose of the camera by using an iterative closest point algorithm (ICP). I.e. to optimize the following problem

Beam adjustment optimization (BA). And the position of the three-dimensional point and the camera parameters are adjusted, so that the error of the three-dimensional point re-projection into the image is minimized. I.e. to optimize the following cost function:

where Φ (K, R, t) is a camera parameter, P is a world coordinate of a three-dimensional point in space, h (·) is a projection function, and P is a pixel coordinate of the three-dimensional point in space in an image. By adjusting Φ and P, the above reprojection cost is minimized.

(6) Estimating a depth map D for each image by using camera parameters phi (K, R, t) obtained in the motion recovery structure process and relying on an MVS algorithm based on deep learning, and carrying out depth map { D on the depth map _i I =1,2,. The, N } is fused to obtain a dense point cloud model M of a large scene _dense 。

In the process of extracting the image features, the convolutional neural network is adopted to extract the image features. Adaptively adjusting a CNN (CNN) receptive field and aggregation weight according to the richness of the texture of the reconstructed target surface, adjusting the receptive field by adopting deformable convolution, and adjusting the aggregation weight by adopting adaptive weight migration, wherein the formula is as follows:

change Δ o according to the current characteristic _k By adaptively adjusting the size and position of the convolution kernel, change Δ w _k And adaptively adjusting the weighting weight so as to acquire the depth feature which is more beneficial to subsequent stereo matching.

After the features are acquired, randomly sampled depth hypotheses { d ] are generated for each feature pixel _k |k＝1,2,..,N _d }. Using differentiable homography transformation, reference is made to feature map F _ref The distortion is transformed to N adjacent thereto _s Zhang Source signature Pattern { F _i ,i＝1,2,...,N _s On, for F _ref A certain characteristic pixel p in (b), which is assumed to be d at depth _k Can be projected to the source feature image F by _i ：

Obtaining a 3D cost body according to the distance between the features, and aiming at a source image F _i At depth hypothesis d _k The following costs are generated using the two-norm of the feature difference:

C _i (p,d _k )＝||F _i [p _i (p,d _k )]-F _ref (p)|| ₂

using adaptive weights w (C) _i (d _k ) To N in pair _s The 3D cost volumes of the different views are aggregated:

performing Softmax on the cost body to generate a probability body:

P＝softmax(C)

and carrying out weighted average on the depth hypothesis by using a probability body to obtain a final depth map:

and constructing a scene dense point cloud model through the depth map. For a certain pixel P in an image, obtaining a coordinate P of a 3D point in a real space according to a camera parameter t (R, t), K and a depth map D:

P＝D(p)Τ ^-1 K ^-1 p

all 3D points jointly form a dense point cloud model M _dense 。

(7) Post-processing, dense point clouds M of the scene _dense And surface reconstruction, surface optimization and texture mapping are performed to obtain a three-dimensional model M with a good visual effect.

The invention also provides a three-dimensional reconstruction system for realizing the three-dimensional reconstruction method under the large scene, which comprises the following steps:

the terminal equipment comprises an image acquisition module; the image acquisition module acquires RGB pictures or video data of a scene through image acquisition equipment;

the cloud device comprises an image preprocessing module; the image preprocessing module acquires scene RGB (red, green and blue) pictures or video data uploaded to the cloud equipment through the communication module and preprocesses the data;

the cloud device further comprises an image retrieval and matching module; the image retrieval and matching module is connected with the image preprocessing module and is used for rapidly matching the large-scale images by using the computing function of the cloud equipment;

the image feature extraction and matching module is connected with the image retrieval and matching module, and is used for extracting feature points from the image by using the computing function of the cloud equipment so as to realize rapid and accurate matching of the feature points;

the reconstruction module is connected with the image feature extraction and matching module, and carries out camera pose estimation and dense point cloud reconstruction on the image subjected to feature point extraction and matching by using the computing function of the cloud equipment;

the post-processing module is connected with the reconstruction module, and surface reconstruction and optimization and texture mapping are carried out on the reconstructed dense point cloud by using the computing function of the cloud equipment, so that the visual effect of the model is optimized; obtaining a reconstruction model with a good visual effect;

and the communication module is used for communication between the terminal equipment and the cloud equipment.

The invention also provides a three-dimensional reconstruction device for realizing the three-dimensional reconstruction system in the large scene, which comprises the following components: the system comprises terminal equipment and cloud equipment;

the terminal device is used for acquiring and storing image data of a reconstructed scene, and comprises a terminal device memory and a communication module, wherein the terminal device memory is used for storing acquired images, and the communication module is used for communicating with cloud equipment;

the cloud equipment is used for performing three-dimensional reconstruction according to the image data; the cloud device at least comprises a cloud device memory, a processor and a communication module;

the cloud device storage is used for storing a computer program for realizing the three-dimensional reconstruction method in the large scene according to any one of claims 1 to 8 and intermediate data; the processor comprises a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU);

the processor is used for calling computer programs and data in a cloud device memory to realize the three-dimensional reconstruction method;

the communication module is used for being in communication connection with the terminal equipment.

Compared with the prior art, the invention has the beneficial effects that:

the invention preprocesses the image data of the obtained reconstruction target, thereby realizing the reconstruction of input pictures and videos with various formats; by adopting the image retrieval technology, the scene picture set can be effectively and globally described, similar scene images can be quickly found and matched according to the global descriptor, and a matched image pair is obtained. In the application of large-scene three-dimensional reconstruction, the image retrieval and matching technology provided by the embodiment of the application can effectively reduce the matching time while ensuring the accuracy; when the matching pair of the images is established, the quick and accurate matching pair generation is realized by using the image retrieval technology from the viewpoint of the visual similarity, so that the matching modes such as detailed matching, space matching and the like are avoided, the matching process does not depend on additional space information data such as POS (point of sale) and the like, and the scene image acquired from the image acquisition equipment without positioning and navigation can still be quickly and accurately matched and reconstructed; the problems of overlong image data matching time, low precision of partial scenes and insufficient integrity in large scenes are solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic step diagram of a three-dimensional reconstruction method in a large scene provided by the present invention;

fig. 2 is a flowchart illustrating a three-dimensional reconstruction method in a large scene according to an embodiment of the present invention:

FIG. 3 is a schematic diagram of a three-dimensional reconstruction system in a large scene according to a second embodiment of the present invention;

fig. 4 is a schematic structural diagram of a three-dimensional reconstruction apparatus in a large scene according to an embodiment of the present invention.

The figure comprises the following components:

20. a terminal device; 201. an image acquisition module; 303. an image acquisition device; 21. cloud equipment; 202. an image preprocessing module; 207. a communication module; 203. an image retrieval and matching module; 204. an image feature extraction and matching module; 205. a reconstruction module; 206. a post-processing module; 304. a terminal device memory; 307. a cloud device memory; 306. a processor.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are one embodiment of the present invention, and not all embodiments of the present invention. All other embodiments obtained by a person skilled in the art based on the embodiments of the present invention without any creative work belong to the protection scope of the present invention.

Example one

Referring to fig. 1 and fig. 2, a three-dimensional reconstruction method in a large scene is provided in an embodiment of the present invention.

Fig. 2 shows a three-dimensional reconstruction method in a large scene according to an embodiment. The method comprises the following specific steps:

s101: and acquiring image data of the reconstructed target scene by the RGB image acquisition equipment. The captured image data supports a variety of formats including pictures or video.

The image acquisition equipment is any equipment for acquiring RGB images; including but not limited to smart phones, cameras, drones, etc. with image capture capabilities.

When the image acquisition equipment carries out the acquisition task of the target scene image, the image should contain all visible parts of the scene to be reconstructed, and the scene part which does not acquire RGB image data cannot be subjected to three-dimensional reconstruction. In the acquisition process, the acquisition angles should be diversified as much as possible, and scenes in different images should have certain overlapping performance. When the collection angle is changed, the collection angle is converted to the greatest extent through translation, and rotation is avoided. Each acquired image is prevented from only containing a single solid color surface, and the reconstruction effect can be obviously enhanced by adjusting the angle or zooming distance to contain rich texture parts of a scene or adding posters, objects and the like on the solid color surface. Finally, the visual effect of the texture mapping depends on the resolution of the acquired image data, and the higher the resolution, the better the visual effect.

S102: and preprocessing the image according to the acquired image format. And if the format of the acquired data is a video, sampling the video. Setting sampling picture quantity threshold value N during sampling _min ,N _max ]And adjusting the sampling rate r according to the length of the video _s And the frame number after final sampling is within the threshold range. Obtaining scene picture set I = { I = } _i |i＝1,2,...,N}。

S103, retrieving and matching all the images to obtain image matching pairs. The image retrieval uses a deep neural network-NetVLAD to carry out scene recognition, and matches the images of similar scenes to obtain a plurality of one-to-one image matching pairs. The image retrieval and matching method employed in this example is as follows:

wherein, the image retrieval technology is adopted, so that the scene picture set I can be effectively and globally described

Finding similar scene images according to the global descriptor and matching the scene images to obtain a matching image pair C = { { I { } _a ,I _b }|I _a ,I _b Belongs to I, and a is less than b. In the application of large-scene three-dimensional reconstruction, the image retrieval and matching technology provided by the embodiment of the application can effectively reduce the matching time while ensuring the accuracy.

In this example, a method for extracting a picture global descriptor based on NetVLAD is used, and the steps are as follows: firstly, extracting features of a scene image I through the VGG16, and clustering the features by using a NetVLAD layer to obtain a VLAD vector, which is a global descriptor G =ofthe image, thereby obtaining a scene representation of the image. And matching the similar clustered images to obtain a one-to-one image matching pair C. It is worth noting that the clustering center also participates in the network training, so that the clustering center can show the semantic property, and the effect is better compared with the traditional clustering method.

Compared with the traditional retrieval method, the image retrieval method based on deep learning has better performance in some environments.

For a large number of unordered images, the time advantage of using image retrieval techniques is significant compared to traditional exhaustive matching. Compared with some space matching methods with higher speed, the method has the advantages that the requirement of space POS information on the input image data is not required, and the reconstruction task of common image acquisition equipment such as a mobile phone can be met. Meanwhile, this step is also the core to reduce the image matching time.

S104: for scene picture set I = { I _i I =1, 2.. An, N } extraction feature point

Where x is the key point and d is the descriptor. The deep neural network-SuperPoint is adopted for feature point extraction and description. The feature point extraction and description method of the present example includes:

using a convolutional neural network structure based on VGG as an encoder, down-sampling and encoding the W multiplied by H image to obtain

The feature map of (2) reduces the amount of calculation and saves the calculation time.

Two decoders are connected in parallel, the keypoint decoder and the descriptor decoder obtain the representation of the keypoints and the descriptors of the feature points. The keypoint decoder first convolves the current feature map to obtain feature maps for 65 channels, corresponding to 64 regional channels and one keypoint-free channel. Finally obtaining a W multiplied by H multiplied by 1 characteristic diagram through Softmax and Reshape. The descriptor decoder adopts a full convolution structure similar to UCN to realize more accurate extraction of geometric and semantic information, and then obtains a description matrix of W multiplied by H multiplied by D through bicubic interpolation and L2 normalization, and the D-dimensional descriptor D corresponds to each key point x.

S105, based on the matching image pair C obtained in S102, performing image matching on any image pair C _a,b ＝{I _a ,I _b Feature point matching is performed between, where feature point F = from S104. The feature point matching algorithm adopted in the embodiment of the application adopts SuperGlue based on graph convolution neural network to realize one-to-one matching A of feature points between image pairs _a,b ＝{A(F _i ,F _j )|F _i ∈I _a ,F _j ∈I _b }. The feature point matching method of the present embodiment mainly includes:

and merging the information of the characteristic points. For a certain feature point (x) _j ,d _j ) And merging the positions of the feature points and the description information by adopting a multilayer perceptron:

y _i ＝d _i +MLP(p _i )

y _i can be used as a representation of the feature points.

And (4) carrying out feature point aggregation in the image and between the images by adopting an attention-based graph convolution neural network. Wherein the nodes of the graph are characteristic points y _i The edges of the graph are divided into two types, namely an edge epsilon connecting characteristic points in the image _s And an edge ε connecting feature points between images _c 。ε _s Neighbor information, epsilon, to reflect feature points _c To describe the similarity of different feature points between images. Finally obtaining the feature point y by aggregating the neighbor features of the feature points and the similarity information of the feature points among the images through the multilayer GNN _i Is described by _i 。

And (5) an optimal matching layer. And converting the matching problem into an optimal distribution problem, and solving the feature point matching problem by solving an optimal distribution matrix. Describing vector z by matching _i Obtaining a similarity score S by the inner product between the images, constructing a distribution matrix, optimizing the optimal distribution matrix by using a Sinkhorn algorithm, and realizing two images C _a,b ＝{I _a ,I _b Feature point match A between _a,b ＝{A(F _i ,F _j )|F _i ∈I _a ,F _j ∈I _b }。

S106: based on the matching image and the feature points solved in S105, performing sparse point cloud model M on the three-dimensional scene through a motion recovery structure (SfM) _sparse And (4) reconstructing, and simultaneously obtaining the camera parameters phi (K, R, t) of each picture.

The motion recovery structure technology adopted in this embodiment is incremental SfM, which mainly includes initialization, image registration, triangulation, and beam adjustment optimization. The method comprises the following specific steps:

(1) And (5) initializing. According to matching feature points

Number and distribution selection in images of initially matching image pairs C _init . The initialization greatly affects the final reconstruction effect and integrity, so that when an initial image pair is selected, the matching number of the feature points obtained in the step S105 and the distribution uniformity in the image are comprehensively considered, and the image pairs with more number and more uniform distribution are selected for sparse reconstruction.

The relative pose of the camera is calculated using epipolar geometric constraints, including the rotation matrix R and the translation vector t. Obtained mainly by solving the following epipolar geometric constraint equation:

(2) And (5) triangularization. Matching pairs A (F) through camera parameters phi (K, R, t) and characteristic points _i ,F _j ) And calculating the position of the three-dimensional point in the space, and reconstructing the three-dimensional point. Known camera poses R, t and matching point position x _i 、x _j The following relationship holds:

Z _i x _j ×Rx _i +x _i ×t＝0

solving the equation by matching the coordinates of the characteristic points to obtain the characteristic points x _i Depth Z _i And obtaining the position of the three-dimensional space.

Obtaining an initial model M containing only two pieces of image information through initialization and triangulation _init 。

(3) And (4) image registration. Will be after initialization in the initial model M _init And registering all images in a middle loop. During a certain registration process, assume that the current model is M _now . First choose to observe more models M _now The image of the reconstructed 3D point is used as the next registration image I _next . Computing I Using PnP _next Corresponding camera pose (R, t). The PnP process is to use a geometrical relation similar to the space to obtain the coordinates of the three-dimensional points in a camera coordinate system, and the coordinates { P) of the n three-dimensional points in the space coordinate system _k I k =1,2,. N } and coordinates { P under the camera coordinate system _k ' | k =1, 2., n }, and the camera pose is solved by using an iterative closest point algorithm (ICP). I.e. to optimize the following problem:

converting the feature points in the image into 3D coordinates through triangulation again and adding the coordinates to the scene model M _now In the step (1), the first step,complete image I _next And (4) registering.

(4) Bundle Adjustment optimization (BA). The incremental reconstruction process accumulates errors, which causes a drift problem of the scene. Therefore, in the image registration process, after a certain number of cycles, the reprojection error is reduced through the beam adjustment optimization, the camera pose and the three-dimensional scene coordinate point are optimized, and the accuracy of the calculation result is improved.

By adjusting the current model M _now The position of the medium 3D point and the camera parameters, so that the error in the re-projection of the three-dimensional point into the image is minimized. I.e. optimizing the following cost function:

where Φ (K, R, t) is the camera parameter and P is the model M _now World coordinates of the medium three-dimensional point, h (-) is a projection function, and p is a model M _now The medium 3D point P corresponds to the pixel coordinates in the image. By adjusting the camera parameters Φ and the spatial points P, the above re-projection cost is minimized.

Finally obtaining a sparse point cloud model M _sparse And the camera parameters Φ for each picture.

S107: reconstructing a dense point cloud model M of a scene based on the image set I and the corresponding camera parameters Φ obtained from S106 _dense . The dense point cloud reconstruction method adopted in the embodiment is an MVS network based on deep learning. Similar to the learning-based classical MVS network MVSNet, this example is equally divided into the following four modules: feature extraction, cost body creation, cost body regularization and depth regression.

In the process of extracting image features, a deformable convolution strategy is adopted, a CNN receptive field is increased in a low texture area, and the receptive field is reduced in a texture rich area, so that the reconstruction integrity of scenes such as a wall surface, a pure-color floor and the like in the low texture area is improved. In the process of constructing the cost body, aggregation of costs corresponding to images with different visual angles needs to be performed. Due to the illumination problem caused by occlusion and non-lambertian planes, the embodiment performs weighted aggregation by using the learned adaptive weight instead of simple averaging during cost aggregation. According to the method and the device, the reconstruction effect optimization of the low texture, the shielding and the non-Lambert plane is realized through the strategy. The specific process is as follows:

the input image is 1 reference image I _ref And N _s Sheet source image I _source ＝{I _i ∈Ω _ref |i＝1,2,...,N _s }. Wherein Ω is _ref Representing the spatial and reference images I _ref A set of images captured by cameras whose corresponding cameras have a spatial neighboring relationship, which is determined from the camera pose (R, t) obtained in S106.

All N _s Feature map of +1 images

Extracted by an encoder (convolutional neural network) with shared weights. In the process of extracting image features, in order to enhance the reconstruction effect in the low-texture region, by using deformable convolution, the receptive field of 2DCNN can be adaptively increased in the low-texture region, and the weighted aggregation process is adjusted by using weight offset, which is expressed as follows:

change of Δ o according to characteristics _k By adaptively adjusting the size and position of the convolution kernel, change Δ w _k The weighting weight is adaptively adjusted, so that depth features which are more beneficial to subsequent stereo matching are obtained.

After the features are acquired, randomly sampled depth hypotheses { d ] are generated for each feature pixel _k |k＝1,2,..,N _d As N _d A hypothetical depth plane. Using differentiable homography, the feature map F is referenced _ref The distortion is transformed to N adjacent thereto _s Zhang Source signature Pattern { F _i ,i＝1,2,...,N _s On, for F _ref A certain characteristic pixel p in the depth hypothesis plane d _k Can be projected to the source feature image F by _i The method comprises the following steps:

through the above transformation, the source characteristic image I _source Warping to reference profile I _ref To construct 3D cost body C = { C _i |i＝1,2,..,N _s }. According to the similarity of the features, for the source image F _i At depth hypothesis d _k Generating a cost body by using a two-norm of the characteristic difference:

C _i (p,d _k )＝||F _i [p _i (p,d _k )]-F _ref (p)|| ₂

finally, N is required to be _s Aggregation of 3D cost bodies by using adaptive weight w (C) _i (d _k ) Cost aggregation of different views):

performing Softmax on the cost body to generate a probability body:

P＝softmax(C)

and constructing a scene dense point cloud model through the depth map. For a certain pixel P in an image, obtaining a coordinate P of a three-dimensional point in a real space according to a camera parameter Φ (t, K) and a depth map D:

P＝D(p)Τ ^-1 K ^-1 p

and K is camera internal reference, and T is a camera pose and comprises a rotation matrix R and a translation vector T.

After the depth maps of all images are obtained through the MVS based on the deep learning in the embodiment of the application, all the depth maps are processedFiltering and fusing to obtain dense point cloud M of scene _dense .

S108: and (4) performing post-processing on the dense point cloud model obtained in the step (S107), specifically comprising surface reconstruction, surface optimization and texture mapping, and finally obtaining a three-dimensional reconstruction model with a good visual effect. The method comprises the following steps:

calculating normal parameters of the point cloud, and performing surface triangular mesh reconstruction to obtain a triangular mesh intermediate model;

carrying out mesh optimization based on the triangular mesh model to obtain a refined mesh model;

based on the grid model and the collected images, an optimal visual angle image is selected for each grid, pixels of the images are filled on the surface of the grid, and a reconstruction model with a good visual effect is obtained.

In the embodiment of the disclosure, the image retrieval technology is mainly adopted to realize the rapid pairing of images aiming at a large scene with more reconstructed image data; based on the matching method of the SuperPoint characteristic points and the SuperGlue characteristic points, the speed and the precision of characteristic point matching are improved, so that the precision of camera pose calculation is effectively improved; then, the integrity of reconstruction is improved through an MVS algorithm based on deep learning; and finally, optimizing the point cloud model through a series of post-processing to obtain a final three-dimensional reconstruction model.

According to the invention, the combination of SuperPoint and SuperGlue is adopted in the process of extracting the image feature points and matching, compared with the mainstream SIFT feature and nearest neighbor matching at present, the method can provide more effective matching in some environments, and greatly improves the accuracy of camera pose calculation and the integrity of final reconstruction.

In the dense reconstruction process, the depth map estimation is carried out by adopting the MVS network based on the deep learning, and because the convolutional neural network is used when the characteristics are extracted, the extracted characteristics have global property to a certain extent, compared with the traditional method, the reconstruction deficiency in the weak texture area can be reduced to a certain extent.

Example two

Referring to fig. 3, a second embodiment of the present invention provides a three-dimensional reconstruction system using the three-dimensional reconstruction method in the first embodiment in a large scene.

Referring to fig. 3, the three-dimensional reconstruction system includes: terminal device 20 and cloud device 21.

The terminal device 20, the terminal device 20 includes an image acquisition module 201; the image capture module 201 captures RGB picture or video data of a scene via an image capture device 303.

The cloud device 21 comprises an image preprocessing module 202, an image retrieving and matching module 203, an image feature extracting and matching module 204, a reconstruction module 205 and a post-processing module 206.

The image preprocessing module 202 acquires scene RGB pictures or video data uploaded to the cloud device 21 through the communication module 207, and preprocesses the data; specifically, the image preprocessing module 202 is configured to execute the image preprocessing algorithm of step S102 in the first embodiment.

The image retrieval and matching module 203 is connected with the image preprocessing module 202, and performs fast matching on the large-scale image by using the calculation function of the cloud device 21; specifically, the image retrieving and matching module 203 is used for executing the image retrieving and matching algorithm of step S103 in the first embodiment, and the module accelerates the operation by using the GPU.

The image feature extraction and matching module 204 is connected with the image retrieval and matching module 203, and uses the computing function of the cloud device 21 to extract feature points from the image, so as to realize rapid and accurate matching of the feature points; specifically, the image feature extraction and matching module 204 is configured to execute the feature point extraction and matching algorithm of step S104 and step S105 in the first embodiment, and the module accelerates the operation by using the GPU.

The reconstruction module 205 is connected with the image feature extraction and matching module 204, and performs camera pose estimation and dense point cloud reconstruction on the image subjected to feature point extraction and matching by using the calculation function of the cloud device 21; specifically, the reconstruction module 205 is configured to perform the camera pose calculation and the dense point cloud reconstruction algorithm in steps S106 and S107 in the first embodiment, and the module accelerates the operation by using the GPU.

The post-processing module 206 is connected with the reconstruction module 205, and performs surface reconstruction and optimization and texture mapping on the reconstructed dense point cloud by using the computing function of the cloud device 21, so as to optimize the visual effect of the model; obtaining a reconstruction model with a good visual effect; specifically, the post-processing module 206 is configured to perform the surface reconstruction, surface optimization, and texture mapping algorithms of step S108 in the first embodiment, and the module accelerates the operation by using the GPU.

The communication module 207 is used for communication between the terminal device 20 and the cloud device 21. The module performs data transmission between the terminal device 20 and the cloud device 21 based on an internet communication protocol. The terminal device 20 uploads image data to the cloud device 21 through the communication module 207, and the cloud device 21 provides a downloading service for reconstructing the three-dimensional model to the terminal device 20 through the communication module 207.

Since the three-dimensional reconstruction system and the three-dimensional reconstruction method in the large scene correspond to each other, the embodiments of this section will not be specifically explained in comparison with the embodiments of the method section.

EXAMPLE III

Referring to fig. 4, a third embodiment of the present invention provides a three-dimensional reconstruction apparatus using the three-dimensional reconstruction system in the large scene according to the second embodiment.

Referring to fig. 4, the three-dimensional reconstruction apparatus includes: terminal device 20 and cloud device 21.

The terminal device 20 is configured to obtain and store an RGB image of a scene and store a generated three-dimensional reconstruction model, where the terminal device 20 includes a terminal device memory 304 and a communication module 207, where the image acquisition device 303 is configured to acquire scene image data; the terminal device memory 304 is used for storing image and model data; the communication module 207 is configured to transmit image data to the cloud device 21 through the internet and receive the three-dimensional reconstruction model generated by the cloud device 21.

The cloud device 21 is configured to reconstruct a three-dimensional model according to image data; the cloud device 21 at least includes a cloud device memory 307, a processor 306 and a communication module 207; the cloud device storage 307 is configured to store a computer program for implementing the three-dimensional reconstruction method in the large scene according to the first embodiment and intermediate data; the processor 306 comprises a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU) with sufficient performance; the method for three-dimensional reconstruction under the large scene is supported to be realized; the processor 306 is configured to call the computer program and data in the cloud device memory 307 to implement the three-dimensional reconstruction method.

The communication module 207 is used for being in communication connection with the terminal device 20; specifically, the image data transmitted by the terminal device 20 is received through the internet and the generated three-dimensional reconstruction model is transmitted to the terminal device 20.

The invention also provides a system and a device for realizing the three-dimensional reconstruction in the large scene, wherein the system is used for realizing the reconstruction method, and the device is used for deploying the system.

The three-dimensional reconstruction method, the three-dimensional reconstruction system and the three-dimensional reconstruction device in the large scene can realize the separation of acquisition and calculation, the acquisition is realized by the terminal, and the cloud end carries out rapid calculation by using high-performance computing equipment. The method reduces the requirement on the computing performance of the terminal and enhances the possibility of wide application.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A three-dimensional reconstruction method under a large scene is characterized in that: the method comprises the following steps:

the method comprises the following steps of S1, acquiring image data of a reconstructed target through RGB image acquisition equipment, and preprocessing the image data;

s3, calculating the corresponding camera pose of each image;

2. The method according to claim 1, wherein the method comprises the following steps: in the step S1: the image data is a multi-angle two-dimensional picture or video of a scene; preprocessing image data, adaptively sampling video, and adjusting sampling rate r according to video frame number and quality _s Limiting the number of reconstructed pictures to the range [ N ] _min ,N _max ]To (c) to (d); obtaining a set of video frame pictures I = { I = (I) = suitable for reconstruction _i I =1,2, ·, N }; the acquisition device is any means for RGB image acquisition.

3. The three-dimensional reconstruction method under the large scene according to claim 2, characterized in that: in the step S2: describing and matching all collected images by adopting an image retrieval technology based on deep learning; for each image of the image set I, the image is subjected to adaptive clustering and scene description by using the characteristics of the image

Finding images with high similarity according to similarity of scene descriptions, and forming a matching image pair C = { { I { (I) _a ,I _b }|I _a ,I _b ∈I,a＜b}。

4. The method according to claim 3, wherein the method comprises the following steps: extracting and describing the characteristic points through a characteristic point extraction network based on the SuperPoint of the deep neural network to obtain the representation of the characteristic points

Where x is the key point and d is the descriptor.

5. A process according to claim 4The three-dimensional reconstruction method under the large scene is characterized by comprising the following steps: in the image matching pair C _a,b ＝{I _a ,I _b On the basis, matching is completed on the feature points, and feature point matching pairs are constructed

And matching by adopting SuperGlue based on a graph convolution neural network, wherein the network aggregates characteristic point information in the image and between the images based on an attention mechanism, and the stability and accuracy of matching are enhanced.

6. The method for three-dimensional reconstruction in large scene according to claim 5, wherein: obtaining feature point matching pairs

Then, performing sparse point cloud M by using incremental SfM _sparse The reconstruction of (a) and the estimation of camera parameters, the camera parameters including an external parameter T (R, T) and an internal parameter K; the method comprises the following steps:

(1) Initialization: according to matching feature points

Number and distribution selection in images of initially matching image pairs C _init Selecting the image pair with the most matching points and the most uniform distribution according to the distribution scores; calculating the relative pose of the camera by using epipolar geometric constraint, wherein the relative pose comprises a rotation matrix R and a translation vector t;

(2) Triangularization: relative motion pose R, t through camera and matching point x _i 、x _j Calculating the position of the three-dimensional point in the space, and reconstructing the three-dimensional point:

Z _i x _j ×Rx _i +x _i ×t＝0

by matching feature point coordinates x ₁ 、x ₂ And pose R and t, solving the equation to obtain depth Z _i Obtaining the position of the three-dimensional space; obtaining an initial model M containing only two images through an initialization and triangularization process _init ；

(3) Image registration: registering the remaining images to the initial model M by PnP _init Performing the following steps; the coordinates of the three-dimensional points under the camera coordinate system are obtained by utilizing the space similar geometric relation, and the coordinates { P ] of the n three-dimensional points under the space coordinate system are obtained _k I k =1,2,. N } and coordinates { P ] in the camera coordinate system _k ' | k =1,2,., n }, and solving the pose of the camera by using an iterative closest point algorithm ICP;

(4) Beam adjustment optimization BA: the position of the three-dimensional point and the camera parameters are adjusted, so that the error of the reconstructed 3D point in the image is minimum; i.e. optimizing the following cost function:

wherein phi is a camera parameter, P is a world coordinate of a three-dimensional point in the space, h (-) is a projection function, and P is a pixel coordinate of the three-dimensional point in the space corresponding to the image; by adjusting Φ and P, the above reprojection error is minimized.

7. The method according to claim 6, wherein the method comprises the following steps: estimating a depth map D of the image I by utilizing an MVS algorithm based on deep learning; further comprising:

in the process of extracting image features, deformable convolution is adopted to adaptively increase the receptive field of the 2D CNN in low texture areas, and a weight offset is used to adjust the weighted aggregation process, which is expressed as follows:

change of Δ o according to characteristics _k To adaptively adjust the size and position of convolution kernel, change Δ w _k The weighting weight is adjusted in a self-adaptive manner, so that the depth characteristic which is more beneficial to subsequent stereo matching is obtained, and the integrity of reconstruction is improved;

obtaining featuresGraph { F _i After i =1, 2.. ·, N }, randomly sampled depth hypotheses { d } are generated for each feature pixel _k |k＝1,2,..,N _d }; using differentiable homography transformation, reference is made to feature map F _ref The distortion is transformed to N adjacent thereto _s Zhang Yuan feature map { F _i ,i＝1,2,...,N _s On, for F _ref A certain characteristic pixel p in (b), which is assumed to be d in depth _k Can be projected to the source feature image F by _i ：

Obtaining a 3D cost body according to the similarity of the features, and aiming at a source image F _i At depth hypothesis d _k The following matching cost is expressed by using a two-norm of the feature difference:

C _i (p,d _k )＝||F _i [p _i (p,d _k )]-F _ref (p)|| ₂

performing Softmax on the cost body to generate a probability body:

P＝softmax(C)

and carrying out weighted average on the depth hypothesis by using a probability body to obtain a final depth map estimation:

constructing a dense point cloud model through a depth map; for a certain pixel P in the image, the coordinates P of a 3D point in real space can be obtained from the camera parameters t (R, t), K and the depth map D:

P＝D(p)Τ ^-1 K ^-1 p

all 3D points jointly form a dense point cloud model M _dense 。

8. The method according to claim 7, wherein the method comprises the following steps: three-dimensional point cloud model M in acquired scene _dense Then, post-processing the point cloud to obtain a reconstruction model M, including:

carrying out surface triangular mesh reconstruction on the point cloud to obtain a triangular mesh intermediate model M _mesh ；

Based on the grid model and the collected images, an optimal visual angle image is selected for each grid, pixels of the image are filled on the surface of the grid, and a reconstruction model M with a good visual effect is obtained.

9. A three-dimensional reconstruction system for implementing the three-dimensional reconstruction method in a large scene according to any one of claims 1 to 8, comprising:

a terminal device (20), the terminal device (20) comprising an image acquisition module (201); the image acquisition module (201) acquires RGB (red, green and blue) picture or video data of a scene through image acquisition equipment (303);

a cloud device (21), the cloud device (21) comprising an image pre-processing module (202); the image preprocessing module (202) acquires scene RGB (red, green and blue) pictures or video data uploaded to the cloud equipment (21) through the communication module (207), and preprocesses the data;

the cloud device (21) further comprises an image retrieval and matching module (203); the image retrieval and matching module (203) is connected with the image preprocessing module (202) and uses the computing function of the cloud equipment (21) to rapidly match the large-scale images;

the image feature extraction and matching module (204) is connected with the image retrieval and matching module (203), and the computing function of the cloud equipment (21) is used for extracting feature points from the image so as to realize rapid and accurate matching of the feature points;

the reconstruction module (205) is connected with the image feature extraction and matching module (204), and uses the computing function of the cloud equipment (21) to carry out camera pose estimation and dense point cloud reconstruction on the image subjected to feature point extraction and matching;

the post-processing module (206) is connected with the reconstruction module (205), and surface reconstruction and optimization and texture mapping are carried out on the reconstructed dense point cloud by using the calculation function of the cloud equipment (21), so that the visual effect of the model is optimized;

and the communication module (207) is used for communication between the terminal equipment (20) and the cloud equipment (21).

10. A three-dimensional reconstruction apparatus for implementing the three-dimensional reconstruction system in a large scene as claimed in claim 9, comprising: the system comprises a terminal device (20) and a cloud device (21);

the terminal device (20) is used for acquiring and storing image data of a reconstructed scene, the terminal device (20) comprises a terminal device memory (304) and a communication module (207), the terminal device memory (304) is used for storing acquired images, and the communication module (207) is used for communicating with a cloud device (21);

the cloud device (21) is used for performing three-dimensional reconstruction according to image data; the cloud device (21) at least comprises a cloud device memory (307), a processor (306) and a communication module (207);

the cloud device storage (307) is used for storing a computer program for implementing the three-dimensional reconstruction method in a large scene according to any one of claims 1 to 8 and intermediate data; the processor (306) comprises a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU);

the processor (306) is used for calling computer programs and data in the cloud device storage (307) to realize the three-dimensional reconstruction method;

the communication module (207) is used for being in communication connection with the terminal equipment (20).