CN114519772A

CN114519772A - Three-dimensional reconstruction method and system based on sparse point cloud and cost aggregation

Info

Publication number: CN114519772A
Application number: CN202210090256.0A
Authority: CN
Inventors: 陶文兵; 齐雨航; 刘李漫
Original assignee: Wuhan Tuke Intelligent Technology Co ltd
Current assignee: Wuhan Tuke Intelligent Technology Co ltd
Priority date: 2022-01-25
Filing date: 2022-01-25
Publication date: 2022-05-20

Abstract

The invention relates to a three-dimensional reconstruction method and a system based on sparse point cloud and cost aggregation, wherein the method comprises the following steps: acquiring a multi-view image and a plurality of corresponding sparse point clouds, and preprocessing the sparse point clouds to obtain depth maps under a plurality of views; extracting features of the multi-view image, constructing one or more cost bodies, and modulating and regularizing each cost body by using the plurality of sparse point clouds to obtain a plurality of probability bodies; and restoring a depth map from each probability body, and fusing the depth map with the filtered depth maps under multiple viewing angles to obtain a reconstructed point cloud model. According to the method, the constructed cost body is modulated by using sparse prior through a strategy based on sparse point guide, and the accuracy of the cost body in estimating the depth of a weak texture region and a detailed structure is improved by using means such as regularization, so that the reconstruction quality of the three-dimensional point cloud model is improved.

Description

Three-dimensional reconstruction method and system based on sparse point cloud and cost aggregation

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a three-dimensional reconstruction method and a three-dimensional reconstruction system based on sparse point cloud and cost aggregation.

Background

Image-based three-dimensional reconstruction aimed at recovering three-dimensional geometry from multiple input images is an important and challenging problem in the field of computer vision. Compared with active three-dimensional reconstruction based on laser radar, the three-dimensional reconstruction based on the image has the advantages of low cost, strong universality and the like.

The traditional multi-view three-dimensional reconstruction method carries out cross-view similarity search based on manually designed features, can obtain a better reconstruction effect in an ideal Lambert body scene, but in a weak texture region and a region with mirror reflection, the reconstruction effect is not satisfactory due to difficult extraction of image features. In recent years, deep neural networks have found widespread use in the field of computer vision. The deep learning method automatically learns the characteristics of an input image through a deep neural network based on a large amount of labeled data. Compared with the traditional method, the features extracted by the deep neural network contain more semantic information.

The scholars Xu and Tao of the science and technology university in 2020 replace the cost index based on the variance with Average Group Correlation (Average Group-wise Correlation), and on the premise of not reducing the reconstruction quality of the model, the video memory cost of the GPU is reduced. Meanwhile, the multi-view depth estimation problem is modeled into an inverse depth regression problem, so that the model is better represented in a scene with a large depth range.

Although the method based on the deep learning makes great progress, the result of sparse reconstruction is not fully utilized, only the pose information of the camera is utilized, and the sparse point cloud information is ignored or not fully utilized.

Disclosure of Invention

In order to fully utilize sparse point cloud information and improve the accuracy of depth estimation, and further improve the reconstruction quality of a three-dimensional point cloud model, in particular to solve the problem that image features are difficult to extract in weak texture areas and areas with specular reflection, the invention provides a three-dimensional reconstruction method based on sparse point cloud and cost aggregation in a first aspect, which comprises the following steps: acquiring a multi-view image and a plurality of corresponding sparse point clouds, and preprocessing the sparse point clouds to obtain depth maps under a plurality of views; extracting features of the multi-view image, constructing one or more cost bodies, and modulating and regularizing each cost body by using the plurality of sparse point clouds to obtain a plurality of probability bodies; and restoring a depth map from each probability body, and fusing the depth map with the filtered depth maps under multiple viewing angles to obtain a reconstructed point cloud model.

In some embodiments of the present invention, the extracting features from the multi-view image and constructing one or more cost bodies includes: respectively extracting the features of each image of the multi-view images by using a convolutional neural network to obtain a plurality of feature maps; selecting one of the characteristic diagrams as a reference characteristic diagram, using the rest characteristic diagrams as source characteristic diagrams, and calculating a characteristic body of each source characteristic diagram on the reference characteristic diagram to obtain characteristic bodies of a plurality of views; and aggregating the characteristic bodies of the plurality of views into a cost body.

Further, the aggregating the feature volumes of the multiple views as a cost volume is realized by the following method:

where C represents a cost body, M represents a variance calculation element by element, v_iRepresenting the ith feature, N representing the total number of features,

mean values for all the features are indicated.

In some embodiments of the present invention, the modulating and regularizing each cost body by using the plurality of sparse point clouds to obtain a plurality of probability bodies includes: constructing a Gaussian modulation function based on the depth maps under multiple viewing angles; modulating each cost body according to the Gaussian modulation function; and regularizing each cost body by using a 3D segmentation network to obtain a filtered probability body.

Further, the regularization is achieved by:

wherein C (v)₀) For each voxel v in the cost volume₀The cost of (a) of (b),

operating on the post-regularization voxel v₀The cost of (d); omega_kIs the weight at the kth sample position, v_kFor a fixed offset in the convolution field, Δ v_kThe bias learned in the process of adaptive cost aggregation.

In the above embodiment, the preprocessing the plurality of sparse point clouds to obtain depth maps under a plurality of viewing angles includes: acquiring three-dimensional points corresponding to all key points under each view, and filtering invisible three-dimensional points; and the depth value of each three-dimensional point in the image coordinate system is obtained by projection change and coordinate transformation of the filtered three-dimensional points according to the camera external parameters of the current view.

In a second aspect of the present invention, a three-dimensional reconstruction system based on sparse point cloud and cost aggregation is provided, including: the system comprises an acquisition module, a depth map generation module and a depth estimation module, wherein the acquisition module is used for acquiring a multi-view image and a plurality of corresponding sparse point clouds and preprocessing the sparse point clouds to obtain depth maps under a plurality of views; the construction module is used for extracting features of the multi-view image, constructing one or more cost bodies, and modulating and regularizing each cost body by using the plurality of sparse point clouds to obtain a plurality of probability bodies; and the reconstruction module is used for recovering the depth map from each probability body and fusing the depth map with the filtered depth maps under the multiple viewing angles to obtain a reconstructed point cloud model.

Further, the building module comprises: the extraction unit is used for respectively extracting the features of each image of the multi-view images by utilizing a convolutional neural network to obtain a plurality of feature maps; the calculation unit is used for selecting one of the feature maps as a reference feature map, using the rest feature maps as source feature maps, and calculating feature bodies of each source feature map on the reference feature map to obtain feature bodies of a plurality of views; and the aggregation unit is used for aggregating the characteristic bodies of the multiple views into a cost body.

In a third aspect of the present invention, there is provided an electronic device comprising: one or more processors; a storage device, configured to store one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the sparse point cloud and cost aggregation based three-dimensional reconstruction method provided by the present invention in the first aspect.

In a fourth aspect of the present invention, a computer-readable medium is provided, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the sparse point cloud and cost aggregation based three-dimensional reconstruction method provided in the first aspect of the present invention.

The invention has the beneficial effects that:

1. the method makes full use of the result of sparse reconstruction, and takes the sparse point cloud obtained by sparse reconstruction as prior information to be blended into a cost body. The method has the advantages that a sparse depth map obtained by sparse point cloud projection is used as the geometric prior of a scene, and the accuracy of depth estimation is improved in a mode of enhancing the depth hypothesis near the sparse prior and restraining the depth hypothesis far away from the prior depth value, especially at the position of a fine structure and depth discontinuity;

2. an adaptive cost aggregation module is introduced in a cost body regularization stage, so that a network model has the capability of scene perception, the bias is adaptively learned in a data-driven mode, and more accurate depth estimation is obtained at a weak texture region and an object boundary.

Drawings

Fig. 1 is a basic flow diagram of a three-dimensional reconstruction method based on sparse point cloud and cost aggregation in some embodiments of the present invention;

FIG. 2 is a detailed flow chart of a method for minimizing the road network data range of an electronic horizon in some embodiments of the invention;

FIG. 3 is a schematic diagram of a convolutional neural network for image feature extraction in some embodiments of the present invention;

FIG. 4 is a schematic diagram of sparse point cloud data preprocessing in some embodiments of the present invention;

FIG. 5 is a schematic diagram of the operating principle of a Gaussian modulation function in some embodiments of the invention;

FIG. 6 is a schematic diagram of a 3D UNET network structure for adaptive cost aggregation in some embodiments of the present invention;

FIG. 7 is a schematic structural diagram of a three-dimensional reconstruction system based on sparse point cloud and cost aggregation in some embodiments of the present invention;

fig. 8 is a schematic structural diagram of an electronic device in some embodiments of the invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1 and fig. 2, in a first aspect of the present invention, a three-dimensional reconstruction method based on sparse point cloud and cost aggregation is provided, including: s100, acquiring a multi-view image and a plurality of corresponding sparse point clouds, and preprocessing the sparse point clouds to obtain depth maps under a plurality of views; s200, extracting features of the multi-view image, constructing one or more cost bodies, and modulating and regularizing each cost body by using the plurality of sparse point clouds to obtain a plurality of probability bodies; s300, restoring a depth map from each probability body, and fusing the depth map with the filtered depth maps under multiple viewing angles to obtain a reconstructed point cloud model.

It can be understood that the deep learning method makes great progress, but does not fully utilize the result of sparse reconstruction, only utilizes the pose information of the camera, ignores or does not fully utilize the sparse point cloud information. The method overcomes the problem, and makes full use of sparse point cloud information, thereby improving the accuracy of the three-dimensional reconstruction model. A multi-view image refers to multiple images taken from different shooting angles (views) of the same scene, i.e., the multiple images originate from the same scene or image.

In step S200 of some embodiments of the present invention, the extracting features of the multi-view image and constructing one or more cost bodies includes: s201, utilizing a convolutional neural network to respectively extract the features of each image of the multi-view images to obtain a plurality of feature maps; s202, selecting one of the feature maps as a reference feature map, using the rest feature maps as source feature maps, and calculating feature bodies of each source feature map on the reference feature map to obtain feature bodies of multiple views; s203, aggregating the characteristic bodies of the multiple views into a cost body.

Specifically, in step S201, the convolutional neural network shown in fig. 3 is used as an input image

Performing feature extraction to obtain feature maps corresponding to the N images

The convolutional neural network contains 8 convolutional layers, except the last convolutional layer, each convolutional layer is followed by a BN layer and a ReLU activation function. The image feature extraction module realizes 3 multiplied by H multiplied by W to

Where H and W are the height and width of the input image, respectively, and C is the channel dimension of the feature map. And the extracted feature graph is used for subsequent cost body construction.

Next, in step S202, F₁Features extracted for a reference image for which depth estimation is required,

for the feature map of the source image to be matched with the reference image,

and obtaining a camera internal reference matrix, a rotation matrix and a translation vector corresponding to each view through sparse reconstruction. By reference image features F₁For reference, pixel p on the image feature is referenced, based on depth hypothesis d_jAt source image feature F_iCorresponding pixel point p on_i,jThe calculation formula of (2) is as follows:

in the formula d_jFor the jth depth hypothesis, where j ∈ {1,2, …, N_d}，N_dFor the number of depth hypotheses, the subscript ref denotes a reference image feature. The characteristic diagram of each image is transformed by projection to obtain corresponding characteristic body

In order to flexibly process any number of input views, the invention utilizes the measurement index based on variance to combine the feature bodies of multiple views

The polymerization is at the cost of body C.

In the formula

Mean of all the feature volumes, M is the element-by-element variance calculation, v_iDenotes the ith feature, and N denotes the total number of features.

In step S200 of some embodiments of the present invention, the modulating and regularizing each cost body by using the plurality of sparse point clouds to obtain a plurality of probability bodies includes: s204, constructing a Gaussian modulation function based on the depth maps under multiple visual angles; s205, modulating each cost body according to the Gaussian modulation function; s206, regularizing each cost body by using a 3D segmentation network to obtain a filtered probability body.

Referring to fig. 5, schematically, step S204 includes: and (3) constructing a Gaussian modulation function by taking the sparse depth value d' as a center and the depth hypothesis d as an independent variable. The gaussian modulation function is used for feature enhancement of the cost body. Specifically, the response of the depth hypothesis near d 'is enhanced, and the depth hypothesis far from d' is suppressed. Gaussian modulation function:

wherein d is a depth assumption value; d' is a sparse depth value; k is the amplitude of the gaussian function and c is the bandwidth. Considering an index using variance, a smaller cost means a higher confidence in the depth hypothesis, so the above equation is rewritten as:

in the formula of omega_sparseIs a collection of pixels where there is a priori depth.

Referring to fig. 6, schematically, step S205 includes: and utilizing a 3 DU-Net network to carry out regularization operation on the cost body to obtain a filtered probability body. According to the method, the self-adaptive cost aggregation module is introduced in the cost body regularization step, and the self-adaptive cost aggregation module can learn offset in a self-adaptive manner, so that more accurate reconstruction precision is obtained at the discontinuous depth position. Optionally, regularization of the cost body is achieved by using a variant of the U-Net series or a 3D image segmentation network such as Segnet.

Specifically, in step S206, the probability volume normalization is performed, and the cost volume of the C channel is converted into the probability volume of the single channel by using 3D convolution, so as to realize the normalization

To

The soft-max normalization operation is carried out on the probability body along the depth direction.

In step S206, the regularization is implemented by:

in the formula C (v)₀) For each voxel v in the cost volume₀The cost of (a) of (b),

It is understood that Voxelization (Voxelization) is the conversion of a geometric representation of an object into a voxel representation closest to the object, resulting in a volume data set that contains not only surface information of the model, but also internal properties of the model. The voxels correspond to pixels in the image and can be understood as pixels in a three-dimensional object, which allows the object in three-dimensional space to be represented based on regular spatial volume pixels.

The voxel representation method typically includes SDF and TSDF: sdf (designed distance field) is the valid distance field. I.e. the object surface is simulated by assigning an SDF to each voxel. If the SDF value is greater than 0, it indicates that the voxel is in front of the current surface, and if the SDF value is less than 0, it indicates that the voxel is behind the surface, the closer the SDF is to 0, indicating that it is closer to the real surface of the scene. However, this representation method occupies a large amount of resources, and thus TSDF has been proposed;

TSDF is proposed to reduce the resource consumption of the voxel representation method. The TSDF uses a grid cube to represent a three-dimensional space, the distance between each grid and the surface of an object is stored in each grid, the positive and negative values of the TSDF respectively represent that the TSDF is shielded and visible by the surface, and points on the surface pass through a zero point.

In step S300 of some embodiments of the present invention, the recovering a depth map from each probability volume and fusing the depth map with the filtered depth maps at multiple viewing angles to obtain a reconstructed point cloud model includes the following steps:

s301, depth map regression: to achieve depth estimation accuracy at the sub-pixel level, the present invention uses a weighted average of all depth hypotheses as the final depth output (soft-argmin operation), with the weight of each term being the probability value for that hypothesis. The pixel-by-pixel depth estimate is calculated by:

wherein P (d) is the probability value corresponding to the depth hypothesis d;

s302, calculating a luminosity confidence map: the luminosity confidence map is used for measuring the multi-view luminosity consistency matching quality, and the luminosity confidence is obtained by summing the probability of 4 nearest assumed values of the depth assumed value. Photometric confidence maps are used for depth map filtering in the above embodiments;

s303, depth map filtering: depth map filtering uses photometric and geometric consistency for robust depth map filtering. The invention uses a probability map to measure the quality of depth estimation, and the probability value is lower than tau₀The points of (2) are filtered out as outliers. Geometric consistency is used to measure depth continuity between multiple views. Reference image pixel point p₁Depth value d of₁P projected to neighborhood view_iAt point, then p is_iDepth value d of_iReprojection onto a reference image p_reprojDepth value d of_reprojIf | p is satisfied_reproj-p₁|< τ₁And | d_reproj-d₁|/d₁<τ₂Is called pixel p₁An estimated value d of depth of₁The two views are consecutive. In the invention, at least n is satisfied to ensure the cross-view continuity of depth estimation_τThe depth estimates for successive views are retained. Projecting the filtered pixel points to a three-dimensional space in a reverse direction to obtain a thick imageAnd (5) a dense three-dimensional point cloud model.

In step S100 of the foregoing embodiment, the preprocessing the plurality of sparse point clouds to obtain depth maps at a plurality of viewing angles includes: s101, obtaining three-dimensional points corresponding to all key points under each view, and filtering invisible three-dimensional points; and S102, obtaining the depth value of each three-dimensional point in the image coordinate system of the three-dimensional point through projection change and coordinate transformation according to the camera external parameters of the current view.

Schematically, step S101 may refer to fig. 4: acquiring three-dimensional points corresponding to all key points under each view, filtering invisible three-dimensional points, and marking the visible three-dimensional points of the view as P_world. Converting three-dimensional points in the world coordinate system into a camera coordinate system according to camera external parameters of the current view to obtain points P in the camera coordinate system_cam：

P_cam＝R·P_world+t，

Wherein R is a rotation matrix and t is a translation vector. Obtaining the projection point and the depth value of the three-dimensional point under the image coordinate system according to the projection relation

d(u,v,1)^T＝K·P_cam，

In the formula (u, v,1)^TThe coordinates of the image pixel points are the same coordinates, and d is the depth value of the pixel points. K is camera internal reference.

It should be understood that parameters in the processes of cost body construction, cost body modulation, depth map filtering and point cloud fusion are required to be set before the method is implemented. In practical application, the depth maps obtained by regression of the cost bodies with different assumed numbers of different depths have differences; the constraint effects of Gaussian modulation functions corresponding to different amplitudes and bandwidths are different, and three-dimensional point cloud models with different visual effects can be obtained by different depth map fusion parameters. The parameters of the invention are as follows: number of depth hypothesis planes N_d＝192，k＝10，c＝2Δd，Δd＝d_j+1-d_j，τ₀＝0.8，τ₁＝1，τ₂＝0.01，n_τ＝3。

Example 2

Referring to fig. 7, in a second aspect of the present invention, there is provided a three-dimensional reconstruction system 1 based on sparse point cloud and cost aggregation, including: the acquiring module 11 is configured to acquire a multi-view image and a plurality of corresponding sparse point clouds, and pre-process the plurality of sparse point clouds to obtain depth maps at a plurality of views; the construction module 12 is configured to perform feature extraction on the multi-view image, construct one or more cost bodies, and modulate and regularize each cost body by using the plurality of sparse point clouds to obtain a plurality of probability bodies; and the reconstruction module 13 is configured to recover the depth map from each probability volume, and fuse the depth map with the filtered depth maps at multiple viewing angles to obtain a reconstructed point cloud model.

Further, the building module 12 includes: the extraction unit is used for respectively extracting the features of each image of the multi-view images by utilizing a convolutional neural network to obtain a plurality of feature maps; the calculation unit is used for selecting one of the feature maps as a reference feature map, using the rest feature maps as source feature maps, and calculating feature bodies of each source feature map on the reference feature map to obtain feature bodies of a plurality of views; and the aggregation unit is used for aggregating the characteristic bodies of the multiple views into a cost body.

Example 3

Referring to fig. 8, in a third aspect of the present invention, there is provided an electronic apparatus comprising: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method of the invention in the first aspect.

The electronic device 500 may include a processing means (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following devices may be connected to the I/O interface 505 in general: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; a storage device 508 including, for example, a hard disk; and a communication device 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 8 illustrates an electronic device 500 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 8 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program, when executed by the processing device 501, performs the above-described functions defined in the methods of embodiments of the present disclosure. It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more computer programs which, when executed by the electronic device, cause the electronic device to:

computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, Python, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A three-dimensional reconstruction method based on sparse point cloud and cost aggregation is characterized by comprising the following steps:

acquiring a multi-view image and a plurality of corresponding sparse point clouds, and preprocessing the sparse point clouds to obtain depth maps under a plurality of views;

extracting features of the multi-view image, constructing one or more cost bodies, and modulating and regularizing each cost body by using the plurality of sparse point clouds to obtain a plurality of probability bodies;

and restoring a depth map from each probability body, and fusing the depth map with the filtered depth maps under multiple viewing angles to obtain a reconstructed point cloud model.

2. The sparse point cloud and cost aggregation-based three-dimensional reconstruction method of claim 1, wherein the feature extraction of the multi-view image and the construction of one or more cost bodies comprises:

respectively extracting the features of each image of the multi-view images by using a convolutional neural network to obtain a plurality of feature maps;

selecting one of the feature maps as a reference feature map, using the rest feature maps as source feature maps, and calculating feature bodies of each source feature map on the reference feature map to obtain feature bodies of multiple views;

and aggregating the characteristic bodies of the plurality of views into a cost body.

3. The sparse point cloud and cost aggregation-based three-dimensional reconstruction method according to claim 2, wherein the aggregation of the feature volumes of the plurality of views into the cost volume is achieved by:

mean values for all the features are indicated.

4. The sparse point cloud and cost aggregation-based three-dimensional reconstruction method of claim 1, wherein the modulating and regularizing each cost body by using the plurality of sparse point clouds to obtain a plurality of probability bodies comprises:

constructing a Gaussian modulation function based on the depth maps under a plurality of visual angles;

modulating each cost body according to the Gaussian modulation function;

and regularizing each cost body by using a 3D segmentation network to obtain a filtered probability body.

5. The sparse point cloud and cost aggregation-based three-dimensional reconstruction method of claim 4, wherein the regularization is achieved by:

wherein C (v)₀) For each voxel v in the cost volume₀The cost of (a) of (b),

6. The sparse point cloud and cost aggregation-based three-dimensional reconstruction method according to any one of claims 1 to 5, wherein the preprocessing the plurality of sparse point clouds to obtain depth maps at a plurality of viewing angles comprises:

acquiring three-dimensional points corresponding to all key points under each view, and filtering invisible three-dimensional points;

and the depth value of each three-dimensional point in the image coordinate system is obtained by projection change and coordinate transformation of the filtered three-dimensional points according to the camera external parameters of the current view.

7. A three-dimensional reconstruction system based on sparse point cloud and cost aggregation comprises:

the system comprises an acquisition module, a depth map generation module and a depth estimation module, wherein the acquisition module is used for acquiring a multi-view image and a plurality of corresponding sparse point clouds and preprocessing the sparse point clouds to obtain depth maps under a plurality of views;

the construction module is used for extracting features of the multi-view image, constructing one or more cost bodies, and modulating and regularizing each cost body by using the plurality of sparse point clouds to obtain a plurality of probability bodies;

and the reconstruction module is used for recovering the depth map from each probability body and fusing the depth map with the filtered depth maps under the multiple viewing angles to obtain a reconstructed point cloud model.

8. The sparse point cloud and cost aggregation based three-dimensional reconstruction system of claim 7, wherein the construction module comprises:

the extraction unit is used for respectively extracting the features of each image of the multi-view images by using a convolutional neural network to obtain a plurality of feature maps;

the calculation unit is used for selecting one of the feature maps as a reference feature map, using the rest feature maps as source feature maps, and calculating feature bodies of each source feature map on the reference feature map to obtain feature bodies of a plurality of views;

and the aggregation unit is used for aggregating the characteristic bodies of the multiple views into a cost body.

9. An electronic device, comprising: one or more processors; a storage device to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the sparse point cloud and cost aggregation based three-dimensional reconstruction method of any one of claims 1 to 6.

10. A computer readable medium having a computer program stored thereon, wherein the computer program when executed by a processor implements the sparse point cloud and cost aggregation based three-dimensional reconstruction method of any one of claims 1 to 6.