CN115170746A - Multi-view three-dimensional reconstruction method, system and equipment based on deep learning - Google Patents

Multi-view three-dimensional reconstruction method, system and equipment based on deep learning Download PDF

Info

Publication number
CN115170746A
CN115170746A CN202211087276.9A CN202211087276A CN115170746A CN 115170746 A CN115170746 A CN 115170746A CN 202211087276 A CN202211087276 A CN 202211087276A CN 115170746 A CN115170746 A CN 115170746A
Authority
CN
China
Prior art keywords
point cloud
scales
scale
semantic
dimensional reconstruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211087276.9A
Other languages
Chinese (zh)
Other versions
CN115170746B (en
Inventor
任胜兵
彭泽文
陈旭洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202211087276.9A priority Critical patent/CN115170746B/en
Publication of CN115170746A publication Critical patent/CN115170746A/en
Application granted granted Critical
Publication of CN115170746B publication Critical patent/CN115170746B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/52Scale-space analysis, e.g. wavelet analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform

Abstract

The invention discloses a method, a system and equipment for multi-view three-dimensional reconstruction based on deep learning, wherein a plurality of multi-view images are obtained, multi-scale semantic feature extraction is carried out on the multi-view images, and feature maps of various scales are obtained; performing multi-scale semantic segmentation on the feature maps of various scales to obtain semantic segmentation sets of various scales; reconstructing a plurality of multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map; obtaining depth maps of various scales based on the semantic segmentation sets and the initial depth maps of various scales; constructing point cloud sets with various scales; optimizing the point cloud sets of various scales by adopting different radius filtering to obtain optimized point cloud sets; reconstructing at different scales based on the optimized point cloud set to obtain three-dimensional reconstruction results at different scales; and splicing and fusing the three-dimensional reconstruction results of each scale. The invention can fully utilize semantic information of each scale and improve the accuracy of three-dimensional reconstruction.

Description

Multi-view three-dimensional reconstruction method, system and equipment based on deep learning
Technical Field
The invention relates to the technical field of computer vision, in particular to a method, a system and equipment for multi-view three-dimensional reconstruction based on deep learning.
Background
The three-dimensional reconstruction method for deep learning is that a neural network is built by using a computer, training is carried out through a large amount of image data and three-dimensional model data, and the mapping relation from an image to a three-dimensional model is learned, so that the three-dimensional reconstruction of a new image target is realized. Compared with the traditional method such as a 3D digital media management (3D) method and a Structural From Motion (SFM), the three-dimensional reconstruction method for deep learning can introduce some learned global semantic information into image reconstruction, so that the limitation that the traditional reconstruction method is poor in reconstruction of weak illumination and weak texture areas is overcome to a certain extent.
The existing deep learning three-dimensional reconstruction method is mostly based on a single scale, namely, objects with different sizes in an image are reconstructed in the same way. The single-scale reconstruction can keep better reconstruction accuracy and speed under the environment with lower scene complexity and fewer fine objects. However, the problem of insufficient reconstruction accuracy of small-scale objects easily occurs in some environments with complex scenes and more objects of various scales. And only the high-level features are utilized, and the low-level detail information of the image is not fully utilized.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a multi-view three-dimensional reconstruction method, a multi-view three-dimensional reconstruction system and multi-view three-dimensional reconstruction equipment based on deep learning, which can make full use of semantic information of each scale and improve the accuracy of three-dimensional reconstruction.
In a first aspect, an embodiment of the present invention provides a deep learning-based multi-view three-dimensional reconstruction method, where the deep learning-based multi-view three-dimensional reconstruction method includes:
acquiring a plurality of multi-view images, and performing multi-scale semantic feature extraction on the plurality of multi-view images to obtain feature maps of various scales;
performing multi-scale semantic segmentation on the feature maps of multiple scales to obtain semantic segmentation sets of multiple scales;
reconstructing the multiple multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map;
obtaining depth maps of multiple scales based on the semantic segmentation sets of multiple scales and the initial depth map;
constructing a point cloud set with multiple scales based on the depth maps with multiple scales;
according to the scale of the point cloud set, different radius filtering is adopted for the point cloud sets with various scales to carry out optimization, and the optimized point cloud set is obtained;
reconstructing at different scales based on the optimized point cloud set to obtain three-dimensional reconstruction results at different scales;
and splicing and fusing the three-dimensional reconstruction results of each scale to obtain a final three-dimensional reconstruction result.
Compared with the prior art, the first aspect of the invention has the following beneficial effects:
the method can extract the features of different scales by extracting the multi-scale semantic features of a plurality of multi-view images, can obtain the feature maps of various scales, can perform multi-scale semantic segmentation on the feature maps of various scales, and can aggregate the semantic information of each scale, thereby enriching the semantic information of each scale; semantic guidance is respectively carried out on the initial depth map by utilizing semantic information of each scale in a semantic segmentation set of multiple scales, so that the initial depth map is continuously corrected, and the accurate depth map of multiple scales is obtained; the method comprises the steps of constructing a point cloud set with various scales by using the obtained depth maps with various scales, optimizing by adopting different radius filtering according to the scales of the point cloud set, using the optimized point cloud set for reconstruction with different scales, and fusing three-dimensional reconstruction results to obtain more accurate three-dimensional reconstruction results. Therefore, the method can fully utilize semantic information of each scale and improve the accuracy of three-dimensional reconstruction.
According to some embodiments of the present invention, the performing multi-scale semantic feature extraction on a plurality of the multi-view images to obtain feature maps of multiple scales includes:
performing multilayer feature extraction on the multiple multi-view images through a ResNet network to obtain original feature maps with multiple scales;
and respectively connecting the original feature map of each scale with channel attention so as to carry out importance weighting on the original feature map of each scale through a channel attention mechanism and obtain feature maps of various scales.
According to some embodiments of the present invention, the importance weighting is performed on the original feature map of each scale through a channel attention mechanism to obtain feature maps of multiple scales, including:
compressing the original characteristic diagram of each scale through a compression network to obtain a one-dimensional characteristic diagram corresponding to the original characteristic diagram of each scale;
inputting the one-dimensional characteristic diagram into a full-connection layer through an excitation network to perform importance prediction, and obtaining the importance of each channel;
and exciting the importance of each channel to the one-dimensional characteristic diagram of the original characteristic diagram of each scale through an excitation function to obtain characteristic diagrams of various scales.
According to some embodiments of the present invention, the performing multi-scale semantic segmentation on the feature maps of multiple scales to obtain a semantic segmentation set of multiple scales includes:
clustering the characteristic graphs of multiple scales through nonnegative matrix decomposition to obtain semantic segmentation sets of multiple scales; wherein the expression of the non-negative matrix factorization is:
Figure 100002_DEST_PATH_IMAGE001
the method comprises the following steps that V represents a matrix V which is formed by mapping, connecting and reshaping feature maps of various scales into HW rows and C columns, P represents a matrix of the HW rows and K columns, Q represents a matrix of the K rows and C columns, H represents a coefficient matrix, W represents a base matrix, K represents a non-negative matrix decomposition factor of semantic cluster number, C represents the dimension of each pixel, and F represents the adoption of a non-induced norm.
According to some embodiments of the present invention, the obtaining the depth maps of the plurality of scales based on the semantic segmentation sets of the plurality of scales and the initial depth map comprises:
selecting any one of the multiple multi-view images as a reference image, and taking the other images as images to be matched;
selecting a reference point from the reference image, acquiring a semantic category corresponding to the reference point in the semantic segmentation set, and acquiring a depth value corresponding to the reference point on the initial depth image;
the number of reference points is chosen by the following formula:
Figure 778149DEST_PATH_IMAGE002
wherein, the first and the second end of the pipe are connected with each other,
Figure 100002_DEST_PATH_IMAGE003
representing the number of reference points selected by the jth segmentation set, H representing the height of the multi-view image, W representing the width of the multi-view image, HW representing the number of pixel points of the multi-view image, t representing a constant parameter,
Figure 773918DEST_PATH_IMAGE004
representing the number of semantic categories contained in the jth of said semantic segmentation sets,
Figure 100002_DEST_PATH_IMAGE005
representing the number of semantic categories contained in the ith semantic segmentation set;
based on each reference point, obtaining the matching point of each reference point on the graph to be matched through the following formula:
Figure 819234DEST_PATH_IMAGE006
wherein, the first and the second end of the pipe are connected with each other,
Figure 100002_DEST_PATH_IMAGE007
representing the matching point of the ith reference point on the graph to be matched, K representing the internal reference of the camera, T representing the external reference of the camera,
Figure 745602DEST_PATH_IMAGE008
representing a reference point P in said reference map i Corresponding depth values on the initial depth map;
obtaining the semantic category corresponding to each matching point, correcting the multi-view image of each scale by minimizing a semantic loss function to obtain the depth maps of multiple scales, wherein the semantic loss function
Figure 100002_DEST_PATH_IMAGE009
The calculation formula of (a) is as follows:
Figure 137138DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure 100002_DEST_PATH_IMAGE011
representing the difference between the semantic information of the ith reference point and the semantic information of the ith matching point, M i Representing a mask and N representing the number of said reference points.
According to some embodiments of the invention, the constructing a point cloud set of multiple scales based on the depth maps of multiple scales comprises:
constructing a point cloud set of each scale by using the depth map of each scale according to the following expression:
Figure 490759DEST_PATH_IMAGE012
wherein, the first and the second end of the pipe are connected with each other,
Figure 100002_DEST_PATH_IMAGE013
the abscissa representing the depth map is shown as,
Figure 23371DEST_PATH_IMAGE014
represents the ordinate of the depth map and,
Figure 100002_DEST_PATH_IMAGE015
and
Figure 504162DEST_PATH_IMAGE016
representing the camera focal length obtained from the camera parameters, and x, y and z represent the point cloud coordinates of the point cloud transformation.
According to some embodiments of the present invention, the optimizing the point cloud sets of multiple scales by using different radius filtering according to the scales of the point cloud sets to obtain an optimized point cloud set includes:
acquiring the point cloud sets of multiple scales, wherein the point cloud in the point cloud set of each scale has a corresponding radius and a preset number of adjacent points;
calculating the corresponding radius of the point cloud in the point cloud set by adopting the following formula according to the scale of the point cloud set:
Figure 100002_DEST_PATH_IMAGE017
wherein, the first and the second end of the pipe are connected with each other,
Figure 704199DEST_PATH_IMAGE018
representing the corresponding radius of the point cloud in the point cloud set with different scales,
Figure 100002_DEST_PATH_IMAGE019
representing a constant parameter, t representing a constant parameter,
Figure 805885DEST_PATH_IMAGE020
representing a preset scale grade of each point cloud set;
and optimizing the point cloud sets with various scales according to the radius corresponding to each point cloud and the preset number of adjacent points to obtain an optimized point cloud set.
In a second aspect, an embodiment of the present invention further provides a deep learning-based multi-view three-dimensional reconstruction system, where the deep learning-based multi-view three-dimensional reconstruction system includes:
the characteristic diagram acquisition unit is used for acquiring multi-view images, and performing multi-scale semantic feature extraction on the multi-view images to acquire characteristic diagrams of multiple scales;
the semantic segmentation set acquisition unit is used for carrying out multi-scale semantic segmentation on the feature maps with various scales to acquire a semantic segmentation set with various scales;
the initial depth map acquisition unit is used for reconstructing a plurality of multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map;
the depth map acquisition unit is used for acquiring depth maps of multiple scales based on the semantic segmentation sets of multiple scales and the initial depth map;
the point cloud set acquisition unit is used for constructing point cloud sets of multiple scales on the basis of the depth maps of the multiple scales;
the radius filtering unit is used for optimizing the point cloud sets with various scales by adopting different radius filtering according to the scales of the point cloud sets to obtain the optimized point cloud sets;
a reconstruction result obtaining unit, configured to perform reconstruction of different scales based on the optimized point cloud set, so as to obtain three-dimensional reconstruction results of different scales;
and the reconstruction result fusion unit is used for splicing and fusing the reconstruction results of each scale to obtain a final three-dimensional reconstruction result.
Compared with the prior art, the second aspect of the invention has the following beneficial effects:
the feature map acquisition unit of the system can extract deep features by performing multi-scale semantic feature extraction on a plurality of multi-view images, can acquire feature maps of various scales, performs multi-scale semantic segmentation on the feature maps of various scales by the semantic segmentation set acquisition unit, aggregates semantic information of various scales, and enriches the semantic information of various scales; the depth map acquisition unit is used for respectively carrying out semantic guidance on the initial depth map by utilizing semantic information of each scale in a semantic segmentation set of multiple scales, so that the initial depth map is continuously corrected, and the accurate depth map of multiple scales is obtained; a point cloud set acquisition unit of the system constructs a point cloud set with multiple scales by using the acquired depth maps with multiple scales, different radius filtering is adopted for optimization according to the scales of the point cloud set through a radius filtering unit, reconstruction with different scales is carried out on the basis of the optimized point cloud set through a reconstruction result acquisition unit, and then a three-dimensional reconstruction result is fused through a reconstruction result fusion unit to obtain a more accurate three-dimensional reconstruction result. Therefore, the system can make full use of semantic information of each scale and improve the accuracy of three-dimensional reconstruction.
In a third aspect, an embodiment of the present invention further provides a deep learning-based multi-view three-dimensional reconstruction apparatus, including at least one control processor and a memory, which is in communication connection with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform a method of deep learning based multi-view three-dimensional reconstruction as described above.
In a fourth aspect, the embodiment of the present invention further provides a computer-readable storage medium, where computer-executable instructions are stored, and the computer-executable instructions are configured to enable a computer to execute a method for deep learning based multi-view three-dimensional reconstruction as described above.
It is to be understood that the advantageous effects of the third aspect to the fourth aspect compared to the related art are the same as the advantageous effects of the first aspect compared to the related art, and reference may be made to the related description in the first aspect, which is not repeated herein.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a flowchart of a deep learning-based multi-view three-dimensional reconstruction method according to an embodiment of the present invention;
FIG. 2 is a block diagram of a depth residual network in accordance with one embodiment of the present invention;
FIG. 3 is a schematic diagram of a non-negative matrix factorization of an embodiment of the present invention;
FIG. 4 is a block diagram of multi-scale semantic segmentation in accordance with an embodiment of the present invention;
fig. 5 is a structural diagram of a deep learning-based multi-view three-dimensional reconstruction system according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
In the description of the present invention, if there are first, second, etc. described, it is only for the purpose of distinguishing technical features, and it is not understood that relative importance is indicated or implied or that the number of indicated technical features is implicitly indicated or that the precedence of the indicated technical features is implicitly indicated.
In the description of the present invention, it should be understood that the orientation or positional relationship referred to, for example, the upper, lower, etc., is indicated based on the orientation or positional relationship shown in the drawings, and is only for convenience of description and simplification of description, but does not indicate or imply that the device or element referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present invention.
In the description of the present invention, it should be noted that unless otherwise explicitly defined, terms such as setup, installation, connection, etc. should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention by combining the detailed contents of the technical solutions.
For the convenience of understanding of those skilled in the art, the terms in the present embodiment are explained:
the deep learning three-dimensional reconstruction method comprises the following steps: the three-dimensional reconstruction method for deep learning is that a neural network is built by a computer, training is carried out through a large amount of image data and three-dimensional model data, and the mapping relation between an image and a three-dimensional model is learned, so that three-dimensional reconstruction of a new image target is realized. Compared with the traditional method for reconstructing three-dimensional information such as 3DMM and the method for reconstructing three-dimensional information by SFM, the three-dimensional reconstruction method for deep learning can introduce some global semantic information into image reconstruction, thereby overcoming the limitation that the traditional reconstruction method is poor in reconstruction in weak illumination and weak texture areas to a certain extent, wherein the SFM algorithm is an off-line algorithm for three-dimensional reconstruction based on various collected disordered pictures; the 3DMM, a three-dimensional deformable face model, is a general three-dimensional face model, and represents a face by using fixed points.
The current three-dimensional reconstruction methods for deep learning can be mainly classified into supervised three-dimensional reconstruction methods (for example, NVSNet, CVP-MVSNet, patchmatchchnet and the like in the prior art) and self-supervised three-dimensional reconstruction methods (for example, JDACS-MS and the like in the prior art). The supervised three-dimensional reconstruction method needs truth values for training, has high precision, and is difficult to apply in some scenes in which truth values are difficult to acquire. The self-supervision three-dimensional reconstruction method does not need real value training, and has wide application range and relatively low precision.
Semantic segmentation: semantic segmentation is a classification at the pixel level, and pixels belonging to the same class are classified into one class, so that semantic segmentation is used for understanding an image from the pixel level, for example, pixels having different semantics are marked with different colors. Pixels belonging to animals are classified into the same class. The segmented semantic information can guide image reconstruction, and reconstruction accuracy is improved. And performing semantic segmentation by adopting a clustering mode, and clustering pixels belonging to the same class into the same class.
Depth map: a distance image is an image in which the distance (depth) value from an image capture device to each point in a scene is defined as a pixel value.
Point cloud: the point data set of the object appearance surface is point cloud, contains information such as three-dimensional coordinate information and color of the object, and can realize image reconstruction through the point cloud data.
non-Negative Matrix Factorization (NMF): is a matrix decomposition method under the constraint that all elements in the matrix are non-negative. There are many analysis methods for solving the practical problem by matrix decomposition, such as PCA (principal component analysis), ICA (independent component analysis), SVD (singular value decomposition), VQ (vector quantization), and the like. In all these methods, the original large matrix V is approximately decomposed into a low rank V = WH form. The common feature of these methods is that the elements in the factors W and H can be positive or negative, and even if the input initial matrix elements are all positive, the non-negativity of the original data cannot be guaranteed by the conventional rank reduction algorithm. Mathematically, it is true from a computational point of view that the presence of negative values in the decomposition results is correct, but negative values elements often make no sense in practical problems.
The three-dimensional reconstruction method for deep learning is that a neural network is built by using a computer, training is carried out through a large amount of image data and three-dimensional model data, and the mapping relation from an image to a three-dimensional model is learned, so that the three-dimensional reconstruction of a new image target is realized. Compared with the traditional method such as a 3DMM method and an SFM method, the three-dimensional reconstruction method based on deep learning can introduce some learned global semantic information into image reconstruction, so that the limitation that the traditional reconstruction method is poor in reconstruction in weak-illumination and weak-texture areas is overcome to a certain extent.
The existing deep learning three-dimensional reconstruction method is mostly based on a single scale, namely, the objects with different sizes in the image are reconstructed in the same way. The single-scale reconstruction can keep better reconstruction accuracy and speed under the environment with lower scene complexity and fewer fine objects. However, the problem of insufficient reconstruction accuracy of small-scale objects easily occurs in some environments with complex scenes and more objects of various scales. And only the high-level features are utilized, and the low-level detail information of the image is not fully utilized.
In order to solve the problems, the multi-scale semantic feature extraction is carried out on a plurality of multi-view images, features of different scales can be extracted, feature maps of various scales can be obtained, multi-scale semantic segmentation is carried out on the feature maps of various scales, semantic information of various scales is aggregated, and the semantic information of various scales is enriched; semantic guidance is respectively carried out on the initial depth map by utilizing semantic information of each scale in a semantic segmentation set of multiple scales, so that the initial depth map is continuously corrected, and an accurate depth map of multiple scales is obtained; the method and the device construct the point cloud sets of various scales by using the obtained depth maps of various scales, optimize by adopting different radius filtering according to the scales of the point cloud sets, use the optimized point cloud sets for reconstruction of different scales, and fuse three-dimensional reconstruction results to obtain more accurate three-dimensional reconstruction results. Therefore, the method and the device can make full use of semantic information of all scales, and can improve the accuracy of three-dimensional reconstruction.
Referring to fig. 1, an embodiment of the present invention provides a deep learning-based multi-view three-dimensional reconstruction method, where the deep learning-based multi-view three-dimensional reconstruction method includes:
s100, acquiring a plurality of multi-view images, and performing multi-scale semantic feature extraction on the plurality of multi-view images to obtain feature maps of various scales.
Specifically, a plurality of multi-view images are acquired, and the object to be recognized can be subjected to image acquisition at various angles in all directions through image acquisition equipment such as a camera and an image scanner, so that the plurality of multi-view images are obtained. For example, when multi-scale semantic feature extraction needs to be performed on multiple multi-view images, multiple multi-view images can be obtained by using an image acquisition device such as a camera.
In the embodiment, multilayer feature extraction is performed on a plurality of multi-view images through a ResNet network to obtain original feature maps with various scales;
respectively connecting the original feature map of each scale with channel attention, and performing importance weighting on the original feature map of each scale through a channel attention mechanism to obtain feature maps of multiple scales, specifically:
compressing the original characteristic diagram of each scale through a compression network to obtain a one-dimensional characteristic diagram corresponding to the original characteristic diagram of each scale;
inputting the one-dimensional characteristic diagram into a full connection layer through an excitation network to carry out importance prediction, and obtaining the importance of each channel;
and exciting the importance of each channel to the one-dimensional characteristic diagram of the original characteristic diagram of each scale through an excitation function to obtain characteristic diagrams of various scales.
In the embodiment, the ResNet network is adopted to extract the image characteristics, the deeper the number of layers of the deep learning network is, the stronger the expression capability theoretically is, but after the CNN network reaches a certain depth, the classification performance is deepened, so that the network convergence is slower, and the accuracy is reduced; even if the data set is enlarged to solve the problem of overfitting, the classification performance and accuracy will not be improved. The ResNet network adopts a residual error learning method, and with reference to FIG. 2, the learned characteristics are recorded as x when the input is x
Figure DEST_PATH_IMAGE021
Now we want it to learn the residual
Figure 91373DEST_PATH_IMAGE022
Such that the actual original learning features are
Figure DEST_PATH_IMAGE023
. This is so because residual learning is easier than direct learning of the original features. When the residual error is 0, the accumulation layer only performs identity mapping at this time, at least the network performance is not reduced, and actually the residual error is not 0, so that the accumulation layer can learn new features on the basis of the input features, and has better performance. The residual function is easier to optimize, and the number of network layers can be greatly deepened, so that deeper semantic information can be extracted. The performance of ResNet in the aspects of efficiency, resource consumption and deep semantic feature extraction is obviously superior to that of networks such as VGG (virtual grid generator) and the like.
After multi-layer feature extraction is carried out on a plurality of multi-view images through a ResNet network to obtain original feature maps of various scales, the original feature maps of each scale are respectively connected with channel attention, and importance weighting is carried out on the original feature maps of each scale through a channel attention mechanism to obtain the feature maps of various scales. The channel attention mechanism mainly comprises a compression network and an excitation network, and comprises the following specific processes:
let the dimension of the original feature map be H × W × C, where H is Height (Height), W is width (width), and C is channel number (channel). The compression network does the same by compressing H x W C to 1 x 1C, which is equivalent to compressing H x W to one-dimensional features, by global averaging pooling. After H W is compressed into one dimension, the corresponding one-dimensional parameters obtain the previous H W global view, and the sensing area is wider. And transmitting the one-dimensional characteristics obtained by the compression network to an excitation network, transmitting the one-dimensional characteristics to a full connection layer by the excitation network, predicting the importance of each channel to obtain the importance of different channels, and exciting the importance of different channels to the channels corresponding to the previous characteristic diagrams by a Sigmoid excitation function. The channel attention mechanism enables the network to pay attention to more effective semantic features, the weight of the semantic features is improved in an iterative mode, the feature extraction network extracts rich semantic features, and the importance of different semantic features to semantic segmentation is different. The introduction of the channel attention mechanism can enable the network to pay attention to more effective features, inhibit inefficient features and improve the effectiveness of feature extraction.
In this embodiment, because of the convolutional neural network used in feature extraction in the prior art, feature extraction like VGG network is limited by the number of network extraction layers, the deep level feature extraction capability is insufficient, and the feature validity is not high. With the increase of the number of the convolution layers, the problems of slow network convergence, low accuracy and the like occur, the feature extraction capability is insufficient, all the extracted features have different importance for image reconstruction, and the extraction of the features with high effectiveness is difficult to guarantee. Therefore, in the embodiment, by performing multi-scale semantic feature extraction on a plurality of multi-view images, deep features can be extracted, and feature maps of various scales can be obtained. And through the introduction of a channel attention mechanism, the network can pay attention to more effective features, the inefficient features are restrained, and the feature extraction effectiveness is improved.
And S200, performing multi-scale semantic segmentation on the feature maps of various scales to obtain a semantic segmentation set of various scales.
Specifically, clustering is carried out on feature maps of multiple scales through nonnegative matrix decomposition, and semantic segmentation sets of multiple scales are obtained; wherein the expression of the non-negative matrix factorization is as follows:
Figure 93964DEST_PATH_IMAGE001
the method comprises the following steps that V represents a matrix V which is formed by mapping, connecting and reshaping feature maps of various scales into HW rows and C columns, P represents a matrix of the HW rows and K columns, Q represents a matrix of the K rows and C columns, H represents a coefficient matrix, W represents a base matrix, K represents a non-negative matrix decomposition factor of semantic cluster number, C represents the dimension of each pixel, and F represents the adoption of a non-induced norm.
A typical matrix decomposition decomposes a large matrix into many smaller matrices, but the elements of these matrices have positive and negative values. In the real world, the existence of negative numbers in a matrix formed by images, texts and the like is meaningless, so that it is meaningful to decompose a matrix into all non-negative elements. Requiring a raw matrix in NMF
Figure 414087DEST_PATH_IMAGE024
Is non-negative, then the matrix
Figure 578352DEST_PATH_IMAGE024
Can be decomposed into the product of two smaller non-negative matrices with one and only one such decomposition satisfying presence and uniqueness. For example,
given matrix
Figure DEST_PATH_IMAGE025
Looking for non-negative matrices
Figure 101869DEST_PATH_IMAGE026
And a non-negative matrix
Figure DEST_PATH_IMAGE027
So that
Figure 704888DEST_PATH_IMAGE028
. Before and after decomposition it is understood that: original matrix
Figure 82780DEST_PATH_IMAGE024
The column vector of (1) is the weighted sum of all the column vectors in the left matrix, and the weighting coefficient is the element of the corresponding column vector of the right matrix, so called
Figure DEST_PATH_IMAGE029
As a basis matrix, the matrix is,
Figure 995110DEST_PATH_IMAGE030
is a matrix of coefficients.
Referring to fig. 3, N multi-scale feature maps are first concatenated and reshaped into a (HW, C) matrix V. Solving NMF using multiplicative update rules, i.e. using formulae
Figure DEST_PATH_IMAGE031
Figure 255190DEST_PATH_IMAGE032
Solving for NMF, V is decomposed by NMF decomposition (i.e., NMF non-negative matrix decomposition) in the graph into (HW, K) matrix P and (K, C) matrix Q, where K is the NMF factor representing the number of semantic clusters. Due to NMF (QQ) T = I), each row of the (K, C) matrix Q may be considered as a C-dimensional cluster center, each row of the K, C) matrix Q corresponding to several objects in the view. The rows of the (HW, K) matrix P correspond to the positions of all pixels from the N multi-scale feature maps. In general, matrix decomposition forces the product between each row of P and each column of Q to better approximate the C-dimensional characteristics of each pixel in V. Thus, the semantic category of each position in the image is obtained by the P matrix.
Referring to FIG. 4, assume an extracted feature map
Figure DEST_PATH_IMAGE033
Each feature is semantically segmented by means of clustering (i.e., NMF non-negative matrix factorization)Matrix array
Figure 865163DEST_PATH_IMAGE034
Is decomposed into
Figure DEST_PATH_IMAGE035
And because the receptive field of the high-level feature layer is large, the features are more abstract, and the global situation is more concerned. The lower characteristic layer has small receptive field and focuses more on details. Thus, each segmented set obtained by multi-scale semantic segmentation
Figure 628720DEST_PATH_IMAGE036
The system comprises a plurality of layers from coarse to fine. The segmentation sets S1 to S3 in fig. 4 contain increasingly more detailed information. Each segmentation set S contains semantic segmentation results of an input set of images (reference image and image to be matched), for example, different colors represent different semantic categories, and a segmentation set (e.g., segmentation set S3) containing more detailed information contains more semantic categories.
In the embodiment, because most of the current deep learning three-dimensional reconstruction methods are based on a single scale, the three-dimensional reconstruction methods are reconstructed in the same manner for objects with different sizes in the image. The single-scale reconstruction can keep better reconstruction accuracy and speed in the environment with lower scene complexity and fewer small objects, but the problem of insufficient reconstruction accuracy of small-scale objects easily occurs in the environment with complex scenes and more objects of various scales; and only the high-level features are utilized, and the low-level detail information of the image is not fully utilized. Therefore, in the embodiment, the feature maps of multiple scales are subjected to multi-scale semantic segmentation, semantic information of each scale is aggregated, the semantic information of each scale is enriched, and the detail information of the low-level feature layer can be fully utilized.
And S300, reconstructing a plurality of multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map.
Specifically, in the embodiment, a plurality of multi-view images are reconstructed by a supervised three-dimensional reconstruction method, so as to obtain an initial depth map.
According to the embodiment, the initial depth map is obtained through a supervised three-dimensional reconstruction method, and the reconstruction precision can be improved. Because the supervised three-dimensional reconstruction method has high precision, but needs a large amount of training truth value data, and under certain specific scenes (for example, underwater), the training truth value is difficult to acquire and is difficult to apply. Therefore, step S400 is required to perform semantic guidance on the initial depth map of this embodiment, and the supervised three-dimensional reconstruction method is converted into an unsupervised one, so as to implement the unsupervised three-dimensional reconstruction, thereby overcoming the inherent defects of the supervised three-dimensional reconstruction method.
The supervised three-dimensional reconstruction method in this embodiment is any one of the supervised three-dimensional reconstruction methods in the prior art, for example, MVSNet (Depth index for unstrained multiple-View step), CVP-MVSNet (Cost Volume farm Based Depth index for multiple-View step), and PatchmatchNet (PatchmatchNet: left multiple-View patchmat step), and the detailed description thereof is omitted.
And S400, obtaining the depth maps of various scales based on the semantic segmentation sets and the initial depth map of various scales.
Specifically, in this embodiment, semantic information is used as a supervision signal to combine with a supervised three-dimensional reconstruction method, and the image reconstruction is guided to obtain a depth map, which specifically includes the following processes:
acquiring a plurality of multi-view images through image acquisition equipment, and taking the plurality of multi-view images as input to obtain an initial depth map through a supervised three-dimensional reconstruction method;
selecting any one of the multiple multi-view images as a reference image, and taking the other images as images to be matched;
selecting a reference point from the reference image, acquiring a semantic category corresponding to the reference point in the semantic segmentation set, and acquiring a depth value corresponding to the reference point on the initial depth map;
the number of reference points is chosen by the following formula:
Figure DEST_PATH_IMAGE037
wherein, the first and the second end of the pipe are connected with each other,
Figure 682258DEST_PATH_IMAGE038
representing the number of reference points selected by the jth segmentation set, H representing the height of the multi-view image, W representing the width of the multi-view image, HW representing the number of pixel points of the multi-view image, t representing a constant parameter,
Figure DEST_PATH_IMAGE039
representing the number of semantic categories contained in the jth semantic partition set,
Figure 960792DEST_PATH_IMAGE040
representing the number of semantic categories contained in the ith semantic segmentation set;
based on each reference point, acquiring the matching point of each reference point on the graph to be matched through the following formula:
Figure 577718DEST_PATH_IMAGE006
wherein the content of the first and second substances,
Figure 726940DEST_PATH_IMAGE007
the matching point of the ith reference point on the graph to be matched is shown, K represents the internal reference of the camera, T represents the external reference of the camera,
Figure 473352DEST_PATH_IMAGE008
representing a reference point P in a reference picture i Corresponding depth values on the initial depth map;
obtaining the semantic category corresponding to each matching point, correcting the multi-view image of each scale by minimizing a semantic loss function to obtain depth maps of various scales, and obtaining the semantic loss function
Figure 645707DEST_PATH_IMAGE009
The calculation formula of (a) is as follows:
Figure 863062DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE041
representing the difference between the semantic information of the ith reference point and the semantic information of the ith matching point, M i Representing the mask and N the number of reference points. This embodiment is illustrated by the following example:
firstly, a plurality of multi-view images of the same object under different view angles are obtained through an image acquisition device, the multi-view images are used as input, and an initial depth map can be obtained through a supervised three-dimensional reconstruction method. Selecting one of the input multi-view images as a reference image, and taking a point of reference point P on the reference image, wherein the rest images are images to be matched i And its corresponding semantic class S on the segmented set S i And a corresponding depth value on the depth map.
For the segmentation sets of different levels, the segmentation sets with more categories need to be guided more finely due to different semantic category numbers, the number of reference points needs to be more, and the number of the reference points is selected according to a formula:
Figure 601211DEST_PATH_IMAGE042
the matching points corresponding to the reference points on the graph to be matched are obtained through the following homography matrix formula
Figure 511398DEST_PATH_IMAGE007
Taking matching points
Figure 171049DEST_PATH_IMAGE007
Semantic categories of
Figure DEST_PATH_IMAGE043
The semantic category of the matching point calculated by the reference point under the condition that the depth map is accurate (i.e. the depth value of the corresponding position is correct) should be the same as the semantic category of the reference point, and the following semantic loss function is calculated and minimized:
Figure 473986DEST_PATH_IMAGE044
and continuously correcting the initial depth map by minimizing a semantic loss function, and finally obtaining an accurate depth map. The semantic information can replace a truth value for guiding, a supervised three-dimensional reconstruction method is converted into an unsupervised three-dimensional reconstruction method, and the self-supervised three-dimensional reconstruction is realized, so that the inherent defects of the supervised method are overcome.
In this embodiment, since the semantics of the image can be divided into three layers, a visual layer, an object layer and a concept layer, the semantics of the visual layer includes colors, lines, contours, etc., the semantics of the object layer includes various objects, and the semantics of the concept layer relates to understanding of the scene. In the prior art, part of three-dimensional reconstruction methods also use semantic information guidance, but high-level abstract semantic information (object layer) with a single scale has better precision on reconstruction tasks of large-scale objects, and on reconstruction tasks with small scales, the high-level abstract semantic information is relatively rough and the reconstruction precision is poor.
Therefore, in the embodiment, a plurality of multi-view images are used as input, and an initial depth map is obtained through a supervised three-dimensional reconstruction method; obtaining depth maps of various scales based on semantic segmentation sets and initial depth maps of various scales; in the embodiment, semantic guidance is respectively performed on the initial depth map by utilizing semantic information of each scale in a semantic segmentation set of multiple scales, so that the initial depth map is continuously corrected, and an accurate depth map of multiple scales is obtained.
And S500, constructing a point cloud set with various scales based on the depth maps with various scales.
Specifically, a point cloud set of each scale is constructed by the following expression for the depth map of each scale:
Figure DEST_PATH_IMAGE045
wherein, the first and the second end of the pipe are connected with each other,
Figure 801062DEST_PATH_IMAGE046
the abscissa representing the depth map is shown,
Figure DEST_PATH_IMAGE047
the ordinate of the depth map is represented,
Figure 616571DEST_PATH_IMAGE048
and
Figure 75103DEST_PATH_IMAGE016
representing the camera focal length obtained from the camera parameters, and x, y and z represent the point cloud coordinates of the point cloud transformation.
And S600, according to the scale of the point cloud set, optimizing the point cloud sets with various scales by adopting different radius filtering to obtain the optimized point cloud set.
Specifically, a point cloud set with multiple scales is obtained, and the point cloud in the point cloud set with each scale has a corresponding radius and a preset number of adjacent points;
calculating the corresponding radius of the point cloud in the point cloud set according to the scale of the point cloud set by adopting the following formula:
Figure 837523DEST_PATH_IMAGE017
wherein the content of the first and second substances,
Figure 550264DEST_PATH_IMAGE018
representing the corresponding radius of the point cloud in the point cloud set with different scales,
Figure 536674DEST_PATH_IMAGE019
representing a constant parameter, t representing a constant parameter,
Figure DEST_PATH_IMAGE049
representing a preset scale grade of each point cloud set;
and optimizing the point cloud sets with various scales according to the radius size corresponding to each point cloud and the preset number of adjacent points to obtain the optimized point cloud sets.
In this embodiment, for point cloud sets of different scales, radius filtering is required after depth map conversion, noise points are filtered, and point cloud data are optimized. For point cloud sets with different scales, different radius filtering is adopted due to different aggregation degrees of the point clouds. Radius filtering, namely, firstly, acquiring the radius corresponding to each point cloud and presetting the quantity of adjacent points, only the point cloud which meets the requirement of having enough quantity of adjacent points in the radius range can be reserved, and the rest points are filtered out. For the multi-scale point cloud set of this embodiment, the semantic type of the point cloud in the segmentation set needs to be considered, that is, the point cloud having n number of neighboring points with the same semantic type in the radius is retained.
And S700, reconstructing at different scales based on the optimized point cloud set to obtain three-dimensional reconstruction results at different scales.
Specifically, in step S600, point cloud sets of different scales are optimized to obtain point cloud sets optimized in different scales, and the point cloud sets optimized in each scale are reconstructed to obtain three-dimensional reconstruction results in different scales.
And step S800, splicing and fusing the three-dimensional reconstruction results of each scale to obtain a final three-dimensional reconstruction result.
Specifically, the three-dimensional reconstruction results of each scale are spliced and fused to obtain the final three-dimensional reconstruction result. In this embodiment, through the step S700, the reconstruction of different scales is performed based on the optimized point cloud set, and the optimized point cloud set is more accurate, so that the final three-dimensional reconstruction result obtained in this embodiment is also more accurate.
In the embodiment, a plurality of multi-view images are obtained, and multi-scale semantic feature extraction is performed on the plurality of multi-view images to obtain feature maps of various scales; performing multi-scale semantic segmentation on the feature maps of various scales to obtain semantic segmentation sets of various scales; in the embodiment, the deep-level features can be extracted by performing multi-scale semantic feature extraction on a plurality of multi-view images, and feature maps of various scales can be obtained. And multi-scale semantic segmentation is carried out on the feature maps of various scales, and semantic information of each scale is aggregated, so that the semantic information of each scale is enriched. In the embodiment, a plurality of multi-view images are used as input, and an initial depth map is obtained through a supervised three-dimensional reconstruction method; obtaining depth maps of various scales based on the semantic segmentation sets and the initial depth maps of various scales; in the embodiment, semantic guidance is respectively performed on the initial depth map by utilizing semantic information of each scale in a semantic segmentation set of multiple scales, so that the initial depth map is continuously corrected, and an accurate depth map of multiple scales is obtained. The method comprises the steps of constructing a point cloud set with various scales based on depth maps with various scales; according to the scale of the point cloud set, optimizing the point cloud sets of various scales by adopting different radius filtering to obtain the optimized point cloud set; reconstructing at different scales based on the optimized point cloud set to obtain reconstruction results at different scales; and splicing and fusing the reconstruction results of each scale to obtain a final reconstruction result. In this embodiment, the obtained depth maps of multiple scales are used to construct a point cloud set of multiple scales, different radius filtering is adopted for optimization according to the scales of the point cloud set, the optimized point cloud set is used for reconstruction of different scales, and then the reconstruction results are fused to obtain a more accurate reconstruction result. According to the embodiment, semantic information of each scale can be fully utilized, and the accuracy of three-dimensional reconstruction can be improved.
Referring to fig. 5, an embodiment of the present invention provides a deep learning-based multi-view three-dimensional reconstruction system, which includes a feature map obtaining unit 100, a semantic segmentation set obtaining unit 200, an initial depth map obtaining unit 300, a depth map obtaining unit 400, a point cloud set obtaining unit 500, a radius filtering unit 600, a reconstruction result obtaining unit 700, and a reconstruction result fusion unit 800, where:
the feature map acquiring unit 100 is configured to acquire a multi-view image, perform multi-scale semantic feature extraction on the multi-view image, and acquire feature maps of multiple scales;
a semantic division set obtaining unit 200, configured to perform multi-scale semantic division on feature maps of multiple scales to obtain a semantic division set of multiple scales;
an initial depth map obtaining unit 300, configured to reconstruct the multiple multi-view images by using a supervised three-dimensional reconstruction method, so as to obtain an initial depth map;
a depth map obtaining unit 400, configured to obtain depth maps of multiple scales based on the multiple-scale semantic segmentation sets and the initial depth map;
a point cloud set obtaining unit 500, configured to construct a point cloud set of multiple scales based on depth maps of multiple scales;
the radius filtering unit 600 is configured to optimize point cloud sets of multiple scales by using different radius filtering according to the scale of the point cloud set, so as to obtain an optimized point cloud set;
a reconstruction result obtaining unit 700, configured to perform reconstruction of different scales based on the optimized point cloud set, so as to obtain three-dimensional reconstruction results of different scales;
and a reconstruction result fusion unit 800, configured to splice and fuse the reconstruction results of each scale to obtain a final three-dimensional reconstruction result.
It should be noted that, since the multi-view three-dimensional reconstruction system based on deep learning in the present embodiment is based on the same inventive concept as the above-mentioned multi-view three-dimensional reconstruction method based on deep learning, the corresponding contents in the method embodiments are also applicable to the present system embodiment, and are not described in detail herein.
The embodiment of the invention also provides a multi-view three-dimensional reconstruction device based on deep learning, which comprises: at least one control processor and a memory for communicative connection with the at least one control processor.
The memory, as a non-transitory computer-readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer-executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The non-transitory software program and instructions required to implement a deep learning based multi-view three-dimensional reconstruction method of the above embodiments are stored in a memory, and when executed by a processor, perform the deep learning based multi-view three-dimensional reconstruction method of the above embodiments, for example, perform the above-described method steps S100 to S800 in fig. 1.
The above described system embodiments are merely illustrative, wherein the units described as separate components may or may not be physically separate, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Embodiments of the present invention also provide a computer-readable storage medium, which stores computer-executable instructions, which, when executed by one or more control processors, may cause the one or more control processors to perform one of the above method embodiments based on deep learning, for example, perform the functions of the above method steps S100 to S800 in fig. 1.
Through the above description of the embodiments, those skilled in the art will clearly understand that the embodiments may be implemented by software plus a general hardware platform. Those skilled in the art will appreciate that all or part of the processes of the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, which may be stored in a computer readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like.
The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims (10)

1. A multi-view three-dimensional reconstruction method based on deep learning is characterized by comprising the following steps:
acquiring a plurality of multi-view images, and performing multi-scale semantic feature extraction on the plurality of multi-view images to obtain feature maps of various scales;
performing multi-scale semantic segmentation on the feature maps of multiple scales to obtain semantic segmentation sets of multiple scales;
reconstructing the multiple multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map;
obtaining depth maps of multiple scales based on the semantic segmentation sets of multiple scales and the initial depth map;
constructing a point cloud set with various scales based on the depth maps with various scales;
according to the scale of the point cloud set, different radius filtering is adopted for the point cloud sets with various scales to carry out optimization, and the optimized point cloud set is obtained;
reconstructing at different scales based on the optimized point cloud set to obtain three-dimensional reconstruction results at different scales;
and splicing and fusing the three-dimensional reconstruction results of each scale to obtain a final three-dimensional reconstruction result.
2. The deep learning-based multi-view three-dimensional reconstruction method according to claim 1, wherein the performing multi-scale semantic feature extraction on the multiple multi-view images to obtain feature maps of multiple scales comprises:
performing multilayer feature extraction on the multiple multi-view images through a ResNet network to obtain original feature maps with multiple scales;
and respectively connecting the original feature map of each scale with channel attention so as to carry out importance weighting on the original feature map of each scale through a channel attention mechanism and obtain feature maps of various scales.
3. The deep learning-based multi-view three-dimensional reconstruction method according to claim 2, wherein the obtaining of feature maps of multiple scales by weighting the importance of the original feature map of each scale through a channel attention mechanism comprises:
compressing the original characteristic diagram of each scale through a compression network to obtain a one-dimensional characteristic diagram corresponding to the original characteristic diagram of each scale;
inputting the one-dimensional characteristic diagram into a full-connection layer through an excitation network to perform importance prediction, and obtaining the importance of each channel;
and exciting the importance of each channel to the one-dimensional characteristic diagram of the original characteristic diagram of each scale through an excitation function to obtain characteristic diagrams of various scales.
4. The deep learning-based multi-view three-dimensional reconstruction method according to claim 1, wherein the performing multi-scale semantic segmentation on the feature maps of multiple scales to obtain semantic segmentation sets of multiple scales comprises:
clustering the characteristic graphs of multiple scales through nonnegative matrix decomposition to obtain semantic segmentation sets of multiple scales; wherein the expression of the non-negative matrix factorization is:
Figure DEST_PATH_IMAGE001
the method comprises the following steps of mapping, connecting and remolding feature maps of various scales into a matrix V with HW rows and C columns, wherein the P represents a matrix with HW rows and K columns, the Q represents a matrix with K rows and C columns, the H represents a coefficient matrix, the W represents a base matrix, the K represents a non-negative matrix decomposition factor of a semantic cluster number, the C represents the dimension of each pixel, and the F represents the adoption of a non-inducible norm.
5. The method according to claim 1, wherein obtaining the depth maps of multiple scales based on the semantic segmentation sets of multiple scales and the initial depth map comprises:
selecting any one of the multiple multi-view images as a reference image, and taking the other images as images to be matched;
selecting a reference point from the reference image, acquiring a semantic category corresponding to the reference point in the semantic segmentation set, and acquiring a depth value corresponding to the reference point on the initial depth image;
the number of reference points is chosen by the following formula:
Figure 478779DEST_PATH_IMAGE002
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE003
representing the number of reference points selected by the jth segmentation set, H representing the height of the multi-view image, W representing the width of the multi-view image, HW representing the number of pixel points of the multi-view image, t representing a constant parameter,
Figure 123518DEST_PATH_IMAGE004
representing the number of semantic categories contained in the jth said semantic partition set,
Figure DEST_PATH_IMAGE005
representing the number of semantic categories contained in the ith semantic segmentation set;
based on each reference point, obtaining the matching point of each reference point on the graph to be matched through the following formula:
Figure 956345DEST_PATH_IMAGE006
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE007
representing the matching point of the ith reference point on the graph to be matched, K representing the internal reference of the camera, T representing the external reference of the camera,
Figure 582499DEST_PATH_IMAGE008
representing a reference point P in said reference map i Corresponding depth values on the initial depth map;
obtaining semantic categories corresponding to each matching point, correcting the multi-view images of each scale by minimizing a semantic loss function to obtain the depth maps of various scales, wherein the semantic loss function
Figure DEST_PATH_IMAGE009
The calculation formula of (a) is as follows:
Figure 546781DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE011
representing the difference between the semantic information of the ith reference point and the semantic information of the ith matching point, M i Representing a mask and N representing the number of said reference points.
6. The method for multi-view three-dimensional reconstruction based on deep learning according to claim 5, wherein the constructing a multi-scale point cloud set based on the multi-scale depth maps comprises:
constructing a point cloud set of each scale through the following expression for the depth map of each scale:
Figure 447741DEST_PATH_IMAGE012
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE013
the abscissa representing the depth map is shown as,
Figure 603916DEST_PATH_IMAGE014
represents the ordinate of the depth map and,
Figure DEST_PATH_IMAGE015
and
Figure 479599DEST_PATH_IMAGE016
representing the camera focal length obtained from the camera parameters, and x, y and z represent the point cloud coordinates of the point cloud transformation.
7. The deep learning-based multi-view three-dimensional reconstruction method according to claim 1, wherein the optimization of the point cloud sets of multiple scales by using different radius filters according to the scales of the point cloud sets to obtain an optimized point cloud set comprises:
acquiring the point cloud sets of multiple scales, wherein the point cloud in the point cloud set of each scale has a corresponding radius and a preset number of adjacent points;
calculating the corresponding radius of the point cloud in the point cloud set by adopting the following formula according to the scale of the point cloud set:
Figure DEST_PATH_IMAGE017
wherein, the first and the second end of the pipe are connected with each other,
Figure 354014DEST_PATH_IMAGE018
representing the corresponding radius of the point cloud in the point cloud set with different scales,
Figure DEST_PATH_IMAGE019
representing a constant parameter, t representing a constant parameter,
Figure 58665DEST_PATH_IMAGE020
representing a preset scale grade of each point cloud set;
and optimizing the point cloud sets with various scales according to the radius corresponding to each point cloud and the preset number of adjacent points to obtain the optimized point cloud set.
8. A deep learning based multi-view three-dimensional reconstruction system, characterized in that the deep learning based multi-view three-dimensional reconstruction system comprises:
the characteristic diagram acquisition unit is used for acquiring multi-view images and extracting multi-scale semantic characteristics of the multi-view images to acquire characteristic diagrams of multiple scales;
the semantic segmentation set acquisition unit is used for carrying out multi-scale semantic segmentation on the feature maps with various scales to acquire a semantic segmentation set with various scales;
the initial depth map acquisition unit is used for reconstructing a plurality of multi-view images by a supervised three-dimensional reconstruction method to obtain an initial depth map;
the depth map acquisition unit is used for acquiring depth maps of multiple scales based on the semantic segmentation sets of multiple scales and the initial depth map;
the point cloud set acquisition unit is used for constructing point cloud sets of multiple scales on the basis of the depth maps of the multiple scales;
the radius filtering unit is used for optimizing the point cloud sets with various scales by adopting different radius filtering according to the scales of the point cloud sets to obtain the optimized point cloud sets;
a reconstruction result obtaining unit, configured to perform reconstruction of different scales based on the optimized point cloud set, so as to obtain three-dimensional reconstruction results of different scales;
and the reconstruction result fusion unit is used for splicing and fusing the reconstruction results of each scale to obtain a final three-dimensional reconstruction result.
9. A deep learning based multi-view three-dimensional reconstruction device comprising at least one control processor and a memory for communicative connection with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform the method of deep learning based multi-view three-dimensional reconstruction as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform the method of deep learning based multi-view three-dimensional reconstruction according to any one of claims 1 to 7.
CN202211087276.9A 2022-09-07 2022-09-07 Multi-view three-dimensional reconstruction method, system and equipment based on deep learning Active CN115170746B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211087276.9A CN115170746B (en) 2022-09-07 2022-09-07 Multi-view three-dimensional reconstruction method, system and equipment based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211087276.9A CN115170746B (en) 2022-09-07 2022-09-07 Multi-view three-dimensional reconstruction method, system and equipment based on deep learning

Publications (2)

Publication Number Publication Date
CN115170746A true CN115170746A (en) 2022-10-11
CN115170746B CN115170746B (en) 2022-11-22

Family

ID=83481918

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211087276.9A Active CN115170746B (en) 2022-09-07 2022-09-07 Multi-view three-dimensional reconstruction method, system and equipment based on deep learning

Country Status (1)

Country Link
CN (1) CN115170746B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115457101A (en) * 2022-11-10 2022-12-09 武汉图科智能科技有限公司 Edge-preserving multi-view depth estimation and ranging method for unmanned aerial vehicle platform

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104715504A (en) * 2015-02-12 2015-06-17 四川大学 Robust large-scene dense three-dimensional reconstruction method
CN106157307A (en) * 2016-06-27 2016-11-23 浙江工商大学 A kind of monocular image depth estimation method based on multiple dimensioned CNN and continuous CRF
CN108388639A (en) * 2018-02-26 2018-08-10 武汉科技大学 A kind of cross-media retrieval method based on sub-space learning Yu semi-supervised regularization
US20190108639A1 (en) * 2017-10-09 2019-04-11 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Semantic Segmentation of 3D Point Clouds
CN111340186A (en) * 2020-02-17 2020-06-26 之江实验室 Compressed representation learning method based on tensor decomposition
US20200364876A1 (en) * 2019-05-17 2020-11-19 Magic Leap, Inc. Methods and apparatuses for corner detection using neural network and corner detector
CN112734915A (en) * 2021-01-19 2021-04-30 北京工业大学 Multi-view stereoscopic vision three-dimensional scene reconstruction method based on deep learning
US20210150726A1 (en) * 2019-11-14 2021-05-20 Samsung Electronics Co., Ltd. Image processing apparatus and method
CN113066168A (en) * 2021-04-08 2021-07-02 云南大学 Multi-view stereo network three-dimensional reconstruction method and system
CN113673400A (en) * 2021-08-12 2021-11-19 土豆数据科技集团有限公司 Real scene three-dimensional semantic reconstruction method and device based on deep learning and storage medium
CN114677479A (en) * 2022-04-13 2022-06-28 温州大学大数据与信息技术研究院 Natural landscape multi-view three-dimensional reconstruction method based on deep learning
CN114881867A (en) * 2022-03-24 2022-08-09 山西三友和智慧信息技术股份有限公司 Image denoising method based on deep learning

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104715504A (en) * 2015-02-12 2015-06-17 四川大学 Robust large-scene dense three-dimensional reconstruction method
CN106157307A (en) * 2016-06-27 2016-11-23 浙江工商大学 A kind of monocular image depth estimation method based on multiple dimensioned CNN and continuous CRF
WO2018000752A1 (en) * 2016-06-27 2018-01-04 浙江工商大学 Monocular image depth estimation method based on multi-scale cnn and continuous crf
US20190108639A1 (en) * 2017-10-09 2019-04-11 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Semantic Segmentation of 3D Point Clouds
CN108388639A (en) * 2018-02-26 2018-08-10 武汉科技大学 A kind of cross-media retrieval method based on sub-space learning Yu semi-supervised regularization
US20200364876A1 (en) * 2019-05-17 2020-11-19 Magic Leap, Inc. Methods and apparatuses for corner detection using neural network and corner detector
US20210150726A1 (en) * 2019-11-14 2021-05-20 Samsung Electronics Co., Ltd. Image processing apparatus and method
CN111340186A (en) * 2020-02-17 2020-06-26 之江实验室 Compressed representation learning method based on tensor decomposition
CN112734915A (en) * 2021-01-19 2021-04-30 北京工业大学 Multi-view stereoscopic vision three-dimensional scene reconstruction method based on deep learning
CN113066168A (en) * 2021-04-08 2021-07-02 云南大学 Multi-view stereo network three-dimensional reconstruction method and system
CN113673400A (en) * 2021-08-12 2021-11-19 土豆数据科技集团有限公司 Real scene three-dimensional semantic reconstruction method and device based on deep learning and storage medium
CN114881867A (en) * 2022-03-24 2022-08-09 山西三友和智慧信息技术股份有限公司 Image denoising method based on deep learning
CN114677479A (en) * 2022-04-13 2022-06-28 温州大学大数据与信息技术研究院 Natural landscape multi-view three-dimensional reconstruction method based on deep learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HITRJJ: "多视角场景点云重建模型PointMVS", 《HTTPS://BLOG.CSDN.NET/U014636245/ARTICLE/DETAILS/104354289》 *
LIU F等: "Learning depth from single monocular images using deep convolutional neural fields", 《IEEE TRANSACTIONS ON PATTERN ANALYSIS&MACHINE INTELLIGENCE》 *
廖旋等: "融合分割先验的多图像目标语义分割", 《中国图象图形学报》 *
王泉德等: "基于多尺度特征融合的单目图像深度估计", 《华中科技大学学报(自然科学版)》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115457101A (en) * 2022-11-10 2022-12-09 武汉图科智能科技有限公司 Edge-preserving multi-view depth estimation and ranging method for unmanned aerial vehicle platform

Also Published As

Publication number Publication date
CN115170746B (en) 2022-11-22

Similar Documents

Publication Publication Date Title
WO2022088676A1 (en) Three-dimensional point cloud semantic segmentation method and apparatus, and device and medium
CN110188795B (en) Image classification method, data processing method and device
CN111340738B (en) Image rain removing method based on multi-scale progressive fusion
CN114255238A (en) Three-dimensional point cloud scene segmentation method and system fusing image features
CN109993707B (en) Image denoising method and device
CN110473137A (en) Image processing method and device
CN110222718B (en) Image processing method and device
CN111753698A (en) Multi-mode three-dimensional point cloud segmentation system and method
US11875424B2 (en) Point cloud data processing method and device, computer device, and storage medium
CN114418030A (en) Image classification method, and training method and device of image classification model
Grigorev et al. Depth estimation from single monocular images using deep hybrid network
CN115082885A (en) Point cloud target detection method, device, equipment and storage medium
CN111833360B (en) Image processing method, device, equipment and computer readable storage medium
CN114219855A (en) Point cloud normal vector estimation method and device, computer equipment and storage medium
CN111368733B (en) Three-dimensional hand posture estimation method based on label distribution learning, storage medium and terminal
CN115170746B (en) Multi-view three-dimensional reconstruction method, system and equipment based on deep learning
CN115205150A (en) Image deblurring method, device, equipment, medium and computer program product
CN113066018A (en) Image enhancement method and related device
CN110705564B (en) Image recognition method and device
CN113066089B (en) Real-time image semantic segmentation method based on attention guide mechanism
CN117237623B (en) Semantic segmentation method and system for remote sensing image of unmanned aerial vehicle
CN113313176A (en) Point cloud analysis method based on dynamic graph convolution neural network
CN111667495A (en) Image scene analysis method and device
WO2021057091A1 (en) Viewpoint image processing method and related device
CN112270701A (en) Packet distance network-based parallax prediction method, system and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant