CN117078851A

CN117078851A - Single-view three-dimensional point cloud reconstruction method

Info

Publication number: CN117078851A
Application number: CN202311030758.5A
Authority: CN
Inventors: 宋嘉慧; 侯永宏; 彭勃; 雷建军
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2023-08-15
Filing date: 2023-08-15
Publication date: 2023-11-17

Abstract

The invention discloses a single-view three-dimensional point cloud reconstruction method, which comprises the following steps: the depth map acquisition module is constructed by adopting the network structure which is the same as a high-quality monocular depth estimation network, and acquires a depth map I corresponding to a color image I by carrying out depth estimation on the input color image I _D The method comprises the steps of carrying out a first treatment on the surface of the Constructing a depth image feature learning module, taking a depth image estimated from an input color image I as input, and mining depth information for assisting a three-dimensional point cloud reconstruction process; designing a color-depth information fusion module to extract M layers of image features by a color image feature learning moduleM levels of depth features extracted by the depth feature learning moduleA multi-stage fusion strategy from shallow to deep is adopted for input, and depth features and image features are fused on each level; constructing a three-dimensional point cloud reconstruction module to learn M-th-level RGB-D fusion characteristicsBy mapping the features to 3D space for input, reconstruction of a three-dimensional point cloud is achieved.

Description

Single-view three-dimensional point cloud reconstruction method

Technical Field

The invention relates to the field of deep learning and three-dimensional point cloud reconstruction, in particular to a single-view three-dimensional point cloud reconstruction method.

Background

With the rapid development of the fields of three-dimensional vision, generation type artificial intelligence and the like, the three-dimensional point cloud reconstruction technology is widely applied to various scenes such as automatic driving, virtual reality, robot navigation and the like. As an important branch of three-dimensional point cloud reconstruction research, single-view three-dimensional point cloud reconstruction aims at deducing the three-dimensional geometry and structure of an object from a single two-dimensional view, and is receiving more and more attention from researchers.

In recent years, due to the strong feature learning capability of the deep neural network, single-view three-dimensional point cloud reconstruction based on deep learning has become a mainstream research direction in the field of single-view three-dimensional point cloud reconstruction. Fan et al propose the first network PSGN (point set generation network) to reconstruct a three-dimensional point cloud using a deep learning method, which is reconstructed using a codec that introduces an hourglass structure, has a high degree of flexibility, can better combine global and local information, and is excellent in reconstructing an object having a complex structure. Mandikal et al propose a reconstruction network combining image coding and point cloud coding that first trains a point cloud self-encoder to learn the potential representation space of the point cloud, then maps a two-dimensional image to the potential representation space with the image encoder, and finally achieves single-view point cloud reconstruction by the image encoder and the point cloud decoder. Mandikal et al propose a location-aware segmentation penalty to better constrain the network to generate a three-dimensional point cloud. Further, jiang et al propose a geometric countering loss by maintaining geometric consistency of the predicted and real point clouds over different perspectives to normalize the reconstructed point clouds as a whole.

The method has made a certain research progress in the field of single-view three-dimensional point cloud reconstruction. However, single views lack an expression of object depth, which results in a three-dimensional point cloud reconstructed with only a single view often being insufficiently accurate. Therefore, how to acquire depth information to assist the reconstruction process and further improve the reconstruction quality of the three-dimensional point cloud has important research significance.

Disclosure of Invention

In order to alleviate the problem of insufficient expression of single view to object depth information, the invention provides a single view three-dimensional point cloud reconstruction method, which takes a depth map estimated from a single Zhang Caise image as an aid to supplement necessary depth information in the reconstruction process, thereby realizing higher-quality three-dimensional point cloud reconstruction, and is described in detail below:

a single view three-dimensional point cloud reconstruction method, the method comprising:

constructing a color image feature learning module, taking a single color image I as input, and excavating geometric features and semantic features of each input view;

the depth map acquisition module is constructed by adopting the network structure which is the same as a high-quality monocular depth estimation network, and acquires a depth map I corresponding to a color image I by carrying out depth estimation on the input color image I _D ；

Constructing a depth image feature learning module, taking a depth image estimated from an input color image I as input, and mining depth information for assisting a three-dimensional point cloud reconstruction process;

designing a color-depth information fusion module to extract M layers of image features by a color image feature learning moduleAnd M levels of depth features extracted by the depth feature learning module +.>A multi-stage fusion strategy from shallow to deep is adopted for input, and depth features and image features are fused on each level;

constructing a three-dimensional point cloud reconstruction module to learn M-th-level RGB-D fusion characteristicsBy mapping the features to 3D space for input, reconstruction of a three-dimensional point cloud is achieved.

The depth map acquisition module is as follows:

mapping from a three-dimensional space to a depth space by utilizing real three-dimensional point cloud data and camera parameters in a shape Net data set, and converting the three-dimensional point cloud into a depth map to generate a pseudo-label of the depth map;

the input color images and the corresponding pseudo labels of the depth map are input into a depth map acquisition module in pairs for supervised fine tuning training;

the input color image I is sent to the module for depth estimation to obtain a depth image I corresponding to the I _D 。

Wherein, the color-depth information fusion module is:

r is R ^m 、D ^m And RGB-D characteristics obtained by carrying out characteristic fusion at the (m-1) th levelSending the features into a cascade layer to cascade the features along the channel dimension, and performing self-adaptive fusion on the cascaded features through a convolution layer to obtain a primary fused mth-level RGB-D feature F ^m ：

Wherein, [. Cndot. ] represents a concatenation operation, conv (. Cndot.) represents a convolution operation with a convolution kernel of 3 x 3;

will F ^m Sending into an attention unit for feature selection, wherein the attention unit consists of a global pooling layer, a full connection layer and a Sigmoid layer; employing a global pooling layer to aggregate F ^m Is represented as a feature descriptor of the aggregated feature:

wherein,representing the characteristics after polymerization, h _m 、w _m And c _m Respectively represent the characteristics F ^m Height, width and channel dimension number;

based onAcquiring interdependence among characteristic channels by adopting full connection layer to obtain F ^m The channel values of the obtained attention profile are normalized to (0, 1) by means of the Sigmoid layer, the attention profile finally learned by the attention unit>Expressed as:

wherein, FC (·) represents the mapping process of the full connection layer, sigmoid (·) represents the normalization operation implemented by the Sigmoid function;

based on the obtained attention profileFor F ^m Weighting channel dimension, and finally outputting RGB-D fusion characteristic after attention enhancement>

Wherein f _scale (. Cndot.) means channel-level multiplication between the attention profile and the original profile;

self-adaptive learning is carried out by adopting a self-adaptive block formed by a plurality of convolution layers, and finally the mth level RGB-D fusion characteristic is obtained

The color-depth information fusion module finally fuses the features by M-th level RGB-D through carrying out M-level feature fusion from shallow to deepFor output, the features contain both image geometry information, semantic information, and depth information.

The technical scheme provided by the invention has the beneficial effects that:

1. the invention provides a single-view three-dimensional point cloud reconstruction method, which uses a depth map estimated from an input color image as an aid to obtain necessary depth information in a reconstruction process, so that the problem of insufficient expression of the single view on object depth information is effectively solved, and the quality of single-view three-dimensional point cloud reconstruction is improved;

2. the invention constructs a depth map acquisition module which utilizes the real three-dimensional point cloud data mapping to generate a depth map pseudo tag for the fine tuning training of the constrained depth estimation process, thereby estimating a more accurate depth image from an input color image and further mining more effective depth information;

3. the invention designs a color-depth information fusion module which adopts a shallow-to-deep multi-stage fusion strategy to fully aggregate color information learned from an input view and depth information mined from a depth view, thereby capturing RGB-D fusion characteristics which are more beneficial to the reconstruction process.

Drawings

Fig. 1 is a flowchart of a single view three-dimensional point cloud reconstruction method.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.

The following describes a specific embodiment of a single-view three-dimensional point cloud reconstruction method according to the present invention by way of example.

1. Building color image feature learning module

First, a color image feature learning module is constructed, which takes a single color image I as input, aiming at mining the geometric and semantic features of each input view. In particular, the module co-extracts image features of a total of M levelsWherein R is ^m Representing the learned image feature of the mth level, M is set to 5.

The formula of the feature extraction process is as follows:

wherein E is _C (·) represents a color image feature learning module consisting of M convolutional layers and corresponding pooling layers in the VGG16 network.

The VGG16 network is well known to those skilled in the art, and the embodiments of the present invention will not be described in detail.

2. Building depth map acquisition module

The depth map acquisition module is constructed by adopting the same network structure as DenseDepth (high quality monocular depth estimation network), and acquires a depth map I corresponding to a color image I by performing depth estimation on the input color image I _D . Specifically, the constructed depth map acquisition module performs pre-training on the autopilot scene data set KITTI first, and then performs fine-tuning training on the three-dimensional model data set shape Net to achieve more accurate depth estimation performance. However, the process is not limited to the above-described process,the process of fine tuning the module requires a real depth map as a supervisory signal, while the shaanenet dataset does not provide a real depth map label corresponding to the color image. In order to solve the problems, mapping from a three-dimensional space to a depth space is performed by using real three-dimensional point cloud data in a shape net data set and camera parameters, and the three-dimensional point cloud is converted into a depth map so as to generate a depth map pseudo tag. Then, the input color image and the corresponding depth map pseudo tag are input into a depth map acquisition module in pairs for supervised fine tuning training. After finishing the fine tuning training of the depth map acquisition module, the input color image I is sent to the module for depth estimation so as to obtain a depth map I corresponding to the I _D And is further used for mining depth information.

The DenseDepth network, the KITTI data set, and the shape Net data set are all known to those skilled in the art, and the description of the embodiments of the present invention is omitted.

3. Building depth image feature learning module

The depth image feature learning module takes a depth image estimated from an input color image I as input to mine depth information for assisting a three-dimensional point cloud reconstruction process. Specifically, the constructed depth image feature learning module will first estimate the resulting single channel depth map I _D Map coding feeds three-way representations into a depth feature extractor to mine I _D Depth characteristics, three-channel representation includes: horizontal parallax, height above ground, and the angle of the local surface normal of the pixel coincident with the inferred gravitational direction.

The feature extractor co-learns depth features of M levelsD ^m Representing the learned depth features of the mth level, M is set to 5. The formula of the feature extraction process is as follows:

wherein the method comprises the steps of，E _D (·) represents a depth feature extractor consisting of a VGG16 network that holds 5 convolution blocks and corresponding pooling layers.

4. Design color-depth information fusion module

In order to more effectively utilize the color image features and the depth features to reconstruct a higher-quality three-dimensional point cloud, a color-depth information fusion module is designed, specifically, M layers of image features extracted by a color image feature learning moduleAnd M levels of depth features extracted by the depth feature learning module +.>For input, the color-depth information fusion module adopts a multi-stage fusion strategy from shallow to deep, and fuses depth features and image features on each level.

Taking the m-th level feature fusion process as an example, R is firstly taken as a reference ^m 、D ^m And RGB-D characteristics obtained by carrying out characteristic fusion at the (m-1) th levelSending the features into a cascade layer to cascade the features along the channel dimension, and performing self-adaptive fusion on the cascaded features through a convolution layer to obtain a primary fused mth-level RGB-D feature F ^m ：

Wherein [ (S)]Representing a concatenation operation, conv (·) represents a convolution operation with a convolution kernel of 3×3. In particular, when m=1,

in order for the fusion process to focus on information useful for reconstruction, redundant information useless for reconstruction is discarded, F ^m Into the attention unitAnd performing feature selection. In particular, the attention unit consists of a global pooling layer, a fully connected layer and a Sigmoid layer. First, global pooling layer is employed to aggregate F ^m Is represented as a feature descriptor of the aggregated feature:

wherein,representing the characteristics after polymerization, h _m 、w _m And c _m Respectively represent the characteristics F ^m Height, width and channel dimension number.

Second, based onAcquiring interdependencies among feature channels by using full connection layer to obtain F ^m Is a feature of the attention map of (2). Finally, the channel values of the resulting attention profile are normalized to (0, 1) using the Sigmoid layer. Attention module finally learned attention profile +.>Expressed as:

wherein, FC (·) represents the mapping process of the full connection layer, sigmoid (·) represents the normalization operation implemented by the Sigmoid function.

Thereafter, based on the obtained attention profileFor F ^m Weighting channel dimension, and finally outputting RGB-D fusion characteristic after attention enhancement>The formula for the above procedure is as follows:

wherein f _scale (·) means channel-level multiplication between the attention profile and the original profile.

In the process of obtainingThen, adaptive learning is carried out by adopting an adaptive block formed by a plurality of convolution layers, and finally the m-th level RGB-D fusion characteristic is obtained>

Wherein Conv ⁿ (. Cndot.) represents the convolution operation of n convolution layers. Specifically, for the 1 st level fusion to the M th level fusion, the number of convolution layers n in the adaptive block is set to (2,3,3,3,1), respectively.

5. Building three-dimensional point cloud reconstruction module

And finally, constructing a three-dimensional point cloud reconstruction module. With learned M-th level RGB-D fusion featuresFor input, the module implements a three-dimensional point cloud by mapping the features to 3D spaceThe formula of the reconstruction process is as follows:

wherein P is E R ^N×3 And (3) representing the reconstructed three-dimensional point cloud, wherein N represents the number of space points contained in the reconstructed point cloud, N is set to 1024, REC (·) represents a three-dimensional point cloud reconstruction module, and the three-dimensional point cloud reconstruction module adopts the same structure as a predictor in a classical single-view point cloud reconstruction network PSGN.

As known to those skilled in the PSGN arts, the embodiment of the present invention will not be described in detail.

In order to optimize the reconstruction process of the three-dimensional point cloud, the single-view point cloud reconstruction method provided by the embodiment of the invention adopts a Chamfer Distance (CD) loss to restrict the Distance between the reconstructed point cloud and the real point cloud, and the formula of the CD loss is expressed as follows:

wherein p is _i Representing a reconstruction point cloud P _g Any one of the spatial points, p _j Representing a real point cloud P _t Any one of the spatial points in the above. The smaller the value of the CD loss, the smaller the distance from the reconstructed point cloud to the real point cloud, thus representing a higher accuracy of the reconstruction process.

Those skilled in the art will appreciate that the drawings are schematic representations of only one preferred embodiment, and that the above-described embodiment numbers are merely for illustration purposes and do not represent advantages or disadvantages of the embodiments.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A single view three-dimensional point cloud reconstruction method, the method comprising:

2. The single-view three-dimensional point cloud reconstruction method according to claim 1, wherein the depth map acquisition module is:

3. The method for reconstructing a single-view three-dimensional point cloud according to claim 1, wherein the color-depth information fusion module is: