CN114998507A

CN114998507A - Luminosity three-dimensional reconstruction method based on self-supervision learning

Info

Publication number: CN114998507A
Application number: CN202210634582.3A
Authority: CN
Inventors: 冯伟; 王英铭; 张乾; 万亮
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-06-07
Filing date: 2022-06-07
Publication date: 2022-09-02

Abstract

The invention relates to a photometric stereo three-dimensional reconstruction method based on self-supervision learning, which comprises the following steps: shooting a target scene for multiple times under different illumination conditions to obtain an input image set; inputting image sets

Inputting the result into a photometric stereo model to obtain a rough normal map recovery result and the illumination condition of each image; inputting the input image set into a reflection map estimation model and a shadow estimation model to obtain a reflection map of a scene and a shadow map of each image; restoring the input image according to the rough normal map restoration result, the illumination condition, the reflection map and the shadow map to obtain a restored image set; training a luminosity stereo model in an automatic supervision mode according to the similarity between the input image set and the restored image set, and updating parameters of the luminosity stereo model; inputting the input image set into the luminosity three-dimensional model with updated parameters to obtain an optimized normal map recovery result and the illumination condition of each image; and performing three-dimensional reconstruction on the target scene.

Description

Luminosity three-dimensional reconstruction method based on self-supervision learning

Technical Field

The invention belongs to the field of artificial intelligence and computer vision, relates to a photometric stereo three-dimensional reconstruction technology, and particularly relates to a photometric stereo three-dimensional reconstruction method based on self-supervision learning.

Background

The three-dimensional reconstruction technology aims to acquire the surface three-dimensional structure of a real object and express the surface three-dimensional structure into a data format which is easy to store and process by a computer, and plays an important role in the problems of the computer vision field such as automatic driving, virtual reality, digital museums, cultural relic protection and the like.

At present, a plurality of solutions to the three-dimensional reconstruction problem exist, and laser scanners and structured light three-dimensional scanners which are widely applied to real life are active three-dimensional reconstruction methods. Although the active three-dimensional reconstruction method can accurately reconstruct the three-dimensional structure of a scene, the working time is too long, and three or four hours are often required for the three-dimensional reconstruction of one scene. In addition, the active three-dimensional reconstruction method needs to emit laser to the target scene, which causes inevitable damage to the structure of the fragile object (such as cultural relic).

The photometric stereo technique aims at restoring normal information of a scene surface from a plurality of images with different illumination conditions. The algorithm does not need to interact with a target scene, and compared with data acquisition equipment of an active method, the method for acquiring the two-dimensional image is more convenient. However, the existing photometric stereo algorithm has the obvious defects: when the method is applied to an open real scene with complex material and illumination conditions, the performance of the algorithm is greatly reduced.

Disclosure of Invention

The invention provides a luminosity three-dimensional reconstruction method based on self-supervision learning, which utilizes a self-supervision training mode to carry out optimization training on a luminosity three-dimensional model and improves the adaptability of the luminosity three-dimensional model to complex and variable scene light and scene materials. In the training process of the luminosity three-dimensional model, a reflection map estimation model and a shadow estimation model are constructed to predict and estimate the distribution condition of the reflection map and the shadow of the scene, an input image is restored according to a Lambert rendering model, and a reconstruction loss function is constructed to perform self-supervision optimization training on the luminosity three-dimensional model, so that reliable and efficient open environment task operation is supported, and the three-dimensional reconstruction accuracy is improved. The invention is realized by the following technical scheme:

a luminosity three-dimensional reconstruction method based on self-supervision learning is characterized by comprising the following steps:

step one, shooting a target scene for multiple times under different illumination conditions to obtain an input image set

K is the number of input images;

step two, inputting the image set

Inputting the data into a photometric stereo model to obtain a rough normal map recovery result

And each image

Light conditions of

P represents the number of pixels in one image, and K is more than or equal to 1 and less than or equal to K;

wherein, the structure of the luminosity stereo model is a twin neural network, which is respectively used for each input image I ^k Performing feature extraction, and fusing the K features into a global feature with a fixed size by using maximum pooling operation; respectively estimating the illumination condition of each image according to the global features and the respective features of each input image, inputting the global features into a decoder, and regressing to obtain a rough normal map recovery result of the scene;

step three, inputting the image set

Inputting the data into a reflection map estimation model and a shadow estimation model to obtain a reflection map of a scene

And shadow map of each image

The method comprises the following steps:

(1) image collection

All images in (1) are connected in series in channel dimension, and the result after the connection is input into a reflection map estimation model, and the model is output as a reflection map of a scene

The estimation result of (2);

(2) restoring the coarse normal map to the result

And light conditions

Serially connecting in channel dimension, and recording the serial result as tensor B;

(3) respectively collecting images

In each image I ^k And a corresponding tensor B ^k Input to a shadow estimation model, and the shadow estimation model is output for each image I ^k Shadow map of

The method of (2) is as follows:

the shadow estimation model is structured as a neural network based on a coder-decoder, and comprises two encoders, wherein each encoder consists of four convolution layers, the sizes of the convolution kernels are all 3 multiplied by 3, and the number of the convolution kernels of each layer is respectively 64, 128, 256 and 256; to input an image I ^k The tensor B is input into two encoders, and depth feature extraction is carried out on the two encoders respectively to obtain two depth features; a feature fusion module is arranged behind the two encoders, the feature fusion module carries out series operation on the channel dimension on the two depth features, the features after series connection are successively passed through a global pooling layer and a convolution layer with convolution kernel of 1 multiplied by 1, finally normalization is carried out through a Sigmoid function, and a channel weight matrix is output; multiplying the channel weight matrix by the serial result of the two depth features, and fusing the two depth features to obtain a final image depth feature; inputting the finally obtained depth features into a decoder to obtain a shadow map

The estimation result of (2);

step four, restoring the result according to the rough normal map

Light conditions

Reflection diagram

And shadow map

Restoring the input image to obtain a restored image set

Step five, according to the input image set

And restoring the image set

The similarity between the three is used for training the luminosity three-dimensional model in a self-supervision mode and updating the parameters of the luminosity three-dimensional model;

step six, inputting the image set

Inputting the luminosity three-dimensional model with updated parameters to obtain the optimized normal map recovery result

And each image I ^k Light conditions of

Step seven, restoring the result according to the final normal map

And performing three-dimensional reconstruction on the target scene.

Furthermore, in step three, the structure of the reflection map estimation model is a neural network based on a coder-decoder, the coder is composed of 6 convolutional layers, and the decoder is composed of 4 convolutional layers; a jump connection layer is arranged on each convolution layer of the decoder and connected with the shallow network.

Further, the specific method of the step four is as follows:

according to the lambertian rendering formula:

recovery of results using normal map

Light conditions

Reflection diagram

And shadow map

For each input image I separately ^k Performing restoration and constructing a restoration image set

Further, the concrete method of the step five is as follows:

(1) selecting and using mean square error MSE as a measure index of similarity between images;

(2) carrying out self-supervision training on the photometric stereo model according to a reconstruction loss function, wherein the loss function is as follows:

where K and P represent the number of input images and the number of pixels in each image, respectively, and i represents the index of the pixels in the image;

(3) and training the luminosity three-dimensional model until convergence, and obtaining the optimized luminosity three-dimensional model.

The technical scheme provided by the invention has the beneficial effects that:

1. in the three-dimensional reconstruction process of the target scene, the shadow information in the scene can be actively identified and understood, so that the robustness of the photometric stereo model to the scene shadow is improved, and the precision of the three-dimensional reconstruction result is improved.

2. In the three-dimensional reconstruction process of the target scene, the adaptive capacity of the photometric stereo model to the open application scene with complicated and changeable illumination conditions and surface materials can be improved by means of self-supervision optimization training.

Drawings

FIG. 1 is a flow chart of a photometric stereo three-dimensional reconstruction method based on self-supervised learning;

FIG. 2 is a flow chart of an auto-supervised photometric stereo model;

FIG. 3 is a flow chart of the shadow estimation model of FIG. 2 according to the present invention;

FIG. 4 is a quantitative comparison of the method of the present invention with the existing optimal eight photometric stereo three-dimensional reconstruction methods.

Fig. 5 is a visual comparison of the method of the present invention and the prior optimal photometric stereo three-dimensional reconstruction method.

Detailed Description

The technical scheme of the invention is clearly and completely described below with reference to the accompanying drawings. All other embodiments obtained by those skilled in the art without creative efforts based on the technical solutions of the present invention belong to the protection scope of the present invention.

Firstly, shooting a target scene for multiple times under different illumination conditions to obtain an input image set

Shooting a target scene for multiple times under different illumination conditions to obtain an input image set

The specific method comprises the following steps:

(1) respectively obtaining K photos of a target scene under K different illumination conditions to obtain a picture set

Description 1: picture collection

And TJU-Synth-AA data set

Typically, K is 13, and a light source (e.g., a plane light) is used to actively illuminate the scene in the 1-12 o' clock direction and the front direction, respectively, and to acquire images of the scene. Note that the requirements for the direction of the light source are not strict, and it is only necessary to ensure that the image is taken under multi-angle illumination.

In order to support the training process of each neural network in the invention, the invention utilizes a rendering engine to construct a virtual scene synthesis data set with true values of all attributes of an image, and the virtual scene synthesis data set is named asTJU-Synth-AA dataset. The data set comprises 1136 virtual scenes, and provides a normal map, a reflection map and a shadow map of each scene, 100 imaging results of the scene under different illumination conditions and corresponding illumination information.

According to the invention, a three-dimensional model provided by a Sculpture 3D model data set is adopted, data with incomplete three-dimensional structure or material information are removed, and finally 142 three-dimensional models are screened out. The method uses a Unity3D rendering engine to render the screened 3D model to obtain a scene image. For each 3D model, the invention observes the model through 8 different camera viewing angles to obtain 8 different scenes. In each scene, the invention acquires the imaging result of the scene under the irradiation of 100 different directional lights. Thus, the present invention finally obtains 142 × 8 scenes, 1136 scenes each including 100 images. In addition, the GetComponent function package provided by the Unity3D rendering engine can directly acquire the scene normal map, the reflection map and the illumination condition information corresponding to each image. But the shadow information of the scene surface cannot be directly acquired in Unity 3D. In order to obtain the real value of the scene shadow map, the invention adopts the following method: after the imaging result of a certain scene under one-direction light irradiation is obtained, all the surface materials of the object in the current scene are set to be white diffuse reflection materials, then the casting Shadow (Cast Shadow) option in Unity3D is sequentially set to be on and off, and the two imaging results of the current scene under the conditions of considering the casting Shadow and not considering the casting Shadow are respectively obtained under the same light irradiation condition. The shadow map S in the scene can then be obtained by pixel-by-pixel division of the two imaging results.

(II) obtaining a rough normal map recovery result

And each image I ^k Light conditions of

Inputting image sets

And each image I ^k Light conditions of

The specific method comprises the following steps:

(1) image collection

All images in the image data are respectively input into a photometric stereo model, and the model output is a rough normal map recovery result

And each image I ^k Light conditions of

Description 2: construction and training process of photometric stereo model

The structure of the photometric stereo model is a twin neural network, which is used for each input image I ^k Feature extraction is performed and the K features are fused into a fixed-size global feature using a max-pooling operation. The network firstly estimates the illumination condition of each image according to the global features and the respective features of each input image, then inputs the global features into a decoder, and regresses to obtain the normal map prediction result of the scene.

The photometric stereo model is trained in a supervised training mode on Blbby and Sculpture and photometric stereo data sets.

Reflection map of (III) scene

And shadow map of each image

Obtaining of

Input image set

Inputting the data into a reflection map estimation model and a shadow estimation model to obtain a reflection map of the scene

And shadow map of each image

The specific method comprises the following steps:

(4) image collection

The estimation result of (2).

Description 3: structure and training process of reflection map estimation model

The structure of the reflection map estimation model is a neural network based on a coder-decoder. Wherein, the encoder is composed of 6 convolutional layers, and the decoder is composed of 4 convolutional layers. In addition, a jump connection layer is arranged on each convolution layer of the decoder to be connected with the shallow network.

The training of the reflection map estimation model adopts a supervised training mode and trains on TJU-Synth-AA reflection map prediction data sets.

(5) Normal line diagram

And lighting conditions

And (4) connecting in the channel dimension in series, and recording the serial connection result as tensor B.

Description 4: normal map

And lighting conditions

In series of

Is a 3 x 1 vector of the vector,

is a 3 x h x w tensor (h, w are image height and width, respectively). Firstly, the method is to

Replication extension to

The same dimension then and

performing a series operation to form a post-series tensor B ^k (dimension 6 × h × w).

(6) Respectively collecting images

In each image I ^k And a corresponding tensor B ^k Input to a shadow estimation model, the model output being for each image I ^k Shadow map of

The estimation result of (2).

Description 5: structure and training process of shadow estimation model

The shadow estimation model is structured as a codec-based neural network. The network comprises two encoders, each encoder consisting of four convolutional layers, the convolutional kernel sizes all being 3 × 3, the number of steps (Stride) being 2. The number of convolution kernels per layer is 64, 128, 256 and 256, respectively. Firstly, input image I ^k And the tensor B is input into the two encoders, and the depth features are extracted respectively to obtain two depth features. The two encoders are followed by a feature fusion module, the module firstly carries out series operation on channel dimensions on the two depth features, the features after series connection pass through a global pooling layer and a convolution layer with convolution kernel of 1 × 1, finally normalization is carried out through a Sigmoid function, and a channel weight matrix is output. The weight matrix has the functions of learning the importance difference among different channels in the training process and endowing different fusion weights to each characteristic channel according to different importance. And after the channel weight matrix is obtained, the feature fusion module multiplies the channel weight matrix by the serial result of the two depth features, and fuses the two depth features to obtain the final image feature. The resulting depth features are then input into the decoder. The decoder is composed of 4 deconvolution layers and one convolution layer, and a jump connection layer is provided after each deconvolution layer.

The shadow estimation model is trained in a supervised training mode on an TJU-Synth-AA shadow estimation data set.

(IV) restoring image set

Generation of

Recovering results from normal map

Light conditions

Reflection diagram

And shadow map

Restoring the input image to obtain a restored image set

The specific method comprises the following steps:

(1) after the normal map recovery result is obtained

Light conditions

Reflection diagram

And shadow map

Then, according to the lambertian rendering formula:

respectively delivering each sheetInput image I ^k Performing synthesis restoration to further construct a restored image set

Model parameter updating of (V) photometric stereo model

From a set of input images

And restoring the image set

The similarity between the three-dimensional model and the luminosity three-dimensional model is trained in an automatic supervision mode, and the specific method for updating the model parameters comprises the following steps:

(1) training the photometric stereo model with an auto-supervised strategy by minimizing the reconstruction loss function:

and updating the network parameters.

Description 6: form of reconstruction loss function

The invention adopts the widely used Mean-Square Error (MSE) as the measurement index of the image similarity, and the reconstruction Error can be written into the following form:

(VI) obtaining the final normal map recovery result

Inputting image sets

Inputting the luminosity three-dimensional model after updating the parameters to obtain the final normal map recovery result

The specific method comprises the following steps:

(1) image collection

All images in the system are connected in series on the channel dimension, the result after the series connection is input into a luminosity three-dimensional model after the self-supervision training is finished, and the model output is the recovery result of the normal map after the optimization

And each image I ^k Light conditions of

Seventhly, reconstructing the surface of the target scene according to the final normal map recovery result

Restoring the result from the final normal map

The specific method for three-dimensional reconstruction of the surface of the target scene comprises the following steps:

(1) recovering results from normal map

The gradient fields are computed and then integrable constrained to reconstruct the target surface. The reconstruction objective function is defined as.

Where (u, v) represents the pixel coordinates of a point in the image, Ω represents the integration region, Z represents the depth of the object surface, g the gradient field of the target scene. The above equation is generally solved using poisson's equation.

Description 7: acquisition of the depth Z of the surface of an object

Let the surface of the target scene be represented by the depth equation Z ═ f (x, y), and let (p, q) represent the gradient field at a point in the scene, then the normal vector for that point can be represented as:

on this basis, the object surface depth equation Z ═ f (x, y) can be solved using poisson's equation:

(2) and writing a visualization program by using an extended program library NumPy of a programming language Python to visually display the reconstructed surface.

The following examples are presented to demonstrate the feasibility of the method of the present invention, as described in detail below:

the method of the invention was used for validation on the DiLiGenT dataset and the dunhuang mugrotto dataset. The DiLiGenT data lump contains 10 test scenes, each containing 96 pictures imaged under different lighting conditions. In addition, the DiLiGenT dataset also provides normal Truth (Ground Truth) and ray targeting information for each scene surface. The mongao cave dataset of dunhuang contains 15 scenes, each scene containing 13 pictures imaged under different lighting conditions, the test set does not contain normal Truth (Ground Truth) and ray calibration information for the scene surface. The experiments used the Mean Angle Error (MAE) function to quantitatively evaluate the three-dimensional reconstruction of DiLiGenT datasets. For the Dunhuang Mogao Grottoes dataset, the experiment only visually shows it, and does not quantitatively evaluate it, since the dataset does not contain truth values.

The results of testing on the DiLiGenT dataset according to the present method and the prior optimal photometric stereo three-dimensional reconstruction method illustrated in fig. 4 show that: the reconstruction result of the method can obtain lower average angle error and has higher reconstruction precision, which proves the effectiveness of the method. Meanwhile, the visual reconstruction result in fig. 5 shows that compared with the existing optimal method of SDPS-Net, the method can perform more detailed reconstruction on the surface of the cultural relic, and has higher practicability.

Claims

1. A luminosity three-dimensional reconstruction method based on self-supervision learning is characterized by comprising the following steps:

K is the number of input images;

step two, inputting the image set

And each image

Light conditions of

step three, inputting the image set

And shadow map of each image

The method comprises the following steps:

(1) image collection

The estimation result of (2);

(2) restoring the coarse normal map to the result

And light conditions

Serially connecting in a channel dimension, and recording a serial connection result as a tensor B;

(3) respectively collecting images

The method of (2) is as follows:

the shadow estimation model is structured as a neural network based on a coder-decoder, and comprises two encoders, wherein each encoder is composed of four convolution layers, the sizes of the convolution kernels are all 3 multiplied by 3, the number of the convolution kernels of each layer is respectively 64, 128 and 256 and 256; to input an image I ^k The tensor B is input into two encoders, and depth feature extraction is carried out on the two encoders respectively to obtain two depth features; a feature fusion module is arranged behind the two encoders, and the feature fusion module is used for performing channel dimension serial operation on the two depth features, the serially connected features sequentially pass through a global pooling layer and a convolution layer with convolution kernel of 1 × 1, and finally normalization is performed through a Sigmoid function, and a channel weight matrix is output; multiplying the channel weight matrix by the serial result of the two depth features, and fusing the two depth features to obtain a final image depth feature; inputting the finally obtained depth features into a decoder to obtain a shadow map

The estimation result of (2);

step four, restoring the result according to the rough normal map

Light conditions

Reflection diagram

And shadow map

Restoring the input image to obtain a restored image set

Step five, according to the input image set

And restoring the image set

step six, inputting the image set

Inputting the luminosity stereo model with updated parameters to obtain the optimized normal map recovery result

And each image I ^k Light conditions of

Step seven, restoring the result according to the final normal map

And performing three-dimensional reconstruction on the target scene.

2. The method as claimed in claim 1, wherein the reflectometry estimation model is structured as a neural network based on encoder-decoder, the encoder is composed of 6 convolutional layers, and the decoder is composed of 4 convolutional layers; a jump connection layer is arranged on each convolution layer of the decoder and connected with the shallow network.

3. The photometric stereo three-dimensional reconstruction method based on the self-supervised learning as recited in claim 1, wherein the concrete method of the fourth step is as follows:

according to the lambertian rendering formula:

recovery of results using normal map

Light conditions

Reflection diagram

And shadow map

For each input image I respectively ^k Performing restoration and constructing a restoration image set

4. The photometric stereo three-dimensional reconstruction method based on the self-supervised learning as recited in claim 1, wherein the concrete method of the fifth step is as follows:

(3) and training the luminosity stereo model until convergence to obtain the optimized luminosity stereo model.