CN114998507A - Luminosity three-dimensional reconstruction method based on self-supervision learning - Google Patents

Luminosity three-dimensional reconstruction method based on self-supervision learning Download PDF

Info

Publication number
CN114998507A
CN114998507A CN202210634582.3A CN202210634582A CN114998507A CN 114998507 A CN114998507 A CN 114998507A CN 202210634582 A CN202210634582 A CN 202210634582A CN 114998507 A CN114998507 A CN 114998507A
Authority
CN
China
Prior art keywords
image
map
model
result
shadow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210634582.3A
Other languages
Chinese (zh)
Inventor
冯伟
王英铭
张乾
万亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202210634582.3A priority Critical patent/CN114998507A/en
Publication of CN114998507A publication Critical patent/CN114998507A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/50Lighting effects
    • G06T15/60Shadow generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • G06T15/205Image-based rendering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/50Lighting effects
    • G06T15/506Illumination models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/60Extraction of image or video features relating to illumination properties, e.g. using a reflectance or lighting model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2215/00Indexing scheme for image rendering
    • G06T2215/12Shadow map, environment map

Abstract

The invention relates to a photometric stereo three-dimensional reconstruction method based on self-supervision learning, which comprises the following steps: shooting a target scene for multiple times under different illumination conditions to obtain an input image set; inputting image sets
Figure DDA0003681545150000011
Inputting the result into a photometric stereo model to obtain a rough normal map recovery result and the illumination condition of each image; inputting the input image set into a reflection map estimation model and a shadow estimation model to obtain a reflection map of a scene and a shadow map of each image; restoring the input image according to the rough normal map restoration result, the illumination condition, the reflection map and the shadow map to obtain a restored image set; training a luminosity stereo model in an automatic supervision mode according to the similarity between the input image set and the restored image set, and updating parameters of the luminosity stereo model; inputting the input image set into the luminosity three-dimensional model with updated parameters to obtain an optimized normal map recovery result and the illumination condition of each image; and performing three-dimensional reconstruction on the target scene.

Description

Luminosity three-dimensional reconstruction method based on self-supervision learning
Technical Field
The invention belongs to the field of artificial intelligence and computer vision, relates to a photometric stereo three-dimensional reconstruction technology, and particularly relates to a photometric stereo three-dimensional reconstruction method based on self-supervision learning.
Background
The three-dimensional reconstruction technology aims to acquire the surface three-dimensional structure of a real object and express the surface three-dimensional structure into a data format which is easy to store and process by a computer, and plays an important role in the problems of the computer vision field such as automatic driving, virtual reality, digital museums, cultural relic protection and the like.
At present, a plurality of solutions to the three-dimensional reconstruction problem exist, and laser scanners and structured light three-dimensional scanners which are widely applied to real life are active three-dimensional reconstruction methods. Although the active three-dimensional reconstruction method can accurately reconstruct the three-dimensional structure of a scene, the working time is too long, and three or four hours are often required for the three-dimensional reconstruction of one scene. In addition, the active three-dimensional reconstruction method needs to emit laser to the target scene, which causes inevitable damage to the structure of the fragile object (such as cultural relic).
The photometric stereo technique aims at restoring normal information of a scene surface from a plurality of images with different illumination conditions. The algorithm does not need to interact with a target scene, and compared with data acquisition equipment of an active method, the method for acquiring the two-dimensional image is more convenient. However, the existing photometric stereo algorithm has the obvious defects: when the method is applied to an open real scene with complex material and illumination conditions, the performance of the algorithm is greatly reduced.
Disclosure of Invention
The invention provides a luminosity three-dimensional reconstruction method based on self-supervision learning, which utilizes a self-supervision training mode to carry out optimization training on a luminosity three-dimensional model and improves the adaptability of the luminosity three-dimensional model to complex and variable scene light and scene materials. In the training process of the luminosity three-dimensional model, a reflection map estimation model and a shadow estimation model are constructed to predict and estimate the distribution condition of the reflection map and the shadow of the scene, an input image is restored according to a Lambert rendering model, and a reconstruction loss function is constructed to perform self-supervision optimization training on the luminosity three-dimensional model, so that reliable and efficient open environment task operation is supported, and the three-dimensional reconstruction accuracy is improved. The invention is realized by the following technical scheme:
a luminosity three-dimensional reconstruction method based on self-supervision learning is characterized by comprising the following steps:
step one, shooting a target scene for multiple times under different illumination conditions to obtain an input image set
Figure BDA0003681545130000014
K is the number of input images;
step two, inputting the image set
Figure BDA0003681545130000015
Inputting the data into a photometric stereo model to obtain a rough normal map recovery result
Figure BDA0003681545130000011
And each image
Figure BDA0003681545130000012
Light conditions of
Figure BDA0003681545130000013
P represents the number of pixels in one image, and K is more than or equal to 1 and less than or equal to K;
wherein, the structure of the luminosity stereo model is a twin neural network, which is respectively used for each input image I k Performing feature extraction, and fusing the K features into a global feature with a fixed size by using maximum pooling operation; respectively estimating the illumination condition of each image according to the global features and the respective features of each input image, inputting the global features into a decoder, and regressing to obtain a rough normal map recovery result of the scene;
step three, inputting the image set
Figure BDA00036815451300000227
Inputting the data into a reflection map estimation model and a shadow estimation model to obtain a reflection map of a scene
Figure BDA0003681545130000021
And shadow map of each image
Figure BDA0003681545130000022
The method comprises the following steps:
(1) image collection
Figure BDA0003681545130000023
All images in (1) are connected in series in channel dimension, and the result after the connection is input into a reflection map estimation model, and the model is output as a reflection map of a scene
Figure BDA0003681545130000024
The estimation result of (2);
(2) restoring the coarse normal map to the result
Figure BDA0003681545130000025
And light conditions
Figure BDA0003681545130000026
Serially connecting in channel dimension, and recording the serial result as tensor B;
(3) respectively collecting images
Figure BDA0003681545130000027
In each image I k And a corresponding tensor B k Input to a shadow estimation model, and the shadow estimation model is output for each image I k Shadow map of
Figure BDA0003681545130000028
The method of (2) is as follows:
the shadow estimation model is structured as a neural network based on a coder-decoder, and comprises two encoders, wherein each encoder consists of four convolution layers, the sizes of the convolution kernels are all 3 multiplied by 3, and the number of the convolution kernels of each layer is respectively 64, 128, 256 and 256; to input an image I k The tensor B is input into two encoders, and depth feature extraction is carried out on the two encoders respectively to obtain two depth features; a feature fusion module is arranged behind the two encoders, the feature fusion module carries out series operation on the channel dimension on the two depth features, the features after series connection are successively passed through a global pooling layer and a convolution layer with convolution kernel of 1 multiplied by 1, finally normalization is carried out through a Sigmoid function, and a channel weight matrix is output; multiplying the channel weight matrix by the serial result of the two depth features, and fusing the two depth features to obtain a final image depth feature; inputting the finally obtained depth features into a decoder to obtain a shadow map
Figure BDA0003681545130000029
The estimation result of (2);
step four, restoring the result according to the rough normal map
Figure BDA00036815451300000210
Light conditions
Figure BDA00036815451300000211
Reflection diagram
Figure BDA00036815451300000212
And shadow map
Figure BDA00036815451300000213
Restoring the input image to obtain a restored image set
Figure BDA00036815451300000214
Step five, according to the input image set
Figure BDA00036815451300000215
And restoring the image set
Figure BDA00036815451300000216
The similarity between the three is used for training the luminosity three-dimensional model in a self-supervision mode and updating the parameters of the luminosity three-dimensional model;
step six, inputting the image set
Figure BDA00036815451300000217
Inputting the luminosity three-dimensional model with updated parameters to obtain the optimized normal map recovery result
Figure BDA00036815451300000218
And each image I k Light conditions of
Figure BDA00036815451300000219
Step seven, restoring the result according to the final normal map
Figure BDA00036815451300000220
And performing three-dimensional reconstruction on the target scene.
Furthermore, in step three, the structure of the reflection map estimation model is a neural network based on a coder-decoder, the coder is composed of 6 convolutional layers, and the decoder is composed of 4 convolutional layers; a jump connection layer is arranged on each convolution layer of the decoder and connected with the shallow network.
Further, the specific method of the step four is as follows:
according to the lambertian rendering formula:
Figure BDA00036815451300000221
recovery of results using normal map
Figure BDA00036815451300000222
Light conditions
Figure BDA00036815451300000223
Reflection diagram
Figure BDA00036815451300000224
And shadow map
Figure BDA00036815451300000225
For each input image I separately k Performing restoration and constructing a restoration image set
Figure BDA00036815451300000226
Further, the concrete method of the step five is as follows:
(1) selecting and using mean square error MSE as a measure index of similarity between images;
(2) carrying out self-supervision training on the photometric stereo model according to a reconstruction loss function, wherein the loss function is as follows:
Figure BDA0003681545130000031
where K and P represent the number of input images and the number of pixels in each image, respectively, and i represents the index of the pixels in the image;
(3) and training the luminosity three-dimensional model until convergence, and obtaining the optimized luminosity three-dimensional model.
The technical scheme provided by the invention has the beneficial effects that:
1. in the three-dimensional reconstruction process of the target scene, the shadow information in the scene can be actively identified and understood, so that the robustness of the photometric stereo model to the scene shadow is improved, and the precision of the three-dimensional reconstruction result is improved.
2. In the three-dimensional reconstruction process of the target scene, the adaptive capacity of the photometric stereo model to the open application scene with complicated and changeable illumination conditions and surface materials can be improved by means of self-supervision optimization training.
Drawings
FIG. 1 is a flow chart of a photometric stereo three-dimensional reconstruction method based on self-supervised learning;
FIG. 2 is a flow chart of an auto-supervised photometric stereo model;
FIG. 3 is a flow chart of the shadow estimation model of FIG. 2 according to the present invention;
FIG. 4 is a quantitative comparison of the method of the present invention with the existing optimal eight photometric stereo three-dimensional reconstruction methods.
Fig. 5 is a visual comparison of the method of the present invention and the prior optimal photometric stereo three-dimensional reconstruction method.
Detailed Description
The technical scheme of the invention is clearly and completely described below with reference to the accompanying drawings. All other embodiments obtained by those skilled in the art without creative efforts based on the technical solutions of the present invention belong to the protection scope of the present invention.
Firstly, shooting a target scene for multiple times under different illumination conditions to obtain an input image set
Figure BDA0003681545130000032
Shooting a target scene for multiple times under different illumination conditions to obtain an input image set
Figure BDA0003681545130000033
The specific method comprises the following steps:
(1) respectively obtaining K photos of a target scene under K different illumination conditions to obtain a picture set
Figure BDA0003681545130000034
Description 1: picture collection
Figure BDA0003681545130000035
And TJU-Synth-AA data set
Typically, K is 13, and a light source (e.g., a plane light) is used to actively illuminate the scene in the 1-12 o' clock direction and the front direction, respectively, and to acquire images of the scene. Note that the requirements for the direction of the light source are not strict, and it is only necessary to ensure that the image is taken under multi-angle illumination.
In order to support the training process of each neural network in the invention, the invention utilizes a rendering engine to construct a virtual scene synthesis data set with true values of all attributes of an image, and the virtual scene synthesis data set is named asTJU-Synth-AA dataset. The data set comprises 1136 virtual scenes, and provides a normal map, a reflection map and a shadow map of each scene, 100 imaging results of the scene under different illumination conditions and corresponding illumination information.
According to the invention, a three-dimensional model provided by a Sculpture 3D model data set is adopted, data with incomplete three-dimensional structure or material information are removed, and finally 142 three-dimensional models are screened out. The method uses a Unity3D rendering engine to render the screened 3D model to obtain a scene image. For each 3D model, the invention observes the model through 8 different camera viewing angles to obtain 8 different scenes. In each scene, the invention acquires the imaging result of the scene under the irradiation of 100 different directional lights. Thus, the present invention finally obtains 142 × 8 scenes, 1136 scenes each including 100 images. In addition, the GetComponent function package provided by the Unity3D rendering engine can directly acquire the scene normal map, the reflection map and the illumination condition information corresponding to each image. But the shadow information of the scene surface cannot be directly acquired in Unity 3D. In order to obtain the real value of the scene shadow map, the invention adopts the following method: after the imaging result of a certain scene under one-direction light irradiation is obtained, all the surface materials of the object in the current scene are set to be white diffuse reflection materials, then the casting Shadow (Cast Shadow) option in Unity3D is sequentially set to be on and off, and the two imaging results of the current scene under the conditions of considering the casting Shadow and not considering the casting Shadow are respectively obtained under the same light irradiation condition. The shadow map S in the scene can then be obtained by pixel-by-pixel division of the two imaging results.
(II) obtaining a rough normal map recovery result
Figure BDA0003681545130000041
And each image I k Light conditions of
Figure BDA0003681545130000042
Inputting image sets
Figure BDA0003681545130000043
Inputting the data into a photometric stereo model to obtain a rough normal map recovery result
Figure BDA0003681545130000044
And each image I k Light conditions of
Figure BDA0003681545130000045
The specific method comprises the following steps:
(1) image collection
Figure BDA0003681545130000046
All images in the image data are respectively input into a photometric stereo model, and the model output is a rough normal map recovery result
Figure BDA0003681545130000047
And each image I k Light conditions of
Figure BDA0003681545130000048
Description 2: construction and training process of photometric stereo model
The structure of the photometric stereo model is a twin neural network, which is used for each input image I k Feature extraction is performed and the K features are fused into a fixed-size global feature using a max-pooling operation. The network firstly estimates the illumination condition of each image according to the global features and the respective features of each input image, then inputs the global features into a decoder, and regresses to obtain the normal map prediction result of the scene.
The photometric stereo model is trained in a supervised training mode on Blbby and Sculpture and photometric stereo data sets.
Reflection map of (III) scene
Figure BDA0003681545130000049
And shadow map of each image
Figure BDA00036815451300000410
Obtaining of
Input image set
Figure BDA00036815451300000411
Inputting the data into a reflection map estimation model and a shadow estimation model to obtain a reflection map of the scene
Figure BDA00036815451300000412
And shadow map of each image
Figure BDA00036815451300000413
The specific method comprises the following steps:
(4) image collection
Figure BDA00036815451300000414
All images in (1) are connected in series in channel dimension, and the result after the connection is input into a reflection map estimation model, and the model is output as a reflection map of a scene
Figure BDA00036815451300000415
The estimation result of (2).
Description 3: structure and training process of reflection map estimation model
The structure of the reflection map estimation model is a neural network based on a coder-decoder. Wherein, the encoder is composed of 6 convolutional layers, and the decoder is composed of 4 convolutional layers. In addition, a jump connection layer is arranged on each convolution layer of the decoder to be connected with the shallow network.
The training of the reflection map estimation model adopts a supervised training mode and trains on TJU-Synth-AA reflection map prediction data sets.
(5) Normal line diagram
Figure BDA0003681545130000051
And lighting conditions
Figure BDA0003681545130000052
And (4) connecting in the channel dimension in series, and recording the serial connection result as tensor B.
Description 4: normal map
Figure BDA0003681545130000053
And lighting conditions
Figure BDA0003681545130000054
In series of
Figure BDA0003681545130000055
Is a 3 x 1 vector of the vector,
Figure BDA0003681545130000056
is a 3 x h x w tensor (h, w are image height and width, respectively). Firstly, the method is to
Figure BDA0003681545130000057
Replication extension to
Figure BDA0003681545130000058
The same dimension then and
Figure BDA0003681545130000059
performing a series operation to form a post-series tensor B k (dimension 6 × h × w).
(6) Respectively collecting images
Figure BDA00036815451300000510
In each image I k And a corresponding tensor B k Input to a shadow estimation model, the model output being for each image I k Shadow map of
Figure BDA00036815451300000511
The estimation result of (2).
Description 5: structure and training process of shadow estimation model
The shadow estimation model is structured as a codec-based neural network. The network comprises two encoders, each encoder consisting of four convolutional layers, the convolutional kernel sizes all being 3 × 3, the number of steps (Stride) being 2. The number of convolution kernels per layer is 64, 128, 256 and 256, respectively. Firstly, input image I k And the tensor B is input into the two encoders, and the depth features are extracted respectively to obtain two depth features. The two encoders are followed by a feature fusion module, the module firstly carries out series operation on channel dimensions on the two depth features, the features after series connection pass through a global pooling layer and a convolution layer with convolution kernel of 1 × 1, finally normalization is carried out through a Sigmoid function, and a channel weight matrix is output. The weight matrix has the functions of learning the importance difference among different channels in the training process and endowing different fusion weights to each characteristic channel according to different importance. And after the channel weight matrix is obtained, the feature fusion module multiplies the channel weight matrix by the serial result of the two depth features, and fuses the two depth features to obtain the final image feature. The resulting depth features are then input into the decoder. The decoder is composed of 4 deconvolution layers and one convolution layer, and a jump connection layer is provided after each deconvolution layer.
The shadow estimation model is trained in a supervised training mode on an TJU-Synth-AA shadow estimation data set.
(IV) restoring image set
Figure BDA00036815451300000512
Generation of
Recovering results from normal map
Figure BDA00036815451300000513
Light conditions
Figure BDA00036815451300000514
Reflection diagram
Figure BDA00036815451300000515
And shadow map
Figure BDA00036815451300000516
Restoring the input image to obtain a restored image set
Figure BDA00036815451300000517
The specific method comprises the following steps:
(1) after the normal map recovery result is obtained
Figure BDA00036815451300000518
Light conditions
Figure BDA00036815451300000519
Reflection diagram
Figure BDA00036815451300000520
And shadow map
Figure BDA00036815451300000521
Then, according to the lambertian rendering formula:
Figure BDA00036815451300000522
respectively delivering each sheetInput image I k Performing synthesis restoration to further construct a restored image set
Figure BDA00036815451300000523
Model parameter updating of (V) photometric stereo model
From a set of input images
Figure BDA00036815451300000524
And restoring the image set
Figure BDA00036815451300000525
The similarity between the three-dimensional model and the luminosity three-dimensional model is trained in an automatic supervision mode, and the specific method for updating the model parameters comprises the following steps:
(1) training the photometric stereo model with an auto-supervised strategy by minimizing the reconstruction loss function:
Figure BDA00036815451300000526
and updating the network parameters.
Description 6: form of reconstruction loss function
The invention adopts the widely used Mean-Square Error (MSE) as the measurement index of the image similarity, and the reconstruction Error can be written into the following form:
Figure BDA0003681545130000061
(VI) obtaining the final normal map recovery result
Inputting image sets
Figure BDA0003681545130000062
Inputting the luminosity three-dimensional model after updating the parameters to obtain the final normal map recovery result
Figure BDA0003681545130000063
The specific method comprises the following steps:
(1) image collection
Figure BDA0003681545130000064
All images in the system are connected in series on the channel dimension, the result after the series connection is input into a luminosity three-dimensional model after the self-supervision training is finished, and the model output is the recovery result of the normal map after the optimization
Figure BDA0003681545130000065
And each image I k Light conditions of
Figure BDA0003681545130000066
Seventhly, reconstructing the surface of the target scene according to the final normal map recovery result
Restoring the result from the final normal map
Figure BDA0003681545130000067
The specific method for three-dimensional reconstruction of the surface of the target scene comprises the following steps:
(1) recovering results from normal map
Figure BDA0003681545130000068
The gradient fields are computed and then integrable constrained to reconstruct the target surface. The reconstruction objective function is defined as.
Figure BDA0003681545130000069
Where (u, v) represents the pixel coordinates of a point in the image, Ω represents the integration region, Z represents the depth of the object surface, g the gradient field of the target scene. The above equation is generally solved using poisson's equation.
Description 7: acquisition of the depth Z of the surface of an object
Let the surface of the target scene be represented by the depth equation Z ═ f (x, y), and let (p, q) represent the gradient field at a point in the scene, then the normal vector for that point can be represented as:
Figure BDA00036815451300000610
on this basis, the object surface depth equation Z ═ f (x, y) can be solved using poisson's equation:
Figure BDA00036815451300000611
(2) and writing a visualization program by using an extended program library NumPy of a programming language Python to visually display the reconstructed surface.
The following examples are presented to demonstrate the feasibility of the method of the present invention, as described in detail below:
the method of the invention was used for validation on the DiLiGenT dataset and the dunhuang mugrotto dataset. The DiLiGenT data lump contains 10 test scenes, each containing 96 pictures imaged under different lighting conditions. In addition, the DiLiGenT dataset also provides normal Truth (Ground Truth) and ray targeting information for each scene surface. The mongao cave dataset of dunhuang contains 15 scenes, each scene containing 13 pictures imaged under different lighting conditions, the test set does not contain normal Truth (Ground Truth) and ray calibration information for the scene surface. The experiments used the Mean Angle Error (MAE) function to quantitatively evaluate the three-dimensional reconstruction of DiLiGenT datasets. For the Dunhuang Mogao Grottoes dataset, the experiment only visually shows it, and does not quantitatively evaluate it, since the dataset does not contain truth values.
The results of testing on the DiLiGenT dataset according to the present method and the prior optimal photometric stereo three-dimensional reconstruction method illustrated in fig. 4 show that: the reconstruction result of the method can obtain lower average angle error and has higher reconstruction precision, which proves the effectiveness of the method. Meanwhile, the visual reconstruction result in fig. 5 shows that compared with the existing optimal method of SDPS-Net, the method can perform more detailed reconstruction on the surface of the cultural relic, and has higher practicability.

Claims (4)

1. A luminosity three-dimensional reconstruction method based on self-supervision learning is characterized by comprising the following steps:
step one, shooting a target scene for multiple times under different illumination conditions to obtain an input image set
Figure FDA00036815451200000110
K is the number of input images;
step two, inputting the image set
Figure FDA00036815451200000111
Inputting the data into a photometric stereo model to obtain a rough normal map recovery result
Figure FDA0003681545120000011
And each image
Figure FDA0003681545120000012
Light conditions of
Figure FDA0003681545120000013
P represents the number of pixels in one image, and K is more than or equal to 1 and less than or equal to K;
wherein, the structure of the luminosity stereo model is a twin neural network, which is respectively used for each input image I k Performing feature extraction, and fusing the K features into a global feature with a fixed size by using maximum pooling operation; respectively estimating the illumination condition of each image according to the global features and the respective features of each input image, inputting the global features into a decoder, and regressing to obtain a rough normal map recovery result of the scene;
step three, inputting the image set
Figure FDA00036815451200000112
Inputting the data into a reflection map estimation model and a shadow estimation model to obtain a reflection map of a scene
Figure FDA0003681545120000014
And shadow map of each image
Figure FDA0003681545120000015
The method comprises the following steps:
(1) image collection
Figure FDA00036815451200000113
All images in (1) are connected in series in channel dimension, and the result after the connection is input into a reflection map estimation model, and the model is output as a reflection map of a scene
Figure FDA0003681545120000016
The estimation result of (2);
(2) restoring the coarse normal map to the result
Figure FDA0003681545120000017
And light conditions
Figure FDA0003681545120000018
Serially connecting in a channel dimension, and recording a serial connection result as a tensor B;
(3) respectively collecting images
Figure FDA00036815451200000114
In each image I k And a corresponding tensor B k Input to a shadow estimation model, and the shadow estimation model is output for each image I k Shadow map of
Figure FDA00036815451200000115
The method of (2) is as follows:
the shadow estimation model is structured as a neural network based on a coder-decoder, and comprises two encoders, wherein each encoder is composed of four convolution layers, the sizes of the convolution kernels are all 3 multiplied by 3, the number of the convolution kernels of each layer is respectively 64, 128 and 256 and 256; to input an image I k The tensor B is input into two encoders, and depth feature extraction is carried out on the two encoders respectively to obtain two depth features; a feature fusion module is arranged behind the two encoders, and the feature fusion module is used for performing channel dimension serial operation on the two depth features, the serially connected features sequentially pass through a global pooling layer and a convolution layer with convolution kernel of 1 × 1, and finally normalization is performed through a Sigmoid function, and a channel weight matrix is output; multiplying the channel weight matrix by the serial result of the two depth features, and fusing the two depth features to obtain a final image depth feature; inputting the finally obtained depth features into a decoder to obtain a shadow map
Figure FDA00036815451200000116
The estimation result of (2);
step four, restoring the result according to the rough normal map
Figure FDA00036815451200000117
Light conditions
Figure FDA00036815451200000118
Reflection diagram
Figure FDA00036815451200000119
And shadow map
Figure FDA00036815451200000120
Restoring the input image to obtain a restored image set
Figure FDA0003681545120000019
Step five, according to the input image set
Figure FDA0003681545120000023
And restoring the image set
Figure FDA0003681545120000024
The similarity between the three is used for training the luminosity three-dimensional model in a self-supervision mode and updating the parameters of the luminosity three-dimensional model;
step six, inputting the image set
Figure FDA0003681545120000025
Inputting the luminosity stereo model with updated parameters to obtain the optimized normal map recovery result
Figure FDA0003681545120000026
And each image I k Light conditions of
Figure FDA0003681545120000027
Step seven, restoring the result according to the final normal map
Figure FDA0003681545120000028
And performing three-dimensional reconstruction on the target scene.
2. The method as claimed in claim 1, wherein the reflectometry estimation model is structured as a neural network based on encoder-decoder, the encoder is composed of 6 convolutional layers, and the decoder is composed of 4 convolutional layers; a jump connection layer is arranged on each convolution layer of the decoder and connected with the shallow network.
3. The photometric stereo three-dimensional reconstruction method based on the self-supervised learning as recited in claim 1, wherein the concrete method of the fourth step is as follows:
according to the lambertian rendering formula:
Figure FDA0003681545120000021
recovery of results using normal map
Figure FDA0003681545120000029
Light conditions
Figure FDA00036815451200000210
Reflection diagram
Figure FDA00036815451200000211
And shadow map
Figure FDA00036815451200000212
For each input image I respectively k Performing restoration and constructing a restoration image set
Figure FDA00036815451200000213
4. The photometric stereo three-dimensional reconstruction method based on the self-supervised learning as recited in claim 1, wherein the concrete method of the fifth step is as follows:
(1) selecting and using mean square error MSE as a measure index of similarity between images;
(2) carrying out self-supervision training on the photometric stereo model according to a reconstruction loss function, wherein the loss function is as follows:
Figure FDA0003681545120000022
where K and P represent the number of input images and the number of pixels in each image, respectively, and i represents the index of the pixels in the image;
(3) and training the luminosity stereo model until convergence to obtain the optimized luminosity stereo model.
CN202210634582.3A 2022-06-07 2022-06-07 Luminosity three-dimensional reconstruction method based on self-supervision learning Pending CN114998507A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210634582.3A CN114998507A (en) 2022-06-07 2022-06-07 Luminosity three-dimensional reconstruction method based on self-supervision learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210634582.3A CN114998507A (en) 2022-06-07 2022-06-07 Luminosity three-dimensional reconstruction method based on self-supervision learning

Publications (1)

Publication Number Publication Date
CN114998507A true CN114998507A (en) 2022-09-02

Family

ID=83033836

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210634582.3A Pending CN114998507A (en) 2022-06-07 2022-06-07 Luminosity three-dimensional reconstruction method based on self-supervision learning

Country Status (1)

Country Link
CN (1) CN114998507A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116105632A (en) * 2023-04-12 2023-05-12 四川大学 Self-supervision phase unwrapping method and device for structured light three-dimensional imaging
CN116883578A (en) * 2023-09-06 2023-10-13 腾讯科技(深圳)有限公司 Image processing method, device and related equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116105632A (en) * 2023-04-12 2023-05-12 四川大学 Self-supervision phase unwrapping method and device for structured light three-dimensional imaging
CN116883578A (en) * 2023-09-06 2023-10-13 腾讯科技(深圳)有限公司 Image processing method, device and related equipment
CN116883578B (en) * 2023-09-06 2023-12-19 腾讯科技(深圳)有限公司 Image processing method, device and related equipment

Similar Documents

Publication Publication Date Title
US11257272B2 (en) Generating synthetic image data for machine learning
CN112258390B (en) High-precision microscopic virtual learning resource generation method
US11334762B1 (en) Method for image analysis
Mayer et al. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation
CN114998507A (en) Luminosity three-dimensional reconstruction method based on self-supervision learning
CN110910437B (en) Depth prediction method for complex indoor scene
CN113112504A (en) Plant point cloud data segmentation method and system
CN113572962A (en) Outdoor natural scene illumination estimation method and device
US11875583B2 (en) Dataset generation method for self-supervised learning scene point cloud completion based on panoramas
CN110633628A (en) RGB image scene three-dimensional model reconstruction method based on artificial neural network
US20230419600A1 (en) Volumetric performance capture with neural rendering
CN114514561A (en) Neural light transmission
CN115457188A (en) 3D rendering display method and system based on fixation point
CN115082254A (en) Lean control digital twin system of transformer substation
Yeh et al. Photoscene: Photorealistic material and lighting transfer for indoor scenes
Shinohara et al. Point2color: 3d point cloud colorization using a conditional generative network and differentiable rendering for airborne lidar
CN113763231A (en) Model generation method, image perspective determination device, image perspective determination equipment and medium
CN115272599A (en) Three-dimensional semantic map construction method oriented to city information model
CN114332355A (en) Weak light multi-view geometric reconstruction method based on deep learning
CN114663880A (en) Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism
Schambach et al. A multispectral light field dataset and framework for light field deep learning
CN116433822B (en) Neural radiation field training method, device, equipment and medium
CN116311218A (en) Noise plant point cloud semantic segmentation method and system based on self-attention feature fusion
CN116071278A (en) Unmanned aerial vehicle aerial image synthesis method, system, computer equipment and storage medium
CN115953447A (en) Point cloud consistency constraint monocular depth estimation method for 3D target detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination