CN113592913B

CN113592913B - Method for eliminating uncertainty of self-supervision three-dimensional reconstruction

Info

Publication number: CN113592913B
Application number: CN202110907900.4A
Authority: CN
Inventors: 许鸿斌; 周志鹏; 乔宇; 康文雄
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2021-08-09
Filing date: 2021-08-09
Publication date: 2023-12-26
Anticipated expiration: 2041-08-09
Also published as: CN113592913A

Abstract

The invention discloses a method for eliminating uncertainty of self-supervision three-dimensional reconstruction. The method comprises the following steps: the method comprises the steps of pre-training a deep learning three-dimensional reconstruction model by taking a set first loss function as a target, wherein the deep learning three-dimensional reconstruction model takes an image pair consisting of a reference view angle and a source view angle as an input, the first loss function is constructed based on luminosity three-dimensional consistency loss and depth optical flow consistency loss, and the depth optical flow consistency loss represents pseudo optical flow information formed by pixels of the source view angle and matching points of the pixels of the source view angle under the reference view angle; training a pre-trained deep learning three-dimensional reconstruction model with a set second penalty function as an optimization target, the second penalty function being constructed by estimating an uncertainty mask for the pre-training phase, the uncertainty mask being used to characterize an active region in the input image. The invention does not need to annotate data, overcomes the problem of uncertainty in image reconstruction, and improves the accuracy and generalization capability of the model.

Description

Method for eliminating uncertainty of self-supervision three-dimensional reconstruction

Technical Field

The invention relates to the technical field of image three-dimensional reconstruction, in particular to a method for eliminating self-supervision three-dimensional reconstruction uncertainty.

Background

Multi-view stereovision (MVS) aims to recover three-dimensional structural information of a scene from Multi-view images and camera poses. In the past decades, the traditional multi-view stereoscopic method has made great progress, but the artificially designed feature descriptors lack robustness in estimating the matching relationship of image pairs and are easily interfered by factors such as noise or illumination.

In recent years, researchers have begun to introduce deep learning methods into the flow of MVS and have achieved significant performance improvements, such as MVSNet, R-MVSNet, and the like. The methods integrate the image matching process into an end-to-end network, input a series of multi-view images and camera parameters, and directly output a dense depth map. And then restoring the three-dimensional information of the whole scene by fusing the depth maps under all the view angles. However, in practical applications, these deep learning-based MVS methods have a significant drawback in that a large-scale data set is required for training. The high cost of acquiring three-dimensional annotated data limits the wide application of MVS methods. To break away from the limitations of three-dimensional data labeling, researchers have begun to focus more on unsupervised or self-supervised MVS methods. The existing self-supervision MVS method mainly realizes the self-supervision training of the network by constructing an agent task based on an image reconstruction task, and in the mode, in order to ensure the luminosity stereoscopic consistency assumption, a certain view angle image reconstructed by using a predicted depth image and other view angle images should be ensured to be consistent with an original image.

However, in the prior art, the self-supervising MVS method also lacks effective countermeasures against the influence of uncertain factors such as color change and object occlusion, thereby affecting the quality of the reconstructed image.

Disclosure of Invention

The object of the present invention is to overcome the drawbacks of the prior art described above, and to provide a method for eliminating uncertainty of self-supervised three-dimensional reconstruction, the method comprising the steps of:

step S1: the method comprises the steps of pre-training a deep learning three-dimensional reconstruction model by taking a set first loss function as a target, wherein the deep learning three-dimensional reconstruction model takes an image pair consisting of a reference view angle and a source view angle as input, extracting a corresponding depth image for three-dimensional image reconstruction, constructing the first loss function based on a luminosity three-dimensional consistency loss and a depth optical flow consistency loss, wherein the luminosity three-dimensional consistency loss represents the difference between a reconstructed image and the reference image, and the depth optical flow consistency loss represents pseudo optical flow information formed by pixels of the source view angle and matching points of the pixels of the source view angle under the reference view angle;

step S2: training the pre-trained deep learning three-dimensional reconstruction model by taking the set second loss function as an optimization target to obtain an optimized three-dimensional reconstruction model, wherein the second loss function is constructed by estimating an uncertainty mask of a pre-training stage, and the uncertainty mask is used for representing an effective area in an input image.

Compared with the prior art, the invention has the advantages that in order to solve the uncertainty problem of foreground supervision ambiguity, extra matching information is introduced by adopting cross-view optical flow and depth consistency constraint so as to strengthen the constraint function of self-supervision signals; in order to solve the problem of uncertainty of background invalid interference, the uncertainty mask estimated in the self-supervision process is combined with the pseudo tag, so that the area possibly introducing an error supervision signal is effectively filtered, and the quality of a reconstructed image is improved.

Other features of the present invention and its advantages will become apparent from the following detailed description of exemplary embodiments of the invention, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a diagram of the difference and uncertainty between full-supervised training and self-supervised training in the prior MVS technique;

FIG. 2 is a schematic diagram of a visual comparison of uncertainty of full supervision and self-supervision signals in MVS in accordance with an embodiment of the present invention;

FIG. 3 is a flow chart of a method of eliminating self-supervising three-dimensional reconstruction uncertainty according to one embodiment of the present invention;

FIG. 4 is a process schematic diagram of a method of eliminating self-supervised three dimensional reconstruction uncertainty, according to one embodiment of the present invention;

FIG. 5 is a diagram illustrating a relative conversion relationship between depth information and cross-view optical flow according to one embodiment of the present invention;

FIG. 6 is a schematic diagram of a visual analysis of the self-supervised pre-training effect of optical flow signal guidance, according to one embodiment of the invention;

FIG. 7 is a visual analysis diagram of uncertainty mask guided self-supervised post-training effort, according to one embodiment of the present invention;

FIG. 8 is a schematic diagram of an application process of a three-dimensional reconstruction model according to one embodiment of the present invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless it is specifically stated otherwise.

The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of exemplary embodiments may have different values.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

For clear understanding of the invention, the uncertainty problem existing in the existing self-supervision three-dimensional reconstruction process is analyzed first. Referring to fig. 1, fig. 1 (a) is a schematic diagram of a full-supervision training process, fig. 1 (b) is a schematic diagram of a self-supervision training process, and fig. 1 (c) is a schematic diagram of the degree of uncertainty in the supervision signals of the full-supervision training and the self-supervision training. Briefly, the self-supervising MVS method builds a self-supervising signal by means of a proxy task using image reconstruction, replacing the deep tags in the supervising method. An intuitive explanation of this approach is that if the network estimated depth value is correct, then the homography projection relationship determined from the three-dimensional information of the depth value, and the image of one view reconstructed from the image of another view, should be consistent with the original. While the effectiveness of self-supervising signals has been demonstrated, the prior art merely explains the utility of self-supervising signals on an intuitive basis, and lacks direct specific explanation. For example, self-supervising signals are active in which places of the image are inactive, etc. To account for these problems, the cognitive uncertainty in self-supervised training is visualized using the Monte-Carlo Dropout (Monte Carlo discard, or MC Dropout) method to provide an intuitive interpretation, as can be seen in FIG. 1 (c), where there is more uncertainty in the background and border regions of the image than in the fully supervised training.

To further analyze the cause of the uncertainty problem in the self-supervised training, FIG. 2 visually compares the uncertainty of the full-supervised and self-supervised (or non-supervised) signals in the MVS. The uncertainties in existing self-supervising methods can be summarized as follows, by analysis.

1) Uncertainty about foreground supervision ambiguity. As shown in fig. 2 (a), the image reconstruction-based proxy supervisory signal fails to satisfy photometric stereo consistency in MVS under the influence of some additional factors, such as color change and object occlusion (i.e., circled portion), resulting in the self-supervisory signal failing to contain correct depth information.

2) Uncertainty about background null interference. As shown in fig. 2 (b), the non-textured areas (i.e., delineated portions) in the image do not contain any valid matching cues, which are typically discarded directly in the full supervised training. However, for self-supervised training, since the entire image is incorporated into the proxy loss calculation for image reconstruction, non-textured regions and other non-valid regions are also included, which can introduce additional noise interference and non-valid supervisory signals and further cause deep overcomplification of the training results.

In view of the uncertainty problem in the self-supervision method, the invention provides a method for eliminating uncertainty of self-supervision three-dimensional reconstruction, and the method specifically comprises the following steps in combination with fig. 3 and 4.

And step S310, constructing a deep learning three-dimensional reconstruction model.

The basic process of self-supervised three-dimensional reconstruction using deep learning is: inputting the multi-view images into a depth estimation network for depth estimation, projecting the extracted feature images onto the same reference view angle through homography mapping, and constructing a matching error body (or cost body) between the view angles under various depths, wherein the cost body predicts a depth image of the reference view angle; fusing the depth maps under each view angle together to reconstruct three-dimensional information of the whole scene; and then, utilizing the self-supervision loss estimation to reconstruct the difference between the image and the original image, and training the network until convergence.

The deep learning three-dimensional reconstruction model can adopt various types of networks, such as MVSNet, R-MVSNet and the like. In one embodiment, as shown in connection with fig. 4, the backbone network employs MVSNet, where the input to the backbone network is an image of N source views, the entire process is scalable by projecting the source view image to a reference view image according to camera outliers. The variance of the feature map for each view is counted to construct a cost body, and a 3D convolution network is further used to extract features. In the bottleneck layer portion of the 3D convolutional network, multiple Monte-Carlo Dropout layers are embedded. By default, the Monte-Carlo Dropout layer is frozen, activated only when an uncertainty mask needs to be estimated, and the processing at activation will be described below.

It should be understood that any network modified on the basis of MVSNet may replace the backbone network, and that other types of three-dimensional reconstruction models may be employed, and that the invention is not limited to the type and specific structure of the three-dimensional reconstruction model. In addition, the number of source view images may be set as desired, e.g., to 2-8, etc., as the present invention is not limited thereto.

Step S320, self-supervising pre-training the deep learning three-dimensional reconstruction model with the set loss function as a target, wherein the loss function comprises a photometric stereo consistency loss and a deep optical flow consistency loss.

Still referring to fig. 4, in the self-supervision pre-training stage, two branches are logically included, and the upper branch takes the reference view and the source view pair as input, so as to obtain a corresponding depth map. For the lower layer branch, on one hand, a forward optical flow (namely, the optical flow from a reference view image to a source view image) and a backward optical flow (namely, the optical flow from the source view image to the reference view image) are acquired on the basis of a pair of view pairs formed by the reference view and the source view, and on the other hand, virtual optical flow information or pseudo optical flow information is acquired for a depth map predicted by the upper layer branch, and a pre-training model is evaluated by fusing optical flow matching information of the two aspects.

Specifically, in the self-supervision pre-training stage, in order to solve the uncertainty problem of foreground supervision ambiguity, besides the basic photometric stereo consistency loss, a depth-optical flow consistency loss is additionally added, and the robustness of the self-supervision signal is enhanced by introducing additional dense matching relation prior information of the cross-view optical flow.

1) Loss of photometric stereo consistency

Given i=1 denotes the reference viewing angle, and j (2+.j+.v) denotes the source viewing angle, V being the total viewing angle number. For one-to-many view images (I ₁ ,I _j ) And its corresponding camera intrinsic and extrinsic parameters ([ K) ₁ ,T ₁ ],[K _j ,T _j ]). Output as depth map D at reference view ₁ . Thus, the pixel p in the source view j can be calculated _i Corresponding pixels in reference viewing angle

Where i (1. Ltoreq.i.ltoreq.HW) denotes the position index of the pixel in the image, H and W are the height and width of the image, D _j Representing a depth map corresponding to a source view j. And (3) normalizing the result by the homography formula (1) to obtain the coordinates in the corresponding image.

Wherein Norm ([ x, y, z)] ^T )＝[x/z,y/z,1] ^T 。

Image reconstruction using source view j by a micro bilinear interpolation operationIn addition, since only a part of pixels have mapping relation in the process of reconstructing the image, a binarization mask M can be obtained _j For representing the active area in the reconstructed image. In one embodiment, the difference between the reconstructed image and the reference image is compared in calculating the photometric stereo consistency loss according to the following formula:

where Δ denotes the gradient of the image in the x and y directions, and by operating point by point, I ₁ A reference image is represented and a reference image is represented,representing an image reconstructed based on the image of the source view j.

The self-supervision pre-training by utilizing the luminosity stereo consistency loss can ensure that luminosity change (such as gray value) between the reconstructed image and the reference image is as small as possible.

2) Loss of optical flow-depth consistency across view angles

To solve the problem of foreground supervision ambiguity in self-supervision MVS, a new cross-view optical flow-depth consistency loss (i.e. depth optical flow consistency loss) is further proposed. Still referring to FIG. 4, in one embodiment, the calculation of the loss includes two sub-modules, an image-to-optical flow module and a depth-to-optical flow module, respectively. Wherein the depth-to-optical flow module is fully micro-capable of mapping matching information contained in the depth mapWhich translates into a light flow diagram between the reference viewing angle and any viewing angle. The module may be embedded in any network. The image-to-optical flow module may use an unsupervised method to estimate optical flow information directly from the original image, including, for example, forward optical flow (e.g., I ₁ ->I ₂ ，I ₁ ->I ₃ ) And reverse optical flow (e.g. I ₂ ->I ₁ ，I ₃ ->I ₁ ). When the optical flow-depth consistency loss is calculated, the optical flow graphs output by the two sub-modules are compared, and the results are required to be as similar as possible.

Specifically, a schematic diagram of the depth-to-optical flow module is shown in fig. 5. In the MVS system, images of different viewing angles are acquired by moving the position of the camera by default, and depth information is restored according to the matching relationship of pixels between multiple views. In contrast, it is assumed that there is a virtual camera without a moving position, and the relative motion occurs as a crop, approximately by the relative motion. Intuitively, the matching information contained in the depth map may be translated into matching information for a pseudo-optical flow map in such a relative motion scenario. The detailed deduction is as follows:

defining the virtual optical flow as described above as:wherein->Pixel representing source view j>And its matching point p at reference viewing angle _i The resulting optical flow. The homography projection formula can be obtained according to the homography projection formula:

the matching information in the depth map can be represented in the form of a light flow map according to the above formula (4), and the whole process is fully differentiable.

For the image transfer optical flow module, dividing the reference view angle and the rest source view angles according to the current multi-view angle data set to form a pair of view angle pairs. With these multi-view pairs, the structured dataset is pre-trained on a light stream learning network (e.g., PWC-Net) without supervision by a self-supervising method. In the image-to-optical flow module, a pair of viewing angles consisting of any reference viewing angle and source viewing angle is input, and a forward optical flow diagram F between the two viewing angles is output _1j Reference viewing angle->Source view angle) and reverse light flow graph F _j1 Source viewing angle->Reference viewing angle).

3) Calculation of the cross-view optical flow-depth consistency loss function.

In the depth-to-optical flow module, a predicted depth map D ₁ Converted into virtual cross-view optical flowIn the image-to-optical flow module, the output is forward optical flow F _1j And reverse optical flow F _j1 It should be +.>Matching information of the matching information is consistent.

First, for pixel points where no occlusion exists, the forward optical flow F _1j Guiding and reversing optical flow F _j1 Is opposite to the value of (c). To avoid interference of the occlusion region introduced during calculation loss, the forward optical flow F is passed _1j And reverse optical flow F _j1 Calculate the occlusion mask O _1j Expressed as:

O _1j ＝{|F _1j +F _j1 |＞∈} (5)

where e is a threshold that can be set according to the accuracy of the calculation, e.g., e is 0.5.

Optical flow-depth consistency loss may then be calculated, expressed as:

wherein F is _1j (p _i ) Representing pixel p in forward light flow graph from reference view to source view j _i Is used to determine the optical flow value of (1),representing pixel p in a virtual dataflow graph from a reference view to a source view j _i Is a light flux value of (a).

In this embodiment, the minimum error is used to select the pair of calculation losses with the minimum error from all pairs of reference and source views, taking into account the noise of the optical flow map itself, in such a way as to reduce the effect of the noise of the optical flow itself.

4) Calculation of total loss in a self-supervising pre-training phase

In the self-supervision pre-training stage, for balancing the photometric stereo consistency loss and the optical flow-depth consistency loss to improve the model training precision and generalization capability, the two types of loss are combined to construct an overall loss, for example, expressed as:

L _ssp ＝L _pc +λL _fc (7)

wherein L is _pc Representing loss of photometric stereo consistency, L _fc Representing the optical flow-depth consistency loss, λ is a set constant, which can be set as desired, in order to balance the two loss scales, e.g., λ is 0.1.

Fig. 6 is a schematic diagram of a visual analysis of the self-supervised pre-training effect of optical flow signal guidance, with the left side being schematic without optical flow signal guidance and the right side being schematic with optical flow guidance. It can be seen that the occlusion mask is calculated using the forward optical flow and the backward optical flow, and further used for the calculation of the depth optical flow consistency loss, the interference of the occlusion region can be perceived. The use of optical flow to introduce additional matching relation can strengthen the constraint function of the self-supervision signal and enhance the effective area of the self-supervision signal.

Step S330, further training the pre-training model, which estimates the uncertainty mask in the self-supervision process and combines with the pseudo tag to filter out the region where the error supervision signal is introduced.

After the pre-trained model is obtained, further subsequent training, referred to herein as a pseudo tag post-training stage, may be performed in order to improve the accuracy and generalization ability of the model. In the post-pseudo tag training phase, uncertainty in the self-supervision process is first estimated, and then the estimated uncertainty is introduced into the self-training consistency penalty to guide training of the final model. Further, a pair of multi-view images enhanced with random data may be employed for training.

Specifically, to address the background noise interference problem, invalid regions, such as non-textured regions, are filtered out by an uncertainty mask during the post-pseudo-tag training phase. As these inactive areas do not contain any information useful for self-supervising signals. The uncertainty mask in the self-supervision process is estimated, for example, by activating Monte-Carlo Dropout, and further added to the penalty to achieve the effect of filtering the inactive region.

1) With respect to uncertainty estimation

In practice, uncertainty describes the degree of suspicion of the model output. Monte-Carlo Dropout is added to the bottleneck layer of the 3D convolutional network of the model, and to avoid model overfitting, the loss of the self-supervised pre-training phase is preferably modified accordingly to introduce a regularization term of uncertainty, expressed as:

wherein sigma ² Is occasional uncertainty representing noise contained in the data itself.

In one embodiment, a 6-layer CNN (convolutional neural network) is employed to predict a pixel-by-pixel occasional uncertainty map. The loss function of the aforementioned self-supervised pre-training phase is then modified as per equation (8) above to support the training process of uncertainty estimation.

Random Monte-Carlo Dropout is actually equivalent to sampling different model weights: w (W) _t ～q _θ (W, t) wherein q _θ And (W, t) represents the distribution to which Dropout is subjected. Definition of the definitionModel weight for the t-th sample is W _t The predicted depth map is D _1,t . The (cognitive) uncertainty of the model can then be estimated by the following formula:

wherein the method comprises the steps ofIs the result of sampling, sigma _t Representing the occasional uncertainty map corresponding to the T-th sample, and T represents the number of samples. In one embodiment, the average of these T samples is used as the pseudo tag: />

Compared with the Bayesian neural network, the Bayesian neural network is approximately simulated by embedding the Monte-Carlo Dropout layer in the embodiment of the invention, so that the calculation cost can be greatly reduced. In one practical application, the drop out rate may be set to 0.5, with the number of samples being more and closer to ideal, e.g., 20 default samples.

2) Self-training consistency penalty for uncertainty awareness

To mitigate interference in areas of greater uncertainty, uncertainty-aware self-training consistency loss (or uncertainty-aware self-training consistency loss) is constructed using the generated pseudo tag and uncertainty mask.

First, a binary mask is calculated based on the learned uncertaintyExpressed as:

where ζ is a set threshold, which may be set according to the accuracy requirement for uncertainty estimation, e.g., ζ=0.3.

Next, a self-training consistency loss is calculated, expressed as:

wherein D is _1,τ The depth map of the network prediction after random data enhancement is shown. In the embodiment of the invention, the adopted data enhancement does not contain the transformation on the position, and only contains data enhancement strategies such as random illumination change, color disturbance, shielding mask and the like.

Through verification, the uncertainty mask in the self-supervision process is estimated by using Monte-Carlo Dropout and combined with the pseudo tag, so that noise supervision signals possibly existing in the pseudo tag can be effectively restrained. FIG. 7 is a visual analysis illustration of the self-supervised post-training effort of uncertainty mask guidance, with the effect of no uncertainty guidance on the left and the effect of uncertainty guidance on the right. It can be seen that the uncertainty mask guided post-self-supervision training of the present invention is able to effectively filter out areas that may introduce false supervision signals, as opposed to training directly using false labels that contain uncertainty results.

In summary, the self-supervising MVS framework provided by the present invention is generally divided into a self-supervising pre-training stage and a pseudo tag post-training stage. In the self-supervision pre-training stage, L is adopted _ssp And (5) performing calculation. L can also be estimated due to the need to introduce Monte-Carlo Dropout and uncertainty in the subsequent stages _ssp Modified to L' _ssp . In the post-pseudo-tag training phase, the pseudo-tag and uncertainty mask are estimated first using Monte-Carlo and a model obtained by pre-training, and then the self-training loss L is calculated _uc And a final model is obtained.

Compared with the prior art, the method and the device can be better suitable for natural scenes. Intuitively, the uncertainty estimated by the present invention naturally involves various noise, occlusion changes, or non-textured areas in the background in natural scenes. In the self-supervision training process, the influence of the uncertainty factors on the supervision process can be effectively restrained, and a better training effect is ensured. From the experimental point of view, the deep learning three-dimensional reconstruction model trained by the method can obtain the leading effect on the disclosed natural scene three-dimensional reconstruction data set (Tanks and Temples) without any fine adjustment. The results in Table 1 below show that the last row is the effect of the invention on multiple types of data sets, and the other rows are prior art effects. The second column represents whether a real three-dimensional label is adopted for model training, and the third column represents the score of the three-dimensional reconstruction effect under the real scene, and the score is provided by an online evaluation website of the data set, and the larger the value is, the better the value is. The fourth through eleventh columns represent the reconstruction result scores for eight different real scenes.

Table 1: data set effect contrast

In addition, compared with the supervised method requiring the labeling of the data set, the method adopts complete unsupervised training, and the whole training process only uses the original multi-view images and camera parameters and does not require any labeling of three-dimensional information. And by combining the optical flow matching information and the uncertainty estimation and guiding the training process, the obtained optimization model realizes the reconstruction performance which is not weaker than that of a supervised mode even in certain scenes. The final model obtained by the invention can be used for reconstructing three-dimensional images of various scenes, for example, the final model is embedded into electronic equipment, and the electronic equipment comprises, but is not limited to, mobile phones, terminals, wearable equipment, computer equipment and the like. Referring to fig. 8, the basic procedure for the terminal is: the user opens an application program on the terminal equipment; recording a video and uploading; the terminal equipment intercepts the video into a plurality of frames to construct a multi-view image pair; solving the pose of the camera according to the camera intrinsic parameters and the multi-view image pair (Bundle Adjustment); performing depth estimation on the multi-view image through a trained deep learning three-dimensional reconstruction model; fusing depth information under multiple visual angles to obtain three-dimensional information of a scene; the terminal device displays the three-dimensional model to the user.

In summary, the present invention focuses on the problem of uncertainty in self-monitoring MVS, which directly relates to the situation that the target evaluation standard contains some interference signals in natural scene, i.e. uncertainty in foreground and background, such as occlusion, illumination change, no-texture background, etc. Conventional training, however, cannot effectively cope with both uncertainties because it incorporates directly those areas containing false supervision signals into the training process, inevitably affecting the final effect. And as the generalization capability is enhanced, the method and the device can be applied to a cross-dataset scene.

It will be appreciated by those skilled in the art that changes could be made to the embodiments described above without departing from the spirit and scope of the invention. For example, a bayesian network can be applied in addition to the uncertainty estimation in the self-supervision process by adopting the Monte-Carlo dropouout method, but the training cost of the bayesian network is high in the actual training process, so that the bayesian network is difficult to embed into the network of the framework of the invention, and the bayesian network model is large and cannot be placed in a common graphic processing unit (1080/2080 Ti) for training. To this end, the present invention preferably employs a Monte-Carlo Dropout approach to approximate the Bayesian network sampling process by embedding Dropout to reduce the computational consumption of model size and uncertainty estimates. As another example, the overall loss from the training phase may be weighted in other ways, such as exponentially, etc.

The present invention may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for carrying out operations of the present invention may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++, python, and the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are all equivalent.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A method of eliminating self-supervised three dimensional reconstruction uncertainty, comprising the steps of:

constructing a multi-view image pair using the photographed image;

solving the pose of the camera according to the camera internal parameters and the multi-view image pairs;

inputting the multi-view image pair into an optimized three-dimensional reconstruction model to perform depth estimation on the multi-view image;

the depth information under multiple visual angles is fused to obtain three-dimensional information of a scene, and then an image three-dimensional model is obtained;

wherein the optimized three-dimensional reconstruction model is obtained according to the following steps:

step S2: training the pre-trained deep learning three-dimensional reconstruction model by taking the set second loss function as an optimization target to obtain the optimized three-dimensional reconstruction model, wherein the second loss function is constructed by estimating an uncertainty mask of a pre-training stage, and the uncertainty mask is used for representing an effective area in an input image.

2. The method of claim 1, wherein the first loss function is set to:

L _ssp ＝L _pc +λL _fc

wherein L is _pc Representing loss of photometric stereo consistency, L _fc Represents the depth-to-optical-flow consistency loss, λ being a set constant.

3. The method of claim 2, wherein the photometric stereo consistency loss is calculated according to the steps of:

calculating a pixel p in a source view j _i Corresponding pixels in reference viewing angleExpressed as:

normalizing the result to obtain coordinates in the corresponding image:

obtaining a corresponding reconstructed image based on the image of the source view j by a micro bilinear interpolation operationAnd in the process of reconstructing the image, a binarization mask M is obtained _j For representing an active area in the reconstructed image;

based on the obtained binarization mask M _j The photometric stereo consistency loss is calculated as:

wherein, delta represents the gradient of the image in the x and y directions, and by point-wise operation, i (1. Ltoreq.i.ltoreq.HW) represents the index of the pixel position in the image, H and W represent the height and width of the image, respectively, [ K ] ₁ ,T ₁ ],[K _j ,T _j ]Is a one-to-many view image (I ₁ ,I _j ) Corresponding camera internal and external parameters, I ₁ A reference view image is represented and,representing the image weight based on source view jAnd (5) constructing an image.

4. The method of claim 2, wherein the depth-to-optical flow consistency loss is calculated according to the steps of:

pre-training an optical flow learning network by utilizing a data set, wherein the input is a pair of view angles formed by reference view angles and source view angles, and the output is a forward optical flow diagram F between the reference view angles and the source view angles _1j Reverse optical flow graph F _j1 ；

Depth map D to be predicted ₁ Conversion to virtual cross-view optical flow

Through forward optical flow F _1j And reverse optical flow F _j1 Calculate the occlusion mask O _1j Expressed as:

O _1j ＝{|F _1j +F _j1 |＞∈}

the depth optical flow consistency loss is calculated as follows:

wherein, E is a set threshold, i (1.ltoreq.i.ltoreq.HW) represents the position index of the pixel in the image, H and W represent the height and width of the image, p respectively _i Is a pixel in the source view j, D ₁ Is a depth map at a predicted reference view corresponding to a one-to-many view image, F _1j (p _i ) Representing pixel p in forward light flow graph from reference view to source view j _i Is used to determine the optical flow value of (1),representing pixel p in a virtual dataflow graph from a reference view to a source view j _i Is a light flux value of (a).

5. The method of claim 1, wherein a monte caraway layer is provided on a bottleneck layer of the deep-learning three-dimensional reconstruction model for estimating uncertainty in the pre-training process by multiple sampling.

6. The method of claim 5, wherein the second loss function is calculated according to the steps of:

by sampling different pre-training model weights, the uncertainty of the model is estimated, expressed as:

wherein the method comprises the steps ofIs the result of sampling, T is the sampling times, D _1,t Is the depth map of the t-th sampling prediction;

using the mean of T samples as a pseudo tag, expressed as:

calculating a binary mask based on the estimated uncertainty

Constructing the second penalty function using the generated pseudo tag and uncertainty binary mask, expressed as:

wherein the method comprises the steps ofXi represents the set threshold, D _1,τ Representing the depth map, σ predicted in step S2 _t Representing the occasional uncertainty plot corresponding to the t-th sample.

7. The method of claim 2, wherein the first loss function is modified to:

wherein sigma ² Is occasional uncertainty.

8. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor realizes the steps of the method according to any of claims 1 to 7.

9. A computer device comprising a memory and a processor, on which memory a computer program is stored which can be run on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 7 when the program is executed.