CN115496787A

CN115496787A - Monocular depth estimation method for visual augmented reality of wearable intelligent equipment

Info

Publication number: CN115496787A
Application number: CN202211171962.4A
Authority: CN
Inventors: 程德强; 韩成功; 寇旗旗; 刘海; 徐飞翔; 王晓艺; 刘敬敬
Original assignee: China University of Mining and Technology CUMT
Current assignee: China University of Mining and Technology CUMT
Priority date: 2022-09-26
Filing date: 2022-09-26
Publication date: 2022-12-20

Abstract

The invention relates to a monocular depth estimation method for visual augmented reality of wearable intelligent equipment, which belongs to the technical field of image processing, and is characterized in that image characteristics are segmented by adopting image liveness measurement, an input image is segmented into high-order characteristics and low-order characteristics based on multidirectional distribution of image contours, the high-order characteristics and the low-order characteristics are subjected to characteristic fusion and are sent to a common decoder to improve the experience intensity of a network on different characteristics; secondly, based on the depth similarity loss of the recursive network, aiming at the depth map estimated by the network, the depth map is sent to the depth estimation network with the same structure after being reconstructed in a recursive network mode, the similarity of the input images of the first order network and the second order network is constrained through the luminosity loss, and the similarity of the output depth maps of the first order network and the second order network is constrained through the depth similarity loss, so that the defect that the luminosity loss is poor in constraint when facing a low-texture area is made up; the depth consistency loss uses the cosine similarity of the style matrix for similarity measurement.

Description

Monocular depth estimation method for visual augmented reality of wearable intelligent equipment

Technical Field

The invention relates to an image processing technology, in particular to a monocular depth estimation method for visual augmented reality of wearable intelligent equipment.

Background

Monocular depth estimation is widely used for augmented reality, intelligent equipment, robotic navigation, and autopilot. Compared with the conventional Motion recovery Structure (SFM) and Simultaneous Localization and Mapping (SLAM) algorithm, monocular depth estimation can obtain the relative depth of a scene without the help of truth values, so as to determine the relative position relationship of each object in an image. The acquisition of ground truth typically employs the rendering of expensive lidar or computer simulation engines. However, the lidar is not favorable for the background with frequent scene switching, and the simulation engine has the defect of poor generalization capability of the real scene. The two tasks are unified into a framework by the self-supervision monocular depth estimation, a monocular video is used as input, and a self-supervision constraint mechanism of view synthesis is used, so that the depth information estimation algorithm which is convenient to deploy and simple to set is realized. With the continuous development of computer computing power, the information mining capability of a big data-driven deep learning algorithm is continuously enhanced, so that the acquisition of depth information through a single image becomes possible. The depth estimation of a single image is to establish a mapping relationship, which is an ill-posed problem, that is, the absolute depth relationship between objects cannot be obtained like a depth sensor, and only the relative depth of each object in a field of view can be obtained. In practical application, the relative position relation of each object in a scene can be calculated by obtaining the relative depth between the objects, so that the requirement of a three-dimensional reconstruction task is met.

The depth estimation algorithm based on a single image is divided into a supervision algorithm and an automatic supervision algorithm, and the supervision algorithm needs truth value assistance, so that the supervision algorithm is difficult to deploy. The self-supervision algorithm establishes a loss relation according to the own luminosity loss, does not need the assistance of truth values, and has a large amount of data without truth values in an entity three-dimensional space, so that the self-supervision algorithm is more in line with the actual situation of the nature, and the self-supervision algorithm gradually becomes the main research direction in the field of depth estimation. With the development and use of a multi-task framework and auxiliary judgment, indexes and performances of the self-supervision algorithm exceed those of older supervision algorithms and gradually approach the latest supervision algorithm. However, the self-supervision depth estimation algorithm has some defects, and when the luminosity loss is small for pixels in a non-texture area, the algorithm cannot have good depth estimation capability, which may cause an error in the depth estimation of the whole network.

Monocular depth estimation is a long standing problem in the field of computer vision. In 2016, laina proposed FCRN, which replaced the full link layer in a manner similar to reverse convolution, not only reducing the parameters, but also accommodating all sizes of input pictures. The main research direction of recent monocular supervised depth estimation has shifted from improving accuracy to simplifying models and reducing network parameters. Self-supervised Depth Estimation has been extensively studied, and Godard and Zhou are the first self-supervised Monocular Depth Estimation methods using a training Depth network and an independent pose network (Godard C, aodha O M, brostow G J. Unverended monomeric Depth Estimation With Left-Right Consistency [ C ]. Computer Vision & Pattern registration, 2017.) (Zhou J, wang Y, qin K, et al. Unverput High-Resolution Depth Estimation From video With Dual Networks [ C ].2019IEEE/CVF International Conference Computer Vision (ICCV), 2019.). Recent methods have achieved good results in such directions as modeling the scene flow of moving objects and performing individual warped projections of the moving objects. With the auxiliary judgment and the multi-task construction, the modeling of the moving object is more and more accurate, the shielding problem is continuously optimized, and the performance of the self-supervision monocular depth estimation network is continuously improved.

The existing monocular depth estimation method has the defects that:

first, monocular depth estimation uses warped re-projection of adjacent frames to establish an auto-supervised constraint relationship, which results in that when depth information estimation is good, error occurs due to the warped re-projection, still resulting in an erroneous constraint relationship, which is especially obvious in a region with complex texture. In the complex-texture areas such as the edge of a tree and the overlapping position of a human body, the depth estimation performance is obviously reduced.

Secondly, in a plurality of scenes, the existing luminosity constraint method cannot provide an effective constraint relation, and because the dependence of interpolation reprojection on the original image is strong, when the position relation of the original image is fuzzy, the luminosity constraint performance is obviously reduced.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a monocular depth estimation method facing to visual augmented reality of wearable intelligent equipment.

Firstly, an Image Activity Measure (IAM) is adopted to perform independent processing on a texture complex region, so as to enhance the feature extraction capability of a network on the complex region.

And secondly, performing secondary constraint by adopting a cyclic recursion network, providing depth consistency loss, and performing similarity measurement by adopting cosine similarity of a style matrix.

The technical scheme of the invention is as follows: firstly, image features are segmented by adopting image liveness measurement, an input image is segmented into high-order features and low-order features based on multidirectional distribution of image contours, the high-order features and the low-order features are subjected to feature fusion and are sent to a common decoder to improve the experience intensity of a network on different features;

secondly, based on the depth similarity loss of the recursive network, aiming at the depth map estimated by the network, the depth map is sent to the depth estimation network with the same structure after being reconstructed in a recursive network mode, the similarity of the input images of the first order network and the second order network is constrained through the luminosity loss, and the similarity of the output depth maps of the first order network and the second order network is constrained through the depth similarity loss, so that the defect that the luminosity loss is poor in constraint when facing a low-texture area is made up; the depth consistency loss uses the cosine similarity of the style matrix for similarity measurement.

Further, the method for measuring the image liveness is an image variance method IAM.

Further, the image gradient method calculates the variance of the image block along four directions, which are the horizontal direction, the vertical direction, the lower left diagonal direction and the lower right diagonal direction.

Further, the calculation formula of the image variance method is as follows, where α ∈ [0,1 ]]Is a weight factor, representing the percentage of the sum of the two variances, where α is set to 0.5; wherein V ₁ Is the sum of the variances of the center pixel from the lower left diagonal and the lower right diagonal, V ₂ Is the sum of the horizontal and vertical variances of the distance of the central pixel, M and N are the length and width of the image block, b _i,j Is the position [ i, j ] in the image block]The pixel points of (2).

Further, the depth similarity loss based on the recursive network is specifically defined as follows:

L _dc ＝α ₁ ||D _t ,D′ _t ||+α ₂ ||I′ _t ,I″ _t ||+α ₃ ||I _t ,I″ _t ||

wherein alpha is ₁ ,α ₂ ,α ₃ For consistency weight, | | | | | is similarity loss calculation, input picture I _t Is an input picture, I' _t To reconstruct the image, image I' _t Estimated depth map D' _t And recovering a reconstructed image I' by the pose parameter estimated by the second-order pose network _t Output D of the depth estimation network _t Depth uniformity loss L _dc 。

Further, the total loss L of the network is represented by a photometric loss L _pe Smoothness loss L _s And depth consistency loss L _dc Jointly calculating to obtain:

L＝L _pe +L _s +L _dc 。

has the advantages that:

firstly, the invention uses Image Activity Measure (IAM) to segment Image characteristics, based on the multi-directional distribution of Image contour, the input Image is segmented into high-order characteristics and low-order characteristics, the high-order characteristics and the low-order characteristics are subjected to characteristic fusion and sent to a common decoder to improve the network experience intensity for different characteristics, and the texture review area depth estimation performance is optimized.

Secondly, the invention provides a depth similarity loss based on a recursive network, aiming at a depth map estimated by the network, the depth map is sent to the depth estimation network with the same structure after being reconstructed in a recursive network mode, the similarity of the input images of the first-order network and the second-order network is constrained through luminosity loss, and the similarity of the output depth maps of the first-order network and the second-order network is constrained through the depth similarity loss, so that the defect that the fuzzy luminosity constraint is poor when the luminosity loss faces a low-texture area is overcome.

Drawings

FIG. 1 is an IAM feature fusion graph;

FIG. 2 is a schematic diagram of a recursive network based deep consistency loss process;

FIG. 3 is a comparison graph of KITTI data set results;

FIG. 4 is a comparison graph of the results of the cityscaps data sets.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

The research provides a recursion depth estimation network algorithm based on image liveness. Firstly, image Activity Measure (IAM) is adopted to segment Image features, an input Image is segmented into high-order features and low-order features based on multidirectional distribution of Image contours, the high-order features and the low-order features are subjected to feature fusion and are sent to a common decoder to improve the experience intensity of a network to different features. And secondly, providing depth similarity loss based on a recursive network, recursively sending the reconstructed depth map into the depth estimation network with the same structure by adopting a recursive network mode aiming at the depth map estimated by the network, constraining the similarity of the input images of the first-order network and the second-order network through luminosity loss, and constraining the similarity of the output depth maps of the first-order network and the second-order network through the depth similarity loss so as to make up the defect of poor constraint of the luminosity loss in the face of a low-texture area. The depth consistency loss uses the cosine similarity of the style matrix for similarity measurement.

1. Degree of image liveness

Image liveness is an index used for measuring image content in the field of image compression coding, and the liveness of an image can also distinguish texture regions in the image to a certain extent. The active region of the image generally refers to a region with a strong edge and a strong texture in the image, generally in a pixel region with a strong texture or a clear edge, the image activity is generally high, and according to different measurement formulas, the active region can effectively divide high and low texture regions of different pictures.

The common image liveness measurement method has a local variance method IAM ₁ Edge operator method IAM ₂ IAM, wavelet transform method ₃ Sum image gradient method IAM ₄ 。

1. Local variance method IAM ₁ : the local variance method is the simplest method for measuring the liveness of a region, and assumes that the pixels of the image are stationary within the measurement region. For an image block from M to N, it is measured in the way of equation (1). Since the size of the region directly affects the IAM of the region, the local variance method is not a good measure of image activity.

2. Edge operator method IAM ₂ : the more common method is to measure IAM by using feature extraction methods such as edge extraction. In order to detect a boundary with a rapidly changing or discontinuous pixel value in an image in edge detection, edge extraction operators including a Sobel operator, a Prewitt operator, a Roberts operator, a Canny operator and a Marr-Hildreth operator are often adopted, the operators usually return a binary image which is the same as the original image, and the function finds that the position of an edge is 1 and other positions are 0. After the edge is detected, the edge operator method uses the normalized amplitude of the edge to calculate the IAM.

3. IAM method of wavelet transform ₃ : the two-dimensional image is decomposed into four sub-bands, LL, LH, HL, and HH, respectively, according to frequency characteristics by performing two-dimensional wavelet transform on the calculated image.

The IAM is calculated by wavelet transforming LH, HL, and HH among them.

4. Image gradient method IAM ₄ : as shown in fig. 1, the image gradient method calculates the IAM by applying a logarithmic or square root function to the horizontal or vertical gradient of the image.

In order to correctly distinguish the non-texture region from the strong texture region in the image through the liveness of different regions of the image, the IAM should be calculated from multiple angles of the edge. The present study therefore calculates the variance of the image block along four directions, horizontal, vertical, left-down diagonal and right-down diagonal, respectively. Due to the smoothness of the intermediate pixels, horizontal and verticalSum of variances v of directions ₂ Is much larger than the sum v of the variances in the lower left diagonal and the lower right diagonal directions ₁ . The formula is shown below, where α ∈ [0,1 ]]For the weight factor, a percentage of the sum of the two variances is indicated, where α is set to 0.5.

The method comprises the steps of carrying out region segmentation on an input picture through an IAM formula so as to obtain a high-order feature map rich in textures and a low-order feature map of weak textures, sending the two pictures into the same encoder for feature extraction, carrying out feature fusion on the extracted high-order feature and the extracted low-order feature, and sending the high-order feature and the extracted low-order feature into a depth decoder for recovering depth information. By separately processing the high-order features and the low-order features in the image, the texture effectiveness of the high-order features is ensured, and the problem that the low-order features are weakened in the encoding process is avoided.

2. Recursive network based deep consistency loss

In daily life, human beings can infer self-movement and the three-dimensional structure of a surrounding scene in a short time according to a large amount of life experiences. This is because the everyday wide-range movement and careful observation of surrounding scenes gives our brains a rich, structural understanding of the world, a principle that is summarized as the motion recovery Structure (SFM). According to the SFM, the initial self-supervision monocular depth estimation framework SFMLearner is established by zhou, weak supervision of view synthesis is adopted as a constraint mode, and a depth map D estimated by DeptNet is used _t And PoseNet estimated self-motion parameters to recover input picture I _t Is heavyComposition I' _t By comparing input pictures I _t Of reconstruction map I' _t The whole depth estimation network is constrained by the similarity of the self-supervision depth estimation network, and therefore the self-supervision monocular depth estimation framework is achieved.

However, the early network framework is too large and complex, and for this problem, godard proposes the network framework of Monodepth 2. Structurally, monodepth2 uses a standard, fully convolved U-Net as the depth prediction network and an independent pose network to predict inter-frame motion. In the aspect of algorithm, monodepth2 proposes full resolution multi-scale, and samples the depth image of the middle layer to the input resolution to reduce texture copy artifacts. In terms of the loss function, the Monodepth2 adopts the pixel-by-pixel minimum projection and a dynamic elimination mask, the former emphasizes that good re-projection is not calculated averagely, and the latter adopts the mask to eliminate the dynamic object in the picture so as to reduce the re-projection error phenomenon of the dynamic object. Monodepth2 establishes a basic framework of subsequent monocular depth estimation, and a simple and practical network framework is adopted to realize higher depth estimation precision. Using L in terms of loss function ₁ Norm and luminosity loss L _pe And loss of smoothness L _s To jointly calculate the reconstruction loss.

L _pe (I _t ，I′ _t )＝λ ₁ SSIM(I _t ，I′ _t )+λ ₂ L ₁ (8)

Traditional luminance loss over-dependent input picture I _t Of reconstruction map I' _t The output D of the depth estimation network when weak texture or no texture regions appear in the scene _t Some void effect will occur resulting in recovered reconstructed picture I' _t Not representative, the performance of the network is reduced by recalculating the luminosity loss. To address this problem, the present study proposes a recursive network-based deep consistency penalty to enforce the constraint.

As shown in fig. 2, willReconstructed image I 'recovered from traditional network' _t Sending into a second-order depth estimation network of the same structure, and reconstructing a graph I' _t And adjacent frame I of input picture _t-1 Simultaneously sending the images into a second-order pose estimation network with the same structure to reconstruct an image I' _t Estimated depth map D' _t And recovering a reconstructed image I' by pose parameters estimated by the second-order pose network _t . When the network is completely converged, picture I is input in the first-order network _t Is reconstructed to picture I' _t Should be the same as _t And l' _t Estimated first order depth D _t And second order depth D' _t I "should be the same and recovered by the second order network _t And input picture I _t Of reconstruction map I' _t Should be the same. Recursive network based deep consistency loss L _dc The definition is as follows.

L _dc ＝α ₁ ||D _t ，D′ _t ||+α ₂ ||I′ _t ，I″ _t ||+α ₃ ||I _t ，I″ _t || (10)

Wherein alpha is ₁ ，α ₂ ，α ₃ For consistency weights, | | | | is the similarity loss calculation.

Finally, the total loss of the depth estimation network is represented by the photometric loss L _pe Smoothness loss L _s And depth consistency loss L _dc And jointly calculating to obtain the final product.

L＝L _pe +L _s +L _dc (11)

3. Dataset evaluation

The algorithm herein trains and tests on the KITTI dataset, using Eigen ^[1] And the data set segmentation mode is adopted for all input images in the same transformation mode, the central point of the camera is set as the central point of the images, and the focal length of the camera is set as the average value of all focal lengths in KITTI. The test set included 697 pairs of images at a resolution of 1242 × 375, covering a total of 29 scenes. 39806 pictures were used for training, and 4424 pictures were left for verification. The network framework herein is implemented using a Pytorch. The images are adjusted to 640 x 192 during training. Network managementOptimizing by an Adam optimizer, setting the optimizer parameter to beta ₁ ＝0.9，β ₂ =0.999. The weighting of the loss function is λ ₁ ＝0.90，λ ₂ ＝0.05，λ ₃ And =0.05, the initial learning rate is set to 0.0001, 20 batches are trained, and the learning rate is reduced to one tenth of the original rate every 15 batches.

Table 1 shows a qualitative comparison of our approach with several related works. The model was trained on different types of data-monocular video (M), stereo pair (S) and binocular video (MS), all of which were tested using a single image as input. The best results are marked in bold, and less preferred underlined. In a monocular training mode, by comparing all objective indexes, our model has better accuracy performance than other algorithms. The Abs Rel index increased from 0.115 to 0.111 compared to suboptimal monadepth 2. For the more penalizing index Sq Rel, our model improved from 0.837 to 0.816 relative to the second best DualNet. In addition, the most important accuracy index of our model in the target depth estimation has a good result, which is 0.884. For the stereo training mode, the MonoDisp model achieves the best effect in Sq Rel and RMSE indexes, and the RMSE even achieves the best in all comparison algorithms. The reason why MonoDisp achieves better results on these indices is that the model is specifically trained stereoscopically. They use a bilateral circular relationship of left and right differences, which allows their models to achieve higher indexing accuracy during stereo training. The MonoResM fuses a traditional stereo matching method in a deep learning algorithm, and a semi-global matching method is used for replacing a ground channel value. This is why MonoResM gives the best results in RMSE logs and the last two accuracy indicators. By inserting agent supervision truth values, the depth estimation indexes are enhanced in the training process. Although our model was made by monocular training mode, it still surpassed the results of MonoDisp and MonoResM on Abs Rel, and our model improved to 0.874 on the most important accuracy index. In binocular training, the proposed model achieved the best performance in Abs Rel, RMSE log and accuracy, and we achieved the second best performance on other indices. The result shows that the effect of the algorithm is obviously superior to that of other unsupervised monocular depth estimation algorithms.

Currently, the most common quantization indexes for evaluating monocular depth estimation are absolute relative difference (abssel), relative error (SqRel), root Mean Square Error (RMSE), and logarithmic Root Mean Square Error (RMSE). The following is a calculation formula of each quantization index.

The accuracy of monocular depth estimation is a percentage that satisfies the following condition

TABLE 1 test results on KITTI data set

Where M represents monocular training, S represents stereo pair training, and MS represents monocular plus stereo pair training.

[1]Eigen D,Fergus R.Predicting Depth,Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture[J].IEEE,2014.

[2]Casser V,Pirk S,Mahjourian R,et al.Depth Prediction Without the Sensors:Leveraging Structure for Unsupervised Learning from Monocular Videos[J],2018.

[3]Godard C,Aodha O M,Brostow G J.Unsupervised Monocular Depth Estimation with Left-Right Consistency[C].Computer Vision&Pattern Recognition,2017.

[4]Gordon A,Li H,Jonschkowski R,et al.Depth From Videos in the Wild:Unsupervised Monocular Depth Learning From Unknown Cameras[C].2019 IEEE/CVF International Conference on Computer Vision(ICCV),2019.

[5]Zhou J,Wang Y,Qin K,et al.Unsupervised High-Resolution Depth Learning From Videos With Dual Networks[C].2019 IEEE/CVF International Conference on Computer Vision(ICCV),2019.

[6]Klingner M,Termhlen J A,Mikolajczyk J,et al.Self-Supervised Monocular Depth Estimation:Solving the Dynamic Object Problem by Semantic Guidance[J],2020.

[7]Godard C,Aodha O M,Firman M,et al.Digging Into Self-Supervised Monocular Depth Estimation[C].2019 IEEE/CVF International Conference on Computer Vision(ICCV),2020.

[8]Garg R,Bg V K,Carneiro G,et al.Unsupervised CNN for Single View Depth Estimation:Geometry to the Rescue[C].European Conference on Computer Vision,2016.

[9]Mehta I,Sakurikar P,Narayanan P J.Structured Adversarial Training for Unsupervised Monocular Depth Estimation[C].2018 International Conference on 3D Vision(3DV),2018.

[10]Poggi M,Tosi F,Mattoccia S.Learning monocular depth estimation with unsupervised trinocular assumptions[C].2018 International Conference on 3D Vision(3DV),2018:324-333.

[11]Li R,Wang S,Long Z,et al.UnDeepVO:Monocular Visual Odometry through Unsupervised Deep Learning[J],2017.

[12]Luo C,Yang Z,Peng W,et al.Every Pixel Counts++:Joint Learning of Geometry and Motion with 3D Holistic Understanding[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018.

Fig. 3 shows a depth estimation effect map on the KITTI dataset. For Struct2 herein ^[2] ，Mono ^[9] ，SGD ^[6] ，Mono2 ^[7] And the five algorithms such as the algorithm of the text carry out depth map result comparison, wherein the obvious place is highlighted by a rectangular box. Compared with other methods, when the model is used for processing small objects with far targets such as people, vehicles, billboards and trunks, the depth map obtained by the algorithm is clearer in overall outline, obvious in edge and rich in details. And for objects such as roadside guardrails at close distances, the depth map obtained by the model is more continuous, smooth, sharp and clear.

To assess the generalization of the model herein, the model was used directly to test the cityscaps dataset and the results were compared to other unsupervised monocular depth estimation methods, as shown in fig. 4, where the effect is evident highlighted by a rectangular box. Compared with other methods, the depth map obtained by the model when the challenging areas such as multi-person superposition and complex texture are processed has higher discrimination and identification. Therefore, the model has better generalization.

Claims

1. The monocular depth estimation method for visual augmented reality of wearable intelligent equipment is characterized by comprising the steps of firstly, segmenting image features by adopting image liveness measurement, segmenting an input picture into high-order features and low-order features based on multidirectional distribution of image contours, carrying out feature fusion on the high-order features and the low-order features, and sending the high-order features and the low-order features into a common decoder to improve the experience strength of a network on different features;

2. The wearable intelligent equipment visual augmented reality-oriented monocular depth estimation method of claim 1, wherein the method of image liveness measurement is image variance method (IAM).

3. The wearable smart equipment visual augmented reality-oriented monocular depth estimation method according to claim 2, wherein the image gradient method calculates the variance of an image block along four directions, namely a horizontal direction, a vertical direction, a lower left diagonal direction and a lower right diagonal direction.

4. The monocular depth estimation method for visual augmented reality of wearable intelligent equipment according to claim 3, wherein the image gradient method has a calculation formula as follows, wherein α e [0,1 ]]As a weight factor, representing the percentage of the sum of the two variances, α is set to 0.5; wherein V ₁ Is the sum of the variances of the center pixel distance from the lower left diagonal and the lower right diagonal; v ₂ Is the sum of the horizontal and vertical variances of the center pixel distance; m and N are the length and width of the image block, b _i,j Is the position [ i, j ] in the image block]The number of the pixel points of (a),

5. the wearable intelligent equipment visual augmented reality-oriented monocular depth estimation method of claim 1, wherein the depth similarity loss based on the recursive network is specifically defined as follows:

L _dc ＝α ₁ ||D _t ,D′ _t ||+α ₂ ||I′ _t ,I″ _t +α ₃ ||I _t ,I″ _t |

wherein alpha is ₁ ,α ₂ ,α ₃ For consistency weight, | | | | is similarity loss calculation, input picture I _t Is input picture, I' _t To reconstruct the image, image I' _t Estimated depth map D' _t And recovering a reconstructed image I' by the pose parameter estimated by the second-order pose network _t Output of the depth estimation network D _t Depth uniformity loss L _dc 。

6. The visual augmented reality oriented monocular depth estimation method for wearable smart equipment according to claim 1, wherein a total loss L of the depth estimation network is represented by a photometric loss L _pe Smoothness loss L _s And depth consistency loss L _dc Jointly calculating to obtain:

L＝L _pe +L _s +L _dc 。