WO2019223382A1

WO2019223382A1 - Method for estimating monocular depth, apparatus and device therefor, and storage medium

Info

Publication number: WO2019223382A1
Application number: PCT/CN2019/076247
Authority: WO
Inventors: 郭晓阳; 李鸿升; 伊帅; 任思捷; 王晓刚
Original assignee: 深圳市商汤科技有限公司
Priority date: 2018-05-22
Filing date: 2019-02-27
Publication date: 2019-11-28
Also published as: JP2021515939A; CN108961327B; CN108961327A; JP7106665B2; SG11202008787UA

Abstract

Provided in the embodiments of the present application is a method for estimating a monocular depth. The method comprises: acquiring an image to be processed; inputting the image to be processed into a monocular depth estimation network model obtained by means of training, and obtaining an analysis result of the image to be processed, wherein the monocular depth estimation network model is supervised and trained by means of a disparity map output by a first binocular matching neural network model; and outputting the analysis result of the image to be processed. Further provided in the embodiments of the present application are an apparatus and device for estimating a monocular depth, and a storage medium.

Description

Monocular depth estimation method, device, equipment and storage medium thereof

Cross-reference to related applications

This application is based on a Chinese patent application with an application number of 201810496541.6 and an application date of May 22, 2018, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated by reference in its entirety. .

Technical field

Embodiments of the present application relate to the field of artificial intelligence, and in particular, to a monocular depth estimation method and a device, device, and storage medium thereof.

Background technique

Monocular depth estimation is an important issue in computer vision. The specific task of monocular depth estimation is to predict the depth of each pixel in a picture. Among them, a picture composed of the depth value of each pixel is also called a depth map. Monocular depth estimation is of great significance for obstacle detection, three-dimensional scene reconstruction, and three-dimensional scene analysis in autonomous driving. In addition, monocular depth estimation can indirectly improve the performance of other computer vision tasks, such as object detection, target tracking and target recognition.

The current problem is that training neural networks for monocular depth estimation requires a large amount of labeled data, but obtaining labeled data is costly. In the outdoor environment, the marker data can be obtained by lidar, but the obtained marker data is very sparse. The monocular depth estimation network trained with such marker data has no clear edges and cannot capture the correct depth information of small objects.

Summary of the Invention

The embodiments of the present application provide a monocular depth estimation method, an apparatus, a device and a storage medium thereof.

The technical solution of the embodiment of the present application is implemented as follows:

An embodiment of the present application provides a monocular depth estimation method. The method includes: acquiring an image to be processed; inputting the image to be processed into a trained monocular depth estimation network model to obtain an analysis of the image to be processed; As a result, the monocular depth estimation network model is supervised and trained through the disparity map output by the first binocular matching neural network model; and the analysis result of the image to be processed is output.

An embodiment of the present application provides a monocular depth estimation device. The device includes: an acquisition module, an execution module, and an output module, wherein: the acquisition module is configured to acquire an image to be processed; and the execution module is configured to convert all images to be processed. It is described that the to-be-processed image is input to a trained monocular depth estimation network model to obtain the analysis result of the to-be-processed image, wherein the monocular depth estimation network model is a disparity output through a first binocular matching neural network model. The image is subjected to supervised training; the output module is configured to output an analysis result of the image to be processed.

An embodiment of the present application provides a monocular depth estimation device, including a memory and a processor. The memory stores a computer program that can be run on the processor, and the processor implements the program provided by the embodiment of the application when the processor executes the program. Steps in a monocular depth estimation method.

An embodiment of the present application provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the steps in the monocular depth estimation method provided by the embodiment of the present application are implemented.

In the embodiment of the present application, the image to be processed is obtained; the image to be processed is input to a trained monocular depth estimation network model to obtain the analysis result of the image to be processed, wherein the monocular depth estimation network The model is supervised and trained through the disparity map output by the first binocular matching neural network model; the analysis results of the to-be-processed images are output; thus, the monocular depth estimation network can be trained with less or no data marked with a depth map And, a more effective method of unsupervised fine-tuning binocular disparity network is proposed, which indirectly improves the effect of monocular depth estimation.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A is a first schematic flowchart of a monocular depth estimation method according to an embodiment of the present application; FIG.

FIG. 1B is a schematic diagram of a single picture depth estimation according to an embodiment of the present application; FIG.

FIG. 1C is a schematic diagram of training a second binocular matching neural network model according to an embodiment of the present application; FIG.

1D is a schematic diagram of a training monocular depth estimation network model according to an embodiment of the present application;

FIG. 1E is a schematic diagram of relevant pictures of a loss function according to an embodiment of the present application; FIG.

FIG. 2A is a second schematic diagram of an implementation process of a monocular depth estimation method according to an embodiment of the present application; FIG.

FIG. 2B is a schematic diagram of an effect of a loss function according to an embodiment of the present application; FIG.

2C is a schematic diagram of a visualization depth estimation result according to an embodiment of the present application;

3 is a schematic structural diagram of a monocular depth estimation device according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a hardware entity of a monocular depth estimation device according to an embodiment of the present application.

Detailed ways

To make the objectives, technical solutions, and advantages of the embodiments of the present application clearer, the specific technical solutions of the application will be described in further detail below with reference to the accompanying drawings in the embodiments of the present application. The following examples are used to illustrate the present application, but are not intended to limit the scope of the present application.

In the following description, the use of suffixes such as "module", "component", or "unit" for indicating elements is merely for the benefit of the description of the present application, and it does not have a specific meaning itself. Therefore, "modules," "components," or "units" can be used in combination.

Generally, a deep neural network is used to predict the depth map of a single picture. Only one picture is needed to 3D model the scene corresponding to the picture to obtain the depth of each pixel. The monocular depth estimation method proposed in the embodiment of the present application is obtained by using neural network training. The training data comes from the disparity map data output by binocular matching, without the need for expensive depth acquisition equipment such as lidar. The binocular matching algorithm that provides training data is also implemented by a neural network. The network can achieve good results by pre-training a large number of virtual binocular image pairs rendered by the rendering engine. In addition, fine-tuning training can be performed on real data to achieve Better results.

The technical solution of the present application is further described in detail below with reference to the accompanying drawings and embodiments.

An embodiment of the present application provides a monocular depth estimation method. The method is applied to a computing device. The functions implemented by the method can be implemented by a processor in a server calling program code. Of course, the program code can be stored in a computer storage medium. It can be seen that the server includes at least a processor and a storage medium. FIG. 1A is a schematic flowchart 1 of a method for implementing a monocular depth estimation method according to an embodiment of the present application. As shown in FIG. 1A, the method includes:

Step S101: Acquire an image to be processed;

Here, an image to be processed may be acquired by a mobile terminal, and the image to be processed may include a picture of an arbitrary scene. Generally speaking, a mobile terminal may be various types of devices with information processing capabilities during the implementation process. For example, the mobile terminal may include a mobile phone, a Personal Digital Assistant (PDA), a navigator, a digital phone, Video phones, smart watches, smart bracelets, wearables, tablets, etc. The server may be a computing device with information processing capabilities such as a mobile terminal, such as a mobile phone, a tablet computer, a notebook computer, and a fixed terminal such as a personal computer and a server cluster.

Step S102: Input the to-be-processed image to a trained monocular depth estimation network model to obtain an analysis result of the to-be-processed image, wherein the monocular depth estimation network model is matched by a first binocular matching nerve The disparity map output by the network model is used for supervised training;

In the embodiment of the present application, the monocular depth estimation network model is mainly obtained through the following three steps: the first step is to pre-train a binocular matching neural network using synthetic binocular data rendered by the rendering engine; the second step is Use the real-world data to fine-tune the binocular matching neural network obtained in the first step; the third step is to use the binocular matching neural network obtained in the second step to provide supervision on the monocular depth estimation network, thereby training to obtain the monocular depth Estimate the network. In the prior art, monocular depth estimation generally uses a large amount of real labeled data for training, or uses an unsupervised method to train a monocular depth estimation network. However, the acquisition cost of a large amount of real labeled data is very high. Training the monocular depth estimation network directly using an unsupervised method cannot process the depth estimation of the occluded area, and the obtained result is poor. The sample data of the monocular depth estimation network model described in this application comes from the disparity map output by the first binocular matching neural network model, that is, this application uses binocular disparity to guide the prediction of the monocular depth. Therefore, the method in the present application does not require a large amount of labeled data, and can obtain better training results.

Step S103: Output the analysis result of the image to be processed. Here, the analysis result of the image to be processed refers to a depth map corresponding to the image to be processed. After obtaining the image to be processed, the image to be processed is input to a trained monocular depth estimation network model, and the monocular depth estimation network model generally outputs a disparity map corresponding to the image to be processed instead of depth. Therefore, it is also necessary to determine the depth corresponding to the image to be processed according to the disparity map output by the monocular depth estimation network model, the lens baseline distance of the camera that captures the image to be processed, and the lens focal length of the camera that captures the image to be processed. Illustration.

FIG. 1B is a schematic diagram of the depth estimation of a single picture in the embodiment of the present application. As shown in FIG. 1B, the picture 11 with the number 11 is the image to be processed, the picture with the number 12 is the depth map corresponding to the picture 11 with the number 11.

In practical applications, the product of the baseline distance of the lens and the focal length of the lens, and the ratio of the disparity map corresponding to the output image to be processed may be determined as the depth map corresponding to the image to be processed.

Based on the foregoing method embodiments, an embodiment of the present application further provides a monocular depth estimation method, which includes:

Step S111: Obtain a synthesized binocular picture with a depth mark as synthesized sample data, where the synthesized binocular picture includes a synthesized left image and a synthesized right image;

In some embodiments, the method further includes: step S11, constructing a virtual 3D scene through a rendering engine; step S12, mapping the 3D scene into a binocular picture through two virtual cameras; step S13, according to constructing the Obtain the depth data of the synthesized binocular picture by the position during the virtual 3D scene, the direction when constructing the virtual 3D scene, and the lens focal length of the virtual camera; step S14, marking the binocular picture according to the depth data To obtain the synthesized binocular picture.

Step S112: Train a second binocular matching neural network model according to the obtained synthetic sample data;

Here, in actual application, step S112 may be implemented by the following steps: step S1121, training a second binocular matching neural network model according to the synthesized binocular picture, and obtaining a trained second binocular matching neural network A network model, wherein the output of the trained second binocular matching neural network model is a disparity map and an occlusion map, and the disparity map describes each pixel in the left image corresponding to the right image The parallax distance of the pixel points, the parallax distance is in pixels; the occlusion map describes whether each pixel point in the left image corresponding to the pixel point in the right image is blocked by an object.

FIG. 1C is a schematic diagram of training a second binocular matching neural network model according to an embodiment of the present application. As shown in FIG. 1C, a picture labeled 11 is a left view of a synthesized binocular picture, and a picture labeled 12 is a synthesized binocular picture. In the right picture of the target picture, I ^L is the pixel value of all the pixels contained in picture 11 on the left picture labeled 11 and I ^R is the pixel value of all pixels contained in picture 12 on the right picture labeled 12; Picture 13 is the occlusion map of the second binocular matching neural network model after training, picture 14 is the picture 14 is the disparity map of the second binocular matching neural network model after training, picture 15 is the picture 15 Match the neural network model for the second binocular.

Step S113: Adjust the parameters of the trained second binocular matching neural network model according to the obtained real sample data to obtain a first binocular matching neural network model;

Here, the step S113 can be implemented in two ways, wherein the first implementation method is implemented according to the following steps: step S1131a, the trained second binocular matching neural network is obtained according to the obtained real binocular data with depth markers The model undergoes supervised training to adjust the weight of the trained second binocular matching neural network model to obtain a first binocular matching neural network model. Here, the real binocular data with the depth marker is acquired. In this way, the real binocular data with the depth marker can be directly used to supervise the training of the second binocular matching neural network trained in step S112 to The weight of the trained second binocular matching neural network model is adjusted to further improve the effect of the trained second binocular matching neural network model to obtain a first binocular matching neural network model. In this part, the binocular parallax network needs to adapt to the real data. You can use real binocular data with depth markers to directly fine-tune the binocular disparity network through supervised training to adjust the network weights. The second implementation manner is implemented according to the following steps: step S1131b, performing unsupervised training on the trained second binocular matching neural network model according to the obtained real binocular data without depth marking, so as to adjust the trained first binocular matching neural network model. The two binocular matching neural network models are weighted to obtain the first binocular matching neural network model. In the embodiment of the present application, it is also possible to perform unsupervised training on the trained second binocular matching neural network model using real binocular data without depth marking, so as to adjust the trained second binocular matching neural network model. To get the first binocular matching neural network model. Here, unsupervised training refers to training using only binocular data without deep data marking, and this process can be implemented using unsupervised fine-tuning methods.

Step S114: Supervise the monocular depth estimation network model through the disparity map output by the first binocular matching neural network model, thereby training the monocular depth estimation network model;

Here, step S114 is implemented in two ways, wherein the first implementation way is implemented according to the following steps: step S1141a, acquiring the left or right image in the real binocular data with depth mark as a training sample, The depth-labeled real binocular data includes a left image and a right image; step S1142a, a monocular depth estimation network model is trained according to the left or right image in the depth-labeled real binocular data. Here, a deep neural network is used to predict the depth map of a single picture. Only one picture is needed to 3D model the scene corresponding to the picture to obtain the depth of each pixel. Therefore, the monocular depth estimation network model may be trained according to the left or right image in the depth-labeled real binocular data, where the depth-labeled real binocular data is the band used in step S1131a. Deeply marked true binocular data. The second implementation manner is implemented according to the following steps: Step S1141b, the true binocular data without depth marking is input to the first binocular matching neural network model to obtain a corresponding disparity map, wherein the without disparity map The labeled true binocular data includes left and right images; step S1142b, according to the corresponding disparity map, a lens baseline distance of a camera that captures the true binocular data without a depth marker, and captures the image without the depth marker. The focal length of the camera of the real binocular data to determine the depth map corresponding to the parallax map; step S1143b, the left or right image in the real binocular data without the depth mark is used as sample data, and according to the parallax The depth map corresponding to the graph supervises the monocular depth estimation network model, thereby training the monocular depth estimation network model. Here, a deep neural network is used to predict the depth map of a single picture. Only one picture is needed to 3D model the scene corresponding to the picture to obtain the depth of each pixel. Therefore, the left image or the right image in the real binocular data without the depth mark used in step S1131b can be taken as the sample data, or the left image or the right in the real binocular data without the depth mark used in step S1141b. The map is used as sample data, and the monocular depth estimation network model is supervised according to the depth map corresponding to the disparity map output in step S1141b, so that the monocular depth estimation network model is trained, and the trained monocular depth estimation network model is obtained.

FIG. 1D is a schematic diagram of a training monocular depth estimation network model according to an embodiment of the present application. As shown in FIG. 1D, FIG. 1A shows inputting real binocular data without a depth marker to the first binocular matching neural network model. To obtain the corresponding parallax map picture 13 labeled 13, where the true binocular data without the depth mark includes a left picture 11 labeled 11 and a right picture 12 labeled 12 and a picture 15 labeled 15 is the first binocular matching neural network model. The figure (b) in FIG. 1D shows that the left or right image in the real binocular data without the depth mark is used as the sample data, and the depth map corresponding to the disparity map picture 13 labeled 13 is compared with the single image. The mesh depth estimation network model is supervised, thereby training the monocular depth estimation network model, wherein the output of the sample data after passing through the monocular depth estimation network model is a parallax map picture 14 labeled 14 and a picture labeled 16 16 is a monocular depth estimation network model.

Step S115: Obtain an image to be processed;

Here, after obtaining the trained monocular depth estimation network model, this monocular depth estimation network model can be used. That is, using this monocular depth estimation network model, a depth map corresponding to the image to be processed is obtained.

Step S116: The image to be processed is input to a trained monocular depth estimation network model to obtain an analysis result of the image to be processed, wherein the monocular depth estimation network model is matched by a first binocular matching neural network. The disparity map output by the network model is used for supervised training;

Step S117: Output the analysis result of the image to be processed, where the analysis result of the image to be processed includes a disparity map output by the monocular depth estimation network model;

Step S118: According to the disparity map output by the monocular depth estimation network model, a lens baseline distance of a camera that takes a picture of the monocular depth estimation network model and a camera that takes a picture of the monocular depth estimation network model The focal length of the lens to determine the depth map corresponding to the parallax map;

Step S119: Output a depth map corresponding to the disparity map.

Step S121: Obtain a synthesized binocular picture with a depth mark as synthesized sample data, where the synthesized binocular picture includes a synthesized left image and a synthesized right image.

Step S122: Train a second binocular matching neural network model according to the obtained synthetic sample data;

Here, using synthetic data for training the second binocular matching neural network model has better generalization ability.

Step S123: Determine the loss function by using formula (1): L _{stereo-unsupft} = L _photo + γ ₁ L _abs + γ ₂ L _rel (1); wherein, the L _{stereo-unsupft} represents the Loss function; the L _photo represents the reconstruction error, and the L _abs represents the disparity map output by the first binocular matching network model compared with the disparity map output by the trained second binocular matching network model Small; the L _rel indicates that the output gradient of the first binocular matching network model is constrained to be consistent with the output gradient of the trained second binocular matching network model; and γ ₁ and γ ₂ represent intensity coefficients. Here, L _abs and L _rel are regular terms.

In some embodiments, the formula (1) in step S123 can also be refined by the formula in the following step, that is, the method further includes: step S1231, determining the reconstruction using formula (2) or formula (3) error:

Where N is the number of pixels in the picture;

Pixel values of the occlusion map output by the trained second binocular matching network model;

Represents the pixel value of the left image in true binocular data without a depth marker; said

Represents the pixel value of the right image in true binocular data without a depth marker; said

Represents the pixel value of the picture synthesized after sampling the right picture, that is, the reconstructed left picture; said

Represents the pixel value of the picture synthesized after sampling the left picture, that is, the reconstructed right picture; said

Represents the pixel value of the disparity map output by the first binocular matching network model of the left image in the real binocular data without the depth mark; said

Represents the pixel value of the disparity map output by the first binocular matching network model from the right image in the real binocular data without depth mark; the ij represents the pixel coordinates of the pixel; the old represents the second bin after training The output of the target matching network model; the R represents the relevant data of the right or right picture, the L represents the relevant data of the left or left picture; and the I represents RGB (Red Green Blue, red, green) And blue) values. Step S1232, using formula (4) or formula (5) to determine that the disparity map output by the first binocular matching network model is smaller than the disparity map output by the trained second binocular matching network model:

Where N is the number of pixels in the picture, and

Pixel values of the occlusion map output by the trained second binocular matching network model, said

Represents the pixel values of the disparity map output by the second binocular matching network after training on the left image in the sample data, said

Represents the pixel values of the disparity map output by the second binocular matching network after training on the right in the sample data, said

Represents the pixel values of the disparity map output by the left image via the first binocular matching network in the real binocular data without a depth marker, said

Represents the pixel value of the disparity map output by the first binocular matching network from the right image in the real binocular data without depth mark, the ij represents the pixel coordinates of the pixel, and the old represents the second binocular after training The output of the matching network model, where R represents the data on the right or right, L represents the data on the left or left, and γ ₃ represents the intensity coefficient. Step S1233: Use formula (6) or formula (7) to determine that the output gradient of the first binocular matching network model is consistent with the output gradient of the second binocular matching network model:

Where N is the number of pixels in the picture, and

Represents the gradient of the disparity map output by the left image via the first binocular matching network in the real binocular data without a depth marker, said

Represents the gradient of the disparity map output by the first binocular matching network from the right image in the real binocular data without a depth marker, said

Represents the gradient of the disparity map output by the second binocular matching network after training on the left image in the sample data, said

Represents the gradient of the disparity map output by the trained second binocular matching network from the right image in the sample data, the old represents the output of the trained second binocular matching network model, and R represents the right or right The relevant data, where L represents the left picture or the relevant data of the left picture.

Step S124: Use a loss function (Loss) to perform unsupervised training on the trained second binocular matching neural network model according to the real binocular data without the depth marker to adjust the trained second binocular Match the weights of the neural network model to get the first binocular matching neural network model.

Here, the loss function (Loss) uses the output of the second binocular matching neural network after training in step S122 to regularize the fine-tuning training, avoiding the unpredictable, ubiquitous predictions commonly found in unsupervised fine-tuning in the prior art. The problem improves the effect of the first binocular matching network obtained by fine-tuning, thereby indirectly improving the effect of the monocular deep network obtained by the supervision of the first binocular matching network. FIG. 1E is a schematic diagram of a related picture of a loss function according to an embodiment of the present application. As shown in FIG. 1E, FIG. 1A is a left image of real binocular data without a depth marker; FIG. 1E is a graph without a depth marker. The right image of the real binocular data; Figure (c) in Figure 1E is the real binocular image without depth mark composed of Figures (a) and (b) input to the trained second binocular match Parallax map output by the neural network model; Figure (d) in Figure 1E is a picture after reconstructing the left picture after sampling the right picture shown in Figure (b) and combining the parallax map shown in Figure (c); Figure (e) in 1E is a picture obtained by making a difference between the pixels in the left image shown in (a) and the corresponding pixels in the reconstructed left image shown in (d), that is, the reconstruction error map of the left; Figure (f) in FIG. 1E is an occlusion map inputting a real binocular image without depth mark composed of the figures (a) and (b) to the output of a trained second binocular matching neural network model. Among them, all the red boxes 11 in the figure (d) indicate the parts where the reconstructed left picture is different from the real left picture identified in the figure (a), and all the red boxes 12 in the figure (e) show the reconstruction errors. There is an error in the picture, that is, the part that is blocked. Here, when training the binocular disparity network with unsupervised fine-tuning described in step S124, the left image needs to be reconstructed using the right image, but the occluded area cannot be reconstructed correctly. Therefore, the occlusion image is used to clear this part of the error Training signals to improve the effect of unsupervised fine-tuning training.

Step S125: Supervise the monocular depth estimation network model through a disparity map output by the first binocular matching neural network model, thereby training the monocular depth estimation network model.

Here, the sample picture of the monocular depth estimation network model may be a left image in real binocular data without a depth marker, or a right image in real binocular data without a depth marker. Among them, if the left picture is used as a sample picture, the loss function is determined by formula (1), formula (2), formula (4), and formula (6); if the right picture is used as a sample picture, then formula (1) , Formula (3), formula (5) and formula (7) to determine the loss function.

In the embodiment of the present application, supervising the monocular depth estimation network model by using a disparity map output by the first binocular matching neural network model, so as to train the monocular depth estimation network model. The depth map corresponding to the disparity map output by the first binocular matching neural network model supervises the monocular depth estimation network model, and even if supervising information is provided, the monocular depth estimation network model is trained.

Step S126: Acquire the image to be processed;

Step S127: The image to be processed is input to a trained monocular depth estimation network model to obtain an analysis result of the image to be processed, wherein the monocular depth estimation network model is matched by a first binocular matching neural network. The disparity map output by the network model is used for supervised training;

Step S128: Output the analysis result of the image to be processed, where the analysis result of the image to be processed includes a disparity map output by the monocular depth estimation network model.

Step S129: According to the disparity map output by the monocular depth estimation network model, a lens baseline distance of a camera that takes a picture of the monocular depth estimation network model and a camera that takes a picture of the monocular depth estimation network model The focal length of the lens to determine the depth map corresponding to the parallax map;

Step S130: Output a depth map corresponding to the disparity map.

In the embodiment of the present application, when the to-be-processed image is a street view picture, the trained monocular depth estimation network model may be used to predict the depth of the street view picture.

Based on the foregoing method embodiments, an embodiment of the present application further provides a monocular depth estimation method. FIG. 2A is a second schematic diagram of the implementation process of the monocular depth estimation method according to the embodiment of the present application. As shown in FIG.

Step S201: Use the synthetic data rendered by the rendering engine to train a binocular matching network to obtain a disparity map of the binocular picture;

Here, the input of the binocular matching network is: a pair of binocular pictures (including the left and right pictures), and the output of the binocular matching network is: a disparity map and an occlusion map, that is, the binocular matching network uses binocular Pictures are used as input, and disparity and occlusion maps are output. The disparity map is used to describe the disparity distance of each pixel in the left picture and the corresponding pixel point in the right picture, in pixels; the occlusion map is used to describe whether each pixel in the left picture corresponds to the pixel in the right picture. Obscured by other objects. Due to changes in perspective, some areas in the left image will be blocked by other objects in the right image. The occlusion image is used to mark whether the pixels in the left image are blocked in the right image. In this part, the binocular matching network is trained using the synthetic data generated by the computer rendering engine. First, some virtual 3D scenes are constructed by the rendering engine, and then the 3D scenes are mapped into binocular pictures by two virtual cameras to obtain synthetic data. Data such as the correct depth data and camera focal length can also be obtained from the rendering engine, so the binocular matching network can directly supervise training through these labeled data.

Step S202: Use the loss function to fine-tune the binocular matching network obtained in step S201 on the real binocular image data through an unsupervised fine-tuning method;

In this part, the binocular parallax network needs to adapt to the real data. That is, the binocular disparity network is trained unsupervisedly using real binocular data without depth marking. Here, unsupervised training refers to training using only binocular data without deep data marking. The embodiment of the present application proposes a new unsupervised fine-tuning method, which uses the loss function in the above embodiment to perform unsupervised fine-tuning. The main purpose of the loss function proposed in the embodiment of the present application is to hope to fine-tune the binocular disparity network on real binocular data without reducing the pre-training effect. During the fine-tuning process, the pre-trained binocular disparity obtained in step S201 is used during the fine-tuning. The initial output of the network is guided and regularized. FIG. 2B is a schematic diagram of the effect of the loss function in the embodiment of the present application. As shown in FIG. 2B, the picture 21 is a disparity diagram obtained when using the loss function in the prior art, and the picture 22 is implemented using the present application. The disparity map obtained when the proposed loss function is exemplified. The loss function of the prior art does not consider the occlusion area separately, and the image reconstruction error of the occlusion area is also optimized to zero, which will cause the prediction parallax error of the occlusion area, and the edges of the disparity map will be blurred. The loss function in this application Use the occlusion map to clean up the erroneous training signals in this part to improve the effect of unsupervised fine-tuning training.

Step S203: Use the binocular matching network obtained in step S202 to supervise the monocular depth estimation on the real data, and finally obtain the monocular depth estimation network. Here, the input of the monocular depth estimation network is: a single monocular picture, and the output of the monocular depth estimation network is: a depth map. In step S202, the binocular disparity network fine-tuned on the real data is obtained. For each pair of binocular pictures, the binocular disparity network predicts a disparity map, and the disparity map D, the baseline distance b of the binocular lens, and the lens focal length f are obtained. , The depth map corresponding to the disparity map can be calculated, that is, the depth map corresponding to the disparity map can be calculated by formula (8) d: d = bf / D (8); in order to train the monocular depth network prediction to obtain the depth map, you can The left image in the binocular image pair is used as the input of the monocular deep network, and then the depth map calculated by the binocular disparity network output is used to supervise, thereby training the monocular deep network to obtain the final result. In practical applications, the monocular depth estimation method in the embodiments of the present application can be trained to obtain a depth estimation module for unmanned driving, thereby performing three-dimensional reconstruction or obstacle detection on the scene. And the unsupervised fine-tuning method proposed in the embodiment of the present application improves the performance of the binocular disparity network.

In the prior art, a supervised monocular depth estimation method is very limited and difficult to obtain accurate labeled data. The performance of unsupervised methods based on reconstruction errors is usually limited by the pixel matching ambiguity. In order to solve these problems, a new monocular depth estimation method is proposed in the embodiment of the present application, which solves the limitations of the supervised and unsupervised depth estimation methods in the prior art. The method in the embodiment of the present application is to use a binocular matching network to train on cross-modal synthetic data, and to supervise the monocular depth estimation network. The binocular matching network obtains disparity based on the pixel matching relationship between the left and right images, rather than extracting from the semantic features. Therefore, the binocular matching network can well generalize from synthetic data to real data. The method in the embodiment of the present application mainly includes three steps. First, the binocular matching network is trained with synthetic data to predict occlusion maps and disparity maps from binocular pictures. Second, according to the available real data, with or without supervision, the trained binocular matching network is selectively adjusted. Third, the monocular depth estimation network is trained under the supervision of the binocular matching network fine-tuned with the real data obtained in the second step. In this way, the binocular matching network can be used indirectly to make the monocular depth estimation make better use of synthetic data to improve performance.

The first step is to use the synthetic data to train the binocular matching network, including: Currently, the graphics rendering engine can generate many synthetic images containing depth information. However, the performance of training the monocular depth estimation network by directly combining these synthetic image data with real data is usually poor, because the monocular depth estimation is very sensitive to the semantic information of the input scene. The huge modal gap between synthetic and real data makes using synthetic data to aid training useless. However, the binocular matching network has better generalization ability, and the binocular matching network trained with synthetic data can also get better disparity map output on real data. Therefore, the embodiment of the present application uses binocular matching network training as a bridge between synthetic data and real data to improve the performance of monocular deep training. First, a large amount of synthetic binocular data is used to pre-train the binocular matching network. Different from the traditional structure, the binocular matching network in the embodiment also estimates a multi-scale occlusion map based on the disparity map. The occlusion map indicates whether the corresponding pixel point of the left image pixel in the right image is blocked by other objects in the correct image. In the next step, the unsupervised fine-tuning method will use the occlusion map to avoid false estimation. Among them, the right and left parallax consistency checking method can be used to obtain a correctly labeled occlusion map from the correctly labeled parallax map by using formula (9).

Among them, the subscript i represents the value of the i-th row in the image, and the subscript j represents the value of the j-th column in the image. D ^{* L / R} represents the disparity map of the left and right images, and D ^{* wR} is the disparity map of the left image after reconstruction with the right image. For non-occluded regions, the left disparity map and the disparity map of the left image after reconstruction using the right image are Consistent. The consistency check threshold is set to 1. The occlusion map is 0 in the occluded area and 1 in the non-occluded area. Therefore, this embodiment uses the following formula (10) to calculate the loss (Loss) of training the binocular matching network using synthetic data. At this stage, the loss function L _{stereo is} composed of two parts, namely the disparity map estimation error L _disp and the occlusion map estimation. The error L _occ . The multi-scale intermediate layer of the binocular disparity network also generates parallax and occlusion predictions, which are directly applied to the loss weight w _{m of the} multi-scale prediction,

Represents the disparity map estimation error corresponding to each layer,

Represents the estimation error of the occlusion map corresponding to each layer, and m represents the mth layer:

In order to train the disparity map, the L1 loss function is used to avoid the influence of outliers, making the training process more robust. In order to train the occlusion map, formula (11) is used to represent the occlusion map estimation error L _occ , and the binary cross entropy loss is used as a classification task to train the occlusion map:

Where N is the total number of pixels in the image,

Indicates a properly labeled occlusion map,

Represents the occlusion map output by the trained binocular matching network.

The second step is to use the supervised or unsupervised fine-tuning method to train the trained binocular matching network obtained in the first step on the real data, including: the embodiment of the present application performs the trained binocular matching network in two ways Fine-tuning. Among them, the supervised fine-tuning method only uses the multi-scale L1 regression loss function L _stereo-supft , that is, the disparity map estimation error L _disp to improve the previous pixel matching prediction error, see formula (12):

The results show that with a small amount of supervised data, such as 100 pictures, the binocular matching network can also adapt from synthetic modal data to real modal data. Unsupervised fine-tuning method. For unsupervised network tuning, the disparity map obtained by the unsupervised fine-tuning method in the prior art is blurred and the performance is poor, as shown in picture 21 in FIG. 2B. This is due to the limitations of unsupervised loss and the ambiguity of matching pixels with only RGB values. Therefore, the embodiment of the present application introduces additional regular term constraints to improve performance. Using real data, the corresponding occlusion map and parallax map were obtained from the binocular matching network after fine-tuning, and they were marked as

with

These two data are used to help standardize the training process. Further, for the unsupervised fine-tuning loss function proposed in the embodiment of the present application, that is, to obtain the loss function L _{stereo-unsupft} , refer to the description in the foregoing embodiments.

The third step is to train the monocular depth estimation network, including: so far, we have conducted cross-modal training on the binocular matching network with a large amount of synthetic data, and fine-tuned using real data. In order to train the final monocular depth estimation network, the embodiment of the present application uses the disparity map predicted by the trained binocular matching network to provide training data. The loss L _mono of the monocular depth estimation is given by the following sections, see formula (13):

Here, N is the sum of the pixels,

Refers to the disparity map output by the monocular depth estimation network,

Refers to the disparity map output by the trained binocular matching network, or fine-tuning the trained binocular matching network, and the disparity map output by the fine-tuned network. It should be pointed out that formulas (9) to (13) are described by taking the monocular depth estimation network using the left image in real data as a training sample as an example. Experiment: Because the monocular depth estimation network is sensitive to changes in perspective, it does not use cropping and scaling for training data. The input of the monocular depth estimation network and the disparity map for supervising the monocular depth estimation network are both from the trained binocular matching network. FIG. 2C is a schematic diagram of a visualized depth estimation result according to an embodiment of the present application. As shown in FIG. 2C, FIG. 2C shows that the three different street scene pictures obtained by using the prior art and the monocular depth estimation method in the embodiment of the present application correspond The first line is the input of the monocular depth estimation network, that is, three different streetscape pictures; the second line is the depth data obtained by interpolating the sparse lidar depth map using the nearest neighbor algorithm, and the third line is the fifth line. Depth maps corresponding to the three input pictures obtained by the three different monocular depth estimation methods in the prior art; the results of this application are shown in the last three lines, and the use synthesis obtained in the first step of the embodiment of this application is directly used The binocular matching network obtained from the data training, supervising the monocular depth estimation network, and the depth map corresponding to the three input pictures of the monocular depth network, that is, picture 21 labeled 21, picture 22 labeled 22, label Picture 23 is 23; using the unsupervised loss function proposed in the embodiment of the present application, the trained binocular matching network is fine-tuned, and the fine-tuned network is adjusted. The output disparity map is used as the training data of the monocular depth estimation network, and the depth map corresponding to the three input pictures of the monocular depth network is obtained, that is, the picture 24, the picture 25, and the picture 26. Picture 26; Supervised fine-tuning of the trained binocular matching network, using the disparity map of the fine-tuned network output as training data for the monocular depth estimation network, corresponding to the three input images of the monocular depth network Depth map, that is, picture 27 labeled 27, picture 28 labeled 28, picture 29 labeled 29; it can be seen from picture 21 labeled 21 to picture 29 labeled 29 that the single The model obtained by the eye depth estimation method can capture more detailed scene structure.

An embodiment of the present application provides a monocular depth estimation apparatus. FIG. 3 is a schematic structural diagram of a monocular depth estimation apparatus according to an embodiment of the present application. As shown in FIG. 3, the apparatus 300 includes: an acquisition module 301, an execution module 302, and Output module 303, where:

The acquisition module 301 is configured to acquire an image to be processed;

The execution module 302 is configured to input the to-be-processed image into a trained monocular depth estimation network model to obtain an analysis result of the to-be-processed image, wherein the monocular depth estimation network model is A binocular matching disparity map output from a neural network model for supervised training;

The output module 303 is configured to output an analysis result of the image to be processed.

In some embodiments, the apparatus further includes a third training module configured to supervise the monocular depth estimation network model through a disparity map output by the first binocular matching neural network model, thereby training the monocular depth estimation network model. Monocular depth estimation network model.

In some embodiments, the apparatus further includes: a first training module configured to train a second binocular matching neural network model based on the obtained synthetic sample data; and a second training module configured to train the training based on the obtained real sample data The parameters of the second binocular matching neural network model are adjusted to obtain the first binocular matching neural network model.

In some embodiments, the apparatus further includes: a first obtaining module configured to obtain a synthesized binocular picture with a depth mark as the synthesized sample data, wherein the synthesized binocular picture includes a synthesized left image And synthetic right image.

In some embodiments, the first training module includes: a first training unit configured to train a second binocular matching neural network model according to the synthesized binocular picture to obtain a trained second binocular The matching neural network model, wherein the output of the trained second binocular matching neural network model is a disparity map and an occlusion map, and the disparity map describes each pixel in the left image and the right image The parallax distance of the corresponding pixel point, the parallax distance is in pixels; the occlusion map describes whether the corresponding pixel point of each pixel point in the left image in the right image is blocked by an object.

In some embodiments, the apparatus further includes: a construction module configured to construct a virtual 3D scene through a rendering engine; a mapping module configured to map the 3D scene into a binocular picture through two virtual cameras; a second acquisition A module configured to acquire depth data of the synthetic binocular picture according to a position when constructing the virtual 3D scene, a direction when constructing the virtual 3D scene, and a lens focal length of the virtual camera; a third acquisition module, configured In order to mark the binocular picture according to the depth data, the synthesized binocular picture is obtained.

In some embodiments, the second training module includes: a second training unit configured to perform supervised training on the trained second binocular matching neural network model according to the obtained real binocular data with depth markers, so that The weight of the trained second binocular matching neural network model is adjusted to obtain a first binocular matching neural network model.

In some embodiments, the second training unit in the second training module is further configured to perform unsupervised training of the second binocular matching neural network model according to the obtained real binocular data without a depth marker. Training to adjust the weight of the trained second binocular matching neural network model to obtain a first binocular matching neural network model.

In some embodiments, the second training unit in the second training module includes a second training component configured to use a loss function to pair the trained second according to the real binocular data without the depth marker. The binocular matching neural network model is subjected to unsupervised training to adjust the weight of the trained second binocular matching neural network model to obtain a first binocular matching neural network model.

In some embodiments, the apparatus further includes: a first determining module configured to determine the loss function using formula (14); L _{stereo-unsupft} = L _photo + γ ₁ L _abs + γ ₂ L _rel (14) Wherein, the L _{stereo-unsupft} represents a loss function, the L _photo represents a reconstruction error, and the L _abs represents a disparity map output by the first binocular matching network model and the trained second binocular match The disparity map output by the network model is relatively small. The L _rel indicates that the output gradient of the first binocular matching network model is constrained to be consistent with the output gradient of the trained second binocular matching network model. γ ₁ and γ ₂ represent intensity coefficients.

In some embodiments, the apparatus further includes: a second determination module configured to determine the reconstruction error by using formula (15) or formula (16);

Where N is the number of pixels in the picture, and

Represents the pixel value of the left image in the true binocular data without a depth marker, said

Represents the pixel value of the right image in the true binocular data without a depth marker, said

Represents the pixel value of a picture synthesized after sampling the right image, said

Represents the pixel value of the picture synthesized after sampling the left picture, said

Represents the pixel values of the disparity map output by the first binocular matching network model in the left image of the real binocular data without a depth marker, said

Represents the pixel values of the disparity map output by the first binocular matching network model on the right in real binocular data without depth markers, and ij represents the pixel coordinates of the pixel.

In some embodiments, the apparatus further includes: a third determining module configured to determine, using formula (17) or formula (18), a disparity map output by the first binocular matching network model and the trained first The disparity map output by the two binocular matching network model is relatively small;

Wherein, said

Pixel values representing the disparity map output by the trained second binocular matching network model of the left image in the sample data, said

Represents the pixel values of the disparity map output by the trained second binocular matching network model on the right in the sample data, where γ ₃ represents the intensity coefficient.

In some embodiments, the apparatus further includes: a fourth determining module configured to determine an output gradient of the first binocular matching network model and the second binarization using formula (19) or formula (20) The output gradient of the mesh matching network model is consistent;

Wherein, said

Represents the gradient of the disparity map output by the first binocular matching network model in the left image in the real binocular data without a depth marker, said

Represents the gradient of the disparity map output by the first binocular matching network model of the right image in the real binocular data without a depth marker, said

Represents the gradient of the disparity map output by the trained second binocular matching network model of the left image in the sample data, said

Represents the gradient of the disparity map output from the second binocular matching network model after training on the right in the sample data.

In some embodiments, the depth-labeled real binocular data includes a left image and a right image. Correspondingly, the third training module includes a first acquisition unit configured to acquire the depth-labeled real image. The left or right image in the binocular data is used as a training sample; the first training unit is configured to train the monocular depth estimation network model according to the left or right image in the real binocular data with depth marking.

In some embodiments, the true binocular data without a depth mark includes a left image and a right image. Correspondingly, the third training module further includes a second acquisition unit configured to convert the image without depth. The labeled real binocular data is input to the first binocular matching neural network model to obtain a corresponding disparity map; a first determining unit is configured to, according to the corresponding disparity map, shoot the true binocular without a depth mark The baseline distance of the lens of the camera of the binocular data and the focal length of the lens of the camera that captures the true binocular data without the depth mark, to determine the depth map corresponding to the parallax map; The left or right image in the labeled real binocular data is used as sample data, and the monocular depth estimation network model is supervised according to the depth map corresponding to the disparity map, thereby training the monocular depth estimation network model.

In some embodiments, the analysis result of the to-be-processed image includes a disparity map output by the monocular depth estimation network model. Correspondingly, the device further includes a fifth determination module configured to determine the monocular depth based on the monocular depth. The disparity map output by the network model is estimated, the lens baseline distance of the camera that takes pictures of the monocular depth estimation network model input and the lens focal length of the camera that takes pictures of the monocular depth estimation network model input, and determines the disparity map A corresponding depth map; a first output module configured to output a depth map corresponding to the disparity map.

It should be noted here that the description of the above device embodiments is similar to the description of the above method embodiments, and has similar beneficial effects as the method embodiments. For technical details not disclosed in the device embodiments of the present application, please refer to the description of the method embodiments of the present application for understanding. In the embodiment of the present application, if the above-mentioned monocular depth estimation method is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present application that are essentially or contribute to the existing technology can be embodied in the form of software products. The computer software product is stored in a storage medium and includes several instructions for A computing device is caused to execute all or part of the method described in each embodiment of the present application. The foregoing storage medium includes various media that can store program codes, such as a U disk, a mobile hard disk, a ROM (Read Only Memory, read only memory), a magnetic disk, or an optical disk. In this way, the embodiments of the present application are not limited to any specific combination of hardware and software. Correspondingly, an embodiment of the present application provides a monocular depth estimation device. The device includes a memory and a processor. The memory stores a computer program that can be run on the processor. Steps in the mesh depth estimation method. Correspondingly, an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, steps in the monocular depth estimation method are implemented. It should be noted here that the description of the above storage medium and device embodiments is similar to the description of the above method embodiments, and has similar beneficial effects as the method embodiments. For technical details not disclosed in the storage medium and device embodiments of the present application, please refer to the description of the method embodiments of the present application for understanding.

It should be noted that FIG. 4 is a schematic diagram of a hardware entity of the monocular depth estimation device according to the embodiment of the present application. As shown in FIG. 4, the hardware entities of the monocular depth estimation device 400 include: a memory 401, a communication bus 402, and a process. The processor 403, wherein the memory 401 is configured to store instructions and applications executable by the processor 403, and may also cache data to be processed or processed by each module in the to-be-processed 403 and the monocular depth estimation device 400, which may be processed by FLASH ( Flash memory) or RAM (Random Access Memory). The communication bus 402 may enable the monocular depth estimation device 400 to communicate with other terminals or servers through a network, and may also implement connection and communication between the processor 403 and the memory 401. The processor 403 generally controls the overall operation of the monocular depth estimation apparatus 400.

It should be noted that, in this article, the terms "including", "including" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, It also includes other elements not explicitly listed, or elements inherent to such a process, method, article, or device. Without more restrictions, an element limited by the sentence "including a ..." does not exclude that there are other identical elements in the process, method, article, or device that includes the element.

Through the description of the above embodiments, those skilled in the art can clearly understand that the methods in the above embodiments can be implemented by means of software plus a necessary universal hardware platform, and of course, also by hardware, but in many cases the former is better. Implementation. Based on such an understanding, the technical solution of this application that is essentially or contributes to the existing technology can be embodied in the form of a software product, which is stored in a storage medium (such as ROM / RAM, magnetic disk, The optical disc) includes several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods described in the embodiments of the present application.

This application is described with reference to flowcharts and / or block diagrams of methods, devices (apparatuses), and computer program products according to embodiments of the present application. It should be understood that each process and / or block in the flowcharts and / or block diagrams, and combinations of processes and / or blocks in the flowcharts and / or block diagrams can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing device to produce a machine, so that instructions generated by the processor of the computer or other programmable data processing device may be used to Means for implementing the functions specified in one or more flowcharts and / or one or more blocks of the block diagrams. These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing device to work in a specific manner such that the instructions stored in the computer-readable memory produce a manufactured article including an instruction device, the instructions The device implements the functions specified in one or more flowcharts and / or one or more blocks of the block diagram.

These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of steps can be performed on the computer or other programmable device to produce a computer-implemented process, which can be executed on the computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more flowcharts and / or one or more blocks of the block diagrams.

The above are only preferred embodiments of the present application, and thus do not limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made using the contents of the description and drawings of the application, or directly or indirectly used in other related technical fields Are included in the scope of patent protection of this application.

Claims

A monocular depth estimation method, wherein the method includes: acquiring an image to be processed; inputting the image to be processed into a trained monocular depth estimation network model to obtain an analysis result of the image to be processed, wherein The monocular depth estimation network model is supervised and trained through a disparity map output by the first binocular matching neural network model; and an analysis result of the image to be processed is output.
The method according to claim 1, wherein the training process of the first binocular matching neural network model comprises: training a second binocular matching neural network model according to the obtained synthetic sample data; and according to the obtained real sample data, After training, the parameters of the second binocular matching neural network model are adjusted to obtain the first binocular matching neural network model.
The method according to claim 2, further comprising: obtaining a synthesized binocular picture with a depth mark as the synthesized sample data, wherein the synthesized binocular picture includes a synthesized left picture and a synthesized Right picture.
The method according to claim 3, wherein the training a second binocular matching neural network model according to the obtained synthetic sample data comprises: training the second binocular matching neural network model according to the synthesized binocular picture. To obtain a trained second binocular matching neural network model, wherein the output of the trained second binocular matching neural network model is a disparity map and an occlusion map, and the disparity map describes each The parallax distance between each pixel point and the corresponding pixel point in the right image, the parallax distance is in pixels; the occlusion map describes the pixel corresponding to each pixel point in the left image in the right image Whether the point is blocked by the object.
The method according to claim 2, wherein adjusting the parameters of the trained second binocular matching neural network model according to the obtained real sample data to obtain a first binocular matching neural network model comprises: Supervised training on the trained second binocular matching neural network model with the depth-labeled real binocular data to adjust the weight of the trained second binocular matching neural network model to obtain a first binocular match Neural network model.
The method according to claim 2, wherein adjusting the parameters of the trained second binocular matching neural network model according to the obtained real sample data to obtain the first binocular matching neural network model further comprises: The obtained real binocular data without the depth marker is used to perform unsupervised training on the trained second binocular matching neural network model to adjust the weight of the trained second binocular matching neural network model to obtain the first Binocular matching neural network model.
The method according to claim 6, wherein the trained second binocular matching neural network model is subjected to unsupervised training according to the obtained real binocular data without a depth marker to adjust the trained first binocular matching neural network model. The two binocular matching neural network model weights to obtain a first binocular matching neural network model include: using a loss function to pair the trained second binocular matching neural network according to the real binocular data without depth marking. The model performs unsupervised training to adjust the weight of the trained second binocular matching neural network model to obtain a first binocular matching neural network model.
The method according to claim 7, wherein the method further comprises: determining the loss function by using a formula L stereo-un sup ft = L photo + γ 1 L abs + γ 2 L rel , wherein the L stereo -un sup ft represents a loss function L photo showing the reconstruction error, the L abs represents the binocular parallax of the second network model outputs matching a first model output matching network binocular disparity map after the training The deviation is small compared to the figure. The L rel indicates that the output gradient of the first binocular matching network model is constrained to be consistent with the output gradient of the trained second binocular matching network model. The γ 1 and γ 2 Represents the intensity coefficient.
The method according to claim 8, further comprising: using a formula
or,
Determining the reconstruction error, wherein the N represents the number of pixels in the picture, and the
Pixel values of the occlusion map output by the trained second binocular matching network model, said
Represents the pixel value of the left image in the true binocular data without a depth marker, said
Represents the pixel value of the right image in the true binocular data without a depth marker, said
Represents the pixel value of a picture synthesized after sampling the right image, said
Represents the pixel value of the picture synthesized after sampling the left picture, said
Represents the pixel values of the disparity map output by the first binocular matching network model in the left image of the real binocular data without a depth marker, said
Represents the pixel values of the disparity map output by the first binocular matching network model of the right image in the real binocular data without the depth mark, and ij represents the pixel coordinates of the pixel point.
The method according to claim 8, further comprising: using a formula
or,
Determining that the disparity map output by the first binocular matching network model is smaller than the disparity map output by the second binocular matching network model after training, where N is the number of pixels in the picture, Said
Pixel values of the occlusion map output by the trained second binocular matching network model, said
Represents the pixel values of the disparity map output by the first binocular matching network model in the left image of the real binocular data without a depth marker, said
Represents the pixel values of the disparity map output by the first binocular matching network model of the right image in the real binocular data without a depth marker, said
Represents the pixel values of the disparity map output by the trained second binocular matching network model on the left,
Represents the pixel values of the disparity map output by the trained second binocular matching network model on the right, where ij represents the pixel coordinates of the pixel points, and γ 3 represents the intensity coefficient.
The method according to claim 8, further comprising: using a formula
or,
It is determined that the output gradient of the first binocular matching network model is consistent with the output gradient of the second binocular matching network model, where N is the number of pixels in the picture, and the
Represents the gradient of the disparity map output by the first binocular matching network model in the left image in the real binocular data without a depth marker, said
Represents the gradient of the disparity map output by the first binocular matching network model of the right image in the real binocular data without a depth marker, said
Represents the gradient of the disparity map output by the trained second binocular matching network model on the left,
Represents the gradient of the disparity map output by the trained second binocular matching network model on the right, and ij represents the pixel coordinates of the pixel.
The method according to claim 5, wherein the true binocular data with depth marking includes a left image and a right image, and correspondingly, the training process of the monocular depth estimation network model comprises: obtaining the belt depth The left or right image in the labeled real binocular data is used as a training sample; the monocular depth estimation network model is trained according to the left or right image in the depth-labeled real binocular data.
The method according to any one of claims 6 to 11, wherein the true binocular data without a depth mark includes a left image and a right image, and correspondingly, the training process of the monocular depth estimation network model includes : Inputting the true binocular data without the depth mark into the first binocular matching neural network model to obtain a corresponding disparity map; and shooting the true binocular without the depth mark according to the corresponding disparity map The baseline distance of the lens of the camera with the binocular data and the focal length of the lens of the camera that captures the true binocular data without the depth mark, determine the depth map corresponding to the disparity map; The left image or right image of is used as sample data, and the monocular depth estimation network model is supervised according to the depth map corresponding to the disparity map, thereby training the monocular depth estimation network model.
The method according to claim 12 or 13, wherein the analysis result of the image to be processed includes a disparity map output by the monocular depth estimation network model, and correspondingly, the method further comprises: according to the monocular depth The disparity map output by the network model is estimated, the lens baseline distance of the camera that takes pictures of the monocular depth estimation network model input and the lens focal length of the camera that takes pictures of the monocular depth estimation network model input, and determines the disparity map A corresponding depth map; and output a depth map corresponding to the disparity map.
A monocular depth estimation device, wherein the device includes: an acquisition module, an execution module, and an output module, wherein: the acquisition module is configured to acquire an image to be processed; and the execution module is configured to convert the to-be-processed image The image is input to a trained monocular depth estimation network model to obtain an analysis result of the image to be processed, wherein the monocular depth estimation network model is supervised by a disparity map output by a first binocular matching neural network model Trained; the output module configured to output an analysis result of the image to be processed.
The apparatus according to claim 15, wherein the apparatus further comprises: a first training module configured to train a second binocular matching neural network model based on the obtained synthetic sample data; and a second training module configured to perform a training based on the acquired The real sample data adjusts the parameters of the trained second binocular matching neural network model to obtain the first binocular matching neural network model.
The apparatus according to claim 16, further comprising: a first acquisition module configured to acquire a synthesized binocular picture with a depth mark as the synthesized sample data, wherein the synthesized binocular picture Includes left composite image and right composite image.
The device according to claim 17, wherein the first training module comprises: a first training unit configured to train a second binocular matching neural network model according to the synthesized binocular picture, and obtain the training , The output of the trained second binocular matching neural network model is a disparity map and an occlusion map, and the disparity map describes each pixel in the left image and The parallax distance of the corresponding pixel point in the right image, the parallax distance is in pixels; the occlusion map describes whether the corresponding pixel point of each pixel in the left image in the right image is an object Occlusion.
The apparatus according to claim 16, wherein the second training module comprises: a second training unit configured to match the trained second binocular matching neural network model according to the obtained depth-labeled real binocular data Supervised training is performed to adjust the weight of the trained second binocular matching neural network model to obtain a first binocular matching neural network model.
The apparatus according to claim 16, wherein the second training unit is further configured to perform unsupervised training on the trained second binocular matching neural network model according to the obtained real binocular data without the depth marker. To adjust the weight of the trained second binocular matching neural network model to obtain a first binocular matching neural network model.
The apparatus according to claim 20, wherein the second training unit comprises: a second training component configured to use a loss function to pair the trained second pair according to the real binocular data without the depth marker. The binocular matching neural network model is subjected to unsupervised training to adjust the weight of the trained second binocular matching neural network model to obtain a first binocular matching neural network model.
The apparatus according to claim 21, wherein the apparatus further comprises: a first determining module configured to determine the loss function using a formula L stereo-un sup ft = L photo + γ 1 L abs + γ 2 L rel Where L stereo-un sup ft represents a loss function, L photo represents a reconstruction error, and L abs represents a disparity map output by the first binocular matching network model and the second pair after training The disparity map output by the mesh matching network model is relatively small, and the L rel indicates that the output gradient of the first binocular matching network model is constrained to be consistent with the output gradient of the trained second binocular matching network model. The γ 1 and γ 2 represent intensity coefficients.
The apparatus according to claim 22, wherein the apparatus further comprises: a second determination module configured to utilize a formula
or,
Determining the reconstruction error, wherein the N represents the number of pixels in the picture, and the
Pixel values of the occlusion map output by the trained second binocular matching network model, said
Represents the pixel value of the left image in the true binocular data without a depth marker, said
Represents the pixel value of the right image in the true binocular data without a depth marker, said
Represents the pixel value of a picture synthesized after sampling the right image, said
Represents the pixel value of the picture synthesized after sampling the left picture, said
Represents the pixel values of the disparity map output by the first binocular matching network model in the left image of the real binocular data without a depth marker, said
Represents the pixel values of the disparity map output by the first binocular matching network model of the right image in the real binocular data without the depth mark, and ij represents the pixel coordinates of the pixel point.
The apparatus according to claim 22, further comprising: a third determination module configured to utilize a formula
or,
Determining that the disparity map output by the first binocular matching network model is smaller than the disparity map output by the second binocular matching network model after training, where N is the number of pixels in the picture, Said
Pixel values of the occlusion map output by the trained second binocular matching network model, said
Represents the pixel values of the disparity map output by the first binocular matching network model in the left image of the real binocular data without a depth marker, said
Represents the pixel values of the disparity map output by the first binocular matching network model of the right image in the real binocular data without a depth marker, said
Represents the pixel values of the disparity map output by the trained second binocular matching network model on the left,
Represents the pixel values of the disparity map output by the trained second binocular matching network model on the right, where ij represents the pixel coordinates of the pixel points, and γ 3 represents the intensity coefficient.
The apparatus according to claim 22, further comprising: a fourth determination module configured to utilize a formula
or,
It is determined that the output gradient of the first binocular matching network model is consistent with the output gradient of the second binocular matching network model, where N is the number of pixels in the picture, and the
Represents the gradient of the disparity map output by the first binocular matching network model in the left image in the real binocular data without a depth marker, said
Represents the gradient of the disparity map output by the first binocular matching network model of the right image in the real binocular data without a depth marker, said
Represents the gradient of the disparity map output by the trained second binocular matching network model on the left,
Represents the gradient of the disparity map output by the trained second binocular matching network model on the right, and ij represents the pixel coordinates of the pixel.
The device according to claim 19, wherein the true binocular data with the depth marker includes a left image and a right image, and correspondingly, the device further comprises: a third training module configured to obtain the depth marker The left image or the right image in the real binocular data is used as a training sample; the monocular depth estimation network model is trained according to the left image or the right image in the real binocular data with a depth label.
The device according to any one of claims 20 to 25, wherein the true binocular data without a depth mark includes a left image and a right image, and correspondingly, the device further includes: a third training module configured to Inputting the real binocular data without the depth marker into the first binocular matching neural network model to obtain a corresponding disparity map; and shooting the true binocular without the depth marker according to the corresponding disparity map The baseline distance of the lens of the camera and the focal length of the lens of the camera that captures the true binocular data without the depth mark to determine the depth map corresponding to the disparity map; The left image or the right image is used as sample data, and the monocular depth estimation network model is supervised according to the depth map corresponding to the disparity map, thereby training the monocular depth estimation network model.
The apparatus according to claim 26 or 27, wherein the analysis result of the image to be processed includes a disparity map output by the monocular depth estimation network model, and correspondingly, the apparatus further comprises: a fifth determining module, configured A disparity map output from the monocular depth estimation network model, a lens baseline distance of a camera that takes a picture of the monocular depth estimation network model, and a lens of a camera that takes a picture of the monocular depth estimation network model. The focal length determines a depth map corresponding to the disparity map; a first output module is configured to output a depth map corresponding to the disparity map.
A monocular depth estimation device includes a memory and a processor, and the memory stores a computer program operable on the processor, wherein when the processor executes the program, the device according to any one of claims 1 to 14 is implemented. Describe the steps in the monocular depth estimation method.
A computer-readable storage medium having stored thereon a computer program, wherein when the computer program is executed by a processor, the steps in the monocular depth estimation method according to any one of claims 1 to 14 are implemented.