WO2019223382A1 - Method for estimating monocular depth, apparatus and device therefor, and storage medium - Google Patents

Method for estimating monocular depth, apparatus and device therefor, and storage medium Download PDF

Info

Publication number
WO2019223382A1
WO2019223382A1 PCT/CN2019/076247 CN2019076247W WO2019223382A1 WO 2019223382 A1 WO2019223382 A1 WO 2019223382A1 CN 2019076247 W CN2019076247 W CN 2019076247W WO 2019223382 A1 WO2019223382 A1 WO 2019223382A1
Authority
WO
WIPO (PCT)
Prior art keywords
network model
binocular
depth
trained
disparity map
Prior art date
Application number
PCT/CN2019/076247
Other languages
French (fr)
Chinese (zh)
Inventor
郭晓阳
李鸿升
伊帅
任思捷
王晓刚
Original Assignee
深圳市商汤科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市商汤科技有限公司 filed Critical 深圳市商汤科技有限公司
Priority to SG11202008787UA priority Critical patent/SG11202008787UA/en
Priority to JP2020546428A priority patent/JP7106665B2/en
Publication of WO2019223382A1 publication Critical patent/WO2019223382A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • Embodiments of the present application relate to the field of artificial intelligence, and in particular, to a monocular depth estimation method and a device, device, and storage medium thereof.
  • Monocular depth estimation is an important issue in computer vision.
  • the specific task of monocular depth estimation is to predict the depth of each pixel in a picture.
  • a picture composed of the depth value of each pixel is also called a depth map.
  • Monocular depth estimation is of great significance for obstacle detection, three-dimensional scene reconstruction, and three-dimensional scene analysis in autonomous driving.
  • monocular depth estimation can indirectly improve the performance of other computer vision tasks, such as object detection, target tracking and target recognition.
  • the current problem is that training neural networks for monocular depth estimation requires a large amount of labeled data, but obtaining labeled data is costly.
  • the marker data can be obtained by lidar, but the obtained marker data is very sparse.
  • the monocular depth estimation network trained with such marker data has no clear edges and cannot capture the correct depth information of small objects.
  • the embodiments of the present application provide a monocular depth estimation method, an apparatus, a device and a storage medium thereof.
  • An embodiment of the present application provides a monocular depth estimation method.
  • the method includes: acquiring an image to be processed; inputting the image to be processed into a trained monocular depth estimation network model to obtain an analysis of the image to be processed; As a result, the monocular depth estimation network model is supervised and trained through the disparity map output by the first binocular matching neural network model; and the analysis result of the image to be processed is output.
  • An embodiment of the present application provides a monocular depth estimation device.
  • the device includes: an acquisition module, an execution module, and an output module, wherein: the acquisition module is configured to acquire an image to be processed; and the execution module is configured to convert all images to be processed.
  • the to-be-processed image is input to a trained monocular depth estimation network model to obtain the analysis result of the to-be-processed image, wherein the monocular depth estimation network model is a disparity output through a first binocular matching neural network model.
  • the image is subjected to supervised training; the output module is configured to output an analysis result of the image to be processed.
  • An embodiment of the present application provides a monocular depth estimation device, including a memory and a processor.
  • the memory stores a computer program that can be run on the processor, and the processor implements the program provided by the embodiment of the application when the processor executes the program. Steps in a monocular depth estimation method.
  • An embodiment of the present application provides a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed by a processor, the steps in the monocular depth estimation method provided by the embodiment of the present application are implemented.
  • the image to be processed is obtained; the image to be processed is input to a trained monocular depth estimation network model to obtain the analysis result of the image to be processed, wherein the monocular depth estimation network
  • the model is supervised and trained through the disparity map output by the first binocular matching neural network model; the analysis results of the to-be-processed images are output; thus, the monocular depth estimation network can be trained with less or no data marked with a depth map
  • a more effective method of unsupervised fine-tuning binocular disparity network is proposed, which indirectly improves the effect of monocular depth estimation.
  • FIG. 1A is a first schematic flowchart of a monocular depth estimation method according to an embodiment of the present application
  • FIG. 1B is a schematic diagram of a single picture depth estimation according to an embodiment of the present application.
  • FIG. 1C is a schematic diagram of training a second binocular matching neural network model according to an embodiment of the present application.
  • 1D is a schematic diagram of a training monocular depth estimation network model according to an embodiment of the present application.
  • FIG. 1E is a schematic diagram of relevant pictures of a loss function according to an embodiment of the present application.
  • FIG. 2A is a second schematic diagram of an implementation process of a monocular depth estimation method according to an embodiment of the present application.
  • FIG. 2B is a schematic diagram of an effect of a loss function according to an embodiment of the present application.
  • 2C is a schematic diagram of a visualization depth estimation result according to an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a monocular depth estimation device according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a hardware entity of a monocular depth estimation device according to an embodiment of the present application.
  • a deep neural network is used to predict the depth map of a single picture. Only one picture is needed to 3D model the scene corresponding to the picture to obtain the depth of each pixel.
  • the monocular depth estimation method proposed in the embodiment of the present application is obtained by using neural network training.
  • the training data comes from the disparity map data output by binocular matching, without the need for expensive depth acquisition equipment such as lidar.
  • the binocular matching algorithm that provides training data is also implemented by a neural network.
  • the network can achieve good results by pre-training a large number of virtual binocular image pairs rendered by the rendering engine.
  • fine-tuning training can be performed on real data to achieve Better results.
  • FIG. 1A is a schematic flowchart 1 of a method for implementing a monocular depth estimation method according to an embodiment of the present application. As shown in FIG. 1A, the method includes:
  • Step S101 Acquire an image to be processed
  • an image to be processed may be acquired by a mobile terminal, and the image to be processed may include a picture of an arbitrary scene.
  • a mobile terminal may be various types of devices with information processing capabilities during the implementation process.
  • the mobile terminal may include a mobile phone, a Personal Digital Assistant (PDA), a navigator, a digital phone, Video phones, smart watches, smart bracelets, wearables, tablets, etc.
  • the server may be a computing device with information processing capabilities such as a mobile terminal, such as a mobile phone, a tablet computer, a notebook computer, and a fixed terminal such as a personal computer and a server cluster.
  • Step S102 Input the to-be-processed image to a trained monocular depth estimation network model to obtain an analysis result of the to-be-processed image, wherein the monocular depth estimation network model is matched by a first binocular matching nerve
  • the disparity map output by the network model is used for supervised training;
  • the monocular depth estimation network model is mainly obtained through the following three steps: the first step is to pre-train a binocular matching neural network using synthetic binocular data rendered by the rendering engine; the second step is Use the real-world data to fine-tune the binocular matching neural network obtained in the first step; the third step is to use the binocular matching neural network obtained in the second step to provide supervision on the monocular depth estimation network, thereby training to obtain the monocular depth Estimate the network.
  • monocular depth estimation generally uses a large amount of real labeled data for training, or uses an unsupervised method to train a monocular depth estimation network. However, the acquisition cost of a large amount of real labeled data is very high.
  • the sample data of the monocular depth estimation network model described in this application comes from the disparity map output by the first binocular matching neural network model, that is, this application uses binocular disparity to guide the prediction of the monocular depth. Therefore, the method in the present application does not require a large amount of labeled data, and can obtain better training results.
  • Step S103 Output the analysis result of the image to be processed.
  • the analysis result of the image to be processed refers to a depth map corresponding to the image to be processed.
  • the image to be processed is input to a trained monocular depth estimation network model, and the monocular depth estimation network model generally outputs a disparity map corresponding to the image to be processed instead of depth. Therefore, it is also necessary to determine the depth corresponding to the image to be processed according to the disparity map output by the monocular depth estimation network model, the lens baseline distance of the camera that captures the image to be processed, and the lens focal length of the camera that captures the image to be processed. Illustration.
  • FIG. 1B is a schematic diagram of the depth estimation of a single picture in the embodiment of the present application.
  • the picture 11 with the number 11 is the image to be processed
  • the picture with the number 12 is the depth map corresponding to the picture 11 with the number 11.
  • the product of the baseline distance of the lens and the focal length of the lens, and the ratio of the disparity map corresponding to the output image to be processed may be determined as the depth map corresponding to the image to be processed.
  • an embodiment of the present application further provides a monocular depth estimation method, which includes:
  • Step S111 Obtain a synthesized binocular picture with a depth mark as synthesized sample data, where the synthesized binocular picture includes a synthesized left image and a synthesized right image;
  • the method further includes: step S11, constructing a virtual 3D scene through a rendering engine; step S12, mapping the 3D scene into a binocular picture through two virtual cameras; step S13, according to constructing the Obtain the depth data of the synthesized binocular picture by the position during the virtual 3D scene, the direction when constructing the virtual 3D scene, and the lens focal length of the virtual camera; step S14, marking the binocular picture according to the depth data To obtain the synthesized binocular picture.
  • Step S112 Train a second binocular matching neural network model according to the obtained synthetic sample data
  • step S112 may be implemented by the following steps: step S1121, training a second binocular matching neural network model according to the synthesized binocular picture, and obtaining a trained second binocular matching neural network A network model, wherein the output of the trained second binocular matching neural network model is a disparity map and an occlusion map, and the disparity map describes each pixel in the left image corresponding to the right image
  • the parallax distance of the pixel points, the parallax distance is in pixels;
  • the occlusion map describes whether each pixel point in the left image corresponding to the pixel point in the right image is blocked by an object.
  • FIG. 1C is a schematic diagram of training a second binocular matching neural network model according to an embodiment of the present application. As shown in FIG. 1C, a picture labeled 11 is a left view of a synthesized binocular picture, and a picture labeled 12 is a synthesized binocular picture.
  • I L is the pixel value of all the pixels contained in picture 11 on the left picture labeled 11 and I R is the pixel value of all pixels contained in picture 12 on the right picture labeled 12;
  • Picture 13 is the occlusion map of the second binocular matching neural network model after training,
  • picture 14 is the picture 14 is the disparity map of the second binocular matching neural network model after training,
  • picture 15 is the picture 15 Match the neural network model for the second binocular.
  • Step S113 Adjust the parameters of the trained second binocular matching neural network model according to the obtained real sample data to obtain a first binocular matching neural network model
  • step S113 can be implemented in two ways, wherein the first implementation method is implemented according to the following steps: step S1131a, the trained second binocular matching neural network is obtained according to the obtained real binocular data with depth markers The model undergoes supervised training to adjust the weight of the trained second binocular matching neural network model to obtain a first binocular matching neural network model.
  • the real binocular data with the depth marker is acquired.
  • the real binocular data with the depth marker can be directly used to supervise the training of the second binocular matching neural network trained in step S112 to The weight of the trained second binocular matching neural network model is adjusted to further improve the effect of the trained second binocular matching neural network model to obtain a first binocular matching neural network model.
  • the binocular parallax network needs to adapt to the real data.
  • the second implementation manner is implemented according to the following steps: step S1131b, performing unsupervised training on the trained second binocular matching neural network model according to the obtained real binocular data without depth marking, so as to adjust the trained first binocular matching neural network model.
  • the two binocular matching neural network models are weighted to obtain the first binocular matching neural network model.
  • unsupervised training refers to training using only binocular data without deep data marking, and this process can be implemented using unsupervised fine-tuning methods.
  • Step S114 Supervise the monocular depth estimation network model through the disparity map output by the first binocular matching neural network model, thereby training the monocular depth estimation network model;
  • step S114 is implemented in two ways, wherein the first implementation way is implemented according to the following steps: step S1141a, acquiring the left or right image in the real binocular data with depth mark as a training sample,
  • the depth-labeled real binocular data includes a left image and a right image;
  • step S1142a a monocular depth estimation network model is trained according to the left or right image in the depth-labeled real binocular data.
  • a deep neural network is used to predict the depth map of a single picture. Only one picture is needed to 3D model the scene corresponding to the picture to obtain the depth of each pixel.
  • the monocular depth estimation network model may be trained according to the left or right image in the depth-labeled real binocular data, where the depth-labeled real binocular data is the band used in step S1131a. Deeply marked true binocular data.
  • the second implementation manner is implemented according to the following steps: Step S1141b, the true binocular data without depth marking is input to the first binocular matching neural network model to obtain a corresponding disparity map, wherein the without disparity map
  • the labeled true binocular data includes left and right images; step S1142b, according to the corresponding disparity map, a lens baseline distance of a camera that captures the true binocular data without a depth marker, and captures the image without the depth marker.
  • step S1143b The focal length of the camera of the real binocular data to determine the depth map corresponding to the parallax map; step S1143b, the left or right image in the real binocular data without the depth mark is used as sample data, and according to the parallax
  • the depth map corresponding to the graph supervises the monocular depth estimation network model, thereby training the monocular depth estimation network model.
  • a deep neural network is used to predict the depth map of a single picture. Only one picture is needed to 3D model the scene corresponding to the picture to obtain the depth of each pixel.
  • the left image or the right image in the real binocular data without the depth mark used in step S1131b can be taken as the sample data, or the left image or the right in the real binocular data without the depth mark used in step S1141b.
  • the map is used as sample data, and the monocular depth estimation network model is supervised according to the depth map corresponding to the disparity map output in step S1141b, so that the monocular depth estimation network model is trained, and the trained monocular depth estimation network model is obtained.
  • FIG. 1D is a schematic diagram of a training monocular depth estimation network model according to an embodiment of the present application.
  • FIG. 1A shows inputting real binocular data without a depth marker to the first binocular matching neural network model.
  • the true binocular data without the depth mark includes a left picture 11 labeled 11 and a right picture 12 labeled 12 and a picture 15 labeled 15 is the first binocular matching neural network model.
  • the figure (b) in FIG. 1D shows that the left or right image in the real binocular data without the depth mark is used as the sample data, and the depth map corresponding to the disparity map picture 13 labeled 13 is compared with the single image.
  • the mesh depth estimation network model is supervised, thereby training the monocular depth estimation network model, wherein the output of the sample data after passing through the monocular depth estimation network model is a parallax map picture 14 labeled 14 and a picture labeled 16 16 is a monocular depth estimation network model.
  • Step S115 Obtain an image to be processed
  • this monocular depth estimation network model can be used. That is, using this monocular depth estimation network model, a depth map corresponding to the image to be processed is obtained.
  • Step S116 The image to be processed is input to a trained monocular depth estimation network model to obtain an analysis result of the image to be processed, wherein the monocular depth estimation network model is matched by a first binocular matching neural network.
  • the disparity map output by the network model is used for supervised training;
  • Step S117 Output the analysis result of the image to be processed, where the analysis result of the image to be processed includes a disparity map output by the monocular depth estimation network model;
  • Step S118 According to the disparity map output by the monocular depth estimation network model, a lens baseline distance of a camera that takes a picture of the monocular depth estimation network model and a camera that takes a picture of the monocular depth estimation network model The focal length of the lens to determine the depth map corresponding to the parallax map;
  • Step S119 Output a depth map corresponding to the disparity map.
  • an embodiment of the present application further provides a monocular depth estimation method, which includes:
  • Step S121 Obtain a synthesized binocular picture with a depth mark as synthesized sample data, where the synthesized binocular picture includes a synthesized left image and a synthesized right image.
  • Step S122 Train a second binocular matching neural network model according to the obtained synthetic sample data
  • L abs and L rel are regular terms.
  • the formula (1) in step S123 can also be refined by the formula in the following step, that is, the method further includes: step S1231, determining the reconstruction using formula (2) or formula (3) error: Where N is the number of pixels in the picture; Pixel values of the occlusion map output by the trained second binocular matching network model; Represents the pixel value of the left image in true binocular data without a depth marker; said Represents the pixel value of the right image in true binocular data without a depth marker; said Represents the pixel value of the picture synthesized after sampling the right picture, that is, the reconstructed left picture; said Represents the pixel value of the picture synthesized after sampling the left picture, that is, the reconstructed right picture; said Represents the pixel value of the disparity map output by the first binocular matching network model of the left image in the real binocular data without the depth mark; said Represents the pixel value of the disparity map output by the first binocular matching network model from the right image in the
  • Step S1232 using formula (4) or formula (5) to determine that the disparity map output by the first binocular matching network model is smaller than the disparity map output by the trained second binocular matching network model:
  • N is the number of pixels in the picture
  • Pixel values of the occlusion map output by the trained second binocular matching network model said Represents the pixel values of the disparity map output by the second binocular matching network after training on the left image in the sample data
  • the ij represents the pixel coordinates of the pixel
  • the old represents the second binocular after training
  • Step S1233 Use formula (6) or formula (7) to determine that the output gradient of the first binocular matching network model is consistent with the output gradient of the second binocular matching network model: Where N is the number of pixels in the picture, and Represents the gradient of the disparity map output by the left image via the first binocular matching network in the real binocular data without a depth marker, said Represents the gradient of the disparity map output by the first binocular matching network from the right image in the real binocular data without a depth marker, said Represents the gradient of the disparity map output by the second binocular matching network after training on the left image in the sample data, said Represents the gradient of the disparity map output by the trained second binocular matching network from the right image in the sample data, the old represents the output of the trained second binocular matching network model, and R represents the right or right The relevant data, where L represents the left picture or the relevant data of the left picture.
  • Step S124 Use a loss function (Loss) to perform unsupervised training on the trained second binocular matching neural network model according to the real binocular data without the depth marker to adjust the trained second binocular Match the weights of the neural network model to get the first binocular matching neural network model.
  • Loss loss function
  • FIG. 1E is a schematic diagram of a related picture of a loss function according to an embodiment of the present application. As shown in FIG. 1E, FIG. 1A is a left image of real binocular data without a depth marker; FIG. 1E is a graph without a depth marker.
  • Figure (c) in Figure 1E is the real binocular image without depth mark composed of Figures (a) and (b) input to the trained second binocular match Parallax map output by the neural network model;
  • Figure (d) in Figure 1E is a picture after reconstructing the left picture after sampling the right picture shown in Figure (b) and combining the parallax map shown in Figure (c);
  • Figure (e) in 1E is a picture obtained by making a difference between the pixels in the left image shown in (a) and the corresponding pixels in the reconstructed left image shown in (d), that is, the reconstruction error map of the left;
  • 1E is an occlusion map inputting a real binocular image without depth mark composed of the figures (a) and (b) to the output of a trained second binocular matching neural network model.
  • all the red boxes 11 in the figure (d) indicate the parts where the reconstructed left picture is different from the real left picture identified in the figure (a), and all the red boxes 12 in the figure (e) show the reconstruction errors.
  • There is an error in the picture that is, the part that is blocked.
  • the occlusion image is used to clear this part of the error Training signals to improve the effect of unsupervised fine-tuning training.
  • Step S125 Supervise the monocular depth estimation network model through a disparity map output by the first binocular matching neural network model, thereby training the monocular depth estimation network model.
  • the sample picture of the monocular depth estimation network model may be a left image in real binocular data without a depth marker, or a right image in real binocular data without a depth marker.
  • the loss function is determined by formula (1), formula (2), formula (4), and formula (6); if the right picture is used as a sample picture, then formula (1) , Formula (3), formula (5) and formula (7) to determine the loss function.
  • the monocular depth estimation network model by using a disparity map output by the first binocular matching neural network model, so as to train the monocular depth estimation network model.
  • the depth map corresponding to the disparity map output by the first binocular matching neural network model supervises the monocular depth estimation network model, and even if supervising information is provided, the monocular depth estimation network model is trained.
  • Step S126 Acquire the image to be processed
  • Step S127 The image to be processed is input to a trained monocular depth estimation network model to obtain an analysis result of the image to be processed, wherein the monocular depth estimation network model is matched by a first binocular matching neural network.
  • the disparity map output by the network model is used for supervised training;
  • Step S128 Output the analysis result of the image to be processed, where the analysis result of the image to be processed includes a disparity map output by the monocular depth estimation network model.
  • Step S129 According to the disparity map output by the monocular depth estimation network model, a lens baseline distance of a camera that takes a picture of the monocular depth estimation network model and a camera that takes a picture of the monocular depth estimation network model The focal length of the lens to determine the depth map corresponding to the parallax map;
  • Step S130 Output a depth map corresponding to the disparity map.
  • the trained monocular depth estimation network model may be used to predict the depth of the street view picture.
  • FIG. 2A is a second schematic diagram of the implementation process of the monocular depth estimation method according to the embodiment of the present application. As shown in FIG.
  • Step S201 Use the synthetic data rendered by the rendering engine to train a binocular matching network to obtain a disparity map of the binocular picture;
  • the input of the binocular matching network is: a pair of binocular pictures (including the left and right pictures)
  • the output of the binocular matching network is: a disparity map and an occlusion map, that is, the binocular matching network uses binocular Pictures are used as input, and disparity and occlusion maps are output.
  • the disparity map is used to describe the disparity distance of each pixel in the left picture and the corresponding pixel point in the right picture, in pixels; the occlusion map is used to describe whether each pixel in the left picture corresponds to the pixel in the right picture. Obscured by other objects. Due to changes in perspective, some areas in the left image will be blocked by other objects in the right image.
  • the occlusion image is used to mark whether the pixels in the left image are blocked in the right image.
  • the binocular matching network is trained using the synthetic data generated by the computer rendering engine. First, some virtual 3D scenes are constructed by the rendering engine, and then the 3D scenes are mapped into binocular pictures by two virtual cameras to obtain synthetic data. Data such as the correct depth data and camera focal length can also be obtained from the rendering engine, so the binocular matching network can directly supervise training through these labeled data.
  • Step S202 Use the loss function to fine-tune the binocular matching network obtained in step S201 on the real binocular image data through an unsupervised fine-tuning method;
  • the binocular parallax network needs to adapt to the real data. That is, the binocular disparity network is trained unsupervisedly using real binocular data without depth marking.
  • unsupervised training refers to training using only binocular data without deep data marking.
  • the embodiment of the present application proposes a new unsupervised fine-tuning method, which uses the loss function in the above embodiment to perform unsupervised fine-tuning.
  • the main purpose of the loss function proposed in the embodiment of the present application is to hope to fine-tune the binocular disparity network on real binocular data without reducing the pre-training effect.
  • the pre-trained binocular disparity obtained in step S201 is used during the fine-tuning.
  • FIG. 2B is a schematic diagram of the effect of the loss function in the embodiment of the present application.
  • the picture 21 is a disparity diagram obtained when using the loss function in the prior art, and the picture 22 is implemented using the present application.
  • the disparity map obtained when the proposed loss function is exemplified.
  • the loss function of the prior art does not consider the occlusion area separately, and the image reconstruction error of the occlusion area is also optimized to zero, which will cause the prediction parallax error of the occlusion area, and the edges of the disparity map will be blurred.
  • the loss function in this application Use the occlusion map to clean up the erroneous training signals in this part to improve the effect of unsupervised fine-tuning training.
  • Step S203 Use the binocular matching network obtained in step S202 to supervise the monocular depth estimation on the real data, and finally obtain the monocular depth estimation network.
  • the input of the monocular depth estimation network is: a single monocular picture
  • the output of the monocular depth estimation network is: a depth map.
  • the binocular disparity network fine-tuned on the real data is obtained. For each pair of binocular pictures, the binocular disparity network predicts a disparity map, and the disparity map D, the baseline distance b of the binocular lens, and the lens focal length f are obtained.
  • the monocular depth estimation method in the embodiments of the present application can be trained to obtain a depth estimation module for unmanned driving, thereby performing three-dimensional reconstruction or obstacle detection on the scene. And the unsupervised fine-tuning method proposed in the embodiment of the present application improves the performance of the binocular disparity network.
  • a supervised monocular depth estimation method is very limited and difficult to obtain accurate labeled data.
  • the performance of unsupervised methods based on reconstruction errors is usually limited by the pixel matching ambiguity.
  • a new monocular depth estimation method is proposed in the embodiment of the present application, which solves the limitations of the supervised and unsupervised depth estimation methods in the prior art.
  • the method in the embodiment of the present application is to use a binocular matching network to train on cross-modal synthetic data, and to supervise the monocular depth estimation network.
  • the binocular matching network obtains disparity based on the pixel matching relationship between the left and right images, rather than extracting from the semantic features. Therefore, the binocular matching network can well generalize from synthetic data to real data.
  • the method in the embodiment of the present application mainly includes three steps.
  • the binocular matching network is trained with synthetic data to predict occlusion maps and disparity maps from binocular pictures.
  • the trained binocular matching network is selectively adjusted.
  • the monocular depth estimation network is trained under the supervision of the binocular matching network fine-tuned with the real data obtained in the second step. In this way, the binocular matching network can be used indirectly to make the monocular depth estimation make better use of synthetic data to improve performance.
  • the first step is to use the synthetic data to train the binocular matching network, including:
  • the graphics rendering engine can generate many synthetic images containing depth information.
  • the performance of training the monocular depth estimation network by directly combining these synthetic image data with real data is usually poor, because the monocular depth estimation is very sensitive to the semantic information of the input scene.
  • the huge modal gap between synthetic and real data makes using synthetic data to aid training useless.
  • the binocular matching network has better generalization ability, and the binocular matching network trained with synthetic data can also get better disparity map output on real data. Therefore, the embodiment of the present application uses binocular matching network training as a bridge between synthetic data and real data to improve the performance of monocular deep training.
  • the binocular matching network in the embodiment estimates a multi-scale occlusion map based on the disparity map.
  • the occlusion map indicates whether the corresponding pixel point of the left image pixel in the right image is blocked by other objects in the correct image.
  • the unsupervised fine-tuning method will use the occlusion map to avoid false estimation.
  • the right and left parallax consistency checking method can be used to obtain a correctly labeled occlusion map from the correctly labeled parallax map by using formula (9).
  • the subscript i represents the value of the i-th row in the image
  • the subscript j represents the value of the j-th column in the image.
  • D * L / R represents the disparity map of the left and right images
  • D * wR is the disparity map of the left image after reconstruction with the right image.
  • the consistency check threshold is set to 1.
  • the occlusion map is 0 in the occluded area and 1 in the non-occluded area. Therefore, this embodiment uses the following formula (10) to calculate the loss (Loss) of training the binocular matching network using synthetic data.
  • the loss function L stereo is composed of two parts, namely the disparity map estimation error L disp and the occlusion map estimation.
  • the error L occ The multi-scale intermediate layer of the binocular disparity network also generates parallax and occlusion predictions, which are directly applied to the loss weight w m of the multi-scale prediction, Represents the disparity map estimation error corresponding to each layer, Represents the estimation error of the occlusion map corresponding to each layer, and m represents the mth layer:
  • the L1 loss function is used to avoid the influence of outliers, making the training process more robust.
  • formula (11) is used to represent the occlusion map estimation error L occ , and the binary cross entropy loss is used as a classification task to train the occlusion map:
  • N is the total number of pixels in the image, Indicates a properly labeled occlusion map, Represents the occlusion map output by the trained binocular matching network.
  • the second step is to use the supervised or unsupervised fine-tuning method to train the trained binocular matching network obtained in the first step on the real data, including: the embodiment of the present application performs the trained binocular matching network in two ways Fine-tuning.
  • the supervised fine-tuning method only uses the multi-scale L1 regression loss function L stereo-supft , that is, the disparity map estimation error L disp to improve the previous pixel matching prediction error, see formula (12):
  • the results show that with a small amount of supervised data, such as 100 pictures, the binocular matching network can also adapt from synthetic modal data to real modal data.
  • Unsupervised fine-tuning method is to use the supervised or unsupervised fine-tuning method to train the trained binocular matching network obtained in the first step on the real data, including: the embodiment of the present application performs the trained binocular matching network in two ways Fine-tuning.
  • the supervised fine-tuning method only uses the multi-scale L1 regression loss function
  • the disparity map obtained by the unsupervised fine-tuning method in the prior art is blurred and the performance is poor, as shown in picture 21 in FIG. 2B.
  • This is due to the limitations of unsupervised loss and the ambiguity of matching pixels with only RGB values. Therefore, the embodiment of the present application introduces additional regular term constraints to improve performance.
  • the corresponding occlusion map and parallax map were obtained from the binocular matching network after fine-tuning, and they were marked as with These two data are used to help standardize the training process.
  • the unsupervised fine-tuning loss function proposed in the embodiment of the present application that is, to obtain the loss function L stereo-unsupft , refer to the description in the foregoing embodiments.
  • the third step is to train the monocular depth estimation network, including: so far, we have conducted cross-modal training on the binocular matching network with a large amount of synthetic data, and fine-tuned using real data.
  • the embodiment of the present application uses the disparity map predicted by the trained binocular matching network to provide training data.
  • the loss L mono of the monocular depth estimation is given by the following sections, see formula (13):
  • N is the sum of the pixels, Refers to the disparity map output by the monocular depth estimation network, Refers to the disparity map output by the trained binocular matching network, or fine-tuning the trained binocular matching network, and the disparity map output by the fine-tuned network.
  • FIG. 2C is a schematic diagram of a visualized depth estimation result according to an embodiment of the present application. As shown in FIG. 2C, FIG.
  • the first line is the input of the monocular depth estimation network, that is, three different streetscape pictures;
  • the second line is the depth data obtained by interpolating the sparse lidar depth map using the nearest neighbor algorithm, and the third line is the fifth line.
  • Depth maps corresponding to the three input pictures obtained by the three different monocular depth estimation methods in the prior art are shown in the last three lines, and the use synthesis obtained in the first step of the embodiment of this application is directly used
  • the binocular matching network obtained from the data training, supervising the monocular depth estimation network, and the depth map corresponding to the three input pictures of the monocular depth network, that is, picture 21 labeled 21, picture 22 labeled 22, label Picture 23 is 23; using the unsupervised loss function proposed in the embodiment of the present application, the trained binocular matching network is fine-tuned, and the fine-tuned network is adjusted.
  • the output disparity map is used as the training data of the monocular depth estimation network, and the depth map corresponding to the three input pictures of the monocular depth network is obtained, that is, the picture 24, the picture 25, and the picture 26.
  • the model obtained by the eye depth estimation method can capture more detailed scene structure.
  • FIG. 3 is a schematic structural diagram of a monocular depth estimation apparatus according to an embodiment of the present application.
  • the apparatus 300 includes: an acquisition module 301, an execution module 302, and Output module 303, where:
  • the acquisition module 301 is configured to acquire an image to be processed
  • the execution module 302 is configured to input the to-be-processed image into a trained monocular depth estimation network model to obtain an analysis result of the to-be-processed image, wherein the monocular depth estimation network model is A binocular matching disparity map output from a neural network model for supervised training;
  • the output module 303 is configured to output an analysis result of the image to be processed.
  • the apparatus further includes a third training module configured to supervise the monocular depth estimation network model through a disparity map output by the first binocular matching neural network model, thereby training the monocular depth estimation network model.
  • a third training module configured to supervise the monocular depth estimation network model through a disparity map output by the first binocular matching neural network model, thereby training the monocular depth estimation network model.
  • the apparatus further includes: a first training module configured to train a second binocular matching neural network model based on the obtained synthetic sample data; and a second training module configured to train the training based on the obtained real sample data The parameters of the second binocular matching neural network model are adjusted to obtain the first binocular matching neural network model.
  • the apparatus further includes: a first obtaining module configured to obtain a synthesized binocular picture with a depth mark as the synthesized sample data, wherein the synthesized binocular picture includes a synthesized left image And synthetic right image.
  • the first training module includes: a first training unit configured to train a second binocular matching neural network model according to the synthesized binocular picture to obtain a trained second binocular The matching neural network model, wherein the output of the trained second binocular matching neural network model is a disparity map and an occlusion map, and the disparity map describes each pixel in the left image and the right image
  • the parallax distance of the corresponding pixel point, the parallax distance is in pixels; the occlusion map describes whether the corresponding pixel point of each pixel point in the left image in the right image is blocked by an object.
  • the apparatus further includes: a construction module configured to construct a virtual 3D scene through a rendering engine; a mapping module configured to map the 3D scene into a binocular picture through two virtual cameras; a second acquisition A module configured to acquire depth data of the synthetic binocular picture according to a position when constructing the virtual 3D scene, a direction when constructing the virtual 3D scene, and a lens focal length of the virtual camera; a third acquisition module, configured In order to mark the binocular picture according to the depth data, the synthesized binocular picture is obtained.
  • a construction module configured to construct a virtual 3D scene through a rendering engine
  • a mapping module configured to map the 3D scene into a binocular picture through two virtual cameras
  • a second acquisition A module configured to acquire depth data of the synthetic binocular picture according to a position when constructing the virtual 3D scene, a direction when constructing the virtual 3D scene, and a lens focal length of the virtual camera
  • a third acquisition module configured In order to mark the binocular picture according to
  • the second training module includes: a second training unit configured to perform supervised training on the trained second binocular matching neural network model according to the obtained real binocular data with depth markers, so that The weight of the trained second binocular matching neural network model is adjusted to obtain a first binocular matching neural network model.
  • the second training unit in the second training module is further configured to perform unsupervised training of the second binocular matching neural network model according to the obtained real binocular data without a depth marker. Training to adjust the weight of the trained second binocular matching neural network model to obtain a first binocular matching neural network model.
  • the second training unit in the second training module includes a second training component configured to use a loss function to pair the trained second according to the real binocular data without the depth marker.
  • the binocular matching neural network model is subjected to unsupervised training to adjust the weight of the trained second binocular matching neural network model to obtain a first binocular matching neural network model.
  • the L rel indicates that the output gradient of the first binocular matching network model is constrained to be consistent with the output gradient of the trained second binocular matching network model.
  • ⁇ 1 and ⁇ 2 represent intensity coefficients.
  • the apparatus further includes: a second determination module configured to determine the reconstruction error by using formula (15) or formula (16); Where N is the number of pixels in the picture, and Pixel values of the occlusion map output by the trained second binocular matching network model, said Represents the pixel value of the left image in the true binocular data without a depth marker, said Represents the pixel value of the right image in the true binocular data without a depth marker, said Represents the pixel value of a picture synthesized after sampling the right image, said Represents the pixel value of the picture synthesized after sampling the left picture, said Represents the pixel values of the disparity map output by the first binocular matching network model in the left image of the real binocular data without a depth marker, said Represents the pixel values of the disparity map output by the first binocular matching network model on the right in real binocular data without depth markers, and ij represents the pixel coordinates of the pixel.
  • N is the number of pixels in the picture
  • the apparatus further includes: a third determining module configured to determine, using formula (17) or formula (18), a disparity map output by the first binocular matching network model and the trained first The disparity map output by the two binocular matching network model is relatively small; Wherein, said Pixel values representing the disparity map output by the trained second binocular matching network model of the left image in the sample data, said Represents the pixel values of the disparity map output by the trained second binocular matching network model on the right in the sample data, where ⁇ 3 represents the intensity coefficient.
  • the apparatus further includes: a fourth determining module configured to determine an output gradient of the first binocular matching network model and the second binarization using formula (19) or formula (20) The output gradient of the mesh matching network model is consistent; Wherein, said Represents the gradient of the disparity map output by the first binocular matching network model in the left image in the real binocular data without a depth marker, said Represents the gradient of the disparity map output by the first binocular matching network model of the right image in the real binocular data without a depth marker, said Represents the gradient of the disparity map output by the trained second binocular matching network model of the left image in the sample data, said Represents the gradient of the disparity map output from the second binocular matching network model after training on the right in the sample data.
  • a fourth determining module configured to determine an output gradient of the first binocular matching network model and the second binarization using formula (19) or formula (20) The output gradient of the mesh matching network model is consistent; Wherein, said Represents the gradient of the dispar
  • the depth-labeled real binocular data includes a left image and a right image.
  • the third training module includes a first acquisition unit configured to acquire the depth-labeled real image.
  • the left or right image in the binocular data is used as a training sample; the first training unit is configured to train the monocular depth estimation network model according to the left or right image in the real binocular data with depth marking.
  • the true binocular data without a depth mark includes a left image and a right image.
  • the third training module further includes a second acquisition unit configured to convert the image without depth.
  • the labeled real binocular data is input to the first binocular matching neural network model to obtain a corresponding disparity map; a first determining unit is configured to, according to the corresponding disparity map, shoot the true binocular without a depth mark
  • the baseline distance of the lens of the camera of the binocular data and the focal length of the lens of the camera that captures the true binocular data without the depth mark to determine the depth map corresponding to the parallax map;
  • the left or right image in the labeled real binocular data is used as sample data, and the monocular depth estimation network model is supervised according to the depth map corresponding to the disparity map, thereby training the monocular depth estimation network model.
  • the analysis result of the to-be-processed image includes a disparity map output by the monocular depth estimation network model.
  • the device further includes a fifth determination module configured to determine the monocular depth based on the monocular depth.
  • the disparity map output by the network model is estimated, the lens baseline distance of the camera that takes pictures of the monocular depth estimation network model input and the lens focal length of the camera that takes pictures of the monocular depth estimation network model input, and determines the disparity map A corresponding depth map; a first output module configured to output a depth map corresponding to the disparity map.
  • the computer software product is stored in a storage medium and includes several instructions for A computing device is caused to execute all or part of the method described in each embodiment of the present application.
  • the foregoing storage medium includes various media that can store program codes, such as a U disk, a mobile hard disk, a ROM (Read Only Memory, read only memory), a magnetic disk, or an optical disk.
  • program codes such as a U disk, a mobile hard disk, a ROM (Read Only Memory, read only memory), a magnetic disk, or an optical disk.
  • an embodiments of the present application provides a monocular depth estimation device.
  • the device includes a memory and a processor.
  • the memory stores a computer program that can be run on the processor. Steps in the mesh depth estimation method.
  • an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored.
  • the computer program is executed by a processor, steps in the monocular depth estimation method are implemented.
  • the description of the above storage medium and device embodiments is similar to the description of the above method embodiments, and has similar beneficial effects as the method embodiments.
  • the description of the method embodiments of the present application please refer to the description of the method embodiments of the present application for understanding.
  • FIG. 4 is a schematic diagram of a hardware entity of the monocular depth estimation device according to the embodiment of the present application.
  • the hardware entities of the monocular depth estimation device 400 include: a memory 401, a communication bus 402, and a process.
  • the communication bus 402 may enable the monocular depth estimation device 400 to communicate with other terminals or servers through a network, and may also implement connection and communication between the processor 403 and the memory 401.
  • the processor 403 generally controls the overall operation of the monocular depth estimation apparatus 400.
  • the methods in the above embodiments can be implemented by means of software plus a necessary universal hardware platform, and of course, also by hardware, but in many cases the former is better.
  • Implementation Based on such an understanding, the technical solution of this application that is essentially or contributes to the existing technology can be embodied in the form of a software product, which is stored in a storage medium (such as ROM / RAM, magnetic disk, The optical disc) includes several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods described in the embodiments of the present application.
  • a terminal device which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing device to work in a specific manner such that the instructions stored in the computer-readable memory produce a manufactured article including an instruction device, the instructions
  • the device implements the functions specified in one or more flowcharts and / or one or more blocks of the block diagram.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of steps can be performed on the computer or other programmable device to produce a computer-implemented process, which can be executed on the computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more flowcharts and / or one or more blocks of the block diagrams.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

Provided in the embodiments of the present application is a method for estimating a monocular depth. The method comprises: acquiring an image to be processed; inputting the image to be processed into a monocular depth estimation network model obtained by means of training, and obtaining an analysis result of the image to be processed, wherein the monocular depth estimation network model is supervised and trained by means of a disparity map output by a first binocular matching neural network model; and outputting the analysis result of the image to be processed. Further provided in the embodiments of the present application are an apparatus and device for estimating a monocular depth, and a storage medium.

Description

单目深度估计方法及其装置、设备和存储介质Monocular depth estimation method, device, equipment and storage medium thereof
相关申请的交叉引用Cross-reference to related applications
本申请基于申请号为201810496541.6、申请日为2018年05月22日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以全文引入的方式引入本申请。This application is based on a Chinese patent application with an application number of 201810496541.6 and an application date of May 22, 2018, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated by reference in its entirety. .
技术领域Technical field
本申请实施例涉及人工智能领域,尤其涉及一种单目深度估计方法及其装置、设备和存储介质。Embodiments of the present application relate to the field of artificial intelligence, and in particular, to a monocular depth estimation method and a device, device, and storage medium thereof.
背景技术Background technique
单目深度估计是计算机视觉中的重要问题,单目深度估计的具体任务指的是预测一张图片中每个像素点的深度。其中,由每个像素点的深度值组成的图片又称为深度图。单目深度估计对于自动驾驶中的障碍物检测、三维场景重建,场景立体分析有着重要的意义。另外单目深度估计可以间接地提高其他计算机视觉任务的性能,比如物体检测、目标跟踪与目标识别。Monocular depth estimation is an important issue in computer vision. The specific task of monocular depth estimation is to predict the depth of each pixel in a picture. Among them, a picture composed of the depth value of each pixel is also called a depth map. Monocular depth estimation is of great significance for obstacle detection, three-dimensional scene reconstruction, and three-dimensional scene analysis in autonomous driving. In addition, monocular depth estimation can indirectly improve the performance of other computer vision tasks, such as object detection, target tracking and target recognition.
目前存在的问题是训练用于单目深度估计的神经网络需要大量标记的数据,但是获取标记数据成本很大。在室外环境下标记数据可以通过激光雷达获取,但是获取的标记数据是非常稀疏的,用这样的标记数据训练得到的单目深度估计网络没有清晰的边缘以及不能捕捉细小物体的正确深度信息。The current problem is that training neural networks for monocular depth estimation requires a large amount of labeled data, but obtaining labeled data is costly. In the outdoor environment, the marker data can be obtained by lidar, but the obtained marker data is very sparse. The monocular depth estimation network trained with such marker data has no clear edges and cannot capture the correct depth information of small objects.
发明内容Summary of the Invention
本申请实施例提供一种单目深度估计方法及其装置、设备和存储介质。The embodiments of the present application provide a monocular depth estimation method, an apparatus, a device and a storage medium thereof.
本申请实施例的技术方案是这样实现的:The technical solution of the embodiment of the present application is implemented as follows:
本申请实施例提供一种单目深度估计方法,所述方法包括:获取待处理图像;将所述待处理图像输入至经过训练得到的单目深度估计网络模型,得到所述待处理图像的分析结果,其中,所述单目深度估计网络模型是通过第一双目匹配神经网络模型输出的视差图进行监督训练的;输出所述待处理图像的分析结果。An embodiment of the present application provides a monocular depth estimation method. The method includes: acquiring an image to be processed; inputting the image to be processed into a trained monocular depth estimation network model to obtain an analysis of the image to be processed; As a result, the monocular depth estimation network model is supervised and trained through the disparity map output by the first binocular matching neural network model; and the analysis result of the image to be processed is output.
本申请实施例提供一种单目深度估计装置,所述装置包括:获取模块、执行模块和输出模块,其中:所述获取模块,配置为获取待处理图像;所述执行模块,配置为将所述待处理图像输入至经过训练得到的单目深度估计网络模型,得到所述待处理图像的分析结果,其中,所述单目深度估计网络模型是通过第一双目匹配神经网络模型输出的视差图进行监督训练的;所述输出模块,配置为输出所述待处理图像的分析结果。An embodiment of the present application provides a monocular depth estimation device. The device includes: an acquisition module, an execution module, and an output module, wherein: the acquisition module is configured to acquire an image to be processed; and the execution module is configured to convert all images to be processed. It is described that the to-be-processed image is input to a trained monocular depth estimation network model to obtain the analysis result of the to-be-processed image, wherein the monocular depth estimation network model is a disparity output through a first binocular matching neural network model. The image is subjected to supervised training; the output module is configured to output an analysis result of the image to be processed.
本申请实施例提供一种单目深度估计设备,包括存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,所述处理器执行所述程序时实现本申请实施例提供的单目深度估计方法中的步骤。An embodiment of the present application provides a monocular depth estimation device, including a memory and a processor. The memory stores a computer program that can be run on the processor, and the processor implements the program provided by the embodiment of the application when the processor executes the program. Steps in a monocular depth estimation method.
本申请实施例提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现本申请实施例提供的单目深度估计方法中的步骤。An embodiment of the present application provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the steps in the monocular depth estimation method provided by the embodiment of the present application are implemented.
本申请实施例中,通过获取待处理图像;将所述待处理图像输入至经过训练得到的单目深度估计网络模型,得到所述待处理图像的分析结果,其中,所述单目深度估计网络模型是通过第一双目匹配神经网络模型输出的视差图进行监督训练的;输出所述待处理图像的分析结果;从而能够使用更少或者不使用有深度图标记的数据训练单目深度估计网络,并且提出了一种更有效的无监督微调双目视差网络的方法,从而间接提高了单目深度估计的效果。In the embodiment of the present application, the image to be processed is obtained; the image to be processed is input to a trained monocular depth estimation network model to obtain the analysis result of the image to be processed, wherein the monocular depth estimation network The model is supervised and trained through the disparity map output by the first binocular matching neural network model; the analysis results of the to-be-processed images are output; thus, the monocular depth estimation network can be trained with less or no data marked with a depth map And, a more effective method of unsupervised fine-tuning binocular disparity network is proposed, which indirectly improves the effect of monocular depth estimation.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1A为本申请实施例单目深度估计方法的实现流程示意图一;FIG. 1A is a first schematic flowchart of a monocular depth estimation method according to an embodiment of the present application; FIG.
图1B为本申请实施例单个图片深度估计示意图;FIG. 1B is a schematic diagram of a single picture depth estimation according to an embodiment of the present application; FIG.
图1C为本申请实施例训练第二双目匹配神经网络模型示意图;FIG. 1C is a schematic diagram of training a second binocular matching neural network model according to an embodiment of the present application; FIG.
图1D为本申请实施例训练单目深度估计网络模型示意图;1D is a schematic diagram of a training monocular depth estimation network model according to an embodiment of the present application;
图1E为本申请实施例损失函数相关图片示意图;FIG. 1E is a schematic diagram of relevant pictures of a loss function according to an embodiment of the present application; FIG.
图2A为本申请实施例单目深度估计方法的实现流程示意图二;FIG. 2A is a second schematic diagram of an implementation process of a monocular depth estimation method according to an embodiment of the present application; FIG.
图2B为本申请实施例损失函数效果示意图;FIG. 2B is a schematic diagram of an effect of a loss function according to an embodiment of the present application; FIG.
图2C为本申请实施例可视化深度估计结果示意图;2C is a schematic diagram of a visualization depth estimation result according to an embodiment of the present application;
图3为本申请实施例单目深度估计装置的组成结构示意图;3 is a schematic structural diagram of a monocular depth estimation device according to an embodiment of the present application;
图4为本申请实施例单目深度估计设备的一种硬件实体示意图。FIG. 4 is a schematic diagram of a hardware entity of a monocular depth estimation device according to an embodiment of the present application.
具体实施方式Detailed ways
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对申请的具体技术方案做进一步详细描述。以下实施例用于说明本申请,但不用来限制本申请的范围。To make the objectives, technical solutions, and advantages of the embodiments of the present application clearer, the specific technical solutions of the application will be described in further detail below with reference to the accompanying drawings in the embodiments of the present application. The following examples are used to illustrate the present application, but are not intended to limit the scope of the present application.
在后续的描述中,使用用于表示元件的诸如“模块”、“部件”或“单元”的后缀仅为了有利于本申请的说明,其本身没有特定的意义。因此,“模块”、“部件”或“单元”可以混合地使用。In the following description, the use of suffixes such as "module", "component", or "unit" for indicating elements is merely for the benefit of the description of the present application, and it does not have a specific meaning itself. Therefore, "modules," "components," or "units" can be used in combination.
一般地,利用深度神经网络来预测单张图片的深度图,仅需要一张图片即可以对图片对应的场景进行三维建模,得到每个像素点的深度。本申请实施例提出的单目深度估计方法使用神经网络训练得到,训练数据来自双目匹配输出的视差图数据,而不需要昂贵的深度采集设备如激光雷达。提供训练数据的双目匹配算法也是通过神经网络实现,该网络通过渲染引擎渲染的大量虚拟双目图片对进行预训练即可达到很好的效果,另外可以在真实数据上再进行微调训练以达到更好的效果。Generally, a deep neural network is used to predict the depth map of a single picture. Only one picture is needed to 3D model the scene corresponding to the picture to obtain the depth of each pixel. The monocular depth estimation method proposed in the embodiment of the present application is obtained by using neural network training. The training data comes from the disparity map data output by binocular matching, without the need for expensive depth acquisition equipment such as lidar. The binocular matching algorithm that provides training data is also implemented by a neural network. The network can achieve good results by pre-training a large number of virtual binocular image pairs rendered by the rendering engine. In addition, fine-tuning training can be performed on real data to achieve Better results.
下面结合附图和实施例对本申请的技术方案进一步详细阐述。The technical solution of the present application is further described in detail below with reference to the accompanying drawings and embodiments.
本申请实施例提供一种单目深度估计方法,该方法应用于计算设备,该方法所实现的功能可以通过服务器中的处理器调用程序代码来实现,当然程序代码可以保存在计算 机存储介质中,可见,该服务器至少包括处理器和存储介质。图1A为本申请实施例单目深度估计方法的实现流程示意图一,如图1A所示,该方法包括:An embodiment of the present application provides a monocular depth estimation method. The method is applied to a computing device. The functions implemented by the method can be implemented by a processor in a server calling program code. Of course, the program code can be stored in a computer storage medium. It can be seen that the server includes at least a processor and a storage medium. FIG. 1A is a schematic flowchart 1 of a method for implementing a monocular depth estimation method according to an embodiment of the present application. As shown in FIG. 1A, the method includes:
步骤S101、获取待处理图像;Step S101: Acquire an image to be processed;
这里,可以由移动终端来获取待处理图像,所述待处理图像,可以包含任意场景的图片。一般来说,移动终端在实施的过程中可以为各种类型的具有信息处理能力的设备,例如所述移动终端可以包括手机、个人数字助理(Personal Digital Assistant,PDA)、导航仪、数字电话、视频电话、智能手表、智能手环、可穿戴设备、平板电脑等。服务器在实现的过程中可以是移动终端如手机、平板电脑、笔记本电脑,固定终端如个人计算机和服务器集群等具有信息处理能力的计算设备。Here, an image to be processed may be acquired by a mobile terminal, and the image to be processed may include a picture of an arbitrary scene. Generally speaking, a mobile terminal may be various types of devices with information processing capabilities during the implementation process. For example, the mobile terminal may include a mobile phone, a Personal Digital Assistant (PDA), a navigator, a digital phone, Video phones, smart watches, smart bracelets, wearables, tablets, etc. The server may be a computing device with information processing capabilities such as a mobile terminal, such as a mobile phone, a tablet computer, a notebook computer, and a fixed terminal such as a personal computer and a server cluster.
步骤S102、将所述待处理图像输入至经过训练得到的单目深度估计网络模型,得到所述待处理图像的分析结果,其中,所述单目深度估计网络模型是通过第一双目匹配神经网络模型输出的视差图进行监督训练的;Step S102: Input the to-be-processed image to a trained monocular depth estimation network model to obtain an analysis result of the to-be-processed image, wherein the monocular depth estimation network model is matched by a first binocular matching nerve The disparity map output by the network model is used for supervised training;
本申请实施例中,所述单目深度估计网络模型主要是通过以下三个步骤获取的:第一步是使用渲染引擎渲染的合成双目数据预训练一个双目匹配神经网络;第二步是使用真实场景的数据对第一步得到的双目匹配神经网络进行微调训练;第三步是使用第二步得到的双目匹配神经网络对单目深度估计网络提供监督,从而训练得到单目深度估计网络。现有技术中,单目深度估计一般使用大量的真实标记数据进行训练,或者使用无监督的方法训练单目深度估计网络。但是,大量的真实标记数据获取成本很高,直接用无监督的方法训练单目深度估计网络又无法处理遮挡区域的深度估计,得到的效果较差。而本申请中所述单目深度估计网络模型的样本数据来自第一双目匹配神经网络模型输出的视差图,也就是说,本申请利用了双目视差来指导单目深度的预测。因此,本申请中的方法无需大量的标记数据,并且可以得到较好的训练效果。In the embodiment of the present application, the monocular depth estimation network model is mainly obtained through the following three steps: the first step is to pre-train a binocular matching neural network using synthetic binocular data rendered by the rendering engine; the second step is Use the real-world data to fine-tune the binocular matching neural network obtained in the first step; the third step is to use the binocular matching neural network obtained in the second step to provide supervision on the monocular depth estimation network, thereby training to obtain the monocular depth Estimate the network. In the prior art, monocular depth estimation generally uses a large amount of real labeled data for training, or uses an unsupervised method to train a monocular depth estimation network. However, the acquisition cost of a large amount of real labeled data is very high. Training the monocular depth estimation network directly using an unsupervised method cannot process the depth estimation of the occluded area, and the obtained result is poor. The sample data of the monocular depth estimation network model described in this application comes from the disparity map output by the first binocular matching neural network model, that is, this application uses binocular disparity to guide the prediction of the monocular depth. Therefore, the method in the present application does not require a large amount of labeled data, and can obtain better training results.
步骤S103、输出所述待处理图像的分析结果。这里,所述待处理图像的分析结果,指的是所述待处理图像对应的深度图。获取待处理图像后,将所述待处理图像输入至经过训练得到的单目深度估计网络模型,所述单目深度估计网络模型一般输出的是所述待处理图像对应的视差图,而不是深度图;因此,还需要根据所述单目深度估计网络模型输出的视差图、拍摄待处理图像的摄像机的镜头基线距离和拍摄待处理图像的摄像机的镜头焦距,确定所述待处理图像对应的深度图。Step S103: Output the analysis result of the image to be processed. Here, the analysis result of the image to be processed refers to a depth map corresponding to the image to be processed. After obtaining the image to be processed, the image to be processed is input to a trained monocular depth estimation network model, and the monocular depth estimation network model generally outputs a disparity map corresponding to the image to be processed instead of depth. Therefore, it is also necessary to determine the depth corresponding to the image to be processed according to the disparity map output by the monocular depth estimation network model, the lens baseline distance of the camera that captures the image to be processed, and the lens focal length of the camera that captures the image to be processed. Illustration.
图1B为本申请实施例单个图片深度估计示意图,如图1B所示,标号为11的图片11为待处理图像,标号为12的图片12为标号为11的图片11对应的深度图。FIG. 1B is a schematic diagram of the depth estimation of a single picture in the embodiment of the present application. As shown in FIG. 1B, the picture 11 with the number 11 is the image to be processed, the picture with the number 12 is the depth map corresponding to the picture 11 with the number 11.
在实际应用中,可以将所述镜头基线距离和所述镜头焦距的乘积,与所述输出的待处理图像对应的视差图的比值,确定为所述待处理图像对应的深度图。In practical applications, the product of the baseline distance of the lens and the focal length of the lens, and the ratio of the disparity map corresponding to the output image to be processed may be determined as the depth map corresponding to the image to be processed.
基于上述方法实施例,本申请实施例再提供一种单目深度估计方法,该方法包括:Based on the foregoing method embodiments, an embodiment of the present application further provides a monocular depth estimation method, which includes:
步骤S111、获取有深度标记的合成的双目图片作为合成样本数据,其中,所述合成的双目图片包括合成的左图和合成的右图;Step S111: Obtain a synthesized binocular picture with a depth mark as synthesized sample data, where the synthesized binocular picture includes a synthesized left image and a synthesized right image;
在一些实施例中,所述方法还包括:步骤S11、通过渲染引擎构造虚拟3D场景; 步骤S12、通过两个虚拟的摄像机将所述3D场景映射成双目图片;步骤S13、根据构造所述虚拟3D场景时的位置、构造所述虚拟3D场景时的方向和所述虚拟的摄像机的镜头焦距获取所述合成双目图片的深度数据;步骤S14、根据所述深度数据标记所述双目图片,得到所述合成的双目图片。In some embodiments, the method further includes: step S11, constructing a virtual 3D scene through a rendering engine; step S12, mapping the 3D scene into a binocular picture through two virtual cameras; step S13, according to constructing the Obtain the depth data of the synthesized binocular picture by the position during the virtual 3D scene, the direction when constructing the virtual 3D scene, and the lens focal length of the virtual camera; step S14, marking the binocular picture according to the depth data To obtain the synthesized binocular picture.
步骤S112、根据获取的合成样本数据训练第二双目匹配神经网络模型;Step S112: Train a second binocular matching neural network model according to the obtained synthetic sample data;
这里,在实际应用中,所述步骤S112可以通过以下步骤实现:步骤S1121、根据所述合成的双目图片对第二双目匹配神经网络模型进行训练,得到训练后的第二双目匹配神经网络模型,其中,所述训练后的第二双目匹配神经网络模型的输出为视差图和遮挡图,所述视差图描述了所述左图中每个像素点与所述右图中对应的像素点的视差距离,所述视差距离以像素为单位;所述遮挡图描述了所述左图中每个像素点在所述右图中对应的像素点是否被物体遮挡。Here, in actual application, step S112 may be implemented by the following steps: step S1121, training a second binocular matching neural network model according to the synthesized binocular picture, and obtaining a trained second binocular matching neural network A network model, wherein the output of the trained second binocular matching neural network model is a disparity map and an occlusion map, and the disparity map describes each pixel in the left image corresponding to the right image The parallax distance of the pixel points, the parallax distance is in pixels; the occlusion map describes whether each pixel point in the left image corresponding to the pixel point in the right image is blocked by an object.
图1C为本申请实施例训练第二双目匹配神经网络模型示意图,如图1C所示,标号为11的图片11为合成的双目图片的左图,标号为12的图片12为合成的双目图片的右图,I L为标号为11的左图图片11中包含的所有像素点的像素值,I R为标号为12的右图图片12中包含的所有像素点的像素值;标号为13的图片13为第二双目匹配神经网络模型经过训练后输出的遮挡图,标号为14的图片14为第二双目匹配神经网络模型经过训练后输出的视差图,标号为15的图片15为第二双目匹配神经网络模型。 FIG. 1C is a schematic diagram of training a second binocular matching neural network model according to an embodiment of the present application. As shown in FIG. 1C, a picture labeled 11 is a left view of a synthesized binocular picture, and a picture labeled 12 is a synthesized binocular picture. In the right picture of the target picture, I L is the pixel value of all the pixels contained in picture 11 on the left picture labeled 11 and I R is the pixel value of all pixels contained in picture 12 on the right picture labeled 12; Picture 13 is the occlusion map of the second binocular matching neural network model after training, picture 14 is the picture 14 is the disparity map of the second binocular matching neural network model after training, picture 15 is the picture 15 Match the neural network model for the second binocular.
步骤S113、根据获取的真实样本数据对训练后的第二双目匹配神经网络模型的参数进行调整,得到第一双目匹配神经网络模型;Step S113: Adjust the parameters of the trained second binocular matching neural network model according to the obtained real sample data to obtain a first binocular matching neural network model;
这里,所述步骤S113可以通过两种方式实现,其中,第一种实现方式按照以下步骤实现:步骤S1131a、根据获取的带深度标记的真实双目数据对训练后的第二双目匹配神经网络模型进行监督训练,以调整所述训练后的第二双目匹配神经网络模型的权值,得到第一双目匹配神经网络模型。这里,获取的是带有深度标记的真实双目数据,这样,就可以直接用带有深度标记的真实双目数据,对步骤S112中训练后的第二双目匹配神经网络进行监督训练,以调整所述训练后的第二双目匹配神经网络模型的权值,进一步提高训练后的第二双目匹配神经网络模型的效果,得到第一双目匹配神经网络模型。在这一部分中,双目视差网络需要对真实数据进行适配。可以使用真实的带有深度标记的双目数据,通过有监督的训练对双目视差网络直接进行微调训练调整网络权值。第二种实现方式按照以下步骤实现:步骤S1131b、根据获取的不带深度标记的真实双目数据对训练后的第二双目匹配神经网络模型进行无监督训练,以调整所述训练后的第二双目匹配神经网络模型的权值,得到第一双目匹配神经网络模型。本申请实施例中,还可以使用不带深度标记的真实双目数据对训练后的第二双目匹配神经网络模型进行无监督训练,以调整所述训练后的第二双目匹配神经网络模型的权值,得到第一双目匹配神经网络模型。这里无监督训练指的是在没有深度数据标记的情况下,仅仅使用双目数据进行训练,可以使用无监督微调方法对此过程进行实现。Here, the step S113 can be implemented in two ways, wherein the first implementation method is implemented according to the following steps: step S1131a, the trained second binocular matching neural network is obtained according to the obtained real binocular data with depth markers The model undergoes supervised training to adjust the weight of the trained second binocular matching neural network model to obtain a first binocular matching neural network model. Here, the real binocular data with the depth marker is acquired. In this way, the real binocular data with the depth marker can be directly used to supervise the training of the second binocular matching neural network trained in step S112 to The weight of the trained second binocular matching neural network model is adjusted to further improve the effect of the trained second binocular matching neural network model to obtain a first binocular matching neural network model. In this part, the binocular parallax network needs to adapt to the real data. You can use real binocular data with depth markers to directly fine-tune the binocular disparity network through supervised training to adjust the network weights. The second implementation manner is implemented according to the following steps: step S1131b, performing unsupervised training on the trained second binocular matching neural network model according to the obtained real binocular data without depth marking, so as to adjust the trained first binocular matching neural network model. The two binocular matching neural network models are weighted to obtain the first binocular matching neural network model. In the embodiment of the present application, it is also possible to perform unsupervised training on the trained second binocular matching neural network model using real binocular data without depth marking, so as to adjust the trained second binocular matching neural network model. To get the first binocular matching neural network model. Here, unsupervised training refers to training using only binocular data without deep data marking, and this process can be implemented using unsupervised fine-tuning methods.
步骤S114、通过所述第一双目匹配神经网络模型输出的视差图对单目深度估计网络 模型进行监督,从而训练所述单目深度估计网络模型;Step S114: Supervise the monocular depth estimation network model through the disparity map output by the first binocular matching neural network model, thereby training the monocular depth estimation network model;
这里,所述步骤S114以通过两种方式实现,其中,第一种实现方式按照以下步骤实现:步骤S1141a、获取所述带深度标记的真实双目数据中的左图或右图作为训练样本,其中,所述带深度标记的真实双目数据包括左图和右图;步骤S1142a、根据所述带深度标记的真实双目数据中的左图或右图对单目深度估计网络模型进行训练。这里,利用深度神经网络来预测单张图片的深度图,仅需要一张图片即可以对图片对应的场景进行三维建模,得到每个像素点的深度。因此,可以根据所述带深度标记的真实双目数据中的左图或右图对单目深度估计网络模型进行训练,其中,所述带深度标记的真实双目数据为步骤S1131a中使用的带深度标记的真实双目数据。第二种实现方式按照以下步骤实现:步骤S1141b、所述不带深度标记的真实双目数据输入到所述第一双目匹配神经网络模型,得到对应的视差图,其中,所述不带深度标记的真实双目数据包括左图和右图;步骤S1142b、根据所述对应的视差图、拍摄所述不带深度标记的真实双目数据的摄像机的镜头基线距离和拍摄所述不带深度标记的真实双目数据的摄像机的镜头焦距,确定所述视差图对应的深度图;步骤S1143b、所述不带深度标记的真实双目数据中的左图或右图作为样本数据,根据所述视差图对应的深度图对单目深度估计网络模型进行监督,从而训练所述单目深度估计网络模型。这里,利用深度神经网络来预测单张图片的深度图,仅需要一张图片即可以对图片对应的场景进行三维建模,得到每个像素点的深度。因此,可以根据步骤S1131b中使用的不带深度标记的真实双目数据中的左图或右图作为样本数据,也是步骤S1141b中使用的不带深度标记的真实双目数据中的左图或右图作为样本数据,根据步骤S1141b中输出的视差图对应的深度图对单目深度估计网络模型进行监督,从而训练所述单目深度估计网络模型,得到训练后的单目深度估计网络模型。Here, step S114 is implemented in two ways, wherein the first implementation way is implemented according to the following steps: step S1141a, acquiring the left or right image in the real binocular data with depth mark as a training sample, The depth-labeled real binocular data includes a left image and a right image; step S1142a, a monocular depth estimation network model is trained according to the left or right image in the depth-labeled real binocular data. Here, a deep neural network is used to predict the depth map of a single picture. Only one picture is needed to 3D model the scene corresponding to the picture to obtain the depth of each pixel. Therefore, the monocular depth estimation network model may be trained according to the left or right image in the depth-labeled real binocular data, where the depth-labeled real binocular data is the band used in step S1131a. Deeply marked true binocular data. The second implementation manner is implemented according to the following steps: Step S1141b, the true binocular data without depth marking is input to the first binocular matching neural network model to obtain a corresponding disparity map, wherein the without disparity map The labeled true binocular data includes left and right images; step S1142b, according to the corresponding disparity map, a lens baseline distance of a camera that captures the true binocular data without a depth marker, and captures the image without the depth marker. The focal length of the camera of the real binocular data to determine the depth map corresponding to the parallax map; step S1143b, the left or right image in the real binocular data without the depth mark is used as sample data, and according to the parallax The depth map corresponding to the graph supervises the monocular depth estimation network model, thereby training the monocular depth estimation network model. Here, a deep neural network is used to predict the depth map of a single picture. Only one picture is needed to 3D model the scene corresponding to the picture to obtain the depth of each pixel. Therefore, the left image or the right image in the real binocular data without the depth mark used in step S1131b can be taken as the sample data, or the left image or the right in the real binocular data without the depth mark used in step S1141b. The map is used as sample data, and the monocular depth estimation network model is supervised according to the depth map corresponding to the disparity map output in step S1141b, so that the monocular depth estimation network model is trained, and the trained monocular depth estimation network model is obtained.
图1D为本申请实施例训练单目深度估计网络模型示意图,如图1D所示,图(a)表示了将不带深度标记的真实双目数据输入到所述第一双目匹配神经网络模型,得到对应的标号为13的视差图图片13,其中,所述不带深度标记的真实双目数据包括标号为11的左图图片11和标号为12的右图图片12,标号为15的图片15为第一双目匹配神经网络模型。图1D中的图(b)表示了将所述不带深度标记的真实双目数据中的左图或右图作为样本数据,根据所述标号为13的视差图图片13对应的深度图对单目深度估计网络模型进行监督,从而训练所述单目深度估计网络模型,其中所述样本数据经过所述单目深度估计网络模型的输出为标号为14的视差图图片14,标号为16的图片16为单目深度估计网络模型。FIG. 1D is a schematic diagram of a training monocular depth estimation network model according to an embodiment of the present application. As shown in FIG. 1D, FIG. 1A shows inputting real binocular data without a depth marker to the first binocular matching neural network model. To obtain the corresponding parallax map picture 13 labeled 13, where the true binocular data without the depth mark includes a left picture 11 labeled 11 and a right picture 12 labeled 12 and a picture 15 labeled 15 is the first binocular matching neural network model. The figure (b) in FIG. 1D shows that the left or right image in the real binocular data without the depth mark is used as the sample data, and the depth map corresponding to the disparity map picture 13 labeled 13 is compared with the single image. The mesh depth estimation network model is supervised, thereby training the monocular depth estimation network model, wherein the output of the sample data after passing through the monocular depth estimation network model is a parallax map picture 14 labeled 14 and a picture labeled 16 16 is a monocular depth estimation network model.
步骤S115、获取待处理图像;Step S115: Obtain an image to be processed;
这里,在得到训练后的单目深度估计网络模型后,就可以使用此单目深度估计网络模型。即利用此单目深度估计网络模型,获取待处理图像对应的深度图。Here, after obtaining the trained monocular depth estimation network model, this monocular depth estimation network model can be used. That is, using this monocular depth estimation network model, a depth map corresponding to the image to be processed is obtained.
步骤S116、将所述待处理图像输入至经过训练得到的单目深度估计网络模型,得到所述待处理图像的分析结果,其中,所述单目深度估计网络模型是通过第一双目匹配神经网络模型输出的视差图进行监督训练的;Step S116: The image to be processed is input to a trained monocular depth estimation network model to obtain an analysis result of the image to be processed, wherein the monocular depth estimation network model is matched by a first binocular matching neural network. The disparity map output by the network model is used for supervised training;
步骤S117、输出所述待处理图像的分析结果,其中,所述待处理图像的分析结果包括所述单目深度估计网络模型输出的视差图;Step S117: Output the analysis result of the image to be processed, where the analysis result of the image to be processed includes a disparity map output by the monocular depth estimation network model;
步骤S118、根据所述单目深度估计网络模型输出的视差图、拍摄输入所述单目深度估计网络模型的图片的摄像机的镜头基线距离和拍摄输入所述单目深度估计网络模型的图片的摄像机的镜头焦距,确定所述视差图对应的深度图;Step S118: According to the disparity map output by the monocular depth estimation network model, a lens baseline distance of a camera that takes a picture of the monocular depth estimation network model and a camera that takes a picture of the monocular depth estimation network model The focal length of the lens to determine the depth map corresponding to the parallax map;
步骤S119、输出所述视差图对应的深度图。Step S119: Output a depth map corresponding to the disparity map.
基于上述方法实施例,本申请实施例再提供一种单目深度估计方法,该方法包括:Based on the foregoing method embodiments, an embodiment of the present application further provides a monocular depth estimation method, which includes:
步骤S121、获取有深度标记的合成的双目图片作为合成样本数据,其中,所述合成的双目图片包括合成的左图和合成的右图。Step S121: Obtain a synthesized binocular picture with a depth mark as synthesized sample data, where the synthesized binocular picture includes a synthesized left image and a synthesized right image.
步骤S122、根据获取的合成样本数据训练第二双目匹配神经网络模型;Step S122: Train a second binocular matching neural network model according to the obtained synthetic sample data;
这里,使用合成数据用于训练第二双目匹配神经网络模型具有更好的泛化能力。Here, using synthetic data for training the second binocular matching neural network model has better generalization ability.
步骤S123、利用公式(1)确定所述损失函数:L stereo-unsupft=L photo1L abs2L rel (1);其中,所述L stereo-unsupft表示本申请实施例提出的损失函数;所述L photo表示重建误差,所述L abs表示所述第一双目匹配网络模型输出的视差图与所述训练后的第二双目匹配网络模型输出的视差图相比偏离较小;所述L rel表示约束所述第一双目匹配网络模型的输出梯度与所述训练后的第二双目匹配网络模型的输出梯度一致;所述γ 1和γ 2表示强度系数。这里,L abs和L rel为正则项。 Step S123: Determine the loss function by using formula (1): L stereo-unsupft = L photo + γ 1 L abs + γ 2 L rel (1); wherein, the L stereo-unsupft represents the Loss function; the L photo represents the reconstruction error, and the L abs represents the disparity map output by the first binocular matching network model compared with the disparity map output by the trained second binocular matching network model Small; the L rel indicates that the output gradient of the first binocular matching network model is constrained to be consistent with the output gradient of the trained second binocular matching network model; and γ 1 and γ 2 represent intensity coefficients. Here, L abs and L rel are regular terms.
在一些实施例中,步骤S123中的公式(1)还可以通过以下步骤中的公式进行细化,即所述方法还包括:步骤S1231、利用公式(2)或公式(3)确定所述重建误差:
Figure PCTCN2019076247-appb-000001
其中,所述N表示图片中像素的个数;所述
Figure PCTCN2019076247-appb-000002
表示所述训练后的第二双目匹配网络模型输出的遮挡图的像素值;所述
Figure PCTCN2019076247-appb-000003
表示不带深度标记的真实双目数据中的左图的像素值;所述
Figure PCTCN2019076247-appb-000004
表示不带深度标记的真实双目数据中的右图的像素值;所述
Figure PCTCN2019076247-appb-000005
表示将右图采样后合成的图片的像素值,即重建的左图;所述
Figure PCTCN2019076247-appb-000006
表示将左图采样后合成的图片的像素值,即重建的右图;所述
Figure PCTCN2019076247-appb-000007
表示不带深度标记的真实双目数据中的左图经第一双目匹配网络模型输出的视差图的像素值;所述
Figure PCTCN2019076247-appb-000008
表示不带深度标记的真实双目数据中的右图经第一双目匹配网络模型输出的视差图的像素值;所述ij表示像素点的像素坐标;所述old表示训练后的第二双目匹配网络模型的输出;所述R表示右图或右图的相关数据,所述L表示左图或左图的相关数据;所述I表示图片像素点的RGB(Red Green Blue,红色、绿色和蓝色)值。步骤S1232、利用公式(4)或公式(5)确定所述第一双目匹配网络模型输出的视差图与所述训练后的第二双目匹配网络模型输出的视差图相比偏离较小:
Figure PCTCN2019076247-appb-000009
其中,所述N表示图片中像素的个数,所述
Figure PCTCN2019076247-appb-000010
表示所述训练后的第二双目匹配网络模型输出的遮挡图的像素值,所述
Figure PCTCN2019076247-appb-000011
表示样本数据中的左图经训练后的第二双目匹配网络输出的视差图的像素值,所述
Figure PCTCN2019076247-appb-000012
表示样本数据中的右图经训练后的第二双目匹配网络输出的视差图 的像素值,所述
Figure PCTCN2019076247-appb-000013
表示不带深度标记的真实双目数据中的左图经第一双目匹配网络输出的视差图的像素值,所述
Figure PCTCN2019076247-appb-000014
表示不带深度标记的真实双目数据中的右图经第一双目匹配网络输出的视差图的像素值,所述ij表示像素点的像素坐标,所述old表示训练后的第二双目匹配网络模型的输出,所述R表示右图或右图的相关数据,所述L表示左图或左图的相关数据,所述γ 3表示强度系数。步骤S1233、利用公式(6)或公式(7)确定所述第一双目匹配网络模型的输出梯度与所述第二双目匹配网络模型的输出梯度一致:
Figure PCTCN2019076247-appb-000015
其中,所述N表示图片中像素的个数,所述
Figure PCTCN2019076247-appb-000016
表示不带深度标记的真实双目数据中的左图经第一双目匹配网络输出的视差图的梯度,所述
Figure PCTCN2019076247-appb-000017
表示不带深度标记的真实双目数据中的右图经第一双目匹配网络输出的视差图的梯度,所述
Figure PCTCN2019076247-appb-000018
表示样本数据中的左图经训练后的第二双目匹配网络输出的视差图的梯度,所述
Figure PCTCN2019076247-appb-000019
表示样本数据中的右图经训练后的第二双目匹配网络输出的视差图的梯度,所述old表示训练后的第二双目匹配网络模型的输出,所述R表示右图或右图的相关数据,所述L表示左图或左图的相关数据。
In some embodiments, the formula (1) in step S123 can also be refined by the formula in the following step, that is, the method further includes: step S1231, determining the reconstruction using formula (2) or formula (3) error:
Figure PCTCN2019076247-appb-000001
Where N is the number of pixels in the picture;
Figure PCTCN2019076247-appb-000002
Pixel values of the occlusion map output by the trained second binocular matching network model;
Figure PCTCN2019076247-appb-000003
Represents the pixel value of the left image in true binocular data without a depth marker; said
Figure PCTCN2019076247-appb-000004
Represents the pixel value of the right image in true binocular data without a depth marker; said
Figure PCTCN2019076247-appb-000005
Represents the pixel value of the picture synthesized after sampling the right picture, that is, the reconstructed left picture; said
Figure PCTCN2019076247-appb-000006
Represents the pixel value of the picture synthesized after sampling the left picture, that is, the reconstructed right picture; said
Figure PCTCN2019076247-appb-000007
Represents the pixel value of the disparity map output by the first binocular matching network model of the left image in the real binocular data without the depth mark; said
Figure PCTCN2019076247-appb-000008
Represents the pixel value of the disparity map output by the first binocular matching network model from the right image in the real binocular data without depth mark; the ij represents the pixel coordinates of the pixel; the old represents the second bin after training The output of the target matching network model; the R represents the relevant data of the right or right picture, the L represents the relevant data of the left or left picture; and the I represents RGB (Red Green Blue, red, green) And blue) values. Step S1232, using formula (4) or formula (5) to determine that the disparity map output by the first binocular matching network model is smaller than the disparity map output by the trained second binocular matching network model:
Figure PCTCN2019076247-appb-000009
Where N is the number of pixels in the picture, and
Figure PCTCN2019076247-appb-000010
Pixel values of the occlusion map output by the trained second binocular matching network model, said
Figure PCTCN2019076247-appb-000011
Represents the pixel values of the disparity map output by the second binocular matching network after training on the left image in the sample data, said
Figure PCTCN2019076247-appb-000012
Represents the pixel values of the disparity map output by the second binocular matching network after training on the right in the sample data, said
Figure PCTCN2019076247-appb-000013
Represents the pixel values of the disparity map output by the left image via the first binocular matching network in the real binocular data without a depth marker, said
Figure PCTCN2019076247-appb-000014
Represents the pixel value of the disparity map output by the first binocular matching network from the right image in the real binocular data without depth mark, the ij represents the pixel coordinates of the pixel, and the old represents the second binocular after training The output of the matching network model, where R represents the data on the right or right, L represents the data on the left or left, and γ 3 represents the intensity coefficient. Step S1233: Use formula (6) or formula (7) to determine that the output gradient of the first binocular matching network model is consistent with the output gradient of the second binocular matching network model:
Figure PCTCN2019076247-appb-000015
Where N is the number of pixels in the picture, and
Figure PCTCN2019076247-appb-000016
Represents the gradient of the disparity map output by the left image via the first binocular matching network in the real binocular data without a depth marker, said
Figure PCTCN2019076247-appb-000017
Represents the gradient of the disparity map output by the first binocular matching network from the right image in the real binocular data without a depth marker, said
Figure PCTCN2019076247-appb-000018
Represents the gradient of the disparity map output by the second binocular matching network after training on the left image in the sample data, said
Figure PCTCN2019076247-appb-000019
Represents the gradient of the disparity map output by the trained second binocular matching network from the right image in the sample data, the old represents the output of the trained second binocular matching network model, and R represents the right or right The relevant data, where L represents the left picture or the relevant data of the left picture.
步骤S124、使用损失函数(Loss),根据所述不带深度标记的真实双目数据对训练后的第二双目匹配神经网络模型进行无监督训练,以调整所述训练后的第二双目匹配神经网络模型的权值,得到第一双目匹配神经网络模型。Step S124: Use a loss function (Loss) to perform unsupervised training on the trained second binocular matching neural network model according to the real binocular data without the depth marker to adjust the trained second binocular Match the weights of the neural network model to get the first binocular matching neural network model.
这里,所述损失函数(Loss)利用了步骤S122中训练后的第二双目匹配神经网络的输出对微调训练进行正则化,避免了现有技术中的无监督微调普遍存在的预测变模糊的问题,提高了微调得到的第一双目匹配网络的效果,从而间接提高了第一双目匹配网络监督得到的单目深度网络的效果。图1E为本申请实施例损失函数相关图片示意图,如图1E所示,图(a)为不带深度标记的真实双目数据的左图;图1E中的图(b)为不带深度标记的真实双目数据的右图;图1E中的图(c)为将图(a)和图(b)组成的不带深度标记的真实双目图片输入至经过训练后的第二双目匹配神经网络模型输出的视差图;图1E中的图(d)为将图(b)表示的右图进行采样后,结合图(c)表示的视差图,对左图进行重建后的图片;图1E中的图(e)为将图(a)表示的左图中的像素与图(d)表示的重建后的左图中的对应像素做差得到的图片,即左图的重建误差图;图1E中的图(f)为将图(a)和图(b)组成的不带深度标记的真实双目图片输入至经过训练后的第二双目匹配神经网络模型输出的遮挡图。其中,图(d)中所有的红框11表示所述重建后的左图与图(a)标识的真实左图有差异的部分,图(e)中所有的红框12表示所述重建误差图中有误差的部分,即被遮挡的部分。这里,实现步骤S124中描述的用无监督微调训练双目视差网络时,需要使用右图对左图进行重建,但是有遮挡区域是无法重建正确的,因此,用遮挡图来清理这一部分的错误训练信号来提高无监督微调训练的效果。Here, the loss function (Loss) uses the output of the second binocular matching neural network after training in step S122 to regularize the fine-tuning training, avoiding the unpredictable, ubiquitous predictions commonly found in unsupervised fine-tuning in the prior art. The problem improves the effect of the first binocular matching network obtained by fine-tuning, thereby indirectly improving the effect of the monocular deep network obtained by the supervision of the first binocular matching network. FIG. 1E is a schematic diagram of a related picture of a loss function according to an embodiment of the present application. As shown in FIG. 1E, FIG. 1A is a left image of real binocular data without a depth marker; FIG. 1E is a graph without a depth marker. The right image of the real binocular data; Figure (c) in Figure 1E is the real binocular image without depth mark composed of Figures (a) and (b) input to the trained second binocular match Parallax map output by the neural network model; Figure (d) in Figure 1E is a picture after reconstructing the left picture after sampling the right picture shown in Figure (b) and combining the parallax map shown in Figure (c); Figure (e) in 1E is a picture obtained by making a difference between the pixels in the left image shown in (a) and the corresponding pixels in the reconstructed left image shown in (d), that is, the reconstruction error map of the left; Figure (f) in FIG. 1E is an occlusion map inputting a real binocular image without depth mark composed of the figures (a) and (b) to the output of a trained second binocular matching neural network model. Among them, all the red boxes 11 in the figure (d) indicate the parts where the reconstructed left picture is different from the real left picture identified in the figure (a), and all the red boxes 12 in the figure (e) show the reconstruction errors. There is an error in the picture, that is, the part that is blocked. Here, when training the binocular disparity network with unsupervised fine-tuning described in step S124, the left image needs to be reconstructed using the right image, but the occluded area cannot be reconstructed correctly. Therefore, the occlusion image is used to clear this part of the error Training signals to improve the effect of unsupervised fine-tuning training.
步骤S125、通过所述第一双目匹配神经网络模型输出的视差图对所述单目深度估计网络模型进行监督,从而训练所述单目深度估计网络模型。Step S125: Supervise the monocular depth estimation network model through a disparity map output by the first binocular matching neural network model, thereby training the monocular depth estimation network model.
这里,所述单目深度估计网络模型的样本图片,可以是不带深度标记的真实双目数据中的左图,也可以是不带深度标记的真实双目数据中的右图。其中,如果使用左图作为样本图片,则通过公式(1)、公式(2)、公式(4)和公式(6)来确定损失函数;如果使用右图作为样本图片,则通过公式(1)、公式(3)、公式(5)和公式(7)来确定损失函数。Here, the sample picture of the monocular depth estimation network model may be a left image in real binocular data without a depth marker, or a right image in real binocular data without a depth marker. Among them, if the left picture is used as a sample picture, the loss function is determined by formula (1), formula (2), formula (4), and formula (6); if the right picture is used as a sample picture, then formula (1) , Formula (3), formula (5) and formula (7) to determine the loss function.
本申请实施例中,所述通过所述第一双目匹配神经网络模型输出的视差图对所述单目深度估计网络模型进行监督,从而训练所述单目深度估计网络模型,指的是通过所述第一双目匹配神经网络模型输出的视差图对应的深度图对所述单目深度估计网络模型进行监督,也即使提供监督信息,从而训练所述单目深度估计网络模型。In the embodiment of the present application, supervising the monocular depth estimation network model by using a disparity map output by the first binocular matching neural network model, so as to train the monocular depth estimation network model. The depth map corresponding to the disparity map output by the first binocular matching neural network model supervises the monocular depth estimation network model, and even if supervising information is provided, the monocular depth estimation network model is trained.
步骤S126、获取待处理图像;Step S126: Acquire the image to be processed;
步骤S127、将所述待处理图像输入至经过训练得到的单目深度估计网络模型,得到所述待处理图像的分析结果,其中,所述单目深度估计网络模型是通过第一双目匹配神经网络模型输出的视差图进行监督训练的;Step S127: The image to be processed is input to a trained monocular depth estimation network model to obtain an analysis result of the image to be processed, wherein the monocular depth estimation network model is matched by a first binocular matching neural network. The disparity map output by the network model is used for supervised training;
步骤S128、输出所述待处理图像的分析结果,其中,所述待处理图像的分析结果包括所述单目深度估计网络模型输出的视差图。Step S128: Output the analysis result of the image to be processed, where the analysis result of the image to be processed includes a disparity map output by the monocular depth estimation network model.
步骤S129、根据所述单目深度估计网络模型输出的视差图、拍摄输入所述单目深度估计网络模型的图片的摄像机的镜头基线距离和拍摄输入所述单目深度估计网络模型的图片的摄像机的镜头焦距,确定所述视差图对应的深度图;Step S129: According to the disparity map output by the monocular depth estimation network model, a lens baseline distance of a camera that takes a picture of the monocular depth estimation network model and a camera that takes a picture of the monocular depth estimation network model The focal length of the lens to determine the depth map corresponding to the parallax map;
步骤S130、输出所述视差图对应的深度图。Step S130: Output a depth map corresponding to the disparity map.
本申请实施例中,当所述待处理图像为街景图片时,就可以使用所述训练后的单目深度估计网络模型预测所述街景图片的深度。In the embodiment of the present application, when the to-be-processed image is a street view picture, the trained monocular depth estimation network model may be used to predict the depth of the street view picture.
基于上述的方法实施例,本申请实施例再提供一种单目深度估计方法,图2A为本申请实施例单目深度估计方法的实现流程示意图二,如图2A所示,该方法包括:Based on the foregoing method embodiments, an embodiment of the present application further provides a monocular depth estimation method. FIG. 2A is a second schematic diagram of the implementation process of the monocular depth estimation method according to the embodiment of the present application. As shown in FIG.
步骤S201、使用渲染引擎渲染的合成数据训练双目匹配网络,得到双目图片的视差图;Step S201: Use the synthetic data rendered by the rendering engine to train a binocular matching network to obtain a disparity map of the binocular picture;
这里,所述双目匹配网络的输入为:一对双目图片(包含左图和右图),所述双目匹配网络的输出为:视差图、遮挡图,即双目匹配网络使用双目图片作为输入,输出视差图和遮挡图。其中,视差图用于描述左图中每个像素点与右图中对应的像素点的视差距离,以像素为单位;遮挡图用于描述左图每个像素在右图中对应的像素点是否被其他物体遮挡。由于视角的变化,左图中的一些区域在右图中会被其他物体遮挡,遮挡图则是用于标记左图中的像素是否在右图中被遮挡。这一部分,双目匹配网络使用计算机渲染引擎产生的合成数据进行训练,首先通过渲染引擎构造一些虚拟3D场景,然后通过两个虚拟的摄像机将3D场景映射成双目图片,从而获得合成数据,同时正确的深度数据和相机焦距等数据也可以从渲染引擎中得到,所以双目匹配网络可以直接通过这些标记数据进行监督训练。Here, the input of the binocular matching network is: a pair of binocular pictures (including the left and right pictures), and the output of the binocular matching network is: a disparity map and an occlusion map, that is, the binocular matching network uses binocular Pictures are used as input, and disparity and occlusion maps are output. The disparity map is used to describe the disparity distance of each pixel in the left picture and the corresponding pixel point in the right picture, in pixels; the occlusion map is used to describe whether each pixel in the left picture corresponds to the pixel in the right picture. Obscured by other objects. Due to changes in perspective, some areas in the left image will be blocked by other objects in the right image. The occlusion image is used to mark whether the pixels in the left image are blocked in the right image. In this part, the binocular matching network is trained using the synthetic data generated by the computer rendering engine. First, some virtual 3D scenes are constructed by the rendering engine, and then the 3D scenes are mapped into binocular pictures by two virtual cameras to obtain synthetic data. Data such as the correct depth data and camera focal length can also be obtained from the rendering engine, so the binocular matching network can directly supervise training through these labeled data.
步骤S202、利用损失函数,通过无监督微调方法在真实双目图片数据上对步骤S201 得到的双目匹配网络进行微调;Step S202: Use the loss function to fine-tune the binocular matching network obtained in step S201 on the real binocular image data through an unsupervised fine-tuning method;
在这一部分中,双目视差网络需要对真实数据进行适配。即使用不带深度标记的真实双目数据对双目视差网络进行无监督训练。这里无监督训练指的是在没有深度数据标记的情况下,仅仅使用双目数据进行训练。本申请实施例提出了一种新的无监督微调方法,即使用上述实施例中的损失函数进行无监督微调。本申请实施例提出的损失函数的主要目的是希望在不降低预训练效果的情况下在真实双目数据上对双目视差网络进行微调,微调过程中借助了步骤S201得到的预训练双目视差网络的初步输出进行指导和正则化。图2B为本申请实施例损失函数效果示意图,如图2B所示,标号为21的图片21为使用现有技术中的损失函数时得到的视差图,标号为22的图片22为使用本申请实施例提出的损失函数时得到的视差图。现有技术的损失函数没有单独考虑遮挡区域,会将遮挡区域的图像重建误差也优化为零,这样会导致遮挡区域的预测视差错误,视差图的边缘也会模糊,而本申请中的损失函数用遮挡图来清理这一部分的错误训练信号来提高无监督微调训练的效果。In this part, the binocular parallax network needs to adapt to the real data. That is, the binocular disparity network is trained unsupervisedly using real binocular data without depth marking. Here, unsupervised training refers to training using only binocular data without deep data marking. The embodiment of the present application proposes a new unsupervised fine-tuning method, which uses the loss function in the above embodiment to perform unsupervised fine-tuning. The main purpose of the loss function proposed in the embodiment of the present application is to hope to fine-tune the binocular disparity network on real binocular data without reducing the pre-training effect. During the fine-tuning process, the pre-trained binocular disparity obtained in step S201 is used during the fine-tuning. The initial output of the network is guided and regularized. FIG. 2B is a schematic diagram of the effect of the loss function in the embodiment of the present application. As shown in FIG. 2B, the picture 21 is a disparity diagram obtained when using the loss function in the prior art, and the picture 22 is implemented using the present application. The disparity map obtained when the proposed loss function is exemplified. The loss function of the prior art does not consider the occlusion area separately, and the image reconstruction error of the occlusion area is also optimized to zero, which will cause the prediction parallax error of the occlusion area, and the edges of the disparity map will be blurred. The loss function in this application Use the occlusion map to clean up the erroneous training signals in this part to improve the effect of unsupervised fine-tuning training.
步骤S203、使用步骤S202得到的双目匹配网络在真实数据上对单目深度估计进行监督,最终得到单目深度估计网络。这里,所述单目深度估计网络的输入为:单张单目图片,所述单目深度估计网络的输出为:深度图。在步骤S202中得到了在真实数据上微调过的双目视差网络,对于每一对双目图片,双目视差网络预测得到视差图,通过视差图D、双目镜头基线距离b以及镜头焦距f,可以计算得到视差图对应的深度图,即通过公式(8),可以计算得到视差图对应的深度图d:d=bf/D(8);为了训练单目深度网络预测得到深度图,可以使用双目图片对中的左图作为单目深度网路的输入,然后使用双目视差网络输出计算得到的深度图进行监督,从而训练单目深度网路,得到最终结果。在实际应用中,可以本申请实施例中的单目深度估计方法训练得到用于无人驾驶的深度估计模块,从而对场景进行三维重建或者障碍物检测。且本申请实施例提出的无监督微调方法提高了双目视差网络的性能。Step S203: Use the binocular matching network obtained in step S202 to supervise the monocular depth estimation on the real data, and finally obtain the monocular depth estimation network. Here, the input of the monocular depth estimation network is: a single monocular picture, and the output of the monocular depth estimation network is: a depth map. In step S202, the binocular disparity network fine-tuned on the real data is obtained. For each pair of binocular pictures, the binocular disparity network predicts a disparity map, and the disparity map D, the baseline distance b of the binocular lens, and the lens focal length f are obtained. , The depth map corresponding to the disparity map can be calculated, that is, the depth map corresponding to the disparity map can be calculated by formula (8) d: d = bf / D (8); in order to train the monocular depth network prediction to obtain the depth map, you can The left image in the binocular image pair is used as the input of the monocular deep network, and then the depth map calculated by the binocular disparity network output is used to supervise, thereby training the monocular deep network to obtain the final result. In practical applications, the monocular depth estimation method in the embodiments of the present application can be trained to obtain a depth estimation module for unmanned driving, thereby performing three-dimensional reconstruction or obstacle detection on the scene. And the unsupervised fine-tuning method proposed in the embodiment of the present application improves the performance of the binocular disparity network.
现有技术中,有监督的单目深度估计方法,获取准确的标记数据是非常有限也是非常难的。基于重建误差的无监督方法的性能通常受到像素匹配歧义的限制。为了解决这些问题,本申请实施例提出了一种新的单目深度估计方法,解决了现有技术中监督和无监督深度估计方法存在的局限性。本申请实施例中的方法是使用一个双目匹配网络在跨模态合成数据上训练,并用来监督单目深度估计网络。所述双目匹配网络是基于左右图的像素匹配关系来获得视差,而不是从语义特征中提取,因此,双目匹配网络可以很好地从合成数据泛化到真实数据。本申请实施例的方法主要包括三个步骤。第一,用合成数据对双目匹配网络进行训练,从双目图片中预测遮挡图和视差图。第二,根据可用的真实数据,在有监督或者无监督的情况下,对训练后的双目匹配网络有选择性地进行调整。第三,在第二步得到的用真实数据微调训练后的双目匹配网络的监督下,训练单目深度估计网络。这样可以间接利用双目匹配网络来使单目深度估计更好地利用合成数据来提高性能。In the prior art, a supervised monocular depth estimation method is very limited and difficult to obtain accurate labeled data. The performance of unsupervised methods based on reconstruction errors is usually limited by the pixel matching ambiguity. In order to solve these problems, a new monocular depth estimation method is proposed in the embodiment of the present application, which solves the limitations of the supervised and unsupervised depth estimation methods in the prior art. The method in the embodiment of the present application is to use a binocular matching network to train on cross-modal synthetic data, and to supervise the monocular depth estimation network. The binocular matching network obtains disparity based on the pixel matching relationship between the left and right images, rather than extracting from the semantic features. Therefore, the binocular matching network can well generalize from synthetic data to real data. The method in the embodiment of the present application mainly includes three steps. First, the binocular matching network is trained with synthetic data to predict occlusion maps and disparity maps from binocular pictures. Second, according to the available real data, with or without supervision, the trained binocular matching network is selectively adjusted. Third, the monocular depth estimation network is trained under the supervision of the binocular matching network fine-tuned with the real data obtained in the second step. In this way, the binocular matching network can be used indirectly to make the monocular depth estimation make better use of synthetic data to improve performance.
第一步、利用合成数据对双目匹配网络进行训练,包括:目前由图形渲染引擎可以生成很多的包含深度信息的合成图像。但是,直接将这些合成图像数据与真实数据合并来训练单目深度估计网络得到的性能通常较差,因为单目深度估计对输入场景的语义信息非常敏感。合成数据和真实数据之间的巨大模态差距使得使用合成数据辅助训练变得毫无用处。然而,双目匹配网络有更好的泛化能力,使用合成数据训练的双目匹配网络在真实数据上也能得到较好的视差图输出。因此,本申请实施例将双目匹配网络训练作为在合成数据和真实数据之间的桥梁来提高单目深度训练的性能。首先利用大量的合成双目数据对双目匹配网络进行预训练。与传统的结构不同,实施例中的双目匹配网络在视差图的基础上,还估计了多尺度遮挡图。其中,遮挡图表示在正确的图像中,左侧图像像素的在右图中的对应像素点是否被其他物体遮挡。在接下来的步骤中,无监督的微调方法会使用到所述遮挡图,以避免错误的估计。其中,可以使用左右视差一致性检验方法,利用公式(9)从正确标记的视差图中得到有正确标记的遮挡图
Figure PCTCN2019076247-appb-000020
Figure PCTCN2019076247-appb-000021
其中,下标i表示图像中第i行的值,下标j表示图像中第j列的值。D *L/R表示左右幅图像的视差图,D *wR是用右图重建后的左图的视差图,对于非遮挡区域,左视差图和利用右图重建后的左图的视差图是一致的。一致性检查的阈值设置为1。遮挡图在遮挡区域为0,非遮挡区域为1。因此,本实施例使用以下公式(10)计算使用合成数据训练双目匹配网络的损失(Loss),在此阶段,损失函数L stereo由两部分组成,即视差图估计误差L disp和遮挡图估计误差L occ。双目视差网络的多尺度中间层也产生了视差和遮挡预测,并直接应用于多尺度预测的损失权重w m
Figure PCTCN2019076247-appb-000022
表示每一层对应的视差图估计误差,
Figure PCTCN2019076247-appb-000023
表示每一层对应的遮挡图估计误差,m表示第m层:
Figure PCTCN2019076247-appb-000024
为了训练视差图,采用L1损失函数来避免异常值的影响,使训练过程更加鲁棒。为了训练遮挡图,利用公式(11)来表示遮挡图估计误差L occ,采用二元交叉熵损失作为一种分类任务来训练遮挡图:
Figure PCTCN2019076247-appb-000025
其中,N是图像中像素的总数,
Figure PCTCN2019076247-appb-000026
表示有正确标记的遮挡图,
Figure PCTCN2019076247-appb-000027
表示经训练后的双目匹配网络输出的遮挡图。
The first step is to use the synthetic data to train the binocular matching network, including: Currently, the graphics rendering engine can generate many synthetic images containing depth information. However, the performance of training the monocular depth estimation network by directly combining these synthetic image data with real data is usually poor, because the monocular depth estimation is very sensitive to the semantic information of the input scene. The huge modal gap between synthetic and real data makes using synthetic data to aid training useless. However, the binocular matching network has better generalization ability, and the binocular matching network trained with synthetic data can also get better disparity map output on real data. Therefore, the embodiment of the present application uses binocular matching network training as a bridge between synthetic data and real data to improve the performance of monocular deep training. First, a large amount of synthetic binocular data is used to pre-train the binocular matching network. Different from the traditional structure, the binocular matching network in the embodiment also estimates a multi-scale occlusion map based on the disparity map. The occlusion map indicates whether the corresponding pixel point of the left image pixel in the right image is blocked by other objects in the correct image. In the next step, the unsupervised fine-tuning method will use the occlusion map to avoid false estimation. Among them, the right and left parallax consistency checking method can be used to obtain a correctly labeled occlusion map from the correctly labeled parallax map by using formula (9).
Figure PCTCN2019076247-appb-000020
Figure PCTCN2019076247-appb-000021
Among them, the subscript i represents the value of the i-th row in the image, and the subscript j represents the value of the j-th column in the image. D * L / R represents the disparity map of the left and right images, and D * wR is the disparity map of the left image after reconstruction with the right image. For non-occluded regions, the left disparity map and the disparity map of the left image after reconstruction using the right image are Consistent. The consistency check threshold is set to 1. The occlusion map is 0 in the occluded area and 1 in the non-occluded area. Therefore, this embodiment uses the following formula (10) to calculate the loss (Loss) of training the binocular matching network using synthetic data. At this stage, the loss function L stereo is composed of two parts, namely the disparity map estimation error L disp and the occlusion map estimation. The error L occ . The multi-scale intermediate layer of the binocular disparity network also generates parallax and occlusion predictions, which are directly applied to the loss weight w m of the multi-scale prediction,
Figure PCTCN2019076247-appb-000022
Represents the disparity map estimation error corresponding to each layer,
Figure PCTCN2019076247-appb-000023
Represents the estimation error of the occlusion map corresponding to each layer, and m represents the mth layer:
Figure PCTCN2019076247-appb-000024
In order to train the disparity map, the L1 loss function is used to avoid the influence of outliers, making the training process more robust. In order to train the occlusion map, formula (11) is used to represent the occlusion map estimation error L occ , and the binary cross entropy loss is used as a classification task to train the occlusion map:
Figure PCTCN2019076247-appb-000025
Where N is the total number of pixels in the image,
Figure PCTCN2019076247-appb-000026
Indicates a properly labeled occlusion map,
Figure PCTCN2019076247-appb-000027
Represents the occlusion map output by the trained binocular matching network.
第二步、使用有监督或无监督的微调方法在真实数据上训练第一步得到的训练后的双目匹配网络,包括:本申请实施例以两种方式对训练后的双目匹配网络进行微调。其中,有监督的微调方法,仅采用多尺度的L1回归损失函数L stereo-supft,即视差图估计误差L disp来改进之前像素匹配预测的误差,见公式(12):
Figure PCTCN2019076247-appb-000028
结果表明,使用少量的监督数据,例如100幅图片,双目匹配网络也能够从合成模态数据适应到真实模态数据。无监督的微调方法。对于无监督的网络调优,现有技术中的无监督微调方法得到的视差图模糊,性能较差,如图2B中的图片21所示。这是由于无监督损失的局限性以及与只有RGB值的像素匹配的歧义性导致的。因此,本申请实施例引入额外的正则项约束来提高性能。利用真实数据,从没有进行微调的训练后的双目匹配网络中得到 了相应的遮挡图和视差图,并且,将其分别标记为
Figure PCTCN2019076247-appb-000029
Figure PCTCN2019076247-appb-000030
这两个数据用于帮助规范训练过程。进一步的,本申请实施例提出的无监督的微调损失函数,即损失函数L stereo-unsupft的获取可以参见前面实施例中的描述。
The second step is to use the supervised or unsupervised fine-tuning method to train the trained binocular matching network obtained in the first step on the real data, including: the embodiment of the present application performs the trained binocular matching network in two ways Fine-tuning. Among them, the supervised fine-tuning method only uses the multi-scale L1 regression loss function L stereo-supft , that is, the disparity map estimation error L disp to improve the previous pixel matching prediction error, see formula (12):
Figure PCTCN2019076247-appb-000028
The results show that with a small amount of supervised data, such as 100 pictures, the binocular matching network can also adapt from synthetic modal data to real modal data. Unsupervised fine-tuning method. For unsupervised network tuning, the disparity map obtained by the unsupervised fine-tuning method in the prior art is blurred and the performance is poor, as shown in picture 21 in FIG. 2B. This is due to the limitations of unsupervised loss and the ambiguity of matching pixels with only RGB values. Therefore, the embodiment of the present application introduces additional regular term constraints to improve performance. Using real data, the corresponding occlusion map and parallax map were obtained from the binocular matching network after fine-tuning, and they were marked as
Figure PCTCN2019076247-appb-000029
with
Figure PCTCN2019076247-appb-000030
These two data are used to help standardize the training process. Further, for the unsupervised fine-tuning loss function proposed in the embodiment of the present application, that is, to obtain the loss function L stereo-unsupft , refer to the description in the foregoing embodiments.
第三步、训练单目深度估计网络,包括:到目前为止,我们已经通过大量的合成数据对双目匹配网络进行了跨模态训练,并使用真实数据进行了微调。为了训练最终的单目深度估计网络,本申请实施例采用训练后的双目匹配网络预测的视差图提供训练数据。单目深度估计的损失L mono由以下几个部分给出,参见公式(13):
Figure PCTCN2019076247-appb-000031
这里,N为像素点的总和,
Figure PCTCN2019076247-appb-000032
指的是单目深度估计网络输出的视差图,
Figure PCTCN2019076247-appb-000033
指的是训练后的双目匹配网络输出的视差图,或者,将训练后的双目匹配网络进行微调,微调后的网络输出的视差图。需要指出的是,公式(9)至公式(13)都是以单目深度估计网络使用真实数据中的左图作为训练样本为例,进行说明的。实验:由于单目深度估计网络对视角变化敏感,所以不对训练数据使用裁剪和缩放。所述单目深度估计网络的输入和用于监督单目深度估计网络的视差图都是来自训练后的双目匹配网络。图2C为本申请实施例可视化深度估计结果示意图,如图2C所示,图2C中展示了使用现有技术和本申请实施例中的单目深度估计方法,获取的三幅不同的街景图片对应的深度图,其中,第一行为单目深度估计网络的输入,即三幅不同的街景图片;第二行为使用最近邻算法对稀疏激光雷达深度图插值得到的深度数据,第三行至第五行为现有技术中的三种不同的单目深度估计方法分别得到的三幅输入图片对应的深度图;本申请的结果见最后三行,直接利用本申请实施例中第一步得到的使用合成数据训练得到的双目匹配网络,对单目深度估计网络进行监督,得到的单目深度网络的三幅输入图片对应的深度图,即标号为21的图片21、标号为22的图片22、标号为23的图片23;利用本申请实施例提出的无监督损失函数,对训练后的双目匹配网络进行微调,将微调后的网络输出的视差图,作为单目深度估计网络的训练数据,得到的单目深度网络的三幅输入图片对应的深度图,即标号为24的图片24、标号为25的图片25、标号为26的图片26;对训练后的双目匹配网络进行有监督的微调,将微调后的网络输出的视差图,作为单目深度估计网络的训练数据,得到的单目深度网络的三幅输入图片对应的深度图,即标号为27的图片27、标号为28的图片28、标号为29的图片29;从标号为21的图片21至标号为29的图片29可以看出,本申请实施例中的单目深度估计方法获得的模型可以捕捉到更细节的场景结构。
The third step is to train the monocular depth estimation network, including: so far, we have conducted cross-modal training on the binocular matching network with a large amount of synthetic data, and fine-tuned using real data. In order to train the final monocular depth estimation network, the embodiment of the present application uses the disparity map predicted by the trained binocular matching network to provide training data. The loss L mono of the monocular depth estimation is given by the following sections, see formula (13):
Figure PCTCN2019076247-appb-000031
Here, N is the sum of the pixels,
Figure PCTCN2019076247-appb-000032
Refers to the disparity map output by the monocular depth estimation network,
Figure PCTCN2019076247-appb-000033
Refers to the disparity map output by the trained binocular matching network, or fine-tuning the trained binocular matching network, and the disparity map output by the fine-tuned network. It should be pointed out that formulas (9) to (13) are described by taking the monocular depth estimation network using the left image in real data as a training sample as an example. Experiment: Because the monocular depth estimation network is sensitive to changes in perspective, it does not use cropping and scaling for training data. The input of the monocular depth estimation network and the disparity map for supervising the monocular depth estimation network are both from the trained binocular matching network. FIG. 2C is a schematic diagram of a visualized depth estimation result according to an embodiment of the present application. As shown in FIG. 2C, FIG. 2C shows that the three different street scene pictures obtained by using the prior art and the monocular depth estimation method in the embodiment of the present application correspond The first line is the input of the monocular depth estimation network, that is, three different streetscape pictures; the second line is the depth data obtained by interpolating the sparse lidar depth map using the nearest neighbor algorithm, and the third line is the fifth line. Depth maps corresponding to the three input pictures obtained by the three different monocular depth estimation methods in the prior art; the results of this application are shown in the last three lines, and the use synthesis obtained in the first step of the embodiment of this application is directly used The binocular matching network obtained from the data training, supervising the monocular depth estimation network, and the depth map corresponding to the three input pictures of the monocular depth network, that is, picture 21 labeled 21, picture 22 labeled 22, label Picture 23 is 23; using the unsupervised loss function proposed in the embodiment of the present application, the trained binocular matching network is fine-tuned, and the fine-tuned network is adjusted. The output disparity map is used as the training data of the monocular depth estimation network, and the depth map corresponding to the three input pictures of the monocular depth network is obtained, that is, the picture 24, the picture 25, and the picture 26. Picture 26; Supervised fine-tuning of the trained binocular matching network, using the disparity map of the fine-tuned network output as training data for the monocular depth estimation network, corresponding to the three input images of the monocular depth network Depth map, that is, picture 27 labeled 27, picture 28 labeled 28, picture 29 labeled 29; it can be seen from picture 21 labeled 21 to picture 29 labeled 29 that the single The model obtained by the eye depth estimation method can capture more detailed scene structure.
本申请实施例提供一种单目深度估计装置,图3为本申请实施例单目深度估计装置的组成结构示意图,如图3所示,所述装置300包括:获取模块301、执行模块302和输出模块303,其中:An embodiment of the present application provides a monocular depth estimation apparatus. FIG. 3 is a schematic structural diagram of a monocular depth estimation apparatus according to an embodiment of the present application. As shown in FIG. 3, the apparatus 300 includes: an acquisition module 301, an execution module 302, and Output module 303, where:
所述获取模块301,配置为获取待处理图像;The acquisition module 301 is configured to acquire an image to be processed;
所述执行模块302,配置为将所述待处理图像输入至经过训练得到的单目深度估计网络模型,得到所述待处理图像的分析结果,其中,所述单目深度估计网络模型是通过 第一双目匹配神经网络模型输出的视差图进行监督训练的;The execution module 302 is configured to input the to-be-processed image into a trained monocular depth estimation network model to obtain an analysis result of the to-be-processed image, wherein the monocular depth estimation network model is A binocular matching disparity map output from a neural network model for supervised training;
所述输出模块303,配置为输出所述待处理图像的分析结果。The output module 303 is configured to output an analysis result of the image to be processed.
在一些实施例中,所述装置还包括:第三训练模块,配置为通过所述第一双目匹配神经网络模型输出的视差图对所述单目深度估计网络模型进行监督,从而训练所述单目深度估计网络模型。In some embodiments, the apparatus further includes a third training module configured to supervise the monocular depth estimation network model through a disparity map output by the first binocular matching neural network model, thereby training the monocular depth estimation network model. Monocular depth estimation network model.
在一些实施例中,所述装置还包括:第一训练模块,配置为根据获取的合成样本数据训练第二双目匹配神经网络模型;第二训练模块,配置为根据获取的真实样本数据对训练后的第二双目匹配神经网络模型的参数进行调整,得到第一双目匹配神经网络模型。In some embodiments, the apparatus further includes: a first training module configured to train a second binocular matching neural network model based on the obtained synthetic sample data; and a second training module configured to train the training based on the obtained real sample data The parameters of the second binocular matching neural network model are adjusted to obtain the first binocular matching neural network model.
在一些实施例中,所述装置还包括:第一获取模块,配置为获取有深度标记的合成的双目图片作为所述合成样本数据,其中,所述合成的双目图片包括合成的左图和合成的右图。In some embodiments, the apparatus further includes: a first obtaining module configured to obtain a synthesized binocular picture with a depth mark as the synthesized sample data, wherein the synthesized binocular picture includes a synthesized left image And synthetic right image.
在一些实施例中,所述第一训练模块,包括:第一训练单元,配置为根据所述合成的双目图片对第二双目匹配神经网络模型进行训练,得到训练后的第二双目匹配神经网络模型,其中,所述训练后的第二双目匹配神经网络模型的输出为视差图和遮挡图,所述视差图描述了所述左图中每个像素点与所述右图中对应的像素点的视差距离,所述视差距离以像素为单位;所述遮挡图描述了所述左图中每个像素点在所述右图中对应的像素点是否被物体遮挡。In some embodiments, the first training module includes: a first training unit configured to train a second binocular matching neural network model according to the synthesized binocular picture to obtain a trained second binocular The matching neural network model, wherein the output of the trained second binocular matching neural network model is a disparity map and an occlusion map, and the disparity map describes each pixel in the left image and the right image The parallax distance of the corresponding pixel point, the parallax distance is in pixels; the occlusion map describes whether the corresponding pixel point of each pixel point in the left image in the right image is blocked by an object.
在一些实施例中,所述装置还包括:构造模块,配置为通过渲染引擎构造虚拟3D场景;映射模块,配置为通过两个虚拟的摄像机将所述3D场景映射成双目图片;第二获取模块,配置为根据构造所述虚拟3D场景时的位置、构造所述虚拟3D场景时的方向和所述虚拟的摄像机的镜头焦距获取所述合成双目图片的深度数据;第三获取模块,配置为根据所述深度数据标记所述双目图片,得到所述合成的双目图片。In some embodiments, the apparatus further includes: a construction module configured to construct a virtual 3D scene through a rendering engine; a mapping module configured to map the 3D scene into a binocular picture through two virtual cameras; a second acquisition A module configured to acquire depth data of the synthetic binocular picture according to a position when constructing the virtual 3D scene, a direction when constructing the virtual 3D scene, and a lens focal length of the virtual camera; a third acquisition module, configured In order to mark the binocular picture according to the depth data, the synthesized binocular picture is obtained.
在一些实施例中,所述第二训练模块,包括:第二训练单元,配置为根据获取的带深度标记的真实双目数据对训练后的第二双目匹配神经网络模型进行监督训练,以调整所述训练后的第二双目匹配神经网络模型的权值,得到第一双目匹配神经网络模型。In some embodiments, the second training module includes: a second training unit configured to perform supervised training on the trained second binocular matching neural network model according to the obtained real binocular data with depth markers, so that The weight of the trained second binocular matching neural network model is adjusted to obtain a first binocular matching neural network model.
在一些实施例中,所述第二训练模块中的第二训练单元,还配置为:根据获取的不带深度标记的真实双目数据对训练后的第二双目匹配神经网络模型进行无监督训练,以调整所述训练后的第二双目匹配神经网络模型的权值,得到第一双目匹配神经网络模型。In some embodiments, the second training unit in the second training module is further configured to perform unsupervised training of the second binocular matching neural network model according to the obtained real binocular data without a depth marker. Training to adjust the weight of the trained second binocular matching neural network model to obtain a first binocular matching neural network model.
在一些实施例中,所述第二训练模块中的第二训练单元,包括:第二训练部件,配置为使用损失函数,根据所述不带深度标记的真实双目数据对训练后的第二双目匹配神经网络模型进行无监督训练,以调整所述训练后的第二双目匹配神经网络模型的权值,得到第一双目匹配神经网络模型。In some embodiments, the second training unit in the second training module includes a second training component configured to use a loss function to pair the trained second according to the real binocular data without the depth marker. The binocular matching neural network model is subjected to unsupervised training to adjust the weight of the trained second binocular matching neural network model to obtain a first binocular matching neural network model.
在一些实施例中,所述装置还包括:第一确定模块,配置为利用公式(14)确定所述损失函数;L stereo-unsupft=L photo1L abs2L rel(14);其中,所述L stereo-unsupft表示损失函数,所述L photo表示重建误差,所述L abs表示所述第一双目匹配网络模型输出的视差图与所述训练后的第二双目匹配网络模型输出的视差图相比偏离较小,所述L rel表示约束所述第一双目 匹配网络模型的输出梯度与所述训练后的第二双目匹配网络模型的输出梯度一致,所述γ 1和γ 2表示强度系数。 In some embodiments, the apparatus further includes: a first determining module configured to determine the loss function using formula (14); L stereo-unsupft = L photo + γ 1 L abs + γ 2 L rel (14) Wherein, the L stereo-unsupft represents a loss function, the L photo represents a reconstruction error, and the L abs represents a disparity map output by the first binocular matching network model and the trained second binocular match The disparity map output by the network model is relatively small. The L rel indicates that the output gradient of the first binocular matching network model is constrained to be consistent with the output gradient of the trained second binocular matching network model. γ 1 and γ 2 represent intensity coefficients.
在一些实施例中,所述装置还包括:第二确定模块,配置为利用公式(15)或公式(16)确定所述重建误差;
Figure PCTCN2019076247-appb-000034
Figure PCTCN2019076247-appb-000035
其中,所述N表示图片中像素的个数,所述
Figure PCTCN2019076247-appb-000036
表示所述训练后的第二双目匹配网络模型输出的遮挡图的像素值,所述
Figure PCTCN2019076247-appb-000037
表示不带深度标记的真实双目数据中的左图的像素值,所述
Figure PCTCN2019076247-appb-000038
表示不带深度标记的真实双目数据中的右图的像素值,所述
Figure PCTCN2019076247-appb-000039
表示将右图采样后合成的图片的像素值,所述
Figure PCTCN2019076247-appb-000040
表示将左图采样后合成的图片的像素值,所述
Figure PCTCN2019076247-appb-000041
表示不带深度标记的真实双目数据中的左图经第一双目匹配网络模型输出的视差图的像素值,所述
Figure PCTCN2019076247-appb-000042
表示不带深度标记的真实双目数据中的右图经第一双目匹配网络模型输出的视差图的像素值,ij表示像素点的像素坐标。
In some embodiments, the apparatus further includes: a second determination module configured to determine the reconstruction error by using formula (15) or formula (16);
Figure PCTCN2019076247-appb-000034
Figure PCTCN2019076247-appb-000035
Where N is the number of pixels in the picture, and
Figure PCTCN2019076247-appb-000036
Pixel values of the occlusion map output by the trained second binocular matching network model, said
Figure PCTCN2019076247-appb-000037
Represents the pixel value of the left image in the true binocular data without a depth marker, said
Figure PCTCN2019076247-appb-000038
Represents the pixel value of the right image in the true binocular data without a depth marker, said
Figure PCTCN2019076247-appb-000039
Represents the pixel value of a picture synthesized after sampling the right image, said
Figure PCTCN2019076247-appb-000040
Represents the pixel value of the picture synthesized after sampling the left picture, said
Figure PCTCN2019076247-appb-000041
Represents the pixel values of the disparity map output by the first binocular matching network model in the left image of the real binocular data without a depth marker, said
Figure PCTCN2019076247-appb-000042
Represents the pixel values of the disparity map output by the first binocular matching network model on the right in real binocular data without depth markers, and ij represents the pixel coordinates of the pixel.
在一些实施例中,所述装置还包括:第三确定模块,配置为利用公式(17)或公式(18)确定所述第一双目匹配网络模型输出的视差图与所述训练后的第二双目匹配网络模型输出的视差图相比偏离较小;
Figure PCTCN2019076247-appb-000043
Figure PCTCN2019076247-appb-000044
其中,所述
Figure PCTCN2019076247-appb-000045
表示样本数据中的左图经训练后的第二双目匹配网络模型输出的视差图的像素值,所述
Figure PCTCN2019076247-appb-000046
表示样本数据中的右图经训练后的第二双目匹配网络模型输出的视差图的像素值,所述γ 3表示强度系数。
In some embodiments, the apparatus further includes: a third determining module configured to determine, using formula (17) or formula (18), a disparity map output by the first binocular matching network model and the trained first The disparity map output by the two binocular matching network model is relatively small;
Figure PCTCN2019076247-appb-000043
Figure PCTCN2019076247-appb-000044
Wherein, said
Figure PCTCN2019076247-appb-000045
Pixel values representing the disparity map output by the trained second binocular matching network model of the left image in the sample data, said
Figure PCTCN2019076247-appb-000046
Represents the pixel values of the disparity map output by the trained second binocular matching network model on the right in the sample data, where γ 3 represents the intensity coefficient.
在一些实施例中,所述装置还包括:第四确定模块,配置为利用公式(19),或公式(20),确定所述第一双目匹配网络模型的输出梯度与所述第二双目匹配网络模型的输出梯度一致;
Figure PCTCN2019076247-appb-000047
其中,所述
Figure PCTCN2019076247-appb-000048
表示不带深度标记的真实双目数据中的左图经第一双目匹配网络模型输出的视差图的梯度,所述
Figure PCTCN2019076247-appb-000049
表示不带深度标记的真实双目数据中的右图经第一双目匹配网络模型输出的视差图的梯度,所述
Figure PCTCN2019076247-appb-000050
表示样本数据中的左图经训练后的第二双目匹配网络模型输出的视差图的梯度,所述
Figure PCTCN2019076247-appb-000051
表示样本数据中的右图经训练后的第二双目匹配网络模型输出的视差图的梯度。
In some embodiments, the apparatus further includes: a fourth determining module configured to determine an output gradient of the first binocular matching network model and the second binarization using formula (19) or formula (20) The output gradient of the mesh matching network model is consistent;
Figure PCTCN2019076247-appb-000047
Wherein, said
Figure PCTCN2019076247-appb-000048
Represents the gradient of the disparity map output by the first binocular matching network model in the left image in the real binocular data without a depth marker, said
Figure PCTCN2019076247-appb-000049
Represents the gradient of the disparity map output by the first binocular matching network model of the right image in the real binocular data without a depth marker, said
Figure PCTCN2019076247-appb-000050
Represents the gradient of the disparity map output by the trained second binocular matching network model of the left image in the sample data, said
Figure PCTCN2019076247-appb-000051
Represents the gradient of the disparity map output from the second binocular matching network model after training on the right in the sample data.
在一些实施例中,所述带深度标记的真实双目数据包括左图和右图,对应地,所述第三训练模块,包括:第一获取单元,配置为获取所述带深度标记的真实双目数据中的左图或右图作为训练样本;第一训练单元,配置为根据所述带深度标记的真实双目数据中的左图或右图对单目深度估计网络模型进行训练。In some embodiments, the depth-labeled real binocular data includes a left image and a right image. Correspondingly, the third training module includes a first acquisition unit configured to acquire the depth-labeled real image. The left or right image in the binocular data is used as a training sample; the first training unit is configured to train the monocular depth estimation network model according to the left or right image in the real binocular data with depth marking.
在一些实施例中,所述不带深度标记的真实双目数据包括左图和右图,对应地,所述第三训练模块,还包括:第二获取单元,配置为将所述不带深度标记的真实双目数据输入到所述第一双目匹配神经网络模型,得到对应的视差图;第一确定单元,配置为根据所述对应的视差图、拍摄所述不带深度标记的真实双目数据的摄像机的镜头基线距离 和拍摄所述不带深度标记的真实双目数据的摄像机的镜头焦距,确定所述视差图对应的深度图;第二训练单元,配置为将所述不带深度标记的真实双目数据中的左图或右图作为样本数据,根据所述视差图对应的深度图对单目深度估计网络模型进行监督,从而训练所述单目深度估计网络模型。In some embodiments, the true binocular data without a depth mark includes a left image and a right image. Correspondingly, the third training module further includes a second acquisition unit configured to convert the image without depth. The labeled real binocular data is input to the first binocular matching neural network model to obtain a corresponding disparity map; a first determining unit is configured to, according to the corresponding disparity map, shoot the true binocular without a depth mark The baseline distance of the lens of the camera of the binocular data and the focal length of the lens of the camera that captures the true binocular data without the depth mark, to determine the depth map corresponding to the parallax map; The left or right image in the labeled real binocular data is used as sample data, and the monocular depth estimation network model is supervised according to the depth map corresponding to the disparity map, thereby training the monocular depth estimation network model.
在一些实施例中,所述待处理图像的分析结果包括所述单目深度估计网络模型输出的视差图,对应地,所述装置还包括:第五确定模块,配置为根据所述单目深度估计网络模型输出的视差图、拍摄输入所述单目深度估计网络模型的图片的摄像机的镜头基线距离和拍摄输入所述单目深度估计网络模型的图片的摄像机的镜头焦距,确定所述视差图对应的深度图;第一输出模块,配置为输出所述视差图对应的深度图。In some embodiments, the analysis result of the to-be-processed image includes a disparity map output by the monocular depth estimation network model. Correspondingly, the device further includes a fifth determination module configured to determine the monocular depth based on the monocular depth. The disparity map output by the network model is estimated, the lens baseline distance of the camera that takes pictures of the monocular depth estimation network model input and the lens focal length of the camera that takes pictures of the monocular depth estimation network model input, and determines the disparity map A corresponding depth map; a first output module configured to output a depth map corresponding to the disparity map.
这里需要指出的是:以上装置实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本申请装置实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。本申请实施例中,如果以软件功能模块的形式实现上述的单目深度估计方法,并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算设备执行本申请各个实施例所述方法的全部或部分。而前述的存储介质包括:U盘、移动硬盘、ROM(Read Only Memory,只读存储器)、磁碟或者光盘等各种可以存储程序代码的介质。这样,本申请实施例不限制于任何特定的硬件和软件结合。对应地,本申请实施例提供一种单目深度估计设备,该设备包括存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,所述处理器执行所述程序时实现单目深度估计方法中的步骤。对应地,本申请实施例提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现单目深度估计方法中的步骤。这里需要指出的是:以上存储介质和设备实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本申请存储介质和设备实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。It should be noted here that the description of the above device embodiments is similar to the description of the above method embodiments, and has similar beneficial effects as the method embodiments. For technical details not disclosed in the device embodiments of the present application, please refer to the description of the method embodiments of the present application for understanding. In the embodiment of the present application, if the above-mentioned monocular depth estimation method is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present application that are essentially or contribute to the existing technology can be embodied in the form of software products. The computer software product is stored in a storage medium and includes several instructions for A computing device is caused to execute all or part of the method described in each embodiment of the present application. The foregoing storage medium includes various media that can store program codes, such as a U disk, a mobile hard disk, a ROM (Read Only Memory, read only memory), a magnetic disk, or an optical disk. In this way, the embodiments of the present application are not limited to any specific combination of hardware and software. Correspondingly, an embodiment of the present application provides a monocular depth estimation device. The device includes a memory and a processor. The memory stores a computer program that can be run on the processor. Steps in the mesh depth estimation method. Correspondingly, an embodiment of the present application provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, steps in the monocular depth estimation method are implemented. It should be noted here that the description of the above storage medium and device embodiments is similar to the description of the above method embodiments, and has similar beneficial effects as the method embodiments. For technical details not disclosed in the storage medium and device embodiments of the present application, please refer to the description of the method embodiments of the present application for understanding.
需要说明的是,图4为本申请实施例单目深度估计设备的一种硬件实体示意图,如图4所示,该单目深度估计设备400的硬件实体包括:存储器401、通信总线402和处理器403,其中,存储器401配置为存储由处理器403可执行的指令和应用,还可以缓存待处理器403以及单目深度估计设备400中各模块待处理或已经处理的数据,可以通过FLASH(闪存)或RAM(Random Access Memory,随机访问存储器)实现。通信总线402可以使单目深度估计设备400通过网络与其他终端或服务器通信,还可以实现处理器403和存储器401之间的连接通信。处理器403通常控制单目深度估计设备400的总体操作。It should be noted that FIG. 4 is a schematic diagram of a hardware entity of the monocular depth estimation device according to the embodiment of the present application. As shown in FIG. 4, the hardware entities of the monocular depth estimation device 400 include: a memory 401, a communication bus 402, and a process. The processor 403, wherein the memory 401 is configured to store instructions and applications executable by the processor 403, and may also cache data to be processed or processed by each module in the to-be-processed 403 and the monocular depth estimation device 400, which may be processed by FLASH ( Flash memory) or RAM (Random Access Memory). The communication bus 402 may enable the monocular depth estimation device 400 to communicate with other terminals or servers through a network, and may also implement connection and communication between the processor 403 and the memory 401. The processor 403 generally controls the overall operation of the monocular depth estimation apparatus 400.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者 装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be noted that, in this article, the terms "including", "including" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, It also includes other elements not explicitly listed, or elements inherent to such a process, method, article, or device. Without more restrictions, an element limited by the sentence "including a ..." does not exclude that there are other identical elements in the process, method, article, or device that includes the element.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所描述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the methods in the above embodiments can be implemented by means of software plus a necessary universal hardware platform, and of course, also by hardware, but in many cases the former is better. Implementation. Based on such an understanding, the technical solution of this application that is essentially or contributes to the existing technology can be embodied in the form of a software product, which is stored in a storage medium (such as ROM / RAM, magnetic disk, The optical disc) includes several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods described in the embodiments of the present application.
本申请是参照根据本申请实施例的方法、设备(装置)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。This application is described with reference to flowcharts and / or block diagrams of methods, devices (apparatuses), and computer program products according to embodiments of the present application. It should be understood that each process and / or block in the flowcharts and / or block diagrams, and combinations of processes and / or blocks in the flowcharts and / or block diagrams can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing device to produce a machine, so that instructions generated by the processor of the computer or other programmable data processing device may be used to Means for implementing the functions specified in one or more flowcharts and / or one or more blocks of the block diagrams. These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing device to work in a specific manner such that the instructions stored in the computer-readable memory produce a manufactured article including an instruction device, the instructions The device implements the functions specified in one or more flowcharts and / or one or more blocks of the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of steps can be performed on the computer or other programmable device to produce a computer-implemented process, which can be executed on the computer or other programmable device. The instructions provide steps for implementing the functions specified in one or more flowcharts and / or one or more blocks of the block diagrams.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only preferred embodiments of the present application, and thus do not limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made using the contents of the description and drawings of the application, or directly or indirectly used in other related technical fields Are included in the scope of patent protection of this application.

Claims (30)

  1. 一种单目深度估计方法,其中,所述方法包括:获取待处理图像;将所述待处理图像输入至经过训练得到的单目深度估计网络模型,得到所述待处理图像的分析结果,其中,所述单目深度估计网络模型是通过第一双目匹配神经网络模型输出的视差图进行监督训练的;输出所述待处理图像的分析结果。A monocular depth estimation method, wherein the method includes: acquiring an image to be processed; inputting the image to be processed into a trained monocular depth estimation network model to obtain an analysis result of the image to be processed, wherein The monocular depth estimation network model is supervised and trained through a disparity map output by the first binocular matching neural network model; and an analysis result of the image to be processed is output.
  2. 根据权利要求1所述的方法,其中,所述第一双目匹配神经网络模型的训练过程,包括:根据获取的合成样本数据训练第二双目匹配神经网络模型;根据获取的真实样本数据对训练后的第二双目匹配神经网络模型的参数进行调整,得到第一双目匹配神经网络模型。The method according to claim 1, wherein the training process of the first binocular matching neural network model comprises: training a second binocular matching neural network model according to the obtained synthetic sample data; and according to the obtained real sample data, After training, the parameters of the second binocular matching neural network model are adjusted to obtain the first binocular matching neural network model.
  3. 根据权利要求2所述的方法,其中,所述方法还包括:获取有深度标记的合成的双目图片作为所述合成样本数据,其中,所述合成的双目图片包括合成的左图和合成的右图。The method according to claim 2, further comprising: obtaining a synthesized binocular picture with a depth mark as the synthesized sample data, wherein the synthesized binocular picture includes a synthesized left picture and a synthesized Right picture.
  4. 根据权利要求3所述的方法,其中,所述根据获取的合成样本数据训练第二双目匹配神经网络模型,包括:根据所述合成的双目图片对第二双目匹配神经网络模型进行训练,得到训练后的第二双目匹配神经网络模型,其中,所述训练后的第二双目匹配神经网络模型的输出为视差图和遮挡图,所述视差图描述了所述左图中每个像素点与所述右图中对应的像素点的视差距离,所述视差距离以像素为单位;所述遮挡图描述了所述左图中每个像素点在所述右图中对应的像素点是否被物体遮挡。The method according to claim 3, wherein the training a second binocular matching neural network model according to the obtained synthetic sample data comprises: training the second binocular matching neural network model according to the synthesized binocular picture. To obtain a trained second binocular matching neural network model, wherein the output of the trained second binocular matching neural network model is a disparity map and an occlusion map, and the disparity map describes each The parallax distance between each pixel point and the corresponding pixel point in the right image, the parallax distance is in pixels; the occlusion map describes the pixel corresponding to each pixel point in the left image in the right image Whether the point is blocked by the object.
  5. 根据权利要求2所述的方法,其中,所述根据获取的真实样本数据对训练后的第二双目匹配神经网络模型的参数进行调整,得到第一双目匹配神经网络模型,包括:根据获取的带深度标记的真实双目数据对训练后的第二双目匹配神经网络模型进行监督训练,以调整所述训练后的第二双目匹配神经网络模型的权值,得到第一双目匹配神经网络模型。The method according to claim 2, wherein adjusting the parameters of the trained second binocular matching neural network model according to the obtained real sample data to obtain a first binocular matching neural network model comprises: Supervised training on the trained second binocular matching neural network model with the depth-labeled real binocular data to adjust the weight of the trained second binocular matching neural network model to obtain a first binocular match Neural network model.
  6. 根据权利要求2所述的方法,其中,所述根据获取的真实样本数据对训练后的第二双目匹配神经网络模型的参数进行调整,得到第一双目匹配神经网络模型,还包括:根据获取的不带深度标记的真实双目数据对训练后的第二双目匹配神经网络模型进行无监督训练,以调整所述训练后的第二双目匹配神经网络模型的权值,得到第一双目匹配神经网络模型。The method according to claim 2, wherein adjusting the parameters of the trained second binocular matching neural network model according to the obtained real sample data to obtain the first binocular matching neural network model further comprises: The obtained real binocular data without the depth marker is used to perform unsupervised training on the trained second binocular matching neural network model to adjust the weight of the trained second binocular matching neural network model to obtain the first Binocular matching neural network model.
  7. 根据权利要求6所述的方法,其中,所述根据获取的不带深度标记的真实双目数据对训练后的第二双目匹配神经网络模型进行无监督训练,以调整所述训练后的第二双目匹配神经网络模型的权值,得到第一双目匹配神经网络模型,包括:使用损失函数,根据所述不带深度标记的真实双目数据对训练后的第二双目匹配神经网络模型进行无监督训练,以调整所述训练后的第二双目匹配神经网络模型的权值,得到第一双目匹配神经网络模型。The method according to claim 6, wherein the trained second binocular matching neural network model is subjected to unsupervised training according to the obtained real binocular data without a depth marker to adjust the trained first binocular matching neural network model. The two binocular matching neural network model weights to obtain a first binocular matching neural network model include: using a loss function to pair the trained second binocular matching neural network according to the real binocular data without depth marking. The model performs unsupervised training to adjust the weight of the trained second binocular matching neural network model to obtain a first binocular matching neural network model.
  8. 根据权利要求7所述的方法,其中,所述方法还包括:利用公式L stereo-un sup ft=L photo1L abs2L rel确定所述损失函数,其中,所述L stereo-un sup ft表示损失函数,所述L photo表示重建误差,所述L abs表示所述第一双目匹配网络模型输出的视差图与所述训练后的第二双目匹配网络模型输出的视差图相比偏离较小,所述L rel表示约束所述第一双目匹配网络模型的输出梯度与所述训练后的第二双目匹配网络模型的输出梯度一致,所述γ 1和γ 2表示强度系数。 The method according to claim 7, wherein the method further comprises: determining the loss function by using a formula L stereo-un sup ft = L photo + γ 1 L abs + γ 2 L rel , wherein the L stereo -un sup ft represents a loss function L photo showing the reconstruction error, the L abs represents the binocular parallax of the second network model outputs matching a first model output matching network binocular disparity map after the training The deviation is small compared to the figure. The L rel indicates that the output gradient of the first binocular matching network model is constrained to be consistent with the output gradient of the trained second binocular matching network model. The γ 1 and γ 2 Represents the intensity coefficient.
  9. 根据权利要求8所述的方法,其中,所述方法还包括:利用公式
    Figure PCTCN2019076247-appb-100001
    或,
    Figure PCTCN2019076247-appb-100002
    确定所述重建误差,其中,所述N表示图片中像素的个数,所述
    Figure PCTCN2019076247-appb-100003
    表示所述训练后的第二双目匹配网络模型输出的遮挡图的像素值,所述
    Figure PCTCN2019076247-appb-100004
    表示不带深度标记的真实双目数据中的左图的像素值,所述
    Figure PCTCN2019076247-appb-100005
    表示不带深度标记的真实双目数据中的右图的像素值,所述
    Figure PCTCN2019076247-appb-100006
    表示将右图采样后合成的图片的像素值,所述
    Figure PCTCN2019076247-appb-100007
    表示将左图采样后合成的图片的像素值,所述
    Figure PCTCN2019076247-appb-100008
    表示不带深度标记的真实双目数据中的左图经第一双目匹配网络模型输出的视差图的像素值,所述
    Figure PCTCN2019076247-appb-100009
    表示不带深度标记的真实双目数据中的右图经第一双目匹配网络模型输出的视差图的像素值,所述ij表示像素点的像素坐标。
    The method according to claim 8, further comprising: using a formula
    Figure PCTCN2019076247-appb-100001
    or,
    Figure PCTCN2019076247-appb-100002
    Determining the reconstruction error, wherein the N represents the number of pixels in the picture, and the
    Figure PCTCN2019076247-appb-100003
    Pixel values of the occlusion map output by the trained second binocular matching network model, said
    Figure PCTCN2019076247-appb-100004
    Represents the pixel value of the left image in the true binocular data without a depth marker, said
    Figure PCTCN2019076247-appb-100005
    Represents the pixel value of the right image in the true binocular data without a depth marker, said
    Figure PCTCN2019076247-appb-100006
    Represents the pixel value of a picture synthesized after sampling the right image, said
    Figure PCTCN2019076247-appb-100007
    Represents the pixel value of the picture synthesized after sampling the left picture, said
    Figure PCTCN2019076247-appb-100008
    Represents the pixel values of the disparity map output by the first binocular matching network model in the left image of the real binocular data without a depth marker, said
    Figure PCTCN2019076247-appb-100009
    Represents the pixel values of the disparity map output by the first binocular matching network model of the right image in the real binocular data without the depth mark, and ij represents the pixel coordinates of the pixel point.
  10. 根据权利要求8所述的方法,其中,所述方法还包括:利用公式
    Figure PCTCN2019076247-appb-100010
    或,
    Figure PCTCN2019076247-appb-100011
    确定所述第一双目匹配网络模型输出的视差图与所述训练后的第二双目匹配网络模型输出的视差图相比偏离较小,其中,所述N表示图片中像素的个数,所述
    Figure PCTCN2019076247-appb-100012
    表示所述训练后的第二双目匹配网络模型输出的遮挡图的像素值,所述
    Figure PCTCN2019076247-appb-100013
    表示不带深度标记的真实双目数据中的左图经第一双目匹配网络模型输出的视差图的像素值,所述
    Figure PCTCN2019076247-appb-100014
    表示不带深度标记的真实双目数据中的右图经第一双目匹配网络模型输出的视差图的像素值,所述
    Figure PCTCN2019076247-appb-100015
    表示左图经训练后的第二双目匹配网络模型输出的视差图的像素值,所述
    Figure PCTCN2019076247-appb-100016
    表示右图经训练后的第二双目匹配网络模型输出的视差图的像素值,所述ij表示像素点的像素坐标,所述γ 3表示强度系数。
    The method according to claim 8, further comprising: using a formula
    Figure PCTCN2019076247-appb-100010
    or,
    Figure PCTCN2019076247-appb-100011
    Determining that the disparity map output by the first binocular matching network model is smaller than the disparity map output by the second binocular matching network model after training, where N is the number of pixels in the picture, Said
    Figure PCTCN2019076247-appb-100012
    Pixel values of the occlusion map output by the trained second binocular matching network model, said
    Figure PCTCN2019076247-appb-100013
    Represents the pixel values of the disparity map output by the first binocular matching network model in the left image of the real binocular data without a depth marker, said
    Figure PCTCN2019076247-appb-100014
    Represents the pixel values of the disparity map output by the first binocular matching network model of the right image in the real binocular data without a depth marker, said
    Figure PCTCN2019076247-appb-100015
    Represents the pixel values of the disparity map output by the trained second binocular matching network model on the left,
    Figure PCTCN2019076247-appb-100016
    Represents the pixel values of the disparity map output by the trained second binocular matching network model on the right, where ij represents the pixel coordinates of the pixel points, and γ 3 represents the intensity coefficient.
  11. 根据权利要求8所述的方法,其中,所述方法还包括:利用公式
    Figure PCTCN2019076247-appb-100017
    或,
    Figure PCTCN2019076247-appb-100018
    确定所述第一双目匹配网络模型的输出梯度与所述第二双目匹配网络模型的输出梯度一致,其中,所述N表示图片中像素的个数,所述
    Figure PCTCN2019076247-appb-100019
    表示不带深度标记的真实双目数据中的左图经第一双目匹配网络模型输出的视差图的梯度,所述
    Figure PCTCN2019076247-appb-100020
    表示不带深度标记的真实双目数据中的右图经第一双目匹配网络模型输出的视差图的梯度,所述
    Figure PCTCN2019076247-appb-100021
    表示左图经训练后的第二双目匹配网络模型输出的视差图的梯度,所述
    Figure PCTCN2019076247-appb-100022
    表示右图经训练后的第二双目匹配网络模型输出的视差图的梯度,所述ij表示像素点的像素坐标。
    The method according to claim 8, further comprising: using a formula
    Figure PCTCN2019076247-appb-100017
    or,
    Figure PCTCN2019076247-appb-100018
    It is determined that the output gradient of the first binocular matching network model is consistent with the output gradient of the second binocular matching network model, where N is the number of pixels in the picture, and the
    Figure PCTCN2019076247-appb-100019
    Represents the gradient of the disparity map output by the first binocular matching network model in the left image in the real binocular data without a depth marker, said
    Figure PCTCN2019076247-appb-100020
    Represents the gradient of the disparity map output by the first binocular matching network model of the right image in the real binocular data without a depth marker, said
    Figure PCTCN2019076247-appb-100021
    Represents the gradient of the disparity map output by the trained second binocular matching network model on the left,
    Figure PCTCN2019076247-appb-100022
    Represents the gradient of the disparity map output by the trained second binocular matching network model on the right, and ij represents the pixel coordinates of the pixel.
  12. 根据权利要求5所述的方法,其中,所述带深度标记的真实双目数据包括左图 和右图,对应地,所述单目深度估计网络模型的训练过程,包括:获取所述带深度标记的真实双目数据中的左图或右图作为训练样本;根据所述带深度标记的真实双目数据中的左图或右图对单目深度估计网络模型进行训练。The method according to claim 5, wherein the true binocular data with depth marking includes a left image and a right image, and correspondingly, the training process of the monocular depth estimation network model comprises: obtaining the belt depth The left or right image in the labeled real binocular data is used as a training sample; the monocular depth estimation network model is trained according to the left or right image in the depth-labeled real binocular data.
  13. 根据权利要求6至11任一项所述的方法,其中,所述不带深度标记的真实双目数据包括左图和右图,对应地,所述单目深度估计网络模型的训练过程,包括:将所述不带深度标记的真实双目数据输入到所述第一双目匹配神经网络模型,得到对应的视差图;根据所述对应的视差图、拍摄所述不带深度标记的真实双目数据的摄像机的镜头基线距离和拍摄所述不带深度标记的真实双目数据的摄像机的镜头焦距,确定所述视差图对应的深度图;将所述不带深度标记的真实双目数据中的左图或右图作为样本数据,根据所述视差图对应的深度图对单目深度估计网络模型进行监督,从而训练所述单目深度估计网络模型。The method according to any one of claims 6 to 11, wherein the true binocular data without a depth mark includes a left image and a right image, and correspondingly, the training process of the monocular depth estimation network model includes : Inputting the true binocular data without the depth mark into the first binocular matching neural network model to obtain a corresponding disparity map; and shooting the true binocular without the depth mark according to the corresponding disparity map The baseline distance of the lens of the camera with the binocular data and the focal length of the lens of the camera that captures the true binocular data without the depth mark, determine the depth map corresponding to the disparity map; The left image or right image of is used as sample data, and the monocular depth estimation network model is supervised according to the depth map corresponding to the disparity map, thereby training the monocular depth estimation network model.
  14. 根据权利要求12或13所述的方法,其中,所述待处理图像的分析结果包括所述单目深度估计网络模型输出的视差图,对应地,所述方法还包括:根据所述单目深度估计网络模型输出的视差图、拍摄输入所述单目深度估计网络模型的图片的摄像机的镜头基线距离和拍摄输入所述单目深度估计网络模型的图片的摄像机的镜头焦距,确定所述视差图对应的深度图;输出所述视差图对应的深度图。The method according to claim 12 or 13, wherein the analysis result of the image to be processed includes a disparity map output by the monocular depth estimation network model, and correspondingly, the method further comprises: according to the monocular depth The disparity map output by the network model is estimated, the lens baseline distance of the camera that takes pictures of the monocular depth estimation network model input and the lens focal length of the camera that takes pictures of the monocular depth estimation network model input, and determines the disparity map A corresponding depth map; and output a depth map corresponding to the disparity map.
  15. 一种单目深度估计装置,其中,所述装置包括:获取模块、执行模块和输出模块,其中:所述获取模块,配置为获取待处理图像;所述执行模块,配置为将所述待处理图像输入至经过训练得到的单目深度估计网络模型,得到所述待处理图像的分析结果,其中,所述单目深度估计网络模型是通过第一双目匹配神经网络模型输出的视差图进行监督训练的;所述输出模块,配置为输出所述待处理图像的分析结果。A monocular depth estimation device, wherein the device includes: an acquisition module, an execution module, and an output module, wherein: the acquisition module is configured to acquire an image to be processed; and the execution module is configured to convert the to-be-processed image The image is input to a trained monocular depth estimation network model to obtain an analysis result of the image to be processed, wherein the monocular depth estimation network model is supervised by a disparity map output by a first binocular matching neural network model Trained; the output module configured to output an analysis result of the image to be processed.
  16. 根据权利要求15所述的装置,其中,所述装置还包括:第一训练模块,配置为根据获取的合成样本数据训练第二双目匹配神经网络模型;第二训练模块,配置为根据获取的真实样本数据对训练后的第二双目匹配神经网络模型的参数进行调整,得到第一双目匹配神经网络模型。The apparatus according to claim 15, wherein the apparatus further comprises: a first training module configured to train a second binocular matching neural network model based on the obtained synthetic sample data; and a second training module configured to perform a training based on the acquired The real sample data adjusts the parameters of the trained second binocular matching neural network model to obtain the first binocular matching neural network model.
  17. 根据权利要求16所述的装置,其中,所述装置还包括:第一获取模块,配置为获取有深度标记的合成的双目图片作为所述合成样本数据,其中,所述合成的双目图片包括合成的左图和合成的右图。The apparatus according to claim 16, further comprising: a first acquisition module configured to acquire a synthesized binocular picture with a depth mark as the synthesized sample data, wherein the synthesized binocular picture Includes left composite image and right composite image.
  18. 根据权利要求17所述的装置,其中,所述第一训练模块,包括:第一训练单元,配置为根据所述合成的双目图片对第二双目匹配神经网络模型进行训练,得到训练后的第二双目匹配神经网络模型,其中,所述训练后的第二双目匹配神经网络模型的输出为视差图和遮挡图,所述视差图描述了所述左图中每个像素点与所述右图中对应的像素点的视差距离,所述视差距离以像素为单位;所述遮挡图描述了所述左图中每个像素点在所述右图中对应的像素点是否被物体遮挡。The device according to claim 17, wherein the first training module comprises: a first training unit configured to train a second binocular matching neural network model according to the synthesized binocular picture, and obtain the training , The output of the trained second binocular matching neural network model is a disparity map and an occlusion map, and the disparity map describes each pixel in the left image and The parallax distance of the corresponding pixel point in the right image, the parallax distance is in pixels; the occlusion map describes whether the corresponding pixel point of each pixel in the left image in the right image is an object Occlusion.
  19. 根据权利要求16所述的装置,其中,所述第二训练模块,包括:第二训练单元,配置为根据获取的带深度标记的真实双目数据对训练后的第二双目匹配神经网络模 型进行监督训练,以调整所述训练后的第二双目匹配神经网络模型的权值,得到第一双目匹配神经网络模型。The apparatus according to claim 16, wherein the second training module comprises: a second training unit configured to match the trained second binocular matching neural network model according to the obtained depth-labeled real binocular data Supervised training is performed to adjust the weight of the trained second binocular matching neural network model to obtain a first binocular matching neural network model.
  20. 根据权利要求16所述的装置,其中,所述第二训练单元,还配置为:根据获取的不带深度标记的真实双目数据对训练后的第二双目匹配神经网络模型进行无监督训练,以调整所述训练后的第二双目匹配神经网络模型的权值,得到第一双目匹配神经网络模型。The apparatus according to claim 16, wherein the second training unit is further configured to perform unsupervised training on the trained second binocular matching neural network model according to the obtained real binocular data without the depth marker. To adjust the weight of the trained second binocular matching neural network model to obtain a first binocular matching neural network model.
  21. 根据权利要求20所述的装置,其中,所述第二训练单元,包括:第二训练部件,配置为使用损失函数,根据所述不带深度标记的真实双目数据对训练后的第二双目匹配神经网络模型进行无监督训练,以调整所述训练后的第二双目匹配神经网络模型的权值,得到第一双目匹配神经网络模型。The apparatus according to claim 20, wherein the second training unit comprises: a second training component configured to use a loss function to pair the trained second pair according to the real binocular data without the depth marker. The binocular matching neural network model is subjected to unsupervised training to adjust the weight of the trained second binocular matching neural network model to obtain a first binocular matching neural network model.
  22. 根据权利要求21所述的装置,其中,所述装置还包括:第一确定模块,配置为利用公式L stereo-un sup ft=L photo1L abs2L rel确定所述损失函数,其中,所述L stereo-un sup ft表示损失函数,所述L photo表示重建误差,所述L abs表示所述第一双目匹配网络模型输出的视差图与所述训练后的第二双目匹配网络模型输出的视差图相比偏离较小,所述L rel表示约束所述第一双目匹配网络模型的输出梯度与所述训练后的第二双目匹配网络模型的输出梯度一致,所述γ 1和γ 2表示强度系数。 The apparatus according to claim 21, wherein the apparatus further comprises: a first determining module configured to determine the loss function using a formula L stereo-un sup ft = L photo + γ 1 L abs + γ 2 L rel Where L stereo-un sup ft represents a loss function, L photo represents a reconstruction error, and L abs represents a disparity map output by the first binocular matching network model and the second pair after training The disparity map output by the mesh matching network model is relatively small, and the L rel indicates that the output gradient of the first binocular matching network model is constrained to be consistent with the output gradient of the trained second binocular matching network model. The γ 1 and γ 2 represent intensity coefficients.
  23. 根据权利要求22所述的装置,其中,所述装置还包括:第二确定模块,配置为利用公式
    Figure PCTCN2019076247-appb-100023
    或,
    Figure PCTCN2019076247-appb-100024
    确定所述重建误差,其中,所述N表示图片中像素的个数,所述
    Figure PCTCN2019076247-appb-100025
    表示所述训练后的第二双目匹配网络模型输出的遮挡图的像素值,所述
    Figure PCTCN2019076247-appb-100026
    表示不带深度标记的真实双目数据中的左图的像素值,所述
    Figure PCTCN2019076247-appb-100027
    表示不带深度标记的真实双目数据中的右图的像素值,所述
    Figure PCTCN2019076247-appb-100028
    表示将右图采样后合成的图片的像素值,所述
    Figure PCTCN2019076247-appb-100029
    表示将左图采样后合成的图片的像素值,所述
    Figure PCTCN2019076247-appb-100030
    表示不带深度标记的真实双目数据中的左图经第一双目匹配网络模型输出的视差图的像素值,所述
    Figure PCTCN2019076247-appb-100031
    表示不带深度标记的真实双目数据中的右图经第一双目匹配网络模型输出的视差图的像素值,所述ij表示像素点的像素坐标。
    The apparatus according to claim 22, wherein the apparatus further comprises: a second determination module configured to utilize a formula
    Figure PCTCN2019076247-appb-100023
    or,
    Figure PCTCN2019076247-appb-100024
    Determining the reconstruction error, wherein the N represents the number of pixels in the picture, and the
    Figure PCTCN2019076247-appb-100025
    Pixel values of the occlusion map output by the trained second binocular matching network model, said
    Figure PCTCN2019076247-appb-100026
    Represents the pixel value of the left image in the true binocular data without a depth marker, said
    Figure PCTCN2019076247-appb-100027
    Represents the pixel value of the right image in the true binocular data without a depth marker, said
    Figure PCTCN2019076247-appb-100028
    Represents the pixel value of a picture synthesized after sampling the right image, said
    Figure PCTCN2019076247-appb-100029
    Represents the pixel value of the picture synthesized after sampling the left picture, said
    Figure PCTCN2019076247-appb-100030
    Represents the pixel values of the disparity map output by the first binocular matching network model in the left image of the real binocular data without a depth marker, said
    Figure PCTCN2019076247-appb-100031
    Represents the pixel values of the disparity map output by the first binocular matching network model of the right image in the real binocular data without the depth mark, and ij represents the pixel coordinates of the pixel point.
  24. 根据权利要求22所述的装置,其中,所述装置还包括:第三确定模块,配置为利用公式
    Figure PCTCN2019076247-appb-100032
    或,
    Figure PCTCN2019076247-appb-100033
    确定所述第一双目匹配网络模型输出的视差图与所述训练后的第二双目匹配网络模型输出的视差图相比偏离较小,其中,所述N表示图片中像素的个数,所述
    Figure PCTCN2019076247-appb-100034
    表示所述训练后的第二双目匹配网络模型输出的遮挡图的像素值,所述
    Figure PCTCN2019076247-appb-100035
    表示不带深度标记的真实双目数据中的左图经第一双目匹配网络模型输出的视差图的像素值,所述
    Figure PCTCN2019076247-appb-100036
    表示不带深度标记的真实双目数据中的右图经第一双目匹配网络模型输出的视差图的像素值,所述
    Figure PCTCN2019076247-appb-100037
    表示左图经训练后的第二双目匹配网络模型输出的视差图的像素值,所述
    Figure PCTCN2019076247-appb-100038
    表示右图经 训练后的第二双目匹配网络模型输出的视差图的像素值,所述ij表示像素点的像素坐标,所述γ 3表示强度系数。
    The apparatus according to claim 22, further comprising: a third determination module configured to utilize a formula
    Figure PCTCN2019076247-appb-100032
    or,
    Figure PCTCN2019076247-appb-100033
    Determining that the disparity map output by the first binocular matching network model is smaller than the disparity map output by the second binocular matching network model after training, where N is the number of pixels in the picture, Said
    Figure PCTCN2019076247-appb-100034
    Pixel values of the occlusion map output by the trained second binocular matching network model, said
    Figure PCTCN2019076247-appb-100035
    Represents the pixel values of the disparity map output by the first binocular matching network model in the left image of the real binocular data without a depth marker, said
    Figure PCTCN2019076247-appb-100036
    Represents the pixel values of the disparity map output by the first binocular matching network model of the right image in the real binocular data without a depth marker, said
    Figure PCTCN2019076247-appb-100037
    Represents the pixel values of the disparity map output by the trained second binocular matching network model on the left,
    Figure PCTCN2019076247-appb-100038
    Represents the pixel values of the disparity map output by the trained second binocular matching network model on the right, where ij represents the pixel coordinates of the pixel points, and γ 3 represents the intensity coefficient.
  25. 根据权利要求22所述的装置,其中,所述装置还包括:第四确定模块,配置为利用公式
    Figure PCTCN2019076247-appb-100039
    或,
    Figure PCTCN2019076247-appb-100040
    确定所述第一双目匹配网络模型的输出梯度与所述第二双目匹配网络模型的输出梯度一致,其中,所述N表示图片中像素的个数,所述
    Figure PCTCN2019076247-appb-100041
    表示不带深度标记的真实双目数据中的左图经第一双目匹配网络模型输出的视差图的梯度,所述
    Figure PCTCN2019076247-appb-100042
    表示不带深度标记的真实双目数据中的右图经第一双目匹配网络模型输出的视差图的梯度,所述
    Figure PCTCN2019076247-appb-100043
    表示左图经训练后的第二双目匹配网络模型输出的视差图的梯度,所述
    Figure PCTCN2019076247-appb-100044
    表示右图经训练后的第二双目匹配网络模型输出的视差图的梯度,所述ij表示像素点的像素坐标。
    The apparatus according to claim 22, further comprising: a fourth determination module configured to utilize a formula
    Figure PCTCN2019076247-appb-100039
    or,
    Figure PCTCN2019076247-appb-100040
    It is determined that the output gradient of the first binocular matching network model is consistent with the output gradient of the second binocular matching network model, where N is the number of pixels in the picture, and the
    Figure PCTCN2019076247-appb-100041
    Represents the gradient of the disparity map output by the first binocular matching network model in the left image in the real binocular data without a depth marker, said
    Figure PCTCN2019076247-appb-100042
    Represents the gradient of the disparity map output by the first binocular matching network model of the right image in the real binocular data without a depth marker, said
    Figure PCTCN2019076247-appb-100043
    Represents the gradient of the disparity map output by the trained second binocular matching network model on the left,
    Figure PCTCN2019076247-appb-100044
    Represents the gradient of the disparity map output by the trained second binocular matching network model on the right, and ij represents the pixel coordinates of the pixel.
  26. 根据权利要求19所述的装置,其中,所述带深度标记的真实双目数据包括左图和右图,对应地,所述装置还包括:第三训练模块,配置为获取所述带深度标记的真实双目数据中的左图或右图作为训练样本;根据所述带深度标记的真实双目数据中的左图或右图对单目深度估计网络模型进行训练。The device according to claim 19, wherein the true binocular data with the depth marker includes a left image and a right image, and correspondingly, the device further comprises: a third training module configured to obtain the depth marker The left image or the right image in the real binocular data is used as a training sample; the monocular depth estimation network model is trained according to the left image or the right image in the real binocular data with a depth label.
  27. 根据权利要求20至25任一项所述的装置,其中,所述不带深度标记的真实双目数据包括左图和右图,对应地,所述装置还包括:第三训练模块,配置为将所述不带深度标记的真实双目数据输入到所述第一双目匹配神经网络模型,得到对应的视差图;根据所述对应的视差图、拍摄所述不带深度标记的真实双目数据的摄像机的镜头基线距离和拍摄所述不带深度标记的真实双目数据的摄像机的镜头焦距,确定所述视差图对应的深度图;将所述不带深度标记的真实双目数据中的左图或右图作为样本数据,根据所述视差图对应的深度图对单目深度估计网络模型进行监督,从而训练所述单目深度估计网络模型。The device according to any one of claims 20 to 25, wherein the true binocular data without a depth mark includes a left image and a right image, and correspondingly, the device further includes: a third training module configured to Inputting the real binocular data without the depth marker into the first binocular matching neural network model to obtain a corresponding disparity map; and shooting the true binocular without the depth marker according to the corresponding disparity map The baseline distance of the lens of the camera and the focal length of the lens of the camera that captures the true binocular data without the depth mark to determine the depth map corresponding to the disparity map; The left image or the right image is used as sample data, and the monocular depth estimation network model is supervised according to the depth map corresponding to the disparity map, thereby training the monocular depth estimation network model.
  28. 根据权利要求26或27所述的装置,其中,所述待处理图像的分析结果包括所述单目深度估计网络模型输出的视差图,对应地,所述装置还包括:第五确定模块,配置为根据所述单目深度估计网络模型输出的视差图、拍摄输入所述单目深度估计网络模型的图片的摄像机的镜头基线距离和拍摄输入所述单目深度估计网络模型的图片的摄像机的镜头焦距,确定所述视差图对应的深度图;第一输出模块,配置为输出所述视差图对应的深度图。The apparatus according to claim 26 or 27, wherein the analysis result of the image to be processed includes a disparity map output by the monocular depth estimation network model, and correspondingly, the apparatus further comprises: a fifth determining module, configured A disparity map output from the monocular depth estimation network model, a lens baseline distance of a camera that takes a picture of the monocular depth estimation network model, and a lens of a camera that takes a picture of the monocular depth estimation network model. The focal length determines a depth map corresponding to the disparity map; a first output module is configured to output a depth map corresponding to the disparity map.
  29. 一种单目深度估计设备,包括存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,其中,所述处理器执行所述程序时实现权利要求1至14任一项所述单目深度估计方法中的步骤。A monocular depth estimation device includes a memory and a processor, and the memory stores a computer program operable on the processor, wherein when the processor executes the program, the device according to any one of claims 1 to 14 is implemented. Describe the steps in the monocular depth estimation method.
  30. 一种计算机可读存储介质,其上存储有计算机程序,其中,该计算机程序被处理器执行时实现权利要求1至14任一项所述单目深度估计方法中的步骤。A computer-readable storage medium having stored thereon a computer program, wherein when the computer program is executed by a processor, the steps in the monocular depth estimation method according to any one of claims 1 to 14 are implemented.
PCT/CN2019/076247 2018-05-22 2019-02-27 Method for estimating monocular depth, apparatus and device therefor, and storage medium WO2019223382A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
SG11202008787UA SG11202008787UA (en) 2018-05-22 2019-02-27 Method for estimating monocular depth, apparatus and device therefor, and storage medium
JP2020546428A JP7106665B2 (en) 2018-05-22 2019-02-27 MONOCULAR DEPTH ESTIMATION METHOD AND DEVICE, DEVICE AND STORAGE MEDIUM THEREOF

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810496541.6A CN108961327B (en) 2018-05-22 2018-05-22 Monocular depth estimation method and device, equipment and storage medium thereof
CN201810496541.6 2018-05-22

Publications (1)

Publication Number Publication Date
WO2019223382A1 true WO2019223382A1 (en) 2019-11-28

Family

ID=64499439

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/076247 WO2019223382A1 (en) 2018-05-22 2019-02-27 Method for estimating monocular depth, apparatus and device therefor, and storage medium

Country Status (4)

Country Link
JP (1) JP7106665B2 (en)
CN (1) CN108961327B (en)
SG (1) SG11202008787UA (en)
WO (1) WO2019223382A1 (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111105432A (en) * 2019-12-24 2020-05-05 中国科学技术大学 Unsupervised end-to-end driving environment perception method based on deep learning
CN111310859A (en) * 2020-03-26 2020-06-19 上海景和国际展览有限公司 Rapid artificial intelligence data training system used in multimedia display
CN111340864A (en) * 2020-02-26 2020-06-26 浙江大华技术股份有限公司 Monocular estimation-based three-dimensional scene fusion method and device
CN111354030A (en) * 2020-02-29 2020-06-30 同济大学 Method for generating unsupervised monocular image depth map embedded into SENET unit
CN111428859A (en) * 2020-03-05 2020-07-17 北京三快在线科技有限公司 Depth estimation network training method and device for automatic driving scene and autonomous vehicle
CN111445476A (en) * 2020-02-27 2020-07-24 上海交通大学 Monocular depth estimation method based on multi-mode unsupervised image content decoupling
CN111784757A (en) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 Training method of depth estimation model, depth estimation method, device and equipment
CN111833390A (en) * 2020-06-23 2020-10-27 杭州电子科技大学 Light field depth estimation method based on unsupervised depth learning
CN111932584A (en) * 2020-07-13 2020-11-13 浙江大华技术股份有限公司 Method and device for determining moving object in image
CN112446328A (en) * 2020-11-27 2021-03-05 汇纳科技股份有限公司 Monocular depth estimation system, method, device and computer-readable storage medium
CN112465888A (en) * 2020-11-16 2021-03-09 电子科技大学 Monocular vision-based unsupervised depth estimation method
CN112561947A (en) * 2020-12-10 2021-03-26 中国科学院深圳先进技术研究院 Image self-adaptive motion estimation method and application
CN112712017A (en) * 2020-12-29 2021-04-27 上海智蕙林医疗科技有限公司 Robot, monocular depth estimation method and system and storage medium
CN112819875A (en) * 2021-02-03 2021-05-18 苏州挚途科技有限公司 Monocular depth estimation method and device and electronic equipment
CN112862877A (en) * 2021-04-09 2021-05-28 北京百度网讯科技有限公司 Method and apparatus for training image processing network and image processing
CN112991416A (en) * 2021-04-13 2021-06-18 Oppo广东移动通信有限公司 Depth estimation method, model training method, device, equipment and storage medium
CN113014899A (en) * 2019-12-20 2021-06-22 杭州海康威视数字技术股份有限公司 Binocular image parallax determination method, device and system
CN113140011A (en) * 2021-05-18 2021-07-20 烟台艾睿光电科技有限公司 Infrared thermal imaging monocular vision distance measurement method and related assembly
CN113160298A (en) * 2021-03-31 2021-07-23 奥比中光科技集团股份有限公司 Depth truth value acquisition method, device and system and depth camera
CN113570658A (en) * 2021-06-10 2021-10-29 西安电子科技大学 Monocular video depth estimation method based on depth convolutional network
CN114051128A (en) * 2021-11-11 2022-02-15 北京奇艺世纪科技有限公司 Method, device, equipment and medium for converting 2D video into 3D video
CN114119698A (en) * 2021-06-18 2022-03-01 湖南大学 Unsupervised monocular depth estimation method based on attention mechanism
CN114132594A (en) * 2020-09-03 2022-03-04 细美事有限公司 Article storage device and control method of article storage device
CN116703813A (en) * 2022-12-27 2023-09-05 荣耀终端有限公司 Image processing method and apparatus
CN117156113A (en) * 2023-10-30 2023-12-01 南昌虚拟现实研究院股份有限公司 Deep learning speckle camera-based image correction method and device

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108961327B (en) * 2018-05-22 2021-03-30 深圳市商汤科技有限公司 Monocular depth estimation method and device, equipment and storage medium thereof
CN111354032B (en) * 2018-12-24 2023-10-20 杭州海康威视数字技术股份有限公司 Method and device for generating disparity map
CN111444744A (en) * 2018-12-29 2020-07-24 北京市商汤科技开发有限公司 Living body detection method, living body detection device, and storage medium
CN109741388B (en) * 2019-01-29 2020-02-28 北京字节跳动网络技术有限公司 Method and apparatus for generating a binocular depth estimation model
CN111508010B (en) * 2019-01-31 2023-08-08 北京地平线机器人技术研发有限公司 Method and device for estimating depth of two-dimensional image and electronic equipment
CN109887019B (en) * 2019-02-19 2022-05-24 北京市商汤科技开发有限公司 Binocular matching method and device, equipment and storage medium
CN111723926B (en) * 2019-03-22 2023-09-12 北京地平线机器人技术研发有限公司 Training method and training device for neural network model for determining image parallax
CN110009674B (en) * 2019-04-01 2021-04-13 厦门大学 Monocular image depth of field real-time calculation method based on unsupervised depth learning
CN110163246B (en) * 2019-04-08 2021-03-30 杭州电子科技大学 Monocular light field image unsupervised depth estimation method based on convolutional neural network
CN110148179A (en) * 2019-04-19 2019-08-20 北京地平线机器人技术研发有限公司 A kind of training is used to estimate the neural net model method, device and medium of image parallactic figure
CN113808062A (en) * 2019-04-28 2021-12-17 深圳市商汤科技有限公司 Image processing method and device
CN110335245A (en) * 2019-05-21 2019-10-15 青岛科技大学 Cage netting damage monitoring method and system based on monocular space and time continuous image
CN112149458A (en) * 2019-06-27 2020-12-29 商汤集团有限公司 Obstacle detection method, intelligent driving control method, device, medium, and apparatus
CN110310317A (en) * 2019-06-28 2019-10-08 西北工业大学 A method of the monocular vision scene depth estimation based on deep learning
CN110782412B (en) * 2019-10-28 2022-01-28 深圳市商汤科技有限公司 Image processing method and device, processor, electronic device and storage medium
CN111105451B (en) * 2019-10-31 2022-08-05 武汉大学 Driving scene binocular depth estimation method for overcoming occlusion effect
CN111126478B (en) * 2019-12-19 2023-07-07 北京迈格威科技有限公司 Convolutional neural network training method, device and electronic system
CN111325786B (en) * 2020-02-18 2022-06-28 北京市商汤科技开发有限公司 Image processing method and device, electronic equipment and storage medium
CN112150531B (en) * 2020-09-29 2022-12-09 西北工业大学 Robust self-supervised learning single-frame image depth estimation method
CN113705432A (en) * 2021-08-26 2021-11-26 京东鲲鹏(江苏)科技有限公司 Model training and three-dimensional target detection method, device, equipment and medium
CN115294375B (en) * 2022-10-10 2022-12-13 南昌虚拟现实研究院股份有限公司 Speckle depth estimation method and system, electronic device and storage medium
CN115909446B (en) * 2022-11-14 2023-07-18 华南理工大学 Binocular face living body discriminating method, device and storage medium
CN116165646B (en) * 2023-02-22 2023-08-11 哈尔滨工业大学 False alarm controllable radar target detection method based on segmentation network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150279022A1 (en) * 2014-03-31 2015-10-01 Empire Technology Development Llc Visualization of Spatial and Other Relationships
CN106600650A (en) * 2016-12-12 2017-04-26 杭州蓝芯科技有限公司 Binocular visual sense depth information obtaining method based on deep learning
CN107204010A (en) * 2017-04-28 2017-09-26 中国科学院计算技术研究所 A kind of monocular image depth estimation method and system
CN108961327A (en) * 2018-05-22 2018-12-07 深圳市商汤科技有限公司 A kind of monocular depth estimation method and its device, equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102903096B (en) * 2012-07-04 2015-06-17 北京航空航天大学 Monocular video based object depth extraction method
CN106157307B (en) * 2016-06-27 2018-09-11 浙江工商大学 A kind of monocular image depth estimation method based on multiple dimensioned CNN and continuous CRF
GB2553782B (en) * 2016-09-12 2021-10-20 Niantic Inc Predicting depth from image data using a statistical model
EP4131172A1 (en) * 2016-09-12 2023-02-08 Dassault Systèmes Deep convolutional neural network for 3d reconstruction of a real object
CN107909150B (en) * 2017-11-29 2020-08-18 华中科技大学 Method and system for on-line training CNN based on block-by-block random gradient descent method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150279022A1 (en) * 2014-03-31 2015-10-01 Empire Technology Development Llc Visualization of Spatial and Other Relationships
CN106600650A (en) * 2016-12-12 2017-04-26 杭州蓝芯科技有限公司 Binocular visual sense depth information obtaining method based on deep learning
CN107204010A (en) * 2017-04-28 2017-09-26 中国科学院计算技术研究所 A kind of monocular image depth estimation method and system
CN108961327A (en) * 2018-05-22 2018-12-07 深圳市商汤科技有限公司 A kind of monocular depth estimation method and its device, equipment and storage medium

Cited By (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113014899A (en) * 2019-12-20 2021-06-22 杭州海康威视数字技术股份有限公司 Binocular image parallax determination method, device and system
CN113014899B (en) * 2019-12-20 2023-02-03 杭州海康威视数字技术股份有限公司 Binocular image parallax determination method, device and system
CN111105432B (en) * 2019-12-24 2023-04-07 中国科学技术大学 Unsupervised end-to-end driving environment perception method based on deep learning
CN111105432A (en) * 2019-12-24 2020-05-05 中国科学技术大学 Unsupervised end-to-end driving environment perception method based on deep learning
CN111340864A (en) * 2020-02-26 2020-06-26 浙江大华技术股份有限公司 Monocular estimation-based three-dimensional scene fusion method and device
CN111340864B (en) * 2020-02-26 2023-12-12 浙江大华技术股份有限公司 Three-dimensional scene fusion method and device based on monocular estimation
CN111445476B (en) * 2020-02-27 2023-05-26 上海交通大学 Monocular depth estimation method based on multi-mode unsupervised image content decoupling
CN111445476A (en) * 2020-02-27 2020-07-24 上海交通大学 Monocular depth estimation method based on multi-mode unsupervised image content decoupling
CN111354030B (en) * 2020-02-29 2023-08-04 同济大学 Method for generating unsupervised monocular image depth map embedded into SENet unit
CN111354030A (en) * 2020-02-29 2020-06-30 同济大学 Method for generating unsupervised monocular image depth map embedded into SENET unit
CN111428859A (en) * 2020-03-05 2020-07-17 北京三快在线科技有限公司 Depth estimation network training method and device for automatic driving scene and autonomous vehicle
CN111310859A (en) * 2020-03-26 2020-06-19 上海景和国际展览有限公司 Rapid artificial intelligence data training system used in multimedia display
CN111833390A (en) * 2020-06-23 2020-10-27 杭州电子科技大学 Light field depth estimation method based on unsupervised depth learning
CN111784757A (en) * 2020-06-30 2020-10-16 北京百度网讯科技有限公司 Training method of depth estimation model, depth estimation method, device and equipment
CN111784757B (en) * 2020-06-30 2024-01-23 北京百度网讯科技有限公司 Training method of depth estimation model, depth estimation method, device and equipment
CN111932584A (en) * 2020-07-13 2020-11-13 浙江大华技术股份有限公司 Method and device for determining moving object in image
CN111932584B (en) * 2020-07-13 2023-11-07 浙江大华技术股份有限公司 Method and device for determining moving object in image
CN114132594A (en) * 2020-09-03 2022-03-04 细美事有限公司 Article storage device and control method of article storage device
CN112465888A (en) * 2020-11-16 2021-03-09 电子科技大学 Monocular vision-based unsupervised depth estimation method
CN112446328A (en) * 2020-11-27 2021-03-05 汇纳科技股份有限公司 Monocular depth estimation system, method, device and computer-readable storage medium
CN112446328B (en) * 2020-11-27 2023-11-17 汇纳科技股份有限公司 Monocular depth estimation system, method, apparatus, and computer-readable storage medium
CN112561947A (en) * 2020-12-10 2021-03-26 中国科学院深圳先进技术研究院 Image self-adaptive motion estimation method and application
CN112712017A (en) * 2020-12-29 2021-04-27 上海智蕙林医疗科技有限公司 Robot, monocular depth estimation method and system and storage medium
CN112819875B (en) * 2021-02-03 2023-12-19 苏州挚途科技有限公司 Monocular depth estimation method and device and electronic equipment
CN112819875A (en) * 2021-02-03 2021-05-18 苏州挚途科技有限公司 Monocular depth estimation method and device and electronic equipment
CN113160298B (en) * 2021-03-31 2024-03-08 奥比中光科技集团股份有限公司 Depth truth value acquisition method, device and system and depth camera
CN113160298A (en) * 2021-03-31 2021-07-23 奥比中光科技集团股份有限公司 Depth truth value acquisition method, device and system and depth camera
CN112862877B (en) * 2021-04-09 2024-05-17 北京百度网讯科技有限公司 Method and apparatus for training an image processing network and image processing
CN112862877A (en) * 2021-04-09 2021-05-28 北京百度网讯科技有限公司 Method and apparatus for training image processing network and image processing
CN112991416A (en) * 2021-04-13 2021-06-18 Oppo广东移动通信有限公司 Depth estimation method, model training method, device, equipment and storage medium
CN113140011A (en) * 2021-05-18 2021-07-20 烟台艾睿光电科技有限公司 Infrared thermal imaging monocular vision distance measurement method and related assembly
CN113140011B (en) * 2021-05-18 2022-09-06 烟台艾睿光电科技有限公司 Infrared thermal imaging monocular vision distance measurement method and related components
CN113570658A (en) * 2021-06-10 2021-10-29 西安电子科技大学 Monocular video depth estimation method based on depth convolutional network
CN114119698A (en) * 2021-06-18 2022-03-01 湖南大学 Unsupervised monocular depth estimation method based on attention mechanism
CN114051128B (en) * 2021-11-11 2023-09-05 北京奇艺世纪科技有限公司 Method, device, equipment and medium for converting 2D video into 3D video
CN114051128A (en) * 2021-11-11 2022-02-15 北京奇艺世纪科技有限公司 Method, device, equipment and medium for converting 2D video into 3D video
CN116703813A (en) * 2022-12-27 2023-09-05 荣耀终端有限公司 Image processing method and apparatus
CN116703813B (en) * 2022-12-27 2024-04-26 荣耀终端有限公司 Image processing method and apparatus
CN117156113A (en) * 2023-10-30 2023-12-01 南昌虚拟现实研究院股份有限公司 Deep learning speckle camera-based image correction method and device
CN117156113B (en) * 2023-10-30 2024-02-23 南昌虚拟现实研究院股份有限公司 Deep learning speckle camera-based image correction method and device

Also Published As

Publication number Publication date
JP2021515939A (en) 2021-06-24
CN108961327B (en) 2021-03-30
CN108961327A (en) 2018-12-07
JP7106665B2 (en) 2022-07-26
SG11202008787UA (en) 2020-10-29

Similar Documents

Publication Publication Date Title
WO2019223382A1 (en) Method for estimating monocular depth, apparatus and device therefor, and storage medium
Ming et al. Deep learning for monocular depth estimation: A review
Shivakumar et al. Dfusenet: Deep fusion of rgb and sparse depth information for image guided dense depth completion
Hambarde et al. UW-GAN: Single-image depth estimation and image enhancement for underwater images
US11443445B2 (en) Method and apparatus for depth estimation of monocular image, and storage medium
US12031842B2 (en) Method and apparatus for binocular ranging
Hu et al. Deep depth completion from extremely sparse data: A survey
CN109300151B (en) Image processing method and device and electronic equipment
CN113362444A (en) Point cloud data generation method and device, electronic equipment and storage medium
EP3872760A2 (en) Method and apparatus of training depth estimation network, and method and apparatus of estimating depth of image
Zhang et al. Exploring event-driven dynamic context for accident scene segmentation
Shi et al. An improved lightweight deep neural network with knowledge distillation for local feature extraction and visual localization using images and LiDAR point clouds
CN114677422A (en) Depth information generation method, image blurring method and video blurring method
CN108229281B (en) Neural network generation method, face detection device and electronic equipment
Arampatzakis et al. Monocular depth estimation: A thorough review
Zhao et al. Distance transform pooling neural network for lidar depth completion
Haji-Esmaeili et al. Large-scale monocular depth estimation in the wild
KR20240012426A (en) Unconstrained image stabilization
CN115375742A (en) Method and system for generating depth image
Wang et al. Surface and underwater human pose recognition based on temporal 3D point cloud deep learning
Yang et al. Towards generic 3d tracking in RGBD videos: Benchmark and baseline
CN113537359A (en) Training data generation method and device, computer readable medium and electronic equipment
US10896333B2 (en) Method and device for aiding the navigation of a vehicle
CN116597097B (en) Three-dimensional scene reconstruction method for autopilot, electronic device, and storage medium
Tian Effective image enhancement and fast object detection for improved UAV applications

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19806515

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020546428

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 19.04.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19806515

Country of ref document: EP

Kind code of ref document: A1