GB2605621A - Monocular depth estimation - Google Patents
Monocular depth estimation Download PDFInfo
- Publication number
- GB2605621A GB2605621A GB2104949.9A GB202104949A GB2605621A GB 2605621 A GB2605621 A GB 2605621A GB 202104949 A GB202104949 A GB 202104949A GB 2605621 A GB2605621 A GB 2605621A
- Authority
- GB
- United Kingdom
- Prior art keywords
- depth
- calibration
- estimation model
- image
- scene
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30248—Vehicle exterior or interior
- G06T2207/30252—Vehicle exterior; Vicinity of vehicle
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
A computer-implemented method of training a depth estimation model to generate a depth map from a monocular image includes receiving first and second training images respectively of first and second views of a scene, processing the first training image using the depth estimation model to generate a candidate depth map comprising an array of candidate depth values, projecting the second training image using the candidate depth map to generate a candidate reconstruction of the first training image, and updating the depth estimation model so as to reduce a value of a loss function. The loss function comprising a photometric difference term penalising a photometric difference between the first training image and the candidate reconstruction of the first training image dependent it further comprises a reflection mitigation term penalising a second derivative in the horizontal direction of at least a portion of the candidate depth map.
Description
MONOCULAR DEPTH ESTIMATION
Technical Field
The present invention relates to training and calibrating a depth estimation model for inferring depth information from a monocular image. The invention has particular, but not exclusive, relevance to training and calibrating a depth estimation model for an advanced driver assistance system (ADAS) or an automated driving system (ADS).
Background
Depth estimation or ranging involves estimating distances to objects within a scene or environment. Such information may be provided as input data to an advanced driver assistance system (ADAS) or an automated driving system (ADS) for an autonomous vehicle. Depth information estimated in near-real time can be used, for example, to inform decisions on collision avoidance actions such as emergency braking and pedestrian crash avoidance mitigation, as well as for generating driver alerts such as proximity warnings in the case of ADAS, Various methods are known for depth estimation, for example stereo matching, lidar, radar, structured light methods and photometric stereo methods. Many of these methods require specialist equipment and/or perform poorly in real world settings such as those encountered in the context of ADAS or ADS. Alongside these methods, deep learning approaches have been developed for estimating depth information from individual images captured using a standard monocular camera. In some examples, supervised learning is used to train a deep learning model using ground truth depth information determined using one of the methods described above. In other examples, depth information is recast as an unsupervised learning problem. In such methods, a first training image representing a first view of a scene is processed using a deep learning model to generate a candidate depth map. The candidate depth map is then used to project a second training image representing a second view of the scene to a frame of reference of the first training image, to generate a candidate reconstruction of the first training image. A photometric difference between the first training image and the candidate reconstruction of the first training image is then used as a training signal to update the deep learning model. The images may be captured simultaneously using a stereo camera rig, or at different times using a moving camera. In the latter case, the relationship between reference frames of the first and second views of the environment may not be known, in which case a pose estimation model is trained alongside the depth estimation model, using the same training signal, for predicting this relationship.
Once trained, a depth estimation model as described above is able to determine depth information based on individual images captured by a standard monocular camera. However, there is a need to improve the performance of such models when deployed in real life environments, particularly those typically encountered in the context of an ADAS or ADS, where for example reflections from a wet road surface can lead to erroneous depth information. Furthermore, a depth estimation model trained using a pose estimation model as mentioned above is not scale-aware, and therefore for practical applications the output of the depth estimation model must be calibrated for a given camera setup in order to determine a scaling for the determined depth information.
Summary
According to a first aspect of the present disclosure, there is provided a computer-implemented method of training a depth estimation model to generate a depth map from a monocular image. The method includes receiving first and second training images respectively of first and second views of a scene, and processing the first training image using the depth estimation model to generate a candidate depth map comprising an array of candidate depth values, the array defining horizontal and vertical directions within the scene. The method further includes projecting the second training image using the candidate depth map to generate a candidate reconstruction of the training image, and updating the depth estimation model so as to reduce a value of a loss function comprising a photometric difference term penalising a photometric difference between the first training image and the candidate reconstruction of the first training image, wherein the loss function further comprises a reflection mitigation term penalising a second derivative in the horizontal direction of at least a portion of the candidate depth map.
The inventors have identified that reflections typically lead to noisier and consequently less smooth regions of the candidate depth map. By penalising the second derivative in the horizontal direction, the reflection mitigation term encourages the depth estimation model to minimise variation of the depth estimates in the horizontal direction for regions in which reflections are present, leading to depth estimates which are more consistent with a surface without any reflections. The reflection mitigation term therefore discourages the depth estimation model from inferring depth from regions exhibiting reflections, and instead encourages the depth estimation model to infer the depth for these regions from surrounding areas, leading to a depth map which is closer to the ground truth.
According to a second aspect, there is provided a further computer-implemented method of training a depth estimation model to generate a depth map from a monocular image. The method includes receiving first and second training images respectively of first and second views of a scene, processing the first training image using the depth estimation model to generate a candidate depth map comprising an array of candidate depth values, the array defining horizontal and vertical directions within the scene, and processing at least one of the first training image and the second training image to determine a binary mask indicative of distant regions of the scene. The method further includes projecting the second training image using the candidate depth map to generate a candidate reconstruction of the first training image and updating the depth estimation model so as to reduce a value of a loss function comprising a photometric difference term penalising a photometric difference between corresponding portions of the first training image and the candidate reconstruction of the first training image dependent, wherein the corresponding portions exclude the distant regions of the scene as indicated by the binary mask.
By excluding contributions to the loss function from distant regions of the scene, the loss function is more strongly affected by regions of the scene in which depth variation is present and from which depth information can be validly inferred. This results in a higher rate of convergence of the depth estimation model during training, allowing the training to be performed using less training data, the collection of which may be time-and/or resource-demanding. In addition to the improved convergence properties, the accuracy of the resultant depth estimation model is found to be improved.
In examples, determining the binary mask includes generating a difference image corresponding to a difference of the first training image and the second training image, and binarizing the generated difference image. Determining the binary mask in this way is reliable, and the implementation is straightforward and significantly less demanding of processing resources than alternative approaches such as a semantic segmentation model. The binarization may be performed using a binarization threshold depending on pixel values of the first training image and/or the second training image.
In this way, the binarization threshold can be made to automatically account for variations in properties of the training images, such as brightness of the training images. The training images used for training the depth estimation model may be images captured by a camera. Alternatively, the method may further include generating the scene virtually using scene synthesis and then capturing a virtual image of the generated scene. Using synthetic scenes for at least part of the training may reduce the time-and resource-consuming process of capturing real images of an environment, allowing the training process to be performed more quickly and with more training data. In examples, the method further includes processing the first and second training images using a pose estimation model to determine a candidate relative pose relating the first and second views of the scene, and the projecting of the second training image further uses the candidate relative pose. The depth estimation model and the pose estimation may respectively be first and second neural network models sharing one or more neural network layers. Sharing one or more neural network layers, for example convolutional layers, between the depth estimation model and the pose estimation model may alleviate some of the computational burden of the training process and lead to faster convergence of the models.
According to a third aspect, there is provided a computer-implemented method of determining a scaling factor for calibrating a depth estimation model. The method includes receiving a calibration image of a calibration scene comprising a calibration object, receiving data indicative of dimensions of the calibration object, and processing the calibration image to determine an orientation of the calibration object. The method further includes determining a calibrated depth difference between two predetermined points on the calibration object in dependence on the determined orientation of the calibration object and the data indicative of the dimensions of the calibration object, using the depth estimation model to determine an uncalibrated depth difference between the two predetermined points on the calibration object from the calibration depth map, and determining the scaling factor as a ratio of the calibrated depth difference to the uncalibrated depth difference Once the scaling factor has been determined, depth values determined by the depth estimation model can be multiplied by the scaling factor to give calibrated depth values, resulting in a scale-aware depth estimation model.
According to a fourth aspect, there is provided a further computer-implemented method of determining a scaling factor for calibrating a depth estimation model. The method includes receiving a calibration image of a calibration scene captured by a camera positioned a given height above a ground plane and using the depth estimation model to process the calibration image model to determine a calibration depth map comprising an array of calibration depth values, the array defining horizontal and vertical directions. The method further includes processing the calibration depth map to generate a point cloud representation of the calibration scene, generating a histogram of vertical positions of points within the point cloud representation of the calibration scene, identifying points within a modal bin of the generated histogram as ground plane points, performing a statistical analysis of the points identified as ground plane points to determine an uncalibrated height of the camera above the ground plane, and determining the scaling factor as a ratio of the given height of the camera above the ground plane to the uncalibrated height of the camera above the ground plane.
The above calibration method does not require a calibration object, and can be applied at run-time to calibrate or re-calibrate the depth estimation model for a particular camera setup. This method is therefore particularly suitable when the depth estimation model is provided as software to be used in conjunction with an unknown camera setup. Using a histogram to identify ground plane points is more straightforward to implement, more robust, and less demanding of processing resources than alternative methods of identifying a ground plane, for example using a semantic segmentation model.
In examples, the generated histogram is restricted to points lying below an optical axis of the camera, and/or to points having a restricted range of depth values. An upper limit of the restricted range may for example be given by a predetermined proportion of a median depth value of points within the point cloud representation. In this way, distant points, for which the co-ordinates of the points are likely to be less accurately determined, are omitted, increasing the accuracy of the calibration method whilst also reducing processing demands According to a fifth aspect, there is provided a system including a mobile camera arranged to capture first and second training images respectively of first and second views of a scene as the mobile camera moves with respect to the scene, and a computing system arranged to process the first and second training images to train a depth estimation model, in accordance with any of the training methods described above.
According to a sixth aspect, there is provided a system comprising memory circuitry, processing circuitry, and a camera The memory circuitry holds a depth estimation model and machine-readable instructions which, when executed by the processing circuitry, cause the camera to capture a calibration image of a calibration scene, and the system to determine a scaling factor for calibrating the depth estimation model using the captured calibration scene, in accordance with any of the calibration methods described above. The system may further include an advanced driver assistance system (ADAS) and/or an automated driving system (ADS) for a vehicle, and may be arranged to process images captured by the camera using the depth estimation model to generate input data for said ADAS and/or ADS.
According to a seventh aspect, there is provided a computer program product holding machine-readable instructions which, when executed by a computing system, cause the computing system to perform any of the computer-implemented method described above.
Further features and advantages of the invention will become apparent from the following description of preferred embodiments of the invention, given by way of example only, which is made with reference to the accompanying drawings.
Brief Description of the Drawings
Figure 1 schematically shows an example of apparatus for training a depth estimation model; Figure 2 schematically shows a first example of a method for training a depth estimation model; Figure 3 schematically shows a second example of a method for training a depth estimation model; Figure 4 illustrates depth estimation from a monocular image using a model trained with and without a modified loss function; Figure 5 illustrates a processing of two images of respective different views of a scene to generate a binary mask, Figure 6 shows a first example of a method for calibrating a depth estimation model Figure 7A shows a system arranged to perform a second example of a method for calibrating a depth estimation model; Figure 7B illustrates a processing of an image captured using the apparatus of Figure 7A to generate a point cloud; Figure 7C shows a histogram of vertical positions of points within the point cloud of Figure 7B.
Detailed Description
Figure 1 shows an example of apparatus for training a depth estimation model. The apparatus includes a camera 102 mounted on a moveable platform 104 arranged to move in a direction A along a path 106. The moveable platform 104 may for example be a trolley arranged to move along a rail or a track, or may be a vehicle such as a car or van arranged to drive along a road. The camera 102 in this example is a monocular camera arranged to capture a series of monocular images as the camera 102 and platform 104 move along the path 106. The camera 102 captures images when the camera 102 passes the dashed lines at times t1, t2 and t3, which in this example are equally separated both in position and time, though in other examples the camera 102 may capture images at times or positions which are not equally separated. The three images captured at times t1, t2 and t3 represent different views of a scene, meaning that the images contain overlapping portions of the environment, viewed from different perspectives. The frequency at which the images are captured may be several images every second or every tenth or hundredth of a second. The images may for example be frames of a video captured by the camera 102. Although in this example the camera 102 is shown as being forward-facing with respect to the direction of travel A, in other examples the camera 102 could be rear-facing, sideways-facing, or oblique to the direction of travel A. The images captured at times t1, t2 and t3 are referred to as a set of training images, as their function is to be processed together to train a depth estimation model as will be explained in more detail hereafter. In general, a set of training images in the context of the present disclosure includes two or more training images. Although in this example the set of training images is captured using a single mobile camera 102, in other examples a set of training images of different views of a scene may instead be captured simultaneously using multiple cameras, for example using a stereoscopic camera rig which includes a pair of cameras mounted side-by-side.
The apparatus of Figure 1 further includes a computer system 108 arranged to process training images captured by the camera 102 to train a depth estimation model. The computer system 108 may receive the training images in real-time as the camera 102 captures the training images, or may receive the training images in a batch fashion after several sets of training images have been captured, for example via wired or wireless means or via a removable storage device.
The computer system 108 may be a general-purpose computer such as a desktop computer, laptop, server or network-based data processing system, or may be a standalone device or module, for example an integral component of a vehicle. The computer system 108 includes memory circuitry 110 and processing circuitry 112. The processing circuitry 112 may include general purpose processing units and/or specialist processing units such as a graphics processing unit (GPU) or a neural processing unit (NPU). Additionally, or alternatively, the processing circuitry 112 may include an application-specific standard product (ASSP), an application-specific integrated circuit (A SIC), a field-programmable gate array (FPGA), or any other type of integrated circuit suitable for carrying out the methods described hereinafter.
The memory circuitry 110 stores sets of training images, along with model data defining the depth estimation model and machine-readable instructions for a training routine used to train the depth estimation model as described below. In this example, the training images are captured by the camera 102 as described above, but in other examples the computer system 108 could receive training images from elsewhere, for example from a database in the case of historic data. The depth estimation model may be any suitable type of machine learning model arranged to take a colour or greyscale monocular image as input and generate a depth map as an output. The generated depth map comprises an array of direct depth values or inverse depth values (disparity values), with each entry of the array corresponding to an in-plane position within the monocular image. A depth value for a given point in the scene is a distance from the camera to that point in a direction parallel to the optical axis of the camera (conventionally referred to as the z direction). The array may have the same or different dimensions to the input image. In some examples, the depth estimation model is a deep neural network model with convolutional layers and having an encoder-decoder architecture, for example as described in the article "U-Net: Convolutional Networks for Biomedical Image Segmentation" by Ronneberger et al, arXiv:1505.04597, 2015. Those skilled in the art will appreciate that many different neural network architectures can be used to process an input image to generate an output array. The model data stored in the memory 110 includes trainable parameters of the depth estimation model, for example kernel values, connection weights and bias values in the case of a deep neural network model. The model data further includes data which is not updated during training, for example data indicative of network architecture, along with hyperparameter values for controlling activation functions, optimiser preferences, and alike.
Figure 2 shows schematically a first example of a method performed by the computer system 108 in which a set of training images comprising a first training image 202 and a second training image 204 is processed to train the depth estimation model held in the memory circuitry 110. The first and second training images 202, 204 are respectively of first and second views of a scene, captured at different times by the camera 102. The solid arrows in Figure 2 represent flow of data during a forward pass, and the dashed arrows represent backward propagation of errors following the forward pass, during a single training iteration.
The first training image 202 is processed using the depth estimation model 206 to generate a candidate depth map 208. The candidate depth map 208 is input to a projection model 210 which is arranged to project the second training image 204 into a frame of reference corresponding to that of the first training image 202, to generate a candidate reconstruction 212 of the first training image. The projection model 210 determines a projected pixel location for each pixel location in the candidate reconstruction 212, using the candidate depth map along with a relative pose relating the first and second views of the scene, and an intrinsic matrix characterising the camera 102. Pixel values are then sampled from the second training image 204, for example based on the nearest pixel to the projected pixel location or an interpolation between the nearest four pixels to the projected pixel location. The candidate reconstruction 212 is generated by applying this procedure throughout the region for which the projected pixel locations lie within the second training image 204. Pixels of the candidate reconstruction 212 for which the projected pixel location lies outside of the second training image 204 may be zeroed or otherwise padded, or alternatively the first training image 202 and the candidate reconstruction 212 may both be cropped to exclude border pixels, ensuring the resulting images can be compared as discussed hereafter.
The relative pose used by the projection model 210 for generating the candidate reconstruction 212 of the first training image can have up to six degrees of freedom, corresponding to a translation in three dimensions along with rotation angles about three axes, and may be represented as a 4x4 matrix. In the present example, the relative pose is assumed to be known, for example based on a known velocity of the camera 102 and a time elapsed between the capturing of the first and second training images 202, 204. The relative pose may be measured or configured in examples where the training images are captured simultaneously using a multi-camera rig. An example in which the relative pose is not known is described hereinafter with reference to Figure 3.
The first training image 202 and the candidate reconstruction 212 of the first training image are compared using a loss function 214 comprising a photometric difference term 215 penalising a photometric difference between the first training image 202 and the candidate reconstruction 212 of the first training image The photometric difference may be calculated as an Li difference, L2 difference, structured similarity index (SSIN/1), or a combination of these metrics or any other suitable difference metrics, for example applied to the pixel intensities of the first training image 202 and the candidate reconstruction 212 of the first training image. The idea is that a more accurate candidate depth map 208 leads to a more accurate candidate reconstruction 212 of the first training image and a lower value of the photometric difference term 215.
The photometric difference term 215 therefore plays the role of a training signal for training the depth estimation model 206.
Further to the photometric difference term 215, the loss function 214 may include additional terms to induce particular properties in the candidate depth map 208 generated by the depth estimation model 206. In particular, the loss function 214 may include a regularisation term (not shown) which induces smoothness in the candidate depth map 208. In an example, the loss function 214 includes an edge-aware regularisation term which penalises a gradient of the depth values of the candidate depth map 208 in the horizontal and vertical directions (inducing smoothness), weighted by a gradient of the pixel values of the first training image 202 (emphasising the effect of edges on the regularisation term) In order to mitigate the effect of erroneous depth estimations caused by reflections from a horizontal surface, in the present example the loss function 214 further includes a reflection mitigation term 217, as will be described in more detail hereinafter.
The rate of convergence during training of the depth estimation model and the accuracy of the resulting depth estimation model can be improved if the application of one or more terms of the loss function 214 is restricted to a subregion of the scene in which very distant regions are excluded. In the present example, the first training image 202 and the second training image 204 are processed to determine a binary mask 211 indicative of distant regions of the scene, for example those corresponding to the sky, and the loss function 214 is filtered using the binary mask 211 to exclude very distant regions of the scene. The implementation and effect of the binary mask 211 are described in more detail hereinafter For a single training iteration, the loss function 214 may be applied to one or more pairs of training images in order to determine a loss value 216. For example, the loss function may be applied to multiple pairs of image in a set (for example, in the example of Figure 2, the loss function may be applied to the pair of images captured at t, and t2, then to the pair of images captured at t2 and t3). Furthermore, in order to reduce bias, the loss function 214 may be applied to a batch comprising multiple randomly-selected sets of training images. Alternatively, or additionally, the loss function 214 may be applied to training images at multiple scales, which may result in a more robust depth estimation model 206. In another example, a composite candidate reconstruction of the first training image may be generated by projecting multiple further training images using the projection model 210, for example by taking an average of the resulting projections, in which case the loss function may be applied to the first training image 202 and the composite candidate reconstruction of the first training image.
The gradient of the resulting loss value 216 is backpropagated as indicated by the dashed arrows in Figure 2 to determine a gradient of the loss value 216 with respect to the trainable parameters of the depth estimation model. The trainable parameters of the depth estimation model are updated using stochastic gradient descent or a variant thereof Over multiple such iterations, the depth estimation model 206 is trained to generate accurate depth maps from monocular images. Although in the present example the loss function 214 outputs a loss value 216 and the training aims to minimise the loss value 216, in other examples a loss function may be atTanged to output a value which rewards photometric similarity between the first and second training images 202, 204, in which case the training aims to maximise this value, for example using gradient ascent of a variant thereof In the example of Figure 2, it is assumed that the relative pose relating the first and second views of the scene is known. Figure 3 shows schematically a second example in which the relative pose is not assumed to be known a priori. It is likely that for some cases in which training images are captured at different times using a mobile camera, for example a camera mounted on a vehicle, the relative pose will not be known to a high degree of precision, in which case the method of Figure 3 may be preferable. Items in Figures 2 and 3 sharing the same last two digits are functionally equivalent.
The method of Figure 3 differs from the method of Figure 2 in that the first training image 302 and the second training image 304 are processed together using a pose estimation model 309 to generate a candidate relative pose relating the first and second views of the scene. The pose estimation model 309 may be any suitable type of machine learning model arranged to take two colour or greyscale images as input and generate data indicative of a relative pose, for example a vector with six components indicative of the six degrees of freedom mentioned above. The resulting vector can then be converted into a 4x4 transformation matrix. For example, the pose estimation model 309 may be a deep neural network model in which at least some of the layers are convolutional layers. Those skilled in the art will appreciate that many different neural network architectures can be used to process input images to generate an output vector. The pose estimation model 309 is defined by trainable parameters for example kernel values, connection weights and bias values in the case of a deep neural network model, along with further data which is not updated during training, for example data indicative of network architecture, along with hyperparameter values for controlling activation functions, optimiser preferences, and alike. In some examples, the depth estimation model 306 and the pose estimation model 309 may share one or more neural network layers, for example an initial one or more convolutional layers which may be applied to the first training image 302 for the depth estimation model and further applied to the second training image 304 for the pose estimation model. This may alleviate the computational burden of the training process and also lead to faster convergence of the models.
In the example of Figure 3, the candidate relative pose determined using the pose estimation model 309 is used in the projection model 310 in place of the known relative pose. The gradient of the resulting loss value 316 is backpropagated as indicated by the dashed arrows to determine a gradient of the loss value 316 with respect to the trainable parameters of the depth estimation model 306 and the trainable parameters of the pose estimation model 309, and the trainable parameters of the depth estimation model 306 and the pose estimation model 309 are updated using stochastic gradient descent or a variant thereof As explained above, the loss functions used to train the depth estimation model may include a reflection mitigation term to mitigate the effect of erroneous depth estimations caused by reflections from a horizontal surface. Frame 402 of Figure 4 is an example of an image showing, inter alia, a road 404 with two wet patches 406, 408 on its surface, and a vehicle 410 driving on the road 404. Frame 412 shows the same image with dashed lines representing i so-depth contours as determined using a depth estimation model trained using the method of Figure 2 or 3 without the reflection mitigation term. Frame 414 shows the same image with dashed lines representing isodepth contours as determined using a depth estimation model trained using the method of Figure 2 or 3 with the reflection mitigation term.
It is observed in frame 412 that the wet patches 406, 408 on the surface of the road 404 lead to erroneous depth estimates in the corresponding regions of the estimated depth map. In particular, the wet patch 406 shows a reflection of the sky, and is therefore estimated to have much greater depth values than those of the surrounding portions of the road 404. The wet patch 408 also shows a reflection the sky, along with part of the vehicle 410, and different regions of the wet patch 408 are therefore estimated to have greatly differing depths values, all of which are greatly different from the depth values of the surrounding portions of the road 404. The erroneous depth estimates for the wet patches 406, 408 lead to an unrealistic representation of the scene, as can be seen by comparing frames 414 and 418. Such unrealistic representations may have detrimental or dangerous consequences, for example when used as input data for an ADAS or ADS. It will be appreciated that erroneous depth estimations caused by reflective surfaces would be even more problematic when the scene has more reflective surfaces, for example a road surface after heavy rain. Although in the example of Figure 4 the effects of the reflections are predictable based on ray tracing considerations, in examples where the reflective surface is not smooth (for example due to ripples on the surface of a puddle), the reflections will typically lead to noisy and unpredictable erroneous depth estimates.
In order to reduce the detrimental effects of reflections, the reflection mitigation term penalises a second derivative in the horizontal direction of at least a portion of the candidate depth map. This is in contrast with the regularisation term mentioned above, which instead penalises first order derivatives. The inventors have identified that, compared with other regions of a scene, reflections typically lead to noisier and consequently less smooth regions of the candidate depth map. By penalising the second derivative in the horizontal direction, the reflection mitigation term encourages reduced variation of the depth estimates in the horizontal direction for regions in which reflections are present, as would be expected for example for a surface (such as a horizontal road surface) without reflections. The reflection mitigation term therefore discourages the depth estimation model from inferring depth from regions exhibiting reflections, and instead encourages the depth estimation model to infer the depth for these regions from surrounding areas, leading to a depth map which is closer to the ground truth.
The reflection mitigation term may be applied to the entirety of the candidate depth map, or only to a portion of the candidate depth map. For example, the reflection mitigation term may be applied to regions identified using one or more binary masks. In one example, a semantic segmentation model is used to generate a binary road mask identifying a region of the input image as a road surface, and the reflection mitigation term is only applied to the region identified as the road surface. Limiting the application of the reflection mitigation term to a particular region or subregion of the candidate depth map may be advantageous in that the term will not interfere with the performance of the depth estimation model on regions not resulting from reflections. However, the inventors have found that the reflection mitigation term as described herein has little effect on the performance of the depth estimation model regions on such regions, and is capable of reducing erroneous depth estimations caused by reflections without significantly affecting the performance of the depth estimation model elsewhere. Therefore, the reflection mitigation term can be applied without the additional complication and processing demands of implementing a semantic segmentation model.
As mentioned above, the rate of convergence during training of the depth estimation model (and the pose estimation model if used), and the accuracy of the resulting depth estimation model (and pose estimation model if used) can be improved if the application of one or more terms of the loss function is restricted to a subregion of the scene in which very distant regions of the scene are excluded. In order to restrict the application of the loss function in this way, the first training image and/or the second training image are processed to determine a binary mask indicative of distant regions of the scene, for example those corresponding to the sky. The binary mask may be determined for example using a semantic segmentation model, but a more computationally efficient method is to generate a difference image or delta image by subtracting the first training image from the second training image or vice versa. The first and second training images may optionally be downsampled or converted to greyscale before the subtraction is performed. The pixel values of the difference image are then binarized on the basis of a binarization threshold. The binarization threshold may be predetermined or alternatively may be determined in dependence on pixel values of the first training image and/or the second training image. For example, the binarization threshold may depend on pixel values of the difference image, for example such that a predetermined proportion of the pixels are excluded from the binary mask. Alternatively, the binarization threshold may be set to a value where a sharp drop in histogram frequency of the difference pixel values is observed, since it is expected that pixels of very distant regions will occur with a significantly higher frequency than pixels corresponding to any other narrow distance range. Alternatively, or additionally, the binarization threshold may depend directly on pixel values of the first and/or second training image. In this way, the binarization threshold can automatically account for variations in brightness of the training images. In any case, the binarization threshold is determined such that the binarized difference image will only be zero when the corresponding pixels of the first and second training images are almost identical. Very distant regions of the scene are expected to appear almost identically in the first and second training images, and will therefore lead to a binarized difference value of zero, whereas other regions of the scene are likely to have binarized pixel values of one.
Figure 5 shows an example of a first training image 502 and a second training image 504 respectively of first and second views of a scene, along with a binary mask represented by black regions of the frame 506, as determined using the method described above. It is observed that the binary mask omits regions of the scene that contain sky in both the first and second training images 502, 504. It is further observed that in this example, reflections of the sky in the wet patches on the road are not omitted from the binary mask, as the noisy appearance of these regions result in different pixel values in the different training images.
As mentioned above, the reflection mitigation term is effective when applied to the entirety of the candidate depth map or only a subregion of the candidate depth map. In some examples, the reflection mitigation term is only applied to a region corresponding to a binary mask determined as described above. The binary mask may be applied to some or all of the terms in the loss function, irrespective of whether the loss function includes a reflection mitigation term.
In examples where the relative pose relating views of a scene is not known, such as when a pose estimation model is trained alongside the depth estimation model, the depth maps output by the trained depth estimation model are not scale-aware. In other words, depth values estimated by the model are only known up to a multiplicative scaling factor. For many practical applications, such as when the depth estimation model is used to generate input data for an ADS or ADAS, the scaling factor must be determined. Although the trained model may be valid for use with a range of cameras for example having different lenses and intrinsic parameters, the scaling factor may be different for different cameras. An objective of the present disclosure is to provide an automated calibration method for determining the scaling factor. The calibration procedure may be applied shortly after training the depth estimation model and pose estimation model, or alternatively may be applied at a later time and/or in a different location, for example when the trained depth estimation model is deployed to perform inference. The latter may be advantageous if the depth estimation model is to be calibrated for a different camera to the one used to capture the training images, for example where the trained depth estimation model is provided as software for several different models of vehicle.
Figure 6 illustrates a first example of a computer-implemented method for determining a scaling factor for calibrating a depth estimation model. The method includes receiving a calibration image 602 of a calibration scene comprising a calibration object 604. The calibration object 604 has known dimensions which are provided to the computer system performing the calibration method. In this example, the calibration object has a planar surface decorated with a 4x4 black and white checkerboard pattern. A checkerboard pattern is advantageous because the high contrast between adjacent squares allows the interior corners of the pattern to be detected easily using known image processing techniques. Nevertheless, other calibration objects may be used, namely any object with known dimensions which has a detectable pattern such as one or more squares, rectangles, circles or other shapes. The calibration object in this example is positioned such that plane of the checkerboard pattern is substantial and parallel to the optical axis of the camera used to capture the calibration image. This is not essential, however, and a calibration object may be oriented in any way provided that there is some depth variation between points of the calibration object visible in the calibration image.
The calibration image is processed to determine an orientation of the calibration object. The orientation has three degrees of freedom corresponding to rotations around three axes. In the present example, corner detection methods are used to detect the positions of three or more of the corners of the checkerboard pattern (for example, the three interior comers indicated by crosses in frame 606 of Figure 6, in which the calibration object 604 is shown in line drawing style for clarity). Having detected the three corners of the calibration object 604, the orientation of the calibration object is calculated directly from the positions of these corners in the calibration image, using geometric considerations. In other examples, other points on a calibration object may be detected using any suitable image processing method and used for determining the orientation of the calibration object. Three points is sufficient for determining the orientation, irrespective of whether the points lie within a planar surface of the calibration object. However, those skilled in the art will appreciate that using more than three points to determine the orientation, for example taking an average of several orientations calculated using different sets of points, may result in improved accuracy. When the orientation of the calibration object 604 has been determined, the method includes determining a calibrated real-world depth difference between two predetermined points on the calibration object, in dependence on the determined orientation of the calibration object and the known dimensions of the calibration object. The predetermined point should not lie on a line perpendicular to the optical axis of the camera (which would have a depth difference of zero), and the accuracy of the calibration method is improved by choosing predetermined points that have a relatively large depth difference. In the example of Figure 6, the depth difference between the leftmost crosses in frame 606 is determined to be Ad.
The method includes processing the calibration image 602 using the depth estimation model to generate an uncalibrated depth difference between the two predetermined points in the calibration image. As explained above, the depth estimation model is arranged to generate a depth map associating respective depth values with an array of points in the calibration image, but these depth values are uncalibrated, i.e. only known up to a multiplicative constant. The uncalibrated depth difference can therefore be determined by selecting points within the depth map corresponding to the predetermined points within the calibration image, and subtracting the uncalibrated depths values of these points. Frame 608 shows the calibration image with i so-depth contours determined using the depth estimation model. The uncalibrated depth difference between the comers corresponding to the leftmost crosses in frame 606 is determined to be Az.
The scaling factor is calculated as a ratio of the calibrated depth difference to the uncalibrated depth difference. In the example of Figure 6, the scaling factor is given by Ad/Az. Optionally, the calibration method can be performed multiple times with the calibration object placed at different locations and depths within a scene, or in different scenes, and the final scaling factor can be calculated from the resulting scaling factors at each instance, for example as a mean or median value. Once the final scaling factor has been determined, all depth values determined by the depth estimation model are multiplied by the scaling factor to give calibrated depth values.
The method of Figure 6 provides an accurate, automated method of calibrating a depth estimation model. However, in some circumstances it may be inconvenient or impracticable to obtain a suitable calibration object and/or to perform the steps of capturing calibration images containing a calibration object. In particular, this is unlikely to be a convenient solution where the depth estimation model is provided as software for a vehicle with a camera setup not available when the model is trained.
Figures 7A-C illustrate a second example of a computer-implemented method for determining a scaling factor for calibrating a depth estimation model. Unlike the method of Figure 6, the method of Figures 7A-C does not require a calibration object, and can be applied at rim-time to calibrate or re-calibrate the depth estimation model for a particular camera setup. The method of Figures 7A-C is therefore particularly suitable when the depth estimation model is provided as software to be used in conjunction with an unknown camera setup.
Figure 7A shows a vehicle 700 with a camera 702 and a computer system 703 configured as an ADS/A DAS. The computer system 703 is arranged to process images captured by the camera 702 using a depth estimation model to generate input data for the ADS/ADAS. The computer system 703 is further arranged to cause the camera 702 to capture a calibration image of a calibration scene, and to determine a scaling factor for calibrating the depth estimation model using the captured calibration scene as described hereafter. The camera 702 is mounted a height h above a ground plane 704, which in this example is a road surface, where his a value known to the computer system 703. Frame 706 of Figure 7B shows a calibration image of a calibration scene captured using the camera 702, along with iso-depth contours representing a calibration depth map determined using the depth estimation model. The calibration method includes processing the calibration depth map to generate a point cloud representation of the calibration scene. Whereas the calibration depth map associates depth values with an array of points, the point cloud representation indicates three-dimensional co-ordinates of a set of points representing objects in the scene. The co-ordinates in the point cloud representation are uncalibrated because the calibration depth map is uncalibrated. According to the convention used in the present example, the x and y coordinates are horizontal and vertical co-ordinates in the plane of the calibration image, and the z co-ordinate represents depth in the direction of the optical axis of the camera 702. Those skilled in the art are aware of methods of converting a depth map to a point cloud representation. Frame 708 in Figure 7B shows a set of points of the point cloud representation as crosses. It is observed that more points are associated with the ground plane 704 than with any other object appearing in the calibration scene.
The calibration method includes generating a histogram of y co-ordinates of points within the point cloud representation. The bin widths of the histogram may be predetermined or may be determined in dependence on the co-ordinates of the points in the point cloud, for example to ensure a predetermined number of bins with frequency values above a threshold value. Figure 7C shows a histogram of y co-ordinates for the point cloud shown in frame 708 of Figure 7B. It is observed that the modal bin 710 representing points for which -30 <y -20 has a significantly greater frequency than the other bins. In accordance with the present method, points falling within the modal bin 710 are identified as ground plane points, because as noted above it is expected that more points will be associated with the ground plane 704 than with any other object in the calibration scene. Using a histogram to identify ground plane points is simpler to implement, more robust, and less demanding of processing resources than alternative methods of determining a ground plane, for example using a semantic segmentation model.
The histogram used to identify ground plane points may be restricted to include only points which lie below the optical axis of the camera 702. Provided the angle between the optical axis and the ground plane 704 is not too great, points belonging to the ground plane 704 will lie beneath the axis of the camera 702. The histogram may additionally, or alternatively, be restricted to include only points having a restricted range of depth values (z coordinates). The restricted range of depth values may be bounded from above and/or from below by a threshold depth value or values. The threshold depth value(s) may be predetermined or may depend on the points within the point cloud. In one example, an upper threshold depth value is given by the median or mean depth value of points within the point cloud, or a predetermined proportion thereof In this way, distant points, for which the co-ordinates of the points are likely to be less accurately determined, are omitted, increasing the accuracy of the calibration method whilst also reducing processing demands.
The calibration method proceeds with performing a statistical analysis of the points identified as ground plane points to determine an uncalibrated height of the camera 702 above the ground plane 704. The uncalibrated height given by minus the intercept of the ground plane 704 with a vertical y-axis with an origin at the camera position. In one example, it is assumed that the ground plane 704 is parallel to the optical axis of the camera 702. The statistical analysis may then involve determining a mean or median value of the y co-ordinates of the identified ground plane points, and taking this value as the y-intercept of the ground plane. Alternatively, if it is not assumed that the ground plane 704 is parallel to the optical axis of the camera (for example because the optical axis is inclined or declined with respect to the ground plane 704 or vice versa), then the statistical analysis may involve determining coefficients for an equation of a plane of best fit for the points identified as ground plane points, for example using a regression matrix approach. The y-intercept of the ground plane is then taken as the y-intercept of the plane of best fit.
The scaling factor is calculated as a ratio of the actual height of the camera above the ground plane 704 to the uncalibrated height of the camera 702 above the ground plane 704 determined as discussed above. Optionally, the calibration method can be performed multiple times with differing scenes, and the final scaling factor can be calculated from the resulting scaling factors at each instance, for example as a mean or median value. Once the final scaling factor has been determined, all depth values determined by the depth estimation model are multiplied by the scaling factor to give calibrated depth values. In some examples, the calibration method may be performed repeatedly during the time that the depth estimation is in use. For example, the calibration method may be performed for each image or video frame processed by the depth estimation model, or a subset thereof, to ensure that calibration of the model remains up to date. This is facilitated by the efficient method of identifying the ground plane described herein. In cases where the calibration method is performed repeatedly, there may be certain images or frames for which a scaling factor cannot be determined, for example in the case of a vehicle driving over a brow of a hill or about to drive up a hill. In these cases, the scaling factor may be determined in dependent on previous instances of the calibration method, for example as a most recent value or an accumulated or rolling average of the previously determined scaling factors.
The above embodiments are to be understood as illustrative examples of the invention. Further embodiments of the invention are envisaged. For example, although in the examples of training methods described above the training images are real images captured by one or more cameras, in other examples a set of training images may be views of a virtual scene. Using virtual scenes for at least part of the training may reduce the time-and resource-consuming process of capturing real images of an environment, allowing the training process to be performed more quickly and with more training data. It may be advantageous to train the depth estimation model using a combination of synthetic training images and real training images. In some examples, a synthetic three-dimensional scene is generated using scene synthesis, for example using a deep neural network approach and/or using a video game engine, and a set of training images is generated virtually as views of the synthetic scene from respective virtual camera locations, It is to be understood that any feature described in relation to any one embodiment may be used alone, or in combination with other features described, and may also be used in combination with one or more features of any other of the embodiments, or any combination of any other of the embodiments. Furthermore, equivalents and modifications not described above may also be employed without departing from the scope of the invention, which is defined in the accompanying claims
Claims (26)
- CLAIMSI. A computer-implemented method of training a depth estimation model to generate a depth map from a monocular image, the method comprising: receiving first and second training images respectively of first and second views of a scene; processing the first training image using the depth estimation model to generate a candidate depth map comprising an array of candidate depth values, the array representing horizontal and vertical directions within the scene; projecting the second training image using the candidate depth map to generate a candidate reconstruction of the first training image; and updating the depth estimation model so as to reduce a value of a loss function comprising a photometric difference term penalising a photometric difference between the first training image and the candidate reconstruction of the first training image dependent, wherein the loss function further comprises a reflection mitigation term penalising a second derivative in the horizontal direction of at least a portion of the candidate depth map.
- 2. A computer-implemented method of training a depth estimation model to generate a depth map from a monocular image, the method comprising: receiving first and second training images respectively of first and second views of a scene; processing the first training image using the depth estimation model to generate a candidate depth map comprising an array of candidate depth values, the array representing horizontal and vertical directions within the scene; processing at least one of the first training image and the second training image to determine a binary mask indicative of distant regions of the scene; projecting the second training image using the candidate depth map to generate a candidate reconstruction of the first training image; and updating the depth estimation model so as to reduce a value of a loss function comprising a photometric difference term penalising a photometric difference between corresponding portions of the first training image and the candidate reconstruction of the first training image, wherein the corresponding portions exclude the distant regions of the scene as indicated by the binary mask
- 3. The method of claim 2, wherein determining the binary mask comprises: generating a difference image corresponding to a difference between the first training image and the second training image; and binarizing the generated difference image.
- 4. The method of claim 3, wherein the binarizing is based on a binarization threshold determined in dependence on pixel values of the first training image and/or the second training image
- 5. The method of any of claims 2 to 4, wherein the loss function further comprises a reflection mitigation term penalising a second derivative in the horizontal direction of at least a portion of the candidate depth map.
- 6. The method of claim 5, wherein said at least portion of the candidate depth map excludes depth values corresponding to the distant regions of the scene as indicated by the binary mask.
- 7 The method of claim 1 or 5, further comprising identifying a road surface in the first training image using a segmentation model, wherein said at least portion of the candidate depth map corresponds to the identified road surface.
- 8. The method of any preceding claim, further comprising.receiving one or more further training images of respective further views of the scene and projecting each of the one or more further training images using the candidate depth map to generate a respective further candidate reconstruction of the first training image, wherein the photometric difference term of the loss function further penalises a photometric difference between the first training image and each of the respective projected further training images
- 9. The method of any of claims 1 to 7, wherein: the candidate reconstruction of the first training image is a composite candidate reconstruction of the first training image; the method further comprises receiving one or more further training images of respective further views of the scene; and generating the composite candidate reconstruction of the first training image further comprises projecting each of the further training images using the candidate depth map.
- 10. The method of any preceding claim, further comprising generating the scene using scene synthesis.
- 11. The method of any preceding claim, further comprising processing the first and second training images using a pose estimation model to determine a candidate relative pose relating the first and second views of the scene, wherein: the projecting of the second training image further uses the candidate relative pose and the method further includes updating the pose estimation model so as to reduce the value of the loss function.
- 12. The method of claim 11, wherein: the depth estimation model a first neural network model; the pose estimation model is a second neural network model; and the first neural network model and the second neural network model share one or more neural network layers.
- 13. The method of any preceding claim, further comprising determining a scaling factor for calibrating the trained depth estimation model, the method comprising: object; 27 receiving a calibration image of a calibration scene comprising a calibration receiving data indicative of dimensions of the calibration object; processing the calibration image to determine an orientation of the calibration object, determining, in dependence on the determined orientation of the calibration object and the data indicative of the dimensions of the calibration object, a calibrated depth difference between two predetermined points on the calibration object; processing the calibration image, using the depth estimation model, to determine an uncalibrated depth difference between the two predetermined points on the calibration object, and determining the scaling factor as a ratio of the calibrated depth difference to the uncalibrated depth difference.
- 14. A computer-implemented method of determining a scaling factor for calibrating a depth estimation model, the method comprising: receiving a calibration image of a calibration scene comprising a calibration object, receiving data indicative of dimensions of the calibration object; processing the calibration image to determine an orientation of the calibration object; determining, in dependence on the determined orientation of the calibration object and the data indicative of the dimensions of the calibration object, a calibrated depth difference between two predetermined points on the calibration object; processing the calibration image, using the depth estimation model, to determine an uncalibrated depth difference between the two predetermined points on the calibration object; and determining the scaling factor as a ratio of the calibrated depth difference to the uncalibrated depth difference.
- 15. The method of claim 13 or 14, wherein: a surface of the calibration object is decorated with a checkerboard pattern; and determining the orientation of the calibration object comprises determining locations within the calibration image of a plurality of vertices of the checkerboard pattern, and analysing the determined locations of the plurality of vertices.
- 16. The method of any of claims Ito 12, further comprising determining a scaling factor for calibrating the trained depth estimation model, the method comprising: receiving a calibration image of a calibration scene captured by a camera positioned a given height above a ground plane; processing the calibration image, using the trained depth estimation model, to determine a calibration depth map comprising an array of calibration depth values; process the calibration depth map to generate a point cloud representation of the calibration scene, generating a histogram of vertical positions of points within the point cloud representation of the calibration scene; identifying points within a modal bin of the generated histogram as ground plane points; performing a statistical analysis of the points identified as ground plane points to determine an uncalibrated height of the camera above the ground plane, and determining the scaling factor as a ratio of the given height of the camera above the ground plane to the uncalibrated height of the camera above the ground plane.
- 17. A computer-implemented method of determining a scaling factor for calibrating a depth estimation model, the method comprising: receiving a calibration image of a calibration scene captured by a camera positioned a given height above a ground plane; processing the calibration image, using the depth estimation model, to determine a calibration depth map comprising an array of calibration depth values, the array defining horizontal and vertical directions; process the calibration depth map to generate a point cloud representation of the calibration scene; generating a histogram of vertical pos.-Lions of points within the point cloud representation of the calibration scene; identifying points within a modal bin of the generated histogram as ground plane points; performing a statistical analysis of the points identified as ground plane points to determine an estimated height of the camera above the ground plane; and determining the scaling factor as a ratio of the given height of the camera above the ground plane to the uncalibrated height of the camera above the ground plane.
- 18. The method of claim 16 or 17, wherein the statistical analysis comprises determining a mean or median of the vertical positions of the points identified as ground plane points.
- 19. The method of claim 16 or 17, wherein statistical analysis comprises determining coefficients for an equation of a plane of best fit for the points identified as ground plane points, and determining the uncalibrated height of the camera above the ground plane using the determined coefficients
- 20. The method of any of claims 16 to 19, wherein the generated histogram is restricted to points lying below an optical axis of the camera
- 21. The method of any of claims 16 to 20, wherein the generated histogram is restricted to points having a restricted range of depth values.
- 22. The method of claim 21, wherein an upper limit of the restricted range is given by a predetermined proportion of a median depth value of points within the point cloud representation.
- 23. A system comprising: a mobile camera arranged to capture first and second training images respectively of first and second views of a scene as the mobile camera moves with respect to the scene; and a computing system arranged to process the first and second training images to train a depth estimation model, in accordance with the method of any of claims Ito 12.
- 24. A system comprising memory circuitry, processing circuitry, and a camera, wherein the memory circuitry holds a depth estimation model and machine-readable instructions which, when executed by the processing circuitry, cause the camera to capture a calibration image of a calibration scene, and the system to determine a scaling factor for calibrating the depth estimation model using the captured calibration scene, in accordance with the method of claim 14 or 17.
- 25. The system of claim 24, further comprising an advanced driver assistance system (ADAS) and/or an automated driving system (ADS) for a vehicle, wherein the system is arranged to process images captured by the camera using the depth estimation model to generate input data for said ADAS and/or ADS.
- 26. A computer program product holding machine-readable instructions which, when executed by a computing system, cause the computing system to perform the method of any of claims Ito 22.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB2104949.9A GB2605621A (en) | 2021-04-07 | 2021-04-07 | Monocular depth estimation |
PCT/GB2022/050881 WO2022214821A2 (en) | 2021-04-07 | 2022-04-07 | Monocular depth estimation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB2104949.9A GB2605621A (en) | 2021-04-07 | 2021-04-07 | Monocular depth estimation |
Publications (2)
Publication Number | Publication Date |
---|---|
GB202104949D0 GB202104949D0 (en) | 2021-05-19 |
GB2605621A true GB2605621A (en) | 2022-10-12 |
Family
ID=75883606
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB2104949.9A Pending GB2605621A (en) | 2021-04-07 | 2021-04-07 | Monocular depth estimation |
Country Status (2)
Country | Link |
---|---|
GB (1) | GB2605621A (en) |
WO (1) | WO2022214821A2 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111274939B (en) * | 2020-01-19 | 2023-07-14 | 交信北斗科技有限公司 | Automatic extraction method for road pavement pothole damage based on monocular camera |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103337066A (en) | 2013-05-27 | 2013-10-02 | 清华大学 | Calibration method for 3D (three-dimensional) acquisition system |
US20160189358A1 (en) | 2014-12-29 | 2016-06-30 | Dassault Systemes | Method for calibrating a depth camera |
US20180352119A1 (en) | 2017-05-30 | 2018-12-06 | Intel Corporation | Calibrating depth cameras using natural objects with expected shapes |
US20190096092A1 (en) | 2017-09-27 | 2019-03-28 | Arcsoft (Hangzhou) Multimedia Technology Co., Ltd. | Method and device for calibration |
CN209541744U (en) | 2019-04-26 | 2019-10-25 | 昆明理工大学 | A kind of caliberating device that photogrammetric post-processing auxiliary scale restores and orients |
US10628968B1 (en) | 2018-12-05 | 2020-04-21 | Toyota Research Institute, Inc. | Systems and methods of calibrating a depth-IR image offset |
US20200258249A1 (en) * | 2017-11-15 | 2020-08-13 | Google Llc | Unsupervised learning of image depth and ego-motion prediction neural networks |
-
2021
- 2021-04-07 GB GB2104949.9A patent/GB2605621A/en active Pending
-
2022
- 2022-04-07 WO PCT/GB2022/050881 patent/WO2022214821A2/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103337066A (en) | 2013-05-27 | 2013-10-02 | 清华大学 | Calibration method for 3D (three-dimensional) acquisition system |
US20160189358A1 (en) | 2014-12-29 | 2016-06-30 | Dassault Systemes | Method for calibrating a depth camera |
US20180352119A1 (en) | 2017-05-30 | 2018-12-06 | Intel Corporation | Calibrating depth cameras using natural objects with expected shapes |
US20190096092A1 (en) | 2017-09-27 | 2019-03-28 | Arcsoft (Hangzhou) Multimedia Technology Co., Ltd. | Method and device for calibration |
US20200258249A1 (en) * | 2017-11-15 | 2020-08-13 | Google Llc | Unsupervised learning of image depth and ego-motion prediction neural networks |
US10628968B1 (en) | 2018-12-05 | 2020-04-21 | Toyota Research Institute, Inc. | Systems and methods of calibrating a depth-IR image offset |
CN209541744U (en) | 2019-04-26 | 2019-10-25 | 昆明理工大学 | A kind of caliberating device that photogrammetric post-processing auxiliary scale restores and orients |
Non-Patent Citations (10)
Also Published As
Publication number | Publication date |
---|---|
WO2022214821A3 (en) | 2022-11-17 |
WO2022214821A2 (en) | 2022-10-13 |
GB202104949D0 (en) | 2021-05-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20240161622A1 (en) | Vehicle environment modeling with a camera | |
KR102109941B1 (en) | Method and Apparatus for Vehicle Detection Using Lidar Sensor and Camera | |
US10288418B2 (en) | Information processing apparatus, information processing method, and storage medium | |
EP2671384B1 (en) | Mobile camera localization using depth maps | |
US8971612B2 (en) | Learning image processing tasks from scene reconstructions | |
JP6305171B2 (en) | How to detect objects in a scene | |
JP5023186B2 (en) | Object motion detection system based on combination of 3D warping technique and proper object motion (POM) detection | |
KR20140027468A (en) | Depth measurement quality enhancement | |
JP2013537661A (en) | Automatic detection of moving objects using stereo vision technology | |
CN111383257B (en) | Carriage loading and unloading rate determining method and device | |
US20230376106A1 (en) | Depth information based pose determination for mobile platforms, and associated systems and methods | |
US12050661B2 (en) | Systems and methods for object detection using stereovision information | |
CN112683228A (en) | Monocular camera ranging method and device | |
JP6351917B2 (en) | Moving object detection device | |
WO2022214821A2 (en) | Monocular depth estimation | |
CN112733678A (en) | Ranging method, ranging device, computer equipment and storage medium | |
Loktev et al. | Image Blur Simulation for the Estimation of the Behavior of Real Objects by Monitoring Systems. | |
CN118661194A (en) | Depth map completion in visual content using semantic and three-dimensional information | |
Corneliu et al. | Real-time pedestrian classification exploiting 2D and 3D information | |
WO2024142571A1 (en) | Image processing device | |
WO2022130618A1 (en) | Position/orientation estimation device, position/orientation estimation method, and program | |
Yusuf et al. | Data Fusion of Semantic and Depth Information in the Context of Object Detection | |
Wu et al. | Robust image measurement and analysis based on perspective transformations | |
Dalbah et al. | Detection of dynamic objects for environment mapping by time-of-flight cameras | |
WO2022263004A1 (en) | Method for annotating objects in an image and driver assistant system for performing the method |