WO2022214821A2

WO2022214821A2 - Monocular depth estimation

Info

Publication number: WO2022214821A2
Application number: PCT/GB2022/050881
Authority: WO
Inventors: Jostyn Biebele FUBARA; Liangchuan GU
Original assignee: Robok Limited
Priority date: 2021-04-07
Filing date: 2022-04-07
Publication date: 2022-10-13
Also published as: GB202104949D0; GB2605621A; WO2022214821A3

Abstract

A computer-implemented method of training a depth estimation model to generate a depth map from a monocular image includes receiving first and second training images respectively of first and second views of a scene, processing the first training image using the depth estimation model to generate a candidate depth map comprising an array of candidate depth values, projecting the second training image using the candidate depth map to generate a candidate reconstruction of the first training image, and updating the depth estimation model so as to reduce a value of a loss function comprising a photometric difference term penalising a photometric difference between the first training image and the candidate reconstruction of the first training image dependent. The loss function further comprises a reflection mitigation term penalising a second derivative in the horizontal direction of at least a portion of the candidate depth map.

Description

MONOCULAR DEPTH ESTIMATION

Technical Field

The present invention relates to training and calibrating a depth estimation model for inferring depth information from a monocular image. The invention has particular, but not exclusive, relevance to training and calibrating a depth estimation model for an advanced driver assistance system (ADAS) or an automated driving system (ADS). Background

Depth estimation or ranging involves estimating distances to objects within a scene or environment. Such information may be provided as input data to an advanced driver assistance system (ADAS) or an automated driving system (ADS) for an autonomous vehicle. Depth information estimated in near-real time can be used, for example, to inform decisions on collision avoidance actions such as emergency braking and pedestrian crash avoidance mitigation, as well as for generating driver alerts such as proximity warnings in the case of ADAS.

Various methods are known for depth estimation, for example stereo matching, lidar, radar, structured light methods and photometric stereo methods. Many of these methods require specialist equipment and/or perform poorly in real world settings such as those encountered in the context of ADAS or ADS. Alongside these methods, deep learning approaches have been developed for estimating depth information from individual images captured using a standard monocular camera. In some examples, supervised learning is used to train a deep learning model using ground truth depth information determined using one of the methods described above. In other examples, depth information is recast as an unsupervised learning problem. In such methods, a first training image representing a first view of a scene is processed using a deep learning model to generate a candidate depth map. The candidate depth map is then used to project a second training image representing a second view of the scene to a frame of reference of the first training image, to generate a candidate reconstruction of the first training image. A photometric difference between the first training image and the candidate reconstruction of the first training image is then used as a training signal to update the deep learning model. The images may be captured simultaneously using a stereo camera rig, or at different times using a moving camera. In the latter case, the relationship between reference frames of the first and second views of the environment may not be known, in which case a pose estimation model is trained alongside the depth estimation model, using the same training signal, for predicting this relationship.

Once trained, a depth estimation model as described above is able to determine depth information based on individual images captured by a standard monocular camera. However, there is a need to improve the performance of such models when deployed in real life environments, particularly those typically encountered in the context of an ADAS or ADS, where for example reflections from a wet road surface can lead to erroneous depth information. Furthermore, a depth estimation model trained using a pose estimation model as mentioned above is not scale-aware, and therefore for practical applications the output of the depth estimation model must be calibrated for a given camera setup in order to determine a scaling for the determined depth information.

Summary

According to a first aspect of the present disclosure, there is provided a computer-implemented method of training a depth estimation model to generate a depth map from a monocular image. The method includes receiving first and second training images respectively of first and second views of a scene, and processing the first training image using the depth estimation model to generate a candidate depth map comprising an array of candidate depth values, the array defining horizontal and vertical directions within the scene. The method further includes projecting the second training image using the candidate depth map to generate a candidate reconstruction of the training image, and updating the depth estimation model so as to reduce a value of a loss function comprising a photometric difference term penalising a photometric difference between the first training image and the candidate reconstruction of the first training image, wherein the loss function further comprises a reflection mitigation term penalising a second derivative in the horizontal direction of at least a portion of the candidate depth map. The inventors have identified that reflections typically lead to noisier and consequently less smooth regions of the candidate depth map. By penalising the second derivative in the horizontal direction, the reflection mitigation term encourages the depth estimation model to minimise variation of the depth estimates in the horizontal direction for regions in which reflections are present, leading to depth estimates which are more consistent with a surface without any reflections. The reflection mitigation term therefore discourages the depth estimation model from inferring depth from regions exhibiting reflections, and instead encourages the depth estimation model to infer the depth for these regions from surrounding areas, leading to a depth map which is closer to the ground truth.

According to a second aspect, there is provided a further computer-implemented method of training a depth estimation model to generate a depth map from a monocular image. The method includes receiving first and second training images respectively of first and second views of a scene, processing the first training image using the depth estimation model to generate a candidate depth map comprising an array of candidate depth values, the array defining horizontal and vertical directions within the scene, and processing at least one of the first training image and the second training image to determine a binary mask indicative of distant regions of the scene. The method further includes projecting the second training image using the candidate depth map to generate a candidate reconstruction of the first training image and updating the depth estimation model so as to reduce a value of a loss function comprising a photometric difference term penalising a photometric difference between corresponding portions of the first training image and the candidate reconstruction of the first training image dependent, wherein the corresponding portions exclude the distant regions of the scene as indicated by the binary mask.

By excluding contributions to the loss function from distant regions of the scene, the loss function is more strongly affected by regions of the scene in which depth variation is present and from which depth information can be validly inferred. This results in a higher rate of convergence of the depth estimation model during training, allowing the training to be performed using less training data, the collection of which may be time- and/or resource-demanding. In addition to the improved convergence properties, the accuracy of the resultant depth estimation model is found to be improved.

In examples, determining the binary mask includes generating a difference image corresponding to a difference of the first training image and the second training image, and binarizing the generated difference image. Determining the binary mask in this way is reliable, and the implementation is straightforward and significantly less demanding of processing resources than alternative approaches such as a semantic segmentation model. The binarization may be performed using a binarization threshold depending on pixel values of the first training image and/or the second training image. In this way, the binarization threshold can be made to automatically account for variations in properties of the training images, such as brightness of the training images.

The training images used for training the depth estimation model may be images captured by a camera. Alternatively, the method may further include generating the scene virtually using scene synthesis and then capturing a virtual image of the generated scene. Using synthetic scenes for at least part of the training may reduce the time- and resource-consuming process of capturing real images of an environment, allowing the training process to be performed more quickly and with more training data.

In examples, the method further includes processing the first and second training images using a pose estimation model to determine a candidate relative pose relating the first and second views of the scene, and the projecting of the second training image further uses the candidate relative pose. The depth estimation model and the pose estimation may respectively be first and second neural network models sharing one or more neural network layers. Sharing one or more neural network layers, for example convolutional layers, between the depth estimation model and the pose estimation model may alleviate some of the computational burden of the training process and lead to faster convergence of the models.

According to a third aspect, there is provided a computer-implemented method of determining a scaling factor for calibrating a depth estimation model. The method includes receiving a calibration image of a calibration scene comprising a calibration object, receiving data indicative of dimensions of the calibration object, and processing the calibration image to determine an orientation of the calibration object. The method further includes determining a calibrated depth difference between two predetermined points on the calibration object in dependence on the determined orientation of the calibration object and the data indicative of the dimensions of the calibration object, using the depth estimation model to determine an uncalibrated depth difference between the two predetermined points on the calibration object from the calibration depth map, and determining the scaling factor as a ratio of the calibrated depth difference to the uncalibrated depth difference. Once the scaling factor has been determined, depth values determined by the depth estimation model can be multiplied by the scaling factor to give calibrated depth values, resulting in a scale-aware depth estimation model.

According to a fourth aspect, there is provided a further computer-implemented method of determining a scaling factor for calibrating a depth estimation model. The method includes receiving a calibration image of a calibration scene captured by a camera positioned a given height above a ground plane and using the depth estimation model to process the calibration image model to determine a calibration depth map comprising an array of calibration depth values, the array defining horizontal and vertical directions. The method further includes processing the calibration depth map to generate a point cloud representation of the calibration scene, generating a histogram of vertical positions of points within the point cloud representation of the calibration scene, identifying points within a modal bin of the generated histogram as ground plane points, performing a statistical analysis of the points identified as ground plane points to determine an uncalibrated height of the camera above the ground plane, and determining the scaling factor as a ratio of the given height of the camera above the ground plane to the uncalibrated height of the camera above the ground plane.

The above calibration method does not require a calibration object, and can be applied at run-time to calibrate or re-calibrate the depth estimation model for a particular camera setup. This method is therefore particularly suitable when the depth estimation model is provided as software to be used in conjunction with an unknown camera setup. Using a histogram to identify ground plane points is more straightforward to implement, more robust, and less demanding of processing resources than alternative methods of identifying a ground plane, for example using a semantic segmentation model.

In examples, the generated histogram is restricted to points lying below an optical axis of the camera, and/or to points having a restricted range of depth values. An upper limit of the restricted range may for example be given by a predetermined proportion of a median depth value of points within the point cloud representation. In this way, distant points, for which the co-ordinates of the points are likely to be less accurately determined, are omitted, increasing the accuracy of the calibration method whilst also reducing processing demands.

According to a fifth aspect, there is provided a method of training a ground plane estimation model. The method includes obtaining a first training image comprising a view of a scene, processing the first training image to generate a point cloud representation of at least part of the scene, processing the first training image using the ground plane estimation model to determined plane parameter values for an estimated ground plane within the scene, and updating the ground plane estimation model so as to reduce a value of a loss function comprising a ground plane loss term penalising distances between the estimated ground plane and at least some points of the point cloud representation.

In some examples, generating the point cloud representation includes processing the first training image using a depth estimation model to generate a depth map of the scene, converting at least part of the depth map to the point cloud representation. The method may then further include training the depth estimation model, the training comprising updating the depth estimation model jointly with the ground plane estimation model so as to reduce the value of the loss function. The trained ground plane estimation model may then be used to determine a scale factor for calibrating the depth estimation model. This calibration method does not require a calibration object, and can be applied at run-time to calibrate or re-calibrate the depth estimation model for a particular camera setup. This method is therefore particularly suitable when the depth estimation model is provided as software to be used in conjunction with an unknown camera setup. Furthermore, using a ground plane estimation model involve significantly lower processing demands than alternative methods of identifying a ground plane, which is desirable for example if the calibration needs to be performed frequently and/or using low-power devices such as embedded computing systems.

According to a sixth aspect, there is provided a system including a mobile camera arranged to capture first and second training images respectively of first and second views of a scene as the mobile camera moves with respect to the scene, and a computing system arranged to process the first and second training images to train a depth estimation model, in accordance with any of the training methods described above.

According to a seventh aspect, there is provided a system comprising memory circuitry, processing circuitry, and a camera. The memory circuitry holds a depth estimation model and machine-readable instructions which, when executed by the processing circuitry, cause the camera to capture a calibration image of a calibration scene, and the system to determine a scaling factor for calibrating the depth estimation model using the captured calibration scene, in accordance with any of the calibration methods described above. The system may further include an advanced driver assistance system (ADAS) and/or an automated driving system (ADS) for a vehicle, and may be arranged to process images captured by the camera using the depth estimation model to generate input data for said ADAS and/or ADS.

According to an eighth aspect, there is provided a computer program product holding machine-readable instructions which, when executed by a computing system, cause the computing system to perform any of the computer-implemented method described above.

Further features and advantages of the invention will become apparent from the following description of preferred embodiments of the invention, given by way of example only, which is made with reference to the accompanying drawings.

Brief Description of the Drawings

Figure 1 schematically shows an example of apparatus for training a depth estimation model;

Figure 2 schematically shows a first example of a method for training a depth estimation model;

Figure 3 schematically shows a second example of a method for training a depth estimation model;

Figure 4 illustrates depth estimation from a monocular image using a model trained with and without a modified loss function; Figure 5 illustrates a processing of two images of respective different views of a scene to generate a binary mask;

Figure 6 shows a first example of a method for calibrating a depth estimation model;

Figure 7A shows a system arranged to perform a second example of a method for calibrating a depth estimation model;

Figure 7B illustrates a processing of an image captured using the apparatus of Figure 7A to generate a point cloud;

Figure 7C shows a histogram of vertical positions of points within the point cloud of Figure 7B.

Figure 8 schematically shows an example of a method of training a ground plane estimation model alongside a depth estimation model.

Detailed Description

Figure 1 shows an example of apparatus for training a depth estimation model. The apparatus includes a camera 102 mounted on a moveable platform 104 arranged to move in a direction A along a path 106. The moveable platform 104 may for example be a trolley arranged to move along a rail or a track, or may be a vehicle such as a car or van arranged to drive along a road. The camera 102 in this example is a monocular camera arranged to capture a series of monocular images as the camera 102 and platform 104 move along the path 106. The camera 102 captures images when the camera 102 passes the dashed lines at times t , t₂ and t₃, which in this example are equally separated both in position and time, though in other examples the camera 102 may capture images at times or positions which are not equally separated. The three images captured at times t , t₂ and t₃ represent different views of a scene, meaning that the images contain overlapping portions of the environment, viewed from different perspectives. The frequency at which the images are captured may be several images every second or every tenth or hundredth of a second. The images may for example be frames of a video captured by the camera 102. Although in this example the camera 102 is shown as being forward-facing with respect to the direction of travel A, in other examples the camera 102 could be rear-facing, sideways-facing, or oblique to the direction of travel A. The images captured at times t_x, t₂ and t₃ are referred to as a set of training images, as their function is to be processed together to train a depth estimation model as will be explained in more detail hereafter. In general, a set of training images in the context of the present disclosure includes two or more training images. Although in this example the set of training images is captured using a single mobile camera 102, in other examples a set of training images of different views of a scene may instead be captured simultaneously using multiple cameras, for example using a stereoscopic camera rig which includes a pair of cameras mounted side-by-side.

The apparatus of Figure 1 further includes a computer system 108 arranged to process training images captured by the camera 102 to train a depth estimation model. The computer system 108 may receive the training images in real-time as the camera 102 captures the training images, or may receive the training images in a batch fashion after several sets of training images have been captured, for example via wired or wireless means or via a removable storage device.

The computer system 108 may be a general-purpose computer such as a desktop computer, laptop, server or network-based data processing system, or may be a standalone device or module, for example an integral component of a vehicle. The computer system 108 includes memory circuitry 110 and processing circuitry 112. The processing circuitry 112 may include general purpose processing units and/or specialist processing units such as a graphics processing unit (GPU) or a neural processing unit (NPU). Additionally, or alternatively, the processing circuitry 112 may include an application-specific standard product (ASSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or any other type of integrated circuit suitable for carrying out the methods described hereinafter.

The memory circuitry 110 stores sets of training images, along with model data defining the depth estimation model and machine-readable instructions for a training routine used to train the depth estimation model as described below. In this example, the training images are captured by the camera 102 as described above, but in other examples the computer system 108 could receive training images from elsewhere, for example from a database in the case of historic data. The depth estimation model may be any suitable type of machine learning model arranged to take a colour or greyscale monocular image as input and generate a depth map as an output. The generated depth map comprises an array of direct depth values or inverse depth values (disparity values), with each entry of the array corresponding to an in-plane position within the monocular image. A depth value for a given point in the scene is a distance from the camera to that point in a direction parallel to the optical axis of the camera (conventionally referred to as the z direction). The array may have the same or different dimensions to the input image. In some examples, the depth estimation model is a deep neural network model with convolutional layers and having an encoder-decoder architecture, for example as described in the article “U-Net: Convolutional Networks for Biomedical Image Segmentation” by Ronneberger et al, arXiv: 1505.04597, 2015. Those skilled in the art will appreciate that many different neural network architectures can be used to process an input image to generate an output array. The model data stored in the memory 110 includes trainable parameters of the depth estimation model, for example kernel values, connection weights and bias values in the case of a deep neural network model. The model data further includes data which is not updated during training, for example data indicative of network architecture, along with hyperparameter values for controlling activation functions, optimiser preferences, and alike.

Figure 2 shows schematically a first example of a method performed by the computer system 108 in which a set of training images comprising a first training image 202 and a second training image 204 is processed to train the depth estimation model held in the memory circuitry 110. The first and second training images 202, 204 are respectively of first and second views of a scene, captured at different times by the camera 102. The solid arrows in Figure 2 represent flow of data during a forward pass, and the dashed arrows represent backward propagation of errors following the forward pass, during a single training iteration.

The first training image 202 is processed using the depth estimation model 206 to generate a candidate depth map 208. The candidate depth map 208 is input to a projection model 210 which is arranged to project the second training image 204 into a frame of reference corresponding to that of the first training image 202, to generate a candidate reconstruction 212 of the first training image. The projection model 210 determines a projected pixel location for each pixel location in the candidate reconstruction 212, using the candidate depth map along with a relative pose relating the first and second views of the scene, and an intrinsic matrix characterising the camera 102. Pixel values are then sampled from the second training image 204, for example based on the nearest pixel to the projected pixel location or an interpolation between the nearest four pixels to the projected pixel location. The candidate reconstruction 212 is generated by applying this procedure throughout the region for which the projected pixel locations lie within the second training image 204. Pixels of the candidate reconstruction 212 for which the projected pixel location lies outside of the second training image 204 may be zeroed or otherwise padded, or alternatively the first training image 202 and the candidate reconstruction 212 may both be cropped to exclude border pixels, ensuring the resulting images can be compared as discussed hereafter.

The relative pose used by the projection model 210 for generating the candidate reconstruction 212 of the first training image can have up to six degrees of freedom, corresponding to a translation in three dimensions along with rotation angles about three axes, and may be represented as a 4x4 matrix. In the present example, the relative pose is assumed to be known, for example based on a known velocity of the camera 102 and a time elapsed between the capturing of the first and second training images 202, 204. The relative pose may be measured or configured in examples where the training images are captured simultaneously using a multi-camera rig. An example in which the relative pose is not known is described hereinafter with reference to Figure 3.

The first training image 202 and the candidate reconstruction 212 of the first training image are compared using a loss function 214 comprising a photometric difference term 215 penalising a photometric difference between the first training image 202 and the candidate reconstruction 212 of the first training image. The photometric difference may be calculated as an LI difference, L2 difference, structured similarity index (SSIM), or a combination of these metrics or any other suitable difference metrics, for example applied to the pixel intensities of the first training image 202 and the candidate reconstruction 212 of the first training image. The idea is that a more accurate candidate depth map 208 leads to a more accurate candidate reconstruction 212 of the first training image and a lower value of the photometric difference term 215. The photometric difference term 215 therefore plays the role of a training signal for training the depth estimation model 206.

Further to the photometric difference term 215, the loss function 214 may include additional terms to induce particular properties in the candidate depth map 208 generated by the depth estimation model 206. In particular, the loss function 214 may include a regularisation term (not shown) which induces smoothness in the candidate depth map 208. In an example, the loss function 214 includes an edge-aware regularisation term which penalises a gradient of the depth values of the candidate depth map 208 in the horizontal and vertical directions (inducing smoothness), weighted by a gradient of the pixel values of the first training image 202 (emphasising the effect of edges on the regularisation term). In order to mitigate the effect of erroneous depth estimations caused by reflections from a horizontal surface, in the present example the loss function 214 further includes a reflection mitigation term 217, as will be described in more detail hereinafter.

The rate of convergence during training of the depth estimation model and the accuracy of the resulting depth estimation model can be improved if the application of one or more terms of the loss function 214 is restricted to a subregion of the scene in which very distant regions are excluded. In the present example, the first training image 202 and the second training image 204 are processed to determine a binary mask 211 indicative of distant regions of the scene, for example those corresponding to the sky, and the loss function 214 is filtered using the binary mask 211 to exclude very distant regions of the scene. The implementation and effect of the binary mask 211 are described in more detail hereinafter

For a single training iteration, the loss function 214 may be applied to one or more pairs of training images in order to determine a loss value 216. For example, the loss function may be applied to multiple pairs of image in a set (for example, in the example of Figure 2, the loss function may be applied to the pair of images captured at t₁ and t₂, then to the pair of images captured at t₂ and t₃). Furthermore, in order to reduce bias, the loss function 214 may be applied to a batch comprising multiple randomly-selected sets of training images. Alternatively, or additionally, the loss function 214 may be applied to training images at multiple scales, which may result in a more robust depth estimation model 206. In another example, a composite candidate reconstruction of the first training image may be generated by projecting multiple further training images using the projection model 210, for example by taking an average of the resulting projections, in which case the loss function may be applied to the first training image 202 and the composite candidate reconstruction of the first training image.

The gradient of the resulting loss value 216 is backpropagated as indicated by the dashed arrows in Figure 2 to determine a gradient of the loss value 216 with respect to the trainable parameters of the depth estimation model. The trainable parameters of the depth estimation model are updated using stochastic gradient descent or a variant thereof. Over multiple such iterations, the depth estimation model 206 is trained to generate accurate depth maps from monocular images. Although in the present example the loss function 214 outputs a loss value 216 and the training aims to minimise the loss value 216, in other examples a loss function may be arranged to output a value which rewards photometric similarity between the first and second training images 202, 204, in which case the training aims to maximise this value, for example using gradient ascent of a variant thereof.

In the example of Figure 2, it is assumed that the relative pose relating the first and second views of the scene is known. Figure 3 shows schematically a second example in which the relative pose is not assumed to be known a priori. It is likely that for some cases in which training images are captured at different times using a mobile camera, for example a camera mounted on a vehicle, the relative pose will not be known to a high degree of precision, in which case the method of Figure 3 may be preferable. Items in Figures 2 and 3 sharing the same last two digits are functionally equivalent.

The method of Figure 3 differs from the method of Figure 2 in that the first training image 302 and the second training image 304 are processed together using a pose estimation model 309 to generate a candidate relative pose relating the first and second views of the scene. The pose estimation model 309 may be any suitable type of machine learning model arranged to take two colour or greyscale images as input and generate data indicative of a relative pose, for example a vector with six components indicative of the six degrees of freedom mentioned above. The resulting vector can then be converted into a 4x4 transformation matrix. For example, the pose estimation model 309 may be a deep neural network model in which at least some of the layers are convolutional layers. Those skilled in the art will appreciate that many different neural network architectures can be used to process input images to generate an output vector. The pose estimation model 309 is defined by trainable parameters for example kernel values, connection weights and bias values in the case of a deep neural network model, along with further data which is not updated during training, for example data indicative of network architecture, along with hyperparameter values for controlling activation functions, optimiser preferences, and alike. In some examples, the depth estimation model 306 and the pose estimation model 309 may share one or more neural network layers, for example an initial one or more convolutional layers which may be applied to the first training image 302 for the depth estimation model and further applied to the second training image 304 for the pose estimation model. This may alleviate the computational burden of the training process and also lead to faster convergence of the models.

In the example of Figure 3, the candidate relative pose determined using the pose estimation model 309 is used in the projection model 310 in place of the known relative pose. The gradient of the resulting loss value 316 is backpropagated as indicated by the dashed arrows to determine a gradient of the loss value 316 with respect to the trainable parameters of the depth estimation model 306 and the trainable parameters of the pose estimation model 309, and the trainable parameters of the depth estimation model 306 and the pose estimation model 309 are updated using stochastic gradient descent or a variant thereof.

As explained above, the loss functions used to train the depth estimation model may include a reflection mitigation term to mitigate the effect of erroneous depth estimations caused by reflections from a horizontal surface. Frame 402 of Figure 4 is an example of an image showing, inter alia, a road 404 with two wet patches 406, 408 on its surface, and a vehicle 410 driving on the road 404. Frame 412 shows the same image with dashed lines representing iso-depth contours as determined using a depth estimation model trained using the method of Figure 2 or 3 without the reflection mitigation term. Frame 414 shows the same image with dashed lines representing iso depth contours as determined using a depth estimation model trained using the method of Figure 2 or 3 with the reflection mitigation term.

It is observed in frame 412 that the wet patches 406, 408 on the surface of the road 404 lead to erroneous depth estimates in the corresponding regions of the estimated depth map. In particular, the wet patch 406 shows a reflection of the sky, and is therefore estimated to have much greater depth values than those of the surrounding portions of the road 404. The wet patch 408 also shows a reflection the sky, along with part of the vehicle 410, and different regions of the wet patch 408 are therefore estimated to have greatly differing depths values, all of which are greatly different from the depth values of the surrounding portions of the road 404. The erroneous depth estimates for the wet patches 406, 408 lead to an unrealistic representation of the scene, as can be seen by comparing frames 414 and 418. Such unrealistic representations may have detrimental or dangerous consequences, for example when used as input data for an ADAS or ADS. It will be appreciated that erroneous depth estimations caused by reflective surfaces would be even more problematic when the scene has more reflective surfaces, for example a road surface after heavy rain. Although in the example of Figure 4 the effects of the reflections are predictable based on ray tracing considerations, in examples where the reflective surface is not smooth (for example due to ripples on the surface of a puddle), the reflections will typically lead to noisy and unpredictable erroneous depth estimates.

In order to reduce the detrimental effects of reflections, the reflection mitigation term penalises a second derivative in the horizontal direction of at least a portion of the candidate depth map. This is in contrast with the regularisation term mentioned above, which instead penalises first order derivatives. The inventors have identified that, compared with other regions of a scene, reflections typically lead to noisier and consequently less smooth regions of the candidate depth map. By penalising the second derivative in the horizontal direction, the reflection mitigation term encourages reduced variation of the depth estimates in the horizontal direction for regions in which reflections are present, as would be expected for example for a surface (such as a horizontal road surface) without reflections. The reflection mitigation term therefore discourages the depth estimation model from inferring depth from regions exhibiting reflections, and instead encourages the depth estimation model to infer the depth for these regions from surrounding areas, leading to a depth map which is closer to the ground truth.

The reflection mitigation term may be applied to the entirety of the candidate depth map, or only to a portion of the candidate depth map. For example, the reflection mitigation term may be applied to regions identified using one or more binary masks. In one example, a semantic segmentation model is used to generate a binary road mask identifying a region of the input image as a road surface, and the reflection mitigation term is only applied to the region identified as the road surface. Limiting the application of the reflection mitigation term to a particular region or subregion of the candidate depth map may be advantageous in that the term will not interfere with the performance of the depth estimation model on regions not resulting from reflections. However, the inventors have found that the reflection mitigation term as described herein has little effect on the performance of the depth estimation model regions on such regions, and is capable of reducing erroneous depth estimations caused by reflections without significantly affecting the performance of the depth estimation model elsewhere. Therefore, the reflection mitigation term can be applied without the additional complication and processing demands of implementing a semantic segmentation model.

As mentioned above, the rate of convergence during training of the depth estimation model (and the pose estimation model if used), and the accuracy of the resulting depth estimation model (and pose estimation model if used) can be improved if the application of one or more terms of the loss function is restricted to a subregion of the scene in which very distant regions of the scene are excluded. In order to restrict the application of the loss function in this way, the first training image and/or the second training image are processed to determine a binary mask indicative of distant regions of the scene, for example those corresponding to the sky. The binary mask may be determined for example using a semantic segmentation model, but a more computationally efficient method is to generate a difference image or delta image by subtracting the first training image from the second training image or vice versa. The first and second training images may optionally be downsampled or converted to greyscale before the subtraction is performed. The pixel values of the difference image are then binarized on the basis of a binarization threshold. The binarization threshold may be predetermined or alternatively may be determined in dependence on pixel values of the first training image and/or the second training image. For example, the binarization threshold may depend on pixel values of the difference image, for example such that a predetermined proportion of the pixels are excluded from the binary mask. Alternatively, the binarization threshold may be set to a value where a sharp drop in histogram frequency of the difference pixel values is observed, since it is expected that pixels of very distant regions will occur with a significantly higher frequency than pixels corresponding to any other narrow distance range. Alternatively, or additionally, the binarization threshold may depend directly on pixel values of the first and/or second training image. In this way, the binarization threshold can automatically account for variations in brightness of the training images. In any case, the binarization threshold is determined such that the binarized difference image will only be zero when the corresponding pixels of the first and second training images are almost identical. Very distant regions of the scene are expected to appear almost identically in the first and second training images, and will therefore lead to a binarized difference value of zero, whereas other regions of the scene are likely to have binarized pixel values of one.

Figure 5 shows an example of a first training image 502 and a second training image 504 respectively of first and second views of a scene, along with a binary mask represented by black regions of the frame 506, as determined using the method described above. It is observed that the binary mask omits regions of the scene that contain sky in both the first and second training images 502, 504. It is further observed that in this example, reflections of the sky in the wet patches on the road are not omitted from the binary mask, as the noisy appearance of these regions result in different pixel values in the different training images.

As mentioned above, the reflection mitigation term is effective when applied to the entirety of the candidate depth map or only a subregion of the candidate depth map. In some examples, the reflection mitigation term is only applied to a region corresponding to a binary mask determined as described above. The binary mask may be applied to some or all of the terms in the loss function, irrespective of whether the loss function includes a reflection mitigation term.

In examples where the relative pose relating views of a scene is not known, such as when a pose estimation model is trained alongside the depth estimation model, the depth maps output by the trained depth estimation model are not scale-aware. In other words, depth values estimated by the model are only known up to a multiplicative scaling factor. For many practical applications, such as when the depth estimation model is used to generate input data for an ADS or ADAS, the scaling factor must be determined. Although the trained model may be valid for use with a range of cameras for example having different lenses and intrinsic parameters, the scaling factor may be different for different cameras. An objective of the present disclosure is to provide an automated calibration method for determining the scaling factor. The calibration procedure may be applied shortly after training the depth estimation model and pose estimation model, or alternatively may be applied at a later time and/or in a different location, for example when the trained depth estimation model is deployed to perform inference. The latter may be advantageous if the depth estimation model is to be calibrated for a different camera to the one used to capture the training images, for example where the trained depth estimation model is provided as software for several different models of vehicle.

Figure 6 illustrates a first example of a computer-implemented method for determining a scaling factor for calibrating a depth estimation model. The method includes receiving a calibration image 602 of a calibration scene comprising a calibration object 604. The calibration object 604 has known dimensions which are provided to the computer system performing the calibration method. In this example, the calibration object has a planar surface decorated with a 4x4 black and white checkerboard pattern. A checkerboard pattern is advantageous because the high contrast between adjacent squares allows the interior comers of the pattern to be detected easily using known image processing techniques. Nevertheless, other calibration objects may be used, namely any object with known dimensions which has a detectable pattern such as one or more squares, rectangles, circles or other shapes. The calibration object in this example is positioned such that plane of the checkerboard pattern is substantial and parallel to the optical axis of the camera used to capture the calibration image. This is not essential, however, and a calibration object may be oriented in any way provided that there is some depth variation between points of the calibration object visible in the calibration image.

The calibration image is processed to determine an orientation of the calibration object. The orientation has three degrees of freedom corresponding to rotations around three axes. In the present example, corner detection methods are used to detect the positions of three or more of the corners of the checkerboard pattern (for example, the three interior corners indicated by crosses in frame 606 of Figure 6, in which the calibration object 604 is shown in line drawing style for clarity). Having detected the three corners of the calibration object 604, the orientation of the calibration object is calculated directly from the positions of these corners in the calibration image, using geometric considerations. In other examples, other points on a calibration object may be detected using any suitable image processing method and used for determining the orientation of the calibration object. Three points is sufficient for determining the orientation, irrespective of whether the points lie within a planar surface of the calibration object. However, those skilled in the art will appreciate that using more than three points to determine the orientation, for example taking an average of several orientations calculated using different sets of points, may result in improved accuracy.

When the orientation of the calibration object 604 has been determined, the method includes determining a calibrated real-world depth difference between two predetermined points on the calibration object, in dependence on the determined orientation of the calibration object and the known dimensions of the calibration object. The predetermined point should not lie on a line perpendicular to the optical axis of the camera (which would have a depth difference of zero), and the accuracy of the calibration method is improved by choosing predetermined points that have a relatively large depth difference. In the example of Figure 6, the depth difference between the leftmost crosses in frame 606 is determined to be Ad.

The method includes processing the calibration image 602 using the depth estimation model to generate an uncalibrated depth difference between the two predetermined points in the calibration image. As explained above, the depth estimation model is arranged to generate a depth map associating respective depth values with an array of points in the calibration image, but these depth values are uncalibrated, i.e. only known up to a multiplicative constant. The uncalibrated depth difference can therefore be determined by selecting points within the depth map corresponding to the predetermined points within the calibration image, and subtracting the uncalibrated depths values of these points. Frame 608 shows the calibration image with iso-depth contours determined using the depth estimation model. The uncalibrated depth difference between the corners corresponding to the leftmost crosses in frame 606 is determined to be Az.

The scaling factor is calculated as a ratio of the calibrated depth difference to the uncalibrated depth difference. In the example of Figure 6, the scaling factor is given by Dά/Dz. Optionally, the calibration method can be performed multiple times with the calibration object placed at different locations and depths within a scene, or in different scenes, and the final scaling factor can be calculated from the resulting scaling factors at each instance, for example as a mean or median value. Once the final scaling factor has been determined, all depth values determined by the depth estimation model are multiplied by the scaling factor to give calibrated depth values.

The method of Figure 6 provides an accurate, automated method of calibrating a depth estimation model. However, in some circumstances it may be inconvenient or impracticable to obtain a suitable calibration object and/or to perform the steps of capturing calibration images containing a calibration object. In particular, this is unlikely to be a convenient solution where the depth estimation model is provided as software for a vehicle with a camera setup not available when the model is trained. Figures 7A-C illustrate a second example of a computer-implemented method for determining a scaling factor for calibrating a depth estimation model. Unlike the method of Figure 6, the method of Figures 7A-C does not require a calibration object, and can be applied at run-time to calibrate or re-calibrate the depth estimation model for a particular camera setup. The method of Figures 7A-C is therefore particularly suitable when the depth estimation model is provided as software to be used in conjunction with an unknown camera setup.

Figure 7A shows a vehicle 700 with a camera 702 and a computer system 703 configured as an ADS/ADAS. The computer system 703 is arranged to process images captured by the camera 702 using a depth estimation model to generate input data for the ADS/ADAS. The computer system 703 is further arranged to cause the camera 702 to capture a calibration image of a calibration scene, and to determine a scaling factor for calibrating the depth estimation model using the captured calibration scene as described hereafter. The camera 702 is mounted a height h above a ground plane 704, which in this example is a road surface, where h is a value known to the computer system 703. Frame 706 of Figure 7B shows a calibration image of a calibration scene captured using the camera 702, along with iso-depth contours representing a calibration depth map determined using the depth estimation model. The calibration method includes processing the calibration depth map to generate a point cloud representation of the calibration scene. Whereas the calibration depth map associates depth values with an array of points, the point cloud representation indicates three-dimensional co ordinates of a set of points representing objects in the scene. The co-ordinates in the point cloud representation are uncalibrated because the calibration depth map is uncalibrated. According to the convention used in the present example, the x and y co ordinates are horizontal and vertical co-ordinates in the plane of the calibration image, and the z co-ordinate represents depth in the direction of the optical axis of the camera 702. Those skilled in the art are aware of methods of converting a depth map to a point cloud representation. Frame 708 in Figure 7B shows a set of points of the point cloud representation as crosses. It is observed that more points are associated with the ground plane 704 than with any other object appearing in the calibration scene.

The calibration method includes generating a histogram of y co-ordinates of points within the point cloud representation. The bin widths of the histogram may be predetermined or may be determined in dependence on the co-ordinates of the points in the point cloud, for example to ensure a predetermined number of bins with frequency values above a threshold value. Figure 7C shows a histogram of y co-ordinates for the point cloud shown in frame 708 of Figure 7B. It is observed that the modal bin 710 representing points for which —30 < y < —20 has a significantly greater frequency than the other bins. In accordance with the present method, points falling within the modal bin 710 are identified as ground plane points, because as noted above it is expected that more points will be associated with the ground plane 704 than with any other object in the calibration scene. Using a histogram to identify ground plane points is simpler to implement, more robust, and less demanding of processing resources than alternative methods of determining a ground plane, for example using a semantic segmentation model.

The histogram used to identify ground plane points may be restricted to include only points which lie below the optical axis of the camera 702. Provided the angle between the optical axis and the ground plane 704 is not too great, points belonging to the ground plane 704 will lie beneath the axis of the camera 702. The histogram may additionally, or alternatively, be restricted to include only points having a restricted range of depth values (z coordinates). The restricted range of depth values may be bounded from above and/or from below by a threshold depth value or values. The threshold depth value(s) may be predetermined or may depend on the points within the point cloud. In one example, an upper threshold depth value is given by the median or mean depth value of points within the point cloud, or a predetermined proportion thereof. In this way, distant points, for which the co-ordinates of the points are likely to be less accurately determined, are omitted, increasing the accuracy of the calibration method whilst also reducing processing demands.

The calibration method proceeds with performing a statistical analysis of the points identified as ground plane points to determine an uncalibrated height of the camera 702 above the ground plane 704. The uncalibrated height given by minus the intercept of the ground plane 704 with a vertical y-axis with an origin at the camera position. In one example, it is assumed that the ground plane 704 is parallel to the optical axis of the camera 702. The statistical analysis may then involve determining a mean or median value of the y co-ordinates of the identified ground plane points, and taking this value as the y-intercept of the ground plane. Alternatively, if it is not assumed that the ground plane 704 is parallel to the optical axis of the camera (for example because the optical axis is inclined or declined with respect to the ground plane 704 or vice versa), then the statistical analysis may involve determining coefficients for an equation of a plane of best fit for the points identified as ground plane points, for example using a regression matrix approach. The y-intercept of the ground plane is then taken as the y-intercept of the plane of best fit.

The scaling factor is calculated as a ratio of the actual height of the camera above the ground plane 704 to the uncalibrated height of the camera 702 above the ground plane 704 determined as discussed above. Optionally, the calibration method can be performed multiple times with differing scenes, and the final scaling factor can be calculated from the resulting scaling factors at each instance, for example as a mean or median value. Once the final scaling factor has been determined, all depth values determined by the depth estimation model are multiplied by the scaling factor to give calibrated depth values. In some examples, the calibration method may be performed repeatedly during the time that the depth estimation is in use. For example, the calibration method may be performed for each image or video frame processed by the depth estimation model, or a subset thereof, to ensure that calibration of the model remains up to date. This is facilitated by the efficient method of identifying the ground plane described herein. In cases where the calibration method is performed repeatedly, there may be certain images or frames for which a scaling factor cannot be determined, for example in the case of a vehicle driving over a brow of a hill or about to drive up a hill. In these cases, the scaling factor may be determined in dependent on previous instances of the calibration method, for example as a most recent value or an accumulated or rolling average of the previously determined scaling factors.

The calibration method described with reference to Figures 7A-7C determines an uncalibrated height of a camera above a ground plane, based on statistical analysis of a point cloud derived from a trained depth estimation model. An alternative approach involves determining the uncalibrated height of the camera using a ground plane estimation model trained alongside the depth estimation model. The inventors have found that this alternative approach can be less demanding on processing resources and power, which is highly desirable where the method is implemented using low power hardware, for example embedded systems onboard a vehicle or in a robotics setting. In the context of the present disclosure, a ground plane estimation model is a machine learning model which can be used to predict plane parameters associated with a ground region of a scene appearing within an image. Ground truth data is not typically available for such plane parameters, and therefore training a ground plane estimation model poses a challenge.

Figure 8 shows schematically a computer-implemented method of training a ground plane estimation model jointly with a depth estimation model. Figure 8 includes features corresponding to the features of Figures 2 and 3 (with blocks sharing the same last two digits being functionally equivalent), though for the sake of clarity the optional binary mask 811 and the optional pose estimation model 809 are not shown in Figure 8. The method of Figure 8 differs from the methods of Figures 2 and 3 in that the first training image 802 is additionally processed by a ground plane estimation model 818, which may be any suitable type of machine learning model arranged to take an image as an input and generate data indicative of a set of plane parameters. A plane in three-dimensions may be represented in a number of ways, for example by a Cartesian equation ax + by + cz + d = 0, where the four scalars a, b, c, d are plane parameters and the vector (a, b, c)^T is normal to the plane. The ground plane estimation model 818 may for example be a deep neural network model in which at least some of the layers are convolutional layers. In some examples, the depth estimation model 806 and the ground plane estimation model 818 (and, optionally, the pose estimation model 809) may share one or more neural network layers, for example an initial one or more convolutional layers. Sharing network layers in this way may alleviate the computational burden of the training process and also lead to faster convergence of the models. Subsequent to the shared initial layers, the ground plane estimation model 818 may for example comprise one or more fully connected layers.

In the method of Figure 8, point cloud conversion 820 is performed on the candidate depth map 808, resulting in at least some of the depth values of the depth map 820 being converted into a point cloud 822. In a first example, all of the depth values of the depth map 820 are converted into points of the point cloud 822. In a second example, depth values of a predetermined portion of the depth map 820 (for example, a lower half or lower third of the depth map) are converted to points of the point cloud 822. In a third example, only depth map values which are estimated to correspond to a ground region of the scene are converted into points of the point cloud 822. In this case, semantic labels may be obtained for pixels of the first training image 802, for example by processing the first training image 802 using a semantic segmentation model either during or prior to the present method being performed, or based on manual labelling. The semantic labels may be multi-class labels, or may simply be binary labels indicating whether or not a given pixel is associated with the ground region of the scene.

In order to train the ground truth estimation model 818, the loss function 814 is augmented to include a ground plane loss 823 penalising distances between at least some points of the point cloud 822 and a ground plane estimated using the ground plane estimation model 818. The inventors have found that, at least for many types of scene, it is a reasonable assumption that the ground region of scene contributes the most depth values, and accordingly the most points of the point cloud 822, of any part of the scene. By penalising distances between the points and the estimated ground plane, the ground plane loss 823 encourages the estimation model 818 to predict a plane which contains or nearly contains as many of the points as possible, which is highly likely to correspond to the ground region of the scene. The ground plane loss 823 may be based upon all of the points of the point cloud 822. Alternatively, the ground plane loss 823 may be restricted to a subset, such as a random sample, of the points in the point cloud 822, for example in order to reduce the processing demands of the training iteration. As a further example, semantic labels may be used to restrict the ground plane loss 823 to points corresponding to the ground region of the scene. The ground plane loss 823 may for example be a least squares loss, a (smoothed) LI loss, an L2 loss, or any other suitable loss term whose value increases with increased distance of the points from the estimated ground plane.

The gradient of the loss value 816 resulting from the loss function 815 may be backpropagated as indicated by the dashed arrows to determine a gradient of the loss value 816 with respect to the trainable parameters of the depth estimation model 806 and of the ground plane estimation model 818 (and, optionally, of the pose estimation model 809). The trainable parameters of the depth estimation model 806 and of the ground plane estimation model 818 may be then be updated using stochastic gradient descent or a variant thereof, thereby to jointly train the depth estimation model 806 and the ground plane estimation model 818. Although in Figure 8 the backpropagation is shown as passing to the depth estimation model 806 via the point cloud 822, point cloud conversion 820 and candidate depth map 808, this backpropagation route can optionally be omitted such that the depth estimation model 806 is updated independently of the ground plane loss 823, with the ground plane loss being used solely to train the ground plane estimation model 818.

In many real-life scenarios, such as driving scenarios, it is reasonable to assume that the ground will be approximately horizontal with respect to the camera. This assumption can be used to regularise the training of the ground plane estimation model 818. In order to do so, the loss function 814 may be further augmented to include a ground plane regularisation term (not shown) penalising deviations of the estimated ground plane from a horizontal orientation. For example, assuming x defines the transverse horizontal axis, y the vertical axis, and z the horizontal depth axis, and that the ground plane estimation model generates plane parameters a, b, c, d , with (a, b, c)^T normal to the plane, the ground plane regularisation term may penalise deviations of quantity b/Va² + b² + c² from one. Alternatively, the ground plane regularisation term may penalise an angle of the normal from the vertical.

The trained ground plane estimation model 818 may be supplied alongside the trained depth estimation model 806 for use in determining a scale factor for calibrating the trained depth estimation model 806. The calibration method follows the same steps as the method described with reference to Figure 7, but with the trained ground plane estimation model 818 being used in the place of statistical analysis to determine the uncalibrated height of the camera. The trained ground plane estimation model 818 may thus be used to calibrate a depth estimation model for which depth values are known up to a scale factor (e.g. where a pose estimation model has been used for training the depth estimation model). Even in cases where the calibration is known in principle (e.g. where the relative pose is known when training the depth estimation model), running the depth estimation model 806 on images having different resolutions, or captured using different camera setups etc., could alter this calibration, in which case the trained depth ground plane estimation model 818 may be used to recalibrate the depth estimation model 806.

Although in the example of Figure 8 the training of a ground plane estimation model is performed alongside the self-supervised training of a depth estimation model, in other examples the ground truth estimation model may be trained independently of the training of a depth estimation model. The depth estimation model may for example be trained prior to the training of the ground plane estimation model, in which case the trained depth estimation model may be used to generate point clouds for use in training the ground plane estimation model. Following this process, the trained ground plane estimation model may be used for calibration of the trained depth estimation model. Although the training of a ground plane estimation model has been described above in the context of calibrating a depth estimation model, there are also scenarios, for example in robotics and autonomous driving, where a ground plane estimation model may be useful in its own right. Accordingly, an objective of the present disclosure is to provide a computer-implemented method of training a ground plane estimation model. The method includes obtaining a training image comprising a view of a scene, processing the training image to generate a point cloud representation of the scene (for example using a depth map as an intermediate step, or directly), processing the training image using the ground plane estimation model to determine plane parameters for an estimated ground plane in the scene, and updating the ground plane estimation model so as to reduce a value of a loss function comprising a ground plane loss term penalising distances between the estimated ground plane at least some points of the point cloud representation.

The above embodiments are to be understood as illustrative examples of the invention. Further embodiments of the invention are envisaged. For example, although in the examples of training methods described above the training images are real images captured by one or more cameras, in other examples a set of training images may be views of a virtual scene. Using virtual scenes for at least part of the training may reduce the time- and resource-consuming process of capturing real images of an environment, allowing the training process to be performed more quickly and with more training data. It may be advantageous to train the depth estimation model using a combination of synthetic training images and real training images. In some examples, a synthetic three- dimensional scene is generated using scene synthesis, for example using a deep neural network approach and/or using a video game engine, and a set of training images is generated virtually as views of the synthetic scene from respective virtual camera locations. It is to be understood that any feature described in relation to any one embodiment may be used alone, or in combination with other features described, and may also be used in combination with one or more features of any other of the embodiments, or any combination of any other of the embodiments. Furthermore, equivalents and modifications not described above may also be employed without departing from the scope of the invention, which is defined in the accompanying claims.

Claims

1. A computer-implemented method of training a depth estimation model to generate a depth map from a monocular image, the method comprising: receiving first and second training images respectively of first and second views of a scene; processing the first training image using the depth estimation model to generate a candidate depth map comprising an array of candidate depth values, the array representing horizontal and vertical directions within the scene; projecting the second training image using the candidate depth map to generate a candidate reconstruction of the first training image; and updating the depth estimation model so as to reduce a value of a loss function comprising a photometric difference term penalising a photometric difference between the first training image and the candidate reconstruction of the first training image dependent, wherein the loss function further comprises a reflection mitigation term penalising a second derivative in the horizontal direction of at least a portion of the candidate depth map.

2. A computer-implemented method of training a depth estimation model to generate a depth map from a monocular image, the method comprising: receiving first and second training images respectively of first and second views of a scene; processing the first training image using the depth estimation model to generate a candidate depth map comprising an array of candidate depth values, the array representing horizontal and vertical directions within the scene; processing at least one of the first training image and the second training image to determine a binary mask indicative of distant regions of the scene; projecting the second training image using the candidate depth map to generate a candidate reconstruction of the first training image; and updating the depth estimation model so as to reduce a value of a loss function comprising a photometric difference term penalising a photometric difference between corresponding portions of the first training image and the candidate reconstruction of the first training image, wherein the corresponding portions exclude the distant regions of the scene as indicated by the binary mask.

3. The method of claim 2, wherein determining the binary mask comprises: generating a difference image corresponding to a difference between the first training image and the second training image; and binarizing the generated difference image.

4. The method of claim 3, wherein the binarizing is based on a binarization threshold determined in dependence on pixel values of the first training image and/or the second training image

5. The method of any of claims 2 to 4, wherein the loss function further comprises a reflection mitigation term penalising a second derivative in the horizontal direction of at least a portion of the candidate depth map.

6. The method of claim 5, wherein said at least portion of the candidate depth map excludes depth values corresponding to the distant regions of the scene as indicated by the binary mask.

7. The method of claim 1 or 5, further comprising identifying a road surface in the first training image using a segmentation model, wherein said at least portion of the candidate depth map corresponds to the identified road surface.

8. The method of any preceding claim, further comprising: receiving one or more further training images of respective further views of the scene; and projecting each of the one or more further training images using the candidate depth map to generate a respective further candidate reconstruction of the first training image, wherein the photometric difference term of the loss function further penalises a photometric difference between the first training image and each of the respective projected further training images.

9. The method of any of claims 1 to 7, wherein: the candidate reconstruction of the first training image is a composite candidate reconstruction of the first training image; the method further comprises receiving one or more further training images of respective further views of the scene; and generating the composite candidate reconstruction of the first training image further comprises projecting each of the further training images using the candidate depth map.

10. The method of any preceding claim, further comprising generating the scene using scene synthesis.

11. The method of any preceding claim, further comprising processing the first and second training images using a pose estimation model to determine a candidate relative pose relating the first and second views of the scene, wherein: the projecting of the second training image further uses the candidate relative pose; and the method includes updating the pose estimation model jointly with the depth estimation model so as to reduce the value of the loss function.

12. The method of claim 11, wherein: the depth estimation model a first neural network model; the pose estimation model is a second neural network model; and the first neural network model and the second neural network model share one or more neural network layers.

13. The method of any preceding claim, further comprising determining a scaling factor for calibrating the trained depth estimation model, the method comprising: receiving a calibration image of a calibration scene comprising a calibration object; receiving data indicative of dimensions of the calibration object; processing the calibration image to determine an orientation of the calibration object; determining, in dependence on the determined orientation of the calibration object and the data indicative of the dimensions of the calibration object, a calibrated depth difference between two predetermined points on the calibration object; processing the calibration image, using the depth estimation model, to determine an uncalibrated depth difference between the two predetermined points on the calibration object; and determining the scaling factor as a ratio of the calibrated depth difference to the uncalibrated depth difference.

14. A computer-implemented method of determining a scaling factor for calibrating a depth estimation model, the method comprising: receiving a calibration image of a calibration scene comprising a calibration object; receiving data indicative of dimensions of the calibration object; processing the calibration image to determine an orientation of the calibration object; determining, in dependence on the determined orientation of the calibration object and the data indicative of the dimensions of the calibration object, a calibrated depth difference between two predetermined points on the calibration object; processing the calibration image, using the depth estimation model, to determine an uncalibrated depth difference between the two predetermined points on the calibration object; and determining the scaling factor as a ratio of the calibrated depth difference to the uncalibrated depth difference.

15. The method of claim 13 or 14, wherein: a surface of the calibration object is decorated with a checkerboard pattern; and determining the orientation of the calibration object comprises determining locations within the calibration image of a plurality of vertices of the checkerboard pattern, and analysing the determined locations of the plurality of vertices.

16. The method of any of claims 1 to 12, further comprising determining a scaling factor for calibrating the trained depth estimation model, the method comprising: receiving a calibration image of a calibration scene captured by a camera positioned a given height above a ground plane; processing the calibration image, using the trained depth estimation model, to determine a calibration depth map comprising an array of calibration depth values; process the calibration depth map to generate a point cloud representation of the calibration scene; generating a histogram of vertical positions of points within the point cloud representation of the calibration scene; identifying points within a modal bin of the generated histogram as ground plane points; performing a statistical analysis of the points identified as ground plane points to determine an uncalibrated height of the camera above the ground plane; and determining the scaling factor as a ratio of the given height of the camera above the ground plane to the uncalibrated height of the camera above the ground plane.

17. A computer-implemented method of determining a scaling factor for calibrating a depth estimation model, the method comprising: receiving a calibration image of a calibration scene captured by a camera positioned a given height above a ground plane; processing the calibration image, using the depth estimation model, to determine a calibration depth map comprising an array of calibration depth values, the array defining horizontal and vertical directions; process the calibration depth map to generate a point cloud representation of the calibration scene; generating a histogram of vertical positions of points within the point cloud representation of the calibration scene; identifying points within a modal bin of the generated histogram as ground plane points; performing a statistical analysis of the points identified as ground plane points to determine an estimated height of the camera above the ground plane; and determining the scaling factor as a ratio of the given height of the camera above the ground plane to the uncalibrated height of the camera above the ground plane.

18. The method of claim 16 or 17, wherein the statistical analysis comprises determining a mean or median of the vertical positions of the points identified as ground plane points.

19. The method of claim 16 or 17, wherein the statistical analysis comprises determining coefficients for an equation of a plane of best fit for the points identified as ground plane points, and determining the uncalibrated height of the camera above the ground plane using the determined coefficients.

20. The method of any of claims 16 to 19, wherein the generated histogram is restricted to points lying below an optical axis of the camera.

21. The method of any of claims 16 to 20, wherein the generated histogram is restricted to points having a restricted range of depth values.

22. The method of claim 21, wherein an upper limit of the restricted range is given by a predetermined proportion of a median depth value of points within the point cloud representation.

23. A computer-implemented method of training a ground plane estimation model, the method comprising: obtaining a first training image comprising a view of a scene; processing the first training image to generate a point cloud representation of at least part of the scene; processing the first training image using the ground plane estimation model to determined plane parameter values for an estimated ground plane within the scene; and updating the ground plane estimation model so as to reduce a value of a loss function, the loss function comprising a ground plane loss term penalising distances between the estimated ground plane and at least some points of the point cloud representation.

24. The method of claim 23, wherein processing the first training image to generate the point cloud representation comprises: processing the first training image using a depth estimation model to generate a depth map of the scene; and converting at least part of the depth map to the point cloud representation.

25. The method of claim 24, further comprising training the depth estimation model, the training comprising updating the depth estimation model jointly with the ground plane estimation model so as to reduce the value of the loss function.

26. The method of any of claims 1 to 12, further comprising training a ground plane estimation model by at least: processing at least part of the candidate depth map to generate a point cloud representation of at least part of the scene; processing the first training image using the ground plane estimation model to determined plane parameter values for an estimated ground plane within the scene; updating the ground plane estimation model jointly with the depth estimation model so as to reduce the value of the loss function, the loss function comprising a ground plane loss term penalising distances between the estimated ground plane and at least some points of the point cloud representation.

27. The method of any of claims 24 to 26, wherein the depth estimation model and the ground plane estimation model are neural network models sharing one or more neural network layers.

28. The method of any of claims 24 to 27, further comprising determining a scaling factor for calibrating the trained depth estimation model, the method comprising: receiving a calibration image of a calibration scene captured by a camera positioned a given height above a ground plane; processing the calibration image, using the trained ground plane estimation model, to determine plane parameter values for an estimated ground plane within the calibration scene; determining, from the determined plane parameter values for the estimated ground plane within the calibration scene, an uncalibrated height of the camera above the ground plane; and determining the scaling factor as a ratio of the given height of the camera above the ground plane to the uncalibrated height of the camera above the ground plane.

29. The method of any of claims 23 to 28, further comprising: obtaining semantic labels for pixels of the first training image; and selecting said at least part of the scene or said at least some points of the point cloud representation based on the obtained semantic labels for corresponding pixels of the first training image.

30. The method of claim 29, wherein obtaining the semantic labels comprises processing the first training image using a semantic segmentation model.

31. The method of any of claims 23 to 30, wherein the loss function comprises a ground plane regularisation term penalising deviations of the estimated ground plane from a horizontal orientation.

32. A computer-implemented method of determining a scaling factor for calibrating a depth estimation model, the method comprising: receiving a calibration image of a calibration scene captured by a camera positioned a given height above a ground plane; processing the calibration image, using a ground plane estimation model trained jointly with the depth estimation model, to determine plane parameter values for an estimated ground plane within the calibration scene; determining, from the determined plane parameter values for the estimated ground plane within the calibration scene, an uncalibrated height of the camera above the ground plane; and determining the scaling factor as a ratio of the given height of the camera above the ground plane to the uncalibrated height of the camera above the ground plane.

33. A system comprising: a mobile camera arranged to capture first and second training images respectively of first and second views of a scene as the mobile camera moves with respect to the scene; and a computing system arranged to process the first and second training images to train a depth estimation model, in accordance with the method of any of claims 1 to 12 or claim 26.

34. A system comprising memory circuitry, processing circuitry, and a camera, wherein the memory circuitry holds a depth estimation model and machine-readable instructions which, when executed by the processing circuitry, cause the camera to capture a calibration image of a calibration scene, and the system to determine a scaling factor for calibrating the depth estimation model using the captured calibration scene, in accordance with the method of claim 14, 17 or 32.

35. The system of claim 34, further comprising an advanced driver assistance system (ADAS) and/or an automated driving system (ADS) for a vehicle, wherein the system is arranged to process images captured by the camera using the depth estimation model to generate input data for said ADAS and/or ADS.

36. A computer program product comprising machine-readable instructions which, when executed by a computing system, cause the computing system to perform the method of any of claims 1 to 32.