EP4107699A1

EP4107699A1 - A method for generating a dataset, a method for generating a neural network, and a method for constructing a model of a scene

Info

Publication number: EP4107699A1
Application number: EP21710811.7A
Authority: EP
Inventors: Yubin Kuang; Pau GARGALLO PIRACÉS; Manuel LÓPEZ ANTEQUERA; Jan Erik Solem; Peter KONTSCHIEDER; Samuel ROTA BULÒ
Original assignee: Meta Platforms Inc
Current assignee: Meta Platforms Inc
Priority date: 2020-02-17
Filing date: 2021-02-16
Publication date: 2022-12-28
Also published as: CN115053260A; WO2021167910A1

Abstract

A method for generating an image depth estimation neural network that estimates at least one depth measure of an image of a scene, the method comprising: receiving a first set of images, the first set of images being a plurality of images of a scene taken by one or more cameras; receiving, for each image in the first set of images, an associated focal length that is an estimate of a focal length of the camera that took the image; transforming the first set of images into a set of normalized training images, the set of normalized training images representing how images of the first set of images would appear if the images of the set had an joint focal length, wherein transforming an image of the first set of images into a normalized training image comprises rescaling the image, the rescaling representing a change in the associated focal length of the image such that it approaches the joint focal length; training the neural network to predict at least one depth measure of an image, wherein training the neural network comprises providing the neural network with a training dataset of pairs of input data and output data, wherein the input data of the training dataset comprises the set of normalized training images.

Description

A METHOD FOR GENERATING A DATASET, A METHOD FOR GENERATING A NEURAL NETWORK, AND A METHOD FOR CONSTRUCTING A MODEL OF A SCENE

TECHNICAL FIELD

[0001]The present invention relates, in general, to a method for generating a dataset for training an image depth estimation neural network, a method for generating an image depth estimation neural network, and a method for constructing a three- dimensional model of a scene.

BACKGROUND

[0002] Constructing a three-dimensional (3D) model of a scene, based on images of the scene, is important in many applications, e.g. for navigation of autonomous vehicles. The 3D model may be constructed from images that image the scene from different viewpoints, using a structure from motion algorithm. The structure from motion algorithm may find correspondences between images, i.e. find points that occur in several images, and analyze these correspondences to form the 3D model. [0003] In order to find correspondences, it may be beneficial to have depth information of the points in the images of the scene. Such information may be recorded e.g. by a RGB-D camera, the RGB-D camera being a RGB camera comprising a depth sensor which measures depth e.g. through emission and detection of structured light. The depth information may then be used together with the image information for constructing the 3D model of the scene. A 3D reconstruction algorithm that uses depth information from a depth sensor to improve the accuracy of the 3D model is described in [3D Mapping with an RGB-D Camera,

F. Endres et al. IEEE Transactions on Robotics, Vol 30, No 1 , 2014]

[0004] If a camera does not comprise a depth sensor, it may be possible to extract depth information of the points in the images of the scene using a neural network. The neural network may e.g. be trained using the Camera-Aware Multi-scale Convolutions (CAM-Convs) method, as described in [Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 11826-11835] The CAM-Convs method may concatenate camera internal parameters to the feature maps, and hence allow the network to learn the dependence of the depth from these parameters.

SUMMARY

[0005] It is an objective of the invention to enable accurate 3D modelling of a scene.

It is a further objective of the invention to enable cost effective 3D modelling of a scene. These and other objectives of the invention are at least partly met by the invention as defined in the independent claims. Preferred embodiments are set out in the dependent claims.

[0006] The invention stems from the realization that not all cameras comprise a depth sensor. Thus, there is room for improvement of the accuracy of 3D models of sceneries when depth sensor data is unavailable or corrupt. A further realization is that depth information may be estimated from the images themselves, even from single images, using a neural network that is suitably trained. We herein call such a neural network an image depth estimation neural network. Another realization is that such a neural network may be generated using training data derived from a suitable dataset. According to the inventive concept a method for generating a suitable dataset and a method for generating a suitably trained image depth estimation neural network are provided together with a method for generating a 3D model. Each of these aspects of the invention may provide better 3D models of scenes. This in turn may e.g. result in autonomous vehicles navigating in a safer and more accurate manner.

[0007] According to a first aspect of the invention, there is provided a method for generating a dataset for training an image depth estimation neural network, the method comprising: receiving an image series comprising a plurality of images of a scene, each image being an image acquired by a camera, the camera having a position at the time of the acquisition of the image; receiving, for each image of the image series, a measured camera position, the measured camera position being the position of the camera measured at the acquisition of the image; forming a three- dimensional (3D) reconstruction of the scene, the 3D reconstruction comprising coordinates of a 3D model of the scene and a reconstructed camera position for each image, the reconstructed camera position being an estimate of the position of the camera relative to the 3D model at the acquisition of the image, wherein the 3D reconstruction of the scene is formed by running a structure from motion algorithm on the image series, wherein the structure from motion algorithm is configured to align the 3D reconstruction to the measured camera positions; calculating at least one depth measure of at least one image of the image series based on the 3D reconstruction, wherein a depth measure of an image is a representation of a distance between a viewpoint from which the image was taken and a point in the scene of the image; and forming the dataset out of pairs of image data and depth data, wherein each pair comprises an image of the image series as image data and the at least one depth measure of the corresponding image as depth data.

[0008] An advantage of the dataset is that it may provide accurate depth measures of the images in the dataset. By aligning the 3D reconstruction to the measured camera positions, the scale of the 3D model may become accurate. Thus, the 3D model may have a scale where one meter in the model corresponds to one meter in the real-world scene. When the coordinates of the 3D model of the scene and the coordinates of the reconstructed camera positions are accurate, accurate depth measures may be calculated for the images.

[0009] By extension, the dataset may enable accurate future 3D modelling of a scene that is not represented by the dataset. On the future occasion, position measurements may not be available and/or only few images may be available. Nevertheless, the ability to accurately interpret depth from images, the ability acquired by training of the image depth estimation neural network on the dataset which was generated from images where measured camera positions were available, may improve the accuracy of the future 3D model.

[00010] The dataset comprising pairs of image data and depth data may be used in the training of an image depth estimation neural network. Image data from the dataset may be used as input data in the training while depth data from the dataset may be used as output data in the training. Thus, the image depth estimation neural network may learn to accurately estimate depth from images. Said image depth estimation neural network may, at a later time, be used in 3D modelling of a scene by providing estimates of image depth, e.g. from single images, which improve the accuracy of the 3D model in a similar fashion to how RGB-D camera depth measures improve the accuracy.

[00011] Another advantage of the dataset is that it may be generated in a cost- effective way. The requirements on the cameras contributing to the images of the dataset may be low. The cameras may e.g. not comprise a depth sensor. The cameras may e.g. be smartphone cameras, dashboard cameras, action cameras etc. The cameras may also be a variety of the previously mentioned, and/or other, camera types. The cameras may e.g. comprise an image sensor for acquiring the image and a global position system (GPS) sensor for providing the measured camera position. Greater requirements may not be needed. This may contribute to a low cost of creating the dataset since images taken for other purposes, not necessarily with the sole purpose of generating the dataset, may be used.

[00012] By extension, the dataset may enable cost-effective 3D modelling of a scene. The cost of future 3D modelling may be low if the image depth estimation neural network is trained using a low-cost dataset. Furthermore, the cost of future 3D modelling may be low if it can be done using a variety of cameras. This may be facilitated by the use of training data derived from a dataset according to the first aspect. A dataset that may be generated from images from a variety of cameras may improve the ability to handle a variety of cameras in the future 3D modelling.

[00013] It should be understood that the image series may comprise images wherein each image overlaps at least partially with another image of the image series. The image series may depict a scene from a plurality of viewpoints or a plurality of angles. The scene may herein comprise one or more objects. The image series may e.g. be a series of street view images acquired from a vehicle as the vehicle moves along the street.

[00014] It should be understood that the measured camera position may be a position measured with a positioning system, e.g. a GPS positioning system, a wifi positioning system, a cell ID positioning system, or a motion capture system. Alternatively, the measured position may be a position measured with a combination of positioning systems. The positioning system may be associated with the camera that acquired the image, e.g. integrated in the camera or in communication with the camera.

[00015] It should be understood that the estimate of the position of the camera relative to the 3D model may at first be an estimate based on the image series without accounting for the measured camera position. The structure from motion algorithm may firstly form the 3D model and the estimate of the positions of the cameras relative to the 3D model based on the image series. The structure from motion algorithm may secondly align the 3D reconstruction to the measured positions by adjusting the coordinates of the 3D model and the coordinates of the estimated positions of the cameras. When the 3D reconstruction is aligned to the measured positions a scale may be imposed on the model.

[00016] It should be understood that the 3D reconstruction may, in addition to coordinates of a 3D model of the scene and a reconstructed camera position for each image, also comprise a reconstructed camera orientation for each image. The reconstructed camera position and the reconstructed camera orientation of an image may form a reconstructed camera pose for the image.

[00017] It should be understood that the method may include in the dataset only images that fulfill certain requirements. Such requirements may e.g. be: only including images where an estimated focal length of the camera is available, or only including images where the 3D reconstruction is made using at least 2 neighboring images taken within a radius of 10 meters.

[00018] It should be understood that calculating a depth measure of an image may be done by a multi-view stereo algorithm, e.g. by a patch based multi-view stereo algorithm.

[00019] With regards to the term “depth measure”: It should be understood that “a distance between a viewpoint from which the image was taken and a point in the scene” may be the distance between the optical center of the camera and the point in the scene. The optical center may e.g. be the aperture or lens of the camera. A representation of said distance may e.g. be said distance itself, i.e. what is sometimes simply called distance. Alternatively, a representation of said distance may be the shortest distance between the point in the scene and a camera plane, wherein the camera plane is orthogonal to the optical axis of the camera and comprises the optical center of the camera, i.e. what is sometimes simply called depth. The optical axis may herein be the viewing direction of the camera. The viewpoint may be essentially the same as the position of the camera at the time of the acquisition of the image.

[00020] It should be understood that at least one depth measure of an image may be an entry in a depth map of the image. The depth map may comprise an array of depth measures wherein each depth measure corresponds to the depth measure of a certain pixel, or group of pixels, in the image. It should also be understood that entries of a depth map may be undefined, e.g. marked by “not a number”. It may not be possible to calculate depth measures for all pixels in an image. It should also be understood that depth measures may be validated before entered in the dataset. A depth measure of a point in a scene in an image may only be entered into the dataset if it agrees with another depth measure of the same point in the scene in another image. [00021] In accordance with the above, the dataset may comprise pairs of image data and depth data in the form of one image and one depth map of the image in each pair.

[00022] In the method according to the first aspect, the structure from motion algorithm may be configured to align the 3D reconstruction to the measured camera positions through an adjustment of the coordinates of the 3D reconstruction, wherein the adjustment of coordinates penalizes reconstructed camera positions that deviates from the corresponding measured camera positions.

[00023] The adjustment of the coordinates of the 3D reconstruction may be bundle adjustment. Adjustment of the coordinates of the 3D reconstruction, such as bundle adjustment, may be an effective way to obtain an accurate 3D reconstruction. A rough 3D reconstruction may first be formed, after which the coordinates may be adjusted. This may be more effective than aligning the 3D reconstruction to the measured positions already in an initial phase of the formation of the 3D reconstruction. Thus, the term adjustment may be construed as an adjustment of an initial 3D reconstruction, wherein the initial 3D reconstruction didn’t account for the measured camera positions.

[00024] In the method according to the first aspect, the adjustment of coordinates may penalize reconstructed camera positions that deviates from the corresponding measured camera positions by imposing a cost for each reconstructed camera position that depends on a distance between said reconstructed camera position and the corresponding measured camera position. [00025] The cost may be proportional to a function of the distance between the reconstructed camera position and the corresponding measured camera position, e.g. the euclidean distance, the squared distance, or robust functions such as the Cauchy loss.

[00026] According to a second aspect of the present inventive concept there is provided a method for generating an image depth estimation neural network, the image depth estimation neural network being a neural network that estimates at least one depth measure of an image of a scene, wherein a depth measure of the image is a representation of a distance between a viewpoint from which the image was taken and a point in the scene of the image, the camera plane being a plane comprising the camera, the method comprising: receiving a first set of images, the first set of images being a plurality of images of a scene taken by one or more cameras; receiving, for each image in the first set of images, an associated focal length, the associated focal length being an estimate of a focal length of the camera that took the image; transforming the first set of images into a set of normalized training images, the set of normalized training images representing how images of the first set of images would appear if the images of the set had an joint focal length, wherein transforming an image of the first set of images into a normalized training image comprises rescaling the image, the rescaling representing a change in the associated focal length of the image such that it approaches the joint focal length; and training the neural network to predict at least one depth measure of an image, wherein training the neural network comprises providing the neural network with a training dataset of pairs of input data and output data, wherein the input data of the training dataset comprises the set of normalized training images.

[00027] With regards to the term “depth measure”: It should be understood that the description of the term in conjunction with the first aspect may also apply to the second aspect. Thus, “a distance between a viewpoint from which the image was taken and a point in the scene” may be the distance between the optical center of the camera and the point in the scene. A representation of said distance may e.g. be said distance itself or e.g. the shortest distance between the point in the scene and the camera plane.

[00028] The method according to the second aspect may enable training of the image depth estimation neural network using images from a variety of cameras with different focal length as the images are transformed into normalized training images before training. Had the transformation not been performed, different focal lengths of the cameras used for the training data could have degraded the performance of the neural network. Consider two images of a scene taken from the same viewpoint by two cameras with different focal length. Each point in the scene may have the same distance to the camera plane of the two cameras. Thus, the depth measures extracted from the two images should be the same. Yet the images may look different as the focal length of the cameras are different. A camera with a large focal length may provide a zoomed-in version of an image with a small focal length. Training the neural network without accounting for the different focal lengths may therefore result in a neural network that does not provide reliable depth estimations. A hypothesis is that without normalization of the focal length the neural network must accurately predict the focal length itself, a task that may be difficult. Put differently, by providing normalized data, less data is needed for training the neural network adequately.

[00029] An advantage of the method according to the second aspect is that it may generate an accurate image depth estimation neural network. Since the image depth estimation neural network may be trained using normalized training images that are derived from images from a variety of cameras, the training dataset may be large. The training dataset may be much larger than if only cameras with depth sensors, or only cameras with a specific focal length, could be used. A large training dataset may improve the accuracy. Furthermore, when a variety of cameras with different focal length are used, the image depth estimation neural network may become more adapted to handle images from any kind of camera when the network is deployed in 3D modelling of a scene in a situation after training. For example, image distortion may be linked to the focal length. Thus, using cameras of different focal length, as enabled by the normalization, may provide an image depth estimation neural network that handles image distortion of different cameras better. [00030] The method according to the second aspect may have several advantages compared to the CAM-Convs method. For example, the method according to the second aspect may not rely on concatenating camera internal parameters to feature maps. Thus, the method according to the second aspect may not rely on informing the neural network about e.g. the viewing angles of the pixels in the images. Instead, these angles may be intrinsically learned by the neural network. The rescaling of the images may ensure that every pixel in the normalized training images always corresponds to the same viewing angle during training.

[00031] As a consequence of not relying on feature maps with concatenated parameters, the method according to the second aspect may be implemented on a large range of neural network architectures. It may even be implemented on any type of neural network architecture. In contrast, methods relying on concatenated parameters may need to be implemented on special types of architectures, such as u-net architectures.

[00032] Furthermore, even though the method according to the second aspect may be a computationally less demanding way of generating an image depth estimation neural network than the CAM-Convs method, tests have indicated that the generated image depth estimation neural network have at least a comparable performance, regardless of the method. Under some deployment conditions a neural network generated according to the method of the second aspect outperforms neural networks generated according to the CAM-Convs method.

[00033] The method according to the second aspect may be configured to generate an encoder-decoder neural network. The encoder-decoder neural network may have an architecture with skip connections. However, as mentioned other architectures are also possible. The method according to the second aspect may be configured to produce a feature map for each normalized training image, wherein the feature map is smaller than the normalized training image, e.g. 16 times smaller. [00034] For each image in the first set of images an associated focal length, i.e. an estimate of a focal length of the camera that took the image, is received. The associated focal length may be an estimate of a focal length that comes from metadata associated with the image. The associated focal length may alternatively be an estimate of a focal length acquired by running a camera model characterizing structure from motion algorithm on a plurality of images taken by cameras of the same model as the camera that took the image in the first set of images. The camera model characterizing structure from motion algorithm may be configured to iteratively refine both a 3D model and a focal length estimate of images taken by the same camera model. The camera model characterizing structure from motion algorithm may additionally refine distortion parameters of the camera model.

[00035] When an image of the first set of images is transformed into a normalized training image, the image is rescaled. It should be understood that the image may be rescaled around a central point in the image, e.g. around a central pixel. It should also be understood that all images of the first set of images may be rescaled to the same size. Transforming an image of the first set of images may, in addition to rescaling, comprise a distortion reduction, wherein image distortions are decreased. The distortion reduction may be based on distortion parameters extracted by the camera model characterizing structure from motion algorithm. The distortion reduction may be done before the rescaling.

[00036] The rescaling of an image may be done such that the associated focal length of the image approaches the joint focal length. Thus, after transforming the first set of images into a set of normalized training images, the set of normalized training images may all have the same joint focal length. However, although all normalized training images having the same joint focal length may be preferable, it should be understood that the method may provide advantages also when the associated focal lengths of the normalized training images are similar but not identical. For example, the associated focal length of the normalized training images may be within ±10% of the joint focal length or within ±20% of the joint focal length. The precision of the associated focal length of the normalized training images may correspond to a precision in the image depth estimates of the image depth estimation neural network. In some applications a reduced precision may be acceptable in the depth estimates and then the rescaling may not need to be done such that the associated focal lengths of the normalized training images all are the same.

[00037] The least one pair of input data and output data in the training dataset may comprise a normalized training image as input data and at least one depth measure of the normalized training image of said normalized training image as output data.

[00038] Thus, the neural network may be trained in converting input data into output data. In other words: the neural network may be trained in converting an image which has a focal length corresponding to the joint focal length into at least one depth measure. The at least one depth measure of the normalized training image in the output data may be a calculated depth measure, calculated based on an image series. Alternatively, the at least one depth measure of the normalized training image may be a directly measured depth measure, e.g. measured using a RGB-D camera. The depth measures of the images of the set of normalized training images may comprise only calculated depth measures, only directly measured depth measures or a mix of calculated depth measures and directly measured depth measures.

[00039] The normalized training image of the input data and the at least one depth measure of the output data may be derived from one of the pairs of image data and depth data of a dataset that is generated according to the method of the first aspect of the invention.

[00040] Thus, the depth measures of the images of the set of normalized training images may comprise calculated depth measures according to the first aspect. It should be understood that when an image from the dataset according to the first aspect of the invention is rescaled and entered as input data in the training dataset, the representation of the at least one depth measure of the corresponding image may be transformed accordingly before entered as output data in the training dataset. For example, consider a dataset that is generated according to the first aspect of the invention wherein the dataset comprises one image and one depth map of the image in each pair. When the image of a pair is rescaled the depth map may also be rescaled accordingly. The rescaled image and the rescaled depth map may then be entered into the training dataset as a pair of input data and output data. [00041] The rescaling of an image of the first set of images may comprise scaling the image by a factor, the factor being inversely proportional to the focal length associated with the image.

[00042] The factor may be the joint focal length of the set of normalized training images divided by the focal length associated with the image.

[00043] Thus, each image of the set of normalized training images may have an associated focal length that is the joint focal length. Consider an image that is acquired by a camera with a focal length f, wherein the image should be rescaled such that it appears to have been acquired by a camera with the focal length fc, fc being the joint focal length. If e.g. f=0.5fc, then the image may be scaled by a factor fc/0.5fc=2. Such a rescaling may correspond to moving pixel information in the image a factor 2 away from the center pixel in both the x- and y-direction, thereby mimicking a factor 2 change in focal length.

[00044] The first set of images may comprise images associated with a plurality of focal lengths.

[00045] Thus, the neural network may be trained to handle images coming from a variety of cameras with a variety of focal lengths.

[00046] The rescaling of an image may comprise: cropping the image if the focal length associated with the image is smaller than the joint focal length; padding the image if the focal length associated with the image is larger than the joint focal length.

[00047] Cropping and padding may be image processing task that require only small computational resources. Thus, the rescaling may be performed effectively. [00048] Cropping the image may comprise removing outer parts of the image. For example, images from the first set of images that have a focal length smaller than the joint focal length may be rescaled in a manner corresponding to zooming-in and pixel information ending up outside the image boundaries may be cropped. [00049] Padding the image may comprise introducing new pixels around the outer parts of the image, e.g. pixels comprising no image information or pixels comprising a fixed value such as e.g. zero. For example, images from the first set of images that have a focal length larger than the joint focal length may be rescaled in a manner corresponding to zooming-out and zero valued pixels may be introduced along the image boundaries.

[00050] The pixel information of the images may also be redistributed in conjunction with cropping or padding such that all the images in the set of normalized training images have the same number of pixels and aspect ratio.

[00051] According to a third aspect of the present inventive concept there is provided a method for constructing a three-dimensional (3D) model of a scene, said method comprising: receiving a set of scene images, the set of scene images being images of the scene taken from a plurality of viewpoints; receiving, for each of the scene images, an associated focal length, the associated focal length being an estimate of the focal length of the camera that took the scene image; transforming the set of scene images into a set of normalized images, the set of normalized images representing how the set of scene images would appear if the images of the set had an joint focal length, wherein transforming a scene image into a normalized image comprises rescaling the image to represent a change in the associated focal length of the image such that it approaches the joint focal length; obtaining at least one estimate of a depth measure of at least one image of the set of normalized images using an image depth estimation neural network, wherein a depth measure of an image is a representation of a distance between a viewpoint from which the image was taken and a point in the scene of the image, wherein the image depth estimation neural network is a neural network that estimates a depth measure of an image; processing images associated with the set of scene images together with the at least one estimate of a depth measure using a structure for motion algorithm to construct the 3D model of the scene.

[00052] With regards to the term “depth measure”: It should be understood that the description of the term in conjunction with the first aspect may also apply to the third aspect. Thus, “a distance between a viewpoint from which the image was taken and a point in the scene” may be the distance between the optical center of the camera and the point in the scene. A representation of said distance may e.g. be said distance itself or e.g. the shortest distance between the point in the scene and the camera plane. [00053] An advantage of the method according to the third aspect is that it may provide an accurate 3D model of the scene. As the method according to the third aspect transforms the set of scene images into a set of normalized images, the method may utilize an image depth estimation neural network that is trained on normalized images, e.g. an image depth estimation neural network generated according to the second aspect. When the images that are input into neural network are normalized according to the same principles as the images that were used for training the neural network accurate depth estimates may be outputted. Accurate depth measures may by extension translate into an accurate 3D model.

[00054] Another advantage of the method according to the third aspect is that it may be versatile. The accuracy of the 3D model may be indifferent or depend only weakly on which camera has generated the images of the set of scene images. The set of scene images may e.g. comprise images taken by different cameras with different focal lengths. The method may avoid introducing inaccuracies in the 3D model due to the cameras having different camera focal lengths or different distortion properties.

[00055] Another advantage of the method according to the third aspect is that it may be cost-effective as it may be based on an image depth estimation neural network that may be generated at low cost from images that are collected not solely for the purpose of training the image depth estimation neural network.

[00056] For each image in the set of scene images an associated focal length, i.e. an estimate of a focal length of the camera that took the image, is received. The associated focal length may be an estimate of a focal length that comes from metadata associated with the image. The associated focal length may alternatively be an estimate of a focal length acquired by running a camera model characterizing structure from motion algorithm on a plurality of images taken by cameras of the same model as the camera that took the image in the first set of images. The camera model characterizing structure from motion algorithm may be configured to iteratively refine both a 3D model and a focal length estimate of images taken by the same camera model. The camera model characterizing structure from motion algorithm may additionally refine distortion parameters of the camera model.

[00057] When an image of the set of scene images is transformed into a normalized image the image is rescaled. It should be understood that the image may be rescaled around a central point in the image, e.g. around a central pixel. It should also be understood that all images of the first set of images may be rescaled to the same size. Transforming an image of the set of scene images may, in addition to rescaling, comprise a distortion reduction, wherein image distortions are decreased. The distortion reduction may be based on distortion parameters extracted by the camera model characterizing structure from motion algorithm. The distortion reduction may be done before the rescaling.

[00058] The rescaling of an image may be done such that the associated focal length of the image approaches the joint focal length of the method according to the third aspect. Thus, after transforming the set of scene images into a set of normalized images, the set of normalized images may all have the same joint focal length. However, although all normalized images having the same joint focal length may be preferable, it should be understood that the method may provide advantages also when the associated focal lengths of the normalized images are similar but not identical. For example, the associated focal length of the normalized images may be within ±10% of the joint focal length or within ±20% of the joint focal length. The precision of the associated focal length of the normalized images may correspond to an accuracy in the 3D model constructed by the method according to the third aspect. In some applications a reduced accuracy may be acceptable in the 3D model and then the rescaling may not need to be done such that the associated focal lengths of the normalized images all are the same.

[00059] A depth measure of a normalized image may be obtained by inputting the normalized image into the image depth estimation neural network such that the image depth estimation neural network outputs at least one estimate of a depth measure of said normalized image.

[00060] It should be understood that the method may, in addition to constructing the 3D model of the scene, also form a reconstructed camera pose for at least one image of the set of scene images, or at least two images of the set of scene images. In some instances, the method may also form a reconstructed camera pose for all images of the set of scene images, wherein the reconstructed camera pose comprises a camera position and a camera orientation. A camera orientation may herein be an orientation of the camera relative to the 3D axes containing the 3D model. The camera orientation may define the pan, tilt, and roll of the camera. [00061] The image depth estimation neural network may be a neural network that estimates a depth measure of an image based on said image alone, thereby providing a single-image-depth-estimate.

[00062] The image depth estimation neural network may be a neural network generated according to the method of the second aspect of the invention. The method for constructing a 3D model according to the third aspect may thereby make use of the advantages that a neural network generated according to the method of the second aspect provides.

[00063] The rescaling of the images in the set of scene images in the method for constructing a 3D model according to the third aspect of the invention and the rescaling of the images in the first set of images in the method for generating the image depth estimation neural network according to the second aspect of the invention may share a common joint focal length.

[00064] Thus, the method for generating an image depth estimation neural network, according to the second aspect, may rescale images to appear as if they had a joint focal length: fctraining, and then use these images to train the image depth estimation neural network. The method for constructing a 3D model of a scene, according to the third aspect, may rescale scene images to appear as if they had a joint focal length: fcmodeiiing, and then input the rescaled images into said image depth estimation neural network. When fc_modeiiing= fctraining, the image depth estimation neural network receives the images in the format it has been trained to handle and correct depth estimates may be expected.

[00065] Processing images associated with the set of scene images together with the at least one estimate of a depth measure may comprise: reconstructing a camera position and a camera orientation for each image using a 3D reconstruction pipeline; detecting an object using an object detection pipeline, wherein the object appears in at least two images of the set of scene images; extracting an object depth measure for the object for each of the at least two images from the at least one estimate of a depth measure, wherein an object depth measure is derived from at least a depth measure for at least one point on the object; forming an object position in the 3D model, wherein the object position comprises coordinates of the object, by either: finding an estimated object position from each of the at least two images, the estimated object position of an image being based on the reconstructed camera position, the reconstructed camera orientation, and the object depth measure of the image, and forming the object position from the estimated object positions; or finding a triangulated object position, the triangulated object position being based on triangulation using the reconstructed camera positions and the reconstructed camera orientations of the at least two images, comparing the triangulated object position to an estimated object position of one of at least two images, the estimated object position of an image being based on the reconstructed camera position, the reconstructed camera orientation, and the object depth measure of the image, and forming the object position from the triangulated object position if the comparison matches.

[00066] The 3D reconstruction pipeline may be a structure from motion pipeline or a simultaneous localization and mapping pipeline. The object detection pipeline may e.g. be implemented as described in [Seamless Scene Segmentation, Porzi et al. , The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 8277-8286]

[00067] According to a fourth aspect of the present inventive concept there is provided a server for generating a dataset for training an image depth estimation neural network, the server comprising a processor configured to: receive an image series comprising a plurality of images of a scene, each image being an image acquired by a camera, the camera having a position at the time of the acquisition of the image; receive, for each image of the image series, a measured camera position, the measured camera position being the position of the camera measured at the acquisition of the image; form a three-dimensional (3D) reconstruction of the scene, the 3D reconstruction comprising coordinates of a 3D model of the scene and a reconstructed camera position for each image, the reconstructed camera position being an estimate of the position of the camera relative to the 3D model at the acquisition of the image, wherein the 3D reconstruction of the scene is formed by running a structure from motion algorithm on the image series, wherein the structure from motion algorithm is configured to align the 3D reconstruction to the measured camera positions; calculate at least one depth measure of at least one image of the image series based on the 3D reconstruction, wherein a depth measure of an image is a representation of a distance between a viewpoint from which the image was taken and a point in the scene of the image; and form the dataset out of pairs of image data and depth data, wherein each pair comprises an image of the image series as image data and the at least one depth measure of the corresponding image as depth data.

[00068] It should be understood that the server may be a single physical server or a distributed server, a distributed server comprising a plurality of physical servers, possibly distributed over a number of locations, acting together. In addition to the processor, the server may comprise a memory and at least one receiver. The memory may store computer readable instructions for the processor to perform. The memory may also store the formed dataset. The at least one receiver may be configured to receive the image series and the measured camera positions.

[00069] A server according to the fourth aspect may allow images being collected from a plurality of cameras and put together into a dataset which by extension may improve 3D modelling of scenes.

[00070] According to a fifth aspect of the present inventive concept there is provided a server for generating an image depth estimation neural network, the image depth estimation neural network being a neural network that estimates at least one depth measure of an image of a scene, wherein a depth measure of the image is a representation of a distance between a viewpoint from which the image was taken and a point in the scene of the image, the server comprising a processor configured to: receive a first set of images, the first set of images being a plurality of images of a scene taken by one or more cameras; receive, for each image in the first set of images, an associated focal length, the associated focal length being an estimate of a focal length of the camera that took the image; transform the first set of images into a set of normalized training images, the set of normalized training images representing how images of the first set of images would appear if the images of the set had an joint focal length, wherein transforming an image of the first set of images into a normalized training image comprises rescaling the image, the rescaling representing a change in the associated focal length of the image such that it approaches the joint focal length; and train the neural network to predict at least one depth measure of an image, wherein training the neural network comprises providing the neural network with a training dataset of pairs of input data and output data, wherein the input data of the training dataset comprises the set of normalized training images.

[00071] It should be understood that the server may be a single physical server or a distributed server, a distributed server comprising a plurality of physical servers, possibly distributed over a number of locations, acting together. In addition to the processor, the server may comprise a memory and at least one receiver. The memory may store computer readable instructions for the processor to perform. The memory may also store the generated image depth estimation neural network. The at least one receiver may be configured to receive the first set of images and the associated focal lengths.

[00072] A server according to the fifth aspect may allow training the image depth estimation neural network with images collected from a plurality of cameras which by extension may improve 3D modelling of scenes.

[00073] According to a sixth aspect of the present inventive concept there is provided a server for constructing a three-dimensional (3D) model of a scene, the server comprising a processor configured to: receive a set of scene images, the set of scene images being images of the scene taken from a plurality of viewpoints; receive, for each of the scene images, an associated focal length, the associated focal length being an estimate of the focal length of the camera that took the scene image; transform the set of scene images into a set of normalized images, the set of normalized images representing how the set of scene images would appear if the images of the set had an joint focal length, wherein transforming a scene image into a normalized image comprises rescaling the image to represent a change in the associated focal length of the image such that it approaches the joint focal length; obtain at least one estimate of a depth measure of at least one image of the set of normalized images using an image depth estimation neural network, wherein a depth measure of an image is a representation of a distance between a viewpoint from which the image was taken and a point in the scene of the image, wherein the image depth estimation neural network is a neural network that estimates a depth measure of an image; process images associated with the set of scene images together with the at least one estimate of a depth measure using a structure for motion algorithm to construct the 3D model of the scene.

[00074] It should be understood that the server may be a single physical server or a distributed server, a distributed server comprising a plurality of physical servers, possibly distributed over a number of locations, acting together. In addition to the processor, the server may comprise a memory and at least one receiver. The memory may store computer readable instructions for the processor to perform. The memory may also store the image depth estimation neural network and/or the constructed 3D model. The at least one receiver may be configured to receive the set of scene images and the associated focal lengths.

[00075] A server according to the sixth aspect may allow improved 3D modelling based on images collected from a plurality of cameras.

[00076] The servers of the fourth, fifth, and sixth aspect may be different servers. Alternatively, two or more servers of the fourth, fifth, and sixth aspect may be implemented on the same server.

BRIEF DESCRIPTION OF THE DRAWINGS

[00077] The above, as well as additional objects, features and advantages of the present inventive concept, will be better understood through the following illustrative and non-limiting detailed description, with reference to the appended drawings. In the drawings like reference numerals will be used for like elements unless stated otherwise.

[00078] Fig. 1 illustrates a flowchart of a method for generating a dataset for training an image depth estimation neural network.

[00079] Fig. 2 illustrates an example of a flow of data during implementation of a method for generating a dataset for training an image depth estimation neural network.

[00080] Fig. 3 illustrates a flowchart of a method for generating an image depth estimation neural network.

[00081] Fig. 4 illustrates an example of a flow of data during implementation of a method for generating an image depth estimation neural network.

[00082] Fig. 5 illustrates a flowchart of a method for constructing a 3D model of a scene.

[00083] Fig. 6 illustrates an example of a flow of data during implementation of a method for constructing a 3D model of a scene.

DETAILED DESCRIPTION

[00084] In cooperation with attached drawings, the technical contents and detailed description of the present invention are described thereinafter according to a preferable embodiment, being not used to limit the claimed scope. This invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided for thoroughness and completeness, and fully convey the scope of the invention to the skilled person.

[00085] Fig. 1 illustrates a flowchart of a method 100 for generating a dataset for training an image depth estimation neural network 50. It should be understood that the steps of the method 100 do not necessarily need to be performed in the order depicted in Fig. 1. The method 100 will hereinafter be described, by way of example, using the flow of data illustrated in Fig. 2. However, it should be understood that other implementations of the method 100 may also be possible. [00086] The method 100 comprises receiving 102 an image series 10 comprising a plurality of images 11 of a scene, each image 11 being an image acquired by a camera, the camera having a position at the time of the acquisition of the image. As illustrated in Fig. 2, the images 11 of the image series 10 may be associated with other types of data than image data. Each image may e.g. be associated with an indication 12 of a measured camera position, and/or an indication 14 of a camera model, and/or an indication 16 of an associated focal length. The mentioned indications 12, 14, 16 may be metadata associated with the image 11. It should be understood that Fig. 2 is a schematic illustration of the images 11. The indications 12, 14, 16 may not be part of the image information that depicts the scene.

[00087] The method 100 further comprises receiving 104, for each image of the image series, a measured camera position, the measured camera position being the position of the camera measured at the acquisition of the image. The measured camera position may e.g. be received 104 as an indication 12 of a measured camera position from metadata associated with the image 11. The indication 12 of a measured camera position may e.g. be may be a position measured with a positioning system, e.g. a GPS positioning system, a wifi positioning system, or a cell ID positioning system, at the time the image 11 was acquired.

[00088] The method 100 further comprises forming 106 a 3D reconstruction of the scene, the 3D reconstruction comprising coordinates of a 3D model of the scene and a reconstructed camera position for each image, the reconstructed camera position being an estimate of the position of the camera relative to the 3D model at the acquisition of the image, wherein the 3D reconstruction of the scene is formed by running a structure from motion algorithm on the image series, wherein the structure from motion algorithm is configured to align the 3D reconstruction to the measured positions.

[00089] The structure from motion algorithm may herein be the algorithm implemented in OpenSfM [https://github.com/mapillary/OpenSfM].

[00090] Aligning the 3D reconstruction to the measured positions may be done through bundle adjustment. Herein, an initial 3D reconstruction may be improved by the bundle adjustment where the coordinates of a 3D model of the scene and the coordinates of the reconstructed camera positions are simultaneously refined according to one or more cost functions that penalizes 3D reconstructions that deviates from what is believed to be the ground truth. In order to align the 3D reconstruction to the measured positions a cost function that is proportional to squared distance between the reconstructed camera position and the corresponding measured camera position, may be used. For example, such a cost function may be added to the alignment step in OpenSfM. This may impose a scale on the 3D reconstruction, wherein the scale matches the scale given by the measured camera positions.

[00091] The method 100 further comprises calculating 108 at least one depth measure of at least one image 11 of the image series based on the 3D reconstruction, wherein a depth measure of an image 11 is a representation of a distance between a viewpoint from which the image was taken and a point in the scene of the image.

[00092] Depth measures may be calculated e.g. through a Patch-Match based multi-view stereo algorithm such as [S. Shen. IEEE Transactions on Image Processing, 22(5): 1901 -1914, May 2013. 4] This may be seen as a simple winner- takes-all stereo algorithm. Different depth and normal values may be tested for each pixel and the one that gives the best normalized cross correlation score with the neighboring views may be kept. The result may be a dense but noisy depth map. Most of the noise in the depth maps may be removed in a post-processing step that checks the consistency between the depth maps of neighboring images. Depth values that are not consistent with at least two neighboring views may be removed. This may reduce the number of pixels for which a depth value is produced. Depth maps for which the number of pixels with a depth value are below a threshold value, e.g. below 5% of the total number of pixels in the image 11 , may be discarded together with the corresponding image 11. The method 100 may thus calculate 108 at least one depth measure of at least one image 11 in the form of a depth map 23 of the at least one image 11 , wherein the at least one depth measure may correspond to the pixels of the depth map having a depth value. However, it should be understood that the at least one depth measure may also be represented in other ways, e.g. as a single value or as a vector of values.

[00093] The method 100 further comprises forming 110 the dataset 20 out of pairs 21 of image data and depth data, wherein each pair 21 comprises an image 11 of the image series 10 as image data and the at least one depth measure of the corresponding image as depth data. In the example of Fig. 2 the pairs 21 of image data and depth data are pairs 21 of images 11 and their corresponding depth maps 23. As illustrated in Fig. 2 there may be fewer images 11 in the dataset 20 than in the original image series 10.

[00094] Fig. 3 illustrates a flowchart of a method 200 for generating an image depth estimation neural network 50. It should be understood that the steps of the method 200 do not necessarily need to be performed in the order depicted in Fig. 3. [00095] The method 200 comprises receiving 202 a first set of images, the first set of images being a plurality of images of a scene taken by one or more cameras. [00096] The method 200 further comprises receiving 204, for each image in the first set of images, an associated focal length, the associated focal length being an estimate of a focal length of the camera that took the image.

[00097] The associated focal length may be received in several different ways. For example, an image of the first set of images may be associated with an indication 14 of a camera model, e.g. GoPro Hero 8. The indication 14 of the camera model may be stored in metadata associated with the image. Images from other cameras of the same camera model may have been analyzed on a prior occasion and an estimate of the focal length may have been extracted and stored in a database. Thus, using an indication 14 of a camera model the associated focal length may be received from the database. The associated focal length may also be stored directly in metadata of the image, e.g. as an indication 16 of the associated focal length. Said indication 16 may e.g. be provided by the manufacturer of the camera or added as metadata in a prior analyzing step of the image.

[00098] The method 200 further comprises transforming 206 the first set of images into a set 30 of normalized training images, the set 30 of normalized training images representing how images of the first set of images would appear if the images of the set had an joint focal length, wherein transforming 206 an image of the first set of images into a normalized training image comprises rescaling the image, the rescaling representing a change in the associated focal length of the image such that it approaches the joint focal length.

[00099] The joint focal length of the method 200 may be a pre-determined focal length, it may not necessary be one of the associated focal lengths of the images of the first set of images. Transforming 206 an image such that the associated focal length of the image approaches the joint focal length may mean rescaling the image such that the associated focal length of the rescaled image is equal to the joint focal length. However, it may also mean that the associated focal length of the rescaled image is similar to the joint focal length.

[000100] The rescaling of an image of the first set of images may comprise scaling the image by a factor. The factor may be the joint focal length of the set of normalized training images divided by the focal length associated with the image. Thus, the associated focal length of the rescaled image may be equal to the joint focal length.

[000101 ] The rescaling of an image of the first set of images may further comprise: cropping the image if the focal length associated with the image is smaller than the joint focal length, or padding the image if the focal length associated with the image is larger than the joint focal length. If the factor is larger than one, the fraction of the image corresponding to the fraction of the factor above one may be cropped. Padding may be implemented analogously.

[000102] The method 200 further comprises training 208 the neural network to predict at least one depth measure of an image, wherein training the neural network comprises providing the neural network with a training dataset of pairs of input data 54 and output data 56, wherein the input data 54 of the training dataset comprises the set 30 of normalized training images.

[000103] Fig. 4 illustrates an example of a flow of data leading to providing the neural network with a training dataset of pairs of input data 54 and output data 56, such that it may be trained to form an image depth estimation neural network 50. However, it should be understood that other implementations of the method 200 may also be possible.

[000104] In Fig. 4 the image depth estimation neural network 50 is trained using data derived from one or more datasets 20 that are generated according to the text in conjunction with Fig. 1 and 2. The images 11 of the dataset 20 herein form the first set of images.

[000105] It should be understood that while the images 11 of the dataset 20 may have been derived using an image series 10 of images 11 that were interrelated in some way, e.g. that depicted the same scene from different viewpoint or different angles, had some image overlap etc., this may not necessarily be the case for the images of the first dataset. For example, a first image series 10 of interrelated images with corresponding measured camera positions may be used to generate a first dataset 20 and a second image 10 series of interrelated images with corresponding measured camera positions may be used to generate a second dataset 20. In principle, the first set of images with associated depth maps may be formed from one image/depth map from the first image series 10, one image/depth map from the second image series 10, and so forth. For the training of the image depth estimation neural network 50 there may not be any requirement on the images being interrelated or being images of the same scene or being associated with measured camera positions. For the training of the image depth estimation neural network 50 the only requirement on the images may be that their depth maps are accurate. Each depth map may in turn be accurate due to it being formed from an image series 10 of interrelated images which are associated with measured camera positions. However, at the point of training the image depth estimation neural network 50, such requirement may no longer be needed.

[000106] In Fig. 4 the images 11 of the dataset 20 are transformed 206 into a set 30 of normalized training images. The normalized training images are annotated 1 T in Fig. 4 to indicate that they are images derived from images 11. The set 30 of normalized training images then forms the input data 54 for training the neural network. In Fig. 4 each image 11 is associated with a depth map representing at least one depth measure of the image. The depth maps 23 may be transformed 220 into a transformed depth map 23’, wherein the transformation mimics the transformation of the corresponding image 11. Thus, each pixel in the transformed depth map 23’ may be linked to the same image information as before the transformation. If the position of e.g. a stop sign moves within the image as a consequence of the transformation 206, the depth measure of the stop sign may move accordingly within the depth map. The transformed depth map 23’ may subsequently be used as output data 56 for training the neural network. With said input data 54 and output data 56 the neural network may be trained to form the image depth estimation neural network 50.

[000107] Fig. 5 illustrates a flowchart of method 300 for constructing a 3D model of a scene. It should be understood that the steps of the method 300 do not necessarily need to be performed in the order depicted in Fig. 5. The method 300 will hereinafter be described, by way of example, using the flow of data illustrated in Fig. 6. However, it should be understood that other implementations of the method 300 may also be possible.

[000108] The method 300 comprises receiving 302 a set 60 of scene images 61 , the set 60 of scene images 61 being images of the scene taken from a plurality of viewpoints. Each image of the set 60 of scene images 61 may at least partially overlap with another image of the set 60 of scene images 61.

[000109] The method 300 further comprises receiving 304, for each of the scene images 61 , an associated focal length, the associated focal length being an estimate of the focal length of the camera that took the scene image. The associated focal length may be received in several different ways, as previously described. For example, an image 61 of the set 60 of scene images may be associated with an indication 14 of a camera model, e.g. an indication 14 stored in metadata associated with the image. Using an indication 14 of a camera model, the associated focal length may be received from a database. The associated focal length may also be stored directly in metadata of the image, e.g. as an indication 16 of the associated focal length.

[000110] The method 300 further comprises transforming 306 the set 60 of scene images 61 into a set 70 of normalized images 6T, the set 70 of normalized images 6T representing how the set 60 of scene images 61 would appear if the images of the set had an joint focal length, wherein transforming a scene image 61 into a normalized image 6T comprises rescaling the image to represent a change in the associated focal length of the image such that it approaches the joint focal length.

[000111] The joint focal length of the method 300 may be a pre-determined focal length. The rescaling of an image of the set 60 of scene images 61 may be performed analogously to the rescaling of an image of the first set of images. For example, the rescaling of an image of the set 60 of scene images 61 may comprise scaling the image by a factor. The factor may be the joint focal length of the set of normalized images divided by the focal length associated with the image. Thus, the associated focal length of the rescaled image may be equal to the joint focal length. [000112] The rescaling of an image of the set 60 of scene images 61 may further comprise: cropping the image if the focal length associated with the image is smaller than the joint focal length, or padding the image if the focal length associated with the image is larger than the joint focal length. If the factor is larger than one, the fraction of the image corresponding to the fraction of the factor above one may be cropped. Padding may be implemented analogously.

[000113] The method 300 further comprises obtaining 308 at least one estimate of a depth measure of at least one image of the set of normalized images using an image depth estimation neural network 50, wherein a depth measure of an image is a representation of a distance between a viewpoint from which the image was taken and a point in the scene of the image, wherein the image depth estimation neural network is a neural network that estimates a depth measure of an image.

[000114] The image depth estimation neural network 50 may be a neural network generated according to the method 200 for generating an image depth estimation neural network 50. Furthermore, the joint focal length of the method 300 for constructing a 3D model of a scene may be equal to the joint focal length of the method 200 for generating the image depth estimation neural network 50. Thus, images may be provided to the image depth estimation neural network 50 in the same format, i.e. with the same joint focal length, as the image depth estimation neural network 50 has been trained to handle.

[000115] The at least one estimate of a depth measure may e.g. be an estimated depth map 83 of an image 6T of the set 70 of normalized images 6T. [000116] The method 300 further comprises processing images associated with the set 60 of scene images 61 together with the at least one estimate of a depth measure using a structure for motion algorithm 90 to construct the 3D model of the scene.

[000117] As seen in Fig. 6, for each of the images 6T of the set 70 of normalized images 6T the image depth estimation neural network 50 may produce an estimated depth map 83. The structure for motion algorithm 90 may then process each normalized image 6T with its corresponding estimated depth map 83 to construct the 3D model of the scene. However, the structure for motion algorithm 90 may also process another image associated with the scene image 61 , e.g. the scene image 61 itself or another transformation of the scene image 61. As long as the estimated depth map 83 is transformed correspondingly such that each pixel in the transformed estimated depth map 83 is linked to the correct image information in the transformation of the scene image 61.

[000118] The structure from motion algorithm may reconstruct a camera position and camera orientation for each scene image 61 using a structure from motion pipeline or a simultaneous localization and mapping (SLAM) pipeline. It should be understood that SLAM may be considered to be a form of structure from motion algorithm, operating in real time. An object may then be detected, using an object detection pipeline, in at least two images of the set of scene images. For each of the at least two images an estimate of a depth measure for at least one point on the object may be found. For each image of the at least two images an object depth measure may then be derived from these estimates of depth measures of the object. [000119] Consider an example wherein the object is a stop sign. There may exist a plurality of depth measures for the object in each image, e.g. depth measures for several pixels that all are part of the representation of the stop sign in the image. For each image, the plurality of depth measures of the object may form one object depth measure. The object depth measure may thus represent a distance between the viewpoint of the camera and the object, in this case the stop sign, at the time of the acquisition of the image.

[000120] An object position may then be formed in the 3D model by utilizing the object depth measure. Two examples of this are given below.

[000121] In a first example, each of the at least two images gives an estimated object position. The estimated object position of an image may be calculated from the reconstructed camera position, the reconstructed camera orientation, and the object depth measure. The estimated object position may be the reconstructed camera position translated by a distance given by the object depth measure in a direction given by the camera orientation and the object’s relation to the center pixel in the image. The estimated object positions, one from each of the at least two images, may then together form the object position. For example, the estimated object positions may be clustered to form the object position. Alternatively, other calculations based on the estimated object positions may give the object position. [000122] In a second example, a triangulated object position may be found. For each of the at least two images, a vector may be calculated from the reconstructed camera position of the image in a direction given by the camera orientation and the object’s relation to the center pixel in the image. The vectors may then cross to give a triangulated object position. Said triangulated object position may then be compared to an estimated object position of one of at least two images. If the triangulated object position and the estimated object position matches the triangulated object position may form the object position.

[000123] In the above the inventive concept has mainly been described with reference to a limited number of examples. However, as is readily appreciated by a person skilled in the art, other examples than the ones disclosed above are equally possible within the scope of the inventive concept, as defined by the appended claims.

Claims

1. A method for generating a dataset for training an image depth estimation neural network, the method comprising: receiving an image series comprising a plurality of images of a scene, each image being an image acquired by a camera, the camera having a position at the time of the acquisition of the image; receiving, for each image of the image series, a measured camera position, the measured camera position being the position of the camera measured at the acquisition of the image; forming a three-dimensional (3D) reconstruction of the scene, the 3D reconstruction comprising coordinates of a 3D model of the scene and a reconstructed camera position for each image, the reconstructed camera position being an estimate of the position of the camera relative to the 3D model at the acquisition of the image, wherein the 3D reconstruction of the scene is formed by running a structure from motion algorithm on the image series, wherein the structure from motion algorithm is configured to align the 3D reconstruction to the measured camera positions; calculating at least one depth measure of at least one image of the image series based on the 3D reconstruction, wherein a depth measure of an image is a representation of a distance between a viewpoint from which the image was taken and a point in the scene of the image; and forming the dataset out of pairs of image data and depth data, wherein each pair comprises an image of the image series as image data and the at least one depth measure of the corresponding image as depth data.

2. The method of claim 1 , wherein the structure from motion algorithm is configured to align the 3D reconstruction to the measured camera positions through an adjustment of the coordinates of the 3D reconstruction, wherein the adjustment of coordinates penalizes reconstructed camera positions that deviates from the corresponding measured camera positions.

3. The method of claim 2, wherein the adjustment of coordinates penalizes reconstructed camera positions that deviate from the corresponding measured camera positions by imposing a cost for each reconstructed camera position that depends on a distance between said reconstructed camera position and the corresponding measured camera position.

4. A method for generating an image depth estimation neural network, the image depth estimation neural network being a neural network that estimates at least one depth measure of an image of a scene, wherein a depth measure of the image is a representation of a distance between a viewpoint from which the image was taken and a point in the scene of the image, the method comprising: receiving a first set of images, the first set of images being a plurality of images of a scene taken by one or more cameras; receiving, for each image in the first set of images, an associated focal length, the associated focal length being an estimate of a focal length of the camera that took the image; transforming the first set of images into a set of normalized training images, the set of normalized training images representing how images of the first set of images would appear if the images of the set had an joint focal length, wherein transforming an image of the first set of images into a normalized training image comprises rescaling the image, the rescaling representing a change in the associated focal length of the image such that it approaches the joint focal length; and training the neural network to predict at least one depth measure of an image, wherein training the neural network comprises providing the neural network with a training dataset of pairs of input data and output data, wherein the input data of the training dataset comprises the set of normalized training images.

5. The method of claim 4, wherein at least one pair of input data and output data in the training dataset comprises a normalized training image as input data and at least one depth measure of the normalized training image of said normalized training image as output data.

6. The method of claim 5, wherein the normalized training image of the input data and the at least one depth measure of the output data are derived from one of the pairs of image data and depth data of a dataset that is generated according to the method of any one of claims 1 to 3.

7. The method of any one of claims 4 to 6, wherein the rescaling of an image of the first set of images comprises scaling the image by a factor, the factor being inversely proportional to the focal length associated with the image.

8. The method of claim 7, wherein the factor is the joint focal length of the set of normalized training images divided by the focal length associated with the image.

9. The method of any one of claims 4 to 8, wherein the first set of images comprises images associated with a plurality of focal lengths.

10. The method of any one of claim 4 to 9, wherein the rescaling of an image comprises: cropping the image if the focal length associated with the image is smaller than the joint focal length; padding the image if the focal length associated with the image is larger than the joint focal length.

11. A method for constructing a three-dimensional (3D) model of a scene, said method comprising: receiving a set of scene images, the set of scene images being images of the scene taken from a plurality of viewpoints; receiving, for each of the scene images, an associated focal length, the associated focal length being an estimate of the focal length of the camera that took the scene image; transforming the set of scene images into a set of normalized images, the set of normalized images representing how the set of scene images would appear if the images of the set had an joint focal length, wherein transforming a scene image into a normalized image comprises rescaling the image to represent a change in the associated focal length of the image such that it approaches the joint focal length; obtaining at least one estimate of a depth measure of at least one image of the set of normalized images using an image depth estimation neural network, wherein a depth measure of an image is a representation of a distance between a viewpoint from which the image was taken and a point in the scene of the image, wherein the image depth estimation neural network is a neural network that estimates a depth measure of an image; processing images associated with the set of scene images together with the at least one estimate of a depth measure using a structure for motion algorithm to construct the 3D model of the scene.

12. The method of claim 11 , wherein the image depth estimation neural network is a neural network that estimates a depth measure of an image based on said image alone, thereby providing a single-image-depth-estimate.

13. The method of claim 11 or 12, wherein the image depth estimation neural network is a neural network generated according to the method of any one of claims 4 to 10.

14. The method of claim 13, wherein the joint focal length used to generate the set of normalized training images from the first set of images for training the image depth estimation neural network is the same as the joint focal length used to generate the set of normalized images from the set of scene images.

15. The method of any one of claims 11 to 14, wherein processing images associated with the set of scene images together with the at least one estimate of a depth measure comprises: reconstructing a camera position and a camera orientation for each image using a 3D reconstruction pipeline; detecting an object using an object detection pipeline, wherein the object appears in at least two images of the set of scene images; extracting an object depth measure for the object for each of the at least two images from the at least one estimate of a depth measure, wherein an object depth measure is derived from at least a depth measure for at least one point on the object; forming an object position in the 3D model, wherein the object position comprises coordinates of the object, by either: finding an estimated object position from each of the at least two images, the estimated object position of an image being based on the reconstructed camera position, the reconstructed camera orientation, and the object depth measure of the image, and forming the object position from the estimated object positions; or finding a triangulated object position, the triangulated object position being based on triangulation using the reconstructed camera positions and the reconstructed camera orientations of the at least two images, comparing the triangulated object position to an estimated object position of one of at least two images, the estimated object position of an image being based on the reconstructed camera position, the reconstructed camera orientation, and the object depth measure of the image, and forming the object position from the triangulated object position if the comparison matches.

16. A server for generating a dataset for training an image depth estimation neural network, the server comprising a processor configured to: receive an image series comprising a plurality of images of a scene, each image being an image acquired by a camera, the camera having a position at the time of the acquisition of the image; receive, for each image of the image series, a measured camera position, the measured camera position being the position of the camera measured at the acquisition of the image; form a three-dimensional (3D) reconstruction of the scene, the 3D reconstruction comprising coordinates of a 3D model of the scene and a reconstructed camera position for each image, the reconstructed camera position being an estimate of the position of the camera relative to the 3D model at the acquisition of the image, wherein the 3D reconstruction of the scene is formed by running a structure from motion algorithm on the image series, wherein the structure from motion algorithm is configured to align the 3D reconstruction to the measured camera positions; calculate at least one depth measure of at least one image of the image series based on the 3D reconstruction, wherein a depth measure of an image is a representation of a distance between a viewpoint from which the image was taken and a point in the scene of the image; and form the dataset out of pairs of image data and depth data, wherein each pair comprises an image of the image series as image data and the at least one depth measure of the corresponding image as depth data.

17. A server for generating an image depth estimation neural network, the image depth estimation neural network being a neural network that estimates at least one depth measure of an image of a scene, wherein a depth measure of the image is a representation of a distance between a viewpoint from which the image was taken and a point in the scene of the image, the server comprising a processor configured to: receive a first set of images, the first set of images being a plurality of images of a scene taken by one or more cameras; receive, for each image in the first set of images, an associated focal length, the associated focal length being an estimate of a focal length of the camera that took the image; transform the first set of images into a set of normalized training images, the set of normalized training images representing how images of the first set of images would appear if the images of the set had an joint focal length, wherein transforming an image of the first set of images into a normalized training image comprises rescaling the image, the rescaling representing a change in the associated focal length of the image such that it approaches the joint focal length; and train the neural network to predict at least one depth measure of an image, wherein training the neural network comprises providing the neural network with a training dataset of pairs of input data and output data, wherein the input data of the training dataset comprises the set of normalized training images.

18. A server for constructing a three-dimensional (3D) model of a scene, the server comprising a processor configured to: receive a set of scene images, the set of scene images being images of the scene taken from a plurality of viewpoints; receive, for each of the scene images, an associated focal length, the associated focal length being an estimate of the focal length of the camera that took the scene image; transform the set of scene images into a set of normalized images, the set of normalized images representing how the set of scene images would appear if the images of the set had an joint focal length, wherein transforming a scene image into a normalized image comprises rescaling the image to represent a change in the associated focal length of the image such that it approaches the joint focal length; obtain at least one estimate of a depth measure of at least one image of the set of normalized images using an image depth estimation neural network, wherein a depth measure of an image is a representation of a distance between a viewpoint from which the image was taken and a point in the scene of the image, wherein the image depth estimation neural network is a neural network that estimates a depth measure of an image; process images associated with the set of scene images together with the at least one estimate of a depth measure using a structure for motion algorithm to construct the 3D model of the scene.