WO2024099593A1

WO2024099593A1 - Localization based on neural networks

Info

Publication number: WO2024099593A1
Application number: PCT/EP2023/058331
Authority: WO
Inventors: Arthur MOREAU; Moussab BENNEHAR; Nathan PIASCO; Dzmitry Tsishkou
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2022-11-09
Filing date: 2023-03-30
Publication date: 2024-05-16

Abstract

The present disclosure relates to the localization of a moving device, for example, a partially or fully autonomous vehicle, by means of neural networks. It is provided a method of determining a position of a device comprising a sensor device, comprising the steps of obtaining by the sensor device sensor data representing an environment of the device, generating a first descriptor map based on the sensor data by a first neural network, inputting input data based on the sensor data into a second neural network different from the first neural network, outputting by the second neural network descriptors based on the input data, volumetric rendering the descriptors to obtain a second descriptor map, matching the first descriptor map with the second descriptor map and determining a pose of the sensor device based on the matching.

Description

LOCALIZATION BASED ON NEURAL NETWORKS

TECHNICAL FIELD

The present disclosure relates to the localization of a moveable device, for example, an autonomous vehicle, comprising a sensor device, for example, a camera device or LIDAR- camera device, based on neural networks configured for processing sensor data provided by the sensor device.

BACKGROUND

Localization is an important task for the operation of vehicles as automobiles, Automated Guided Vehicles (AGV) and autonomous mobile robots or other mobile devices as smartphones. For example, LIDAR-camera sensing systems comprising one or more Light Detection and Ranging, LIDAR, device configured for obtaining a temporal sequence of 3D point cloud data sets for sensed objects and one or more camera devices configured for capturing a temporal sequence of 2D images of the objects are employed in automotive applications. In the automotive context, the LIDAR-camera sensing systems can be comprised by Advanced Driver Assistant Systems (ADAS).

Structure based and learning based methods of visual localization are known. Visual structure based localization relies on a database of reference images collected for an environment of navigation. A three-dimensional map of triangulated key points with corresponding descriptors is reconstructed from the reference images, for example, by means of Structure from Motion, SfM, algorithms. Localization algorithms are used to compute the actual position of a device comprising a camera in the three-dimensional map in real time from a query image captured by the camera. Feature vectors with dimensions given by a number of key points are extracted from the query image and matched with reference feature vectors extracted from the reference images and represented by the three-dimensional map in order to obtain estimates of the camera poses needed for the localization.

Such visual structure based localization techniques provide relatively accurate camera poses but suffer from high computational costs and memory demands.

Learning based methods of visual localization make use of deep neural networks trained based on reference images and reconstructed three-dimensional reference maps. Local descriptors can be used for the mapping of similar key point patches to clusters in feature space and the local descriptors can be generic and not predefined but learned by a deep neural network (see M. Jahrer et al., “Learned local descriptors for recognition and matching”, Computer Vision Winter Workshop 2008, Moravske Toplice, Slovenia, February 4-6, P. Napoletano, “Visual descriptors for content-based retrieval of remote sensing images”, International Journal of Remote Sensing, 2018, , 39:5, pages 1343-1376 and A. Moreau et al., “ImPosing: Implicit Pose Encoding for Efficient Visual Localization”, IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2023), Jan 2023, Waikoloa Village, United States, pages 2893- 2902).

However, despite the recent engineering progress learning based methods of visual localization still seem to suffer from insufficient accurateness of camera pose estimations, for example, for query images captured from viewing directions that significantly differ from the viewing directions of reference images used for the training.

SUMMARY

In view of the above, it is an objective underlying the present application to provide a technique for accurate localization of a device based on sensor data provided by a sensor device that can be suitably implemented in embedded computational systems with limited computation resources.

The foregoing and other objectives are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

According to a first aspect, it is provided a method of determining a position of a device comprising a sensor device, comprising the steps of obtaining by the sensor device sensor data representing an environment of the device, generating a first descriptor map based on the sensor data by a first neural network, inputting input data based on the sensor data into a second neural network different from the first neural network, outputting by the second neural network descriptors based on the input data, volumetric rendering the descriptors to obtain a second descriptor map, matching (comparing in order to find matches) the first descriptor map with the second descriptor map and determining a pose of the sensor device based on the matching. It goes without saying that herein the term “neural network” refers to an artificial neural network. The device may be a vehicle, for example, a fully or partially autonomous automobile, an autonomous mobile robot or an Automated Guided Vehicle (AGV). The sensor device may be a camera device or a Light Detection and Ranging, LIDAR, device. The camera device may, for example, be a Time-of-Flight camera, depth camera, etc., and the LIDAR device may, for example, be a Micro-Electro-Mechanical System, MEMS, LIDAR device, solid state LIDAR device, etc.

The first neural network is trained for descriptor/feature extraction from query sensor data obtained by the sensor device, for example, during movement of the device in the environment. The second neural network is trained for processing data based on the sensor data to obtain local descriptors, for example, local descriptors for each pixel of an image captured by a camera or each point of a three-dimensional point cloud captured by a LIDAR device.

The first neural network may be a (deep) convolutional neural network (CNN). It may be based on one of the neural network architectures used for learned feature extraction known in the art (see exampled given by M. Jahrer et al., “Learned local descriptors for recognition and matching”, Computer Vision Winter Workshop 2008, Moravske Toplice, Slovenia, February 4-6, P. Napoletano, “Visual descriptors for content-based retrieval of remote sensing images”, International Journal of Remote Sensing, 2018, , 39:5, pages 1343-1376 and A. Moreau et al., “ImPosing: Implicit Pose Encoding for Efficient Visual Localization”, IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2023), Jan 2023, Waikoloa Village, United States, pages 2893-2902).

The second neural network may comprise a (deep) Multilayer Perceptron, MLP, (fully connected feedforward) neural network. Particularly the second neural network may be trained based on the Neural Radiance Field (NeRF) technique introduced by B. Mildenhall et al. in a paper, entitled “Nerf: Representing scenes as neural radiance fields for view synthesis” in “Computer Vision - ECCV 2020”, 16^th European Conference, Glasgow, UK, August 23-28, 2020, Springer, Cham, 2020, or any advancement thereof that nowadays have become a favorite tool for view synthesis.

Input data for the visual NeRF neural network represents 3D locations (x, y, z) and viewing directions/angles (0, (p) of the camera device and the NeRF trained neural network outputs a neural field comprising view dependent color values (for example RGB) and volumetric density values c. Thus, the MLP realizes F©: (x, y, z, 9, (p) -> (R, G, B, c) with optimized weights 0 obtained during the training process. The neural field can be queried at multiple locations along rays for volume rendering (see detailed description below). The neural network representation of the environment captured by the sensor devices is given by the neural field used for the subsequently performed volumetric rendering that results in a rendered image or point cloud, for example. Differences between the rendered image and the mapped sensor data are to be minimized in the training process. If 3D point could data is processed rather than images, the NeRF neural network can be used for rendering point clouds. While in synthetic image generation applications the camera poses for a plurality of images input into the NeRF neural network are known, in localization applications the camera device (or LIDAR device) poses are to be determined starting from a first guess (pose prior) iteratively (see detailed description below).

Such a NeRF trained neural network or variations thereof can be used for or comprised by the second neural network used in the method according to the first aspect, however, by additionally including descriptors in the implicit function (see also detailed description below). The input data based on the sensor data, in this case, comprises three-dimensional locations and viewing directions (together resulting in poses) and the output data comprises descriptors and volumetric densities.

For example, the second neural network may be trained based on the highly evolved NeRF-W technique (appearance code) that reliably takes into account illumination dynamics (see R. Martin-Brualla et al., “NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections”, in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, June 19-25, 2021, pages 7210-7219, Computer Vision Foundation, IEEE, 2021). Input data for the second neural network in an implementation in that such a NeRF-W trained neural network is used includes appearance embeddings (and optionally transient embeddings); see also detailed description below.

The method of determining a position of a device by means of the first and second neural networks according to the first aspect allows for very accurate localization of the device without the need of powerful expensive computational and memory resources. Contrarily to structure based visual localization techniques of the art, no matching with three-dimensional maps generated from reference sensor data that need much memory space is needed but rather matching processes are based on outputs of the trained neural networks that can be implemented with relatively low memory demands. According to an implementation of the method of the first aspect, the descriptors are independent of a viewing direction (or viewpoint) of the sensor device. This property is similar to the one of the volumetric density and unlike to colors output by NeRF trained neural networks. Due to the independence of the descriptors on the viewing direction localization failures caused by texture-less content or reference data obtained from viewing directions that significantly differ from the one of query sensor data obtained during the localization process can be considerably reduced.

According to an implementation, the descriptors represent local content of the input data and three-dimensional positions of data points of the input data (pixels of images or points of point clouds). This implementation is not based on features comprising key points and associated descriptors but rather each of the points of the input data can be used which may result in an enhanced accuracy of the matching results.

According to an implementation, the method according to the first aspect or any implementation thereof further comprises obtaining a depth map by the second neural network and the matching of the first descriptor map with the second descriptor map is based on the obtained depth map. The depth map is obtained by volumetric rendering of values of volumetric densities output by the second neural network. The information of the depth map can be used for avoiding matching of descriptor structures of the descriptor maps that are geometrically far from each other (by at least some pre-determined distance threshold), i.e., it is ensured that the descriptor structures of one of the descriptor maps that are geometrically far from descriptor structures of the other one of the descriptor maps are considered being dissimilar to each other.

According to an implementation, the pose of the sensor device is iteratively determined starting from a pose prior (first guess of the pose). The iteration can be performed based on a Perspective-N-Points (PnP) method combined with a Random Sample Consensus (RANSAC) algorithm in order to get a robust estimate for the pose by discarding outliers matches. The pose prior may be obtained may be obtained by matching a global image descriptor against an image retrieval database or an implicit map as known in the art.

According to an implementation, the first neural network is trained conjointly with the second neural network for the environment based on matching first training descriptor maps obtained by the first neural network and second training descriptor maps obtained by volumetric rendering of training descriptors output by the second neural network. When the neural networks are conjointly trained in this manner, it can be guaranteed that in the matching procedure during actual localization corresponding descriptor structures are identified as matching with each other.

Particularly, according to such an implementation, the descriptor/feature extraction is learned for the environment/scenes in which localization is to be performed. Rather than using an off- the shelf feature extractor the first neural network is trained scene-specifically in this implementation which might even further increase accuracy of the pose estimates and, thus, localization results.

Implementations of the method of the first aspect may comprise training of the first and second neural networks. According to an implementation, the sensor device is the camera device and the method further comprises conjointly training the first and second neural networks for the environment comprising obtaining training image data by a training camera device for different training poses of the training camera device, inputting training input data based on the training image data into the first neural network and inputting training pose data according to the different training poses of the training camera device based on the training image data into the second neural network, outputting by the first neural network first training descriptor maps based on the training input data, outputting by the second neural network training color data, training volumetric density data, and training descriptor data, and rendering the training color data, training volumetric density data, and training descriptor data to obtain rendered training images, rendered training depth maps and rendered second training descriptor maps, respectively. Further, the method according to this implementation, comprises minimizing a first objective function representing differences between the rendered training images and corresponding pre-stored reference images or maximizing a first objective function representing similarities between the rendered training images and pre-stored reference images and minimizing a second objective function representing differences between the first training descriptor maps and corresponding rendered second training descriptor maps or maximizing a second objective function representing similarities between the first descriptor maps and corresponding rendered second training descriptor maps.

This procedure allows for an efficient conjoint training of the first and second neural networks for accurate localization based on poses of the camera device estimated by the matching of the descriptor maps with each other (i.e., based on descriptor maps that match with each other).

A similar training procedure can be performed for the case that the sensor device is a LIDAR device. In this implementation, the method comprises conjointly training the first and second neural networks for the environment comprising obtaining training three-dimensional point cloud data by a training LIDAR device for different training poses of the training LIDAR device, inputting training input data based on the training three-dimensional point cloud data into the first neural network and inputting training pose data according to the different training poses of the training LIDAR device based on the training three-dimensional point cloud data into the second neural network, outputting by the first neural network first training descriptor maps based on the training input data, outputting by the second neural network training volumetric density data and training descriptor data, and rendering the training volumetric density data and training descriptor data to obtain rendered training depth maps and rendered second training descriptor maps, respectively. Further, the method according to this embodiment comprises minimizing a first objective function representing differences between the rendered training depth maps and corresponding pre-stored reference depth maps or maximizing a first objective function representing similarities between the rendered training depths maps and pre-stored reference depth maps and minimizing a second objective function representing differences between the first training descriptor maps and corresponding rendered second training descriptor maps or maximizing a second objective function representing similarities between the first descriptor maps and corresponding rendered second training descriptor maps.

According to another implementation, the training process described above further comprise applying a loss function based on the rendered training depth maps to suppress minimization or maximization of the second objective function for data points of a first training descriptor map of the first training descriptor maps that are geometrically distant from data points of a corresponding rendered second training descriptor map of the rendered second training descriptor maps by more than a pre-determined threshold. Taking into account this loss function allows for avoiding comparing distinct descriptor structures of the descriptor maps with each other that actually do not represent common features of the environment.

At least one of the steps of the method according to the first aspect and implementations thereof may be performed at the device site with even limited computational resources of the embedded computational system or at a remote site that is provided with the data needed for processing/localization of the device.

According to a second aspect, a computer program product comprising computer readable instructions for, when run on a computer performing or controlling the steps of the method according to the first aspect or any implementation thereof is provided. The computer may be installed in the device, for example, a vehicle.

According to a third aspect, a localization device is provided. The method according to the first aspect and any implementations thereof may be implemented in the localization device according to the third aspect. The localization device according to the third aspect and any implementation thereof can provide the same advantages as described above.

The localization device according to the third aspect comprises a sensor device (for example, a camera device or a Light Detection and Ranging, LIDAR, device) configured for obtaining sensor data representing an environment of the sensor device, a first neural network configured for generating a first descriptor map based on the sensor data, and a second neural network different from the first neural network and configured for outputting descriptors based on input data that is based on the sensor data, The localization device according to the third aspect further comprises a processing unit configured for volumetric rendering the descriptors to obtain a second descriptor map, matching the first descriptor map with the second descriptor map and determining a pose of the sensor device based on the matching.

According to an implementation, wherein the descriptors are independent of a viewing direction of the sensor device. The descriptors may represent local content of the input data and three- dimensional positions of data points of the input data.

According to an implementation, the second neural network is further configured for obtaining a depth map and the processing unit is further configured for matching the first descriptor map with the second descriptor map based on the depth map.

According to an implementation, the processing unit is further configured for iteratively determining the pose of the sensor device starting from a pose prior.

According to an implementation, the first neural network is trained conjointly with the second neural network for the environment based on matching first training descriptor maps obtained by the first neural network and second training descriptor maps obtained by volumetric rendering of training descriptors output by the second neural network.

According to a fourth aspect, a vehicle comprising the localization system according to the third aspect or any implementation thereof is provided. The vehicle is for example, an (in particular, fully or partially autonomous) automobile, autonomous mobile robot or Automated Guided Vehicle (AGV).

Details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description, drawings, and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In the following, embodiments of the present disclosure are described in more detail with reference to the attached figures and drawings, in which:

Figure 1 illustrates a technique of localization of a vehicle equipped with a sensor device according to an embodiment.

Figure 2 illustrates a neural network architecture for rendering descriptor maps comprised in a localization device according to an embodiment.

Figure 3 illustrates a technique of training a neural network suitable for usage by a localization device according to an embodiment.

Figure 4 illustrates a method of localizing a device using a neural network trained, for example, in accordance with the technique illustrated in Figure 3.

Figure 5 is a flow chart illustrating a method of localizing a device equipped with a sensor device according to an embodiment.

Figure 6 illustrates a localization device according to an embodiment.

DETAILED DESCRIPTION OF THE EMBODIMENTS

Herein, it is provided a method of localizing a device, for example, a vehicle, equipped with a sensor device, for example, a camera device or a LIDAR device. Particularly, the method may be based on (Neural Radiance Field) NeRF scene representation. Whereas the following description of embodiments refers to the NeRF techniques other techniques for implicit representation of environments/scenes based on neural fields and volumetric rendering might be suitably used in alternative embodiments. High accuracy of localization results at relatively low computational costs can be achieved.

Figure 1 illustrates localization of a vehicle according to an embodiment. The vehicle is navigating 11 in a known environment. The vehicle is equipped with a sensor device, for example, a camera device or a LIDAR device, and captures sensor data, for example, images or 3D point clouds, representing the environment. Particularly, a query sensor data set, for example, a query image or query 3D point cloud, is captured 12 and used for localizing the vehicle in the known environment. The query sensor data set is input in a neural network, for example, a deep convolutional neural network, CCN, trained for descriptor extraction and a query descriptor map is generated 13 based on the query sensor data set and the extracted descriptors. The descriptors do not depend on viewing directions. Due to the independence of the descriptors on the viewing direction localization failures caused by texture-less content or reference data obtained from viewing directions that significantly differ from the one of query sensor data obtained during the localization process can be considerably reduced.

In the art, extracted descriptors or features are compared with a three-dimensional reference descriptor map generated from training sensor data sets. Employment of such three-dimensional reference descriptor maps suffers from high memory demands. Contrary to the art, according to the embodiment illustrated in Figure 1 another neural network is used 14 for providing another rendered descriptor map for matching with the query descriptor map provided by the neural network trained for descriptor extraction. Input data based on the query sensor data captured by the sensor device of the vehicle is input into the other neural network. For example, estimated three-dimensional location data and viewing direction data of the sensor device related to the captured query image or 3D point cloud is input into the other neural network that is configured to output descriptors at local positions (features). Theses descriptors are part of a neural field that may also comprise colors and volumetric density values. The neural field comprising local descriptors may be called a Neural Positional Features Field.

According to a particular implementation, the other neural network comprises or consists of one or more (deep) multilayer perceptrons (MLPs), i.e. it comprises or is a fully connected feedforward neural network, that is trained based on the Neural Radiance Field (NeRF) technique as proposed by B. Mildenhall et al. in a paper, entitled “Nerf: Representing scenes as neural radiance fields for view synthesis” in “Computer Vision - ECCV 2020”, 16^th European Conference, Glasgow, UK, August 23-28, 2020, Springer, Cham, 2020, or further developments thereof, for example, Nerf-W; see R. Martin-Brualla et al., “NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections”, in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, June 19-25, 2021, pages 7210-7219, Computer Vision Foundation, IEEE, 2021. NeRF, as originally introduced by B. Mildenhall et al. allows for obtaining a neural network representation of the environment based on color values and spatially-dependent volumetric density values (representing the neural field).

Input data for the neural network represent 3D locations and viewing directions (0, (p) and the NeRF trained neural network outputs view dependent color values (for example RGB) and volumetric density values c. Thus, the MLP realizes F©: (x, y, z, 9, (p) -> (R, G, B, c) with optimized weights 0 obtained during the training process.

The volumetric rendering is based on rays passing through the scene (cast from all pixels of images). The volumetric density c(x, y, z) can be interpreted as the differential probability of a ray terminating at an infinitesimal particle at (x, y, z). By gathering all the volumetric density values along the ray direction, the accumulated transmittance T(s) along the ray direction can be computed T(s)

The accumulated transmittance T(s) along the ray from its origin 0 to s represents the probability that the ray travels its path to s without hitting any particle.

The implicit representation (neural field) is queried at multiple locations along the rays and then the resulting samples are composed into an image.

According to the embodiment illustrated in Figure 1 such kinds of neural networks can be employed. However, according to the embodiment they are not only trained for obtaining volumetric density values and color values (if a camera device is used as the sensor device) but also descriptors (modified NeRF trained neural network). Volumetric rendering of the descriptors output by the neural network results in the rendered (reference) descriptor map that is to be matched (compared) with the query descriptor map. Employment of the other neural network rather than a reference three-dimensional map as used in the art saves memory space (typically by a factor of about 1000) and, nevertheless, allows for very accurate pose estimates and, thus, localization results.

Query and reference descriptor maps are matched 15 with cosine similarity, for example. For example, two descriptors are a match (i.e., they are similar to each other), if the similarity is higher than a predetermined threshold and if they represent the best candidates in both descriptor maps in both direction (mutual matching). As shown in Figure 1, from the correspondences of the query descriptor map and the rendered descriptor map generated by volumetric rendering of the descriptors output by the other (modified NeRF trained) neural network poses (positions and viewing angles) of the sensor device can be calculated as known the art. For example, starting from a sensor device pose prior (first guess) the actual pose for which the query sensor data is obtained 16 can be iteratively estimated based on a Perspective-N-Points (PnP) method combined with a Random Sample Consensus (RANSAC) algorithm. A view observed from the pose prior should have an overlapping content with the query sensor data to make the matching process feasible. The pose prior may be obtained by matching a global image descriptor against an image retrieval database or an implicit map as known in the art. Similar to the procedure described by A. Moreau et al., “ImPosing: Implicit Pose Encoding for Efficient Visual Localization”, IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2023), Jan 2023, Waikoloa Village, United States, pages 2893-2902, an obtained sensor device pose estimate can be used as a new pose prior and iteration of the matching and PnP combined with RANSAC processing results in refinement of the sensor device pose estimate and, thus, the localization result. While classical 3D reference models only have access to a finite set of reference descriptors, reference descriptors can be computed from any camera pose by the other neural network which may result in increased an increased accuracy of the localization result.

Figure 2 illustrates a neural network architecture for rendering descriptor maps of a localization device equipped with a camera device according to a particular embodiment. The neural network illustrated is Figure is trained based on NeRF-W (R. Martin-Brualla et al., “NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections”, in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, June 19-25, 2021, pages 7210- 7219, Computer Vision Foundation, IEEE, 2021). This kind of NeRF neural network is adapted to dynamic (illumination) changes in outdoor scenes thanks to appearance embeddings included in the implicit function. As shown in Figure 2, similar to the originally introduced NeRF neural network the NeRF-W neural network is fed by input data based on sensor data in form of 3D locations (x, y, z) 21 and viewing directions d. The NeRF-W neural network may be considered comprising a first Multilayer Perceptron (MLP) 22 and a second MLP comprising 23 a first logical part 23a and a second logical part 23b. The first MLP 22 receives the locations (x, y, z) as input data and outputs data comprising information on the locations (x, y, z) that is input into the second MLP 23. Further, the first MLP 22 outputs volumetric density values c that do not depend on viewing directions d. The first part 23a of the second MLP 23 is trained for outputting color values RGB for locations (x, y, z) and for viewing directions d input into the second MLP 23. The first part 23a of the second MLP 23 also receives appearance embeddings app to account for dynamic (illumination) changes of the captured scenes. The first part 23a of the second MLP 23 may or may not also receive transient embeddings.

Based on the (RGB) color values and volumetric density values c images can be generated by volumetric rendering. According to the embodiment illustrated in Figure 2 and different from the teaching by R. Martin-Brualla et al., “NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections”, in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, June 19-25, 2021, pages 7210-7219, Computer Vision Foundation, IEEE, 2021, the second part 23b of the second MLP 23 is trained to output local descriptors for locations (x, y, z) that do not depend on viewing directions (angles) nor the appearance embeddings. Rendered (reference) descriptor maps can be generated by volumetric rendering of the local descriptors and provided for matching with query descriptor maps (confer description of Figure 1 above) for localization purposes.

Figure 3 illustrates a common training process for the neural network (feature extractor) 31 used for descriptor/feature extraction from images and the neural network (neural Tenderer) 32 used for rendering a (reference) descriptor map. The first neural network 31 may be a fully convolutional neural network with 8 layers, ReLU activations and max pooling layers and the second neural network 32 may be the neural network illustrated in Figure 2, for example. Both neural networks are conjointly trained in a self- supervised manner by defining an optimization objective (total loss function) which leverages the scene geometry. Thereby, descriptors specialized on the target scene are obtained which describe not only the visual content but also the 3D locations of the observed points, enabling to discriminate better than generic descriptors provided by off-the-shelf feature extractors. The resulting descriptors do not depend on viewing directions or appearance embeddings.

Training (for example, RGB) images I captured by a camera device are input in the neural network 31 that is configured to output first descriptor maps DesMl based on the input training images. For the same training images input data based on the training images is input in the neural network 32. In the embodiment shown in Figure 3, the input data are the poses (confer 3D location (x, y, z) and direction d describe with reference to Figure 2) of the camera device at which the training images I are captured. The poses may be obtained from the training images I by Structure from Motion, SfM, techniques, for example. Appearance embeddings may also be used by the neural network 32. The neural network 32 outputs training descriptors, volumetric density values o (i.e., depth information) and colors. Volumetric rendering results in second (rendered training) descriptor maps DesM2, rendered training depth maps DepM and a rendered training images IM, respectively. The volumetric rendering comprises aggregation of the descriptors, volumetric densities and color values along virtual camera rays as known in the art.

Photometric losses (mean squared error losses) LMSE are applied on the rendered training images IM supervised by the training images I to train the (radiance field) neural network 32 as known in the art. Further, the structural dissimilarity loss LSSIM may be minimized. Descriptor losses Lpos, Lneg are applied on the first and second training descriptor maps DesMl and DesM2 to train conjointly the two neural network 31 and 32. L_pos may be applied for maximizing similarities of the first and second training descriptor maps DesMl and DesM2 and L_neg may be applied to ensure that pixel pairs (pl, p2) of pixels pl of the first training descriptor maps DesMl and of pixels p2 of the second training descriptor maps DesM2 that have large geometric 3D distances from each other have dissimilar descriptors.

Regularization losses LTV may be applied on the rendered training depth maps DepM to further improve the quality of the geometry learned by the second neural network 32 by smoothing and limiting artifacts. Minimization of the structural dissimilarity loss LSSIM and the regularization losses LTV may improve localization accuracy.

These losses contribute to a total objective function that is to be optimized in the training process. Further details on the losses are given below.

In the embodiments illustrated in Figures 2 and 3 the sensor device used for the localization and training process, respectively, is a camera device. In alternative embodiments, the sensor device is a LIDAR device providing 3D point cloud data rather than images and the configurations shown in Figures 2 and 3 are to be modified in a straightforward manner.

Figure 4 illustrates an embodiment of a method of localizing a device comprising a sensor device by means of a first neural network configured for descriptor/feature extraction and a second neural network 42 configured for providing a neural field comprising descriptors. The second neural network 42 may be, for example, the neural network illustrated in Figure 2. For example, the first neural network 41 and the second neural network 42 are conjointly trained similar to the neural networks 31 and 32 illustrated in Figure 3.

During actual navigation of a moving device equipped with a sensor device, for example, a vehicle equipped with a camera device or a LIDAR device, a query sensor data set SD, for example, a query image or query 3D point cloud, is input in the first neural network 41 for descriptor extraction in order to generate a first descriptor map DesMl. Starting from a pose prior (first guess for the pose), i.e., location and viewing direction, the second neural network outputs descriptors, volumetric density values and color values for the generation of a second descriptor map DesM2, a depth map DepM and an (RGB) image IM, respectively. The two 2D descriptor maps DesMl and DesM2 are matched with each other to establish 2D-2D local correspondences. Using the depth map for the second descriptor map 2D-3D correspondences can be established by the depth information in the third direction. Based on matching descriptor structures a pose estimate can be computed 43 by means of a Perspective-N-Points (PnP) method combined with a Random Sample Consensus (RANSAC) algorithm as known in the art. The estimated sensor pose can be used as a new pose prior and, subsequently, based on the new pose prior a new descriptor map can be rendered for matching with the first descriptor map DesM2 and the process of pose estimation can be re-iterated multiple times in order to increase accuracy of the estimated pose and, thus, accuracy of the localization of the device.

An embodiment of a method 50 of determining a position of a device comprising a sensor device is illustrated in the flow chart of Figure 5. The method 50 comprises obtaining S51 by the sensor device sensor data representing an environment of the device. For example, the sensor device is a camera capturing 2D images of the environment or a LIDAR device capturing 3D point clouds representing the environment. The method 50 further comprises generating S52 a first descriptor map based on the sensor data by a first neural network. The first neural network may be one of the above-described examples of first neural networks configured for descriptor/feature extraction, for example, the neural network 31 shown in Figure 3 or the neural network 41 shown in Figure 4.

The method 50 further comprises inputting S53 input data based on the sensor data into a second neural network different from the first neural network and outputting S54 by the second neural network descriptors (that, in particular, may be independent on viewing directions) based on the input data. For example, the input data represents a first guess or a developed estimate of a pose of the sensor device obtained based on the sensor data. The second neural network may be one of the above-described examples of second neural networks, for example, the neural network 32 shown in Figure 3 or the neural network 42 shown in Figure 4 and it may comprise the two MLPs 32a and 32b shown in Figure 2. The first and second neural networks may be conjointly trained based on descriptor losses applied on training descriptor maps provided by the first neural network and training descriptor maps rendered based on the descriptors output of the second neural network for maximizing similarities of the respective training descriptor maps.

The method 50 further comprises volumetric rendering S55 of the descriptors to obtain a second descriptor map and matching S56 the first descriptor map with the second descriptor map. A pose of the sensor device is determined S57 based on the matching S56 (descriptor maps that matches each other to some predetermined degree). Details of the particular steps of the method 50 can be similar to the one described above.

A localization device 60 according to an embodiment is illustrated in Figure 6. The localization device 60 may be installed on a vehicle, for example, an automobile. The localization device 60 may be configured to carry out the method steps of the method 50 described above with reference to Figure 5. The localization device comprises a sensor device 61, for example, a camera capturing 2D images of the environment of the localization device or a LIDAR device capturing 3D point clouds representing the environment. The sensor data captured by the sensor device 61 is processed by a first neural network 62 and a second neural network 63 different from the first neural network 62. The first neural network is configured for generating a first descriptor map (that, in particular, comprise descriptors that may be independent on viewing directions) based on the sensor data. The first neural network may be one of the above-described examples of first neural networks configured for descriptor/feature extraction, for example, the neural network 31 shown in Figure 3 or the neural network 41 shown in Figure 4.

The second neural network is configured for outputting descriptors (that may be independent on viewing directions) based on input data (for example, a pose prior or developed pose estimate) that is based on the sensor data. The second neural network may be one of the abovedescribed examples of second neural networks, for example, the neural network 32 shown in Figure 3 or the neural network 42 shown in Figure 4 and it may comprise the two MLPs 32a and 32b shown in Figure 2. The first and second neural networks 62 and 63 may be conjointly trained based on descriptor losses applied on training descriptor maps provided by the first neural network 62 and training descriptor maps rendered based on the descriptors output of the second neural network 63 for maximizing similarities of the respective training descriptor maps.

Furthermore, the localization device comprises a processing unit 64 for processing the outputs provided by the first neural network 62 and the second neural network 63. The processing unit 64 is configured for volumetric rendering the descriptors to obtain a second descriptor map, matching the first descriptor map with the second descriptor map and determining a pose of the sensor device based on the matching (descriptor maps that matches each other to some predetermined degree).

It is noted that in the embodiments illustrated in Figures 1 to 6 data processing can be entirely carried out on the device (for example, vehicle) that is to be localized. Thereby, user data privacy can be fully ensured. In case of limited computational capability, data processing can be performed partially or fully at an external processing unit (server). For example, a query sensor data set is transmitted to the external processing unit, the external processing unit performs the localization and informs the device about its position. In this case, server side data privacy is ensured since the device is only provided by information on its position. According to another embodiment, extraction of features/descriptors from the query sensor data can be performed on the device site and the extracted features/descriptors are transmitted to an external processing unit for performing the remaining steps of the localization procedure.

The embodiments of methods and apparatuses described above can be suitably integrated in vehicles as automobiles, Automated Guided Vehicles (AGV) and autonomous mobile robots to facilitate navigation, localization and obstacle avoidance. In the automotive context, the embodiments of methods and apparatuses described above can be comprised by ADAS. Further embodiments of methods and apparatuses described above can be suitably implemented in augmented reality applications.

Description of further details of the particular embodiments described above

Camera re-localization methods estimate the position and orientation from which an image has been captured in a given environment. Solving this problem consists in matching the image against a pre-computed map built from data previously collected in the target area.

This disclosure the following embodiment is described: representing the reference map is represented by a neural field, able to render consistent local descriptors with 3D coordinates from any viewpoint. This yields many advantages over classical sparse 3D models: dense features matching can be performed, the pose estimate can be iteratively refined and the storage requirement is reduced. The proposed system learns local features specialized on the scene in a self-supervised way, performs better than related methods and enables to establish accurate matches even when the pose prior is far away from the actual camera pose. Visual localization, i.e. the problem of camera pose estimation in a known environment, enables to build positioning systems using cameras for various applications such as autonomous driving, robotics, or augmented reality. The goal is to predict the 6-DoF camera pose (translation and orientation) from 2D camera sensors measurements, which is a projective geometry problem. Best performing methods in this area, known as structure-based methods, operate by matching image features with a pre-computed 3D model of the environment that represents the map. Usually, these maps are modelled by the outcome of Structure-from-Motion (SfM): a sparse 3D point cloud built from reference images on which keypoints have been triangulated. Such 3D models present a high storage requirement while the information they store remains sparse.

Recently, Neural Radiance Fields (NeRF) have emerged as a new way to implicitly represent a scene. Instead of an explicit representation such as point clouds, meshes or voxel grids, the scene is represented implicitly by a neural network, which learns the mapping from 3D points coordinates to density and radiance. NeRF is trained with a sparse set of posed images representing a given scene where it learns the underlying 3D geometry without supervision. The resulting model is continuous, i.e. the radiance of all 3D points in the scene can be computed, which enables to render photorealistic views from any viewpoint. Following works have shown that additional modalities, such as semantics, can be incorporated in a radiance field and rendered accurately. By using a Neural Field, one can store dense information about a scene in a compact way: neural networks weights represent only few megabytes.

We propose to introduce local descriptors in the NeRF implicit formulation and to use the resulting model as the localization map. We train simultaneously a CNN feature extractor and a neural Tenderer to provide consistent scene-specific descriptors in a self-supervised way. Unlike radiance, the rendering of our descriptors does not depend on the viewing direction nor on the image appearance. This formulation has the advantage of learning repeatable features which exhibit accurate matches under extreme viewpoint changes. By leveraging different state-of-the-art neural rendering techniques, we make the model computationally tractable thanks to the multi-resolution hash encoding from Instant-NGP and adapted to dynamic outdoor scenes thanks to appearance embeddings from Nerf-W. During training, we leverage the 3D information learned by the radiance field in a metric learning optimization objective which does not require supervised pixel correspondences on image pairs or a pre-computed 3D model. Finally, we show that these features can be used to solve the visual relocalization task with simple structure-based methods based on sparse features matching and/or dense features alignment. The commonly used sparse 3D model obtained from Structure-from-Motion is replaced by the Neural Field from which dense reference features from any camera pose can be queried while presenting a very compact storage requirement compared to prior art.

Camera Relocalization: Estimating the 6-DoF camera pose of a query image in a known environment has been addressed in the literature. Structure-based methods compare local image features to an existing 3D model. The classical pipeline consists primarily in 2 steps: first, the closest reference images from the query image are retrieved using global image descriptors. Then, these priors are refined with geometrical reasoning based on extracted local features. This pose refinement can be either based on sparse features matching, direct features alignment or relative pose regression. Structure-based relocalization methods are the most accurate but require to store and exploit extensive map information, which represent a high computational cost and memory footprint. Even though compression methods have been developed, storing dense maps can still be a challenging task. An efficient alternative is absolute pose regression, which connects the query image and the associated camera pose in a single neural network forward pass but yield low accuracy. Scene coordinate regression learns the mapping between image pixels and 3D coordinates, enabling to compute an accurate pose with Perspective-N- Points, but scales poorly to large environments. Our proposal refines camera pose priors in a structure-based method but replaces the traditional 3D model by a compact implicit representation.

Localization with Neural Scenes Representations: NeRF and related models have recently been used in multiple ways to improve localization methods. iNeRF iteratively optimizes the camera pose by minimizing NeRF photometric error. LENS improves the accuracy of absolute pose regression methods by using NeRF-rendered novel views uniformly distributed across the map as additional training data. Neural implicit maps for the RGB-D SLAM problem are used by iMAP and NICE-SLAM to achieve competitive results compared to state-of-the-art methods. ImPosing addresses the kilometers-scale localization problem by measuring similarity between a global image descriptor and an implicit camera pose representation. Related to our work, Features Query Network learns descriptors in an implicit function for relocalization. Their model is trained only on a pre-computed sparse 3D point cloud with an off-the-shelf features extractor as supervision and learns how descriptors vary w.r.t. viewpoint, enabling iterative pose refinement. We take the opposite direction: we let the radiance field simultaneously learn both the 3D geometry and the viewpoint invariant descriptors. The latter are learned in a self- supervised manner and do not require any additional training data. To the best of our knowledge, learning visual localization descriptors in a neural radiance field without supervision has not been proposed before.

Learning-based description of local features: Local descriptors provide useful descriptions of regions of interest that enable to establish accurate correspondences between pairs of images describing the same scene. While hand-crafted descriptors such as SIFT and SURF have shown great success, the focus has shifted in recent years to learn features extraction from large amounts of visual data. Many learning-based formulations rely on Siamese convolutional networks trained with pairs or triplets of images/patches supervised with correspondences. Features extractors can be trained without annotated correspondences by augmenting 2 versions of a same image. SuperPoint uses homographies while Novotny et al. leverage image warps. Our proposed method follows a different path to learn repeatable descriptors: we constrain the feature extractor to provide the same descriptors map as a volumetric neural Tenderer. This approach allows us to learn dense scene-specific descriptors without annotated correspondences since the neural Tenderer is geometrically consistent by design thanks to its ray marching formulation.

Our method estimates the 6-DoF camera pose (i.e. 3D translation and 3D orientation) of a query image in an already visited environment. We first train our modules in an offline step, using a set of reference images with corresponding poses, captured beforehand in the area of interest. A 3D model of the scene is not a pre-requisite because we learn the scene geometry during the training process.

1. Neural rendering of local descriptors:

NeRF is capable of rendering a view from any camera pose in a given scene while being trained only with a sparse set of observations. Given a camera pose with known intrinsics, 2D pixels are back-projected in the 3D scene through ray marching. The density <J and RGB color c of each point p=(x,y,z) along the ray are evaluated by a MLP. The final pixel color of a pixel is computed with volumetric rendering along the ray, which is differentiable and enables to train the whole system by minimizing the photometric error of rendered images. NeRF makes the assumption that illumination in the scene remain constant over time, which does not hold for many real world scenes. NeRF-W overcomes this limitation by modeling appearance with an appearance embedding that control the appearance of each rendered view (see also Figure 2). Another limitation of such neural scenes representation is the computation time: rendering an image requires H x W x N evaluations of the 8 layers MLP, where N is the number of points sampled per ray. Instant-NGP proposes to use multiresolution hash encoding to accelerate the process by storing local features in hash tables, which are then processed by much smaller MLPs compared to NeRF resulting in significant improvement of both training and inference times.

Descriptor-NeRF: Our neural Tenderer combines the 3 aforementioned techniques to efficiently render dynamic scenes. However, our main objective is not photorealistic rendering but, rather, features matching with new observations. While it is possible to align a query image with a NeRF model by minimizing the photometric error, such approach lacks robustness w.r.t. variations in illumination. Instead, we propose to add local descriptors, i.e. D-dimensional latent vectors which describe the visual content of a region of interest in the scene, as an additional output of the radiance field function. In contrast with the rendered color, we model these descriptors as invariant to viewing direction d and appearance vector and verify below that it makes the matching process more robust. Similar to color, the 2D descriptor of a camera ray is aggregated by the usual volumetric rendering formula applied on descriptors of each points along the ray. The neural Tenderer architecture is illustrated in Figure 2 and the training process is explained in the next section.

2. Training: Self-Supervised features extraction in the Neural Field

Motivation: While the previously described Neural Renderer represents the map of our relocalization method, we also need to extract features from the query image. A simple solution, proposed by FQN, is to use an off-the-shelf pretrained feature extractor such as SuperPoint or D2-Net, and train the neural renderer to memorize observed descriptors depending on the viewing direction. Instead, we propose to train jointly the feature extractor with the neural renderer by defining an optimization objective which leverages the scene geometry. We obtain descriptors specialized on the target scene which describe not only the visual content but also the 3D location of the observed point, enabling to discriminate better than generic descriptors. A training procedure is illustrated in in Figure 3. One training sample is a reference image with its corresponding camera pose. From one side, the image is processed by the features extractor to obtain the descriptors map F_{I}. On the other side, we sample points along rays for each pixel using camera intrinsics, compute density, color and descriptor of each 3D point, and finally perform volumetric rendering to obtain a RGB view C_{R}, a descriptors map F_{R} and a depth map D_{R}.

Features extraction: Our features extractor is a simple fully convolutional neural network with 8 layers, ReLU activations and max poolings. The input is a RGB image I of size H x W and produces a dense descriptors map F_{I} of size H/4 x W/4 x D.

Learning the Radiance Field: Similar to NeRF, we use the mean squared error loss LMSE between C_{R} and the real image to learn the radiance field. As we render entire (downscaled) images in a single training step, we can leverage the local 2D image structure and optimize the SSIM (noted LSSIM loss), we observe that it produces sharper images and better results. Depth maps are used by the localization process to compute the camera pose, and then better depth results in more accurate poses. NeRF models trained with limited training views can yield incorrect depths, due to the shape-radiance ambiguity. We add a regularization loss LTV which minimizes depth total variation of randomly sampled 5x5 image patches to encourage smoothness and limit artefacts on the rendered depth maps. LSSIM and LTV improve the localization accuracy and image reconstruction quality.

Learning Descriptors: We aim to match together descriptors from the features extractor and the neural Tenderer. Our self-supervised objective encourages both models to produce identical features for a given pixel while preventing high matching scores between points far from each other in the 3D scene. We define a loss function with two terms Lpos and LNEG , applied on a pair of descriptors maps, each containing n pixels. We use the cosine similarity, noted 0 to measure similarity between descriptors.

The first term maximizes similarity between descriptors map F_{I} and F_{R} from both models:

The second term samples random pairs of pixels and ensures that pixel pairs with large 3D distances have dissimilar descriptors.

xyz(i) is the 3D coordinate of the point represented by the i^th pixel in the descriptor map. We compute it from the camera parameters of the rendered view and predicted depth. It should be noted that we do not backpropagate the gradient of this loss to the depth map. X is an hyperparameter which controls the linear relationship between descriptors similarity and 3D distance. P are random permutations of pixel indices from 1 to n.

The proposed self-supervised objective is close to a classical triplet loss function, but we show in sec 4.3 that injecting 3D coordinates in the formulation is crucial to learn meaningful descriptors.

Finally, we optimize the following loss function at each training step:

3. Visual Localization by iterative dense features matching

This section describes the localization method used to estimate the camera pose of a query image from our learned modules. An overview of this procedure is shown in Figure 4. The proposed solution combines simple and commonly used techniques as a novel approach. We demonstrate that the high quality of our learned features enables to reach a precise localization accuracy while using simple features matching and pose estimation strategies.

I. Localization prior: Similar to related features matching methods, we assume to have access to a localization prior, i.e. a camera pose relatively close to the query pose. A view observed from the prior should have an overlapping visual content with the query image to make the matching process feasible. Such priors can be obtained by matching a global image descriptor against an image retrieval database or an implicit map.

II. Features extraction: First, we extract dense descriptors from the query image through the CNN and descriptors and depth from the localization prior with the neural Tenderer.

III. Dense features matching: Query and reference descriptors are matched with cosine similarity. We consider that 2 descriptors are a match if the similarity is higher than a threshold theta and if it represent the best candidate in the other map in both direction (mutual matching). We then compute the predicted 3D coordinate of rendered pixels which have been matched (thanks to camera parameters and depth) and obtain a set of 2D-3D matches.

IV. Camera Pose Estimation: We use the Perspective-N-Points method combined with RANSAC, in order to get a robust estimate by discarding outliers matches.

V. Iterative Pose Refinement: While classical 3D models only have access to a finite set of reference descriptors, our neural Tenderer can compute them from any camera pose. Similar to FQN and ImPosing, we can then consider the camera pose estimate as a new localization prior and iterate the previously mentioned steps multiple times to refine the camera pose.

We compare our results to state-of-the-art visual relocalization methods. We use the topi reference pose retrieved by DenseVLADas localization prior (whereas other use the top 10 reference images).

Table 1. Results on 7scenes dataset

Table 2. Results on Cambridge Landmarks dataset. We proposed to use Neural Fields as a way to represent a visual localization map. This enables to represent scenes densely with a small memory footprint, to learn local features specialized for the target area without supervision, and to perform better than related visual localization methods. The proposed pipeline should be compatible with future improvements in the neural rendering field enabling to scale these models to larger scenes. All previously discussed embodiments are not intended as limitations but serve as examples illustrating features and advantages of the invention. It is to be understood that some or all of the above-described features can also be combined in different ways.

Claims

1. A method (50) of determining a position of a device comprising a sensor device (61), comprising the steps of obtaining (S51) by the sensor device (61) sensor data representing an environment of the device; generating (S52) a first descriptor map based on the sensor data by a first neural network (31, 41); inputting (S53) input data based on the sensor data into a second neural network (32, 42, 63) different from the first neural network (31, 41, 62); outputting (S54) by the second neural network (32, 42, 63) descriptors based on the input data; volumetric (S55) rendering the descriptors to obtain a second descriptor map; matching (S56) the first descriptor map with the second descriptor map; and determining (S57) a pose of the sensor device (61) based on the matching.

2. The method (50) according to claim 1, wherein the descriptors are independent of a viewing direction of the sensor device (61).

3. The method (50) according to any of the preceding claims, wherein the descriptors represent local content of the input data and three-dimensional positions of data points of the input data.

4. The method (50) according to any of the preceding claims, further comprising obtaining a depth map by the second neural network (32, 42, 63) and wherein the matching of the first descriptor map with the second descriptor map is based on the obtained depth map.

5. The method (50) according to any of the preceding claims, wherein the pose of the sensor device (61) is iteratively determined starting from a pose prior.

6. The method (50) according to any of the preceding claims, wherein the first neural network (31, 41, 62) is trained conjointly with the second neural network (32, 42, 63) for the environment based on matching first training descriptor maps obtained by the first neural network (31, 41, 62) and second training descriptor maps obtained by volumetric rendering of training descriptors output by the second neural network. The method (50) according to any of the preceding claims, wherein the device is a vehicle and the sensor device (61) is a camera device or a Light Detection and Ranging, LIDAR, device. The method (50) according to claim 7 wherein the sensor device (61) is the camera device and further comprising conjointly training the first and second neural networks (31, 41, 62, 32, 42, 63) for the environment comprising obtaining training image data by a training camera device for different training poses of the training camera device; inputting training input data based on the training image data into the first neural network (31, 41, 62) and inputting training pose data according to the different training poses of the training camera device based on the training image data into the second neural network; outputting by the first neural network (31, 41, 62) first training descriptor maps based on the training input data; outputting by the second neural network (32, 42, 63) training color data, training volumetric density data, and training descriptor data; rendering the training color data, training volumetric density data, and training descriptor data to obtain rendered training images, rendered training depth maps and rendered second training descriptor maps, respectively; minimizing a first objective function representing differences between the rendered training images and corresponding pre-stored reference images or maximizing a first objective function representing similarities between the rendered training images and pre-stored reference images; and minimizing a second objective function representing differences between the first training descriptor maps and corresponding rendered second training descriptor maps or maximizing a second objective function representing similarities between the first descriptor maps and corresponding rendered second training descriptor maps. The method (50) according to claim 7, wherein the sensor device (61) is the LIDAR device and further comprising conjointly training the first and second neural networks for the environment comprising obtaining training three-dimensional point cloud data by a training LIDAR device for different training poses of the training LIDAR device; inputting training input data based on the training three-dimensional point cloud data into the first neural network and inputting training pose data according to the different training poses of the training LIDAR device based on the training three-dimensional point cloud data into the second neural network; outputting by the first neural network first training descriptor maps based on the training input data; outputting by the second neural network training volumetric density data and training descriptor data; rendering the training volumetric density data and training descriptor data to obtain rendered training depth maps and rendered second training descriptor maps, respectively; minimizing a first objective function representing differences between the rendered training depth maps and corresponding pre-stored reference depth maps or maximizing a first objective function representing similarities between the rendered training depths maps and pre-stored reference depth maps; and minimizing a second objective function representing differences between the first training descriptor maps and corresponding rendered second training descriptor maps or maximizing a second objective function representing similarities between the first descriptor maps and corresponding rendered second training descriptor maps. The method (50) according to claim 8 or 9, further comprising applying a loss function based on the rendered training depth maps to suppress minimization or maximization of the second objective function for data points of a first training descriptor map of the first training descriptor maps that are geometrically distant from data points of a corresponding rendered second training descriptor map of the rendered second training descriptor maps by more than a pre-determined threshold. A computer program product comprising computer readable instructions for, when run on a computer, performing or controlling the steps of the method (50) according to any of the preceding claims. A localization device (60), comprising a sensor device (61) configured for obtaining sensor data representing an environment of the sensor device (61); a first neural network (31, 41, 62) configured for generating a first descriptor map based on the sensor data; a second neural network (32, 42, 63) different from the first neural network (31, 41, 62) and configured for outputting descriptors based on input data that is based on the sensor data; and a processing unit (64) configured for volumetric rendering the descriptors to obtain a second descriptor map; matching the first descriptor map with the second descriptor map; and determining a pose of the sensor device (61) based on the matching. The localization device (60) according to claim 12, wherein the descriptors are independent of a viewing direction of the sensor device (61). The localization device (60) according to claim 12 or 13, wherein the descriptors represent local content of the input data and three-dimensional positions of data points of the input data. The localization device (60) according to any of the claims 12 to 14, wherein the second neural network (32, 42, 63) is further configured for obtaining a depth map and the processing unit (64) is further configured for matching the first descriptor map with the second descriptor map based on the depth map.

16. The localization device (60) according to any of the claims 12 to 15, wherein the processing unit (64) is further configured for iteratively determining the pose of the sensor device (61) starting from a pose prior.

17. The localization device (60) according to any of the claims 12 to 16, wherein the first neural network (31, 41, 62) is trained conjointly with the second neural network

(32, 42, 63) for the environment based on matching first training descriptor maps obtained by the first neural network (31, 41, 62) and second training descriptor maps obtained by volumetric rendering of training descriptors output by the second neural network. 18. The localization device (60) according to any of the claims 12 to 17, wherein the sensor device (61) is a camera device or a Light Detection and Ranging, LIDAR, device

19. Vehicle comprising the localization device (60) according to any of the claims 12 to 18.