WO2023165718A1 - Appareil et procédés de localisation visuelle avec représentation cartographique implicite compacte - Google Patents

Appareil et procédés de localisation visuelle avec représentation cartographique implicite compacte Download PDF

Info

Publication number
WO2023165718A1
WO2023165718A1 PCT/EP2022/058974 EP2022058974W WO2023165718A1 WO 2023165718 A1 WO2023165718 A1 WO 2023165718A1 EP 2022058974 W EP2022058974 W EP 2022058974W WO 2023165718 A1 WO2023165718 A1 WO 2023165718A1
Authority
WO
WIPO (PCT)
Prior art keywords
pose
anchors
map
image
camera
Prior art date
Application number
PCT/EP2022/058974
Other languages
English (en)
Inventor
Arthur MOREAU
Nathan PIASCO
Dzmitry Tsishkou
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Publication of WO2023165718A1 publication Critical patent/WO2023165718A1/fr

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/26Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 specially adapted for navigation in a road network
    • G01C21/28Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00 specially adapted for navigation in a road network with correlation of data from several navigational instruments
    • G01C21/30Map- or contour-matching

Definitions

  • the present disclosure relates to a method of localizing a mobile apparatus in an area of interest and a corresponding mobile apparatus.
  • the disclosure addresses the relocalization problem of a mobile platform in a known environment using images, i.e. recovering the precise 6 or 3 Degrees of Freedom (DoF) of a mobile platform within a map from an image taken from its visual sensor. It is widely used in mobile robotics, advanced driver assistance systems (ADAS), autonomous driving and augmented reality systems.
  • DoF Degrees of Freedom
  • Visual relocalization systems can use different types of deep learning based algorithms.
  • One approach consists in storing dense representations of the environment content, enabling camera pose estimation with geometric reasoning at the cost of high computational cost and heavy memory footprint.
  • Other approaches bypass this problem by direct regression of the camera pose resulting in a lower accuracy.
  • a method of localizing a mobile apparatus in an area of interest comprises the steps of capturing an image using a camera of the mobile apparatus, the camera having a current camera pose when capturing the image; determining an image signature based on the image using a pre-trained image encoder of the mobile apparatus, the image signature being a representation of the current camera pose; performing iterations comprising the steps (i) - (iv) as follows: (i) selecting a pool of pose anchors from a map representation of the area of interest, each pose anchor corresponding to a candidate camera pose; (ii) generating a map signature for each pose anchor, each map signature being a representation of the corresponding candidate camera pose; (iii) comparing the image signature with the generated map signatures by determining a similarity score for each comparison; and (iv) identifying a number of pose anchors with highest similarity scores; wherein an initial iteration is performed based on an initial predefined pool of pose anchors, and in each subsequent it
  • the current camera pose is unknown.
  • the current camera pose is to be determined by the method according to the present disclosure.
  • the camera pose may include coordinate values and one or more orientation/angle values.
  • the present disclosure involves an implicit map representation that enables to compress map specific content into a lightweight representation, such that localization in large environments can be performed in an efficient way. The accuracy of the method is not bounded by the density of reference poses.
  • the terms localization/localizing and re-localization are used synonymously in the present disclosure.
  • the iterations may be performed a number of times until the further step of estimating the current camera pose is performed.
  • This number of iterations may be predefined or predetermined, and may be based on a precision criterion for the camera pose, for example, or the number of iterations may be determined during the iterations based on a convergence criterion of the pose anchors in the sequence of iterations.
  • the pre-trained image encoder may comprise a set of predetermined parameters and the map representation may comprise a set of predetermined parameters.
  • the predetermined parameters may include weights of a neural network and are optionally provided in the form of respective parameter vectors.
  • a region in the map representation used to select new anchor poses based on anchors identified in the previous iteration may be decreased, in particular in each iteration new anchors closest to the identified anchors in the previous iteration may be selected to refine the pose estimate.
  • the similarity score may be based on a measure of similarity using the image signature and the respective generated map signature.
  • the method may comprise an initial step of receiving, from a server, the map representation of the area of interest.
  • the method may comprise a further step of receiving, from the server, a further map representations of a further area of interest when the mobile apparatus moves towards or into a further area of interest.
  • each map representation may have been previously obtained by performing the steps of obtaining training data in the area of interest using respective cameras of one or more mobile devices moving in the area of interest, the training data comprising image data and camera pose data; transmitting the obtained training data to a remote computing device, such as the server or a cloud computing device; and using the training data to train the map representation.
  • the image encoder of the mobile apparatus may be pretrained once by performing the steps of providing reference images and corresponding reference camera poses; and training the image encoder by feeding the image encoder with the reference images and adjusting parameters of the image encoder by comparing an output of the image encoder with the reference camera poses.
  • training the image encoder and training the map representation may be performed jointly, in particular at least partially using the same images and camera poses.
  • the step of estimating the current pose of the camera based on the pose anchors identified in the iterations may comprise a step of selecting the pose with a maximum score or by computing an average or a weighted average of the pose anchors.
  • a mobile apparatus comprises a camera for capturing an image, the camera having a current camera pose when capturing the image; a pre-trained image encoder for determining an image signature based on the image, the image signature being a representation of the current camera pose; a memory for storing a map representation of the area of interest; and processing circuitry configured to perform iterations comprising the steps (i) - (iv) as follows: (i) selecting a pool of pose anchors from the map representation of the area of interest, each pose anchor corresponding to a candidate camera pose; (ii) generating a map signature for each pose anchor, each map signature being a representation of the corresponding candidate camera pose; (iii) comparing the image signature with the generated map signatures by determining a similarity score for each comparison; and (iv) identifying a number of pose anchors with highest similarity scores; wherein the processing circuitry is further configured to perform an initial iteration based on an initial predefined pool of pose anchor
  • the iterations may be performed a number of times until the further step of estimating the current camera pose is performed.
  • the pre-trained image encoder may comprise a set of predetermined parameters and the map representation may comprise a set of predetermined parameters.
  • the predetermined parameters may include weights of a neural network and may be provided in the form of respective parameter vectors.
  • a region in the map representation used to select new anchor poses based on anchors identified in the previous iteration may be decreased, in particular in each iteration new anchors closest to the identified anchors in the previous iteration may be selected to refine the pose estimate.
  • the similarity score may be based on a measure of similarity using the image signature and the respective generated map signature.
  • the mobile apparatus may comprise a receiver configured to receive, from a server, the map representation of the area of interest.
  • the receiver may be further configured to receive, from the server, a further map representations of a further area of interest when the mobile apparatus moves towards or into a further area of interest.
  • the processing circuitry may be configured to estimate the current pose of the camera based on the pose anchors identified in the iterations by selecting the pose with a maximum score or by computing an average or a weighted average of the pose anchors.
  • a system comprises one or more mobile devices, each having a camera for capturing images in an area of interest; a localization device for obtaining a respective camera pose corresponding to the captured images; a transmitter for transmitting training data comprising image data of the captured images and camera pose data of the obtained camera poses; and a remote computing device, such as the server or a cloud computing device, for receiving the transmitted training data, and for training a map representation of the area of interest using the training data.
  • the remote computing device may be configured to transmit the map representation of the area of interest to a mobile apparatus.
  • a computer program comprises instructions which, when the program is executed by a computer, cause the computer to carry out the method according to the first aspect or any implementation thereof.
  • a computer-readable medium comprises instructions which, when executed by a computer, cause the computer to carry out the method according to the first aspect or any implementation thereof.
  • a compact learned representation of the environment enables real-time localization with high accuracy.
  • Figure 1 illustrates the localization solution for mobile platforms.
  • Figure 2 illustrates a localization process
  • Figure 3 illustrates a training process
  • Figure 4 illustrates a computational workflow
  • Figure 5 illustrates localization on multiple maps.
  • Figure 6 illustrates localization on a new map (map adaptation).
  • Figure 7 illustrates a computational workflow for multi-maps and map adaptation.
  • Figure 8 illustrates discrete and continuous implicit map representation.
  • Figure 9 illustrates a general method of localizing a mobile apparatus in an area of interest.
  • Figure 10 illustrates a mobile apparatus according to the present disclosure.
  • the relocalization solution for a mobile apparatus consists of using a learning-based visual localization algorithm into an embedded computing device, as describe in Figure 1.
  • a map is built and used to train a deep learning based system which is able to relocalize accurately and efficiently into the map.
  • This is achieved by an implicit map representation that replaces traditional point clouds or images database as the environment representation.
  • This new formulation enables fast computation, low memory footprint, and the ability to deploy on multiple areas with minimal scene-specific training.
  • Localization systems for autonomous driving need to be deployed at city-scale or countryscale. Best algorithms that solve the camera pose estimation problem store a lot of information about the 3D environment of the target area in memory. In the context of very large environments, this prevents deployment of the algorithm in real time in embedded device of the prior art.
  • the present disclosure stores a very compact representation of the surrounding environment in memory, enabling large scale deployment in multiple areas on embedded devices and real-time processing and localization by the mobile apparatus.
  • the present disclosure uses camera-only devices to perform localization, that makes the method cheaper and more scalable compared to LIDAR based localization solution.
  • the camera pose prediction is obtained by iteratively comparing the image representation with pose candidates representations which are sampled in a hierarchical process.
  • Multi map system & new map adaptation this solution can be deployed in multiple target areas with a single neural network and new maps can be integrated into a deployed system in a fast process.
  • crowdsourced data obtained from the system users can be used to track temporal modifications of the environment and continuously improve the localization accuracy.
  • Embodiment 1 localization on a map
  • Visual data in the area of interest must be recorded and stored, in order to build the map and train the localization algorithm. This can be done by a fleet of vehicles deployed for the purpose or by gathering crowd-sourced data. During deployment, crowd-sourced images from system users can be collected for tracking modifications in the map and improving the localization accuracy.
  • the relocalization algorithm takes a RGB image as input and outputs a camera pose with 3 or 6 degrees of freedom (3 translation and 3 rotation in SE3 (3D) or 2 translation and 1 rotation in SE2 (2D)). It is trained with the image database captured in the area of interest and labeled with camera poses computed during the mapping step.
  • the main processing steps and computing modules are described below (see Figure 2 and Table 1):
  • the input image is encoded by a neural network, that is named Image encoder.
  • Image encoder a compact intermediate representation of the image is obtained, named image signature.
  • image encoder can be pre-trained on a larger database of images.
  • map signatures are computed, which are representations of camera poses in the map of interest. These map signatures are produced by the implicit map/scene representation, i.e. a module with learnable parameters that provides higher-dimensional representations of poses in the target area.
  • pose anchors can be chosen at random or uniformly distributed among all the training poses or sampled in a predefined regular grid.
  • the Matching module is defined as a computing unit that predicts a similarity score between image and map signatures. It can either be a learnable module or based on hand crafted heuristics.
  • a Candidates proposers selects a new pool of pose anchors that will be evaluated as described in steps 2 and 3.
  • pose anchor a simple multi-layer perceptron ensures a fast computation, whereas approaches based on features aggregation along camera rays could ensure the 3D consistency of the learned signatures.
  • Multi-dimensional pose embedding like positional encoding in order to better capture small variation in the pose space are also considered.
  • the learned map representation is randomly initialized and optimized to reduce the localization error. The idea is to learn a mapping between camera poses in the target area and the visual content observable from this viewpoint.
  • the optimized representation is loaded in the localization module.
  • Training procedure the trainable module of the method is shown in Figure 3. Camera poses as only source of supervision are used and the reference poses in the offline mapping process are obtained. Both image encoder and implicit map representation are trained jointly. For a given image with corresponding camera pose, an ideal target score is computed that correspond to an ideal output of the localization pipeline. Target scores are defined using the distance between pose candidates and the reference pose. The system learns to minimize scores errors on training samples. For instance, the ideal target score can be designed as a 6D Laplacian kernel centered at the camera position. During training, the loss function between the ideal score and the similarity score outputted by the localization pipeline as described earlier is computed. A loss is computed at each refinement level with anchors manually selected close to the target to speed-up the training.
  • mappings vehicle are deployed on the target area to record data. Data are stored internally or transferred to a remote server or a cloud.
  • neural networks weights are transferred to the computing device through cloud. Images coming in real-time from cameras are processed by the localization algorithm in the embedded device, providing camera pose estimates at a high framerate.
  • the implicit map representation enables to compress map specific content into a lightweight representation, enabling relocalization in large environments in an efficient way.
  • the accuracy of the present methods is not bounded by the density of reference poses, and the continual growth of the reference images database improves the accuracy while keeping a fixed-size memory footprint.
  • Embodiment 2 multi-map and map adaptation
  • Multi-map training The present localization system can be trained simultaneously on multiple areas of interest.
  • the image encoder is shared between all maps, whereas each area of interest is attached to a specific compact learned map representation (see Figure 5).
  • Another important perspective for scaling up map-based autonomous systems is the deployment time on a new area.
  • a system operating in an environment which is continuously growing needs to be able to adapt fast to new environments.
  • a technology able to operate autonomously in an area of interest few minutes after data collection would facilitate large scale deployment.
  • new maps can be integrated into the framework in a small fraction of time compared to the entire training.
  • New map adaptation after data collection and mapping in the target environment, the new learned map representation can be trained directly to fit an already trained multi-map localization algorithm.
  • Image encoder is not optimized during the new map adaptation training process. As a result, learning only the small number of parameters of the learned map representation is a very fast process (see Figure 6).
  • Multimap and new map adaptation mechanisms enable city/country scale deployment of the localization service thanks to the compactness of map specific content, which enables fast transfer with cloud during the mobile platform navigation.
  • Using a multimap system instead of several independent single map systems reduces the computational cost of the training step and improves accuracy thanks to transfer learning.
  • the core of the present disclosure is the implicit map representation module. It is defined as a map-specific learnable module that connects a camera pose in the area of interest to a map signature (i.e. a higher dimensional latent vector).
  • the implicit map representation is described as learnable neural network that output a map signature for every continuous input pose.
  • Another formulation of such an implicit learned map representation is an array of spatially arranged learnable vectors.
  • the map is discretized across its dimensions into a finite number of map cells, to each of which a signature is attached, see Figure 8.
  • the signatures are directly learned with backpropagation and stored in memory.
  • the main benefit is a very compact representation and signatures can be accessed without additional computation.
  • the precision is limited by the resolution of the discretization that has be small in order to keep a compact representation.
  • Discrete vectors could be interpolated to obtain representations at an arbitrary resolution.
  • Figure 9 illustrates a general method of localizing a mobile apparatus in an area of interest according to the present disclosure, covering the embodiments as described above.
  • the general method comprises the steps:
  • 920 determining an image signature based on the image using a pre-trained image encoder of the mobile apparatus, the image signature being a representation of the current camera pose;
  • an initial iteration is performed based on an initial predefined pool of pose anchors, and in each subsequent iteration the step of selecting the pool of pose anchors is based on the pose anchors identified in the previous iteration;
  • Step 934 may comprise storing of at least a part of the pose anchors with the highest similarity scores for the final pose estimation in step 950.
  • the iterations may be performed a number of times until the further step of estimating the current camera pose is performed.
  • This number of iterations can be predefined or predetermined based on a precision criterion for the camera pose, or the number may be determined during the iterations based on a convergence criterion of the pose anchors in the sequence of iterations.
  • FIG. 10 illustrates a mobile apparatus 1000 according to the present disclosure.
  • the mobile apparatus 1000 comprises a camera 1010 for capturing an image, the camera 1010 having a current camera pose when capturing the image; a pre-trained image encoder 1020 for determining an image signature based on the image, the image signature being a representation of the current camera pose; a memory 1030 for storing a map representation of the area of interest; and processing circuitry 1040 configured to perform iterations comprising the steps (i) - (iv) as follows: (i) selecting a pool of pose anchors from the map representation of the area of interest, each pose anchor corresponding to a candidate camera pose; (ii) generating a map signature for each pose anchor, each map signature being a representation of the corresponding candidate camera pose; (iii) comparing the image signature with the generated map signatures by determining a similarity score for each comparison; and (iv) identifying a number of pose anchors with highest similarity scores; wherein the processing circuitry 1040 is further configured
  • the mobile apparatus 1000 is configured to perform the method as described in Figure 9.
  • the present disclosure I system mainly targets autonomous driving applications. Vehicles are equipped with a computing device and cameras and make use of the localization service to ensure precise and safe navigation. The system can be first deployed on a limited area which can be continuously enlarged by collecting data in new areas. Data recorded in user vehicles is used to improve the system's accuracy over time.
  • Autonomous mobile robots can be equipped with our system in order to navigate in their environments. Applications include transport of goods in warehouses, charging robots operating in parking areas, or domestic robots.
  • Augmented reality systems can benefit from the present disclosure I system because they need a precise real-time localization ability.
  • Applications include assistance systems for staff that performs maintenance and repair of complex equipment, tourism industry, or public safety (software that provide instructions in emergency situations).

Landscapes

  • Engineering & Computer Science (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Automation & Control Theory (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Image Analysis (AREA)

Abstract

La présente divulgation concerne un procédé de localisation d'un appareil mobile dans une zone d'intérêt, comprenant les étapes consistant à : capturer une image à l'aide d'un appareil de prise de vues de l'appareil mobile, l'appareil de prise de vues présentant une position actuelle d'appareil de prise de vues lors de la capture de l'image ; déterminer une signature d'image sur la base de l'image à l'aide d'un codeur d'image pré-entraîné de l'appareil mobile, la signature d'image constituant une représentation de la position actuelle de l'appareil de prise de vues ; réaliser des itérations comprenant les étapes consistant à sélectionner un groupe d'ancrages de position à partir d'une représentation cartographique de la zone d'intérêt, chaque ancrage de position correspondant à une position candidate de l'appareil de prise de vues ; générer une signature cartographique pour chaque ancrage de position, chaque signature cartographique constituant une représentation de la position candidate correspondante de l'appareil de prise de vues ; comparer la signature d'image aux signatures cartographiques générées par détermination d'un score de similarité pour chaque comparaison ; et identifier un certain nombre d'ancrages de position présentant les scores de similarité les plus élevés ; une itération initiale étant effectuée sur la base d'un groupe initial prédéfini d'ancrages de position et, lors de chaque itération ultérieure, l'étape de sélection du groupe d'ancrages de position se basant sur les ancrages de position identifiés lors de l'itération précédente ; et estimer la position actuelle de l'appareil de prise de vues sur la base des ancrages de position identifiés dans les itérations. La présente divulgation concerne en outre un appareil mobile correspondant.
PCT/EP2022/058974 2022-03-04 2022-04-05 Appareil et procédés de localisation visuelle avec représentation cartographique implicite compacte WO2023165718A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EPPCT/EP2022/055529 2022-03-04
EP2022055529 2022-03-04

Publications (1)

Publication Number Publication Date
WO2023165718A1 true WO2023165718A1 (fr) 2023-09-07

Family

ID=81579442

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2022/058974 WO2023165718A1 (fr) 2022-03-04 2022-04-05 Appareil et procédés de localisation visuelle avec représentation cartographique implicite compacte

Country Status (1)

Country Link
WO (1) WO2023165718A1 (fr)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180304891A1 (en) * 2015-07-29 2018-10-25 Volkswagen Aktiengesellschaft Determining arrangement information for a vehicle

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180304891A1 (en) * 2015-07-29 2018-10-25 Volkswagen Aktiengesellschaft Determining arrangement information for a vehicle

Similar Documents

Publication Publication Date Title
Maggio et al. Loc-nerf: Monte carlo localization using neural radiance fields
Parkhiya et al. Constructing category-specific models for monocular object-slam
CN111325797A (zh) 一种基于自监督学习的位姿估计方法
CN110717927A (zh) 基于深度学习和视惯融合的室内机器人运动估计方法
Cozman et al. Outdoor visual position estimation for planetary rovers
WO2013117940A2 (fr) Procédé de localisation d'un capteur et appareil connexe
Kanani et al. Vision based navigation for debris removal missions
Dudek et al. Vision-based robot localization without explicit object models
Tomono 3-D localization and mapping using a single camera based on structure-from-motion with automatic baseline selection
CN116229519A (zh) 一种基于知识蒸馏的二维人体姿态估计方法
CN117392488A (zh) 一种数据处理方法、神经网络及相关设备
WO2023165718A1 (fr) Appareil et procédés de localisation visuelle avec représentation cartographique implicite compacte
CN115659836A (zh) 一种基于端到端特征优化模型的无人系统视觉自定位方法
Spampinato et al. Deep learning localization with 2D range scanner
CN116343191A (zh) 三维目标检测方法、电子设备及存储介质
CN115457529A (zh) 实体交互检测方法、建立实体交互检测模型的方法及装置
CN111724438B (zh) 一种数据处理方法、装置
Wei et al. Multi-objective deep cnn for outdoor auto-navigation
Tomono Monocular slam using a rao-blackwellised particle filter with exhaustive pose space search
Grelsson Vision-based localization and attitude estimation methods in natural environments
Chen et al. Remote Sensing Image Registration based on Attention and Residual Network
CN117671022B (zh) 一种室内弱纹理环境的移动机器人视觉定位系统及方法
Pal et al. Evolution of simultaneous localization and mapping framework for autonomous robotics—a comprehensive review
WO2024099593A1 (fr) Localisation basée sur des réseaux neuronaux
Ghasemieh et al. Towards explainable artificial intelligence in deep vision-based odometry

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22720699

Country of ref document: EP

Kind code of ref document: A1