WO2021244604A1

WO2021244604A1 - Visual-based relocalization method, and electronic device

Info

Publication number: WO2021244604A1
Application number: PCT/CN2021/098096
Authority: WO
Inventors: Yuan Tian; Xiang Li; Yi Xu
Original assignee: Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority date: 2020-06-03
Filing date: 2021-06-03
Publication date: 2021-12-09
Also published as: CN115516524A

Abstract

A visual-based relocalization method is executed in an electronic device. The visual-based relocalization method comprising sequence-based pose refinement is proposed to improve the relocalization precision. The device selects a sequence of query frames from a sequence of input frames based on evaluation of depth-image-based single frame relocalization associated with the sequence of the input frames. The sequence of the input frames are obtained from different view angles. The device refines estimated poses associated with the sequence of query frames for visual-based relocalization using external poses associated with the sequence of query frames. The external poses are obtained from external odometry.

Description

VISUAL-BASED RELOCALIZATION METHOD, AND ELECTRONIC DEVICE

Technical Field

The present disclosure relates to the field of augmented reality (AR) systems, and more particularly, to a visual-based relocalization method.

Background Art

In augmented reality (AR) applications, visual-based relocalization is a crucial part to support AR object persistence and multiple user registration. Persistence is the ability to persist virtual objects in the same physical location and orientation as they are previously positioned in real-world space during an AR session or across different AR sessions. For example, during a first AR session, a user places a virtual sofa in a room using an AR application (app) . After a period, the user enters another AR session using the same app which can show the virtual sofa at the same location and in the same orientation. The procedure of AR object persistence is also referred to as relocalization, which includes re-estimation of device poses with respect to a previously stored “map” representation. For multiple user interaction in an AR session, one user device can set up a reference, or known as “anchors” , which can be some reference points or objects in real-world space. Other user devices can relocalize themselves by matching some sensory data with the “anchors” . Relocalization can utilize different sensory data, among which visual-based relocalization is the most popular.

Visual-based relocalization usually utilizes digital images from cameras as input and computes a six degrees of freedom (6 DoF) camera pose regarding a predefined coordinate system as output. Thus, after relocalization, the device can be tracked in the same coordinate system as a previous AR session or a different user’s AR session.

Technical Problem

Enormous research works have been published for visual-based relocalization, where many of them are implemented together with a simultaneous localization and mapping (SLAM) process. The techniques are widely developed and integrated into current AR software products, such as ARKit and ARcore, and current AR hardware products, such as AR glasses. Relocalization typically needs a sparse or dense map representation of the environment. Then, the visual appearance of the map is utilized to provide the initial pose estimation followed by a pose refinement stage depending on applications. Most of the methods use red green blue (RGB) images for relocalization.

Technical Solution

An object of the present disclosure is to propose a visual-based relocalization method, and an electronic device.

In a first aspect, an embodiment of the invention provides a visual-based relocalization method executable in an electronic device, comprising:

selecting a sequence of query frames from a sequence of input frames based on evaluation of depth-image-based single frame relocalization associated with the sequence of the input frames, wherein the sequence of the input frames are obtained from different view angles; and

refining estimated poses associated with the sequence of query frames for visual-based relocalization using external poses associated with the sequence of query frames, wherein the external poses are obtained from external odometry.

In a second aspect, an embodiment of the invention provides an electronic device comprising a camera, a depth camera, an inertial measurement unit (IMU) , and a processor. The camera is configured to capture a sequence of input frames. Each of the input frames comprises a color space image. The depth camera is configured to capture a depth image that is associated with the color space image. The IMU is configured to provide external odometry that is associated with the color space image. The processor configured to execute:

selecting a sequence of query frames from the sequence of input frames based on evaluation of depth-image-based single frame relocalization associated with the sequence of the input frames, wherein the sequence of the input frames are obtained from different view angles; and

The disclosed method may be implemented in a chip. The chip may include a processor, configured to call and run a computer program stored in a memory, to cause a device in which the chip is installed to execute the disclosed method.

The disclosed method may be programmed as computer executable instructions stored in non-transitory computer readable medium. The non-transitory computer readable medium, when loaded to a computer, directs a processor of the computer to execute the disclosed method.

The non-transitory computer readable medium may comprise at least one from a group consisting of: a hard disk, a CD-ROM, an optical storage device, a magnetic storage device, a Read Only Memory, a Programmable Read Only Memory, an Erasable Programmable Read Only Memory, EPROM, an Electrically Erasable Programmable Read Only Memory and a Flash memory.

The disclosed method may be programmed as computer program product, that causes a computer to execute the disclosed method.

The disclosed method may be programmed as computer program, that causes a computer to execute the disclosed method.

Advantageous Effects

To overcome these challenges, the invention utilizes both RGB/monochrome camera and depth camera. Unlike other RGB and depth (RGBD) relocalization, the invention also uses external visual-inertial odometry (VIO) output that is available on most AR devices. The VIO output comprises poses of the devices. VIO is the process of determining a position and an orientation of a device by analyzing an associated image and inertial measurement unit (IMU) data. The invention provides both mapping and relocalization enhanced with VIO and is efficient, decoupled with the SLAM procedure, very flexible to deploy, and requires no training process. VIO uses both RGB/monochrome camera and IMU that provides external odometry. In other words, the invention ultimately uses data from an RGB/monochrome camera, an IMU, and a depth camera. By using heterogeneous sensor data as input, the proposed method can increase the precision of relocalization. Furthermore, the invention utilizes a sequence of images as input and can provide long-term persistence. For example, n frames of sensory data are utilized for relocalization. If visual change of the environment happens after the mapping procedure to a small fraction of frames, the disclosed method can still pick the unchanged frames from the sequence of n frames to perform the relocalization. Comparing with single frame-based relocalization, the proposed relocalization method is sequence-based and can have more robust performance when visual change exists in long-term persistence.

Description of Drawings

In order to more clearly illustrate the embodiments of the present disclosure or related art, the following figures will be described in the embodiments are briefly introduced. It is obvious that the drawings are merely some embodiments of the present disclosure, a person having ordinary skill in this field can obtain other figures according to these figures without paying the premise.

FIG. 1 illustrates a schematic view showing relocalization of a virtual object.

FIG. 2 illustrates a schematic view showing a system including mobile devices that execute a relocalization method according to an embodiment of the present disclosure.

FIG. 3 illustrates a schematic view showing three types of visual-based relocalization methods.

FIG. 4 illustrates a schematic view showing a mapping pipeline for a visual-based relocalization method.

FIG. 5 illustrates a schematic view showing a mapping pipeline for a visual-based relocalization method according to an embodiment of the present disclosure.

FIG. 6 illustrates a schematic view showing a relocalization pipeline for a visual-based relocalization method according to an embodiment of the present disclosure.

FIG. 7 is a block diagram of a system for wireless communication according to an embodiment of the present disclosure.

DETAILED DESCRIPTION OF EMBODIMENTS

Embodiments of the disclosure are described in detail with the technical matters, structural features, achieved objects, and effects with reference to the accompanying drawings as follows. Specifically, the terminologies in the embodiments of the present disclosure are merely for describing the purpose of the certain embodiment, but not to limit the disclosure.

With reference to FIG. 1, for example, during a first AR session A, a user places a virtual object 220, such as an avatar, in a room with a desk 221 using an AR application executed by an electronic device 10. After a period, the user enters another AR session B using the same app which can show the virtual object 220 at the same location and in the same orientation with respect to the desk 221 even if the device is moved to another location. Another electronic device 10c of another user may show the virtual object 220 at the same location and in the same orientation with respect to the desk 221 in AR session C.

As shown in FIG. 1, visual-based relocalization can both help with persistence and multi-user registration. Recently, depth cameras have been increasingly equipped on commodity mobile devices, such as mobile phones and AR glasses. Depth information captured from a depth camera adds geometric details on top of the RGB appearance, and can be used to improve precision and robustness of relocalization.

With reference to FIG. 2, a system including

mobile devices

10a and 10b, a base station (BS) 200a, and a network entity device 300 executes the disclosed method according to an embodiment of the present disclosure. The

mobile devices

10a and 10b may be mobile phones, AR glasses, or other AR processing devices. FIG. 1 is shown for illustrative not limiting, and the system may comprise more mobile devices, BSs, and CN entities. Connections between devices and device components are shown as lines and arrows in the FIGs. The mobile device 10a may include a processor 11a, a memory 12a, a transceiver 13a, a camera 14a, a depth camera 15a, and an inertial measurement unit (IMU) 16a. The mobile device 10b may include a processor 11b, a memory 12b, a transceiver 13b, a camera 14b, a depth camera 15b, and an inertial measurement unit (IMU) 16b. Each of the

cameras

14a and 14b captures and generates color space images from a scene. Each of the

depth cameras

15a and 15b captures and generates depth images from a scene. The IMU 16a measures and generates external odometry of the device 10a. The IMU 16b measures and generates external odometry of the device 10b. Odometry of a device is an estimation that uses data from motion sensors to estimate the position change of the device over time. A color space image camera, such as

camera

14a or 14b, is configured to capture a sequence of input frames, wherein each of the input frames comprises a color space image. A depth camera, such as

depth camera

15a or 15b, is configured to capture a depth image that is associated with the color space image in each frame. An IMU, such as

IMU

16a or 16b, is configured to provide external odometry that is associated with the color space image in each frame.

The base station 200a may include a processor 201a, a memory 202a, and a transceiver 203a. The network entity device 300 may include a processor 301, a memory 302, and a transceiver 303. Each of the

processors

11a, 11b, 201a, and 301 may be configured to implement proposed functions, procedures and/or methods described in the description. Layers of radio interface protocol may be implemented in the

processors

11a, 11b, 201a, and 301. Each of the

memory

12a, 12b, 202a, and 302 operatively stores a variety of programs and information to operate a connected processor. Each of the

transceivers

13a, 13b, 203a, and 303 is operatively coupled with a connected processor, transmits and/or receives radio signals or wireline signals. The base station 200a may be an eNB, a gNB, or one of other types of radio nodes, and may configure radio resources for the mobile device 10a and mobile device 10b.

Each of the

processors

11a, 11b, 201a, and 301 may include an application-specific integrated circuit (ASIC) , other chipsets, logic circuits and/or data processing devices. Each of the

memory

12a, 12b, 202a, and 302 may include read-only memory (ROM) , a random access memory (RAM) , a flash memory, a memory card, a storage medium and/or other storage devices. Each of the

transceivers

13a, 13b, 203a, and 303 may include baseband circuitry and radio frequency (RF) circuitry to process radio frequency signals. When the embodiments are implemented in software, the techniques described herein can be implemented with modules, procedures, functions, entities and so on, that perform the functions described herein. The modules can be stored in a memory and executed by the processors. The memory can be implemented within a processor or external to the processor, in which those can be communicatively coupled to the processor via various means are known in the art.

An example of the electronic device 10 in the description may include one of the mobile device 10a or mobile device 10b.

With reference to FIG. 3, three popular pipelines for the visual-based relocalization include pipelines for realizing a direct regression method, a match &refine method, and a match regression method. An image 310 is input to the pipelines. An electronic device may execute the methods to implement the pipelines.

A direct regression pipeline 320 realizing the direct regression method uses end-to-end methods which utilize a deep neural network (DNN) to regress pose 350 directly. A pose may be defined as 6 degrees-of-freedom (6DoF) translation and orientation of user’s camera refer to a coordinate space. A 6DoF pose of a three-dimensional (3D) object represents localization of a position and an orientation of the 3D object. A pose is defined in ARCore as: “Pose represents an immutable rigid transformation from one coordinate space to another. As provided from all ARCore APIs, Poses always describe the transformation from object's local coordinate space to the world coordinate space…The transformation is defined using a quaternion rotation about the origin followed by a translation. ” Poses from ARCore APIs can be thought of as equivalent to OpenGL model matrices.

A match regression pipeline 340 realizing the match regression method extracts features from an image, then finds a match between the extracted features and a stored map, and finally computes the pose through the matching. A map can be a virtually reconstructed environment. A map is generated by sensors such as RGB camera, depth camera or Lidar sensor. A map can be obtained locally or downloaded from the server. A match and refine pipeline 330 realizing the match and regression method obtains sparse or dense features of a frame (block 331) , regresses the match between features and map directly (block 332) , then computes a pose based on the match (block 333) , and outputs the computed pose (block 350) .

Visual-based relocalization also needs a mapping procedure to generate a representation for real-world space. Such mapping methods are typically designed corresponding to the specific relocalization method being used. For example, the direct regression method in FIG. 3 requires a DNN training step in mapping. The match regression method also utilizes a learning process in mapping, which is not limited to DNNs. The match and refine mapping pipeline 330 usually uses a keyframe-based method. Popular keyframe methods include enhanced hierarchical bag-of-word library (DBoW2) and randomized ferns. A mapping procedure is shown in FIG. 4. An electronic device may execute the mapping procedure. When mapping begins, for example, a frame 20 with one image 21 and one pose 22 are pre-processed (block 401) to extract sparse or dense features. Then a keyframe check is performed (block 402) to check whether the current frame 20 is eligible to become a new keyframe. If the current frame 20 is eligible to become a new keyframe, the frame 20 is added to and indexed in a keyframe database 30 (block 403) . The keyframe database is used in a subsequent relocalization procedure to retrieve a most similar keyframe based on an input frame. If the current frame 20 is not eligible to become a new keyframe, the frame 20 is dropped (block 404) .

Although many proposed relocalization methods have been developed, many of them have lots of challenges in AR applications. The first challenge is long-term persistence, which means the virtual objects should persist for a long period of time. In indoor scenes, the environment could be always changing. For example, chairs could be moved, a cup could be left at different places, and a bedsheet could be changed from time to time. Outdoor scenes suffer from lighting, occlusion, and seasonal changes. A naive solution may be to keep on updating the map, which is infeasible in most cases. The second challenge is the limited computing power of most AR mobile devices that necessitates an efficient relocalization solution. The third challenge is that multi-user AR applications, especially in indoor scenes, require high relocalization precision for good user experiences.

To overcome these challenges, the invention utilizes both RGB/monochrome camera and depth camera. Unlike other RGB and depth (RGBD) relocalization, the invention also uses external visual-inertial odometry (VIO) output that is available on most AR devices. The VIO output comprises poses of the devices. VIO is the process of determining a position and an orientation of a device by analyzing an associated image and inertial measurement unit (IMU) data. The invention provides both mapping and relocalization enhanced with VIO and is efficient, decoupled with the SLAM procedure, very flexible to deploy, and requires no training process. VIO uses both RGB/monochrome camera and IMU that provides external odometry. In other words, the invention ultimately uses data from an RGB/monochrome camera, an IMU, and a depth camera. By using heterogeneous sensor data as input, the proposed method can increase the precision of relocalization. Furthermore, the invention utilizes a sequence of images as input and can provide long-term persistence. For example, n frames of sensory data are utilized for relocalization. If a visual change of the environment happens after the mapping procedure to a small fraction of frames, the disclosed method can still pick the unchanged frames from the sequence of n frames to perform the relocalization. Comparing with single frame-based relocalization, the proposed relocalization method is sequence-based and can have more robust performance when visual change exists in long-term persistence.

The invention requires RGB/monochrome image, depth image, and external odometry data for each frame and combines a data sequence of query frames as input. Note that the invention provides an embodiment of the match and refine method, and does not rely on any specific keyframe selection and retrieval model. FIG. 5 shows a mapping pipeline of the disclosed method. Any current RGB/monochrome keyframe selection method can be used in the invention. For example, a keyframe selection method is disclosed by Glocker, Ben, Jamie Shotton, Antonio Criminisi, and Shahram Izadi in an article titled "Real-time RGB-D camera relocalization via randomized ferns for keyframe encoding" of IEEE transactions on visualization and computer graphics 21, no. 5 (2014) : 571-583 and Gálvez-López, D. Another keyframe selection method is disclosed by J.D. Tardós in an article titled "DBoW2: Enhanced hierarchical bag-of-word library for C++" . A keyframe is a frame that can represent significant information in the mapping. As shown in FIG. 4 and FIG. 5, each frame is checked as to whether the frame is qualified to be a keyframe or not. If the frame is qualified to be a keyframe, the keyframe is stored in the keyframe database. A query frame is a special keyframe during relocalization, selection criteria of which is quite different from a keyframe in the mapping procedure.

If the current frame 20 is eligible to become a new keyframe, the frame 20 is added to and indexed in a keyframe database 30. In addition to the keyframes, a 3D point cloud 23 is also recorded as a depth image for a keyframe (block 403’ ) , and thus each keyframe has a 3D point cloud 23 recorded as a depth image of the keyframe. A point cloud may be generated from a depth camera. Therefore, a sequence of the 3D point cloud is constructed, and may be combined as one 3D map point cloud.

A relocalization procedure may be executed in a later AR session on the same device or on a different user’s device. For example, the visual-based relocalization method of the disclosure is executed by the device 10. The visual-based relocalization method comprises selecting a sequence of query frames from a sequence of input frames based on evaluation of depth-image-based single frame relocalization associated with the sequence of the input frames. The sequence of the input frames is obtained from different view angles. Each input frame in the sequence of the input frames comprises a color space image associated with a depth image, and the evaluation of the depth-image-based single frame relocalization comprises evaluation of point cloud registration of a current frame in the sequence of the input frames using depth information of a depth image associated with the current frame and depth information of depth images associated a plurality of keyframes in a three dimensional (3D) map. The plurality of keyframes comprises k nearest keyframes relative to the current frame, where k is a positive integer. The point cloud registration of a current frame may comprise iterative closest point (ICP) algorithm applied to the current frame. The device refines estimated poses associated with the sequence of query frames for visual-based relocalization using external poses associated with the sequence of query frames. The external poses are obtained from external odometry.

An embodiment of the relocalization method of the disclosure includes a brief pre-processing, and two stages to estimate the 6DoF pose. The two stages comprise a first stage 620 for sequence generation and a second stage 630 for pose refinement. FIG. 6 shows an entire relocalization pipeline. The device 10 may execute the disclosed visual-based relocalization method to realize the relocalization pipeline.

For example, a frame 20 comprises a color space image 21, a depth image 23, and an odometry pose 24. The color space image may comprise an RGB or monochrome image obtained from a camera. The depth image 23 may be obtained from a depth camera. The odometry pose may be obtained from external odometry. The frame 20 is processed as a current frame for preprocessing, the first stage for sequence generation, and the second stage for pose refinement. The invention introduces a new pipeline that incorporates color space images, depth images, and external odometry to estimate the relocalization. Additionally, the invention proposes a method to generate a multi-modal sequence to reduce false relocalization. Further, the visual-based relocalization method with sequence-based pose refinement is proposed to improve the relocalization precision.

As shown in FIG. 6, the device 10 obtains one or more frames for the disclosed relocalization method. Among the one or more frames, one frame is selected as the current frame 20 and comprises the color space image 21, depth image 23, and one 6 DoF pose 24 from external odometry. All of the color space images, depth images, and 6 DoF poses are synchronized. In pre-processing of the current frame 20 (block 610) , the color space image 21, depth image 23, and odometry pose 24 are registered to the same reference frame of an RGB/monochrome camera, such as one of the

camera

14a or 14b shown in FIG. 2, using extrinsic parameters that can be obtained via a calibration process. The extrinsic parameters refer to a transformation matrix between a monochrome/RGB camera and a depth camera. For example, pinhole camera parameters are represented in a 4-by-3 matrix called the camera matrix. This matrix maps the 3-D world scene into an image plane. The calibration algorithm calculates the camera matrix using the extrinsic and intrinsic parameters. The extrinsic parameters represent the location of the camera in the 3-D scene. The intrinsic parameters represent the optical center and focal length of the camera. Pre-processing the one or more frames outputs a sequence of frames including images with depth information and poses, and are passed to the first stage 620 for sequence generation.

The 1 ^st stage for sequence generation:

The 1 ^st stage for sequence generation is a sequence generation stage configured to select and store a sequence of frames that are different frames captured from different view angles. Each of the selected frames has a high probability for estimating poses and generate a correct pose. Note that frames selected from a plurality of input frames in this stage are not the same as the keyframes that are stored for mapping and retrieval because the frames input to the stage are captured at a different time or from a different device. A selected frame in the stage is named a query frame. A query frame needs to have a different view angle from all other query frames in the stored sequence and has the potential to estimate the correct pose. The first stage has four main steps as shown in FIG. 6.

Pose check:

The first step in the stage is the pose check (block 621) . This step makes sure a new query frame is from a different view angle from previous query frames already added in the sequence. The device compares a pose of the current frame 20 with a pose of at least one stored query frame in the sequence of query frames to determine whether the current frame represents a view angle sufficiently different than the stored query frame when the sequence of query frames is not empty and has another query frame other than the current frame. If no query frame is in the sequence, this step of pose check is omitted. The device 10 uses a pose from external odometry associated with the current frame 20 to check whether the current frame 20 has enough view angle difference from previous query frames. The pose of current frame 20 is compared with one or more last query frames in the sequence. In comparing a pose of current frame 20 with a pose of one stored query frame in the sequence, if the Euclidean distance between the two compared poses is larger than a threshold δ _trans or angle difference between the two compared poses is larger than a threshold δ _rot, the current frame 20 is selected for further processed in the next step. If the Euclidean distance between two compared poses is not larger than a threshold δ _trans or angle difference is not larger than a threshold δ _rot, the device 10 determines the current frame is not a qualified query frame, and the current frame 20 is disregarded (block 625) .

Single frame relocalization:

This second step is relocalization using a single frame (block 622) . The device 10 performs the evaluation of depth-image-based single frame relocalization on the current frame 20. Specifically, (1) feature extraction for the current frame 20 is performed depending on what keyframe selection method has been used during mapping. For example, a keyframe selection method is disclosed by Glocker, Ben, Jamie Shotton, Antonio Criminisi, and Shahram Izadi in an article titled "Real-time RGB-D camera relocalization via randomized ferns for keyframe encoding" of IEEE transactions on visualization and computer graphics 21, no. 5 (2014) : 571-583. Another keyframe selection method is disclosed by Gálvez-López, D., and J. D. Tardós in an article titled "DBoW2: Enhanced hierarchical bag-of-word library for C++" (2012) .

Then the device 10 searches k nearest keyframes using k nearest neighbors (kNN) from the keyframe database, where k is a positive integer. Distance measurement for kNN is defined based on the feature as well. For example, if randomized ferns are used as features of frames, then a distance is computed as a Hamming distance between ferns of the current frame 20 and one of the k nearest frames. An ORB-based feature extraction for frames is disclosed by Rublee, Ethan, Vincent Rabaud, Kurt Konolige, and Gary Bradski in an article titled "ORB: An efficient alternative to SIFT or SURF" in 2011 IEEE International conference on computer vision, pp. 2564-2571. If sparse feature such as ORB is used as features of frames, the distance can be computed as a Hamming distance of an ORB descriptor of the current frame 20 and an ORB descriptor of one of the k nearest frames.

(2) The k nearest keyframes provide k initial poses for the current frame. The k poses are associated with the k nearest keyframes and are prestored in the keyframe database during the mapping procedure. The device 10 then performs an iterative closest point (ICP) algorithm between a 3D point cloud from the depth image associated with the current frame and a 3D point cloud associated with each of the nearest keyframes to refine the k poses. Thus, k refined poses associated with the k nearest keyframes are generated.

(3) Among all the k refined poses, the one with the least inlier RMSE (Root Mean Square Error) and largest inlier percentage is selected as an estimated pose of the current frame 20 for the next stage. The device 10 computes an inlier RMSE inlier_rmse of a specific pose among the k poses for a specific keyframe among the k keyframes associated with the specific pose as:

represents a 3D point in a point cloud of the current frame;

represents a 3D point in a point cloud of the specific keyframe; and the operation

represents an operation that outputs Euclidean norm of

and

An inlier percentage of the specific poses is a percentage of one or more inlier points in all 3D points in the current frame 20. The one or more inlier points are defined as those points of the current frame that are mapped to points of the specific keyframe in the 3D map during the ICP. The k refined poses are associated with k inlier RMSEs and k inlier percentages. The device 10 selects one of the k refined poses with a least inlier root mean square error (RMSE) and a largest inlier percentage to form an estimated pose of the current frame.

ICP metric check:

The third step is ICP metric check (block 623) . In the single frame relocalization, ICP is utilized to transform points. In the ICP metric check, ICP is utilized to double check points. ICP metric is a combination of inlier RMSE and inlier percentage. ICP metric check uses the inlier percentage and inlier RMSE to determine that whether a frame can be qualified as a query frame. In the ICP metric check, if the current frame has a selected pose with the inlier RMSE below a threshold δ _rmse, and the inlier percentage higher than a certain threshold δ _per, the current frame 20 becomes a query frame and is added into the sequence of query frames (block 624) . Otherwise, the current frame is disregarded (block 625) , and the process continues to the next frame.

Two main conditions may lead to high inlier RMSE:

1) the current frame 20 includes a region that has not been mapped by the mapping procedure;

2) the current frame 20 includes a region that has been mapped, but keyframes retrieval fails to find a good match.

In this case, the initial pose for ICP may be too far away from the truth, or a ground truth pose. The first condition should be avoided. If the current frame includes a region that has not been mapped, then the relocalization shall not be performed at all. The current frame including the region is referred to as an out-of-map frame. Unless an out-of-map frame has a similar appearance and similar geometry with some keyframe in the map, the inlier RMSE can be high. The thresholds δ _rmse and δ _per can be set empirically, but may be different depending on depth camera parameters and mapping scenario. A process that finds the optimal threshold for δ _rmse and δ _per can be performed after a mapping procedure. The device 10 can use keyframes in the map as input to perform single frame relocalization. Single frame relocalization is a process that determines a pose of the frame with regard to the map. In the keyframe database, each keyframe is stored with a camera pose. Such pose is computed in the mapping stage, and can be called a “ground truth pose” . A mapping process selects a set of keyframes and computes poses of the selected keyframes. These poses are considered to be ground truth in this step. Since the ground truth pose is known for each keyframe, the result of relocalization can be decided. Since the relocalization is successfully completed when the estimated pose has translation and rotation error smaller than thresholds, the query frame selection can be regarded as a classification problem using ICP metric as features. ICP metric may comprise parameters of inlier RMSE and inlier percentage related measurement. Then, such ICP metric parameters can be processed with machine learning, such as simple decision tree training, to avoid most negative cases.

The device 10 selects and adds the current frame 20 as a query frame into the sequence of query frames when the inlier RMSE of the selected refined pose of the current frame 20 is below an RMSE threshold δ _rmse, and the inlier percentage of the selected refined pose of the current frame 20 is higher than a certain percentage threshold δ _per. The estimated pose of the selected current frame 20 is obtained as one of the estimated poses of the query frames as a consequence of the selecting of the current frame 20. When the query frame is added to the sequence, the device 10 also stores a corresponding point cloud from the depth image associated with the query frame. The point cloud might be downsampled for efficiency. The device 10 may use the point cloud for pose refinement. The process may be repeated for each of the input frames to generate a plurality of query frames and the estimated poses of the query frames.

The 2 ^nd stage for pose refinement:

The pose refinement stage is to use a refined subset of frames in the query frame sequence to refine the estimated poses of the query frames (block 631) . This stage starts when the number of query frames is larger than a threshold N _seq. Although all query frames meet the ICP metric in the first stage, not all of them are used for final pose refinement due to errors in pose estimation or in ICP. For example, since a desktop in a room has a plain surface that can be similar to a plain surface of a ground, a point cloud of the desktop may match that of the ground. The goal of the second stage is to select enough inlier frames from query frames. Note that here an inlier means a frame instead of a point during ICP. A random sample consensus (RANSAC) -like method may be used to select inliers. The algorithm for the second stage is shown in Table 1:

Table 1

In this pose refinement procedure, the input of the 2 ^nd stage is all query frames in the sequence with external poses from odometry, and estimated poses regarding the map from the sequence generation stage. External poses are generated from external odometry. Estimated poses are generated from the relocalization process. As shown in the lines 1 and 2 of the Algorithm 1, the device 10 transforms all point clouds from all of the query frames to a reference coordinate frame of the 3D map using estimated poses of the query frames. Any map has an origin and directions of x, y, and z axes. A coordinate system of a map is referred to as a reference coordinate frame. A reference coordinate frame is not a frame in the sequence.

As shown in the line 3 of the Algorithm 1, the device 10 computes Euclidean RMSE between each of the transformed point clouds of the query frames frame and the point clouds of the reference coordinate frame in the 3D map. As shown in the line 4 of the Algorithm 1, the device 10 determines the computed Euclidean RMSEs associated with the query frames to generate a plurality of inlier frames, wherein a frame i in the sequence of query frames is determined as an inlier frame when a computed Euclidean RMSE of the frame i is smaller than a threshold δ _rmse. The device 10 combines point clouds from all inlier frames and refine estimated poses of the inlier frames to generate refined estimated poses using ICP. The device 10 may use the refined estimated poses to improve visual-based relocalization. For example, the device 10 may use the refined estimated poses to better relocate an AR session, an AR content, or a virtual object. After relocalization is done, a virtual object can be placed into the scene. With reference to FIG. 6, in the 2 ^nd stage, from all the estimated poses, the device 10 selects a frame i with an estimated pose

good enough. To do this, for each estimated pose

the device 10 transforms all point clouds from all query frames to the reference coordinate frame of the map using the estimated poses

as shown in line 2 of the Algorithm 1. The frame i has an estimated pose

and an external pose

Apoint cloud PC _j of the j-th frame in the sequence of query frames is transformed to the reference coordinate frame by transformation

where (j=1.. n) means that all the frames in the sequence. Basically, the algorithm process each frame (as shown in line 1 for i=0: n) , during each frame, uses the current frame i as reference and warp all the frames in the sequence using the criteria in line 2.

Then, Euclidean RMSE is computed between points in a point cloud PC _seq of all the frames in the sequence using the pose of the frame i and points in a point cloud PC _map of the map. If the Euclidean RMSE is smaller than a threshold δ _rmse, then the frame i is treated as an inlier. When the number of inliers is large enough, such as a number greater than n/2, all the inlier frames are saved as elements in the refined subset. In one embodiment, once such an inlier frame is found, the device 10 returns the inlier frame and the transformation applied to the inlier as the output of the 2 ^nd stage. Each inlier frame in the output of the 2 ^nd stage is associated with the estimated pose

an external pose

and the transformation

for all j in (1.. n) in the sequence. The variable i is a selected frame index for pose initialization, and variable j is one frame index from 1 to n.

is the reverse rotation of

This early return strategy reduces the computational cost of Algorithm 1. In an alternative embodiment, after all the query frames with estimated poses are evaluated, a frame that has the largest number of inliers is selected and saved as an element in the refined subset. For example, smaller RMSE breaks a tie. In other words, if two frames have the same number of inliers, an embodiment of the disclosed method prefers one frame with smaller RMSE in the frame selection for the refined subset. The device 10 combines point clouds from all inlier frames and refines the estimated pose using ICP, and outputs the refined estimated pose P _final as a portion of the output of the 2 ^nd stage.

The device 10 determines whether the pose refinement is successful (block 632) . When pose refinement is successful with an estimated pose with smallest mean RMSE, the estimated pose with smallest mean RMSE and inliers associated with the estimated pose are also stored as the refined estimated pose P _final (block 634) . After processing all frames, if the device 10 cannot find an estimated pose with enough inliers, the device 10 removes the frames that are outliers of the estimated pose with smallest mean RMSE, and repeats the 1 ^st stage and the 2 ^nd stage for other input frames. The device 10 processes a new frame as a current frame in the 1 ^st stage and the 2 ^nd stage until the refined subset has enough frames.

Removal of outliers happens when no frame that satisfies the criteria can be obtained by the 2 ^nd stage after processing the sequence with N frames. Outlier removal is to trim the N sequences a little bit. Then the sequence is shortened, and the 2 ^nd stage waits for the sequence to have N frames again.

The proposed method utilizes RGB/monochrome images, depth images and external odometry as input to realize visual-based relocalization. The method adopts a traditional pipeline. The computation is fast and suitable for mobile AR devices. Sequence-based relocalization can achieve higher precision than the single frame method. This method is also robust against visual change in the environment since a sequence is taken as the input instead of a single frame.

FIG. 7 is a block diagram of an example system 700 for the disclosed visual-based relocalization method according to an embodiment of the present disclosure. Embodiments described herein may be implemented into the system using any suitably configured hardware and/or software. FIG. 7 illustrates the system 700 including a radio frequency (RF) circuitry 710, a baseband circuitry 720, a processing unit 730, a memory/storage 740, a display 750, a camera module 760, a sensor 770, and an input/output (I/O) interface 780, coupled with each other as illustrated.

The processing unit 730 may include circuitry, such as, but not limited to, one or more single-core or multi-core processors. The processors may include any combinations of general-purpose processors and dedicated processors, such as graphics processors and application processors. The processors may be coupled with the memory/storage and configured to execute instructions stored in the memory/storage to enable various applications and/or operating systems running on the system.

The baseband circuitry 720 may include circuitry, such as, but not limited to, one or more single-core or multi-core processors. The processors may include a baseband processor. The baseband circuitry may handle various radio control functions that enable communication with one or more radio networks via the RF circuitry. The radio control functions may include, but are not limited to, signal modulation, encoding, decoding, radio frequency shifting, etc. In some embodiments, the baseband circuitry may provide for communication compatible with one or more radio technologies. For example, in some embodiments, the baseband circuitry may support communication with 5G NR, LTE, an evolved universal terrestrial radio access network (EUTRAN) and/or other wireless metropolitan area networks (WMAN) , a wireless local area network (WLAN) , a wireless personal area network (WPAN) . Embodiments in which the baseband circuitry is configured to support radio communications of more than one wireless protocol may be referred to as multi-mode baseband circuitry. In various embodiments, the baseband circuitry 720 may include circuitry to operate with signals that are not strictly considered as being in a baseband frequency. For example, in some embodiments, baseband circuitry may include circuitry to operate with signals having an intermediate frequency, which is between a baseband frequency and a radio frequency.

The RF circuitry 710 may enable communication with wireless networks using modulated electromagnetic radiation through a non-solid medium. In various embodiments, the RF circuitry may include switches, filters, amplifiers, etc. to facilitate communication with the wireless network. In various embodiments, the RF circuitry 710 may include circuitry to operate with signals that are not strictly considered as being in a radio frequency. For example, in some embodiments, RF circuitry may include circuitry to operate with signals having an intermediate frequency, which is between a baseband frequency and a radio frequency.

In various embodiments, the transmitter circuitry, control circuitry, or receiver circuitry discussed above with respect to the UE, eNB, or gNB may be embodied in whole or in part in one or more of the RF circuitries, the baseband circuitry, and/or the processing unit. As used herein, “circuitry” may refer to, be part of, or include an Application Specific Integrated Circuit (ASIC) , an electronic circuit, a processor (shared, dedicated, or group) , and/or memory (shared, dedicated, or group) that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable hardware components that provide the described functionality. In some embodiments, the electronic device circuitry may be implemented in, or functions associated with the circuitry may be implemented by, one or more software or firmware modules. In some embodiments, some or all of the constituent components of the baseband circuitry, the processing unit, and/or the memory/storage may be implemented together on a system on a chip (SOC) .

The memory/storage 740 may be used to load and store data and/or instructions, for example, for the system. The memory/storage for one embodiment may include any combination of suitable volatile memory, such as dynamic random access memory (DRAM) , and/or non-volatile memory, such as flash memory. In various embodiments, the I/O interface 780 may include one or more user interfaces designed to enable user interaction with the system and/or peripheral component interfaces designed to enable peripheral component interaction with the system. User interfaces may include, but are not limited to a physical keyboard or keypad, a touchpad, a speaker, a microphone, etc. Peripheral component interfaces may include, but are not limited to, a non-volatile memory port, a universal serial bus (USB) port, an audio jack, and a power supply interface.

The camera module 760 may comprise a color space image camera and a depth camera, such as the

depth camera

15a or 15b. The color space image camera is configured to capture a sequence of input frames, wherein each of the input frames comprises a color space image. The depth camera is configured to capture a depth image that is associated with the color space image in each frame.

The sensor 770 is configured to provide external odometry that is associated with the color space image in each frame. In various embodiments, the sensor 770 may include one or more sensing devices to determine environmental conditions and/or location information related to the system. In some embodiments, the sensors may include, but are not limited to, an IMU, a gyro sensor, an accelerometer, a proximity sensor, an ambient light sensor, and a positioning unit. The positioning unit may also be part of, or interact with, the baseband circuitry and/or RF circuitry to communicate with components of a positioning network, e.g., a global positioning system (GPS) satellite. In various embodiments, the display 750 may include a display, such as a liquid crystal display and a touch screen display. In various embodiments, the system 700 may be a mobile computing device such as, but not limited to, a laptop computing device, a tablet computing device, a netbook, an ultrabook, a smartphone, etc. In various embodiments, the system may have more or less components, and/or different architectures. Where appropriate, the methods described herein may be implemented as a computer program. The computer program may be stored on a storage medium, such as a non-transitory storage medium.

The embodiment of the present disclosure is a combination of techniques/processes that can be adopted to create an end product.

A person having ordinary skill in the art understands that each of the units, algorithm, and steps described and disclosed in the embodiments of the present disclosure are realized using electronic hardware or combinations of software for computers and electronic hardware. Whether the functions run in hardware or software depends on the condition of the application and design requirement for a technical plan. A person having ordinary skill in the art can use different ways to realize the function for each specific application while such realizations should not go beyond the scope of the present disclosure. It is understood by a person having ordinary skill in the art that he/she can refer to the working processes of the system, device, and unit in the above-mentioned embodiment since the working processes of the above-mentioned system, device, and unit are basically the same. For easy description and simplicity, these working processes will not be detailed.

It is understood that the disclosed system, device, and method in the embodiments of the present disclosure can be realized in other ways. The above-mentioned embodiments are exemplary only. The division of the units is merely based on logical functions while other divisions exist in realization. It is possible that a plurality of units or components are combined or integrated into another system. It is also possible that some characteristics are omitted or skipped. On the other hand, the displayed or discussed mutual coupling, direct coupling, or communicative coupling operate through some ports, devices, or units whether indirectly or communicatively by ways of electrical, mechanical, or other kinds of forms.

The units as separating components for explanation are or are not physically separated. The units for display are or are not physical units, that is, located in one place or distributed on a plurality of network units. Some or all of the units are used according to the purposes of the embodiments. Moreover, each of the functional units in each of the embodiments can be integrated into one processing unit, physically independent, or integrated into one processing unit with two or more than two units.

If the software function unit is realized and used and sold as a product, it can be stored in a readable storage medium in a computer. Based on this understanding, the technical plan proposed by the present disclosure can be essentially or partially realized as the form of a software product. Or, one part of the technical plan beneficial to the conventional technology can be realized as the form of a software product. The software product in the computer is stored in a storage medium, including a plurality of commands for a computational device (such as a personal computer, a server, or a network device) to run all or some of the steps disclosed by the embodiments of the present disclosure. The storage medium includes a USB disk, a mobile hard disk, a read-only memory (ROM) , a random access memory (RAM) , a floppy disk, or other kinds of media capable of storing program codes.

The proposed solution adopts a match and refine pipeline and includes two-stage processing to refine the pose. The first stage selects the query frames into the sequence. The second stage selects the inlier frames from the sequence. Finally, the inlier frames are used to refine the pose. The disclosed method achieves high relocalization precision while maintaining efficiency with low computation resources. Because of the sequence inlier selection, the invention can avoid the drawbacks of keyframe-based method, including bad initialization and bad ICP caused by insufficient geometric details. Furthermore, the sequence takes inlier frames with good geometric fitting. When the sequence is long enough to cover static portions of a scene with no visual changes, the disclosed method can process scenes with visual changes.

While the present disclosure has been described in connection with what is considered the most practical and preferred embodiments, it is understood that the present disclosure is not limited to the disclosed embodiments but is intended to cover various arrangements made without departing from the scope of the broadest interpretation of the appended claims.

Claims

A visual-based relocalization method executable in an electronic device, comprising:

selecting a sequence of query frames from a sequence of input frames based on evaluation of depth-image-based single frame relocalization associated with the sequence of the input frames, wherein the sequence of the input frames are different frames obtained from different view angles; and

refining estimated poses associated with the sequence of query frames for visual-based relocalization using external poses associated with the sequence of query frames, wherein the external poses are obtained from external odometry.
The visual-based relocalization method of claim 1, further comprising:

comparing a pose of a current frame with a pose of at least one stored query frame in the sequence of query frames to determine whether the current frame represents a view angle sufficiently different than the stored query frame when the sequence of query frames has another query frame other than the current frame.
The visual-based relocalization method of claim 2, further comprising:

determining that the current frame represents a view angle sufficiently different than the stored query frame when a Euclidean distance between the pose of a current frame and the pose of at least one stored query frame is greater than a threshold; and

performing the evaluation of depth-image-based single frame relocalization on the current frame representing a view angle sufficiently different than the stored query frame.
The visual-based relocalization method of claim 1, wherein each input frame in the sequence of the input frames comprises a RGB image associated with a depth image, and the evaluation of the depth-image-based single frame relocalization comprises evaluation of point cloud registration of a current frame in the sequence of the input frames using depth information of a depth image associated with the current frame and depth information of depth images associated a plurality of keyframes in a three dimensional (3D) map.
The visual-based relocalization method of claim 4, wherein the plurality of keyframes comprises k nearest keyframes relative to the current frame, k is a positive integer, the point cloud registration of a current frame comprises iterative closest point (ICP) algorithm applied to the current frame, and the method further comprises:

providing k poses associated with the k nearest keyframes for the current frame; and

performing iterative closest point (ICP) algorithm between a 3D point cloud from the depth image associated with the current frame and 3D point cloud associated with each of the k nearest keyframes to refine the k poses associated with the k nearest keyframes.
The visual-based relocalization method of claim 5, wherein an inlier RMSE of a specific pose among the k poses is computed for a specific keyframe among the k keyframes associated with the specific pose,

an inlier percentage of the specific poses is a percentage of one or more inlier points in all 3D points in the current frame, the one or more inlier points are defined as those points of the current frame that are mapped to points of the specific keyframe in the 3D map during the ICP, and the k refined poses are associated with k inlier RMSEs and k inlier percentages; and

the method further comprises:

selecting one of the k refined poses with a least inlier root mean square error (RMSE) and a largest inlier percentage to form an estimated pose of the current frame.
The visual-based relocalization method of claim 6, wherein the method further comprises:

selecting and adding the current frame as a query frame into the sequence of query frames when the inlier RMSE of the selected refined pose of the current frame is below an RMSE threshold, and the inlier percentage of the selected refined pose of the current frame is higher than a certain percentage threshold, wherein the estimated pose of the selected current frame is obtained as one of the estimated poses of the query frames.
The visual-based relocalization method of claim 7, wherein the method further comprises:

storing the depth image associated with the current frame that is added to the sequence of query frames.
The visual-based relocalization method of claim 7, wherein the method further comprises:

transforming all point clouds from all of the query frames to a reference coordinate frame of the 3D map using the estimated poses of the query frames;

computing Euclidean RMSE between each of the transformed point clouds of the query frames frame and the points of the reference coordinate frame in the 3D map;

determining computed Euclidean RMSEs of the query frames to generate a plurality of inlier frames, wherein an i-th frame in the sequence of query frames is determined as an inlier frame when a computed Euclidean RMSE of the i-th frame is smaller than a threshold δ _rmse; and

combining point clouds from all inlier frames and refine the estimated poses of the inlier frames to generate refined estimated poses using ICP.
The visual-based relocalization method of claim 9, wherein i-th frame has an estimated pose
and an external pose
apoint cloud PC _j of the j-th frame in the sequence of query frames is transformed to the reference coordinate frame by transformation
An electronic device comprising:

a camera configured to capture a sequence of input frames, wherein each of the input frames comprises an RGB image;

a depth camera configured to capture a depth image that is associated with the RGB image;

an inertial measurement unit configured to provide external odometry that is associated with the RGB image; and

a processor configured to execute:

selecting a sequence of query frames from the sequence of input frames based on evaluation of depth-image-based single frame relocalization associated with the sequence of the input frames, wherein the sequence of the input frames are different frames obtained from different view angles; and

refining estimated poses associated with the sequence of query frames for visual-based relocalization using external poses associated with the sequence of query frames, wherein the external poses are obtained from external odometry.
The electronic device of claim 11, wherein the processor is further configured to execute:

comparing a pose of a current frame with a pose of at least one stored query frame in the sequence of query frames to determine whether the current frame represents a view angle sufficiently different than the stored query frame when the sequence of query frames has another query frame other than the current frame.
The electronic device of claim 12, wherein the processor is further configured to execute:

determining that the current frame represents a view angle sufficiently different than the stored query frame when a Euclidean distance between the pose of a current frame and the pose of at least one stored query frame is greater than a threshold; and

performing the evaluation of depth-image-based single frame relocalization on the current frame representing a view angle sufficiently different than the stored query frame.
The electronic device of claim 11, wherein each input frame in the sequence of the input frames comprises an RGB image associated with a depth image, and the evaluation of the depth-image-based single frame relocalization comprises evaluation of point cloud registration of a current frame in the sequence of the input frames using depth information of a depth image associated with the current frame and depth information of depth images associated a plurality of keyframes in a three dimensional (3D) map.
The electronic device of claim 14, wherein the plurality of keyframes comprises k nearest keyframes relative to the current frame, k is a positive integer, the point cloud registration of a current frame comprises iterative closest point (ICP) algorithm applied to the current frame, and the processor is further configured to execute:

providing k poses associated with the k nearest keyframes for the current frame; and

performing iterative closest point (ICP) algorithm between a 3D point cloud from the depth image associated with the current frame and 3D point cloud associated with each of the k nearest keyframes to refine the k poses associated with the k nearest keyframes.
The electronic device of claim 15, wherein an inlier RMSE of a specific pose among the k poses is computed for a specific keyframe among the k keyframes associated with the specific pose,

an inlier percentage of the specific poses is a percentage of one or more inlier points in all 3D points in the current frame, the one or more inlier points are defined as those points of the current frame that are mapped to points of the specific keyframe in the 3D map during the ICP, and the k refined poses are associated with k inlier RMSEs and k inlier percentages; and

the processor is further configured to execute:

selecting one of the k refined poses with a least inlier root mean square error (RMSE) and a largest inlier percentage to form an estimated pose of the current frame.
The electronic device of claim 16, wherein the processor is further configured to execute:

selecting and adding the current frame as a query frame into the sequence of query frames when the inlier RMSE of the selected refined pose of the current frame is below an RMSE threshold, and the inlier percentage of the selected refined pose of the current frame is higher than a certain percentage threshold, wherein the estimated pose of the selected current frame is obtained as one of the estimated poses of the query frames.
The electronic device of claim 17, wherein the processor is further configured to execute:

storing the depth image associated with the current frame that is added to the sequence of query frames.
The electronic device of claim 17, wherein the processor is further configured to execute:

transforming all point clouds from all of the query frames to a reference coordinate frame of the 3D map using the estimated poses of the query frames;

computing Euclidean RMSE between each of the transformed point clouds of the query frames frame and the points of the reference coordinate frame in the 3D map;

determining computed Euclidean RMSEs of the query frames to generate a plurality of inlier frames, wherein an i-th frame in the sequence of query frames is determined as an inlier frame when a computed Euclidean RMSE of the i-th frame is smaller than a threshold δ _rmse; and

combining point clouds from all inlier frames and refine the estimated poses of the inlier frames to generate refined estimated poses using ICP.
The electronic device of claim 19, wherein i-th frame has an estimated pose
and an external pose
a point cloud PC _j of the j-th frame in the sequence of query frames is transformed to the reference coordinate frame by transformation
A chip, comprising:

a processor, configured to call and run a computer program stored in a memory, to cause a device in which the chip is installed to execute any of the methods of claims 1 to 10.
A computer readable storage medium, in which a computer program is stored, wherein the computer program causes a computer to execute any of the methods of claims 1 to 10.
A computer program product, comprising a computer program, wherein the computer program causes a computer to execute any of the methods of claims 1 to 10.
A computer program, wherein the computer program causes a computer to execute any of the methods of claims 1 to 10.