WO2021244604A1 - Visual-based relocalization method, and electronic device - Google Patents

Visual-based relocalization method, and electronic device Download PDF

Info

Publication number
WO2021244604A1
WO2021244604A1 PCT/CN2021/098096 CN2021098096W WO2021244604A1 WO 2021244604 A1 WO2021244604 A1 WO 2021244604A1 CN 2021098096 W CN2021098096 W CN 2021098096W WO 2021244604 A1 WO2021244604 A1 WO 2021244604A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
frames
sequence
query
current frame
Prior art date
Application number
PCT/CN2021/098096
Other languages
French (fr)
Inventor
Yuan Tian
Xiang Li
Yi Xu
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp., Ltd. filed Critical Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority to CN202180032534.0A priority Critical patent/CN115516524A/en
Publication of WO2021244604A1 publication Critical patent/WO2021244604A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content

Definitions

  • the present disclosure relates to the field of augmented reality (AR) systems, and more particularly, to a visual-based relocalization method.
  • AR augmented reality
  • AR augmented reality
  • Persistence is the ability to persist virtual objects in the same physical location and orientation as they are previously positioned in real-world space during an AR session or across different AR sessions. For example, during a first AR session, a user places a virtual sofa in a room using an AR application (app) . After a period, the user enters another AR session using the same app which can show the virtual sofa at the same location and in the same orientation.
  • the procedure of AR object persistence is also referred to as relocalization, which includes re-estimation of device poses with respect to a previously stored “map” representation.
  • one user device can set up a reference, or known as “anchors” , which can be some reference points or objects in real-world space.
  • Other user devices can relocalize themselves by matching some sensory data with the “anchors” .
  • Relocalization can utilize different sensory data, among which visual-based relocalization is the most popular.
  • Visual-based relocalization usually utilizes digital images from cameras as input and computes a six degrees of freedom (6 DoF) camera pose regarding a predefined coordinate system as output.
  • 6 DoF degrees of freedom
  • the device can be tracked in the same coordinate system as a previous AR session or a different user’s AR session.
  • An object of the present disclosure is to propose a visual-based relocalization method, and an electronic device.
  • an embodiment of the invention provides a visual-based relocalization method executable in an electronic device, comprising:
  • an embodiment of the invention provides an electronic device comprising a camera, a depth camera, an inertial measurement unit (IMU) , and a processor.
  • the camera is configured to capture a sequence of input frames. Each of the input frames comprises a color space image.
  • the depth camera is configured to capture a depth image that is associated with the color space image.
  • the IMU is configured to provide external odometry that is associated with the color space image.
  • the processor configured to execute:
  • the disclosed method may be implemented in a chip.
  • the chip may include a processor, configured to call and run a computer program stored in a memory, to cause a device in which the chip is installed to execute the disclosed method.
  • the disclosed method may be programmed as computer executable instructions stored in non-transitory computer readable medium.
  • the non-transitory computer readable medium when loaded to a computer, directs a processor of the computer to execute the disclosed method.
  • the non-transitory computer readable medium may comprise at least one from a group consisting of: a hard disk, a CD-ROM, an optical storage device, a magnetic storage device, a Read Only Memory, a Programmable Read Only Memory, an Erasable Programmable Read Only Memory, EPROM, an Electrically Erasable Programmable Read Only Memory and a Flash memory.
  • the disclosed method may be programmed as computer program product, that causes a computer to execute the disclosed method.
  • the disclosed method may be programmed as computer program, that causes a computer to execute the disclosed method.
  • the invention utilizes both RGB/monochrome camera and depth camera. Unlike other RGB and depth (RGBD) relocalization, the invention also uses external visual-inertial odometry (VIO) output that is available on most AR devices.
  • the VIO output comprises poses of the devices.
  • VIO is the process of determining a position and an orientation of a device by analyzing an associated image and inertial measurement unit (IMU) data.
  • IMU inertial measurement unit
  • the invention provides both mapping and relocalization enhanced with VIO and is efficient, decoupled with the SLAM procedure, very flexible to deploy, and requires no training process.
  • VIO uses both RGB/monochrome camera and IMU that provides external odometry.
  • the invention ultimately uses data from an RGB/monochrome camera, an IMU, and a depth camera.
  • the proposed method can increase the precision of relocalization.
  • the invention utilizes a sequence of images as input and can provide long-term persistence. For example, n frames of sensory data are utilized for relocalization. If visual change of the environment happens after the mapping procedure to a small fraction of frames, the disclosed method can still pick the unchanged frames from the sequence of n frames to perform the relocalization. Comparing with single frame-based relocalization, the proposed relocalization method is sequence-based and can have more robust performance when visual change exists in long-term persistence.
  • FIG. 1 illustrates a schematic view showing relocalization of a virtual object.
  • FIG. 2 illustrates a schematic view showing a system including mobile devices that execute a relocalization method according to an embodiment of the present disclosure.
  • FIG. 3 illustrates a schematic view showing three types of visual-based relocalization methods.
  • FIG. 4 illustrates a schematic view showing a mapping pipeline for a visual-based relocalization method.
  • FIG. 5 illustrates a schematic view showing a mapping pipeline for a visual-based relocalization method according to an embodiment of the present disclosure.
  • FIG. 6 illustrates a schematic view showing a relocalization pipeline for a visual-based relocalization method according to an embodiment of the present disclosure.
  • FIG. 7 is a block diagram of a system for wireless communication according to an embodiment of the present disclosure.
  • a user places a virtual object 220, such as an avatar, in a room with a desk 221 using an AR application executed by an electronic device 10.
  • the user enters another AR session B using the same app which can show the virtual object 220 at the same location and in the same orientation with respect to the desk 221 even if the device is moved to another location.
  • Another electronic device 10c of another user may show the virtual object 220 at the same location and in the same orientation with respect to the desk 221 in AR session C.
  • visual-based relocalization can both help with persistence and multi-user registration.
  • depth cameras have been increasingly equipped on commodity mobile devices, such as mobile phones and AR glasses. Depth information captured from a depth camera adds geometric details on top of the RGB appearance, and can be used to improve precision and robustness of relocalization.
  • a system including mobile devices 10a and 10b, a base station (BS) 200a, and a network entity device 300 executes the disclosed method according to an embodiment of the present disclosure.
  • the mobile devices 10a and 10b may be mobile phones, AR glasses, or other AR processing devices.
  • FIG. 1 is shown for illustrative not limiting, and the system may comprise more mobile devices, BSs, and CN entities. Connections between devices and device components are shown as lines and arrows in the FIGs.
  • the mobile device 10a may include a processor 11a, a memory 12a, a transceiver 13a, a camera 14a, a depth camera 15a, and an inertial measurement unit (IMU) 16a.
  • IMU inertial measurement unit
  • the mobile device 10b may include a processor 11b, a memory 12b, a transceiver 13b, a camera 14b, a depth camera 15b, and an inertial measurement unit (IMU) 16b.
  • Each of the cameras 14a and 14b captures and generates color space images from a scene.
  • Each of the depth cameras 15a and 15b captures and generates depth images from a scene.
  • the IMU 16a measures and generates external odometry of the device 10a.
  • the IMU 16b measures and generates external odometry of the device 10b.
  • Odometry of a device is an estimation that uses data from motion sensors to estimate the position change of the device over time.
  • a color space image camera such as camera 14a or 14b, is configured to capture a sequence of input frames, wherein each of the input frames comprises a color space image.
  • a depth camera such as depth camera 15a or 15b, is configured to capture a depth image that is associated with the color space image in each frame.
  • An IMU such as IMU 16a or 16b, is configured to provide external odometry that is associated with the color space image in each frame.
  • the base station 200a may include a processor 201a, a memory 202a, and a transceiver 203a.
  • the network entity device 300 may include a processor 301, a memory 302, and a transceiver 303.
  • Each of the processors 11a, 11b, 201a, and 301 may be configured to implement proposed functions, procedures and/or methods described in the description. Layers of radio interface protocol may be implemented in the processors 11a, 11b, 201a, and 301.
  • Each of the memory 12a, 12b, 202a, and 302 operatively stores a variety of programs and information to operate a connected processor.
  • Each of the transceivers 13a, 13b, 203a, and 303 is operatively coupled with a connected processor, transmits and/or receives radio signals or wireline signals.
  • the base station 200a may be an eNB, a gNB, or one of other types of radio nodes, and may configure radio resources for the mobile device 10a and mobile device 10b.
  • Each of the processors 11a, 11b, 201a, and 301 may include an application-specific integrated circuit (ASIC) , other chipsets, logic circuits and/or data processing devices.
  • ASIC application-specific integrated circuit
  • Each of the memory 12a, 12b, 202a, and 302 may include read-only memory (ROM) , a random access memory (RAM) , a flash memory, a memory card, a storage medium and/or other storage devices.
  • Each of the transceivers 13a, 13b, 203a, and 303 may include baseband circuitry and radio frequency (RF) circuitry to process radio frequency signals.
  • RF radio frequency
  • An example of the electronic device 10 in the description may include one of the mobile device 10a or mobile device 10b.
  • three popular pipelines for the visual-based relocalization include pipelines for realizing a direct regression method, a match &refine method, and a match regression method.
  • An image 310 is input to the pipelines.
  • An electronic device may execute the methods to implement the pipelines.
  • a direct regression pipeline 320 realizing the direct regression method uses end-to-end methods which utilize a deep neural network (DNN) to regress pose 350 directly.
  • a pose may be defined as 6 degrees-of-freedom (6DoF) translation and orientation of user’s camera refer to a coordinate space.
  • 6DoF pose of a three-dimensional (3D) object represents localization of a position and an orientation of the 3D object.
  • a pose is defined in ARCore as: “Pose represents an immutable rigid transformation from one coordinate space to another. As provided from all ARCore APIs, Poses always describe the transformation from object's local coordinate space to the world coordinate space...The transformation is defined using a quaternion rotation about the origin followed by a translation. ” Poses from ARCore APIs can be thought of as equivalent to OpenGL model matrices.
  • a match regression pipeline 340 realizing the match regression method extracts features from an image, then finds a match between the extracted features and a stored map, and finally computes the pose through the matching.
  • a map can be a virtually reconstructed environment.
  • a map is generated by sensors such as RGB camera, depth camera or Lidar sensor.
  • a map can be obtained locally or downloaded from the server.
  • a match and refine pipeline 330 realizing the match and regression method obtains sparse or dense features of a frame (block 331) , regresses the match between features and map directly (block 332) , then computes a pose based on the match (block 333) , and outputs the computed pose (block 350) .
  • mapping methods are typically designed corresponding to the specific relocalization method being used.
  • the direct regression method in FIG. 3 requires a DNN training step in mapping.
  • the match regression method also utilizes a learning process in mapping, which is not limited to DNNs.
  • the match and refine mapping pipeline 330 usually uses a keyframe-based method.
  • Popular keyframe methods include enhanced hierarchical bag-of-word library (DBoW2) and randomized ferns.
  • DBoW2 enhanced hierarchical bag-of-word library
  • randomized ferns A mapping procedure is shown in FIG. 4.
  • An electronic device may execute the mapping procedure. When mapping begins, for example, a frame 20 with one image 21 and one pose 22 are pre-processed (block 401) to extract sparse or dense features.
  • a keyframe check is performed (block 402) to check whether the current frame 20 is eligible to become a new keyframe. If the current frame 20 is eligible to become a new keyframe, the frame 20 is added to and indexed in a keyframe database 30 (block 403) . The keyframe database is used in a subsequent relocalization procedure to retrieve a most similar keyframe based on an input frame. If the current frame 20 is not eligible to become a new keyframe, the frame 20 is dropped (block 404) .
  • the first challenge is long-term persistence, which means the virtual objects should persist for a long period of time.
  • the environment could be always changing. For example, chairs could be moved, a cup could be left at different places, and a bedsheet could be changed from time to time. Outdoor scenes suffer from lighting, occlusion, and seasonal changes. A naive solution may be to keep on updating the map, which is infeasible in most cases.
  • the second challenge is the limited computing power of most AR mobile devices that necessitates an efficient relocalization solution.
  • the third challenge is that multi-user AR applications, especially in indoor scenes, require high relocalization precision for good user experiences.
  • the invention utilizes both RGB/monochrome camera and depth camera. Unlike other RGB and depth (RGBD) relocalization, the invention also uses external visual-inertial odometry (VIO) output that is available on most AR devices.
  • the VIO output comprises poses of the devices.
  • VIO is the process of determining a position and an orientation of a device by analyzing an associated image and inertial measurement unit (IMU) data.
  • IMU inertial measurement unit
  • the invention provides both mapping and relocalization enhanced with VIO and is efficient, decoupled with the SLAM procedure, very flexible to deploy, and requires no training process.
  • VIO uses both RGB/monochrome camera and IMU that provides external odometry.
  • the invention ultimately uses data from an RGB/monochrome camera, an IMU, and a depth camera.
  • the proposed method can increase the precision of relocalization.
  • the invention utilizes a sequence of images as input and can provide long-term persistence. For example, n frames of sensory data are utilized for relocalization. If a visual change of the environment happens after the mapping procedure to a small fraction of frames, the disclosed method can still pick the unchanged frames from the sequence of n frames to perform the relocalization. Comparing with single frame-based relocalization, the proposed relocalization method is sequence-based and can have more robust performance when visual change exists in long-term persistence.
  • the invention requires RGB/monochrome image, depth image, and external odometry data for each frame and combines a data sequence of query frames as input. Note that the invention provides an embodiment of the match and refine method, and does not rely on any specific keyframe selection and retrieval model.
  • FIG. 5 shows a mapping pipeline of the disclosed method. Any current RGB/monochrome keyframe selection method can be used in the invention. For example, a keyframe selection method is disclosed by Glocker, Ben, Jamie Shotton, Antonio Criminisi, and Shahram Izadi in an article titled "Real-time RGB-D camera relocalization via randomized ferns for keyframe encoding" of IEEE transactions on visualization and computer graphics 21, no.
  • a keyframe is a frame that can represent significant information in the mapping. As shown in FIG. 4 and FIG. 5, each frame is checked as to whether the frame is qualified to be a keyframe or not. If the frame is qualified to be a keyframe, the keyframe is stored in the keyframe database.
  • a query frame is a special keyframe during relocalization, selection criteria of which is quite different from a keyframe in the mapping procedure.
  • a 3D point cloud 23 is also recorded as a depth image for a keyframe (block 403’ ) , and thus each keyframe has a 3D point cloud 23 recorded as a depth image of the keyframe.
  • a point cloud may be generated from a depth camera. Therefore, a sequence of the 3D point cloud is constructed, and may be combined as one 3D map point cloud.
  • a relocalization procedure may be executed in a later AR session on the same device or on a different user’s device.
  • the visual-based relocalization method of the disclosure is executed by the device 10.
  • the visual-based relocalization method comprises selecting a sequence of query frames from a sequence of input frames based on evaluation of depth-image-based single frame relocalization associated with the sequence of the input frames.
  • the sequence of the input frames is obtained from different view angles.
  • Each input frame in the sequence of the input frames comprises a color space image associated with a depth image
  • the evaluation of the depth-image-based single frame relocalization comprises evaluation of point cloud registration of a current frame in the sequence of the input frames using depth information of a depth image associated with the current frame and depth information of depth images associated a plurality of keyframes in a three dimensional (3D) map.
  • the plurality of keyframes comprises k nearest keyframes relative to the current frame, where k is a positive integer.
  • the point cloud registration of a current frame may comprise iterative closest point (ICP) algorithm applied to the current frame.
  • the device refines estimated poses associated with the sequence of query frames for visual-based relocalization using external poses associated with the sequence of query frames. The external poses are obtained from external odometry.
  • An embodiment of the relocalization method of the disclosure includes a brief pre-processing, and two stages to estimate the 6DoF pose.
  • the two stages comprise a first stage 620 for sequence generation and a second stage 630 for pose refinement.
  • FIG. 6 shows an entire relocalization pipeline.
  • the device 10 may execute the disclosed visual-based relocalization method to realize the relocalization pipeline.
  • a frame 20 comprises a color space image 21, a depth image 23, and an odometry pose 24.
  • the color space image may comprise an RGB or monochrome image obtained from a camera.
  • the depth image 23 may be obtained from a depth camera.
  • the odometry pose may be obtained from external odometry.
  • the frame 20 is processed as a current frame for preprocessing, the first stage for sequence generation, and the second stage for pose refinement.
  • the invention introduces a new pipeline that incorporates color space images, depth images, and external odometry to estimate the relocalization. Additionally, the invention proposes a method to generate a multi-modal sequence to reduce false relocalization. Further, the visual-based relocalization method with sequence-based pose refinement is proposed to improve the relocalization precision.
  • the device 10 obtains one or more frames for the disclosed relocalization method.
  • one frame is selected as the current frame 20 and comprises the color space image 21, depth image 23, and one 6 DoF pose 24 from external odometry. All of the color space images, depth images, and 6 DoF poses are synchronized.
  • the color space image 21, depth image 23, and odometry pose 24 are registered to the same reference frame of an RGB/monochrome camera, such as one of the camera 14a or 14b shown in FIG. 2, using extrinsic parameters that can be obtained via a calibration process.
  • the extrinsic parameters refer to a transformation matrix between a monochrome/RGB camera and a depth camera.
  • pinhole camera parameters are represented in a 4-by-3 matrix called the camera matrix. This matrix maps the 3-D world scene into an image plane.
  • the calibration algorithm calculates the camera matrix using the extrinsic and intrinsic parameters.
  • the extrinsic parameters represent the location of the camera in the 3-D scene.
  • the intrinsic parameters represent the optical center and focal length of the camera.
  • Pre-processing the one or more frames outputs a sequence of frames including images with depth information and poses, and are passed to the first stage 620 for sequence generation.
  • the 1 st stage for sequence generation is a sequence generation stage configured to select and store a sequence of frames that are different frames captured from different view angles. Each of the selected frames has a high probability for estimating poses and generate a correct pose. Note that frames selected from a plurality of input frames in this stage are not the same as the keyframes that are stored for mapping and retrieval because the frames input to the stage are captured at a different time or from a different device.
  • a selected frame in the stage is named a query frame.
  • a query frame needs to have a different view angle from all other query frames in the stored sequence and has the potential to estimate the correct pose.
  • the first stage has four main steps as shown in FIG. 6.
  • the first step in the stage is the pose check (block 621) .
  • This step makes sure a new query frame is from a different view angle from previous query frames already added in the sequence.
  • the device compares a pose of the current frame 20 with a pose of at least one stored query frame in the sequence of query frames to determine whether the current frame represents a view angle sufficiently different than the stored query frame when the sequence of query frames is not empty and has another query frame other than the current frame. If no query frame is in the sequence, this step of pose check is omitted.
  • the device 10 uses a pose from external odometry associated with the current frame 20 to check whether the current frame 20 has enough view angle difference from previous query frames.
  • the pose of current frame 20 is compared with one or more last query frames in the sequence.
  • the current frame 20 is selected for further processed in the next step. If the Euclidean distance between two compared poses is not larger than a threshold ⁇ trans or angle difference is not larger than a threshold ⁇ rot , the device 10 determines the current frame is not a qualified query frame, and the current frame 20 is disregarded (block 625) .
  • This second step is relocalization using a single frame (block 622) .
  • the device 10 performs the evaluation of depth-image-based single frame relocalization on the current frame 20. Specifically, (1) feature extraction for the current frame 20 is performed depending on what keyframe selection method has been used during mapping. For example, a keyframe selection method is disclosed by Glocker, Ben, Jamie Shotton, Antonio Criminisi, and Shahram Izadi in an article titled "Real-time RGB-D camera relocalization via randomized ferns for keyframe encoding" of IEEE transactions on visualization and computer graphics 21, no. 5 (2014) : 571-583. Another keyframe selection method is disclosed by Gálvez-López, D., and J. D. Tardós in an article titled "DBoW2: Enhanced hierarchical bag-of-word library for C++” (2012) .
  • the device 10 searches k nearest keyframes using k nearest neighbors (kNN) from the keyframe database, where k is a positive integer.
  • Distance measurement for kNN is defined based on the feature as well. For example, if randomized ferns are used as features of frames, then a distance is computed as a Hamming distance between ferns of the current frame 20 and one of the k nearest frames.
  • An ORB-based feature extraction for frames is disclosed by Rublee, Ethan, Vincent Rabaud, Kurt Konolige, and Gary Bradski in an article titled "ORB: An efficient alternative to SIFT or SURF" in 2011 IEEE International conference on computer vision, pp. 2564-2571. If sparse feature such as ORB is used as features of frames, the distance can be computed as a Hamming distance of an ORB descriptor of the current frame 20 and an ORB descriptor of one of the k nearest frames.
  • the k nearest keyframes provide k initial poses for the current frame.
  • the k poses are associated with the k nearest keyframes and are prestored in the keyframe database during the mapping procedure.
  • the device 10 then performs an iterative closest point (ICP) algorithm between a 3D point cloud from the depth image associated with the current frame and a 3D point cloud associated with each of the nearest keyframes to refine the k poses.
  • ICP iterative closest point
  • the one with the least inlier RMSE (Root Mean Square Error) and largest inlier percentage is selected as an estimated pose of the current frame 20 for the next stage.
  • the device 10 computes an inlier RMSE inlier_rmse of a specific pose among the k poses for a specific keyframe among the k keyframes associated with the specific pose as:
  • the operation represents a 3D point in a point cloud of the specific keyframe; and the operation represents an operation that outputs Euclidean norm of and
  • An inlier percentage of the specific poses is a percentage of one or more inlier points in all 3D points in the current frame 20.
  • the one or more inlier points are defined as those points of the current frame that are mapped to points of the specific keyframe in the 3D map during the ICP.
  • the k refined poses are associated with k inlier RMSEs and k inlier percentages.
  • the device 10 selects one of the k refined poses with a least inlier root mean square error (RMSE) and a largest inlier percentage to form an estimated pose of the current frame.
  • RMSE root mean square error
  • the third step is ICP metric check (block 623) .
  • ICP is utilized to transform points.
  • ICP is utilized to double check points.
  • ICP metric is a combination of inlier RMSE and inlier percentage.
  • ICP metric check uses the inlier percentage and inlier RMSE to determine that whether a frame can be qualified as a query frame.
  • the current frame if the current frame has a selected pose with the inlier RMSE below a threshold ⁇ rmse , and the inlier percentage higher than a certain threshold ⁇ per , the current frame 20 becomes a query frame and is added into the sequence of query frames (block 624) . Otherwise, the current frame is disregarded (block 625) , and the process continues to the next frame.
  • the current frame 20 includes a region that has not been mapped by the mapping procedure
  • the current frame 20 includes a region that has been mapped, but keyframes retrieval fails to find a good match.
  • the initial pose for ICP may be too far away from the truth, or a ground truth pose.
  • the first condition should be avoided. If the current frame includes a region that has not been mapped, then the relocalization shall not be performed at all.
  • the current frame including the region is referred to as an out-of-map frame. Unless an out-of-map frame has a similar appearance and similar geometry with some keyframe in the map, the inlier RMSE can be high.
  • the thresholds ⁇ rmse and ⁇ per can be set empirically, but may be different depending on depth camera parameters and mapping scenario. A process that finds the optimal threshold for ⁇ rmse and ⁇ per can be performed after a mapping procedure.
  • the device 10 can use keyframes in the map as input to perform single frame relocalization.
  • Single frame relocalization is a process that determines a pose of the frame with regard to the map.
  • each keyframe is stored with a camera pose.
  • Such pose is computed in the mapping stage, and can be called a “ground truth pose” .
  • a mapping process selects a set of keyframes and computes poses of the selected keyframes. These poses are considered to be ground truth in this step. Since the ground truth pose is known for each keyframe, the result of relocalization can be decided. Since the relocalization is successfully completed when the estimated pose has translation and rotation error smaller than thresholds, the query frame selection can be regarded as a classification problem using ICP metric as features.
  • ICP metric may comprise parameters of inlier RMSE and inlier percentage related measurement. Then, such ICP metric parameters can be processed with machine learning, such as simple decision tree training, to avoid most negative cases.
  • the device 10 selects and adds the current frame 20 as a query frame into the sequence of query frames when the inlier RMSE of the selected refined pose of the current frame 20 is below an RMSE threshold ⁇ rmse , and the inlier percentage of the selected refined pose of the current frame 20 is higher than a certain percentage threshold ⁇ per .
  • the estimated pose of the selected current frame 20 is obtained as one of the estimated poses of the query frames as a consequence of the selecting of the current frame 20.
  • the device 10 also stores a corresponding point cloud from the depth image associated with the query frame.
  • the point cloud might be downsampled for efficiency.
  • the device 10 may use the point cloud for pose refinement. The process may be repeated for each of the input frames to generate a plurality of query frames and the estimated poses of the query frames.
  • the pose refinement stage is to use a refined subset of frames in the query frame sequence to refine the estimated poses of the query frames (block 631) .
  • This stage starts when the number of query frames is larger than a threshold N seq .
  • N seq a threshold
  • all query frames meet the ICP metric in the first stage not all of them are used for final pose refinement due to errors in pose estimation or in ICP.
  • the goal of the second stage is to select enough inlier frames from query frames. Note that here an inlier means a frame instead of a point during ICP.
  • a random sample consensus (RANSAC) -like method may be used to select inliers.
  • the algorithm for the second stage is shown in Table 1:
  • the input of the 2 nd stage is all query frames in the sequence with external poses from odometry, and estimated poses regarding the map from the sequence generation stage.
  • External poses are generated from external odometry.
  • Estimated poses are generated from the relocalization process.
  • the device 10 transforms all point clouds from all of the query frames to a reference coordinate frame of the 3D map using estimated poses of the query frames.
  • Any map has an origin and directions of x, y, and z axes.
  • a coordinate system of a map is referred to as a reference coordinate frame.
  • a reference coordinate frame is not a frame in the sequence.
  • the device 10 computes Euclidean RMSE between each of the transformed point clouds of the query frames frame and the point clouds of the reference coordinate frame in the 3D map. As shown in the line 4 of the Algorithm 1, the device 10 determines the computed Euclidean RMSEs associated with the query frames to generate a plurality of inlier frames, wherein a frame i in the sequence of query frames is determined as an inlier frame when a computed Euclidean RMSE of the frame i is smaller than a threshold ⁇ rmse . The device 10 combines point clouds from all inlier frames and refine estimated poses of the inlier frames to generate refined estimated poses using ICP.
  • the device 10 may use the refined estimated poses to improve visual-based relocalization. For example, the device 10 may use the refined estimated poses to better relocate an AR session, an AR content, or a virtual object. After relocalization is done, a virtual object can be placed into the scene. With reference to FIG. 6, in the 2 nd stage, from all the estimated poses, the device 10 selects a frame i with an estimated pose good enough. To do this, for each estimated pose the device 10 transforms all point clouds from all query frames to the reference coordinate frame of the map using the estimated poses as shown in line 2 of the Algorithm 1.
  • Euclidean RMSE is computed between points in a point cloud PC seq of all the frames in the sequence using the pose of the frame i and points in a point cloud PC map of the map. If the Euclidean RMSE is smaller than a threshold ⁇ rmse , then the frame i is treated as an inlier. When the number of inliers is large enough, such as a number greater than n/2, all the inlier frames are saved as elements in the refined subset. In one embodiment, once such an inlier frame is found, the device 10 returns the inlier frame and the transformation applied to the inlier as the output of the 2 nd stage.
  • Each inlier frame in the output of the 2 nd stage is associated with the estimated pose an external pose and the transformation for all j in (1.. n) in the sequence.
  • the variable i is a selected frame index for pose initialization
  • variable j is one frame index from 1 to n. is the reverse rotation of This early return strategy reduces the computational cost of Algorithm 1.
  • a frame that has the largest number of inliers is selected and saved as an element in the refined subset. For example, smaller RMSE breaks a tie. In other words, if two frames have the same number of inliers, an embodiment of the disclosed method prefers one frame with smaller RMSE in the frame selection for the refined subset.
  • the device 10 combines point clouds from all inlier frames and refines the estimated pose using ICP, and outputs the refined estimated pose P final as a portion of the output of the 2 nd stage.
  • the device 10 determines whether the pose refinement is successful (block 632) .
  • pose refinement is successful with an estimated pose with smallest mean RMSE
  • the estimated pose with smallest mean RMSE and inliers associated with the estimated pose are also stored as the refined estimated pose P final (block 634) .
  • the device 10 removes the frames that are outliers of the estimated pose with smallest mean RMSE, and repeats the 1 st stage and the 2 nd stage for other input frames.
  • the device 10 processes a new frame as a current frame in the 1 st stage and the 2 nd stage until the refined subset has enough frames.
  • the proposed method utilizes RGB/monochrome images, depth images and external odometry as input to realize visual-based relocalization.
  • the method adopts a traditional pipeline.
  • the computation is fast and suitable for mobile AR devices.
  • Sequence-based relocalization can achieve higher precision than the single frame method.
  • This method is also robust against visual change in the environment since a sequence is taken as the input instead of a single frame.
  • FIG. 7 is a block diagram of an example system 700 for the disclosed visual-based relocalization method according to an embodiment of the present disclosure. Embodiments described herein may be implemented into the system using any suitably configured hardware and/or software.
  • FIG. 7 illustrates the system 700 including a radio frequency (RF) circuitry 710, a baseband circuitry 720, a processing unit 730, a memory/storage 740, a display 750, a camera module 760, a sensor 770, and an input/output (I/O) interface 780, coupled with each other as illustrated.
  • RF radio frequency
  • the processing unit 730 may include circuitry, such as, but not limited to, one or more single-core or multi-core processors.
  • the processors may include any combinations of general-purpose processors and dedicated processors, such as graphics processors and application processors.
  • the processors may be coupled with the memory/storage and configured to execute instructions stored in the memory/storage to enable various applications and/or operating systems running on the system.
  • the baseband circuitry 720 may include circuitry, such as, but not limited to, one or more single-core or multi-core processors.
  • the processors may include a baseband processor.
  • the baseband circuitry may handle various radio control functions that enable communication with one or more radio networks via the RF circuitry.
  • the radio control functions may include, but are not limited to, signal modulation, encoding, decoding, radio frequency shifting, etc.
  • the baseband circuitry may provide for communication compatible with one or more radio technologies.
  • the baseband circuitry may support communication with 5G NR, LTE, an evolved universal terrestrial radio access network (EUTRAN) and/or other wireless metropolitan area networks (WMAN) , a wireless local area network (WLAN) , a wireless personal area network (WPAN) .
  • EUTRAN evolved universal terrestrial radio access network
  • WMAN wireless metropolitan area networks
  • WLAN wireless local area network
  • WPAN wireless personal area network
  • the baseband circuitry 720 may include circuitry to operate with signals that are not strictly considered as being in a baseband frequency.
  • baseband circuitry may include circuitry to operate with signals having an intermediate frequency, which is between a baseband frequency and a radio frequency.
  • the RF circuitry 710 may enable communication with wireless networks using modulated electromagnetic radiation through a non-solid medium.
  • the RF circuitry may include switches, filters, amplifiers, etc. to facilitate communication with the wireless network.
  • the RF circuitry 710 may include circuitry to operate with signals that are not strictly considered as being in a radio frequency.
  • RF circuitry may include circuitry to operate with signals having an intermediate frequency, which is between a baseband frequency and a radio frequency.
  • the transmitter circuitry, control circuitry, or receiver circuitry discussed above with respect to the UE, eNB, or gNB may be embodied in whole or in part in one or more of the RF circuitries, the baseband circuitry, and/or the processing unit.
  • “circuitry” may refer to, be part of, or include an Application Specific Integrated Circuit (ASIC) , an electronic circuit, a processor (shared, dedicated, or group) , and/or memory (shared, dedicated, or group) that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable hardware components that provide the described functionality.
  • ASIC Application Specific Integrated Circuit
  • the electronic device circuitry may be implemented in, or functions associated with the circuitry may be implemented by, one or more software or firmware modules.
  • some or all of the constituent components of the baseband circuitry, the processing unit, and/or the memory/storage may be implemented together on a system on a chip (SOC) .
  • the memory/storage 740 may be used to load and store data and/or instructions, for example, for the system.
  • the memory/storage for one embodiment may include any combination of suitable volatile memory, such as dynamic random access memory (DRAM) , and/or non-volatile memory, such as flash memory.
  • the I/O interface 780 may include one or more user interfaces designed to enable user interaction with the system and/or peripheral component interfaces designed to enable peripheral component interaction with the system.
  • User interfaces may include, but are not limited to a physical keyboard or keypad, a touchpad, a speaker, a microphone, etc.
  • Peripheral component interfaces may include, but are not limited to, a non-volatile memory port, a universal serial bus (USB) port, an audio jack, and a power supply interface.
  • USB universal serial bus
  • the camera module 760 may comprise a color space image camera and a depth camera, such as the depth camera 15a or 15b.
  • the color space image camera is configured to capture a sequence of input frames, wherein each of the input frames comprises a color space image.
  • the depth camera is configured to capture a depth image that is associated with the color space image in each frame.
  • the sensor 770 is configured to provide external odometry that is associated with the color space image in each frame.
  • the sensor 770 may include one or more sensing devices to determine environmental conditions and/or location information related to the system.
  • the sensors may include, but are not limited to, an IMU, a gyro sensor, an accelerometer, a proximity sensor, an ambient light sensor, and a positioning unit.
  • the positioning unit may also be part of, or interact with, the baseband circuitry and/or RF circuitry to communicate with components of a positioning network, e.g., a global positioning system (GPS) satellite.
  • the display 750 may include a display, such as a liquid crystal display and a touch screen display.
  • the system 700 may be a mobile computing device such as, but not limited to, a laptop computing device, a tablet computing device, a netbook, an ultrabook, a smartphone, etc.
  • the system may have more or less components, and/or different architectures.
  • the methods described herein may be implemented as a computer program.
  • the computer program may be stored on a storage medium, such as a non-transitory storage medium.
  • the embodiment of the present disclosure is a combination of techniques/processes that can be adopted to create an end product.
  • the units as separating components for explanation are or are not physically separated.
  • the units for display are or are not physical units, that is, located in one place or distributed on a plurality of network units. Some or all of the units are used according to the purposes of the embodiments.
  • each of the functional units in each of the embodiments can be integrated into one processing unit, physically independent, or integrated into one processing unit with two or more than two units.
  • the software function unit is realized and used and sold as a product, it can be stored in a readable storage medium in a computer.
  • the technical plan proposed by the present disclosure can be essentially or partially realized as the form of a software product.
  • one part of the technical plan beneficial to the conventional technology can be realized as the form of a software product.
  • the software product in the computer is stored in a storage medium, including a plurality of commands for a computational device (such as a personal computer, a server, or a network device) to run all or some of the steps disclosed by the embodiments of the present disclosure.
  • the storage medium includes a USB disk, a mobile hard disk, a read-only memory (ROM) , a random access memory (RAM) , a floppy disk, or other kinds of media capable of storing program codes.
  • the proposed solution adopts a match and refine pipeline and includes two-stage processing to refine the pose.
  • the first stage selects the query frames into the sequence.
  • the second stage selects the inlier frames from the sequence.
  • the inlier frames are used to refine the pose.
  • the disclosed method achieves high relocalization precision while maintaining efficiency with low computation resources. Because of the sequence inlier selection, the invention can avoid the drawbacks of keyframe-based method, including bad initialization and bad ICP caused by insufficient geometric details. Furthermore, the sequence takes inlier frames with good geometric fitting. When the sequence is long enough to cover static portions of a scene with no visual changes, the disclosed method can process scenes with visual changes.

Abstract

A visual-based relocalization method is executed in an electronic device. The visual-based relocalization method comprising sequence-based pose refinement is proposed to improve the relocalization precision. The device selects a sequence of query frames from a sequence of input frames based on evaluation of depth-image-based single frame relocalization associated with the sequence of the input frames. The sequence of the input frames are obtained from different view angles. The device refines estimated poses associated with the sequence of query frames for visual-based relocalization using external poses associated with the sequence of query frames. The external poses are obtained from external odometry.

Description

VISUAL-BASED RELOCALIZATION METHOD, AND ELECTRONIC DEVICE Technical Field
The present disclosure relates to the field of augmented reality (AR) systems, and more particularly, to a visual-based relocalization method.
Background Art
In augmented reality (AR) applications, visual-based relocalization is a crucial part to support AR object persistence and multiple user registration. Persistence is the ability to persist virtual objects in the same physical location and orientation as they are previously positioned in real-world space during an AR session or across different AR sessions. For example, during a first AR session, a user places a virtual sofa in a room using an AR application (app) . After a period, the user enters another AR session using the same app which can show the virtual sofa at the same location and in the same orientation. The procedure of AR object persistence is also referred to as relocalization, which includes re-estimation of device poses with respect to a previously stored “map” representation. For multiple user interaction in an AR session, one user device can set up a reference, or known as “anchors” , which can be some reference points or objects in real-world space. Other user devices can relocalize themselves by matching some sensory data with the “anchors” . Relocalization can utilize different sensory data, among which visual-based relocalization is the most popular.
Visual-based relocalization usually utilizes digital images from cameras as input and computes a six degrees of freedom (6 DoF) camera pose regarding a predefined coordinate system as output. Thus, after relocalization, the device can be tracked in the same coordinate system as a previous AR session or a different user’s AR session.
Technical Problem
Enormous research works have been published for visual-based relocalization, where many of them are implemented together with a simultaneous localization and mapping (SLAM) process. The techniques are widely developed and integrated into current AR software products, such as ARKit and ARcore, and current AR hardware products, such as AR glasses. Relocalization typically needs a sparse or dense map representation of the environment. Then, the visual appearance of the map is utilized to provide the initial pose estimation followed by a pose refinement stage depending on applications. Most of the methods use red green blue (RGB) images for relocalization.
Technical Solution
An object of the present disclosure is to propose a visual-based relocalization method, and an electronic device.
In a first aspect, an embodiment of the invention provides a visual-based relocalization method executable in an electronic device, comprising:
selecting a sequence of query frames from a sequence of input frames based on evaluation of depth-image-based single frame relocalization associated with the sequence of the input frames, wherein the sequence of the input frames are obtained from different view angles; and
refining estimated poses associated with the sequence of query frames for visual-based relocalization using external poses associated with the sequence of query frames, wherein the external poses are obtained from external odometry.
In a second aspect, an embodiment of the invention provides an electronic device comprising a camera, a depth camera, an inertial measurement unit (IMU) , and a processor. The camera is configured to capture a sequence of input frames. Each of the input frames comprises a color space image. The depth camera is configured to capture a depth image that is associated with the color space image. The IMU is configured to provide external odometry that  is associated with the color space image. The processor configured to execute:
selecting a sequence of query frames from the sequence of input frames based on evaluation of depth-image-based single frame relocalization associated with the sequence of the input frames, wherein the sequence of the input frames are obtained from different view angles; and
refining estimated poses associated with the sequence of query frames for visual-based relocalization using external poses associated with the sequence of query frames, wherein the external poses are obtained from external odometry.
The disclosed method may be implemented in a chip. The chip may include a processor, configured to call and run a computer program stored in a memory, to cause a device in which the chip is installed to execute the disclosed method.
The disclosed method may be programmed as computer executable instructions stored in non-transitory computer readable medium. The non-transitory computer readable medium, when loaded to a computer, directs a processor of the computer to execute the disclosed method.
The non-transitory computer readable medium may comprise at least one from a group consisting of: a hard disk, a CD-ROM, an optical storage device, a magnetic storage device, a Read Only Memory, a Programmable Read Only Memory, an Erasable Programmable Read Only Memory, EPROM, an Electrically Erasable Programmable Read Only Memory and a Flash memory.
The disclosed method may be programmed as computer program product, that causes a computer to execute the disclosed method.
The disclosed method may be programmed as computer program, that causes a computer to execute the disclosed method.
Advantageous Effects
To overcome these challenges, the invention utilizes both RGB/monochrome camera and depth camera. Unlike other RGB and depth (RGBD) relocalization, the invention also uses external visual-inertial odometry (VIO) output that is available on most AR devices. The VIO output comprises poses of the devices. VIO is the process of determining a position and an orientation of a device by analyzing an associated image and inertial measurement unit (IMU) data. The invention provides both mapping and relocalization enhanced with VIO and is efficient, decoupled with the SLAM procedure, very flexible to deploy, and requires no training process. VIO uses both RGB/monochrome camera and IMU that provides external odometry. In other words, the invention ultimately uses data from an RGB/monochrome camera, an IMU, and a depth camera. By using heterogeneous sensor data as input, the proposed method can increase the precision of relocalization. Furthermore, the invention utilizes a sequence of images as input and can provide long-term persistence. For example, n frames of sensory data are utilized for relocalization. If visual change of the environment happens after the mapping procedure to a small fraction of frames, the disclosed method can still pick the unchanged frames from the sequence of n frames to perform the relocalization. Comparing with single frame-based relocalization, the proposed relocalization method is sequence-based and can have more robust performance when visual change exists in long-term persistence.
Description of Drawings
In order to more clearly illustrate the embodiments of the present disclosure or related art, the following figures will be described in the embodiments are briefly introduced. It is obvious that the drawings are merely some embodiments of the present disclosure, a person having ordinary skill in this field can obtain other figures according to these figures without paying the premise.
FIG. 1 illustrates a schematic view showing relocalization of a virtual object.
FIG. 2 illustrates a schematic view showing a system including mobile devices that execute a relocalization method according to an embodiment of the present disclosure.
FIG. 3 illustrates a schematic view showing three types of visual-based relocalization methods.
FIG. 4 illustrates a schematic view showing a mapping pipeline for a visual-based relocalization method.
FIG. 5 illustrates a schematic view showing a mapping pipeline for a visual-based relocalization method according to an embodiment of the present disclosure.
FIG. 6 illustrates a schematic view showing a relocalization pipeline for a visual-based relocalization method according to an embodiment of the present disclosure.
FIG. 7 is a block diagram of a system for wireless communication according to an embodiment of the present disclosure.
DETAILED DESCRIPTION OF EMBODIMENTS
Embodiments of the disclosure are described in detail with the technical matters, structural features, achieved objects, and effects with reference to the accompanying drawings as follows. Specifically, the terminologies in the embodiments of the present disclosure are merely for describing the purpose of the certain embodiment, but not to limit the disclosure.
With reference to FIG. 1, for example, during a first AR session A, a user places a virtual object 220, such as an avatar, in a room with a desk 221 using an AR application executed by an electronic device 10. After a period, the user enters another AR session B using the same app which can show the virtual object 220 at the same location and in the same orientation with respect to the desk 221 even if the device is moved to another location. Another electronic device 10c of another user may show the virtual object 220 at the same location and in the same orientation with respect to the desk 221 in AR session C.
As shown in FIG. 1, visual-based relocalization can both help with persistence and multi-user registration. Recently, depth cameras have been increasingly equipped on commodity mobile devices, such as mobile phones and AR glasses. Depth information captured from a depth camera adds geometric details on top of the RGB appearance, and can be used to improve precision and robustness of relocalization.
With reference to FIG. 2, a system including  mobile devices  10a and 10b, a base station (BS) 200a, and a network entity device 300 executes the disclosed method according to an embodiment of the present disclosure. The  mobile devices  10a and 10b may be mobile phones, AR glasses, or other AR processing devices. FIG. 1 is shown for illustrative not limiting, and the system may comprise more mobile devices, BSs, and CN entities. Connections between devices and device components are shown as lines and arrows in the FIGs. The mobile device 10a may include a processor 11a, a memory 12a, a transceiver 13a, a camera 14a, a depth camera 15a, and an inertial measurement unit (IMU) 16a. The mobile device 10b may include a processor 11b, a memory 12b, a transceiver 13b, a camera 14b, a depth camera 15b, and an inertial measurement unit (IMU) 16b. Each of the  cameras  14a and 14b captures and generates color space images from a scene. Each of the  depth cameras  15a and 15b captures and generates depth images from a scene. The IMU 16a measures and generates external odometry of the device 10a. The IMU 16b measures and generates external odometry of the device 10b. Odometry of a device is an estimation that uses data from motion sensors to estimate the position change of the device over time. A color space image camera, such as  camera  14a or 14b, is configured to capture a sequence of input frames, wherein each of the input frames comprises a color space image. A depth camera, such as  depth camera  15a or 15b, is configured to capture a depth image that is  associated with the color space image in each frame. An IMU, such as  IMU  16a or 16b, is configured to provide external odometry that is associated with the color space image in each frame.
The base station 200a may include a processor 201a, a memory 202a, and a transceiver 203a. The network entity device 300 may include a processor 301, a memory 302, and a transceiver 303. Each of the  processors  11a, 11b, 201a, and 301 may be configured to implement proposed functions, procedures and/or methods described in the description. Layers of radio interface protocol may be implemented in the  processors  11a, 11b, 201a, and 301. Each of the  memory  12a, 12b, 202a, and 302 operatively stores a variety of programs and information to operate a connected processor. Each of the  transceivers  13a, 13b, 203a, and 303 is operatively coupled with a connected processor, transmits and/or receives radio signals or wireline signals. The base station 200a may be an eNB, a gNB, or one of other types of radio nodes, and may configure radio resources for the mobile device 10a and mobile device 10b.
Each of the  processors  11a, 11b, 201a, and 301 may include an application-specific integrated circuit (ASIC) , other chipsets, logic circuits and/or data processing devices. Each of the  memory  12a, 12b, 202a, and 302 may include read-only memory (ROM) , a random access memory (RAM) , a flash memory, a memory card, a storage medium and/or other storage devices. Each of the  transceivers  13a, 13b, 203a, and 303 may include baseband circuitry and radio frequency (RF) circuitry to process radio frequency signals. When the embodiments are implemented in software, the techniques described herein can be implemented with modules, procedures, functions, entities and so on, that perform the functions described herein. The modules can be stored in a memory and executed by the processors. The memory can be implemented within a processor or external to the processor, in which those can be communicatively coupled to the processor via various means are known in the art.
An example of the electronic device 10 in the description may include one of the mobile device 10a or mobile device 10b.
With reference to FIG. 3, three popular pipelines for the visual-based relocalization include pipelines for realizing a direct regression method, a match &refine method, and a match regression method. An image 310 is input to the pipelines. An electronic device may execute the methods to implement the pipelines.
direct regression pipeline 320 realizing the direct regression method uses end-to-end methods which utilize a deep neural network (DNN) to regress pose 350 directly. A pose may be defined as 6 degrees-of-freedom (6DoF) translation and orientation of user’s camera refer to a coordinate space. A 6DoF pose of a three-dimensional (3D) object represents localization of a position and an orientation of the 3D object. A pose is defined in ARCore as: “Pose represents an immutable rigid transformation from one coordinate space to another. As provided from all ARCore APIs, Poses always describe the transformation from object's local coordinate space to the world coordinate space…The transformation is defined using a quaternion rotation about the origin followed by a translation. ” Poses from ARCore APIs can be thought of as equivalent to OpenGL model matrices.
match regression pipeline 340 realizing the match regression method extracts features from an image, then finds a match between the extracted features and a stored map, and finally computes the pose through the matching. A map can be a virtually reconstructed environment. A map is generated by sensors such as RGB camera, depth camera or Lidar sensor. A map can be obtained locally or downloaded from the server. A match and refine pipeline 330 realizing the match and regression method obtains sparse or dense features of a frame (block 331) , regresses the match between features and map directly (block 332) , then computes a pose based on the match (block 333) , and outputs the computed pose (block 350) .
Visual-based relocalization also needs a mapping procedure to generate a representation for real-world space. Such mapping methods are typically designed corresponding to the specific relocalization method being used. For example, the direct regression method in FIG. 3 requires a DNN training step in mapping. The match regression method also utilizes a learning process in mapping, which is not limited to DNNs. The match and refine mapping pipeline 330 usually uses a keyframe-based method. Popular keyframe methods include enhanced hierarchical bag-of-word library (DBoW2) and randomized ferns. A mapping procedure is shown in FIG. 4. An electronic device may execute the mapping procedure. When mapping begins, for example, a frame 20 with one image 21 and one pose 22 are pre-processed (block 401) to extract sparse or dense features. Then a keyframe check is performed (block 402) to check whether the current frame 20 is eligible to become a new keyframe. If the current frame 20 is eligible to become a new keyframe, the frame 20 is added to and indexed in a keyframe database 30 (block 403) . The keyframe database is used in a subsequent relocalization procedure to retrieve a most similar keyframe based on an input frame. If the current frame 20 is not eligible to become a new keyframe, the frame 20 is dropped (block 404) .
Although many proposed relocalization methods have been developed, many of them have lots of challenges in AR applications. The first challenge is long-term persistence, which means the virtual objects should persist for a long period of time. In indoor scenes, the environment could be always changing. For example, chairs could be moved, a cup could be left at different places, and a bedsheet could be changed from time to time. Outdoor scenes suffer from lighting, occlusion, and seasonal changes. A naive solution may be to keep on updating the map, which is infeasible in most cases. The second challenge is the limited computing power of most AR mobile devices that necessitates an efficient relocalization solution. The third challenge is that multi-user AR applications, especially in indoor scenes, require high relocalization precision for good user experiences.
To overcome these challenges, the invention utilizes both RGB/monochrome camera and depth camera. Unlike other RGB and depth (RGBD) relocalization, the invention also uses external visual-inertial odometry (VIO) output that is available on most AR devices. The VIO output comprises poses of the devices. VIO is the process of determining a position and an orientation of a device by analyzing an associated image and inertial measurement unit (IMU) data. The invention provides both mapping and relocalization enhanced with VIO and is efficient, decoupled with the SLAM procedure, very flexible to deploy, and requires no training process. VIO uses both RGB/monochrome camera and IMU that provides external odometry. In other words, the invention ultimately uses data from an RGB/monochrome camera, an IMU, and a depth camera. By using heterogeneous sensor data as input, the proposed method can increase the precision of relocalization. Furthermore, the invention utilizes a sequence of images as input and can provide long-term persistence. For example, n frames of sensory data are utilized for relocalization. If a visual change of the environment happens after the mapping procedure to a small fraction of frames, the disclosed method can still pick the unchanged frames from the sequence of n frames to perform the relocalization. Comparing with single frame-based relocalization, the proposed relocalization method is sequence-based and can have more robust performance when visual change exists in long-term persistence.
The invention requires RGB/monochrome image, depth image, and external odometry data for each frame and combines a data sequence of query frames as input. Note that the invention provides an embodiment of the match and refine method, and does not rely on any specific keyframe selection and retrieval model. FIG. 5 shows a mapping pipeline of the disclosed method. Any current RGB/monochrome keyframe selection method can be used in the invention. For example, a keyframe selection method is disclosed by Glocker, Ben, Jamie Shotton, Antonio Criminisi, and Shahram Izadi in an article titled "Real-time RGB-D camera relocalization via randomized ferns for keyframe  encoding" of IEEE transactions on visualization and computer graphics 21, no. 5 (2014) : 571-583 and Gálvez-López, D. Another keyframe selection method is disclosed by J.D. Tardós in an article titled "DBoW2: Enhanced hierarchical bag-of-word library for C++" . A keyframe is a frame that can represent significant information in the mapping. As shown in FIG. 4 and FIG. 5, each frame is checked as to whether the frame is qualified to be a keyframe or not. If the frame is qualified to be a keyframe, the keyframe is stored in the keyframe database. A query frame is a special keyframe during relocalization, selection criteria of which is quite different from a keyframe in the mapping procedure.
If the current frame 20 is eligible to become a new keyframe, the frame 20 is added to and indexed in a keyframe database 30. In addition to the keyframes, a 3D point cloud 23 is also recorded as a depth image for a keyframe (block 403’ ) , and thus each keyframe has a 3D point cloud 23 recorded as a depth image of the keyframe. A point cloud may be generated from a depth camera. Therefore, a sequence of the 3D point cloud is constructed, and may be combined as one 3D map point cloud.
A relocalization procedure may be executed in a later AR session on the same device or on a different user’s device. For example, the visual-based relocalization method of the disclosure is executed by the device 10. The visual-based relocalization method comprises selecting a sequence of query frames from a sequence of input frames based on evaluation of depth-image-based single frame relocalization associated with the sequence of the input frames. The sequence of the input frames is obtained from different view angles. Each input frame in the sequence of the input frames comprises a color space image associated with a depth image, and the evaluation of the depth-image-based single frame relocalization comprises evaluation of point cloud registration of a current frame in the sequence of the input frames using depth information of a depth image associated with the current frame and depth information of depth images associated a plurality of keyframes in a three dimensional (3D) map. The plurality of keyframes comprises k nearest keyframes relative to the current frame, where k is a positive integer. The point cloud registration of a current frame may comprise iterative closest point (ICP) algorithm applied to the current frame. The device refines estimated poses associated with the sequence of query frames for visual-based relocalization using external poses associated with the sequence of query frames. The external poses are obtained from external odometry.
An embodiment of the relocalization method of the disclosure includes a brief pre-processing, and two stages to estimate the 6DoF pose. The two stages comprise a first stage 620 for sequence generation and a second stage 630 for pose refinement. FIG. 6 shows an entire relocalization pipeline. The device 10 may execute the disclosed visual-based relocalization method to realize the relocalization pipeline.
For example, a frame 20 comprises a color space image 21, a depth image 23, and an odometry pose 24. The color space image may comprise an RGB or monochrome image obtained from a camera. The depth image 23 may be obtained from a depth camera. The odometry pose may be obtained from external odometry. The frame 20 is processed as a current frame for preprocessing, the first stage for sequence generation, and the second stage for pose refinement. The invention introduces a new pipeline that incorporates color space images, depth images, and external odometry to estimate the relocalization. Additionally, the invention proposes a method to generate a multi-modal sequence to reduce false relocalization. Further, the visual-based relocalization method with sequence-based pose refinement is proposed to improve the relocalization precision.
As shown in FIG. 6, the device 10 obtains one or more frames for the disclosed relocalization method. Among the one or more frames, one frame is selected as the current frame 20 and comprises the color space image 21, depth image 23, and one 6 DoF pose 24 from external odometry. All of the color space images, depth images, and 6 DoF poses are synchronized. In pre-processing of the current frame 20 (block 610) , the color space image 21, depth  image 23, and odometry pose 24 are registered to the same reference frame of an RGB/monochrome camera, such as one of the  camera  14a or 14b shown in FIG. 2, using extrinsic parameters that can be obtained via a calibration process. The extrinsic parameters refer to a transformation matrix between a monochrome/RGB camera and a depth camera. For example, pinhole camera parameters are represented in a 4-by-3 matrix called the camera matrix. This matrix maps the 3-D world scene into an image plane. The calibration algorithm calculates the camera matrix using the extrinsic and intrinsic parameters. The extrinsic parameters represent the location of the camera in the 3-D scene. The intrinsic parameters represent the optical center and focal length of the camera. Pre-processing the one or more frames outputs a sequence of frames including images with depth information and poses, and are passed to the first stage 620 for sequence generation.
The 1 st stage for sequence generation:
The 1 st stage for sequence generation is a sequence generation stage configured to select and store a sequence of frames that are different frames captured from different view angles. Each of the selected frames has a high probability for estimating poses and generate a correct pose. Note that frames selected from a plurality of input frames in this stage are not the same as the keyframes that are stored for mapping and retrieval because the frames input to the stage are captured at a different time or from a different device. A selected frame in the stage is named a query frame. A query frame needs to have a different view angle from all other query frames in the stored sequence and has the potential to estimate the correct pose. The first stage has four main steps as shown in FIG. 6.
Pose check:
The first step in the stage is the pose check (block 621) . This step makes sure a new query frame is from a different view angle from previous query frames already added in the sequence. The device compares a pose of the current frame 20 with a pose of at least one stored query frame in the sequence of query frames to determine whether the current frame represents a view angle sufficiently different than the stored query frame when the sequence of query frames is not empty and has another query frame other than the current frame. If no query frame is in the sequence, this step of pose check is omitted. The device 10 uses a pose from external odometry associated with the current frame 20 to check whether the current frame 20 has enough view angle difference from previous query frames. The pose of current frame 20 is compared with one or more last query frames in the sequence. In comparing a pose of current frame 20 with a pose of one stored query frame in the sequence, if the Euclidean distance between the two compared poses is larger than a threshold δ trans or angle difference between the two compared poses is larger than a threshold δ rot, the current frame 20 is selected for further processed in the next step. If the Euclidean distance between two compared poses is not larger than a threshold δ trans or angle difference is not larger than a threshold δ rot, the device 10 determines the current frame is not a qualified query frame, and the current frame 20 is disregarded (block 625) .
Single frame relocalization:
This second step is relocalization using a single frame (block 622) . The device 10 performs the evaluation of depth-image-based single frame relocalization on the current frame 20. Specifically, (1) feature extraction for the current frame 20 is performed depending on what keyframe selection method has been used during mapping. For example, a keyframe selection method is disclosed by Glocker, Ben, Jamie Shotton, Antonio Criminisi, and Shahram Izadi in an article titled "Real-time RGB-D camera relocalization via randomized ferns for keyframe encoding" of IEEE transactions on visualization and computer graphics 21, no. 5 (2014) : 571-583. Another keyframe selection method is disclosed by Gálvez-López, D., and J. D. Tardós in an article titled "DBoW2: Enhanced hierarchical bag-of-word library for C++" (2012) .
Then the device 10 searches k nearest keyframes using k nearest neighbors (kNN) from the keyframe database, where k is a positive integer. Distance measurement for kNN is defined based on the feature as well. For example, if randomized ferns are used as features of frames, then a distance is computed as a Hamming distance between ferns of the current frame 20 and one of the k nearest frames. An ORB-based feature extraction for frames is disclosed by Rublee, Ethan, Vincent Rabaud, Kurt Konolige, and Gary Bradski in an article titled "ORB: An efficient alternative to SIFT or SURF" in 2011 IEEE International conference on computer vision, pp. 2564-2571. If sparse feature such as ORB is used as features of frames, the distance can be computed as a Hamming distance of an ORB descriptor of the current frame 20 and an ORB descriptor of one of the k nearest frames.
(2) The k nearest keyframes provide k initial poses for the current frame. The k poses are associated with the k nearest keyframes and are prestored in the keyframe database during the mapping procedure. The device 10 then performs an iterative closest point (ICP) algorithm between a 3D point cloud from the depth image associated with the current frame and a 3D point cloud associated with each of the nearest keyframes to refine the k poses. Thus, k refined poses associated with the k nearest keyframes are generated.
(3) Among all the k refined poses, the one with the least inlier RMSE (Root Mean Square Error) and largest inlier percentage is selected as an estimated pose of the current frame 20 for the next stage. The device 10 computes an inlier RMSE inlier_rmse of a specific pose among the k poses for a specific keyframe among the k keyframes associated with the specific pose as:
Figure PCTCN2021098096-appb-000001
Figure PCTCN2021098096-appb-000002
represents a 3D point in a point cloud of the current frame;
Figure PCTCN2021098096-appb-000003
represents a 3D point in a point cloud of the specific keyframe; and the operation
Figure PCTCN2021098096-appb-000004
represents an operation that outputs Euclidean norm of
Figure PCTCN2021098096-appb-000005
and
Figure PCTCN2021098096-appb-000006
An inlier percentage of the specific poses is a percentage of one or more inlier points in all 3D points in the current frame 20. The one or more inlier points are defined as those points of the current frame that are mapped to points of the specific keyframe in the 3D map during the ICP. The k refined poses are associated with k inlier RMSEs and k inlier percentages. The device 10 selects one of the k refined poses with a least inlier root mean square error (RMSE) and a largest inlier percentage to form an estimated pose of the current frame.
ICP metric check:
The third step is ICP metric check (block 623) . In the single frame relocalization, ICP is utilized to transform points. In the ICP metric check, ICP is utilized to double check points. ICP metric is a combination of inlier RMSE and inlier percentage. ICP metric check uses the inlier percentage and inlier RMSE to determine that whether a frame can be qualified as a query frame. In the ICP metric check, if the current frame has a selected pose with the inlier RMSE below a threshold δ rmse, and the inlier percentage higher than a certain threshold δ per, the current frame 20 becomes a query frame and is added into the sequence of query frames (block 624) . Otherwise, the current frame is disregarded (block 625) , and the process continues to the next frame.
Two main conditions may lead to high inlier RMSE:
1) the current frame 20 includes a region that has not been mapped by the mapping procedure;
2) the current frame 20 includes a region that has been mapped, but keyframes retrieval fails to find a good match.
In this case, the initial pose for ICP may be too far away from the truth, or a ground truth pose. The first condition should be avoided. If the current frame includes a region that has not been mapped, then the relocalization  shall not be performed at all. The current frame including the region is referred to as an out-of-map frame. Unless an out-of-map frame has a similar appearance and similar geometry with some keyframe in the map, the inlier RMSE can be high. The thresholds δ rmse and δ per can be set empirically, but may be different depending on depth camera parameters and mapping scenario. A process that finds the optimal threshold for δ rmse and δ per can be performed after a mapping procedure. The device 10 can use keyframes in the map as input to perform single frame relocalization. Single frame relocalization is a process that determines a pose of the frame with regard to the map. In the keyframe database, each keyframe is stored with a camera pose. Such pose is computed in the mapping stage, and can be called a “ground truth pose” . A mapping process selects a set of keyframes and computes poses of the selected keyframes. These poses are considered to be ground truth in this step. Since the ground truth pose is known for each keyframe, the result of relocalization can be decided. Since the relocalization is successfully completed when the estimated pose has translation and rotation error smaller than thresholds, the query frame selection can be regarded as a classification problem using ICP metric as features. ICP metric may comprise parameters of inlier RMSE and inlier percentage related measurement. Then, such ICP metric parameters can be processed with machine learning, such as simple decision tree training, to avoid most negative cases.
The device 10 selects and adds the current frame 20 as a query frame into the sequence of query frames when the inlier RMSE of the selected refined pose of the current frame 20 is below an RMSE threshold δ rmse, and the inlier percentage of the selected refined pose of the current frame 20 is higher than a certain percentage threshold δ per. The estimated pose of the selected current frame 20 is obtained as one of the estimated poses of the query frames as a consequence of the selecting of the current frame 20. When the query frame is added to the sequence, the device 10 also stores a corresponding point cloud from the depth image associated with the query frame. The point cloud might be downsampled for efficiency. The device 10 may use the point cloud for pose refinement. The process may be repeated for each of the input frames to generate a plurality of query frames and the estimated poses of the query frames.
The 2 nd stage for pose refinement:
The pose refinement stage is to use a refined subset of frames in the query frame sequence to refine the estimated poses of the query frames (block 631) . This stage starts when the number of query frames is larger than a threshold N seq. Although all query frames meet the ICP metric in the first stage, not all of them are used for final pose refinement due to errors in pose estimation or in ICP. For example, since a desktop in a room has a plain surface that can be similar to a plain surface of a ground, a point cloud of the desktop may match that of the ground. The goal of the second stage is to select enough inlier frames from query frames. Note that here an inlier means a frame instead of a point during ICP. A random sample consensus (RANSAC) -like method may be used to select inliers. The algorithm for the second stage is shown in Table 1:
Table 1
Figure PCTCN2021098096-appb-000007
Figure PCTCN2021098096-appb-000008
In this pose refinement procedure, the input of the 2 nd stage is all query frames in the sequence with external poses from odometry, and estimated poses regarding the map from the sequence generation stage. External poses are generated from external odometry. Estimated poses are generated from the relocalization process. As shown in the lines 1 and 2 of the Algorithm 1, the device 10 transforms all point clouds from all of the query frames to a reference coordinate frame of the 3D map using estimated poses of the query frames. Any map has an origin and directions of x, y, and z axes. A coordinate system of a map is referred to as a reference coordinate frame. A reference coordinate frame is not a frame in the sequence.
As shown in the line 3 of the Algorithm 1, the device 10 computes Euclidean RMSE between each of the transformed point clouds of the query frames frame and the point clouds of the reference coordinate frame in the 3D map. As shown in the line 4 of the Algorithm 1, the device 10 determines the computed Euclidean RMSEs associated with the query frames to generate a plurality of inlier frames, wherein a frame i in the sequence of query frames is determined as an inlier frame when a computed Euclidean RMSE of the frame i is smaller than a threshold δ rmse. The device 10 combines point clouds from all inlier frames and refine estimated poses of the inlier frames to generate refined estimated poses using ICP. The device 10 may use the refined estimated poses to improve visual-based relocalization. For example, the device 10 may use the refined estimated poses to better relocate an AR session, an AR content, or a virtual object. After relocalization is done, a virtual object can be placed into the scene. With reference to FIG. 6, in the 2 nd stage, from all the estimated poses, the device 10 selects a frame i with an estimated pose
Figure PCTCN2021098096-appb-000009
good enough. To do this, for each estimated pose
Figure PCTCN2021098096-appb-000010
the device 10 transforms all point clouds from all query frames to the reference coordinate frame of the map using the estimated poses
Figure PCTCN2021098096-appb-000011
as shown in line 2 of the Algorithm 1. The frame i has an estimated pose
Figure PCTCN2021098096-appb-000012
and an external pose
Figure PCTCN2021098096-appb-000013
Apoint cloud PC j of the j-th frame in the sequence of query frames is transformed to the reference coordinate frame by transformation
Figure PCTCN2021098096-appb-000014
Figure PCTCN2021098096-appb-000015
where (j=1.. n) means that all the frames in the sequence. Basically, the algorithm process each frame (as shown in line 1 for i=0: n) , during each frame, uses the current frame i as reference and warp all the frames in the sequence using the criteria in line 2.
Then, Euclidean RMSE is computed between points in a point cloud PC seq of all the frames in the sequence using the pose of the frame i and points in a point cloud PC map of the map. If the Euclidean RMSE is smaller than a threshold δ rmse, then the frame i is treated as an inlier. When the number of inliers is large enough, such as a number greater than n/2, all the inlier frames are saved as elements in the refined subset. In one embodiment, once such an inlier frame is found, the device 10 returns the inlier frame and the transformation applied to the inlier as the output of the 2 nd stage. Each inlier frame in the output of the 2 nd stage is associated with the estimated pose
Figure PCTCN2021098096-appb-000016
an external pose
Figure PCTCN2021098096-appb-000017
and the transformation
Figure PCTCN2021098096-appb-000018
for all j in (1.. n) in the sequence. The variable i is a selected frame index for pose initialization, and variable j is one frame index from 1 to n. 
Figure PCTCN2021098096-appb-000019
is the reverse rotation of 
Figure PCTCN2021098096-appb-000020
This early return strategy reduces the computational cost of Algorithm 1. In an alternative embodiment, after all the query frames with estimated poses are evaluated, a frame that has the largest number of inliers is selected and  saved as an element in the refined subset. For example, smaller RMSE breaks a tie. In other words, if two frames have the same number of inliers, an embodiment of the disclosed method prefers one frame with smaller RMSE in the frame selection for the refined subset. The device 10 combines point clouds from all inlier frames and refines the estimated pose using ICP, and outputs the refined estimated pose P final as a portion of the output of the 2 nd stage.
The device 10 determines whether the pose refinement is successful (block 632) . When pose refinement is successful with an estimated pose with smallest mean RMSE, the estimated pose with smallest mean RMSE and inliers associated with the estimated pose are also stored as the refined estimated pose P final (block 634) . After processing all frames, if the device 10 cannot find an estimated pose with enough inliers, the device 10 removes the frames that are outliers of the estimated pose with smallest mean RMSE, and repeats the 1 st stage and the 2 nd stage for other input frames. The device 10 processes a new frame as a current frame in the 1 st stage and the 2 nd stage until the refined subset has enough frames.
Removal of outliers happens when no frame that satisfies the criteria can be obtained by the 2 nd stage after processing the sequence with N frames. Outlier removal is to trim the N sequences a little bit. Then the sequence is shortened, and the 2 nd stage waits for the sequence to have N frames again.
The proposed method utilizes RGB/monochrome images, depth images and external odometry as input to realize visual-based relocalization. The method adopts a traditional pipeline. The computation is fast and suitable for mobile AR devices. Sequence-based relocalization can achieve higher precision than the single frame method. This method is also robust against visual change in the environment since a sequence is taken as the input instead of a single frame.
FIG. 7 is a block diagram of an example system 700 for the disclosed visual-based relocalization method according to an embodiment of the present disclosure. Embodiments described herein may be implemented into the system using any suitably configured hardware and/or software. FIG. 7 illustrates the system 700 including a radio frequency (RF) circuitry 710, a baseband circuitry 720, a processing unit 730, a memory/storage 740, a display 750, a camera module 760, a sensor 770, and an input/output (I/O) interface 780, coupled with each other as illustrated.
The processing unit 730 may include circuitry, such as, but not limited to, one or more single-core or multi-core processors. The processors may include any combinations of general-purpose processors and dedicated processors, such as graphics processors and application processors. The processors may be coupled with the memory/storage and configured to execute instructions stored in the memory/storage to enable various applications and/or operating systems running on the system.
The baseband circuitry 720 may include circuitry, such as, but not limited to, one or more single-core or multi-core processors. The processors may include a baseband processor. The baseband circuitry may handle various radio control functions that enable communication with one or more radio networks via the RF circuitry. The radio control functions may include, but are not limited to, signal modulation, encoding, decoding, radio frequency shifting, etc. In some embodiments, the baseband circuitry may provide for communication compatible with one or more radio technologies. For example, in some embodiments, the baseband circuitry may support communication with 5G NR, LTE, an evolved universal terrestrial radio access network (EUTRAN) and/or other wireless metropolitan area networks (WMAN) , a wireless local area network (WLAN) , a wireless personal area network (WPAN) . Embodiments in which the baseband circuitry is configured to support radio communications of more than one wireless protocol may be referred to as multi-mode baseband circuitry. In various embodiments, the baseband circuitry 720 may include circuitry to operate with signals that are not strictly considered as being in a baseband frequency. For example, in  some embodiments, baseband circuitry may include circuitry to operate with signals having an intermediate frequency, which is between a baseband frequency and a radio frequency.
The RF circuitry 710 may enable communication with wireless networks using modulated electromagnetic radiation through a non-solid medium. In various embodiments, the RF circuitry may include switches, filters, amplifiers, etc. to facilitate communication with the wireless network. In various embodiments, the RF circuitry 710 may include circuitry to operate with signals that are not strictly considered as being in a radio frequency. For example, in some embodiments, RF circuitry may include circuitry to operate with signals having an intermediate frequency, which is between a baseband frequency and a radio frequency.
In various embodiments, the transmitter circuitry, control circuitry, or receiver circuitry discussed above with respect to the UE, eNB, or gNB may be embodied in whole or in part in one or more of the RF circuitries, the baseband circuitry, and/or the processing unit. As used herein, “circuitry” may refer to, be part of, or include an Application Specific Integrated Circuit (ASIC) , an electronic circuit, a processor (shared, dedicated, or group) , and/or memory (shared, dedicated, or group) that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable hardware components that provide the described functionality. In some embodiments, the electronic device circuitry may be implemented in, or functions associated with the circuitry may be implemented by, one or more software or firmware modules. In some embodiments, some or all of the constituent components of the baseband circuitry, the processing unit, and/or the memory/storage may be implemented together on a system on a chip (SOC) .
The memory/storage 740 may be used to load and store data and/or instructions, for example, for the system. The memory/storage for one embodiment may include any combination of suitable volatile memory, such as dynamic random access memory (DRAM) , and/or non-volatile memory, such as flash memory. In various embodiments, the I/O interface 780 may include one or more user interfaces designed to enable user interaction with the system and/or peripheral component interfaces designed to enable peripheral component interaction with the system. User interfaces may include, but are not limited to a physical keyboard or keypad, a touchpad, a speaker, a microphone, etc. Peripheral component interfaces may include, but are not limited to, a non-volatile memory port, a universal serial bus (USB) port, an audio jack, and a power supply interface.
The camera module 760 may comprise a color space image camera and a depth camera, such as the  depth camera  15a or 15b. The color space image camera is configured to capture a sequence of input frames, wherein each of the input frames comprises a color space image. The depth camera is configured to capture a depth image that is associated with the color space image in each frame.
The sensor 770 is configured to provide external odometry that is associated with the color space image in each frame. In various embodiments, the sensor 770 may include one or more sensing devices to determine environmental conditions and/or location information related to the system. In some embodiments, the sensors may include, but are not limited to, an IMU, a gyro sensor, an accelerometer, a proximity sensor, an ambient light sensor, and a positioning unit. The positioning unit may also be part of, or interact with, the baseband circuitry and/or RF circuitry to communicate with components of a positioning network, e.g., a global positioning system (GPS) satellite. In various embodiments, the display 750 may include a display, such as a liquid crystal display and a touch screen display. In various embodiments, the system 700 may be a mobile computing device such as, but not limited to, a laptop computing device, a tablet computing device, a netbook, an ultrabook, a smartphone, etc. In various embodiments, the system may have more or less components, and/or different architectures. Where appropriate, the  methods described herein may be implemented as a computer program. The computer program may be stored on a storage medium, such as a non-transitory storage medium.
The embodiment of the present disclosure is a combination of techniques/processes that can be adopted to create an end product.
A person having ordinary skill in the art understands that each of the units, algorithm, and steps described and disclosed in the embodiments of the present disclosure are realized using electronic hardware or combinations of software for computers and electronic hardware. Whether the functions run in hardware or software depends on the condition of the application and design requirement for a technical plan. A person having ordinary skill in the art can use different ways to realize the function for each specific application while such realizations should not go beyond the scope of the present disclosure. It is understood by a person having ordinary skill in the art that he/she can refer to the working processes of the system, device, and unit in the above-mentioned embodiment since the working processes of the above-mentioned system, device, and unit are basically the same. For easy description and simplicity, these working processes will not be detailed.
It is understood that the disclosed system, device, and method in the embodiments of the present disclosure can be realized in other ways. The above-mentioned embodiments are exemplary only. The division of the units is merely based on logical functions while other divisions exist in realization. It is possible that a plurality of units or components are combined or integrated into another system. It is also possible that some characteristics are omitted or skipped. On the other hand, the displayed or discussed mutual coupling, direct coupling, or communicative coupling operate through some ports, devices, or units whether indirectly or communicatively by ways of electrical, mechanical, or other kinds of forms.
The units as separating components for explanation are or are not physically separated. The units for display are or are not physical units, that is, located in one place or distributed on a plurality of network units. Some or all of the units are used according to the purposes of the embodiments. Moreover, each of the functional units in each of the embodiments can be integrated into one processing unit, physically independent, or integrated into one processing unit with two or more than two units.
If the software function unit is realized and used and sold as a product, it can be stored in a readable storage medium in a computer. Based on this understanding, the technical plan proposed by the present disclosure can be essentially or partially realized as the form of a software product. Or, one part of the technical plan beneficial to the conventional technology can be realized as the form of a software product. The software product in the computer is stored in a storage medium, including a plurality of commands for a computational device (such as a personal computer, a server, or a network device) to run all or some of the steps disclosed by the embodiments of the present disclosure. The storage medium includes a USB disk, a mobile hard disk, a read-only memory (ROM) , a random access memory (RAM) , a floppy disk, or other kinds of media capable of storing program codes.
The proposed solution adopts a match and refine pipeline and includes two-stage processing to refine the pose. The first stage selects the query frames into the sequence. The second stage selects the inlier frames from the sequence. Finally, the inlier frames are used to refine the pose. The disclosed method achieves high relocalization precision while maintaining efficiency with low computation resources. Because of the sequence inlier selection, the invention can avoid the drawbacks of keyframe-based method, including bad initialization and bad ICP caused by insufficient geometric details. Furthermore, the sequence takes inlier frames with good geometric fitting. When the sequence is long enough to cover static portions of a scene with no visual changes, the disclosed method can process  scenes with visual changes.
While the present disclosure has been described in connection with what is considered the most practical and preferred embodiments, it is understood that the present disclosure is not limited to the disclosed embodiments but is intended to cover various arrangements made without departing from the scope of the broadest interpretation of the appended claims.

Claims (24)

  1. A visual-based relocalization method executable in an electronic device, comprising:
    selecting a sequence of query frames from a sequence of input frames based on evaluation of depth-image-based single frame relocalization associated with the sequence of the input frames, wherein the sequence of the input frames are different frames obtained from different view angles; and
    refining estimated poses associated with the sequence of query frames for visual-based relocalization using external poses associated with the sequence of query frames, wherein the external poses are obtained from external odometry.
  2. The visual-based relocalization method of claim 1, further comprising:
    comparing a pose of a current frame with a pose of at least one stored query frame in the sequence of query frames to determine whether the current frame represents a view angle sufficiently different than the stored query frame when the sequence of query frames has another query frame other than the current frame.
  3. The visual-based relocalization method of claim 2, further comprising:
    determining that the current frame represents a view angle sufficiently different than the stored query frame when a Euclidean distance between the pose of a current frame and the pose of at least one stored query frame is greater than a threshold; and
    performing the evaluation of depth-image-based single frame relocalization on the current frame representing a view angle sufficiently different than the stored query frame.
  4. The visual-based relocalization method of claim 1, wherein each input frame in the sequence of the input frames comprises a RGB image associated with a depth image, and the evaluation of the depth-image-based single frame relocalization comprises evaluation of point cloud registration of a current frame in the sequence of the input frames using depth information of a depth image associated with the current frame and depth information of depth images associated a plurality of keyframes in a three dimensional (3D) map.
  5. The visual-based relocalization method of claim 4, wherein the plurality of keyframes comprises k nearest keyframes relative to the current frame, k is a positive integer, the point cloud registration of a current frame comprises iterative closest point (ICP) algorithm applied to the current frame, and the method further comprises:
    providing k poses associated with the k nearest keyframes for the current frame; and
    performing iterative closest point (ICP) algorithm between a 3D point cloud from the depth image associated with the current frame and 3D point cloud associated with each of the k nearest keyframes to refine the k poses associated with the k nearest keyframes.
  6. The visual-based relocalization method of claim 5, wherein an inlier RMSE of a specific pose among the k poses is computed for a specific keyframe among the k keyframes associated with the specific pose,
    an inlier percentage of the specific poses is a percentage of one or more inlier points in all 3D points in the current frame, the one or more inlier points are defined as those points of the current frame that are mapped to points of the specific keyframe in the 3D map during the ICP, and the k refined poses are associated with k inlier RMSEs and k inlier percentages; and
    the method further comprises:
    selecting one of the k refined poses with a least inlier root mean square error (RMSE) and a largest inlier percentage to form an estimated pose of the current frame.
  7. The visual-based relocalization method of claim 6, wherein the method further comprises:
    selecting and adding the current frame as a query frame into the sequence of query frames when the inlier RMSE  of the selected refined pose of the current frame is below an RMSE threshold, and the inlier percentage of the selected refined pose of the current frame is higher than a certain percentage threshold, wherein the estimated pose of the selected current frame is obtained as one of the estimated poses of the query frames.
  8. The visual-based relocalization method of claim 7, wherein the method further comprises:
    storing the depth image associated with the current frame that is added to the sequence of query frames.
  9. The visual-based relocalization method of claim 7, wherein the method further comprises:
    transforming all point clouds from all of the query frames to a reference coordinate frame of the 3D map using the estimated poses of the query frames;
    computing Euclidean RMSE between each of the transformed point clouds of the query frames frame and the points of the reference coordinate frame in the 3D map;
    determining computed Euclidean RMSEs of the query frames to generate a plurality of inlier frames, wherein an i-th frame in the sequence of query frames is determined as an inlier frame when a computed Euclidean RMSE of the i-th frame is smaller than a threshold δ rmse; and
    combining point clouds from all inlier frames and refine the estimated poses of the inlier frames to generate refined estimated poses using ICP.
  10. The visual-based relocalization method of claim 9, wherein i-th frame has an estimated pose
    Figure PCTCN2021098096-appb-100001
    and an external pose
    Figure PCTCN2021098096-appb-100002
    apoint cloud PC j of the j-th frame in the sequence of query frames is transformed to the reference coordinate frame by transformation
    Figure PCTCN2021098096-appb-100003
  11. An electronic device comprising:
    a camera configured to capture a sequence of input frames, wherein each of the input frames comprises an RGB image;
    a depth camera configured to capture a depth image that is associated with the RGB image;
    an inertial measurement unit configured to provide external odometry that is associated with the RGB image; and
    a processor configured to execute:
    selecting a sequence of query frames from the sequence of input frames based on evaluation of depth-image-based single frame relocalization associated with the sequence of the input frames, wherein the sequence of the input frames are different frames obtained from different view angles; and
    refining estimated poses associated with the sequence of query frames for visual-based relocalization using external poses associated with the sequence of query frames, wherein the external poses are obtained from external odometry.
  12. The electronic device of claim 11, wherein the processor is further configured to execute:
    comparing a pose of a current frame with a pose of at least one stored query frame in the sequence of query frames to determine whether the current frame represents a view angle sufficiently different than the stored query frame when the sequence of query frames has another query frame other than the current frame.
  13. The electronic device of claim 12, wherein the processor is further configured to execute:
    determining that the current frame represents a view angle sufficiently different than the stored query frame when a Euclidean distance between the pose of a current frame and the pose of at least one stored query frame is greater than a threshold; and
    performing the evaluation of depth-image-based single frame relocalization on the current frame representing a view angle sufficiently different than the stored query frame.
  14. The electronic device of claim 11, wherein each input frame in the sequence of the input frames comprises an RGB image associated with a depth image, and the evaluation of the depth-image-based single frame relocalization comprises evaluation of point cloud registration of a current frame in the sequence of the input frames using depth information of a depth image associated with the current frame and depth information of depth images associated a plurality of keyframes in a three dimensional (3D) map.
  15. The electronic device of claim 14, wherein the plurality of keyframes comprises k nearest keyframes relative to the current frame, k is a positive integer, the point cloud registration of a current frame comprises iterative closest point (ICP) algorithm applied to the current frame, and the processor is further configured to execute:
    providing k poses associated with the k nearest keyframes for the current frame; and
    performing iterative closest point (ICP) algorithm between a 3D point cloud from the depth image associated with the current frame and 3D point cloud associated with each of the k nearest keyframes to refine the k poses associated with the k nearest keyframes.
  16. The electronic device of claim 15, wherein an inlier RMSE of a specific pose among the k poses is computed for a specific keyframe among the k keyframes associated with the specific pose,
    an inlier percentage of the specific poses is a percentage of one or more inlier points in all 3D points in the current frame, the one or more inlier points are defined as those points of the current frame that are mapped to points of the specific keyframe in the 3D map during the ICP, and the k refined poses are associated with k inlier RMSEs and k inlier percentages; and
    the processor is further configured to execute:
    selecting one of the k refined poses with a least inlier root mean square error (RMSE) and a largest inlier percentage to form an estimated pose of the current frame.
  17. The electronic device of claim 16, wherein the processor is further configured to execute:
    selecting and adding the current frame as a query frame into the sequence of query frames when the inlier RMSE of the selected refined pose of the current frame is below an RMSE threshold, and the inlier percentage of the selected refined pose of the current frame is higher than a certain percentage threshold, wherein the estimated pose of the selected current frame is obtained as one of the estimated poses of the query frames.
  18. The electronic device of claim 17, wherein the processor is further configured to execute:
    storing the depth image associated with the current frame that is added to the sequence of query frames.
  19. The electronic device of claim 17, wherein the processor is further configured to execute:
    transforming all point clouds from all of the query frames to a reference coordinate frame of the 3D map using the estimated poses of the query frames;
    computing Euclidean RMSE between each of the transformed point clouds of the query frames frame and the points of the reference coordinate frame in the 3D map;
    determining computed Euclidean RMSEs of the query frames to generate a plurality of inlier frames, wherein an i-th frame in the sequence of query frames is determined as an inlier frame when a computed Euclidean RMSE of the i-th frame is smaller than a threshold δ rmse; and
    combining point clouds from all inlier frames and refine the estimated poses of the inlier frames to generate refined estimated poses using ICP.
  20. The electronic device of claim 19, wherein i-th frame has an estimated pose
    Figure PCTCN2021098096-appb-100004
    and an external pose
    Figure PCTCN2021098096-appb-100005
    a point cloud PC j of the j-th frame in the sequence of query frames is transformed to the reference coordinate frame  by transformation
    Figure PCTCN2021098096-appb-100006
  21. A chip, comprising:
    a processor, configured to call and run a computer program stored in a memory, to cause a device in which the chip is installed to execute any of the methods of claims 1 to 10.
  22. A computer readable storage medium, in which a computer program is stored, wherein the computer program causes a computer to execute any of the methods of claims 1 to 10.
  23. A computer program product, comprising a computer program, wherein the computer program causes a computer to execute any of the methods of claims 1 to 10.
  24. A computer program, wherein the computer program causes a computer to execute any of the methods of claims 1 to 10.
PCT/CN2021/098096 2020-06-03 2021-06-03 Visual-based relocalization method, and electronic device WO2021244604A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202180032534.0A CN115516524A (en) 2020-06-03 2021-06-03 Vision-based repositioning method and electronic equipment

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202063034270P 2020-06-03 2020-06-03
US63/034,270 2020-06-03

Publications (1)

Publication Number Publication Date
WO2021244604A1 true WO2021244604A1 (en) 2021-12-09

Family

ID=78830125

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/098096 WO2021244604A1 (en) 2020-06-03 2021-06-03 Visual-based relocalization method, and electronic device

Country Status (2)

Country Link
CN (1) CN115516524A (en)
WO (1) WO2021244604A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017164479A1 (en) * 2016-03-25 2017-09-28 Samsung Electronics Co., Ltd. A device and method for determining a pose of a camera
US20190080190A1 (en) * 2017-09-14 2019-03-14 Ncku Research And Development Foundation System and method of selecting a keyframe for iterative closest point
WO2020005635A1 (en) * 2018-06-25 2020-01-02 Microsoft Technology Licensing, Llc Object-based localization
US20200042278A1 (en) * 2017-03-30 2020-02-06 Microsoft Technology Licensing, Llc Sharing neighboring map data across devices
US20200051328A1 (en) * 2018-08-13 2020-02-13 Magic Leap, Inc. Cross reality system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017164479A1 (en) * 2016-03-25 2017-09-28 Samsung Electronics Co., Ltd. A device and method for determining a pose of a camera
US20200042278A1 (en) * 2017-03-30 2020-02-06 Microsoft Technology Licensing, Llc Sharing neighboring map data across devices
US20190080190A1 (en) * 2017-09-14 2019-03-14 Ncku Research And Development Foundation System and method of selecting a keyframe for iterative closest point
WO2020005635A1 (en) * 2018-06-25 2020-01-02 Microsoft Technology Licensing, Llc Object-based localization
US20200051328A1 (en) * 2018-08-13 2020-02-13 Magic Leap, Inc. Cross reality system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GLOCKER, BEN ET AL.: "Real-Time RGB-D Camera Relocalization via Randomized Ferns for Keyframe Encoding", IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, vol. 21, no. 5, 31 May 2015 (2015-05-31), XP011576671, DOI: 10.1109/TVCG.2014.2360403 *

Also Published As

Publication number Publication date
CN115516524A (en) 2022-12-23

Similar Documents

Publication Publication Date Title
US9406137B2 (en) Robust tracking using point and line features
US11102398B2 (en) Distributing processing for imaging processing
CN109377530B (en) Binocular depth estimation method based on depth neural network
CN110458805B (en) Plane detection method, computing device and circuit system
JP6858650B2 (en) Image registration method and system
US9727775B2 (en) Method and system of curved object recognition using image matching for image processing
TWI808987B (en) Apparatus and method of five dimensional (5d) video stabilization with camera and gyroscope fusion
BR102018075714A2 (en) Recurring Semantic Segmentation Method and System for Image Processing
US20240112035A1 (en) 3d object recognition using 3d convolutional neural network with depth based multi-scale filters
CN109683699B (en) Method and device for realizing augmented reality based on deep learning and mobile terminal
US10217221B2 (en) Place recognition algorithm
US20140146136A1 (en) Image depth perception device
US11526704B2 (en) Method and system of neural network object recognition for image processing
AU2013237718A1 (en) Method, apparatus and system for selecting a frame
US11527014B2 (en) Methods and systems for calibrating surface data capture devices
WO2021147113A1 (en) Plane semantic category identification method and image data processing apparatus
CN113450392A (en) Robust surface registration based on parametric perspective of image templates
Wang et al. Salient video object detection using a virtual border and guided filter
CN116129228B (en) Training method of image matching model, image matching method and device thereof
WO2021244604A1 (en) Visual-based relocalization method, and electronic device
WO2022016803A1 (en) Visual positioning method and apparatus, electronic device, and computer readable storage medium
CN116630355B (en) Video segmentation method, electronic device, storage medium and program product
KR102605451B1 (en) Electronic device and method for providing multiple services respectively corresponding to multiple external objects included in image
WO2021114871A1 (en) Parallax determination method, electronic device, and computer-readable storage medium
Wang et al. Sequence-Based Indoor Relocalization for Mobile Augmented Reality

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21817094

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21817094

Country of ref document: EP

Kind code of ref document: A1