WO2023195982A1

WO2023195982A1 - Keyframe downsampling for memory usage reduction in slam

Info

Publication number: WO2023195982A1
Application number: PCT/US2022/023667
Authority: WO
Inventors: Jaechoon CHON; Youjie XIA
Original assignee: Innopeak Technology, Inc.
Priority date: 2022-04-06
Filing date: 2022-04-06
Publication date: 2023-10-12

Abstract

This application is directed to simultaneous localization and mapping (SLAM) in extended reality. An electronic system has a camera and obtains image data including a plurality of images captured by the camera in a scene. Each of the plurality images includes a first landmark. The electronic system generates a plurality of landmark descriptors of the first landmark from the image data and identifies a plurality of camera poses for the plurality of landmark descriptors. Each landmark descriptor is generated from a distinct image that includes the first landmark and is captured at a distinct camera pose. In accordance with a determination that two of the plurality of camera poses satisfy a descriptor elimination criterion, the electronic system selects a first landmark descriptor corresponding to a first one of the two of the plurality of camera poses to map the first landmark in the scene.

Description

Keyframe Downsampling for Memory Usage Reduction in SLAM

TECHNICAL FIELD

[0001] This application relates generally to data processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for selectively storing keyframes for localizing a camera and mapping an environment in an extended reality application.

BACKGROUND

[0002] Simultaneous localization and mapping (SLAM) is widely applied in virtual reality (VR), augmented reality (AR), autonomous driving, and navigation. In SLAM, high frequency pose estimation is enabled by sensor fusion. Asynchronous time warping (ATW) is often applied with SLAM in an AR system to warp an image before it is sent to a display to correct head movement that, occurs after the image is rendered. In both SLAM and ATW, relevant image data and inertial sensor data are synchronized and used for estimating and predicting camera poses. For localization, many SLAM systems suppress an accumulated error of the inertial sensor data by detecting corner points and extracting image descriptors on a keyframe image. Each keyframe image is associated with a descriptor derived using a Bag of Visual Word data structure, and a corresponding descriptor of a new image is compared with those of existing keyframe images to localize the new image. Computation cost increases with the number of existing keyframe images and becomes out of control when the number of existing keyframes images reaches a huge or global scale number, e.g., 1 billion. [0003] C onventional solutions have used a length of optical flows between two sequential images to reduce the number of keyframe images. When an average length of different optical flows is small, the conventional solutions wait for new images to find more lengths of optical flows, and the next keyframe is set when it is enough. Additionally, when a camera has significant rotation motion relative to translation motion, the significant rotation motion could not allow for triangulation for 3D calculation, and an associated keyframe is not valid and removed. Small six degree-of-motion (6DOF) motion is applied to identify close objects, and however, does not facilitate downsampling of keyframes. It would be beneficial to have a more efficient SLAM mechanism using less keyframes than the current practice. SUMMARY

[0004] Various embodiments of this application are directed to SLAM techniques that map a virtual space with descriptors. A camera is located and configured to capture images in a scene, and the scene is mapped to a virtual space including a plurality of landmarks. Each image (also called keyframe) corresponds to a camera pose (i.e., a camera position and a camera orientation) of the camera, and records a set of landmarks in the scene. Each keyframe is processed to provide a set of descriptors for the set of landmarks of the scene, and the set of descriptors are associated with the camera pose. From a perspective of each landmark, the respective landmark is recorded in one or more keyframes that provide one or more descriptors associated with one or more respective camera poses to describe the respective landmark. As the number of keyframes captured from the camera increases, memory usage also increases. In some embodiments, the number of keyframes or the number of associated landmarks in each keyframe are downsampled to reduce the number of descriptors that need to be stored for SLAM. For instance, when a camera pose has multiple descriptors, a portion of these descriptors are selected to map the landmarks because other unselected descriptors are substantially similar to descriptors of other keyframes. When the select portion of the descriptors is small, contribution of this camera pose to an overall accuracy of mapping data of the virtual space is limited, and the camera pose and corresponding keyframe is entirely disabled from providing any descriptors including the selected portion to map the set of landmarks. Such a keyframe downsampling mechanism can be applied in various SLAM-based products, e.g., AR glasses, robotic system, autonomous driving, drone, or mobile devices implementing AR applications.

[0005] In one aspect, a method is implemented at an electronic system having a camera. The method includes obtaining image data including a plurality of images captured by the camera in a scene, and each of the plurality images includes a first landmark. The method further includes generating a plurality of landmark descriptors of the first landmark from the image data and identifying a plurality of camera poses for the plurality of landmark descriptors. Each landmark descriptor is generated from a distinct image that includes the first landmark and is captured at a distinct camera pose. The method further includes in accordance with a determination that two of the plurality of camera poses satisfy a descriptor elimination criterion, selecting a first landmark descriptor corresponding to a first one of the two of the plurality of camera poses to map the first landmark in the scene. [0006] In some embodiments, the method further includes in accordance with the determination that the two of the plurality of camera poses satisfy the descriptor elimination criterion, deselecting a second distinct landmark descriptor corresponding to a second one of the two of the plurality of camera poses from mapping the first landmark in the scene. In some embodiments, the method further includes mapping the first landmark with a subset of the plurality of landmark descriptors. The subset of the plurality of landmark descriptors correspond to a subset of distinct camera poses where the subset of the plurality of images are captured. The subset of the plurality of landmark descriptors are selected in accordance with a determination that any two of the subset of distinct camera poses do not satisfy the descriptor elimination criterion.

[0007] In another aspect, some implementations include an electronic system that includes one or more processors and memory' having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.

[0008] In yet another aspect, some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.

[0009] These illustrative embodiments and implementations are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. .Additional embodiments are discussed in the Detailed Description, and further description is provided there.

BRIEF DESCRIPTION OF THE DRAWINGS

[0010] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures.

[0011] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.

[0012] Figure 2 is a block diagram illustrating an electronic system, in accordance with some embodiments. [0013] Figure 3 is a flowchart of a process for processing inertial sensor data and image data of an electronic system (e.g., a server, a client device, or a combination of both) using a SLAM module, in accordance with some embodiments.

[0014] Figures 4A-4C are three simplified diagrams of a virtual space having a plurality of landmarks that are captured by different keyframes, in accordance with some embodiments.

[0015] Figure 5A is a diagram of a virtual space that is mapped with a plurality of landmarks associated with a first set of key frames, in accordance with some embodiments, and Figure 5B is another diagram of the virtual space that is mapped with the plurality of landmarks associated with a second set of keyframes, in accordance with some embodiments, [0016] Figure 6 is a flowchart of a method for simultaneous localization and mapping (SLAM), in accordance with some embodiments.

[0017] Like reference numerals refer to corresponding parts throughout the several views of the drawings.

DETAILED DESCRIPTION

[0018] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary' skill in the art that the subject matter presented herein can be implemented on many types of electronic systems with digital video capabilities.

[0019] This application is directed to localizing a camera and mapping a scene for rendering extended reality content (e.g., virtual, augmented, or mixed reality content) on an electronic device. In prior art, a current image is compared with existing keyframes to identify a camera location, which can be extremely inefficient as a huge number of keyframes are created to map the scene. In various embodiments of this application, the number of keyframes or the number of associated landmarks in each keyframe are downsampled to reduce the number of descriptors that need to be stored for SLAM. For instance, when a camera pose has multiple descriptors, a portion of these descriptors are selected to map the landmarks because other unselected descriptors are substantially similar to descriptors of other keyframes. When the select portion of the descriptors is small, contribution of this camera pose to an overall accuracy of mapping data of the virtual space is limited, and the camera pose and corresponding keyframe is entirely disabled from providing any descriptors including the selected portion to map the set of landmarks.

[0020] Figure I is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, laptop computers 104A, tablet computers 104B, mobile phones 104C, or intelligent, multi -sensing, network-connected home devices (e.g., a surveillance camera 104E, a smart television device, a drone). In some implementations, the one or more client devices 104 include a headmounted display 104D configured to render extended reality content. Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provides system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104. For example, storage 106 may store video content (including visual and audio content), static visual content, and/or inertial sensor data.

[0021] The one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console (e.g., formed by the head-mounted display 104D) that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console.

[0022] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Intemet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other electronic systems that route data and messages.

[0023] The head-mounted display 104D (also called AR glasses 104D) include one or more cameras (e.g., a visible light camera, a depth camera), a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display. The camera(s) and microphone are configured to capture video and audio data from a scene of the AR glasses 104D, while the one or more inertial sensors are configured to capture inertial sensor data. In some situations, the camera captures hand gestures of a user wearing the AR glasses 104D. In some situations, the microphone records ambient sound, including user’s voice commands. In some situations, both video or static visual data captured by the visible light camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses (i.e., device positions and orientations). The video, static image, audio, or inertial sensor data captured by the AR glasses 104D are processed by the AR glasses 104D, server(s) 102, or both to recognize the device poses. The device poses are used to control the AR glasses 104D itself or interact with an application (e.g., a gaming application) executed by the AR glasses 104D. In some embodiments, the display of the AR glasses 104D displays a user interface, and the recognized or predicted device poses are used to render virtual objects with high fidelity or interact, with user selectable display items on the user interface.

[0024] In some embodiments, SLAM techniques are applied in the data processing environment 100 to process video data, static image data, or depth data captured by the AR glasses 104D with inertial sensor data. Device poses are recognized and predicted, and a scene in which the AR glasses 104D is located is mapped and updated. The SLAM techniques are optionally implemented by AR glasses 104D independently or by both of the server 102 and AR glasses 104D jointly.

[0025] Figure 2 is a block diagram illustrating an electronic system 200 configured to process content data (e.g., image data), in accordance with some embodiments. The electronic system 200 includes a server 102, a client device 104 (e.g., AR glasses 104D in Figure 1), a storage 106, or a combination thereof. The electronic system 200 , typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit, or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the electronic system 200 uses a microphone for voice recognition or a camera 260 for gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more optical cameras 260 (e.g., an RGB camera), scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The electronic system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.

[0026] Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning system) or other geo-location receiver, for determining the location of the client device 104. Optionally, the client device 104 includes an inertial measurement unit (IMU) 280 integrating sensor data captured by multi-axes inertial sensors to provide estimation of a location and an orientation of the client device 104 in space. Examples of the one or more inertial sensors of the IMU 280 include, but. are not limited to, a gyroscope, an accelerometer, a magnetometer, and an inclinometer.

[0027] Memory? 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory' 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory' computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof.

* Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;

* Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;

* User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);

* Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction,

* Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;

* One or more user applications 224 for execution by the electronic system 200 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices);

* Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104; * Data processing module 228 for processing content data using data processing models 250, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224;

* Pose determination and prediction module 230 for determining and predicting a pose of the client device 104 (e.g., AR glasses 104D), where in some embodiments, the pose is determined and predicted jointly by the pose determination and prediction module 230 and data processing module 228, and in some embodiments, the module 230 further includes an SLAM module 232 for mapping a scene where a client device 104 is located and identifying a pose of the client device 104 within the scene using image or IMU sensor data;

* Pose-based rendering module 238 for rendering virtual objects on top of a field of view of the camera 260 of the client device 104 or creating mixed, virtual, or augmented reality content using images captured by the camera 260, where the virtual objects are rendered and the mixed, virtual, or augmented reality content are created from a perspective of the camera 260 based on a camera pose of the camera 260; and

* One or more databases 240 for storing at least data including one or more of: o Device settings 242 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 244 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 246 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 248 for training one or more data processing models 250; o Data processing model(s) 250 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques; o Pose data database 252 for storing pose data of the camera 260, where in some embodiments, descriptors and associated camera poses are compressed according to a descriptor or image elimination criterion and stored in association with landmarks in a scene; and o Content data and results 254 that are obtained by and outputted to the client device 104 of the electronic system 200 , respectively, where the content data is processed by the data processing models 250 locally at the client device 104 or remotely at the server 102 to provide the associated results to be presented on client device 104, and include the candidate images.

[0028] Optionally , the one or more databases 240 are stored in one of the server 102, client device 104, and storage 106 of the electronic system 200. Optionally, the one or more databases 240 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200. In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 250 are stored at the server 102 and storage 106, respectively.

[0029] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-ananged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.

[0030] Figure 3 is a flowchart of a process 300 for processing inertial sensor data and image data of an electronic system (e.g., a server 102, a client device 104, or a combination of both) using a visual-inertial SLAM module 232, in accordance with some embodiments. The process 300 includes measurement preprocessing 302, initialization 304, local visual- inertial odometry (VIO) with relocation 306, and global pose graph optimization 308, In measurement preprocessing 302, an RGB camera 260 captures image data of a scene at an image rate (e.g., 30 FPS), and features are detected and tracked (310) from the image data. An IMU 280 measures inertial sensor data at a sampling frequency (e.g., 1000 Hz) concurrently with the RGB camera 260 capturing the image data, and the inertial sensor data are pre-integrated (312) to provide data of a variation of device poses 340. In initialization 304, the image data captured by the RGB camera 260 and the inertial sensor data measured by the IMU 280 are temporally aligned (314), A vision-only structure from motion (SfM) techniques 314 are applied (316) to couple the image data and inertial sensor data, estimate three-dimensional structures, and map the scene of the RGB camera 260. [0031] After initialization 304 and during relocation 306, a sliding window 318 and associated states from a loop closure 320 are used to optimize (322) a VIO. When the VIO corresponds (324) to a keyframe of a smooth video transition and a corresponding loop is detected (326), features are retrieved (328) and used to generate the associated states from the loop closure 320. In global pose graph optimization 308, a multi-degree-of-freedom (multiDOF) pose graph is optimized (330) based on the states from the loop closure 320, and a keyframe database 332 is updated with the keyframe associated with the VIO. Specifically, in some embodiments, each keyframe includes a set of landmarks and is processed to generate landmark descriptors to map the landmarks. The descriptors are optionally compressed according to a descriptor elimination criterion, and the respective keyframe is optionally eliminated from mapping the landmarks according to an image elimination criterion.

[0032] Additionally, the features that are detected and tracked (310) are used to monitor (334) motion of an object in the image data and estimate image-based poses 336, e.g., according to the image rate. In some embodiments, the inertial sensor data that are preintegrated (234) may be propagated (338) based on the motion of the object and used to estimate inertial-based poses 340, e.g., according to a sampling frequency of the IMU 280. The image-based poses 336 and the inertial-based poses 340 are stored in the database 240 and used by the module 230 to estimate and predict poses that are used by the real time video rendering system 234. Alternatively, in some embodiments, the module 232 receives the inertial sensor data measured by the IMU 280 and obtains image-based poses 336 to estimate and predict more poses 340 that are further used by the time video rendering system 234. [0033] In SLAM, high frequency pose estimation is enabled by sensor fusion, which relies on data synchronization between imaging sensors and the IMU 280. The imaging sensors (e.g., the RGB camera 260, a LIDAR scanner) provide image data desirable for pose estimation, and oftentimes operate at a lower frequency (e.g., 30 frames per second) and with a larger latency (e.g., 30 millisecond) than the IMU 280. Conversely, the IMU 280 can measure inertial sensor data and operate at a very high frequency (e.g., 1000 samples per second) and with a negligible latency (e.g., < 0. 1 millisecond). Asynchronous time warping (ATW) is often applied in an AR system to warp an image before it is sent to a display to correct for head movement and pose variation that occurs after the image is rendered. ATW algorithms reduce a latency of the image, increase or maintain a frame rate, or reduce judders caused by missing images. In both SLAM and ATW, relevant image data and inertial sensor data are stored locally, such that they can be synchronized and used for pose estimation/predication. In some embodiments, the image and inertial sensor data are stored in one of multiple Standard Tessellation Language (STL) containers, e.g., std::vector, std::queue, std: :list, etc., or other self-defined containers. These containers are generally convenient for use. The image and inertial sensor data are stored in the STL containers with their timestamps, and the timestamps are used for data search, data insertion, and data organization.

[0034] Figures 4A-4C are three simplified diagrams of a virtual space 400 having a plurality of landmarks 402 that are captured by different keyframes 404, in accordance with some embodiments. An electronic device has a camera 260 and is positioned in a scene. The camera 260 captures a plurality of images (also called keyframes) 404, and each keyframe 404 is captured when the camera 260 has a camera pose 406 (i.e., is positioned at a camera position and oriented with a camera orientation). The scene includes a plurality of objects that are associated with the plurality of landmarks 402. Each keyframe 404 records a portion of the scene including a subset of the objects, and therefore, includes a subset of landmarks 402. For each of the subset of landmarks 402, the respective keyframe 404 is processed by a data processing model (e.g., a convolutional neural network (CNN)) to extract a landmark descriptor for the respective landmark 402. For example, referring to Figure 4A, a first keyframe 404A includes three landmarks 402A-402C and is processed to extract three respective landmark descriptors for the three landmarks 402A-402C. Similarly, each of a second keyframe 404B and a third keyframe 404C includes the three landmarks 402A-402C and is processed to extract three respective landmark descriptors for the three landmarks 402A-402C. As such, the keyframes 404A-404C provide 9 landmarks descriptors associated with the three landmarks 402A-402C.

[0035] The first landmark 402A is associated with at least three landmark descriptors determined from the first, second, and third keyframes 404 A, 404B, and 404C. The first landmark 402A exists in each of the keyframes 404A-404C captured at a distinct one of the camera poses 406A-406C. A descriptor elimination criterion is applied to determine whether each of the landmark descriptors determined from the keyframes 404A-40C is selected to map the first landmark 402A. For example, referring to Figure 4B in accordance with the descriptor elimination criterion, a first landmark descriptor determined by the first keyframe 404A is selected over a second landmark descriptor determined by the second keyframe 404B to map the first landmark 402A in the scene. The first, camera pose 406A associated with the first keyframe 404A includes a first camera position and a first camera pose, and the second camera pose 406B associated with the second keyframe 404B includes a second camera position and a second camera pose. For the first landmark 402A, a first image ray 408 A connects the first, camera position to the first landmark 402A, and a second image ray 408B connects the second camera position to the first landmark 402A. The first and second image rays 408 A and 408B form a ray angle 410 connecting the first and second camera positions to the first landmark 402A. The descriptor elimination criterion defines a ray angle threshold (e.g., 5-10 degrees). If the ray angle 410 is less than the ray angle threshold, the first and second image rays 408A and 408B and the first and second camera positions satisfy the descriptor elimination criterion. That said, in accordance with the descriptor elimination criterion, one of the landmark descriptors associated with the first and second keyframes 404A and 404B needs to be eliminated from mapping the first landmark 402A in the scene. [0036] Assume that the first keyframe 404A is captured temporally prior to the second keyframe 404B. In some embodiments shown in Figure 4B, when one of the landmark descriptors associated with the first and second keyframes 404 A and 404B needs to be eliminated from mapping the first landmark 402A based on the descriptor elimination criterion, the first landmark descriptor associated with the first keyframe 404A that is captured earlier in time is selected, and the second landmark descriptor associated with the second keyframe 404B that is captured later in time is disabled from mapping the first landmark 402 A. Stated another way, the second landmark descriptor associated with the second keyframe 404B that is captured later in time has a priority of being eliminated over the first landmark descriptor based on the descriptor elimination criterion. Conversely, in some embodiments not shown in Figure 4B, the first landmark descriptor associated with the first keyframe 404 A that is captured earlier in time has a priority of being eliminated over the second landmark descriptor based on the descriptor elimination criterion. The second landmark descriptor associated with the second keyframe 404B that is captured later in time is selected, and the first landmark descriptor associated with the first keyframe 404A that is captured earlier in time is disabled from landmark mapping.

[0037] Similarly, for the second landmark 402B, a third image ray 408C connects the second camera position to the second landmark 402B, and a fourth image ray 408D connects the first camera position to the second landmark 402B. The third and fourth image rays 408C and 4081) form a ray angle 412 connecting the second and first camera positions to the second landmark 402B. The ray angle 410 is less than the ray angle threshold, and therefore, the third and fourth image rays 408C and 408D and the first and second camera positions satisfy the descriptor elimination criterion. One of the landmark descriptors associated with the first and second keyframes 404A and 404B needs to be eliminated from mapping the second landmark 402B in the scene. In some embodiments shown in Figure 4B, a landmark descriptor associated with the first keyframe 404 A that is captured earlier in time is selected to map the second landmark 402B in the scene, and another landmark descriptor associated with the second keyframe 404B that is captured later in time is disabled from landmark mapping. Conversely, in some embodiments not shown in Figure 4B, the landmark descriptor associated with the second keyframe 404B that is captured later in time is selected for mapping the landmark 402B in the scene, and the first landmark descriptor associated with the first keyframe 404A that is captured earlier in time is disabled from landmark mapping. [0038] In some embodiments, after the descriptor elimination criterion is applied, the first, second, and third landmarks 402A-402C are mapped with a set of landmark descriptors determined from a set of keyframes 404 including the first, second, and third keyframes 404A, 404B, and 404C. This set of landmark descriptors includes three landmark descriptors provided by the first keyframe 404 A, three landmark descriptors provided by the third keyframe 404C, and however, only one landmark descriptor provided by the second keyframe 404B. Any two landmark descriptors of each landmark 402 satisfies the descriptor elimination criterion. Stated another way, this set of landmark descriptors are selected in accordance with a determination that any two of the subset of distinct camera poses do not satisfy the descriptor elimination criterion. In an example, any two image rays connected to the same landmark 402 form a ray angle that is greater than a ray angle threshold defined by the descriptor elimination criterion.

[0039] In some embodiments, the descriptor elimination criterion eliminates one of two landmark descriptors that are determined from two keyframes 404 for the same landmark 402 based on a distance between the landmark 402 and each of two camera positions corresponding to these two keyframes 404. For example, for the first landmark 402A, if the ray angle 410 is less than the ray angle threshold, the second landmark descriptor determined from the second keyframe 404B is eliminated in accordance with a determination that a first distance between the first landmark 402A and the first camera position of the first keyframe 404 A is greater than a second distance between the first landmark 402 A and the second camera position of the second keyframe 404B, Conversely, in some embodiments, in accordance with a determination that the ray angle 410 is less than the ray angle threshold and that the second distance is less than the first distance, the first landmark descriptor determined from the first keyframe 404A is eliminated, and the second landmark descriptor determined from the second keyframe 404B is selected.

[0040] Referring to Figure 4C, in some embodiments, a target keyframe 404 (e.g,, the second keyframe 404B) satisfies an image elimination criterion and is entirely eliminated from mapping the plurality of landmarks 402 in the scene. The target keyframe 404 is captured to include a first number of landmarks 402 (e.g., 100 landmarks) corresponding to the first number of landmark descriptors. After the descriptor elimination criterion is applied, a second number of landmark descriptors corresponding to a subset of landmarks are disabled from mapping the subset of landmarks. In accordance with a determination that the first or second number satisfies the image elimination criterion, the target keyframe 404 is eliminated entirely, and the first number of landmark descriptors associated with this target keyframe 404 is eliminated from mapping the plurality of landmarks 402 in the scene. Further, in some embodiments, the image elimination criterion requires that a ratio of the second number and the first number exceeds a predetermined threshold (e.g., 90%). This implies that if a large portion of landmark descriptors provided by the target keyframe 404 are eliminated, the target keyframe 404 is not efficiently utilized for mapping the scene and needs to be eliminated, leaving space to store information of more efficiently utilized keyframes 404. [0041] Additionally, upon elimination of the target keyframe 404, if a landmark descriptor associated with the target keyframe has already been selected to map one of the plurality of landmarks based on the descriptor elimination criterion, selection of the selected landmark descriptor is aborted for mapping the one of the plurality of landmarks.

[0042] In an example, the target keyframe 404 is the second keyframe 404B processed to provide three landmark descriptors to the three landmarks 402A-402C. Two of the three landmark descriptors determined from the second keyframe 404B are eliminated based on the descriptor elimination criterion, and only one of these three landmark descriptors is left in association with the third landmark 402C. Given that two out of the three landmark descriptors provided by the keyframe 404B are eliminated, the second keyframe 404B is entirely eliminated from mapping the scene. Although one of the three landmark descriptors is not eliminated by the descriptor elimination criterion, the one of the three landmark descriptors is still eliminated with the second keyframe 404B from mapping the third landmark 402C.

[0043] Figure 5A is a diagram of a virtual space 500 that is mapped with a plurality of landmarks 402 associated with a first set of keyframes 520, in accordance with some embodiments, and Figure 5B is another diagram of the virtual space 500 that is mapped with the plurality of landmarks 402 associated with a second set of key frames 540, in accordance with some embodiments. The second set of keyframes 540 are simplified from the first set of keyframes 520, and the first set of keyframes 520 include the second set of keyframes 540. The first set of keyframes 520 has one or more additional keyframes besides the second set of keyframes 540. The one or more additional keyframes are eliminated, not selected, or deselected based on a combination of a descriptor elimination criterion and an image elimination criterion. By these means, landmark descriptors determined from the second set of keyframes 540 can sufficiently map the scene and identify a current camera pose of a current frame 504C, while requiring a smaller storage space to store information of the second set of keyframes 540 including the landmark descriptors determined therefrom. [0044] A camera 260 is placed in a scene and captures a plurality of images. The plurality of images are applied as keyframes 404 to map the scene to the virtual space 500. The scene includes a plurality of objects 502 that are associated with a plurality of landmarks 402. For example, each landmark 402 is associated with a corner or an edge of a respective object 502. Each keyframe 404 records a portion of the scene including a subset of the objects 502, and therefore, includes a subset of landmarks 402. Each keyframe 404 is captured when the camera 260 has a camera pose 406 (i.e., is positioned at a camera position and oriented with a camera orientation). The subset of landmarks 402 in each keyframe 404 are determined based on the camera pose 406 corresponding to the respective keyframe 404. For each of the subset of landmarks 402, the respective keyframe 404 is processed to extract a landmark descriptor for the respective landmark 402. From a perspective of each landmark 402, the respective landmark 402 is recorded in one or more keyframes 404 each of which provides a landmark descriptor associated with a respective camera pose 406 to map the respective landmark in the virtual space 500.

[0045] The first set of keyframes 520 provide a plurality of landmark descriptors. Each of the first set of keyframes 404-1 provides one or more respective landmark descriptors to map one or more respective landmarks 402. The descriptor elimination criterion is applied to eliminate a subset of the plurality of landmark descriptors. In some embodiments, a first, subset of keyframes 404-1 are entirely eliminated as all of the one or more respective landmark descriptors provided by each keyframe 404-1 are disabled from landmark mapping based on the descriptor elimination criterion. In some embodiments, a second subset of keyframes 404-2 are not eliminated by the descriptor elimination criterion. Each keyframe 404-2 provides a plurality of respective landmark descriptors, and at least one of the respective landmark descriptors provided by each keyframe 404-2 is disabled from landmark mapping based on the descriptor elimination criterion. In some embodiments, a third subset of keyframes 404-3 are not impacted by the descriptor elimination criterion, and all of the one or more respective landmark descriptors provided by each keyframe 404-3 are selected to map corresponding landmarks 402. [0046] In some embodiments, the image elimination criterion is further applied to determine whether to eliminate each keyframe 404 in the second subset of keyframes 404-2. The image elimination criterion does not impact the third subset of keyframe 404-3, which do not correspond to any landmark descriptor that is eliminated by the descriptor elimination criterion. Each of the second subset of keyframes 404-2 provides a respective first number of landmark descriptors to the respective first number of landmarks 402, and a respective second number of landmark descriptors are not selected due to the descriptor elimination criterion. In some embodiments, the second subset of keyframes 404-2 includes one or more keyframes 404-2A. For each keyframe 404-2A, the respective first or second number satisfies the image elimination criteri on (e.g., which requires a ratio of the respective second and first numbers exceeds a predetermined threshold), and the respective keyframe 404-2A is eliminated and not shown in the second set of keyframes 540 in Figure 5B. In some embodiments, the second subset of keyframes 404-2 includes one or more keyframes 404-2B. For each keyframe 404-2B, the respective first or second number does not satisfy the image elimination criterion, and the respective keyframe 404-2B is not eliminated and thereby shown in the second set of keyframes 540 in Figure 5B.

[0047] In some embodiments, the second set of keyframes 540 include at least one keyframe 404-2B and at least one keyframe 404-3. Alternatively, in some embodiments not shown, the second set of keyframes 540 does not include any keyframe 404-2B, and any keyframe 404-2 that has unselected landmark descriptors is eliminated after the image elimination criterion is applied. Alternatively, in some embodiments not shown, the second set of keyframes 540 does not include any keyframe 404-3, and all keyframes 404 are impacted by the descriptor elimination criterion. Based on the descriptor or image elimination criterion, the first set of keyframes 520 are simplified to the second set of keyframes 540. For each landmark 402 in the scene, the one or more landmark descriptors are selected in accordance with a determination that any two of the one or more associated camera poses do not satisfy the descriptor elimination criterion. Mapping data of the scene is generated based on information of the second set of the keyframes 540 including one or more landmark descriptors and one or more associated camera poses 406 corresponding to each landmark 402 in the scene. Each landmark descriptor corresponds to a respective keyframe in the second set of keyframes 540 and a respective camera pose 406. The first, set of keyframes 520 is compressed to the second set of keyframes 540, and the second set of keyframes 540 includes a smaller number of keyframes than that of the first set of keyframes 520, thereby conserving storage space required for saving key frame-related information. [0048] Referring to Figure 5B, after creating the mapping data of the scene, a current frame 504C is used for SLAM, i.e., to identify a current camera pose of the camera 260 and update mapping of the scene in a synchronous manner. After an electronic device (e.g., AR glasses 104D) obtains the current frame 504C, the electronic device extracts a plurality of feature points 506 from the current frame, e.g., based on a CNN. Each of the plurality of feature points 506 corresponds to an image descriptor determined from the current frame 504C. For each of the plurality of feature points 506, the image descriptor is compared with the mapping data to identify a matching landmark 402 based on the second set of keyframes 540. The current camera pose where the current frame 504C is captured is determined based on the respective camera pose corresponding to the matching landmark 402.

[0049] In some embodiments, the current camera pose is interpolated from two or more camera poses corresponding to two or more keyframes 404 in the second set of keyframes 540. The plurality of feature points 506 of the current frame 504C include a first feature point 506A. The electronic device determines that a first image descriptor of the first feature point 506A is a combination of two landmark descriptors of the matching landmark 402. The two landmark descriptors of the matching landmark 402 are determined from two distinct keyframes 404. The current camera pose of the current frame 504C is determined based on two camera poses corresponding to the two distinct keyframes 404 from which the two landmark descriptors are determined. For example, the first image descriptor of the first feature point 506 A in the current frame 504C is a combination of two landmark descriptors of the first landmark 402A. The two landmark descriptors of the first landmark 402A are determined by two keyframes 508 and 510. The current camera pose of the current frame 504C is determined based on the two camera poses corresponding to the two keyframes 508 and 510. For example, the current camera pose of the current frame 504C is equal to a weighted average of the two camera poses of the two keyframes 508 and 510.

[0050] Additionally, in some embodiments, a second feature point 506B is extracted from the current frame 504C, and corresponds to a second image descriptor determined from the current frame 504C. In accordance with a determination that the second image descriptor does not match the landmark descriptors in the mapping data, the current frame 504C is not included in and cannot be determined from a combination of the second set of keyframes 540. The mapping data is updated by associating a second landmark 402B with the second image descriptor of the second feature point 506B and the current camera pose associated with the current frame 504C. Conversely, in accordance with a determination that the second image descriptor matches a subset of the landmark descriptors in the mapping data, the mapping data is not updated with the information related to the second image descriptor.

[0051] Figure 6 is a flowchart of a method 600 for simultaneous localization and mapping (SLAM), in accordance with some embodiments. In some embodiments, the method is applied in the AR glasses 104D, robotic systems, or autonomous vehicles. For convenience, the method 600 is described as being implemented by an electronic system 200 (e.g., a client device 104). In an example, the method 600 is applied to determine and predict poses, map a scene, and render both virtual and real content concurrently in extended reality (e.g., VR, AR). Method 600 is, optionally, governed by instructions that are stored in a non- transitory computer readable storage medium and that are executed by one or more processors of the electronic system. Each of the operations shown in Figure 6 may correspond to instructions stored in a computer memory' or non-transitory computer readable storage medium (e.g., memory' 206 of the electronic system 200 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 600 may be combined and/or the order of some operations may be changed.

[0052] The electronic system has a camera 260, and obtains (602) image data including a plurality of images (also called keyframes) 404 captured by the camera 260 in a scene. Each of the plurality images 404 includes a first landmark 402A. The electronic system generates (604) a plurality of landmark descriptors of the first landmark 402 A from the image data, and identifies (606) a plurality of camera poses 406 for the plurality of landmark descriptors. Each landmark descriptor is generated (608) from a distinct image 404 that includes the first landmark 402A and is captured at a distinct camera pose 406. In accordance with a determination that two of the plurality of camera poses 406 satisfy a descriptor elimination criterion (e.g., are substantially close to each other), the electronic system selects (610) a first landmark descriptor corresponding to a first one of the two of the plurality of camera poses 406 to map the first landmark 402A in the scene.

[0053] In some embodiments, in accordance with the determination that the two of the plurality of camera poses 406 satisfy the descriptor elimination criterion, the electronic system does not select (i.e., disables) (612) a second distinct landmark descriptor corresponding to a second one of the two of the plurality of camera poses 406 from mapping the first landmark 402A in the scene. Further, in some embodiments, the first landmark descriptor is generated (614) from a first image 404A that is captured earlier than a second image 404B from which the second distinct landmark descriptor is generated, and the second distinct landmark descriptor has a priority of being eliminated over the first landmark descriptor based on the descriptor elimination criterion.

[0054] In some embodiments, the electronic system maps the first landmark 402A with a subset of the plurality of landmark descriptors. The subset of the plurality of landmark descriptors correspond to a subset of distinct camera poses 406 where the subset of the plurality of images 404 are captured. The subset of the plurality of landmark descriptors are selected in accordance with a determination that any two of the subset of distinct camera poses 406 do not satisfy the descriptor elimination criterion.

[0055] In some embodiments, the plurality of images includes (616) a target image having a first number of landmarks corresponding to the first number of landmark descriptors. For the target image, based on the descriptor elimination criterion, the electronic system determines (618) that a second number of landmark descriptors corresponding to a subset of landmarks are disabled from mapping the subset of landmarks. In accordance with a determination that the first or second number satisfies an image elimination criterion, the electronic system eliminates (i.e., aborts selecting) (620) the first number of landmark descriptors associated with the target image from mapping the plurality of landmarks in the scene. Further, in some embodiments, the image elimination criterion requires (622) that a ratio of the second number and the first number exceeds a predetermined threshold. In some embodiments, the target key frame corresponds to the first one of the two of the plurality of camera poses, and the subset of landmarks is distinct from the first landmark. The first number of landmark descriptors are deselected from mapping the plurality of l andmarks by aborting selection of the first landmark descriptor for mapping the first landmark 402A. In an example, if 80% or more descriptors of the target image are eliminated and disabled from mapping the landmarks in the scene, then all descriptors related to the target image are eliminated. The target image, corresponding camera pose, and landmark descriptors are not stored for SLAM.

[0056] In some embodiments, each distinct camera pose includes a respective camera position and a respective camera pose. For each landmark descriptor of the first landmark 402A, the electronic system identifies a respective image ray 408 connecting the respective camera position to the first landmark 402A, identifies a ray angle (e.g., 410 or 412 in Figure

4B) between two image rays connecting two camera positions of the plurality of camera poses 406 to the first landmark 402. A, and determines that the ray angle is less than a ray angle threshold. In accordance with a determination that the ray angle is less than the ray angle threshold, the two of the plurality of camera poses 406 are determined to satisfy the descriptor elimination criterion. Further, in some embodiments, the ray angle threshold is in a range of [5°-10°].

[0057] In some embodiments, mapping data of the scene are generated (626). The mapping data includes one or more landmark descriptors and one or more associated camera poses 406 corresponding to each landmark in the scene. Each landmark descriptor corresponds to a respective image and a respective camera pose. Further, in some embodiments, for each landmark in the scene, the one or more landmark descriptors are selected in accordance with a determination that any two of the one or more associated camera poses 406 do not satisfy the descriptor elimination criterion.

[0058] Referring to Figure 5B, in some embodiments, the electronic system obtains a current frame 504C and extracts a plurality of feature points 506 from the current frame. Each of the plurality of feature points 506 corresponds to an image descriptor determined from the current frame 504C. For each of the plurality of feature points 506, the electronic system compares the image descriptor with the mapping data to identify a matching landmark 402 from the plurality of landmarks 402 and determines a current camera pose where the current frame is captured based on the respective camera pose corresponding to the matching landmark.

[0059] Additionally, in some embodiments, the plurality of feature points 506 includes a first feature point 506A. The electronic device determines that a first image descriptor of the first feature point 506A is a combination of two landmark descriptors of the matching landmark 402, and determines the current camera pose of the current frame based on two camera poses 406 corresponding to the two images (e.g., 508 and 510 in Figure 5B) from which the two landmark descriptors are determined.

[0060] Further, in some embodiments, the electronic system extracts a. second feature point 506B from the current frame 504C. The second feature point 506B corresponds to a second image descriptor determined from the current frame 504C, and in accordance with a determination that the second image descriptor does not match the landmark descriptors in the mapping data, updates the mapping data by associating a second landmark 402B with the second image descriptor and the current camera pose.

[0061] It should be understood that the particular order in which the operations in Figure 8 have been described are merely exemplar}' and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to use descriptors for SLAM and image rendering as described herein. Additionally, it should be noted that details of other processes described above with respect to Figures 4 and 5A-5B are also applicable in an analogous manner to method 600 described above with respect to Figure 6. For brevity, these details are not repeated here.

[0062] This application is directed to a landmark descriptor downsampling method. Camera poses 406 where keyframes 404 are captured are connected with landmarks 402 recognized in the keyframes 404 using image rays 408. Ray angles are formed at each landmark 402 among the image rays connecting the respective landmark 402 to multiple camera poses 406. For each small ray angle (e.g., the ray angles 410 and 412), one of the two image rays forming the respective small ray angle is disconnected, and one of two camera poses 406 connected to form the two image rays 408 is disabled from providing a landmark descriptor to map the respective landmark 402. The other one of the two camera poses 406 connected to form the two image rays is selected to provide a landmark descriptor to map the respective landmark 402 in the scene. In some embodiments, the scene is further partitioned into a plurality of map tiles, and the landmark descriptors are further archived according to the plurality of map tiles. More details on organizing mapping data with a map-tile data structure are discussed with reference to International Application No. PCT/CN 2021/076578, titled “Methods for Localization, Electronic Device and Storage Medium”, filed February 10, 2021, which is incorporated by reference in its entirety. Therefore, this application is directed to down-sampling keyframe-related information (e.g., camera poses, descriptors) archived in the map-tile mapping data structure.

[0063] In some embodiments, after losing image rays and descriptors based on a descriptor elimination criterion, a subset of camera poses (e.g., the second camera pose 406B in Figure 4B) have a small number of indexes which point to specifi c landmark 402 to use information of image rays and descriptors. These camera poses are treated as insignificant to improve 3D depth accuracy of the landmarks 402 and to be replaceable with other keyframes. An image elimination criterion is further applied to reduce the subset of camera poses that loses too many image rays due to the descriptor elimination criterion. By these means, applications of both the descriptor and image elimination criteria helped reduce the number of keyframes or the number of landmark descriptors stored for the purposes of mapping the scene accurately. [0064] As an alternative, the number of keyframes or the number of landmark descriptors stored for mapping the scene are down-sampled based on the amount of camera motion. The amount of camera motion corresponds to a camera motion threshold. The descriptor elimination criterion requires that one of two camera poses 406 is eliminated if a distance of the two camera poses 406 is less than the camera motion threshold. In some embodiments, the camera motion threshold varies with the distance between a corresponding landmark 402 and the two camera poses 406, e.g., the shorter the landmark 402 from a middle point of the two camera poses 406, the smaller the camera motion threshold.

[0065] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.

[0066] As used herein, the term “if” is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.

[0067] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art.

[0068] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Claims

What is claimed is:

1 . A method, implemented at an electronic system having a camera, comprising: obtaining image data including a plurality of images captured by the camera in a scene, each of the plurality images including a first landmark; generating a plurality of landmark descriptors of the first landmark from the image data; identifying a plurality of camera poses for the plurality of landmark descriptors, wherein each landmark descriptor is generated from a distinct image that includes the first landmark and is captured at a distinct camera pose; and in accordance with a determination that two of the plurality of camera poses satisfy a descriptor elimination criterion, selecting a first landmark descriptor corresponding to a first one of the two of the plurality of camera poses to map the first landmark in the scene,

2. The method of claim 1, further comprising: in accordance with the determination that the two of the plurality of camera poses satisfy the descriptor elimination criterion, disabling a second distinct landmark descriptor corresponding to a second one of the two of the plurality of camera poses from mapping the first landmark in the scene.

3. The method of claim 2, wherein the first landmark descriptor is generated from a first image that is captured earlier than a second image from which the second distinct landmark descriptor is generated, and the second distinct landmark descriptor has a priority of being eliminated over the first landmark descriptor based on the descriptor elimination criterion.

4. The method of any of the preceding claims, further comprising: mapping the first landmark with a subset of the plurality' of landmark descriptors; and wherein the subset of the plurality of landmark descriptors correspond to a subset of distinct camera poses where the subset of the plurality of images are captured; and wherein the subset of the plurality of landmark descriptors are selected in accordance with a determination that any two of the subset of distinct camera poses do not satisfy the descriptor elimination criterion.

5. The method of any of the preceding claims, wherein the plurality’ of images includes a target image having a first number of landmarks corresponding to the first number of landmark descriptors, the method further comprising, for the target image: based on the descriptor elimination criterion, determining that, a second number of landmark descriptors corresponding to a subset of landmarks are disabled from mapping the subset of landmarks; in accordance with a determination that the first or second number satisfies an image elimination criterion, eliminating the first number of landmark descriptors associated with the third image from mapping the plurality of landmarks in the scene.

6. The method of claim 5, wherein the image elimination criterion requires that a ratio of the second number and the first number exceeds a predetermined threshold.

7. The method of claim 6, wherein the target keyframe corresponds to the first one of the two of the plurality of camera poses, and the subset of landmarks is distinct from the first landmark, and wherein deselecting the first number of landmark descriptors from mapping the plurality of landmarks further comprises: aborting selection of the first landmark descriptor for mapping the first landmark.

8. The method of any of the preceding claims, each distinct camera pose including a respective camera position and a respective camera pose, further comprising: for each landmark descriptor of the first landmark, identifying a respective image ray connecting the respective camera position to the first landmark; and identifying a ray angle between two image rays connecting two camera positions of the plurality of camera poses to the first landmark; and determining that the ray angle is less than a ray angle threshold, wherein in accordance with a determination that the ray angle is less than the ray angle threshold, the two of the plurality of camera poses are determined to satisfy the descriptor elimination criterion.

9. The method of claim 8, wherein the ray angle threshold is in a range of [5°-10°],

10. The method of any of the preceding claims, further comprising: identifying in the scene a plurality of landmarks including the first landmark; and generating mapping data of the scene, the mapping data including one or more landmark descriptors and one or more associated camera poses corresponding to each landmark in the scene, wherein each landmark descriptor corresponds to a respective image and a respective camera pose.

1 1. The method of claim 10, wherein for each landmark in the scene, the one or more landmark descriptors are selected in accordance with a determination that any two of the one or more associated camera poses do not satisfy the descriptor elimination criterion.

12. The method of claim 10, further comprising obtaining a current frame; extracting a plurality of feature points from the current frame, wherein each of the plurality of feature points corresponds to an image descriptor determined from the current frame; and for each of the plurality of feature points, comparing the image descriptor with the mapping data to identify a matching landmark from the plurality of landmarks; and determining a current camera pose where the current frame is captured based on the respective camera pose corresponding to the matching landmark.

13. The method of claim 12, the plurality of feature points including a first feature point, the method further comprising: determining that a first image descriptor of the first feature point is a combination of two landmark descriptors of the matching landmark; and determining the current camera pose of the current frame based on two camera poses corresponding to the two images from winch the two landmark descriptors are determined.

14. The method of claim 12, further comprising: extracting a second feature point from the current frame, wherein the second feature point corresponds to a second image descriptor determined from the current frame; and in accordance with a determination that the second image descriptor does not match the landmark descriptors in the mapping data, updating the mapping data by associating a second landmark with the second image descriptor of the second landmark and the current camera pose of the current frame.

15. An electronic system, comprising: one or more processors; and memory' having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-14.

16. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform a method of any of claims 1-14.