WO2023101662A1 - Procédés et systèmes pour mettre en œuvre une odométrie visuelle-inertielle sur la base d'un traitement simd parallèle - Google Patents

Procédés et systèmes pour mettre en œuvre une odométrie visuelle-inertielle sur la base d'un traitement simd parallèle Download PDF

Info

Publication number
WO2023101662A1
WO2023101662A1 PCT/US2021/061282 US2021061282W WO2023101662A1 WO 2023101662 A1 WO2023101662 A1 WO 2023101662A1 US 2021061282 W US2021061282 W US 2021061282W WO 2023101662 A1 WO2023101662 A1 WO 2023101662A1
Authority
WO
WIPO (PCT)
Prior art keywords
elements
visual
feature points
matrix
data
Prior art date
Application number
PCT/US2021/061282
Other languages
English (en)
Inventor
Jun Liu
Youjie XIA
Chieh CHOU
Fan DENG
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Priority to PCT/US2021/061282 priority Critical patent/WO2023101662A1/fr
Publication of WO2023101662A1 publication Critical patent/WO2023101662A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/012Head tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/03Arrangements for converting the position or the displacement of a member into a coded form
    • G06F3/0304Detection arrangements using opto-electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30105Register structure
    • G06F9/30109Register structure having multiple operands in a single register
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead

Definitions

  • This application relates generally to data processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for implementing a visual-inertial odometry (VIO) using single instruction, multiple data (SIMD) parallel processing.
  • VIO visual-inertial odometry
  • SIMD single instruction, multiple data
  • BACKGROUND Simultaneous localization and mapping (SLAM) is widely applied in virtual reality (VR), augmented reality (AR), autonomous driving, and navigation. In SLAM, high frequency pose estimation is enabled by sensor fusion.
  • Asynchronous time warping is often applied with SLAM in an AR system to warp an image before it is sent to a display to correct for head movement that occurs after the image is rendered.
  • relevant image data and inertial sensor data are synchronized and used for estimating and predicting camera poses.
  • the same image data are also used as background images to render virtual objects according to the camera poses.
  • Such a camera pose prediction system is called a visual inertial odometer (VIO).
  • VIO visual inertial odometer
  • Current VIO systems are implemented based on filter based approaches or optimization based approaches.
  • a multi-state constraint Kalman filter (MSCKF) is used in one of the earliest filter-based approaches.
  • optimization-based approaches are applied in open keyframe-based visual-inertial SLAM (OKVIS), keyframe and feature-based monocular SLAM (e.g., ORB-SLAM), monocular visual-inertial systems (VINS MONO), and bundle adjustment for visual inertial SLAM (e.g., ICE-BA).
  • OKVIS introduced a keyframe-based optimization method
  • VINS MONO proposed a complete solution including loop closure and maps generation.
  • ICE-BA adopts an incremental bundle adjustment to achieve fast computation.
  • ORB-SLAM uses ORB-features to build point association between continuous image frames and adopt a similar optimization framework similar to VINS MONO to achieve accurate pose estimation.
  • Various embodiments of this application are directed to utilizing a parallel single instruction, multiple data (SIMD) processor to perform simultaneous localization and mapping (SLAM) and enable a visual inertial odometer (VIO) for extended reality applications (e.g., AR, VR, and mixed reality (MR)).
  • SIMD parallel single instruction, multiple data
  • SLAM simultaneous localization and mapping
  • VIO visual inertial odometer
  • the parallel SIMD processor uses 1024 bit registers with each single instruction and implements parallel operations on 3232-bit data items in response to the single instruction.
  • the VIO demands real time determination of a large number of visual factors and corresponding Schur complements, which often becomes a bottleneck that holding SLAM from being implemented at a mobile device.
  • SIMD instructions are applied to expedite real time determination of the visual factors and Schur complements during SLAM. SIMD conveniently reduces CPU usage and conserves power consumption, thereby allowing an extended reality application involving SLAM to be implemented in a mobile device.
  • a method is performed by an electronic system to implement a SIMD-based VIO.
  • the method includes obtaining motion data (e.g., measured by an inertial motion unit (IMU)) and image data (e.g., captured by a camera).
  • the image data have a plurality of feature points.
  • the method further includes for each feature point, determining a first visual factor from the motion data and image data.
  • the first visual factor includes a plurality of elements arranged in a vector or matrix, and the elements include a first element located at a first position in the vector or matrix.
  • the method further includes grouping the first element of each feature point of the plurality of feature points into a plurality of first element groups, and each first element group includes a predefined number of first elements corresponding to first visual factors of a subset of the plurality of feature points.
  • the method further includes for each first element group, storing the predefined number of first elements in a first memory block.
  • the method further includes for each first element group and in response to a single instruction, simultaneously and in parallel, extracting the predefined number of first elements from the first memory block and converting each of the predefined number of first elements to a alternative element of a second visual factor for each of the subset of feature points.
  • the second visual factor includes a Jacobian matrix.
  • some implementations include an electronic system that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.
  • some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.
  • Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.
  • Figure 2 is a block diagram illustrating an electronic system for processing data, in accordance with some embodiments.
  • Figure 3 is a flowchart of a process for processing inertial sensor data and image data of an electronic system (e.g., a server, a client device, or a combination of both) using a SLAM module, in accordance with some embodiments.
  • Figure 4 is a temporal diagram illustrating a plurality of parallel temporal threads of inertial sensor data, depth images, confidence maps, and visual images, in accordance with some embodiments.
  • Figure 5 is a block diagram of a SLAM module configured to determine a device pose of an electronic device in a scene, in accordance with some embodiments.
  • Figure 6 is a block diagram of a VIO module configured to determine a Jacobian matrix using a visual factor, in accordance with some embodiments.
  • Figure 7A is a data structure of position vectors Pdestine_est of feature points of an image, in accordance with some embodiments, and Figure 7B is a data structure of elements of the position vectors stored in a SIMD register, in accordance with some embodiments.
  • Figure 8A is a data structure of matrices M of feature points of an image, in accordance with some embodiments
  • Figure 8B is a data structure of elements of the matrices M stored in a SIMD register, in accordance with some embodiments.
  • Figure 9 is a flow diagram of a SIMD-based visual-inertial odometry method that is implemented at an electronic system having a SIMD register, in accordance with some embodiments.
  • Like reference numerals refer to corresponding parts throughout the several views of the drawings. DETAILED DESCRIPTION [0021]
  • Various embodiments of this application are directed to implementing SLAM and enabling a VIO in extended reality applications by one or more processors that execute parallel SIMD instructions with one or more SIMD registers. SLAM computation with each image frame or inertial sensor data is converted based on structures of the SIMD register and instructions.
  • each SIMD register has 1024 bits configured to store 32 single floating point data items, thereby allowing 32 single floating point arithmetic operations to be implemented on the 32 data items in response to a single instruction.
  • Each SIMD register uses one data item space to store one scalar variable of an element of a visual factor, and the entire register space is used to store the same scalar variables of the visual factors of multiple feature points, rather than holding each small matrix of 3 ⁇ 3 or 6 ⁇ 6 as a whole. Matrix-based calculation in SLAM is converted into scalar variable based calculation.
  • Each matrix-based single calculation loop associated with a visual factor in SLAM is converted to multiple calculation loops (e.g., N ⁇ N) on individual scalar variables of the visual factor, where N is a parallel degree.
  • N is a parallel degree.
  • 1024b parallel ARM scalable vector extension (SVE) the parallel degree N is equal to 32.
  • FIG. 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments.
  • the one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, or intelligent, multi-sensing, network-connected home devices (e.g., a depth camera, a visible light camera).
  • the one or more client devices 104 include a head-mounted display 104D configured to render extended reality content and including a depth camera for SLAM.
  • Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface.
  • the collected data or user inputs can be processed locally (e.g., for training and/or for prediction) at the client device 104 and/or remotely by the server(s) 102.
  • the one or more servers 102 provides system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104.
  • the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.
  • storage 106 may store video content (including visual and audio content), static visual content, and/or inertial sensor data.
  • the one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104.
  • the client devices 104 include a game console (e.g., the head-mounted display 104D) that executes an interactive online gaming application.
  • the game console receives a user instruction and sends it to a game server 102 with user data.
  • the game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console.
  • the one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100.
  • the one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof.
  • the one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol.
  • a connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof.
  • a network interface 110 e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node
  • the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another.
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • a pair of augmented reality (AR) glasses 104D also called a head-mounted display
  • AR augmented reality
  • the AR glasses 104D includes one or more cameras (e.g., a visible light camera, a depth camera), a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display.
  • the camera(s) and microphone are configured to capture video and audio data from a scene of the AR glasses 104D, while the one or more inertial sensors are configured to capture inertial sensor data.
  • the camera captures hand gestures of a user wearing the AR glasses 104D.
  • the microphone records ambient sound, including user’s voice commands.
  • both video or static visual data captured by the visible light camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses.
  • the video, static image, audio, or inertial sensor data captured by the AR glasses 104D are processed by the AR glasses 104D, server(s) 102, or both to recognize the device poses.
  • both depth data e.g., depth map and confidence map
  • the depth and inertial sensor data captured by the AR glasses 104D are processed by the AR glasses 104D, server(s) 102, or both to recognize the device poses.
  • the device poses are used to control the AR glasses 104D itself or interact with an application (e.g., a gaming application) executed by the AR glasses 104D.
  • the display of the AR glasses 104D displays a user interface, and the recognized or predicted device poses are used to render virtual objects with high fidelity or interact with user selectable display items on the user interface.
  • SLAM techniques are applied in the data processing environment 100 to process video data, static image data, or depth data captured by the AR glasses 104D with inertial sensor data. Device poses are recognized and predicted, and a scene in which the AR glasses 104D is located is mapped and updated. The SLAM techniques are optionally implemented by AR glasses 104D independently or by both of the server 102 and AR glasses 104D jointly.
  • Figure 2 is a block diagram illustrating an electronic system 200 for processing data, in accordance with some embodiments.
  • the electronic system 200 includes a server 102, a client device 104, a storage 106, or a combination thereof.
  • the electronic system 200 typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset).
  • the electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls.
  • the client device 104 of the electronic system 200 uses a microphone for voice recognition or a camera 260 for capture images used for SLAM.
  • the client device 104 includes one or more optical cameras 260 (e.g., an RGB camera), scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices.
  • the electronic system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.
  • the client device 104 includes a location detection device, such as a GPS (global positioning system) or other geo-location receiver, for determining the location of the client device 104.
  • the client device 104 includes an inertial measurement unit (IMU) 280 integrating sensor data captured by multi-axes inertial sensors to provide estimation of a location and an orientation of the client device 104 in space.
  • IMU inertial measurement unit
  • Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202.
  • Memory 206 includes a non-transitory computer readable storage medium.
  • memory 206 or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:
  • Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;
  • Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
  • x User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or
  • the SLAM module 230 further includes an IMU data preintegration module 234, a visual frontend feature management module 236, a marginalization module 238, a VIO module 240, and a pose optimization module 242. More details on each of the modules 234-242 are explained below with reference to Figure 5.
  • the one or more databases 250 are stored in one of the server 102, client device 104, and storage 106 of the electronic system 200.
  • the one or more databases 250 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200.
  • more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 260 are stored at the server 102 and storage 106, respectively.
  • Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above.
  • the above identified modules or programs i.e., sets of instructions
  • memory 206 optionally, stores a subset of the modules and data structures identified above.
  • memory 206 optionally, stores additional modules and data structures not described above.
  • FIG. 3 is a flowchart of a process 300 for processing inertial sensor data (e.g., 406 in Figure 4) and image data (e.g., 408 in Figure 4) of an electronic system (e.g., a server 102, a client device 104, or a combination of both) using a SLAM module 232, in accordance with some embodiments.
  • the process 300 includes measurement preprocessing 302, initialization 304, local visual-inertial odometry (VIO) with relocation 306, and global pose graph optimization 308.
  • VIO local visual-inertial odometry
  • an RGB camera 260 captures image data of a scene at an image frame rate (e.g., 30 FPS), and features are detected and tracked (310) from the image data.
  • An IMU 280 measures inertial sensor data at a sampling frequency (e.g., 1000 Hz) concurrently with the RGB camera 260 capturing the image data, and the inertial sensor data are pre-integrated (312) to provide data of a variation of device poses 340.
  • a sampling frequency e.g. 1000 Hz
  • the image data captured by the RGB camera 260 and the inertial sensor data measured by the IMU 280 are temporally aligned (314).
  • a vision- only structure from motion (SfM) techniques 314 are applied (316) to couple the image data and inertial sensor data, estimate three-dimensional structures, and map the scene of the RGB camera 260.
  • a sliding window 318 and associated states from a loop closure 320 are used to optimize (322) a VIO.
  • the VIO corresponds (324) to a keyframe of a smooth video transition and a corresponding loop is detected (326)
  • features are retrieved (328) and used to generate the associated states from the loop closure 320.
  • global pose graph optimization 308 a multi-degree-of-freedom (multi- DOF) pose graph is optimized (330) based on the states from the loop closure 320, and a keyframe database 332 is updated with the keyframe associated with the VIO.
  • the features that are detected and tracked (310) are used to monitor (334) motion of an object in the image data and estimate image-based poses 336, e.g., according to the image frame rate.
  • the inertial sensor data that are pre-integrated (312) may be propagated (338) based on the motion of the object and used to estimate inertial-based poses 340, e.g., according to a sampling frequency of the IMU 280.
  • the image-based poses 336 and the inertial-based poses 340 are stored in the pose data buffer 246 and used by the module 230 to estimate and predict poses that are used by the pose-based rendering module 234.
  • the module 232 receives the inertial sensor data measured by the IMU 280 and obtains image-based poses 336 to estimate and predict more poses 340 that are further used by the pose-based rendering module 234.
  • high frequency pose estimation is enabled by sensor fusion, which relies on data synchronization between imaging sensors and the IMU 280.
  • the imaging sensors e.g., the RGB camera 260, a LiDAR scanner
  • the IMU 280 can measure inertial sensor data and operate at a very high frequency (e.g., 1000 samples per second) and with a negligible latency (e.g., ⁇ 0.1 millisecond).
  • Asynchronous time warping is often applied in an AR system to warp an image before it is sent to a display to correct for head movement and pose variation that occurs after the image is rendered.
  • ATW algorithms reduce a latency of the image, increase or maintain a frame rate, or reduce judders caused by missing image frames.
  • relevant image data and inertial sensor data are stored locally, such that they can be synchronized and used for pose estimation/predication.
  • the image and inertial sensor data are stored in one of multiple STL containers, e.g., std::vector, std::queue, std::list, etc., or other self- defined containers. These containers are generally convenient for use.
  • the image and inertial sensor data are stored in the STL containers with their timestamps, and the timestamps are used for data search, data insertion, and data organization.
  • Figure 4 is a temporal diagram illustrating a plurality of parallel temporal threads 400 of inertial sensor data and image data, in accordance with some embodiments.
  • the plurality of parallel temporal threads 400 include a first temporal thread 402 of inertial sensor data and a second temporal thread 404 of image data.
  • the first temporal thread 402 includes a temporally-ordered sequence of inertial sensor data 406 measured by the IMU 280 at a sampling frequency.
  • the second temporal thread 404 of image data includes a temporally-ordered sequence of visual images 408 captured by the camera 260 at an image frame rate.
  • a three-dimensional (3D) map 410 of the scene is created based on the sequence of inertial sensor data 406 and the sequence of images 408, and used to localize the camera 260 and determine device poses.
  • extended reality content is rendered on the visual images 408 or another distinct sequence of images captured by a distinct camera.
  • the extended reality content is rendered at a display refresh rate and based on the device poses determined from the 3D maps and inertial sensor data 406.
  • the display refresh rate of the extended reality content is optionally equal to or distinct from the image frame rate of the visual images 408.
  • a virtual object is rendered on the one or more visual images 408, or a virtual environment is mixed with the one or more visual images 408.
  • Each inertial sensor data 406 has a sampling frequency (e.g., 1000 Hz) and a sensor latency (e.g., 0.1 ms) from being captured by the IMU 280 to being available to be used by a pose determination and prediction module 230.
  • Each image 408 has an image frame rate (e.g., 30 Hz) and a first image latency (e.g., 20 ms) from being captured by the camera 260 to being available to be used by the pose determination and prediction module 230.
  • the sampling frequency of the inertial sensor data 406 is greater than the image frame rate, and the sensor latency is less than the first image latency.
  • a temporal position of each of the inertial sensor data 406 and visual images 408 on the temporal threads 402 and 404 corresponds to a time when the corresponding inertial sensor data 406 or visual image 408 is available to be used for camera pose determination or image rendering.
  • Each visual image 408 is obtained with an image time stamp indicating an image capture time when the visual image 408 is captured.
  • Each motion sensor data 1412 is obtained with a sensor time stamp indicating a sensor time when the motion sensor data is measured.
  • the camera 260 has a camera pose at a pose time. If the camera pose (i.e., an image-based device pose) is determined based on a visual image, the pose time corresponds to one of the image capture time.
  • the pose time corresponds to a corresponding sensor time.
  • an earliest visual image 408 used to identify a plurality of scene feature points of the scene where the camera 260 is located. As each subsequent visual image 408 is captured and processed, more and more scene feature points are identified to confirm the previously identified scene feature points or add new scene feature points of the scene, thereby developing a feature point map 410 of the scene.
  • each of the sequence of 3D maps 410 of the scene is updated with the visual image 408, and includes a plurality of scene feature points (e.g., a table corner, a hand, a tree tip, corners of a window frame, corners of a door).
  • scene feature points e.g., a table corner, a hand, a tree tip, corners of a window frame, corners of a door.
  • first feature points are identified in a first visual image 408A captured at a first capture time t 10 .
  • One or more second feature points are identified in a second visual image 408B captured at a second capture time.
  • the one or more first feature points are compared with the one or more second feature points to determine at least one of the plurality of device poses corresponding to the first or second capture time, e.g., based on a difference of the one or more first feature points and the one or more second feature points.
  • the first or second feature points are optionally used to update the corresponding 3D map 410.
  • Every two consecutive visual images 408 are temporally separated by a depth image separation ⁇ 7, and each visual image 408 is available the first image latency 7 L (covering image transfer and processing) after being captured by the depth camera 270.
  • a first visual image 408A is captured at a prior time t 10 (also called a first capture time), and made available at a first time t 11 .
  • a prior visual image 408P that precedes the first visual image 408A has been available and used with motion data 406 to determine a device pose of the camera 260.
  • the first visual image 408A is available at t 11 , the first visual image 408A is used to generate an image-based device pose associated with the prior time t 10 , and update the device pose data at the instant t i that is between t 10 and t 11 retroactively.
  • the first visual image 408A is also used to determine subsequent device pose data at any instant t j that follows the first time t 11 and precedes a subsequent time t 21 , in real time or predictively, before a subsequent visual image 408B is available at the subsequent time t 21 .
  • the image-based device pose associated with the prior time t 10 is determined based on the first visual image 408A, and can be used to determine the device pose within a temporal range that lasts for a combined duration 412 of the image separation ⁇ 7 and the first image latency 7 L , i.e., from the prior time t 10 to the subsequent time t 21 .
  • the combined duration 412 corresponds to a sliding window enclosing each image 408, and the camera pose is determined within the combined duration 412.
  • an Extended Kalman Filter (EKF) or Error State Kalman Filter (ESKF) is applied to obtain poses at least between the prior time t 10 of capturing the first visual image 408A and the subsequent time t 21 of obtaining the subsequent visual image 408B.
  • the first visual image 408A is available at the first time t 11 , and the inertial sensor data 406 (e.g., acceleration, angular velocity, and corresponding bias and noise) captured between the prior and first times t 10 and t 11 have already been available.
  • the inertial sensor data 406 captured between the prior and current times t 10 and t i are optionally processed using an integration operation to provide a variation of device poses between t10 and ti, and the device pose is updated using the image- based device pose that is retroactively generated for the prior time t 10 and the variation of device poses between t 10 and t i .
  • the inertial sensor data 406 captured between t 11 and tj are also made available, and the device pose is estimated and determined using the image-based device pose that is retroactively generated for the prior time t 10 and a variation of device poses between t 10 and t j , which is integrated from the inertial sensor data 406 between t 10 and t j based on the integration operation.
  • the visual images 408 and the inertial sensor data 406 are applied to predict device poses.
  • the first visual image 408A when the first visual image 408A is available at t 11 , the first visual image 408A is used to generate an image-based device pose associated with the prior time t 10 retroactively.
  • Each inertial sensor data 406 captured between t 10 and t 21 corresponds to a relative device pose variation from an immediately preceding inertial sensor data, and can be combined with the image-based device pose associated with the first visual image 408A to determine a device pose at each specific time between t 10 and t 21 .
  • the at least one device pose is used to predict an upcoming device pose that has not occurred or for which its corresponding inertial sensor data 406 have not been available yet. For example, after a current device pose is determined retroactively for the instant t i based on the inertial sensor data between t 10 and t 1 , the current device pose associated with t i and the image-based device pose associated with the prior time t 10 are applied to derive a subsequent device pose at the subsequent instant t j between t 11 and t 21 , e.g., by linear extrapolation.
  • a series of device poses are determined for a series of times between t 10 and t 11 based on the image-based device pose associated with the prior time t 10 and the inertial sensor data 406.
  • the series of device poses are applied to predict a subsequent device pose at the subsequent instant t j , e.g., by linear extrapolation.
  • a plurality of camera poses are determined based on the inertial sensor data 406 and visual images 408.
  • the camera poses optionally have the same sampling frequency as the inertial sensor data 406, and can be applied to derive more intermediate device poses by interpolation if needed.
  • each visual image 408 is also used to render the extended reality content based on the camera poses of the camera 260.
  • the visual image 408A or 408B is rendered with one or more virtual objects overlaid on top of the visual image 408A or 408B, respectively.
  • the virtual objects are rendered on the visual image 408A from a perspective of the camera 260 that has captured the visual images 408A and 408B.
  • the visual image 408A corresponds to an instantaneous camera pose of the camera 260 at an instant t j between the times t 11 and t 21 when the first visual image 408A and subsequent visual image 408B are available.
  • the instantaneous camera pose of the RGB camera 260 corresponding to the visual image 408A is determined from a device pose of the camera 260 corresponding to the instant t j , while the device pose of the camera 260 corresponding to the instant t j is determined in real time or extrapolated from the image- based device pose corresponding to the first visual image 408A and the corresponding inertial sensor data 406 between t 10 and t j .
  • the device pose of the depth camera 270 corresponding to the subsequent instant t j is retroactively updated after the subsequent visual image 408B of the camera 260 is available at the subsequent time t 21 , so is the instantaneous camera pose of the RGB camera 260 corresponding to the visual image 408A updated retroactively.
  • the extended reality content is being currently displayed with the first image 408A.
  • the overlying first image 408A is updated to the second image 408B, and the rendered virtual objects are immediately updated based on the update of the instantaneous camera pose corresponding to the visual image 408B.
  • the overlying first image 408A is updated immediately, while the virtual objects are rendered with a latency based on the update of the instantaneous camera pose corresponding to the visual image 408B. Additionally and alternatively, in some situations, when the second image 408B is available at t 21 , the overlying first image 408A is not updated immediately, regardless of the update of the instantaneous camera pose corresponding to the visual image 408B.
  • FIG. 5 is a block diagram of a SLAM module 232 (also called a VIO system) configured to determine a camera pose of an electronic device (e.g., AR glasses 104D) having a camera 260 in a scene, in accordance with some embodiments.
  • the SLAM module 232 is configured to determine camera poses 502 based on the inertial sensor data 406 and image data 408 as shown in Figure 4.
  • the SLAM module 232 includes an IMU data preintegration module 234 and a visual frontend feature management module 236.
  • the IMU data preintegration module 234 is configured to estimate a pose change between two consecutive image frames (e.g., between t 10 and t i , between t 10 and t j , between t 11 and t 21 when the images 408A and 408B are available) using integration of the inertial sensor data 406.
  • the visual frontend feature management module 236 is configured to detect 2D features points in each image 408 and track these 2D feature points among successive image frames 408. As such, the IMU data preintegration module 234 and visual frontend feature management module 236 are configured to generate a plurality of IMU factors 504 and a plurality of input visual factors 506, respectively.
  • the SLAM module 232 further includes a marginalization module 238 configured to enable a sliding window based approach implemented based on a marginalization process.
  • the sliding window based approach is adopted to optimize determination of camera poses associated with a plurality of image frames 408 simultaneously, thereby improving accuracy of pose estimation and reducing discontinuity among camera poses estimated from successive images 408.
  • a new image frame 408 come into the sliding window 412, one of the old images 408 exits the sliding window 412. This process is also called marginalization.
  • the marginalization process includes a Gaussian variable elimination procedure that eliminates variables from the sliding window 412.
  • the marginalization module 238 is configured to output a linear constraint equation 508 and a current state vector 510C.
  • the linear constraint equation 508 is determined based on a previous state vector 510P, and sets constraints on the current state vector 510C.
  • the linear constraint equation 508 and state vectors 510 are collectively defined as prior factors 512 in the SLAM module 232.
  • the SLAM module 232 further includes a VIO module 240 configured to receive the IMU factors 504, input visual factors 506, and prior factors 512, and generate a sparse Jacobian matrix 514 and residue vectors.
  • the Jacobian matrix 514 includes three Jacobian parts.
  • the first Jacobian part 514A are associated with the input visual factors 506, and for example, and each visual factor 506 corresponds to a 2D observation’s projection error in an image 408 in the sliding window 412.
  • the first Jacobian part 514A derives a plurality of intermediate visual factors (e.g., >1000 visual factors) from the input visual factors 506.
  • the second Jacobian part 514B are associated with the IMU factors 504.
  • each IM factor 504 correspond to a pose change of two image- based camera poses associated with two distinct image frames 408, and the pose change is obtained by integrating the inertial sensor data 406.
  • each inertial sensor data 406 corresponds to an IMU factor 504, and the number of IMU factors 504 is equal to a total number of inertial sensor data 406 measured in the sliding window 412.
  • the third Jacobian part 514C is associated with the prior factors 512.
  • the linear constraint equations 508 from marginalization are factorized using Cholesky decomposition to obtain the corresponding Jacobian matrix 514. ⁇ [0051] After the sparse Jacobian matrix 514 is determined, the electronic system determines a Hessian matrix 516 using a Schur complement module 518.
  • the third Jacobian part 514C includes a quasi-diagonal matrix. Each subblock of the quasi-diagonal matrix includes a column corresponding to a specific visual feature in the 3D map of the scene.
  • the Schur complement module 518 is configured to implement an accumulation procedure for each feature point and result in the Hessian matrix 516.
  • the SLAM module 232 further includes a pose optimization module 242 configured to determine the camera poses 502 of the camera of the electronic device iteratively.
  • the pose optimization module 242 searches a multi-dimensional space to determine the camera poses 502 by optimizing a cost function (e.g., minimizing the cost function, reducing the cost function below a threshold cost level).
  • a trust region method is applied in the pose optimization module 242.
  • a trust region starts from an initial guess, a search radius is adaptively adjusted to keep linear approximation for the cost function valid.
  • a nonlinear cost function is approximated by a linear cost function, which can be resolved to obtain a next step of a searching procedure of the trust region method.
  • a dogleg solver is applied to implement a dogleg step. Specifically, a Gaussian Newton step and a Gradient descent step are calculated, and the dogleg solver linearly combines the Gaussian Newton step and a Gradient descent step at a Cauchy point.
  • a solution of the Gaussian Newton step is resolved from the Hessian matrix 516, and a Cholesky decomposition or preconditioned conjugate gradient (PCG) solver is applied.
  • PCG conjugate gradient
  • FIG. 6 is a block diagram of a VIO module 240 configured to determine a Jacobian matrix 514, in accordance with some embodiments.
  • An electronic system includes an electronic device (e.g., AR glasses 104D) having a camera 260 and an IMU 280, and the electronic device is disposed in a scene where the IMU 280 measures motion data 406 and the camera 260 captures image data 408 (e.g., a sequence of visual images 408).
  • Each visual image 408 capture by the camera has a plurality of feature points that are mapped in a virtual 3D space associated with a scene where the camera 260 is disposed.
  • Each feature point is identified by a feature identification (i.e., feature ID) 602.
  • the VIO module 240 identifies the feature point based on the feature ID 602, and determines a plurality of visual factors 604, which are further converted to the Jacobian matrix 514 according to a series of equations (e.g., equations (1)-(16) below).
  • the plurality of visual factors 604 include one or more feature middle variables 606M that are determined for the feature ID 602 based on the motion data 406 and image data 408.
  • the one or more feature middle variables 606M includes the IMU factor 504, input visual factors 506, and prior factor 512.
  • Each of the plurality of visual factor 604 has a plurality of elements.
  • the plurality of visual factors 604 includes a first visual factor.
  • the plurality of elements of the first visual factor 604A are arranged in a vector or matrix.
  • the first visual factor 604A includes a translation of a source camera pose is 7 source , which is a vector having three elements.
  • the first visual factor is a rotation of the source camera pose is Q source , which is a 3 ⁇ 3 matrix.
  • the elements of the first visual factor 604A include a first element located at a first position in the vector or matrix.
  • the three elements of the translation of a source camera pose 7 source corresponds to x, y, and z axes, respectively, and are arranged in an ordered sequence.
  • the 3 ⁇ 3 matrix of the rotation of the source camera pose Q source has 9 elements arranged an ordered array having 3 rows and 3 columns, and each element defines a roll, pitch, or raw portion of the rotation of the source camera pose Q source .
  • the first element of each feature point of the plurality of feature points are grouped into a plurality of first element groups. Each first element group includes a predefined number of first elements corresponding to each first visual factor of a subset of feature points.
  • an image 408 has 64 feature points, and each feature point has the first visual factor 604A (e.g., translation of a source camera pose is 7 source ).
  • the image 408 corresponds to 64 first visual factors 604A of the 64 feature points.
  • a parallel level of SIMD is 32, and the SIMD-based VIO module 240 applies each operation associated with a SIMD instruction to 32 data items in parallel. Every 32 first elements of the 64 first visual factors 604A of the 64 feature points are grouped in a first element group to be processed in parallel by the SIMD-based VIO module 240.
  • the predefined number of first elements of the are stored in a first memory block of a SIMD register 248, thereby facilitating parallel processing by the SIMD-based VIO module 240.
  • the 64 first visual factors 604A of the 64 feature points have 64 first elements, and the 64 first elements are grouped into two first element groups.
  • corresponding 32 first elements are stored in the first memory block.
  • the SIMD register 248 includes two first memory blocks for storing all 64 first elements of the 64 first visual factors 604A of the 64 feature points. These two first memory blocks are distinct from each other and do not overlap each other.
  • the SIMD register 248 further includes a plurality of second memory blocks for storing second elements of the 64 first visual factors 604A of the 64 feature points, and each of the second elements is located at a second position in the vector or matrix of the first visual factor 604A. The second position is distinct from the first position.
  • the plurality of second memory blocks are distinct from and do not overlap one another.
  • Each of the plurality of second memory blocks is distinct from and does not overlap any of the two first memory blocks.
  • a device pose is determined based on the first visual factor 604A of each visual point.
  • the electronic system For each first element group, in response to a single instruction, simultaneously and in parallel, the electronic system extracts the predefined number of first elements from the first memory block and converts each of the predefined number of first elements to an alternative element of a second visual factor for each of the subset of feature points.
  • 32 first elements of 32 translation matrices of a source camera pose 7 source are stored in the same first memory block, and extracted jointly to be parallel processed and converted to the same elements of 32 second visual factors (e.g., P W in equation (2), R CB in equation (7), R CC in equation (8)).
  • the first element is an element of the matrix R CW and is converted to an element of the second visual factor J source_rot , which is part of the Jacobian matrix 514.
  • Each feature point corresponds to a source camera pose and a destination camera pose. Rotation of the source camera pose is a 3 ⁇ 3 matrix Q source , and translation of the source camera pose is 7 source . Observation in a source camera coordinate frame is P source . Rotation of the destination camera pose is a 3 ⁇ 3 matrix Q destine , and translation of the destination camera pose is 7 destine . Observation in a destination camera coordinate frame is P destine . D is an inverse depth of the feature point at the source frame.
  • Q ic and 7 ic are constants associated with rotation and translation transformations from a camera coordinate system to an IMU coordinate system, respectively.
  • a normalization function is applied to divide a length-3 vector by a Z component and transform the divided vector to homogeneous coordinate system.
  • a SKEW function is applied to generate an anti-symmetric skew matrix from length-3 vector (i.e. the Lie algebra ).
  • inputs include D ⁇ R, Q ic , 7 ic , P source ⁇ R 3 , Q source , 7 source ⁇ R 3 , P destine ⁇ R 3 , Q destine , 7 destine ⁇ R 3 , and variables include P bi , P W , P bj , P destine_est ⁇ R 3 .
  • the VIO module 240 processes the inputs and variables according to the following equations [0060] In the above equations, Q ic and 7 ic are global constants.
  • Q source and 7 source depend on a source frame time stamp
  • Q destine , 7 destine , and R CW depends on a destination frame time stamp
  • R CB and R CC depend on both source and destination frame time stamps
  • the above variables are independent of the feature observation.
  • the source time stamp is unique, and a destination image corresponds to multiple observations, and P W and skew(P bi ) vary with the feature points and are optionally determined before a subset of the visual factors 604 (e.g., P- destine_est , J D , J source_rot , J source_translation , or J destine_translation ) are determined.
  • each visual factor 604 e.g., including IMU factor 504, input visual factor 506, or prior factor 516) is a vector or matrix, and the vector or matrix includes a plurality of elements. Each element of the visual factor 604 is a single float point number.
  • each element of the visual factor 604 (e.g., including IMU factor 504, input visual factor 506, or prior factor 516) is a fixed point number.
  • the SLAM module 232 is optionally implemented by a fixed point DSP with SIMD functions. Each single floating scale variable is represented in a fixed points format. A shift operation is applied to avoid overflow and underflow, e.g., in accumulators and divisors processing a depth vector. A numeral stable algorithm is applied in a linear solver step.
  • the VIO module 240 includes a destine pose generator 606 to provide a subset or all of the visual factors 604. The visual factors 604 are applied to generate the Jacobian matrix 514.
  • FIG. 7A is a data structure 700 of position vectors P destine_est of feature points of an image 408, in accordance with some embodiments
  • Figure 7B is a data structure 750 of elements of the position vectors stored in a SIMD register 248, in accordance with some embodiments.
  • the position vectors P destine_est is one of the visual vectors 604 determined for a plurality of feature points of the image 408.
  • Each feature point corresponds to a position vector P destine_est having three elements corresponding to x, y, and z axes.
  • the position vector P destine_est (702) has three elements P destine_est (x)(1), P destine_est (y)(1), and P destine_est (z)(1)
  • the position vector P destine_est (704) has three elements P destine_est (x)(2), P destine_est (y)(2), and P destine_est (z)(2)
  • the position vector P destine_est (706) has three elements P destine_est (x)(N), P destine_est (y)(N), and P destine_est (z)(N).
  • the position vectors P destine_est of the plurality of feature points of the image 408 are stored in memory (e.g., 206 in Figure 2). Elements of each position vector P destine_est are stored together in the same memory block of the memory. For example, the three elements P destine_est (x)(1), P destine_est (y)(1), and P destine - _est (z)(1) of the position vector P destine_est are stored in a first set of adjacent memory addresses 702 of the memory.
  • the three elements P destine_est (x)(2), P destine_est (y)(2), and P destine_est (z)(2) of the position vector P destine_est are stored in a second set of adjacent memory addresses 704 of the memory.
  • the three elements P destine_est (x)(N), P destine_est (y)(N), and P destine_est (z)(N) of the position vector P destine_est are stored in an N-th set of adjacent memory addresses 706 of the memory.
  • elements of the position vectors P destine_est of the plurality of feature points of the image 408 are stored in a SIMD register 248 according to positions of the elements.
  • the first elements of the position vectors, Pdestine_est(x)() are grouped to first element groups and stored in first memory blocks 708. For example, in SIMD, 32 data items are processed in parallel, and every 32 first elements of the position vectors P destine_est are grouped and stored together.
  • the first elements of the position vectors, Pdestine_est(x)(1) to Pdestine_est(x)(32), are stored in a first memory block 708A
  • the first elements of the position vectors, P destine_est (x)(33) to P destine_est (x)(64) are stored in a distinct first memory block 708B that is separate from and does not overlap the first memory block 708A.
  • the electronic system groups the second elements of the position vectors, P destine_est (y)(), into a plurality of second element groups.
  • Each element group corresponds to a first element group and includes the predefined number of second elements of the position vectors, P destine_est (y)().
  • the predefined number of second elements of each element group is stored in a second memory block 710 that is separate from and does not overlap any of the first memory blocks 708.
  • the second elements of the position vectors, P destine_est (y)(1) to P destine_est (y)(32), are stored in a second memory block 710A
  • the second elements of the position vectors, P destine_est (y)(33) to P destine_est (y)(64) are stored in a distinct second memory block 710B that is separate from and does not overlap the second memory block 710A.
  • the electronic system also groups the third elements the position vectors, P destine_est (z)(), into a plurality of third element groups. Each third element group corresponds to a first element group and includes the predefined number the third elements of the position vectors, P destine_est (z)().
  • the predefined number of third elements of each third element group are stored in a third memory block 712 that is separate from and does not overlap any of the first and second memory blocks 708 and 710.
  • the third elements of the position vectors, P destine_est (z)(1) to P destine_est (z)(32), are stored in a third memory block 712A
  • the third elements of the position vectors, P destine_est (z)33) to P destine_est (z)64 are stored in a distinct third memory block 712B that is separate from and does not overlap the third memory block 712A.
  • the electronic system for each first element group corresponding to the subset of feature points (e.g., the first 32 feature points), simultaneously and in parallel, implements one of an addition operation and a multiplication operation on the first elements of the position vectors, P destine_est (x)(), in the first element group in response to a single instruction.
  • P destine_est (x)() Each of the predefined number of first elements is converted to a alternative element of a second visual factor.
  • Figure 8A is a data structure 800 of matrices M of feature points of an image 408, in accordance with some embodiments
  • Figure 8B is a data structure 850 of elements of the matrices M stored in a SIMD register 248, in accordance with some embodiments.
  • the matrices M is one of the visual vectors 604 determined for a plurality of feature points of the image 408.
  • Each feature point corresponds to a matrix M having six elements.
  • the matrix M (802) has six elements M (1,1)(1), M (1, 2)(1), M (1, 3)(1), M (2, 1)(1), M (2, 2)(1), and M (2, 3)(1)
  • the matrix M (804) has six elements M (1,1)(2), M (1, 2)(2), M (1, 3)(2), M (2, 1)(2), M (2, 2)(2), and M (2, 3)(2).
  • the matrix M (806) has six elements M (1,1)(N), M (1, 2)( N), M (1, 3)( N), M (2, 1)( N), M (2, 2)( N), and M (2, 3)( N).
  • the matrices M of the plurality of feature points of the image 408 are stored in memory (e.g., 206 in Figure 2). Elements of each matrix M are stored together in the same memory block of the memory. For example, the six elements M (1,1)(1), M (1, 2)(1), M (1, 3)(1), M (2, 1)(1), M (2, 2)(1), and M (2, 3)(1) of a matrix M of a first feature point are stored in a first set of adjacent memory addresses 802 of the memory.
  • the six elements M (1,1)(2), M (1, 2)(2), M (1, 3)(2), M (2, 1)(2), M (2, 2)(2), and M (2, 3)(2) of a matrix M of a second feature point are stored in a second set of adjacent memory addresses 804 of the memory.
  • the six elements M (1,1)(N), M (1, 2)( N), M (1, 3)( N), M (2, 1)( N), M (2, 2)( N), and M (2, 3)( N) of the matrix M of an N-th feature point are stored in an N-th set of adjacent memory addresses 806 of the memory.
  • elements of the matrices M of the plurality of feature points of the image 408 are stored in a SIMD register according to positions of the elements.
  • the first elements of the matrices, M (1,1)() are grouped to first element groups and stored in first memory blocks 808.
  • SIMD 32 data items are processed in parallel, and every 32 first elements of the matrices, M (1,1)(), are grouped and stored together.
  • the first elements of the matrices, M (1,1)(1) to M (1,1)(32), are stored in a first memory block 808A, and the first elements of the matrices, M (1,1)(33) to M (1,1)(64), are stored in a distinct first memory block 808B that is separate from and does not overlap the first memory block 808A.
  • the second, third, fourth, fifth, and sixth elements of the matrices M are grouped and stored together in second, third, fourth, fifth, and sixth memory blocks 810 according to the parallel level of 32, respectively.
  • the electronic system implements one of an addition operation and a multiplication operation on the first elements of the matrices, M (1, 1)(), of the first element group 808 in response to a single instruction.
  • Each of the predefined number of first elements is converted to a alternative element of a second visual factor.
  • the first elements of 32 matrices, M (1, 1)(1) to M (1, 1)(32) are extracted and converted in parallel to alternative elements of the matrices, J D , J source_rot , J source_translation , or J destine_translation .
  • one of the six elements of the matrices is equal to a predefined value for all of the feature points of the image 408.
  • the predefined value is equal to 0.
  • the second elements of the matrices, M (1, 2)(), and the fourth elements of the matrices, M (2, 1)() are fixed and equal to 0.
  • Each lane of a SIMD instruction holds data (i.e., an element) from a visual factor 604 of a feature point.
  • a parallel degree of the SIMD instruction is N (e.g., 32), and N visual factors are processed simultaneously in response to the SIMD instruction. Stated another way, the same visual factor of N feature points is processed in parallel in response to the SIMD instruction.
  • a visual factor includes a 3 ⁇ 3 matrix that are processed by multiplication, add, and subtraction operations, and each of these operations is performed by a SIMD instruction. Conversely, a division operation is applied to determine an inverse depth and not implemented by any SIMD instruction.
  • non-SIMD processing of the visual factors 604 is transformed to SIMD-based parallel processing. Matrix or vector operations are converted to scalar operations. Each element in a vector or matrix of a visual factor 506 is represented by a scalar variable with a subscription x, y, or z, i.e., as a sub-component associated with an x, y or z axis. Scalar variables are extended into vectors with length N.
  • Each element in the vector or matrix stands for a scalar variable corresponding to the visual factor 506.
  • a memory layout of a SIMD register 248 is transformed to store intermediate results.
  • the same first elements for N visual factors of N feature points are stored in a first memory block 808, and the same second elements for the N visual factors of the N feature points are stored in a second memory block 810 that does not overlap and is separate from the first memory block 808.
  • Each scalar operation (e.g., an addition operation) is enabled by a corresponding SIMD instruction that controls the elements of the N visual factors of the N feature points to be parallel processed by scalar operations.
  • Neon is an example SIMD architecture extension for ARM Cortex-A and ARM Cortex-R series of processors.
  • M (k)(i,j) represents an element located at the i-th row and j-th column element in a matrix M (k).
  • the matrix M (k) is associated with a matrix J source_rot (k) associated with the k-th visual factor 604. It is assumed that there are K factors in total.
  • memory e.g., a SIMD register 248 storing the matrix, all the (i,j) element of the same matrix in the visual factors 604 are store together.
  • all (1, 1) elements of matrices of 32 visual factors of 32 feature points are stored in a first memory block.
  • All (1, 2) elements of the 32 visual factors of the 32 feature points are stored in a second memory block that does not overlap and is separate from the first memory block, so are each (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), and (3, 3) element of the 32 visual factors.
  • SIMD based parallel processing is applied in multiplication and addition operations to calculate elements of the Jacobian matrices 514 corresponding to N visual factors (e.g., 32 visual factors corresponding to 32 feature points) simultaneously.
  • a memory space for N visual factors is reserved and reused for every N visual factors, thereby reducing a memory usage during the course of processing the matrix M to the Jacobian matrix 514.
  • SIMD based parallel processing is not limited to processing matrices of limited sizes, and can be broadly applied to process vectors and matrices of different visual factors 604 (e.g., including IMU factors 504, visual factors 506, and prior factors 512).
  • a SIMD based SLAM module 232 performs in an efficient and scalable manner compared with many existing solutions (e.g., ICE-BA).
  • FIG. 9 is a flow diagram of a SIMD-based VIO method 900 that is implemented at an electronic system (e.g., system 200) having a SIMD register 248, in accordance with some embodiments.
  • the electronic system includes an electronic device (e.g., a client device 104).
  • An example of the client device 104 is a head-mount display 104D or a mobile phone 104C.
  • the method 900 is applied to determine and predict poses, map a scene, and render both virtual and real content concurrently in extended reality (e.g., VR, AR).
  • Method 900 is, optionally, governed by instructions that are stored in a non- transitory computer readable storage medium and that are executed by one or more processors of the electronic system.
  • Each of the operations shown in Figure 9 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., a VIO module 240 stored in memory 206 of the system 200 in Figure 2).
  • the computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices.
  • the instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors.
  • the electronic system obtains (902) motion data 406 measured by an inertial motion unit (IMU) 280 and image data 408 that are captured by a camera 260 and having a plurality of feature points. For each feature point, a first visual factor 604A is determined (904) from the motion data and image data.
  • the first visual factor 604A includes (906) a plurality of elements arranged in a vector or matrix, and the elements includes (908) a first element located at a first position in the vector or matrix.
  • the electronic system groups (910) the first element of each feature point of the plurality of feature points into a plurality of first element groups (e.g., stored in first memory blocks 708A and 708B).
  • Each first element group includes (912) a predefined number of first elements corresponding to first visual factors of a subset of the plurality of feature points.
  • the predefined number of first elements are stored (914) in a first memory block of the SIMD register 248.
  • a device pose is determined (916) based on the first visual factor of each feature point.
  • the electronic device extracts (918) the predefined number of first elements from the first memory block and converts (920) each of the predefined number of first elements to a alternative element of a second visual factor for each of the subset of feature points.
  • the second visual factor includes a Jacobian matrix.
  • the first visual factor 604A e.g., P destine_est
  • the first visual factor 604A includes a second element and a third element, and corresponds to a three-dimensional (3D) vector made of the first element, the second element, and the third element.
  • the electronic system implements one of an addition operation and a multiplication operation on the 3D vector of each of the subset of feature points, at least by in response to a single instruction, converting each of the predefined number of first elements to a alternative element of a second visual factor (e.g., M).
  • the second visual factor is a matrix (e.g., M), and the alternative element is located at a second position in the matrix of the second visual factor.
  • the electronic system groups the second element of each feature point of the plurality of feature points into a plurality of second element groups.
  • Each second element group corresponds to a first element group and includes the predefined number of second elements corresponding to each first visual factor of the subset of feature points of the first element group.
  • the electronic system groups the third element of each feature point of the plurality of feature points into a plurality of third element groups.
  • Each third element group corresponds to a first element group and including the predefined number of third elements corresponding to each first visual factor of the subset of feature points of the first element group.
  • the predefined number of second elements of each second element group are stored in a second memory block.
  • the predefined number of element factors of each third element group in a third memory block.
  • the first visual factor corresponds to a matrix (e.g., M) of m 1 ⁇ n 1 elements including the first element that is located in the first position of the matrix and stored in one of the plurality of first element groups.
  • the first visual factor corresponds to a matrix M having 6 elements in 2 rows and 3 columns.
  • the electronic system groups the remaining element of each feature point of the plurality of feature points into a plurality of remaining element groups. Each remaining element group corresponds to a first element group and including the predefined number of remaining elements corresponding to each first visual factor of the subset of feature points of the first element group.
  • the electronic system implements one of an addition operation and a multiplication operation on the matrix of each feature point, at least by converting each the predefined number of first elements to a alternative element of a second visual factor (e.g., (e.g., J D , J source_rot , J source_translation , or J destine_translation ).
  • a second visual factor e.g., J D , J source_rot , J source_translation , or J destine_translation
  • the matrix of m 1 ⁇ n 1 elements includes a subset of elements (e.g., M(1, 2) and M(2, 1) of a matrix M), and each of the subset of elements is equal to a predefined value in all of the plurality of feature points. In an example, the predefined value is equal to zero.
  • the electronic system determines (922) the predefined number (e.g., 32) based on a single instruction, multiple data (SIMD) processing capability of the electronic system.
  • SIMD single instruction, multiple data
  • the electronic system is configured to parallel perform the same operation on the predefined number of first elements simultaneously, and the first element of each feature point of the plurality of feature points are grouped into the plurality of first element groups in accordance with the SIMD processing capability of the electronic system.
  • first elements of the position vector P destine_est of feature points are grouped into first element groups, and each first element group has 32 first elements of position vector P- destine_est of 32 feature points.
  • each of the first elements includes a single float point number, and the SIMD processing capability requires that each of the first elements be stored and processed as a fixed point number. Each of the first elements is converted from the single flat point number to the fixed point number.
  • the term “if” is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context.
  • the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Processing Or Creating Images (AREA)

Abstract

Une odométrie visuelle-inertielle est mise en œuvre sur la base d'un traitement parallèle d'instruction unique de données multiples (SIMD). Un système électronique comprend un registre SIMD et obtient des données de mouvement et des données image correspondant à une pluralité de points caractéristiques. Pour chaque point caractéristique, un facteur visuel est déterminé à partir des données de mouvement et des données image, et comprend un vecteur ou une matrice qui comprend un premier élément situé à une première position dans le vecteur ou la matrice. Le système électronique regroupe le premier élément de chaque point caractéristique de la pluralité de points caractéristiques en une pluralité de premiers groupes d'éléments. Chacun des premiers groupes d'éléments comprend un certain nombre de premiers éléments (par exemple, 32) correspondant à des facteurs visuels d'un sous-ensemble de la pluralité de points caractéristiques. Pour chacun des premiers groupes d'éléments, le nombre de premiers éléments est stocké dans un premier bloc de mémoire du registre SIMD.
PCT/US2021/061282 2021-11-30 2021-11-30 Procédés et systèmes pour mettre en œuvre une odométrie visuelle-inertielle sur la base d'un traitement simd parallèle WO2023101662A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2021/061282 WO2023101662A1 (fr) 2021-11-30 2021-11-30 Procédés et systèmes pour mettre en œuvre une odométrie visuelle-inertielle sur la base d'un traitement simd parallèle

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2021/061282 WO2023101662A1 (fr) 2021-11-30 2021-11-30 Procédés et systèmes pour mettre en œuvre une odométrie visuelle-inertielle sur la base d'un traitement simd parallèle

Publications (1)

Publication Number Publication Date
WO2023101662A1 true WO2023101662A1 (fr) 2023-06-08

Family

ID=86612902

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/061282 WO2023101662A1 (fr) 2021-11-30 2021-11-30 Procédés et systèmes pour mettre en œuvre une odométrie visuelle-inertielle sur la base d'un traitement simd parallèle

Country Status (1)

Country Link
WO (1) WO2023101662A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116908810A (zh) * 2023-09-12 2023-10-20 天津大学四川创新研究院 一种无人机搭载激光雷达测量建筑土方的方法和系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015013418A2 (fr) * 2013-07-23 2015-01-29 The Regents Of The University Of California Procédé de traitement de mesures d'éléments pendant une navigation inertielle assistée par la vision
WO2018128668A1 (fr) * 2017-01-04 2018-07-12 Qualcomm Incorporated Systèmes et procédés d'utilisation d'une vitesse de système mondial de localisation dans une odométrie visuelle-inertielle
US20190301871A1 (en) * 2018-03-27 2019-10-03 Artisense Corporation Direct Sparse Visual-Inertial Odometry Using Dynamic Marginalization

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015013418A2 (fr) * 2013-07-23 2015-01-29 The Regents Of The University Of California Procédé de traitement de mesures d'éléments pendant une navigation inertielle assistée par la vision
WO2018128668A1 (fr) * 2017-01-04 2018-07-12 Qualcomm Incorporated Systèmes et procédés d'utilisation d'une vitesse de système mondial de localisation dans une odométrie visuelle-inertielle
US20190301871A1 (en) * 2018-03-27 2019-10-03 Artisense Corporation Direct Sparse Visual-Inertial Odometry Using Dynamic Marginalization

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116908810A (zh) * 2023-09-12 2023-10-20 天津大学四川创新研究院 一种无人机搭载激光雷达测量建筑土方的方法和系统
CN116908810B (zh) * 2023-09-12 2023-12-12 天津大学四川创新研究院 一种无人机搭载激光雷达测量建筑土方的方法和系统

Similar Documents

Publication Publication Date Title
EP3621034B1 (fr) Procédé et appareil d'étalonnage de paramètres relatifs de collecteur et support d'enregistrement
CN108447097B (zh) 深度相机标定方法、装置、电子设备及存储介质
CN110073313B (zh) 使用母设备和至少一个伴随设备与环境交互
JP6745328B2 (ja) 点群データを復旧するための方法及び装置
US10989540B2 (en) Binocular vision localization method, device and system
US11935187B2 (en) Single-pass object scanning
EP2833322A1 (fr) Procédé de mouvement stéréo d'extraction d'informations de structure tridimensionnelle (3D) à partir d'une vidéo pour la fusion avec des données de nuage de points en 3D
JP7369847B2 (ja) 自動運転車両に対するデータ処理方法及び装置、電子機器、記憶媒体、コンピュータプログラム、ならびに自動運転車両
WO2021178980A1 (fr) Synchronisation de données et prédiction de pose en réalité étendue
CN112183506A (zh) 一种人体姿态生成方法及其系统
CN112509047A (zh) 基于图像的位姿确定方法、装置、存储介质及电子设备
CN109035303B (zh) Slam系统相机跟踪方法及装置、计算机可读存储介质
US11188787B1 (en) End-to-end room layout estimation
WO2024212821A1 (fr) Procédé et appareil d'optimisation de fermeture de boucle à des fins de construction de carte, et dispositif et support d'enregistrement
WO2023101662A1 (fr) Procédés et systèmes pour mettre en œuvre une odométrie visuelle-inertielle sur la base d'un traitement simd parallèle
KR20180112374A (ko) 영상 특징점 기반의 실시간 카메라 자세 추정 방법 및 그 장치
KR20210050997A (ko) 포즈 추정 방법 및 장치, 컴퓨터 판독 가능한 기록 매체 및 컴퓨터 프로그램
JP7272381B2 (ja) 学習処理装置及び学習処理プログラム
CN115731406A (zh) 基于页面图的视觉差异性检测方法、装置及设备
WO2019047607A1 (fr) Procédé et dispositif de traitement de données pour systeme de conduite automatique de bout en bout
WO2023091131A1 (fr) Procédés et systèmes pour récupérer des images sur la base de caractéristiques de plan sémantique
CN110211239B (zh) 基于无标记识别的增强现实方法、装置、设备及介质
CN107993247A (zh) 追踪定位方法、系统、介质和计算设备
WO2023277877A1 (fr) Détection et reconstruction de plan sémantique 3d
WO2023195982A1 (fr) Sous-échantillonnage d'image-clé pour réduction d'utilisation de mémoire dans slam

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21966547

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE