WO2024097469A1 - Système hybride pour la détection de caractéristiques et la génération de descripteurs - Google Patents

Système hybride pour la détection de caractéristiques et la génération de descripteurs Download PDF

Info

Publication number
WO2024097469A1
WO2024097469A1 PCT/US2023/074077 US2023074077W WO2024097469A1 WO 2024097469 A1 WO2024097469 A1 WO 2024097469A1 US 2023074077 W US2023074077 W US 2023074077W WO 2024097469 A1 WO2024097469 A1 WO 2024097469A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
feature
characteristic
data
input data
Prior art date
Application number
PCT/US2023/074077
Other languages
English (en)
Inventor
Ahmed Kamel Sadek
Original Assignee
Qualcomm Incorporated
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US18/449,461 external-priority patent/US20240153245A1/en
Application filed by Qualcomm Incorporated filed Critical Qualcomm Incorporated
Publication of WO2024097469A1 publication Critical patent/WO2024097469A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/60Extraction of image or video features relating to illumination properties, e.g. using a reflectance or lighting model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/94Hardware or software architectures specially adapted for image or video understanding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects

Definitions

  • the present disclosure generally relates to processing sensor data (e.g., images, radar data, light detection and ranging (LIDAR) data, etc.).
  • sensor data e.g., images, radar data, light detection and ranging (LIDAR) data, etc.
  • aspects of the present disclosure are related to a hybrid system for performing feature (e.g., keypoint) detection and descriptor generation.
  • a camera or a device including a camera can capture a sequence of frames of a scene (e.g., a video of a scene).
  • the sequence of frames can be processed for performing one or more functions, can be output for display, can be output for processing and/or consumption by other devices, among other uses.
  • Degrees of freedom refer to the number of basic ways a rigid object can move through three-dimensional (3D) space.
  • six different DoF of an object can be tracked, including three translational DoF and three rotational DoF.
  • Certain devices can track some or all of these degrees of freedom.
  • tracking e.g., 6 DoF tracking
  • a device or system can perform feature analysis (e.g., extraction, tracking, etc.) and other complex functions.
  • a method for processing image data including: obtaining input data; processing, using a non-machine learning based feature detector, the input data to determine one or more feature points in the input data; and determining, using a machine learning system, a respective feature descriptor for each respective feature point of the one or more feature points.
  • an apparatus for processing image data includes at least one memory and at least one processor (e.g., implemented in circuitry) coupled to the at least one memory.
  • the at least one processor is configured to: obtain input data; process, using a non-machine learning based feature detector, the input data to determine one or more feature points in the input data; and determine, using a machine learning system, a respective feature descriptor for each respective feature point of the one or more feature points.
  • a non-transitory computer-readable medium has stored thereon instructions that, when executed by one or more processors, cause the one or more processors to: obtain input data; process, using a non-machine learning based feature detector, the input data to determine one or more feature points in the input data; and determine, using a machine learning system, a respective feature descriptor for each respective feature point of the one or more feature points.
  • an apparatus for processing image data includes: means for obtaining input data; means for processing, using a non-machine learning based feature detector, the input data to determine one or more feature points in the input data; and means for determining, using a machine learning system, a respective feature descriptor for each respective feature point of the one or more feature points.
  • one or more of the apparatuses described herein is or is part of a camera, a mobile device (e.g., a mobile telephone or so-called “smart phone” or other mobile device), a wearable device, an extended reality device (e.g., a virtual reality (VR) device, an augmented reality (AR) device, or a mixed reality (MR) device), a personal computer, a laptop computer, a server computer, or other device.
  • an apparatus includes a camera or multiple cameras for capturing one or more images.
  • the apparatus further includes a display for displaying one or more images, notifications, and/or other displayable data.
  • the apparatus can include one or more sensors, which can be used for determining a location and/or pose of the apparatus, a state of the apparatuses, and/or for other purposes.
  • FIG. 1 is a block diagram illustrating an architecture of an image capture and processing device, in accordance with some examples
  • FIG. 2 is a block diagram illustrating an architecture of an example extended reality (XR) system, in accordance with some examples
  • FIG. 3 is a block diagram illustrating an architecture of a simultaneous localization and mapping (SLAM) device, in accordance with some examples
  • FIG. 4 is an example frame captured by a SLAM system, in accordance with some aspects
  • FIG. 5 is a diagram illustrating an example of a hybrid system 500 for detecting features (e g., keypoints or feature points) and generating descriptors for the detected features, in accordance with some aspects;
  • features e g., keypoints or feature points
  • FIG. 6 is a flow diagram illustrating an example of a process for processing image data, in accordance with some examples of the present disclosure.
  • FIG. 7 is a block diagram illustrating an example of a computing system for implementing certain aspects described herein.
  • a device and system can determine or capture characteristics of a scene based on sensor data associated with the scene.
  • the sensor data can include as images (or frames) of a scene, video data (including multiple frames) of the scene, radar data, LIDAR data, any combination thereof and/or other data.
  • an image capture device e.g., a camera
  • An image capture device typically includes at least one lens that receives light from a scene and bends the light toward an image sensor of the image capture device.
  • the light received by the lens passes through an aperture controlled by one or more control mechanisms and is received by the image sensor.
  • the one or more control mechanisms can control exposure, focus, and/or zoom based on information from the image sensor and/or based on information from an image processor (e.g., a host or application process and/or an image signal processor).
  • the one or more control mechanisms include a motor or other control mechanism that moves a lens of an image capture device to a target lens position.
  • Degrees of freedom refer to the number of basic ways a rigid object can move through three-dimensional (3D) space.
  • DoF Degrees of freedom
  • the six DoF can include three translational DoF corresponding to translational movement along three perpendicular axes, which can be referred to as x, y, and z axes.
  • the six DoF can also include three rotational DoF corresponding to rotational movement around the three axes, which can be referred to as pitch, yaw, and roll.
  • Some devices can track some or all of these degrees of freedom.
  • XR extended reality
  • XR devices such as virtual reality (VR) or augmented reality (AR) headsets
  • VR virtual reality
  • AR augmented reality
  • a 3DoF tracker e.g., of an XR headset
  • a 6DoF tracker e.g., of an XR headset
  • tracking can be used to perform localization and mapping functions. Mapping can include a process of building or generating a map of a particular environment.
  • Localization can include a process of determination a location of an object (e.g., a vehicle, an XR device, a robotics device, a mobile handset, etc.) within a map (e.g., the map generated using the mapping process).
  • An example of atechnique for localization and mapping is visual simultaneous localization and mapping (VSLAM).
  • VSLAM is a computational geometry technique used in devices with cameras, such as vehicles or systems of vehicles (e g., autonomous driving systems), XR devices (e g., head-mounted displays (HMDs), AR headsets, etc.), robotics devices or systems, mobile handsets, etc.
  • a device can construct and update a map of an unknown environment based on frames captured by the device’s camera.
  • the device can keep track of the device’s pose (e.g., a pose of an image sensor of the device, such as a camera pose, which may be determined using 6DOF tracking) within the environment (e.g., location and/or orientation) as the device updates the map.
  • the device can be activated in a particular room of a building and can move throughout the interior of the building, capturing image frames.
  • the device can map the environment, and keep track of its location in the environment, based on tracking where different obj ects in the environment appear in different image frames.
  • Other type of sensor data other than image frames may also be used for VSLAM, such as rader and/or LIDAR data.
  • degrees of freedom can refer to which of the six degrees of freedom the system is capable of tracking.
  • 3DoF tracking systems generally track the three rotational DoF (e.g., pitch, yaw, and roll).
  • a 3DoF headset for instance, can track the user of the headset turning their head left or right, tilting their head up or down, and/or tilting their head to the left or right.
  • 6DoF systems can track the three translational DoF as well as the three rotational DoF.
  • a 6DoF headset for instance, and can track the user moving forward, backward, laterally, and/or vertically in addition to tracking the three rotational DoF.
  • a device e.g., XR devices, mobile devices, etc.
  • feature analysis e.g., extraction, tracking, etc.
  • keypoint features also referred to as feature points or keypoints
  • Descriptors for the keypoint features can also be generated in order to provide semantic meaning for the keypoint features.
  • the keypoint features are features in the images that do not change in different conditions (e.g., different lighting and/or illuminations, different views, different weather conditions, etc ), such as points associated with comers of objects in the images, distinctive features of the objects, etc.
  • the keypoint features can be used as non-semantic features to improve localization and mapping robustness.
  • Keypoint features can include distinctive features extracted from one or more images, such as points associated with a comer of a table, an edge of a street sign, etc.
  • Generating stable keypoint features and descriptors is important so that localization and mapping using such features can be accurately performed.
  • a descriptor generated for a feature associated with a traffic sign detected in a first image captured in the daylight may be different than a descriptor generated for the same feature associated with the traffic sign detected in a second image captured in the dark (e.g., at night time).
  • a descriptor generated for a feature associated with a distinctive part of a building detected in a first image captured in clear conditions may be different than a descriptor generated for the same feature associated with the same part of the building detected in a second image captured in the cloudy or rainy conditions.
  • machine learning based systems can be used to detect keypoint features (e.g., key points or feature points) for localization and mapping and to generate descriptors for the detected features.
  • keypoint features e.g., key points or feature points
  • ground truth and annotations or labels
  • a machine learning based keypoint feature e.g., keypoint or feature point
  • a benefit of using a machine learning based system to generate descriptors is so manual (e.g., by a human) annotation of features is not needed.
  • a human may still be needed to generate descriptors that will be used as ground truth data (e.g., labeled data) for training the machine learning based system.
  • systems, apparatuses, processes (also referred to as methods), and computer-readable media are described herein for providing a hybrid system for performing feature detection (e.g., to detect keypoints, also referred to as feature points) and descriptor generation.
  • the hybrid system can include a non- machine learning based feature detector and a machine learning based descriptor generator.
  • the non-machine learning based feature detector e.g., a feature point detector
  • the non-machine learning based feature detector can detect or generate feature points (or keypoints) from input sensor data (e.g., one or more input images, LIDAR data, radar data, etc.).
  • the machine learning based descriptor generator can include or can be a machine learning system (e.g., a deep learning neural network) that can generate descriptors for the feature points (or keypoints) detected by the non-machine learning based feature detector.
  • the machine learning based descriptor generator can generate the descriptors (also referred as feature descriptors) at least in part by generating a description of a feature as detected or depicted in input sensor data (e g., a local image patch extracted around the feature in an image) by the non-machine learning based feature detector.
  • a feature descriptor can describe a feature as a feature vector or as a collection of feature vectors.
  • the machine learning system used for descriptor generation can include a transformer neural network architecture.
  • the transformer based neural network can use transformer cross-attention (e.g., cross-view attention) to determine a unique signature (which can be used as a feature descriptor) across different ty pes of input data (e.g., images, radar data, and/or LIDAR data captured during the day, images, radar data, and/or LIDAR data captured during nighttime, images, radar data, and/or LIDAR data captured when rain is present, images, radar data, and/or LIDAR data captured when fog is present, etc.), providing robustness to varying input data. Generating a common or unique descriptor across such vary ing input data is more difficult to do manually.
  • transformer cross-attention e.g., cross-view attention
  • Such a hybrid system allows the feature detection and descriptor generation to be performed using machine learning in an un-supervised manner (in which case no labeling is required for training). Further, the transformer-based solution described above (e.g., using cross-attention to generate a unique signature for different types of input data) can scale with more data, and can be trained using unsupervised learning (thus requiring no labeled data).
  • FIG. 1 is a block diagram illustrating an architecture of an image capture and processing system 100.
  • the image capture and processing system 100 includes various components that are used to capture and process images of scenes (e.g., an image of a scene 110).
  • the image capture and processing system 100 can capture standalone images (or photographs) and/or can capture videos that include multiple images (or video frames) in a particular sequence.
  • a lens 115 of the image capture and processing system 100 faces a scene 110 and receives light from the scene 110.
  • the lens 115 bends the light toward the image sensor 130.
  • the light received by the lens 115 passes through an aperture controlled by one or more control mechanisms 120 and is received by an image sensor 130.
  • the one or more control mechanisms 120 may control exposure, focus, and/or zoom based on information from the image sensor 130 and/or based on information from the image processor 150.
  • the one or more control mechanisms 120 may include multiple mechanisms and components; for instance, the one or more control mechanisms 120 may include one or more exposure control mechanisms 125 A, one or more focus control mechanisms 125B, and/or one or more zoom control mechanisms 125C.
  • the one or more control mechanisms 120 may also include additional control mechanisms besides those that are illustrated, such as control mechanisms controlling analog gain, flash, HDR, depth of field, and/or other image capture properties.
  • the one or more focus control mechanisms 125B of the one or more control mechanisms 120 can obtain a focus setting.
  • the one or more focus control mechanisms 125B store the focus setting in a memory register.
  • the one or more focus control mechanisms 125B can adjust the position of the lens 115 relative to the position of the image sensor 130. For example, based on the focus setting, the one or more focus control mechanisms 125B can move the lens 115 closer to the image sensor 130 or farther from the image sensor 130 by actuating a motor or servo (or other lens mechanism), thereby adjusting focus.
  • additional lenses may be included in the image capture and processing system 100, such as one or more microlenses over each photodiode of the image sensor 130, which each bend the light received from the lens 115 toward the corresponding photodiode before the light reaches the photodiode.
  • the focus setting may be determined via contrast detection autofocus (CDAF), phase detection autofocus (PDAF), hybrid autofocus (HAF), or some combination thereof.
  • the focus setting may be determined using the one or more control mechanism 120, the image sensor 130, and/or the image processor 150.
  • the focus setting may be referred to as an image capture setting and/or an image processing setting.
  • the one or more exposure control mechanisms 125 A of the one or more control mechanisms 120 can obtain an exposure setting.
  • the one or more exposure control mechanisms 125 A stores the exposure setting in a memon register. Based on this exposure setting, the one or more exposure control mechanisms 125A can control a size of the aperture (e.g., aperture size or f/stop), a duration of time for which the aperture is open (e.g., exposure time or shutter speed), a sensitivity of the image sensor 130 (e.g., ISO speed or fdm speed), analog gain applied by the image sensor 130, or any combination thereof.
  • the exposure setting may be referred to as an image capture setting and/or an image processing setting.
  • the one or more zoom control mechanisms 125C of the one or more control mechanisms 120 can obtain a zoom setting.
  • the one or more zoom control mechanisms 125C stores the zoom setting in a memory register.
  • the one or more zoom control mechanisms 125C can control a focal length of an assembly of lens elements (lens assembly) that includes the lens 115 and one or more additional lenses.
  • the one or more zoom control mechanisms 125C can control the focal length of the lens assembly by actuating one or more motors or servos (or other lens mechanism) to move one or more of the lenses relative to one another.
  • the zoom setting may be referred to as an image capture setting and/or an image processing setting.
  • the lens assembly may include a parfocal zoom lens or a varifocal zoom lens.
  • the lens assembly may include a focusing lens (which can be lens 115 in some cases) that receives the light from the scene 110 first, with the light then passing through an afocal zoom system between the focusing lens (e.g., lens 115) and the image sensor 130 before the light reaches the image sensor 130.
  • the afocal zoom system may, in some cases, include two positive (e.g., converging, convex) lenses of equal or similar focal length (e.g., within a threshold difference of one another) with a negative (e.g., diverging, concave) lens between them.
  • the one or more zoom control mechanisms 125C moves one or more of the lenses in the afocal zoom system, such as the negative lens and one or both of the positive lenses.
  • the image sensor 130 includes one or more arrays of photodiodes or other photosensitive elements. Each photodiode measures an amount of light that eventually corresponds to a particular pixel in the image produced by the image sensor 130. In some cases, different photodiodes may be covered by different color filters, and may thus measure light matching the color of the filter covering the photodiode. For instance, Bayer color filters include red color filters, blue color filters, and green color filters, with each pixel of the image generated based on red light data from at least one photodiode covered in a red color filter, blue light data from at least one photodiode covered in a blue color filter, and green light data from at least one photodiode covered in a green color filter.
  • color filters may use yellow, magenta, and/or cyan (also referred to as “emerald”) color filters instead of or in addition to red, blue, and/or green color filters.
  • Some image sensors e.g., image sensor 130
  • Monochrome image sensors may also lack color filters and therefore lack color depth.
  • the image sensor 130 may alternately or additionally include opaque and/or reflective masks that block light from reaching certain photodiodes, or portions of certain photodiodes, at certain times and/or from certain angles, which may be used for phase detection autofocus (PDAF).
  • the image sensor 130 may also include an analog gain amplifier to amplify the analog signals output by the photodiodes and/or an analog to digital converter (ADC) to convert the analog signals output of the photodiodes (and/or amplified by the analog gain amplifier) into digital signals.
  • ADC analog to digital converter
  • certain components or functions discussed with respect to one or more of the one or more control mechanisms 120 may be included instead or additionally in the image sensor 130.
  • the image sensor 130 may be a charge-coupled device (CCD) sensor, an electron-multiplying CCD (EMCCD) sensor, an active-pixel sensor (APS), a complimentary metal-oxide semiconductor (CMOS), an N-type metal-oxide semiconductor (NMOS), a hybrid CCD/CMOS sensor (e.g., sCMOS), or some other combination thereof.
  • CCD charge-coupled device
  • EMCD electron-multiplying CCD
  • APS active-pixel sensor
  • CMOS complimentary metal-oxide semiconductor
  • NMOS N-type metal-oxide semiconductor
  • hybrid CCD/CMOS sensor e.g., sCMOS
  • the image processor 150 may include one or more processors, such as one or more image signal processors (ISPs) (including ISP 154), one or more host processors (including host processor 152), and/or one or more of any other type of processor 1610 discussed with respect to the computing system 1600.
  • the host processor 152 can be a digital signal processor (DSP) and/or other type of processor.
  • the image processor 150 is a single integrated circuit or chip (e.g., referred to as a system-on-chip or SoC) that includes the host processor 152 and the ISP 154.
  • the chip can also include one or more input/output ports (e.g., input/ output (I/O) ports 156), central processing units (CPUs), graphics processing units (GPUs), broadband modems (e.g., 3G, 4G or LTE, 5G, etc.), memory, connectivity components (e.g., BluetoothTM, Global Positioning System (GPS), etc.), any combination thereof, and/or other components.
  • input/output ports e.g., input/ output (I/O) ports 156
  • CPUs central processing units
  • GPUs graphics processing units
  • broadband modems e.g., 3G, 4G or LTE, 5G, etc.
  • memory e.g., USB 2.0, etc.
  • connectivity components e.g., BluetoothTM, Global Positioning System (GPS), etc.
  • the I/O ports 156 can include any suitable input/output ports or interface according to one or more protocol or specification, such as an Inter-Integrated Circuit 2 (I2C) interface, an Inter-Integrated Circuit 3 (I3C) interface, a Serial Peripheral Interface (SPI) interface, a serial General Purpose Input/Output (GPIO) interface, a Mobile Industry Processor Interface (MIPI) (such as a MIPI CSI-2 physical (PHY) layer port or interface, an Advanced High-performance Bus (AHB) bus, any combination thereof, and/or other input/output port.
  • I2C Inter-Integrated Circuit 2
  • I3C Inter-Integrated Circuit 3
  • SPI Serial Peripheral Interface
  • GPIO serial General Purpose Input/Output
  • MIPI Mobile Industry Processor Interface
  • the host processor 152 can communicate with the image sensor 130 using an I2C port
  • the ISP 154 can communicate with the image sensor 130 using an MIPI port.
  • the image processor 150 may perform a number of tasks, such as de-mosaicing, color space conversion, image frame downsampling, pixel interpolation, automatic exposure (AE) control, automatic gain control (AGC), CDAF, PDAF, automatic white balance, merging of image frames to form an HDR image, image recognition, object recognition, feature recognition, receipt of inputs, managing outputs, managing memory, or some combination thereof.
  • the image processor 150 may store image frames and/or processed images in random access memory (RAM) 140/1620, read-only memory (ROM) 145/1625, a cache, a memory unit, another storage device, or some combination thereof.
  • I/O devices 160 may be connected to the image processor 150.
  • the I/O devices 160 can include a display screen, a keyboard, a keypad, a touchscreen, a trackpad, a touch-sensitive surface, a printer, any other output devices 1635, any other input devices 1645, or some combination thereof.
  • a caption may be input into the image processing device 105B through a physical key board or keypad of the I/O devices 160, or through a virtual keyboard or keypad of a touchscreen of the I/O devices 160.
  • the I/O ports 156 may include one or more ports, jacks, or other connectors that enable a wired connection between the image capture and processing system 100 and one or more peripheral devices, over which the image capture and processing system 100 may receive data from the one or more peripheral device and/or transmit data to the one or more peripheral devices.
  • the I/O ports 156 may include one or more wireless transceivers that enable a wireless connection between the image capture and processing system 100 and one or more peripheral devices, over which the image capture and processing system 100 may receive data from the one or more peripheral device and/or transmit data to the one or more peripheral devices.
  • the peripheral devices may include any of the previously-discussed types of I/O devices 160 and may themselves be considered I/O devices 160 once they are coupled to the ports, jacks, wareless transceivers, or other wired and/or wireless connectors.
  • the image capture and processing system 100 may be a single device. In some cases, the image capture and processing system 100 may be two or more separate devices, including an image capture device 105 A (e.g., a camera) and an image processing device 105B (e.g., a computing device coupled to the camera).
  • the image capture device 105 A and the image processing device 105B may be coupled together, for example via one or more wires, cables, or other electrical connectors, and/or wirelessly via one or more wireless transceivers. In some implementations, the image capture device 105 A and the image processing device 105B may be disconnected from one another.
  • a vertical dashed line divides the image capture and processing system 100 of FIG. I into two portions that represent the image capture device 105 A and the image processing device 105B, respectively
  • the image capture device 105 A includes the lens 115, the one or more control mechanisms 120, and the image sensor 130.
  • the image processing device 105B includes the image processor 150 (including the ISP 154 and the host processor 152), the RAM 140, the ROM 145, and the I/O 160.
  • certain components illustrated in the image capture device 105A such as the ISP 154 and/or the host processor 152, may be included in the image capture device 105A.
  • the image capture and processing system 100 can include an electronic device, such as a mobile or stationary telephone handset (e.g., smartphone, cellular telephone, or the like), a desktop computer, a laptop or notebook computer, a tablet computer, a set-top box, a television, a camera, a display device, a digital media player, a video gaming console, a video streaming device, an Internet Protocol (IP) camera, or any other suitable electronic device.
  • the image capture and processing system 100 can include one or more wireless transceivers for wireless communications, such as cellular network communications, 802.11 wi-fi communications, wireless local area network (WLAN) communications, or some combination thereof.
  • the image capture device 105 A and the image processing device 105B can be different devices.
  • the image capture device 105 A can include a camera device and the image processing device 105B can include a computing device, such as a mobile handset, a desktop computer, or other computing device.
  • the image capture and processing system 100 can include more components than those shown in FIG. 1.
  • the components of the image capture and processing system 100 can include software, hardware, or one or more combinations of software and hardware.
  • the components of the image capture and processing system 100 can include and/or can be implemented using electronic circuits or other electronic hardware, which can include one or more programmable electronic circuits (e.g., microprocessors, GPUs, DSPs, CPUs, and/or other suitable electronic circuits), and/or can include and/or be implemented using computer software, firmware, or any combination thereof, to perform the various operations described herein.
  • the software and/or firmware can include one or more instructions stored on a computer-readable storage medium and executable by one or more processors of the electronic device implementing the image capture and processing system 100.
  • FIG. 2 is a diagram illustrating an architecture of an example system 200, in accordance with some aspects of the disclosure.
  • the system 200 can be an XR system (e g., running (or executing) XR applications and/or implementing XR operations), a system of a vehicle, a robotics system, or other type of system.
  • the system 200 can perfomi tracking and localization, mapping of an environment in the physical world (e.g., a scene), and/or positioning and rendering of content on a display 209 (e.g., positioning and rendering of virtual content a screen, visible plane/region, and/or other display as part of an XR experience).
  • the system 200 can generate a map (e.g., a three-dimensional (3D) map) of an environment in the physical world, track a pose (e.g., location and position) of the system 200 relative to the environment (e.g., relative to the 3D map of the environment), and/or determine a position and/or anchor point in a specific location(s) on the map of the environment.
  • the system 200 can position and/or anchor virtual content in the specific location(s) on the map of the environment and can render virtual content on the display 209 such that the virtual content appears to be at a location in the environment corresponding to the specific location on the map of the scene where the virtual content is positioned and/or anchored.
  • the display 209 can include a monitor, a glass, a screen, a lens, a projector, and/or other display mechanism.
  • the display 209 can allow a user to see the real-world environment and also allows XR content to be overlaid, overlapped, blended with, or otherwise displayed thereon.
  • the system 200 includes one or more image sensors 202, an accelerometer 204, a gyroscope 206, storage 207, compute components 210, a pose engine 220, an image processing engine 224, and a rendering engine 226.
  • the components 202-126 show n in FIG. 2 are non-limiting examples provided for illustrative and explanation purposes, and other examples can include more, less, or different components than those shown in FIG. 2.
  • the system 200 can include one or more other sensors (e.g., one or more inertial measurement units (IMUs), radars, light detection and ranging (LIDAR) sensors, radio detection and ranging (RADAR) sensors, sound detection and ranging (SODAR) sensors, sound navigation and ranging (SONAR) sensors, audio sensors, etc.), one or more display devices, one more other processing engines, one or more other hardware components, and/or one or more other software and/or hardware components that are not shown in FIG. 2. While various components of the system 200, such as the image sensor 202, may be referenced in the singular form herein, it should be understood that the system 200 may include multiple of any component discussed herein (e.g., multiple image sensors 202).
  • IMUs inertial measurement units
  • LIDAR light detection and ranging
  • RADAR radio detection and ranging
  • SODAR sound detection and ranging
  • SONAR sound navigation and ranging
  • the system 200 includes or is in communication with (wired or wirelessly) an input device 208.
  • the input device 208 can include any suitable input device, such as a touchscreen, a pen or other pointer device, a keyboard, a mouse a button or key, a microphone for receiving voice commands, a gesture input device for receiving gesture commands, a video game controller, a steering wheel, a joystick, a set of buttons, a trackball, a remote control, any other input device 1645 discussed herein, or any combination thereof.
  • the image sensor 202 can capture images that can be processed for interpreting gesture commands.
  • the one or more image sensors 202, the accelerometer 204, the gyroscope 206, storage 207, compute components 210, pose engine 220, image processing engine 224, and rendering engine 226 can be part of the same computing device.
  • the one or more image sensors 202, the accelerometer 204, the gyroscope 206, storage 207, compute components 210, pose engine 220, image processing engine 224, and rendering engine 226 can be integrated into a device or system, such as an HMD, XR glasses (e.g., AR glasses), a vehicle or system of a vehicle, smartphone, laptop, tablet computer, gaming system, and/or any other computing device.
  • a device or system such as an HMD, XR glasses (e.g., AR glasses), a vehicle or system of a vehicle, smartphone, laptop, tablet computer, gaming system, and/or any other computing device.
  • the one or more image sensors 202, the accelerometer 204, the gyroscope 206, storage 207, compute components 210, pose engine 220, image processing engine 224, and rendering engine 226 can be part of two or more separate computing devices.
  • some of the components 202-126 can be part of, or implemented by, one computing device and the remaining components can be part of, or implemented by, one or more other computing devices.
  • the storage 207 can be any storage device(s) for storing data.
  • the storage 207 can store data from any of the components of the system 200.
  • the storage 207 can store data from the image sensor 202 (e.g., image or video data), data from the accelerometer 204 (e.g., measurements), data from the gyroscope 206 (e.g., measurements), data from the compute components 210 (e.g., processing parameters, preferences, virtual content, rendering content, scene maps, tracking and localization data, object detection data, privacy data, XR application data, face recognition data, occlusion data, etc.), data from the pose engine 220, data from the image processing engine 224, and/or data from the rendering engine 226 (e.g., output frames).
  • the storage 207 can include a buffer for storing frames for processing by the compute components 210.
  • the one or more compute components 210 can include a central processing unit (CPU) 212, a graphics processing unit (GPU) 214, a digital signal processor (DSP) 216, an image signal processor (ISP) 218, and/or other processor (e.g., a neural processing unit (NPU) implementing one or more trained neural networks).
  • the compute components 210 can perform various operations such as image enhancement, computer vision, graphics rendering, tracking, localization, pose estimation, mapping, content anchoring, content rendering, image and/or video processing, sensor processing, recognition (e.g., text recognition, facial recognition, object recognition, feature recognition, tracking or pattern recognition, scene recognition, occlusion detection, etc.), trained machine learning operations, filtering, and/or any of the various operations described herein.
  • the compute components 210 can implement (e.g., control, operate, etc.) the pose engine 220, the image processing engine 224, and the rendering engine 226. In other examples, the compute components 210 can also implement one or more other processing engines.
  • the image sensor 202 can include any image and/or video sensors or capturing devices.
  • the image sensor 202 can be part of a multiple-camera assembly, such as a dual-camera assembly.
  • the image sensor 202 can capture image and/or video content (e.g., raw image and/or video data), which can then be processed by the compute components 210, the pose engine 220, the image processing engine 224, and/or the rendering engine 226 as described herein.
  • the image sensors 202 may include an image capture and processing system 100, an image capture device 105A, an image processing device 105B, or a combination thereof.
  • the image sensor 202 can capture image data and can generate images (also referred to as frames) based on the image data and/or can provide the image data or frames to the pose engine 220, the image processing engine 224, and/or the rendering engine 226 for processing.
  • An image or frame can include a video frame of a video sequence or a still image.
  • An image or frame can include a pixel array representing a scene.
  • an image can be a red-green-blue (RGB) image having red, green, and blue color components per pixel; a luma, chroma-red, chroma-blue (Y CbCr) image having a luma component and two chroma (color) components (chroma-red and chroma-blue) per pixel; or any other suitable type of color or monochrome image.
  • RGB red-green-blue
  • Y CbCr chroma-blue
  • the image sensor 202 (and/or other camera of the system 200) can be configured to also capture depth information
  • the image sensor 202 can include an RGB-depth (RGB-D) camera.
  • the system 200 can include one or more depth sensors (not shown) that are separate from the image sensor 202 (and/or other camera) and that can capture depth information.
  • a depth sensor can obtain depth information independently from the image sensor 202.
  • a depth sensor can be physically installed in the same general location as the image sensor 202, but may operate at a different frequency or frame rate from the image sensor 202.
  • a depth sensor can take the form of a light source that can project a structured or textured light pattern, which may include one or more narrow bands of light, onto one or more objects in a scene. Depth information can then be obtained by exploiting geometrical distortions of the projected pattern caused by the surface shape of the object. In one example, depth information may be obtained from stereo sensors such as a combination of an infra-red structured light projector and an infra-red camera registered to a camera (e.g., an RGB camera).
  • stereo sensors such as a combination of an infra-red structured light projector and an infra-red camera registered to a camera (e.g., an RGB camera).
  • the system 200 can also include other sensors in its one or more sensors.
  • the one or more sensors can include one or more accelerometers (e.g., accelerometer 204), one or more gyroscopes (e.g., gyroscope 206), and/or other sensors.
  • the one or more sensors can provide velocity, orientation, and/or other position-related information to the compute components 210.
  • the accelerometer 204 can detect acceleration by the system 200 and can generate acceleration measurements based on the detected acceleration.
  • the accelerometer 204 can provide one or more translational vectors (e.g., up/down, left/nght, forward/back) that can be used for determining a position or pose of the system 200.
  • the gyroscope 206 can detect and measure the orientation and angular velocity of the system 200.
  • the gyroscope 206 can be used to measure the pitch, roll, and yaw of the system 200.
  • the gyroscope 206 can provide one or more rotational vectors (e.g., pitch, yaw, roll).
  • the image sensor 202 and/or the pose engine 220 can use measurements obtained by the accelerometer 204 (e.g., one or more translational vectors) and/or the gyroscope 206 (e.g., one or more rotational vectors) to calculate the pose of the system 200.
  • the system 200 can also include other sensors, such as an inertial measurement unit (IMU), a magnetometer, a gaze and/or eye tracking sensor, a machine vision sensor, a smart scene sensor, a speech recognition sensor, an impact sensor, a shock sensor, a position sensor, a tilt sensor, etc.
  • IMU inertial measurement unit
  • the one or more sensors can include at least one IMU.
  • An IMU is an electronic device that measures the specific force, angular rate, and/or the orientation of the system 200, using a combination of one or more accelerometers, one or more gyroscopes, and/or one or more magnetometers.
  • the one or more sensors can output measured information associated with the capture of an image captured by the image sensor 202 (and/or other camera of the system 200) and/or depth information obtained using one or more depth sensors of the system 200.
  • the output of one or more sensors can be used by the pose engine 220 to determine a pose of the system 200 (also referred to as the head pose) and/or the pose of the image sensor 202 (or other camera of the system 200).
  • a pose of the system 200 also referred to as the head pose
  • the pose of the system 200 and the pose of the image sensor 202 (or other camera) can be the same.
  • the pose of image sensor 202 refers to the position and orientation of the image sensor 202 relative to a frame of reference (e.g., with respect to the object).
  • the camera pose can be determined for 6-Degrees Of Freedom (6DoF), which refers to three translational components (e.g., which can be given by X (horizontal), Y (vertical), and Z (depth) coordinates relative to a frame of reference, such as the image plane) and three angular components (e.g. roll, pitch, and yaw relative to the same frame of reference).
  • 6DoF 6-Degrees Of Freedom
  • 3-Degrees Of Freedom 3DoF
  • a device tracker can use the measurements from the one or more sensors and image data from the image sensor 202 to track a pose (e.g., a 6DoF pose) of the system 200.
  • the device tracker can fuse visual data (e.g., using a visual tracking solution) from the image data with inertial data from the measurements to determine a position and motion of the system 200 relative to the physical world (e.g., the scene) and a map of the physical world.
  • the device tracker when tracking the pose of the system 200, can generate a three-dimensional (3D) map of the scene (e.g., the real world) and/or generate updates for a 3D map of the scene.
  • 3D three-dimensional
  • the 3D map updates can include, for example and without limitation, new or updated features and/or feature or landmark points associated with the scene and/or the 3D map of the scene, localization updates identifying or updating a position of the system 200 within the scene and the 3D map of the scene, etc.
  • the 3D map can provide a digital representation of a scene in the real/physical world.
  • the 3D map can anchor location-based objects and/or content to real- world coordinates and/or objects.
  • the system 200 can use a mapped scene (e.g., a scene in the physical world represented by, and/or associated with, a 3D map) to merge the physical and virtual worlds and/or merge virtual content or objects with the physical environment.
  • the pose (also referred to as a camera pose) of image sensor 202 and/or the system 200 as a whole can be determined and/or tracked by the compute components 210 using a visual tracking solution based on images captured by the image sensor 202 (and/or other camera of the system 200).
  • the compute components 210 can perform tracking using computer vision-based tracking, model-based tracking, and/or simultaneous localization and mapping (SLAM) techniques.
  • SLAM simultaneous localization and mapping
  • the compute components 210 can perform SLAM or can be in communication (wired or wireless) with a SLAM system (not shown in FIG. 2), such as the SLAM system 300 of FIG. 3.
  • SLAM refers to a class of techniques where a map of an environment (e.g., a map of an environment being modeled by system 200) is created while simultaneously tracking the pose of a camera (e.g., image sensor 202) and/or the system 200 relative to that map.
  • the map can be referred to as a SLAM map, and can be three-dimensional (3D).
  • the SLAM techniques can be performed using color or grayscale image data captured by the image sensor 202 (and/or other camera of the system 200), and can be used to generate estimates of 6DoF pose measurements of the image sensor 202 and/or the system 200.
  • Such a SLAM technique configured to perform 6DoF tracking can be referred to as 6DoF SLAM.
  • the output of the one or more sensors can be used to estimate, correct, and/or otherw ise adjust the estimated pose.
  • the 6D0F SLAM e.g., 6D0F tracking
  • features e.g., keypoints
  • 6DoF SLAM can use feature point associations from an input image (or other sensor data, such as a radar sensor, LIDAR sensor, etc.) to determine the pose (position and orientation) of the image sensor 202 and/or system 200 for the input image. 6DoF mapping can also be performed to update the SLAM map.
  • the SLAM map maintained using the 6DoF SLAM can contain 3D feature points (e.g., keypoints) triangulated from two or more images.
  • keyframes can be selected from input images or a video stream to represent an observed scene.
  • a respective 6DoF camera pose associated with the image can be determined.
  • the pose of the image sensor 202 and/or the system 200 can be determined by projecting features (e.g., feature points or keypoints) from the 3D SLAM map into an image or video frame and updating the camera pose from verified 2D-3D correspondences.
  • the compute components 210 can extract feature points (e.g., keypoints) from certain input images (e.g., every input image, a subset of the input images, etc.) or from each keyframe.
  • a feature point also referred to as a key point or registration point
  • features extracted from a captured image can represent distinct feature points along three-dimensional space (e.g., coordinates on X, Y, and Z-axes), and every feature point can have an associated feature location.
  • Feature detection can be used to detect the feature points.
  • Feature detection can include an image processing operation used to examine one or more pixels of an image to determine whether a feature exists at a particular pixel.
  • Feature detection can be used to process an entire captured image or certain portions of an image. For each image or keyframe, once features have been detected, a local image patch around the feature can be extracted.
  • Scale Invariant Feature Transform (which localizes features and generates their descriptions), Learned Invariant Feature Transform (LIFT), Speed Up Robust Features (SURF), Gradient Location-Orientation histogram (GLOH), Oriented Fast and Rotated Brief (ORB), Binary Robust Invariant Scalable Key points (BRISK), Fast Retina Key point (FREAK), KAZE, Accelerated KAZE (AKAZE), Normalized Cross Correlation (NCC), descriptor matching, another suitable technique, or a combination thereof.
  • SIFT Scale Invariant Feature Transform
  • LIFT Learned Invariant Feature Transform
  • SURF Speed Up Robust Features
  • GLOH Gradient Location-Orientation histogram
  • ORB Oriented Fast and Rotated Brief
  • BRISK Binary Robust Invariant Scalable Key points
  • FREAK Fast Retina Key point
  • KAZE Accelerated KAZE
  • NCC Normalized Cross Correlation
  • the system 200 can also track the hand and/or fingers of the user to allow the user to interact with and/or control virtual content in a virtual environment.
  • the system 200 can track a pose and/or movement of the hand and/or fingertips of the user to identify or translate user interactions with the virtual environment.
  • the user interactions can include, for example and without limitation, moving an item of virtual content, resizing the item of virtual content, selecting an input interface element in a virtual user interface (e.g., a virtual representation of a mobile phone, a virtual keyboard, and/or other virtual interface), providing an input through a virtual user interface, etc.
  • a virtual user interface e.g., a virtual representation of a mobile phone, a virtual keyboard, and/or other virtual interface
  • FIG. 3 is a block diagram illustrating an architecture of a simultaneous localization and mapping (SLAM) system 300.
  • the SLAM system 300 can be, can include, or can be a part of the system 200 of FIG. 2.
  • the SLAM system 300 can be, can include, or can be a part of an XR device, an autonomous vehicle, a vehicle, a computing system of a vehicle, a wireless communication device, a mobile device or handset (e.g., a mobile telephone or so-called “smart phone” or other mobile device), a wearable device (e.g., a network-connected watch), a personal computer, alaptop computer, a server computer, a portable video game console, a portable media player, a camera device, a manned or unmanned ground vehicle, a manned or unmanned aerial vehicle, a manned or unmanned aquatic vehicle, a manned or unmanned underwater vehicle, a manned or unmanned vehicle, a robot, another device, or any combination
  • the SLAM system 300 of FIG. 3 includes, or is coupled to, each of one or more sensors 305.
  • the one or more sensors 305 can include one or more cameras 310.
  • Each of the one or more cameras 310 may include an image capture device 105 A, an image processing device 105B, an image capture and processing system 100, another type of camera, or a combination thereof.
  • Each of the one or more cameras 310 may be responsive to light from a particular spectrum of light.
  • the spectrum of light may be a subset of the electromagnetic (EM) spectrum.
  • each of the one or more cameras 310 may be a visible light (VL) camera responsive to a VL spectrum, an infrared (IR) camera responsive to an IR spectrum, an ultraviolet (UV) camera responsive to a UV spectrum, a camera responsive to light from another spectrum of light from another portion of the electromagnetic spectrum, or a some combination thereof.
  • VL visible light
  • IR infrared
  • UV ultraviolet
  • a camera responsive to light from another spectrum of light from another portion of the electromagnetic spectrum or a some combination thereof.
  • the one or more sensors 305 can include one or more other types of sensors other than cameras 310, such as one or more of each of: accelerometers, gyroscopes, magnetometers, inertial measurement units (IMUs), altimeters, barometers, thermometers, radio detection and ranging (RADAR) sensors, light detection and ranging (LIDAR) sensors, sound navigation and ranging (SONAR) sensors, sound detection and ranging (SOD AR) sensors, global navigation satellite system (GNSS) receivers, global positioning system (GPS) receivers, BeiDou navigation satellite system (BDS) receivers, Galileo receivers, Globalnaya Navigazionnaya Sputnikovaya Sistema (GLONASS) receivers, Navigation Indian Constellation (NavIC) receivers, Quasi-Zenith Satellite System (QZSS) receivers, Wi-Fi positioning system (WPS) receivers, cellular network positioning system receivers, Bluetooth® beacon positioning receivers, short-range wireless beacon positioning receivers, personal
  • the SLAM system 300 of FIG. 3 includes a visual-inertial odometry (VIO) tracker 315.
  • the term visual-inertial odometry may also be referred to herein as visual odometry.
  • the VIO tracker 315 receives sensor data 365 from the one or more sensors 305.
  • the sensor data 365 can include one or more images captured by the one or more cameras 310.
  • the sensor data 365 can include other types of sensor data from the one or more sensors 305, such as data from any of the types of sensors 305 listed herein.
  • the sensor data 365 can include inertial measurement unit (IMU) data from one or more IMUs of the one or more sensors 305.
  • IMU inertial measurement unit
  • the VIO tracker 315 Upon receipt of the sensor data 365 from the one or more sensors 305, the VIO tracker 315 performs feature detection, extraction, and/or tracking using a feature tracking engine 320 of the VIO tracker 315. For instance, where the sensor data 365 includes one or more images captured by the one or more cameras 310 of the SLAM system 300, the VIO tracker 315 can identify, detect, and/or extract features in each image. Features may include visually distinctive points in an image, such as portions of the image depicting edges and/or comers.
  • the VIO tracker 315 can receive sensor data 365 periodically and/or continually from the one or more sensors 305, for instance by continuing to receive more images from the one or more cameras 310 as the one or more cameras 310 capture a video, where the images are video frames of the video.
  • the VIO tracker 315 can generate descriptors for the features. Feature descriptors can be generated at least in part by generating a description of the feature as depicted in a local image patch extracted around the feature. In some examples, a feature descriptor can describe a feature as a collection of one or more feature vectors. In some cases, the VIO tracker 315 can be implemented using the hybrid system 500 discussed below with respect to FIG. 5.
  • the VIO tracker 315 in some cases with the mapping engine 330 and/or the relocalization engine 355, can associate the plurality of features with a map of the environment based on such feature descriptors.
  • the feature tracking engine 320 of the VIO tracker 315 can perform feature tracking by recognizing features in each image that the VIO tracker 315 already previously recognized in one or more previous images, in some cases based on identifying features with matching feature descriptors in different images.
  • the feature tracking engine 320 can track changes in one or more positions at which the feature is depicted in each of the different images.
  • the feature extraction engine can detect a particular comer of a room depicted in a left side of a first image captured by a first camera of the cameras 310.
  • the feature extraction engine can detect the same feature (e.g., the same particular comer of the same room) depicted in a right side of a second image captured by the first camera.
  • the feature tracking engine 320 can recognize that the features detected in the first image and the second image are two depictions of the same feature (e.g., the same particular comer of the same room), and that the feature appears in two different positions in the two images.
  • the VIO tracker 315 can determine, based on the same feature appearing on the left side of the first image and on the right side of the second image that the first camera has moved, for example if the feature (e.g., the particular comer of the room) depicts a static portion of the environment.
  • the VIO tracker 315 can include a sensor integration engine 325.
  • the sensor integration engine 325 can use sensor data from other types of sensors 305 (other than the cameras 310) to determine information that can be used by the feature tracking engine 320 when performing the feature tracking.
  • the sensor integration engine 325 can receive IMU data (e.g., which can be included as part of the sensor data 365) from an IMU of the one or more sensors 305.
  • the sensor integration engine 325 can determine, based on the IMU data in the sensor data 365, that the SLAM system 300 has rotated 15 degrees in a clockwise direction from acquisition or capture of a first image and capture to acquisition or capture of the second image by a first camera of the cameras 310.
  • the sensor integration engine 325 can identify that a feature depicted at a first position in the first image is expected to appear at a second position in the second image, and that the second position is expected to be located to the left of the first position by a predetermined distance (e.g., a predetermined number of pixels, inches, centimeters, millimeters, or another distance metric).
  • the feature tracking engine 320 can take this expectation into consideration in tracking features between the first image and the second image.
  • the VIO tracker 315 can determine a 3D feature positions 372 of a particular feature.
  • the 3D feature positions 372 can include one or more 3D feature positions and can also be referred to as 3D feature points.
  • the 3D feature positions 372 can be a set of coordinates along three different axes that are perpendicular to one another, such as an X coordinate along an X axis (e.g., in a horizontal direction), a Y coordinate along a Y axis (e.g., in a vertical direction) that is perpendicular to the X axis, and aZ coordinate along a Z axis (e g., in a depth direction) that is perpendicular to both the X axis and the Y axis.
  • the VIO tracker 315 can also determine one or more keyframes 370 (referred to hereinafter as keyframes 370) corresponding to the particular feature.
  • a keyframe (from one or more keyframes 370) corresponding to a particular feature may be an image in which the particular feature is clearly depicted.
  • a keyframe (from the one or more key frames 370) corresponding to a particular feature may be an image in which the particular feature is clearly depicted.
  • a keyframe corresponding to a particular feature may be an image that reduces uncertainty in the 3D feature positions 372 of the particular feature when considered by the feature tracking engine 320 and/or the sensor integration engine 325 for determination of the 3D feature positions 372.
  • a keyframe corresponding to a particular feature also includes data about the pose 385 of the SLAM system 300 and/or the camera(s) 310 during capture of the keyframe.
  • the VIO tracker 315 can send 3D feature positions 372 and/or keyframes 370 corresponding to one or more features to the mapping engine 330. In some examples, the VIO tracker 315 can receive map slices 375 from the mapping engine 330. The VIO tracker 315 can feature information within the map slices 375 for feature tracking using the feature tracking engine 320.
  • the VIO tracker 315 can determine a pose 385 of the SLAM system 300 and/or of the cameras 310 during capture of each of the images in the sensor data 365.
  • the pose 385 can include a location of the SLAM system 300 and/or of the cameras 310 in 3D space, such as a set of coordinates along three different axes that are perpendicular to one another (e.g., an X coordinate, a Y coordinate, and a Z coordinate).
  • the pose 385 can include an orientation of the SLAM system 300 and/or of the cameras 310 in 3D space, such as pitch, roll, yaw, or some combination thereof.
  • the VIO tracker 315 can send the pose 385 to the relocalization engine 355.
  • the VIO tracker 315 can receive the pose 385 from the relocalization engine 355.
  • the SLAM system 300 also includes a mapping engine 330.
  • the mapping engine 330 can generate a 3D map of the environment based on the 3D feature positions 372 and/or the keyframes 370 received from the VIO tracker 315.
  • the mapping engine 330 can include a map densification engine 335, a keyframe remover 340, a bundle adjuster 345, and/or a loop closure detector 350.
  • the map densification engine 335 can perform map densification, in some examples, increase the quantity and/or density of 3D coordinates describing the map geometry.
  • the keyframe remover 340 can remove keyframes, and/or in some cases add keyframes.
  • the keyframe remover 340 can remove keyframes 370 corresponding to a region of the map that is to be updated and/or whose corresponding confidence values are low.
  • the bundle adjuster 345 can, in some examples, refine the 3D coordinates describing the scene geometry, parameters of relative motion, and/or optical characteristics of the image sensor used to generate the frames, according to an optimality criterion involving the corresponding image projections of all points.
  • the loop closure detector 350 can recognize when the SLAM system 300 has returned to a previously mapped region, and can use such information to update a map slice and/or reduce the uncertainty in certain 3D feature points or other points in the map geometry.
  • the mapping engine 330 can output map slices 375 to the VIO tracker 315.
  • the map slices 375 can represent 3D portions or subsets of the map.
  • the map slices 375 can include map slices 375 that represent new, previously-unmapped areas of the map.
  • the map slices 375 can include map slices 375 that represent updates (or modifications or revisions) to previously- mapped areas of the map.
  • the mapping engine 330 can output map information 380 to the relocalization engine 355.
  • the map information 380 can include at least a portion of the map generated by the mapping engine 330.
  • the map information 380 can include one or more 3D points making up the geometry of the map, such as one or more 3D feature positions 372.
  • the map information 380 can include one or more keyframes 370 corresponding to certain features and certain 3D feature positions 372.
  • the SLAM system 300 also includes a relocalization engine 355.
  • the relocalization engine 355 can perform relocalization, for instance when the VIO tracker 315 fail to recognize more than a threshold number of features in an image, and/or the VIO tracker 315 loses track of the pose 385 of the SLAM system 300 within the map generated by the mapping engine 330.
  • the relocalization engine 355 can perform relocalization by performing extraction and matching using an extraction and matching engine 360.
  • the extraction and matching engine 360 can by extract features from an image captured by the cameras 310 of the SLAM system 300 while the SLAM system 300 is at a current pose 385, and can match the extracted features to features depicted in different key frames 370, identified by 3D feature positions 372, and/or identified in the map information 380.
  • the relocalization engine 355 can identify that the pose 385 of the SLAM system 300 is a pose 385 at which the previously-identified features are visible to the cameras 310 of the SLAM system 300, and is therefore similar to one or more previous poses 385 at which the previously-identified features were visible to the cameras 310.
  • the relocalization engine 355 can perform relocalization based on wide baseline mapping, or a distance between a current camera position and camera position at which feature was originally captured.
  • the relocalization engine 355 can receive information for the pose 385 from the VIO tracker 315, for instance regarding one or more recent poses of the SLAM system 300 and/or cameras 310, which the relocalization engine 355 can base its relocalization determination on.
  • the relocalization engine 355 can output the pose 385 to the VIO tracker 315.
  • FIG. 4 illustrate an example frame 400 of a scene.
  • Frame 400 provides illustrative examples of feature information that can be captured and/or processed by a system (e.g., the system 200 shown in FIG. 2) during tracking and/or mapping.
  • example features 402 are illustrated as circles of differing diameters.
  • the center of each of the features 402 can be referred to as a feature center location.
  • the diameter of the circles can represent a feature scale (also referred to as a blob size) associated with each of the example features 402.
  • Each of the features 402 can also include a dominant orientation vector 403 illustrated as a radial segment.
  • the dominant orientation vector 403 (also referred to as a dominant orientation herein) can be determined based on pixel gradients within a patch (also referred to as a blob or region). For instance, the dominant orientation vector 403 can be determined based on the orientation of edge features in a neighborhood (e.g., a patch of nearby pixels) around the center of the feature.
  • Another example feature 404 is shown with a dominant orientation 406.
  • a feature can have multiple dominant orientations. For example, if no single orientation is clearly dominant, then a feature can have two or more dominant orientations associated with the most prominent orientations.
  • Another example feature 408 is illustrated with two dominant orientation vectors 410 and 412.
  • each of the features 402, 404, 408 can also be associated with a descriptor that can be used to associate the features between different frames. For example, if the pose of the camera that captured frame 400 changes, the x-y coordinate of each of the feature center locations for each of the features 402, 404, 408 can also change, and the descriptor assigned to each feature can be used to match the features between the two different frames.
  • the tracking and mapping operations of an XR system can utilize different types of descriptors for the features 402, 404, 408. Examples of descriptors for the features 402, 404, 408 can include SIFT, FREAK, and/or other descriptors.
  • a tracker can operate on image patches directly or can operate on the descriptors (e.g., SIFT descriptors, FREAK descriptors, etc ).
  • machine learning based systems e.g., using a deep learning neural network
  • features e.g., keypoints or feature points
  • descriptors for the detected features.
  • ground truth and annotations (or labels) for training a machine learning based feature (e.g., keypoint or feature point) detector and descriptor generator can be difficult.
  • the hybrid system can include a non-machine learning based feature detector (e.g., a feature point detector based on, for example, computer vision algorithms) for detecting feature points and a machine learning based descriptor generator (e.g., a deep learning neural network) for generating descriptors (e.g., feature descriptors) for the detected feature points (or keypoints).
  • a non-machine learning based feature detector e.g., a feature point detector based on, for example, computer vision algorithms
  • a machine learning based descriptor generator e.g., a deep learning neural network
  • the machine learning based descriptor generator can generate a descriptor at least in part by generating a description of a feature as detected or depicted in input sensor data.
  • FIG. 5 is a diagram illustrating an example of a hybrid system 500 for detecting features (e.g., keypoints or feature points) and generating descriptors (e.g., feature descriptors) for the detected features.
  • features e.g., keypoints or feature points
  • descriptors e.g., feature descriptors
  • the hybrid system 500 includes a feature point detector 504 and a machine learning (ML) based descriptor generator 506.
  • ML machine learning
  • the feature point detector 504 is a non-machine learning based feature detector.
  • the feature point detector 504 can detect feature points (or keypoints) from input data 502 using, for example, one or more computer vision algorithms.
  • the input data 502 can include image data, radar data (e.g., a radar image), LIDAR data (e.g., a LIDAR point cloud), and/or other sensor data.
  • radar data e.g., a radar image
  • LIDAR data e.g., a LIDAR point cloud
  • other sensor data e.g., as shown in FIG.
  • the input data 502 can include an image 503 a scene in a first illumination condition (e.g., in the daytime), an image 505 of the same scene in a second illumination condition (e g., at nighttime), and an image 507 of the same scene with a particular weather condition present (e.g., fog, rain, etc.). Multiple sets of images with similar differences can also be included in the input data.
  • a first illumination condition e.g., in the daytime
  • an image 505 of the same scene in a second illumination condition e.g., at nighttime
  • an image 507 of the same scene with a particular weather condition present e.g., fog, rain, etc.
  • the input data 502 can include other images of the same scene but from different angles during the first illumination condition (e.g., during the daytime), during the second illumination condition (e.g., at nighttime), and with the same or different weather conditions. Additionally or alternatively, in some illustrative examples, the input data 502 can include images of different scenes during the day, at night, and with the same or different weather conditions.
  • the feature point detector 504 can detect keypoints in the images 503, 505, 507.
  • the feature point detector 504 can output a patch 509 around a feature point (or keypoint) detected in the image 503, a patch 511 around the same feature point (or keypoint) detected in the image 505, and a patch 513 around the same feature point (or keypoint) detected in the image 507. Similar patches can be generated for other feature points detected in the images 503, 505, 507 and in other images and/or sensor data.
  • the patches 509, 511, and 513 can be output to the ML-based descriptor generator 506 for descriptor generation.
  • the ML-based descriptor generator 506 can process the patches 509, 511, and 513 to generate feature descriptors for the features in the respective patches 509, 511, and 513.
  • Each feature descriptor can describe a respective feature as a feature vector or as a collection of feature vectors, as described above with respect to FIG. 3.
  • the ML based descriptor generator 506 can include a transformerbased neural network having a transformer neural network architecture.
  • the transformer-based neural network can use transformer cross-attention (e.g., cross-view attention across sensor data from different views or perspectives of a common feature) to determine a unique signature across the different types of input data 502.
  • transformer cross-attention e.g., cross-view attention across sensor data from different views or perspectives of a common feature
  • the unique signature can then be used as a feature descriptor.
  • a loss function can be used to train the ML based descriptor generator 506 (e.g., by backpropagating gradients determined based on a loss determined by the loss function).
  • the loss function can enforce the same descriptor across different characteristics of the input data 502 (e.g., across all illumination conditions, weather, etc.). In such cases, labels are not needed and thus the ML model of the ML based descriptor generator 506 can be trained in an unsupervised manner.
  • the hybrid system 500 provides advantages over traditional systems for detecting features and generating detectors. For example, as noted previously, it can be difficult to generate a common descriptor for a same feature detected in different sensor data (e.g., images, radar data, LIDAR data, etc.) when the sensor data is captured in different conditions (e.g., different lighting and/or illuminations, different views, different weather conditions, etc.).
  • a descriptor generated for a feature associated with an edge of a traffic sign detected in a first image captured in the daylight may be different than a descriptor generated for the same feature associated with the edge of the traffic sign detected in a second image captured in the dark (e.g., at night time).
  • the background behind the traffic sign may be blue (e.g., corresponding to blue sky), whereas the background behind the traffic sign may be black (e.g., corresponding to a dark sky) during the night.
  • Traditional computer-vision algorithms take as input neighboring pixels surrounding an edge or other distinctive portion of an object (e.g., the traffic sign), and encode information associated with the neighboring pixels (e.g., based on a gradient, how the color changes, etc.) to generate a descriptor.
  • the neighboring pixels are a different color and/or brightness (e.g., blue and high illuminance) during the day than during the night (e.g., black and low illuminance)
  • two different descriptors for the same feature may be different.
  • a device or system e.g., a vehicle, an XR device, a robotics device, etc.
  • the device or system will determine that the two descriptors correspond to different locations, whereas the feature (e.g., the edge of the traffic sign) actually corresponds to the same location. Using such a technique will thus cause the device or system to perform inaccurate localization.
  • the hybrid system 500 allows the feature detection to be performed using non-ML techniques and the descriptor generation to be performed using machine learning in an unsupervised manner (in which case no manual labeling is required for training).
  • the ML-based descriptor generator 506 e.g., utilizing a transformer-based architecture
  • the ability of the ML-based descriptor generator 506 e.g., using cross-attention based on the transformer-based architecture
  • to generate a unique signature for features detected in different types of input data can scale with more data, and requires no labeled data.
  • FIG. 6 is a flowchart illustrating an example of a process 600 for processing image and/or video data.
  • the process 600 can be performed by a computing device (or apparatus), or a component or system (e.g., a chipset) of the computing device.
  • the computing device (or component or system thereof) can include or can be the hybrid system 500 of FIG. 5.
  • the operations of the process 1000 may be implemented as software components that are executed and run on one or more processors (e.g., the processor 710 of FIG. 7 or other processor(s)).
  • the transmission and reception of signals by the first network entity in the process 600 may be enabled, for example, by one or more antennas and/or one or more transceivers such as wireless transceiver(s).
  • the computing device can obtain input data (e.g., input data 502).
  • the input data includes one or more images, radar data, light detection and ranging (LIDAR) data, any combination thereof, and/or other data.
  • the input data includes a first image of a scene with a first characteristic, a second image of a scene with a second characteristic, and a third image of a scene with a third characteristic.
  • the first characteristic can be a daytime characteristic
  • the second characteristic can be a nighttime characteristic
  • the third characteristic can be a weather condition, as shown in the illustrative example of FIG. 5.
  • the computing device can process, using anon-machine learning based feature detector (e.g., feature point detector 504), the input data to determine one or more feature points in the input data.
  • anon-machine learning based feature detector e.g., feature point detector 504
  • the non-machine learning based feature detector is configured to determine the one or more feature points based on a computer vision algorithm, as described herein (e.g., with respect to FIG. 5).
  • the computing device can determine, using a machine learning system (e.g., the ML-based descriptor generator 506), a respective feature descriptor for each respective feature point of the one or more feature points.
  • the machine learning system is a neural network.
  • the neural network may be or may include a transformer neural network.
  • the transformer neural network is configured to perform cross-attention.
  • the computing device or component or system thereof can utilize the transformer neural network to perform transformer cross-attention (e.g., cross-view attention) to determine a unique signature across the obtained input data (e.g., the different types of input data 502).
  • the unique signature can be used as a feature descriptor.
  • the respective feature descriptor for each respective feature point can be based on the unique signature.
  • the computing device (or component or system thereof) can apply a loss function to train transformer neural network (e.g., by backpropagating gradients determined based on a loss determined by the loss function).
  • the loss function can enforce the same descriptor across different characteristics of the input data 502 (e.g., across all illumination conditions, weather, etc.), providing robustness to different inputs with varying characteristics.
  • the computing device can include any suitable device, such as a mobile device (e.g., a mobile phone), a desktop computing device, a tablet computing device, a wearable device (e.g., a VR headset, an AR headset, AR glasses, a network-connected watch or smartwatch, or other wearable device), a server computer, an vehicle (e.g., an autonomous vehicle or semi-autonomous vehicle) or computing device or system of the vehicle, a robotic device, a laptop computer, a smart television, a camera, and/or any other computing device with the resource capabilities to perform the processes described herein, including the process 600 and/or any other process described herein.
  • a mobile device e.g., a mobile phone
  • a desktop computing device e.g., a tablet computing device
  • a wearable device e.g., a VR headset, an AR headset, AR glasses, a network-connected watch or smartwatch, or other wearable device
  • a server computer e.g., an autonomous vehicle or semi-
  • the computing device or apparatus may include various components, such as one or more input devices, one or more output devices, one or more processors, one or more microprocessors, one or more microcomputers, one or more cameras, one or more sensors, and/or other component(s) that are configured to carry out the steps of processes described herein.
  • the computing device may include a display, a network interface configured to communicate and/or receive the data, any combination thereof, and/or other component(s).
  • the network interface may be configured to communicate and/or receive Internet Protocol (IP) based data or other type of data.
  • IP Internet Protocol
  • the components can include and/or can be implemented using electronic circuits or other electronic hardware, which can include one or more programmable electronic circuits (e.g., microprocessors, graphics processing units (GPUs), digital signal processors (DSPs), central processing units (CPUs), and/or other suitable electronic circuits), and/or can include and/or be implemented using computer software, firmware, or any combination thereof, to perform the various operations described herein.
  • programmable electronic circuits e.g., microprocessors, graphics processing units (GPUs), digital signal processors (DSPs), central processing units (CPUs), and/or other suitable electronic circuits
  • CPUs central processing units
  • the process 600 is illustrated as a logical flow diagram, the operation of which represents a sequence of operations that can be implemented in hardware, computer instructions, or a combination thereof.
  • the operations represent computer-executable instructions stored on one or more computer-readable storage media that, when executed by one or more processors, perform the recited operations.
  • computer-executable instructions include routines, programs, objects, components, data structures, and the like that perform particular functions or implement particular data types.
  • the order in which the operations are described is not intended to be construed as a limitation, and any number of the described operations can be combined in any order and/or in parallel to implement the processes.
  • the process 600 and/or any other process described herein may be performed under the control of one or more computer systems configured with executable instructions and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) executing collectively on one or more processors, by hardware, or combinations thereof.
  • code e.g., executable instructions, one or more computer programs, or one or more applications
  • the code may be stored on a computer-readable or machine-readable storage medium, for example, in the form of a computer program comprising a plurality of instructions executable by one or more processors.
  • the computer-readable or machine-readable storage medium may be non-transitory.
  • FIG. 7 illustrates an example computing device architecture 700 of an example computing device which can implement the various techniques described herein.
  • the computing device can include a mobile device, a wearable device, an extended reality device (e.g., a virtual reality (VR) device, an augmented reality (AR) device, or a mixed reality (MR) device), a personal computer, a laptop computer, a video server, a vehicle (or computing device of a vehicle), or other device.
  • the computing device architecture 700 can implement the system 500 of FIG. 5.
  • the components of computing device architecture 700 are shown in electrical communication with each other using connection 705, such as a bus.
  • the example computing device architecture 700 includes a processing unit (CPU or processor) 710 and computing device connection 705 that couples various computing device components including computing device memory 715, such as read only memory (ROM) 720 and random-access memory (RAM) 725, to processor 710.
  • ROM read only memory
  • RAM random-access memory
  • Computing device architecture 700 can include a cache of high-speed memory connected directly with, in close proximity to, or integrated as part of processor 710. Computing device architecture 700 can copy data from memory 715 and/or the storage device 730 to cache 712 for quick access by processor 710. In this way, the cache can provide a performance boost that avoids processor 710 delays while waiting for data. These and other engines can control or be configured to control processor 710 to perform various actions. Other computing device memory 715 may be available for use as well. Memory 715 can include multiple different types of memory with different performance characteristics.
  • Processor 710 can include any general-purpose processor and a hardware or software service, such as service 1 732, service 2 734, and service 3 736 stored in storage device 730, configured to control processor 710 as well as a special -purpose processor where software instructions are incorporated into the processor design.
  • Processor 710 may be a self-contained system, containing multiple cores or processors, a bus, memory controller, cache, etc.
  • a multi-core processor may be symmetric or asymmetric.
  • input device 745 can represent any number of input mechanisms, such as a microphone for speech, a touch- sensitive screen for gesture or graphical input, keyboard, mouse, motion input, speech and so forth.
  • Output device 735 can also be one or more of a number of output mechanisms known to those of skill in the art, such as a display, projector, television, speaker device, etc.
  • multimodal computing devices can enable a user to provide multiple types of input to communicate with computing device architecture 700.
  • Communication interface 740 can generally govern and manage the user input and computing device output. There is no restriction on operating on any particular hardware arrangement and therefore the basic features here may easily be substituted for improved hardware or firmware arrangements as they are developed.
  • Storage device 730 is a non-volatile memory and can be a hard disk or other types of computer readable media which can store data that are accessible by a computer, such as magnetic cassettes, flash memory cards, solid state memory devices, digital versatile disks, cartridges, random access memories (RAMs) 725, read only memory (ROM) 720, and hybrids thereof.
  • Storage device 730 can include services 732, 734, 736 for controlling processor 710.
  • Other hardware or software modules or engines are contemplated.
  • Storage device 730 can be connected to the computing device connection 705.
  • a hardware module that performs a particular function can include the software component stored in a computer- readable medium in connection with the necessary hardware components, such as processor 710, connection 705, output device 735, and so forth, to carry out the function.
  • aspects of the present disclosure are applicable to any suitable electronic device (such as security systems, smartphones, tablets, laptop computers, vehicles, drones, or other devices) including or coupled to one or more active depth sensing systems. While described below with respect to a device having or coupled to one light projector, aspects of the present disclosure are applicable to devices having any number of light projectors and are therefore not limited to specific devices.
  • the term “device” is not limited to one or a specific number of physical objects (such as one smartphone, one controller, one processing system and so on).
  • a device may be any electronic device with one or more parts that may implement at least some portions of this disclosure. While the below description and examples use the term “device” to describe various aspects of this disclosure, the term “device” is not limited to a specific configuration, type, or number of objects.
  • the term “system” is not limited to multiple components or specific embodiments. For example, a system may be implemented on one or more printed circuit boards or other substrates and may have movable or static components. While the below description and examples use the term “system” to describe various aspects of this disclosure, the term “system” is not limited to a specific configuration, type, or number of objects.
  • Individual embodiments may be described above as a process or method which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed, but could have additional steps not included in a figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination can correspond to a return of the function to the calling function or the main function.
  • Processes and methods according to the above-described examples can be implemented using computer-executable instructions that are stored or otherwise available from computer-readable media.
  • Such instructions can include, for example, instructions and data which cause or otherwise configure a general-purpose computer, special purpose computer, or a processing device to perform a certain function or group of functions. Portions of computer resources used can be accessible over a network.
  • the computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, firmware, source code, etc.
  • computer-readable medium includes, but is not limited to, portable or non-portable storage devices, optical storage devices, and various other mediums capable of storing, containing, or carrying instruction(s) and/or data.
  • a computer-readable medium may include a non-transitory medium in which data can be stored and that does not include carrier waves and/or transitory electronic signals propagating wirelessly or over wired connections. Examples of a non-transitory medium may include, but are not limited to, a magnetic disk or tape, optical storage media such as flash memory. memory or memory devices, magnetic or optical disks, flash memory, USB devices provided with non-volatile memory, networked storage devices, compact disk (CD) or digital versatile disk (DVD), any suitable combination thereof, among others.
  • a computer-readable medium may have stored thereon code and/or machine-executable instructions that may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, an engine, a software package, a class, or any combination of instructions, data structures, or program statements.
  • a code segment may be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents.
  • Information, arguments, parameters, data, etc. may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, or the like.
  • the computer-readable storage devices, mediums, and memories can include a cable or wireless signal containing a bit stream and the tike.
  • non-transitory computer-readable storage media expressly exclude media such as energy, carrier signals, electromagnetic waves, and signals per se.
  • Devices implementing processes and methods according to these disclosures can include hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof, and can take any of a variety of form factors.
  • the program code or code segments to perform the necessary tasks may be stored in a computer-readable or machine-readable medium.
  • a processor(s) may perform the necessary tasks.
  • form factors include laptops, smart phones, mobile phones, tablet devices or other small form factor personal computers, personal digital assistants, rackmount devices, standalone devices, and so on.
  • Functionality described herein also can be embodied in peripherals or add-in cards. Such functionality can also be implemented on a circuit board among different chips or different processes executing in a single device, by way of further example.
  • the instructions, media for conveying such instructions, computing resources for executing them, and other structures for supporting such computing resources are example means for providing the functions described in the disclosure.
  • Coupled to refers to any component that is physically connected to another component either directly or indirectly, and/or any component that is in communication with another component (e.g., connected to the other component over a wired or wireless connection, and/or other suitable communication interface) either directly or indirectly.
  • Claim language or other language reciting “at least one of’ a set and/or “one or more” of a set indicates that one member of the set or multiple members of the set (in any combination) satisfy the claim.
  • claim language reciting “at least one of A and B” or “at least one of A or B” means A, B, or A and B.
  • claim language reciting “at least one of A, B, and C” or “at least one of A, B, or C” means A, B, C, or A and B, or A and C, or B and C, A and B and C, or any duplicate information or data (e.g., A and A, B and B, C and C, A and A and B, and so on), or any other ordering, duplication, or combination of A, B, and C.
  • the language “at least one of’ a set and/or “one or more” of a set does not limit the set to the items listed in the set.
  • claim language reciting “at least one of A and B” or “at least one of A or B” may mean A, B, or A and B, and may additionally include items not listed in the set of A and B.
  • the phrases “at least one” and “one or more” are used interchangeably herein.
  • Claim language or other language reciting “at least one processor configured to,” “at least one processor being configured to,” “one or more processors configured to,” “one or more processors being configured to,” or the like indicates that one processor or multiple processors (in any combination) can perform the associated operation(s).
  • claim language reciting “at least one processor configured to: X, Y, and Z” means a single processor can be used to perform operations X, Y, and Z; or that multiple processors are each tasked with a certain subset of operations X, Y, and Z such that together the multiple processors perform X,
  • claim language reciting “at least one processor configured to: X, Y, and Z” can mean that any single processor may only perform at least a subset of operations X, Y, and Z.
  • one element may perform all functions, or more than one element may collectively perform the functions.
  • each function need not be performed by each of those elements (e.g., different functions may be performed by different elements) and/or each function need not be performed in whole by only one element (e.g., different elements may perform different sub-functions of a function).
  • one element may be configured to cause the other element to perform all functions, or more than one element may collectively be configured to cause the other element to perform the functions.
  • an entity e.g., any entity or device described herein
  • the entity may be configured to cause one or more elements (individually or collectively) to perform the functions.
  • the one or more components of the entity may include at least one memory, at least one processor, at least one communication interface, another component configured to perform one or more (or all) of the functions, and/or any combination thereof.
  • the entity may be configured to cause one component to perform all functions, or to cause more than one component to collectively perform the functions.
  • each function need not be performed by each of those components (e.g., different functions may be performed by different components) and/or each function need not be performed in whole by only one component (e.g., different components may perform different sub-functions of a function).
  • the techniques described herein may also be implemented in electronic hardware, computer software, firmware, or any combination thereof. Such techniques may be implemented in any of a variety of devices such as general purposes computers, wireless communication device handsets, or integrated circuit devices having multiple uses including application in wireless communication device handsets and other devices. Any features described as modules or components may be implemented together in an integrated logic device or separately as discrete but interoperable logic devices. If implemented in software, the techniques may be realized at least in part by a computer-readable data storage medium comprising program code including instructions that, when executed, performs one or more of the methods described above.
  • the computer-readable data storage medium may form part of a computer program product, which may include packaging materials.
  • the computer-readable medium may comprise memory or data storage media, such as random-access memory (RAM) such as synchronous dynamic random-access memory (SDRAM), read-only memory (ROM), non-volatile random-access memory (NVRAM), electrically erasable programmable read-only memory (EEPROM), FLASH memory, magnetic or optical data storage media, and the like.
  • RAM random-access memory
  • SDRAM synchronous dynamic random-access memory
  • ROM read-only memory
  • NVRAM non-volatile random-access memory
  • EEPROM electrically erasable programmable read-only memory
  • FLASH memory magnetic or optical data storage media, and the like.
  • the techniques additionally, or alternatively, may be realized at least in part by a computer- readable communication medium that carries or communicates program code in the form of instructions or data structures and that can be accessed, read, and/or executed by a computer, such as propagated signals or waves.
  • the program code may be executed by a processor, which may include one or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, an application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field programmable logic arrays
  • a general-purpose processor may be a microprocessor; but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine.
  • a processor may also be implemented as a combination of computing devices, e g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. Accordingly, the term “processor,” as used herein may refer to any of the foregoing structure, any combination of the foregoing structure, or any other structure or apparatus suitable for implementation of the techniques described herein.
  • Illustrative aspects of the disclosure include:
  • An apparatus for processing image data comprising: at least one memory; and at least one processor coupled to the at least one memory, the at least one processor configured to: obtain input data; process, using a non-machine learning based feature detector, the input data to determine one or more feature points in the input data; and determine, using a machine learning system, a respective feature descriptor for each respective feature point of the one or more feature points.
  • Aspect 2 The apparatus of Aspect 1, wherein the input data includes at least one of one or more images, radar data, or light detection and ranging (LIDAR) data.
  • the input data includes at least one of one or more images, radar data, or light detection and ranging (LIDAR) data.
  • LIDAR light detection and ranging
  • Aspect 3 The apparatus of any one of Aspects 1 or 2, wherein the input data includes a first image of a scene with a first characteristic, a second image of a scene with a second characteristic, and a third image of a scene with a third characteristic.
  • Aspect 4 The apparatus of Aspect 3, wherein the first characteristic is a daytime characteristic, the second characteristic is a nighttime characteristic, and the third characteristic is a weather condition.
  • Aspect 5 The apparatus of any one of Aspects 1 to 4, wherein the non-machine learning based feature detector is configured to determine the one or more feature points based on a computer vision algorithm.
  • Aspect 6 The apparatus of any one of Aspects 1 to 5, wherein the machine learning system is a neural network.
  • Aspect 7 The apparatus of Aspect 6, wherein the neural network is a transformer neural network.
  • a method for processing image data comprising: obtaining input data; processing, using a non-machine learning based feature detector, the input data to determine one or more feature points in the input data; and determining, using a machine learning system, a respective feature descriptor for each respective feature point of the one or more feature points.
  • Aspect 9 The method of Aspect 8, wherein the input data includes at least one of one or more images, radar data, or light detection and ranging (LIDAR) data.
  • the input data includes at least one of one or more images, radar data, or light detection and ranging (LIDAR) data.
  • LIDAR light detection and ranging
  • Aspect 10 The method of any one of Aspects 8 or 9, wherein the input data includes a first image of a scene with a first characteristic, a second image of a scene with a second characteristic, and a third image of a scene with a third characteristic.
  • Aspect 11 The method of Aspect 10, wherein the first characteristic is a daytime characteristic, the second characteristic is a nighttime characteristic, and the third characteristic is a weather condition.
  • Aspect 12 The method of any one of Aspects 8 to 11, further comprising determining, using the non-machine learning based feature detector, the one or more feature points based on a computer vision algorithm.
  • Aspect 13 The method of any one of Aspects 8 to 12, wherein the machine learning system is a neural network.
  • Aspect 14 The method of Aspect 13, wherein the neural network is a transformer neural network.
  • Aspect 15 A non-transitory computer-readable storage medium having stored thereon instructions which, when executed by one or more processors, cause the one or more processors to perform any of the operations of any of Aspects 8 to 14.
  • Aspect 16 An apparatus comprising means for performing any of the operations of any of Aspects 8 to 14.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Processing (AREA)

Abstract

L'invention concerne des systèmes et des techniques de traitement de données de détection. Par exemple, un procédé peut consister à obtenir des données d'entrée et à traiter, à l'aide d'un détecteur de caractéristiques basé sur un apprentissage non automatique, les données d'entrée pour déterminer un ou plusieurs points caractéristiques dans les données d'entrée. Le procédé peut en outre consister à déterminer, à l'aide d'un système d'apprentissage automatique, un descripteur de caractéristique respectif pour chaque point caractéristique respectif du ou des points caractéristiques.
PCT/US2023/074077 2022-11-04 2023-09-13 Système hybride pour la détection de caractéristiques et la génération de descripteurs WO2024097469A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202263382458P 2022-11-04 2022-11-04
US63/382,458 2022-11-04
US18/449,461 2023-08-14
US18/449,461 US20240153245A1 (en) 2022-11-04 2023-08-14 Hybrid system for feature detection and descriptor generation

Publications (1)

Publication Number Publication Date
WO2024097469A1 true WO2024097469A1 (fr) 2024-05-10

Family

ID=88297075

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/074077 WO2024097469A1 (fr) 2022-11-04 2023-09-13 Système hybride pour la détection de caractéristiques et la génération de descripteurs

Country Status (1)

Country Link
WO (1) WO2024097469A1 (fr)

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ETHAN RUBLEE ET AL: "ORB: An efficient alternative to SIFT or SURF", COMPUTER VISION (ICCV), 2011 IEEE INTERNATIONAL CONFERENCE ON, IEEE, 6 November 2011 (2011-11-06), pages 2564 - 2571, XP032093933, ISBN: 978-1-4577-1101-5, DOI: 10.1109/ICCV.2011.6126544 *
LI SHUO ET AL: "A Research of ORB Feature Matching Algorithm Based on Fusion Descriptor", 2020 IEEE 5TH INFORMATION TECHNOLOGY AND MECHATRONICS ENGINEERING CONFERENCE (ITOEC), IEEE, 12 June 2020 (2020-06-12), pages 417 - 420, XP033794618, DOI: 10.1109/ITOEC49072.2020.9141770 *
MA TAIYUAN ET AL: "ASD-SLAM: A Novel Adaptive-Scale Descriptor Learning for Visual SLAM", 2020 IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV), IEEE, 19 October 2020 (2020-10-19), pages 809 - 816, XP033873148, DOI: 10.1109/IV47402.2020.9304626 *
SUN JIAMING ET AL: "LoFTR: Detector-Free Local Feature Matching with Transformers", 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 20 June 2021 (2021-06-20), pages 8918 - 8927, XP034006635, DOI: 10.1109/CVPR46437.2021.00881 *
VIBHAKAR VEMULAPATI ET AL: "ORB-based SLAM accelerator on SoC FPGA", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 18 July 2022 (2022-07-18), XP091273864 *

Similar Documents

Publication Publication Date Title
US11388334B2 (en) Automatic camera guidance and settings adjustment
US11810256B2 (en) Image modification techniques
WO2022133381A1 (fr) Segmentation d'objets et suivi de caractéristiques
US20240078700A1 (en) Collaborative tracking
US20240007760A1 (en) Low-power fusion for negative shutter lag capture
US11769258B2 (en) Feature processing in extended reality systems
US20230239553A1 (en) Multi-sensor imaging color correction
US20240153245A1 (en) Hybrid system for feature detection and descriptor generation
WO2024097469A1 (fr) Système hybride pour la détection de caractéristiques et la génération de descripteurs
US20240177329A1 (en) Scaling for depth estimation
US20230095621A1 (en) Keypoint detection and feature descriptor computation
WO2024112458A1 (fr) Mise à l'échelle pour une estimation de profondeur
US20240096049A1 (en) Exposure control based on scene depth
WO2022067836A1 (fr) Localisation et cartographie simultanées à l'aide de caméras capturant de multiples spectres de lumière
WO2024118233A1 (fr) Sélection et commutation de caméra dynamique pour estimation de pose multicaméra
US20230281835A1 (en) Wide angle eye tracking
US20240161418A1 (en) Augmented reality enhanced media
US20230021016A1 (en) Hybrid object detector and tracker
WO2024107554A1 (fr) Médias améliorés à réalité augmentée
WO2023163799A1 (fr) Détection fovéale
WO2024030691A1 (fr) Génération d'image à plage dynamique élevée (hdr) avec une correction de mouvement multi-domaine