WO2023211435A1 - Estimation de profondeur pour systèmes slam à l'aide de caméras monoculaires - Google Patents

Estimation de profondeur pour systèmes slam à l'aide de caméras monoculaires Download PDF

Info

Publication number
WO2023211435A1
WO2023211435A1 PCT/US2022/026561 US2022026561W WO2023211435A1 WO 2023211435 A1 WO2023211435 A1 WO 2023211435A1 US 2022026561 W US2022026561 W US 2022026561W WO 2023211435 A1 WO2023211435 A1 WO 2023211435A1
Authority
WO
WIPO (PCT)
Prior art keywords
current
image
pixel
camera
candidate
Prior art date
Application number
PCT/US2022/026561
Other languages
English (en)
Inventor
Jun Liu
Fan DENG
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Priority to PCT/US2022/026561 priority Critical patent/WO2023211435A1/fr
Publication of WO2023211435A1 publication Critical patent/WO2023211435A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/579Depth or shape recovery from multiple images from motion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30244Camera pose

Definitions

  • This application relates generally to data processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for determining depth information from image data that are used in an extended reality application for simultaneous localization and mapping (SLAM).
  • SLAM simultaneous localization and mapping
  • Depth image estimation from monocular moving cameras is necessary for augmented reality (AR) applications that need to create occlusion and collision effects for virtual objects in a physical environment and generate a dense mesh map of the environment.
  • Some existing solutions are based on polar rectification and stereo matching, and use pose information from SLAM algorithms.
  • a pixel-based triangulation algorithm is applied to produce depth information, which however doesn’t work in arbitrary camera pose configurations and could introduce extra computation burden and noise in depth estimation.
  • Some other solutions use sparse feature points to make dense depth estimation for real-time rendering purpose. Despite a fast computation speed, such dense depth estimation offers a limited resolution for the resulting depth image, particularly when the number of sparse points is not sufficiently large.
  • Some of the forementioned solutions intend to replace a SLAM based depth estimation module with an end-to-end neural network, and however, the SLAM-based module still offers better depth estimates than the neural network that requires training on a large test data set. It would be beneficial to have a depth estimation mechanism that determines depth information based on image data that are used in an extended reality application (e.g., an AR user application) in a fast, accurate, and efficient manner.
  • an extended reality application e.g., an AR user application
  • Various embodiments of this application are directed to determining a depth image or map corresponding to an image that is captured for SLAM in an extended reality application.
  • Each current pixel on a current image corresponds to a reference pixel on a corresponding reference image.
  • Epipolar lines of the current image are determined based on current pixel locations and a reference camera location that are captured on the current image.
  • Epipolar lines of the reference image are determined based on corresponding reference pixel locations and a current camera location that are captured on the reference image.
  • a respective epipolar line of the reference image is searched to identify a corresponding reference pixel thereon, thereby avoiding polar rectification.
  • Information of the respective epipolar line of the reference image (e.g., information of a respective tilting angle) is pre-determined and stored in a lookup table. Such information is applied to determine a disparity of a current epipolar distance of the current pixel in the current image and a reference epipolar distance of the reference pixel of the reference image, and the disparity is further converted to a depth value at each current pixel of the current image, thereby creating a depth map corresponding to the entire current image. As the pixel-based lookup table is applied, the depth map of the current image is determined at a faster rate and with a better accuracy level, while not requiring polar rectification or deep learning techniques.
  • a method is implemented at an electronic device having one or more processors and memory. The method includes obtaining a current image captured by a camera having a current camera pose and a reference image captured by the camera having a reference camera pose. Each current pixel of at least a subset of the current image has a current pixel position in the current image and corresponds to a respective reference pixel of the reference image.
  • the method further includes for each current pixel of at least the subset of the current image, obtaining a plurality of pixel conversion parameters of the current pixel; determining a disparity of a current distance of the current pixel and a current epipole of the current image and a reference distance of the respective reference pixel and a reference epipole of the reference image; and in accordance with the plurality of pixel conversion parameters, determining a depth value of the current pixel from the disparity of the current distance and the reference distance.
  • the method further includes creating a depth map corresponding to the current image, the depth map including the depth value of each current pixel of the subset of the current image.
  • the method includes creating a lookup table correlating a plurality of pixel positions to the plurality of pixel conversion parameters based on the reference camera pose, the current camera pose, a camera intrinsic matrix, and an epipole position in the current image.
  • Obtaining the plurality of pixel conversion parameters of the current pixel of the current image further includes checking the lookup table to identify the plurality of pixel conversion parameters corresponding to the current pixel.
  • the lookup table includes one or more of: the current distance of the current pixel and the current epipole of the current im age, information of a tilting angle of an epipolar line in the reference image, a disparity range of the disparity of the current and reference distances, and the plurality of pixel conversion parameters.
  • some embodiments include an electronic device that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.
  • some embodiments include a non-transitory computer- readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.
  • a neural network named Deep VideoMVS is proposed to estimate depth map from video from a monocular camera.
  • This neural network applies a plane sweep method to construct a 3D cost volume to train a U-net to estimate a depth map directly.
  • A. lookup table based method is used to replace a plane sweep cost volume construction step.
  • the lookup table includes epipolar geometry information, and the cost volume's inverse depth dimension is replaced by disparity dimension, which can directly reflect an accurate pixel position in the image and avoid the neural network to learn the epipolar geometry information already given in the look up table. This replacement can improve accuracy of methods using the neural network named Deep VideoMVS.
  • Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.
  • Figure 2 is a block diagram illustrating an electronic system, in accordance with some embodiments.
  • FIG. 3 is a flowchart of a process for processing inertial sensor data and image data of an electronic system using a SLAM module, in accordance with some embodiments.
  • Figure 4A is an example stereo vision environment for forming a three- dimensional (3D) image including a current image and a corresponding depth map, in accordance with some embodiments.
  • Figure 4B is a flow diagram of a process of generating a depth map from a current image, in accordance with some embodiments.
  • Figure 5 is an example list of equations applied to convert image data to depth information, in accordance with some embodiments.
  • Figure 6 is a flow diagram of an example process of converting image data (e.g., a current image) to depth information (e.g., a depth image or map), in accordance with some embodiments.
  • image data e.g., a current image
  • depth information e.g., a depth image or map
  • Figure 7 illustrates a feature extraction scheme to extract an image feature from an image pixel, in accordance with some embodiments.
  • Figure 8 is a flow diagram of a noise filtering process that reduces noise in an inverse depth map, in accordance with some embodiments.
  • Figure 9 is a flow diagram of a depth mapping method, in accordance with some embodiments.
  • Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments.
  • the one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, or intelligent, multi-sensing, network-connected home devices (e.g., a depth camera, a visible light camera).
  • the one or more client devices 104 include a head-mounted display 104D configured to render extended reality content.
  • Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally (e.g., for training and/or for prediction) at the client device 104 and/or remotely by the server(s) 102.
  • the one or more servers 102 provides system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104.
  • the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.
  • storage 106 may store video content (including visual and audio content), static visual content, and/or inertial sensor data.
  • the one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102.
  • the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104.
  • the client devices 104 include a game console (e.g., formed by the head-mounted display 104D) that executes an interactive online gaming application.
  • the game console receives a user instruction and sends it to a game server 102 with user data.
  • the game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console.
  • the one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100.
  • the one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof.
  • the one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol .
  • USB Universal Serial Bus
  • FIREWIRE Long Term Evolution
  • LTE Long Term Evolution
  • GSM Global System for Mobile Communications
  • EDGE Enhanced Data GSM Environment
  • CDMA code division multiple access
  • TDMA time division multiple access
  • Bluetooth Wi-Fi
  • Wi-Fi voice over Internet Protocol
  • Wi-MAX wireless wide area network
  • a connection to the one or more communi cati on networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof.
  • the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Intemet Protocol (TCP/IP) suite of protocols to communicate with one another.
  • TCP/IP Transmission Control Protocol/Intemet Protocol
  • At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other electronic systems that route data and messages.
  • the head-mounted display 104D (also called AR glasses 104D) include one or more cameras (e.g., a visible light camera), a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display.
  • the camera(s) and microphone are configured to capture video and audio data from a scene of the AR glasses 104D, while the one or more inertial sensors are configured to capture inertial sensor data.
  • the camera captures hand gestures of a user wearing the AR glasses 104D.
  • the microphone records ambient sound, including user’s voice commands.
  • both video or static visual data captured by the visible light camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses.
  • the video, static image, audio, or inertial sensor data captured by the AR glasses 104D are processed by the AR glasses 104D, servers) 102, or both to recognize the device poses.
  • the device poses are used to control the AR glasses 104D itself or interact with an application (e.g., a gaming application) executed by the AR glasses 104D.
  • the display of the AR glasses 104D displays a user interface, and the recognized or predicted device poses are used to render virtual objects with high fidelity or interact with user selectable display items on the user interface.
  • SLAM techniques are applied in the data processing environment 100 to process video data or static image data captured by the AR glasses 104D with inertial sensor data. Device poses are recognized and predicted, and a scene in which the AR glasses 104D is located is mapped and updated.
  • the SLAM techniques are optionally implemented by AR glasses 104D independently or by both of the server 102 and AR glasses 104D jointly.
  • video data or static image data captured by a client device are also applied to determine depth images or maps.
  • An electronic device obtains a current image captured by a camera having a current camera pose and a reference image captured by the camera having a reference camera pose.
  • Each current pixel of at least a subset of the current image has a current pixel position in the current image and corresponds to a respective reference pixel of the reference image.
  • a plurality of pixel conversion parameters are determined in advance based on the current pixel position of the current pixel, and stored (e.g., in a lookup table 248) in the electronic device.
  • the plurality of pre-determined pixel conversion parameters are extracted directly for each current pixel of the subset of the current image.
  • the electronic device determines a disparity of a current distance of the current pixel and a current epipole of the current image and a reference distance of the respective reference pixel and a reference epipole of the reference image, and converts the disparity of the current distance and the reference distance to a depth value of the current pixel based on the pre-determined pixel conversion parameters.
  • a depth map of the current image is created to include the depth value of each current pixel of the subset of the current image.
  • FIG. 2 is a block diagram illustrating an electronic system 200, in accordance with some embodiments.
  • the electronic system 200 includes a server 102, a client device 104, a storage 106, or a combination thereof.
  • An example of the el ectronic system 200 includes a mobile phone 104C or the AR glasses 104D.
  • the electronic system 200 typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset).
  • the electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice- command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls.
  • the client device 104 of the electronic system 200 uses a microphone for voice recognition or a camera 260 for gesture recognition to supplement or replace the keyboard.
  • the client device 104 includes one or more cameras 260 (e.g., an RGB camera), scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices.
  • the electronic system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays.
  • the client device 104 includes a location detection device, such as a GPS (global positioning system) or other geo-location receiver, for determining the location of the client device 104.
  • the client device 104 includes an inertial measurement unit (IMU) 280 integrating sensor data captured by multi-axes inertial sensors to provide estimation of a location and an orientation of the client device 104 in space.
  • IMU inertial measurement unit
  • Examples of the one or more inertial sensors of the IMU 280 include, but are not limited to, a gyroscope, an accelerometer, a magnetometer, and an inclinometer.
  • Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:
  • Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks
  • Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
  • User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);
  • information e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.
  • output devices 212 e.g., displays, speakers, etc.
  • Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;
  • Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;
  • One or more user applications 224 for execution by the electronic system 200 e.g., games, social network applications, smart home applications, extended reality application, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices;
  • Pose determination and prediction module 226 for determining and predicting a pose of the client device 104 (e.g., AR glasses 104D), where in some embodiments, the pose determination and prediction module 230 includes an SLAM module 228 for mapping a scene where a client device 104 is located and identifying a pose of the client device 104 within the scene using image and IMU sensor data;
  • Pose-based rendering module 230 for rendering virtual objects on top of a field of view of the camera 260 of the client device 104 or creating mixed, virtual, or augmented reality content using images captured by the camera 260, where the virtual objects are rendered and the mixed, virtual, or augmented reality content are created from a perspective of the camera 260 based on a camera pose of the camera 260;
  • Depth image module 232 for determining pixel conversion parameters of each pixel of a subset of an input im age and creating a depth map of the input image based on pixel conversion parameters of the subset of the input image;
  • the one or more databases 240 are stored in one of the server 102, client device 104, and storage 106 of the electronic system 200.
  • the one or more databases 240 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200.
  • more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 262 are stored at the server 102 and storage 106, respectively.
  • Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above.
  • the above identified modules or programs i.e., sets of instructions
  • memory 206 optionally, stores a subset of the modules and data structures identified above.
  • memoiy 206 optionally, stores additional modules and data structures not described above.
  • FIG. 3 is a flowchart of a process 300 for processing inertial sensor data and image data of an electronic system (e.g., a server 102, a client device 104, or a combination of both) using a visual-inertial SLAM module 228, in accordance with some embodiments.
  • the process 300 includes measurement preprocessing 302, initialization 304, local visual- inertial odometiy (VIO) with relocation 306, and global pose graph optimization 308.
  • VIO local visual- inertial odometiy
  • an RGB camera 260 captures image data of a scene at an image rate (e.g., 30 FPS), and features are detected and tracked (310) from the image data.
  • An IMU 280 measures inertial sensor data at a sampling frequency (e.g., 1000 Hz) concurrently with the RGB camera 260 capturing the image data, and the inertial sensor data are pre-integrated (312) to provide data of a variation of device poses 340.
  • a sampling frequency e.g. 1000 Hz
  • the image data captured by the RGB camera 260 and the inertial sensor data measured by the IMU 280 are temporally aligned (314).
  • a vision-only structure from motion (SfM) techniques 314 are applied (316) to couple the image data and inertial sensor data, estimate three-dimensional structures, and map the scene of the RGB camera 260.
  • a sliding window 318 and associated states from a loop closure 320 are used to optimize (322) a VIO.
  • the VIO corresponds (324) to a keyframe of a smooth video transition and a corresponding loop is detected (326)
  • features are retrieved (328) and used to generate the associated states from the loop closure 320.
  • global pose graph optimization 308 a multi-degree-of-freedom (multi- DOF) pose graph is optimized (330) based on the states from the loop closure 320, and a keyframe database 332 is updated with the keyframe associated with the VIO.
  • the features that are detected and tracked (310) are used to monitor (334) motion of an object in the image data and estimate image-based poses 336, e.g., according to the image rate.
  • the inertial sensor data that are pre- integrated (312) may be propagated (338) based on the motion of the object and used to estimate inertial-based poses 340, e.g., according to a sampling frequency of the IMU 280.
  • the image-based poses 336 and the inertial-based poses 340 are stored in the database 240 and used by the pose determination and prediction module 226 to estimate and predict poses that are used by a real time video rendering module 230.
  • the SLAM module 228 receives the inertial sensor data measured by the IMU 280 and obtains image-based poses 336 to estimate and predict more poses 340 that are further used by the pose-based rendering module 230.
  • SLAM high frequency pose estimation is enabled by sensor fusion, which relies on data synchronization between imaging sensors and the IMU 280.
  • the imaging sensors e.g., the RGB camera 260, a LiDAR scanner
  • the IMU 280 can measure inertial sensor data and operate at a very high frequency (e.g., 1000 samples per second) and with a negligible latency (e.g., ⁇ 0. 1 millisecond).
  • Asynchronous time warping is often applied in an AR system to warp an image before it is sent to a display to correct for head movement and pose variation that occurs after the image is rendered.
  • ATW algorithms reduce a latency of the image, increase or maintain a frame rate, or reduce judders caused by missing images.
  • relevant image data and inertial sensor data are stored locally, such that they can be synchronized and used for pose estimation/predication.
  • the image and inertial sensor data are stored in one of multiple Standard Tessellation Language (STL) containers, e.g., std:: vector, std::queue, std: :list, etc., or other self-defined containers. These containers are generally convenient for use.
  • STL Standard Tessellation Language
  • the image and inertial sensor data are stored in the STL containers with their timestamps, and the timestamps are used for data search, data insertion, and data organization.
  • Figure 4A is an example stereo vision environment 400 for forming a three- dimensional (3D) image including a current image 402 and a corresponding depth map 414, in accordance with some embodiments.
  • the depth map 414 is generated based on a comparison between the current image 402 and a reference image 404.
  • Both of the current and reference images 402 and 404 are captured by a camera 260.
  • the current image 402 corresponds to a current camera pose of the camera 260, e.g., is captured when the camera 260 is located at a current camera position O 1 (406C).
  • the reference image corresponds to a reference camera pose of the camera 260, e.g., is captured when the camera 260 is located at a reference camera position O 2 (406R).
  • At least a subset of the current image 402 records the same content as a subset of the reference image 404. That said, each current pixel 408C of at least the subset of the current image 202 has a current pixel position in the current image 202 and corresponds to a respective reference pixel 408R of the reference image 204.
  • an object point is located at an object location P (4080) in the stereo vision environment 400 and captured in both of the current image 402 and reference image 404.
  • the object point 406 corresponds to a current pixel p (408C) in the current image 402 and a reference pixel p' (408R) in the reference image 404.
  • the reference image 404 and the current image 402 are captured at two distinct instants of time by the camera 260, which is moving.
  • the reference image 402 is optionally captured prior to or after the current image.
  • the camera 260 has a positional shift between the current camera position O 1 (406C) and reference camera position O 2 (406R).
  • the current cam era position O 1 (406C) is located in a field of view of the camera 260 located in the reference camera position O 2 (406R), and projected onto a reference epipole e' (410R) into the reference image 404.
  • the reference camera position O 2 (406R) is located in a field of view of the camera 260 located in the current camera position O 1 (406C), and projected onto a current epipole e (410C) into the current image 402.
  • a current epipolar line 412C connects the current pixel p (408C) to the current epipole e (410C) and has a current distance p 0 .
  • a reference epipolar line 412R connects the reference pixel p' (408R) to the reference epipole e' (410R) and has a reference distance p 1 .
  • the current distance p 0 and reference distance p 1 has a disparity (Disparity).
  • the object point located at the object location on P (4080) corresponds to a depth D in the field of view of the camera 260 located at the current camera position O 1 (406C).
  • the depth D is correlated with the disparity (Disparity) of the current distance p 0 and reference distance p 1 as follows: where C 1 , C 2 , C 3 , and C 4 are a set of pixel conversion parameters 416 that are associated with the current pixel 408C corresponding to the object point 4080.
  • the pixel conversion parameters 416 are distinct for each individual pixel in the current image 402.
  • the set of pixel conversion parameters 416 are pre-determined and stored in a lookup table 248 in association with each pixel (e.g., the current pixel 408C) of the current image 402.
  • the corresponding depth D is correlated with the disparity (Disparity) as follows:
  • Figure 4B is a flow diagram of a process 450 of generating a depth map 414 from a current image 402, in accordance with some embodiments.
  • the current image 402 and reference image 404 are captured by a moving camera 260 at two distinct locations of the stereo vision environment 400 including the current camera position O 1 (406C) and the reference camera position O 2 (406R), respectively.
  • A. plurality of pixel conversion parameters 416 e.g., C 1 -C 4 ) are determined and stored for each pixel in the current image 402.
  • the pixel conversion parameters 416 are determined based on the reference camera pose, the current camera pose, a camera intrinsic matrix, and an epipole position in the current image 402, e.g., as described in equations (2)-(10) in Figure 5.
  • the plurality of pixel conversion parameters 416 are stored with the current distance p 0 of the current pixel 408C and epipole e (410C) of the current image 402, information of a tilting angle of an epipolar line 412C in the reference image 404, and a disparity range of the disparity of the current and reference distances p 0 and p 1 in a lookup table 248.
  • the disparity of the current and reference distances p 0 and p 1 is determined (418), e.g., according to one or more of: the current distance p 0 in the current image 402, information of the tilting angle of an epipolar line 412C in the reference image 404, and the disparity range of the disparity of the current and reference distances p 0 and p 1 . Further, in accordance with the pixel conversion parameters 416 (e.g., C 1 -C 4 ), a depth value 420 of each current pixel 408C in a subset of the current image 402 is determined from the disparity of the current and reference distances p 0 and p 1 based on equation (1.1) or (1.2).
  • an inverse depth value of the current pixel 408C is determined from the disparity of the current and reference distances p 0 and p 1 and further converted to the depth value 420 of the current pixel 408C.
  • the depth values 420 of the subset of the current image 402 are applied to create a depth map 414 corresponding to the current image 402.
  • the reference image 404 is selected from a plurality of candidate images 430 based on a reference selection criterion 250.
  • the reference selection criterion 250 requires at least one condition of: an angle between two camera bearing directions associated with the current image 402 and the reference image 404 satisfying a bearing direction angle requirement, the current and reference images 402 and 404 sharing at least a threshold number of common feature points, and an error of a projected camera pose and the current camera pose being less than a pose error threshold.
  • the projected camera pose is determined for the current image based on the reference camera pose and common feature points of the reference image 404 and current image 402.
  • an angle 422 is determined between a current camera bearing direction 424C associated with the current image 402 and a candidate camera bearing direction associated with the candidate image 430 (e.g., a reference camera bearing direction 424R associated with the reference image 404). If the angle corresponding to one of the candidate images 430 satisfies the bearing direction angle requirement (e.g., the angle falls into a predefined bearing direction angle range), the one of the candidate images 430 is selected as the reference image 404. In an example, the angle corresponding to the one of the candidate images 430 is closer to a predefined bearing angle than an angle corresponding to any other candidate image 430, and therefore, the one of the candidate images 430 is selected as the reference image.
  • the bearing direction angle requirement e.g., the angle falls into a predefined bearing direction angle range
  • a number of common feature points exist in both the candidate image 430 and the current image 402. If the number of common feature points determined based on one of the candidate images 430 is within the feature point number range, the one of the candidate images 430 is selected as the reference image.
  • An example of the feature point number range is a feature point number threshold and more, e.g., ⁇ 500. In some situations, the one of the candidate images 430 is selected as the reference image, because the number of common feature points determined based on the one of the candidate images 430 is greater than a number of common feature points determined based on any other candidate image 430.
  • a plurality of current feature points are identified in the current image 402.
  • a plurality of candidate feature points are identified in the candidate image 430, and a candidate camera pose corresponding to the current image 402 is estimated by comparing the candidate feature points and the current feature points.
  • a camera pose error is derived between the candidate and current camera poses of the current im age 402. If the camera pose error determined based on the candidate feature points of one of the candidate images 430 is less than the pose error threshold, the one of the candidate images 430 is selected as the reference image.
  • the camera pose error determined based on the candidate feature points of one of the candidate images 430 is smallest among the camera pose errors determined based on all of the candidate images 430, and the one of the candidate images 430 is selected as the reference image 404.
  • Figure 5 is an example list of equations 500 applied to convert, image data to depth information, in accordance with some embodiments.
  • a depth map 414 is generated based on a compari son between a current image 402 and a reference image 404. Both of the current and reference images 402 and 404 are captured by a camera 260.
  • the current image 402 corresponds to a current camera pose of the camera 260
  • the reference image 404 corresponds to a reference camera pose of the camera 260.
  • the current camera pose includes a current translational position (i.e., a current camera position To) and a current rotational position (i.e., a current camera orientation R 0 ).
  • the reference camera pose includes a reference translational position (i.e., a reference camera position To) and a reference rotational position (i.e., a reference camera orientation R 0 ).
  • the camera 260 has a camera intrinsic matrix K.
  • a current epipole e (410C) in the current image 402 is represented as (eu 0 , ev 0 ).
  • a current pixel 408C is located at (u, v) of the current image 402, and two vectors and are defined for the current pixel 408C of the current image 402 according to equati on (2).
  • the current pixel 408C of the current image 402 corresponds to a reference pixel 408R in the reference image 404, and a location of the reference pixel 408R in the reference image 404 is represented in equation (3).
  • a reference epipole 410R of the reference image 404 is represented as
  • a projection to epipole difference ( ⁇ u, ⁇ v) corresponds to a reference epipolar line 412R connecting the reference pixel 408R and the reference epipole 41 OR of the reference image 404, and is represented by equations (4.1) and (4.2).
  • a tilting angle of the reference epipolar line 412R is ⁇ as described by cos( ⁇ ) and sin( ⁇ ) in equation (5), which is established based on a constant N 1 .
  • the constant N 1 is defined in equation (6) by elements of the two vectors in equation (2).
  • a current distance p 0 of the current pixel 408C and current epipole 410C of the current image 402 is represented by equation (7).
  • the reference distance p 1 of the reference pixel 408R and reference epipole 41 OR of the reference image 404 is represented by equation (8).
  • a disparity of the current distance p 0 and the reference distance p 1 is further determined in equation (10).
  • the plurality of pixel conversion parameters 416 e.g., C 1 -C 4
  • equation (10) are determined in equation (10) and as follows:
  • FIG. 6 is a flow diagram of an example process 600 of converting image data (e.g., a current image 402) to depth information (e.g., a depth image or map 414), in accordance with some embodiments.
  • the process 600 is optionally implemented jointly by a depth image module 232 and an SLAM module 228 of an electronic system 200, and includes one or more of: reference selection 602, image feature extraction 604, lookup table generation 606, prior depth estimation 608, depth-to-disparity conversion 610, cost volume construction 612, semi-global matching 614, disparity-to-depth conversion 616, and depth image denoising 618.
  • This process 600 uses multiple image frames to construct a cost volume that is further processed by a semi-global matching algorithm to estimate the depth information that can be further processed to remove its noise and improve its image quality. As such, the process 600 does not need polar rectification.
  • pixel informati on of the current and reference images 402 and 404 are directly stored in a lookup table 248 during lookup table generation 606.
  • Examples of such directly-stored pixel information include, and are not limited by, e.g., a current distance p 0 , a reference distance p 0 , a tilting angle of an epipolar line 412C or 412R, a disparity range of the current and reference distances p 0 and p 1 .
  • pixel information of the current and reference images 402 and 404 is correlated according to one or more correlation equations (e.g., equations (l.l)-(10)), and coefficients of such equations (e.g., pixel conversion parameters 416 (e.g., C 1 -C 4 )) are stored in the lookup table 248 during lookup table generation 606.
  • the pixel information of the current and reference images 402 and 404 is extracted from the lookup table 248, and applied in depth-to- disparity conversion 610 and disparity-to-depth conversion 616.
  • the current image 402 is applied jointly with the reference image 404 to enable stereo imaging (i.e., measure a parallax and determine the depth map 414), and in reference selection 602, the reference image 404 is selected from a plurality of candidate images 430 based on a reference selection criterion 250.
  • the reference image 404 is optionally captured prior to or after the current image 402, and corresponds to a reference camera pose that is different from a current camera pose associated with the current image 402.
  • the reference selection criterion 250 requires at least one condition of: an angle between two camera bearing directions associated with the current image 402 and the reference image 404 satisfying a bearing direction angle requirement, the current and reference images 402 and 404 sharing at least a threshold number of common feature points, and an error of a projected camera pose and the current camera pose being less than a pose error threshold.
  • the projected camera pose is determined for the current image based on the reference camera pose and common feature points of the reference and current images 404 and 402.
  • sufficient overlapping regions are formed between the current image 402 and the reference image 404. This is measured by an angle 422 of camera bearing directions 424C and 424R of the current and reference images 402 and 404. For example, the angle 422 of the camera bearing directions 424C and 424R is less than a bearing direction angle threshold.
  • translation of the current image 402 to reference image 404 are in a reasonable range that is neither too small nor too large, i.e., a distance between the camera positions 406C and 406R is in a range between two distance thresholds.
  • the current image 402 and reference image 404 share a number of common sparse feature points that is greater than the threshold number of common feature points. Further, in some embodiments, a relati ve pose error of the reference image 404 to the current image 402 is measured by a reprojection error of common sparse feature points, and is less than the pose error threshold in accordance with the reference selection criterion 250.
  • image features are extracted from the current image 402 and reference image 404.
  • a pixel-level image feature is determined a gray level or gradient of each image pixel of the current or reference image 402 or 404.
  • Census transformation is applied to extract image features of the images 402 and 404.
  • a corresponding image feature is stored in a 32-bit descriptor generated from a 7x7 region that includes 48 pixels surrounding the respective pixel. The region is optionally centered or not centered at the image pixel.
  • a gray level or gradient of the image pixel is compared with gray levels or gradients of 32 selected adjacent pixels in the region to determine 32 bits in the 32-bit descriptor, respectively.
  • the 7*7 region includes 7 rows and 7 columns.
  • the image pixel is marked with “o”, and the 32 adjacent pixels are selected from 48 remaining pixels are marked with “x”.
  • a respective bit in the 32-bit descriptor is equal to a first value (e.g., “1”), and conversely, if the gray level of the image pixel is equal to or less than the gray level of the adjacent pixel, a respective bit in the 32-bit descriptor is equal to a second value (e.g., “0”).
  • each image pixel of the current or reference image corresponds to an image feature, which is a 32- bit integer descriptor generated from a respective region that includes the image pixel and has more than 32 pixels.
  • a current camera position O 1 (406C) and a reference camera position O 2 (406R.) correspond to optical centers of the current image 402 and the reference image 404, respectively.
  • a current epipole e (410C) and a reference epipole e' (410R) are located on the current image 402 and the reference image 404, respectively.
  • a current epipolar line pe (412C) connects the current pixel p (408C) to the current epipole e (410C) and has a current distance p 0 .
  • a reference epipolar line p'e' (412R) connects the reference pixel p' (408R) to the reference epipole e' (410R) and has a reference distance p 1 .
  • the epipolar lines pe and p'e' are both in an image plane PO 1 O 2 , and a tilting angle of the epipolar line p'e' depends on the position of the current pixel p in the image plane PO 1 O 2 .
  • the disparity (Disparity) between the epipolar lines pe and p'e' is defined as follows:
  • Disparity Distance(p, e) - Distance( p'e' ) + offset (11) where offset is a constant to make sure that disparity (Disparity) is positive.
  • the disparity can be converted to a depth value of the object point P according to equations (1.1) and (1.2).
  • intermediate results of the epipolar lines pe and p'e' are stored in a lookup table 248, which expediates cost volume construction 612, disparity-to- depth conversion 616, and/or depth-to-disparity conversion 610.
  • the lookup table 248 is an 8-channel matrix and has the same size as the current image 402.
  • Each element of the matrix stores information of a respective pixel of the current or reference image, including but not limited to one or more of: a current distance pe of a current pixel 40C to the current epipole e (410C) in the current image 402, cosine and sine values of a tilting angle of the epipolar line p'e' (412R) in the reference frame 404, a disparity range (e.g., a maximal disparity, a minimal disparity) of the current and reference distances p 0 and p 1 , and pixel conversion parameters 416 (e.g., C 1 , C 2 , C 3 , and C 4 ).
  • a 3D object point P in space is conveniently mapped to the reference pixel p' (408R), which is located on the epipolar line p'e' (412R) in the reference image 404.
  • an initial depth guess is made to narrow down a disparity range prior to stereo matching.
  • the initial depth guess is optionally estimated (608) on an image basis or on a pixel basis for the current image 402. If the initial depth guess is estimated (608) on an image basis, a minimal or maximal depth value of the current image 402 is estimated based on sparse feature points identified in the current image 402. Depth values are determined for the identified sparse feature points, and applied to determine associated disparities of these identified sparse feature points based on equations (1.1) and (1.2). The disparities of these identifi ed sparse feature points are further applied to determine the disparity range for pixels in the current image 402.
  • prior depth estimation 608 is implemented to provide an initial coarse depth value.
  • the initial coarse depth value is estimated based on the sparse feature points obtained from SLAM.
  • Visual based SLAM results in feature points that are applied to estimate an associated camera pose. Depth values of those sparse feature points are estimated after SLAM optimization. Delaunay triangulation is applied to generate a mesh having vertices and triangular faces that are defined by the sparse feature points.
  • a depth value of each pixel located on the triangular face is linearly interpolated from the depth values of the vertices of the triangular face (i.e., from the depth values of corresponding sparse feature points).
  • a previously estimated depth image 414 is reprojected into a current image 402.
  • a point cloud is generated by the previously estimated depth image 414. After the current camera pose associated with the current image 402 is known, the point cloud is projected into the current image 402 to obtain a raw depth map 414 for current image 402.
  • the maximal depth and minimal depth of each pixel are estimated from local neighbors. For example, a current pixel 408C is located at a center of a region of the current image 402. Maximal and minimal depths of pixels in the region are applied as the current pixel’s maximal or minimal depth estimates. Alternatively, the greatest 10 percentage of raw depth estimates are used to generate the maximal depth estimate, and the small est 10 percentage of raw depth estimates are used to generate the minimal depth estimates.
  • the minimal and maximal disparity is determined from the maximal and minimal depth estimates based on equations (1.1) and (1.2), and applied to determine a disparity range, which can be applied to reduce expediate stereo matching and noise following stereo matching.
  • cost volume construction 612 a three-dimensional (3D) cost volume is built with three dimensions, i.e. image width, image height, and disparity.
  • Each cost element is a hamming distance between a current pixel’s census transformation in the current image 402 and a candidate pixel’s census transformation in the reference image 404.
  • the candidate pixel (including a corresponding reference pixel 408R) of the reference image 404 lies on the epipolar line p'e' which has a tilting angle associated with the current pixel 408C in the lookup table 248.
  • the minimal and maximal disparity values are also extracted from the lookup table 248.
  • Candidate pixels lie on a segment of the epipolar line p'e' (412R) determined by the range of disparity defined by the minimal and maximal disparity values.
  • the lookup table 248 enables an efficient cost volume construction process to locate these candidate pixels, obtain each cost element, and fill into a 3D cost volume.
  • the reference pixel 408R corresponding to each current pixel 408C of the current image 402 is identified among the candidate pixel s lying on the segment of epipolar line p'e' (412R), so is the disparity of the current and reference distances p 0 and p 1 corresponding to the reference pixel 408R identified for depth estimation.
  • semi-global matching 614 is applied to estimate an optimal disparity (Disparity) from the 3D cost volume.
  • a semi -global matching method 614 adopts 8-way aggregation and chooses the optimal disparity from summed costs from 8-way aggregation.
  • the minimal and maximal disparity for each candidate pixel on the epipolar line p'e' is applied in the aggregation procedure to set the range of the optimal disparity.
  • Cost aggregation is applied as follows: where Cr(x, I) is the aggregated cost of pixel x having a disparity / in a neighboring direction r, and r ⁇ Nr, which is a neighboring direction set. Eight neighborhood is used. P 1 and P 2 are the penalty value.
  • C(x, I) is a cost value calculated in a previous cost volume construction step.
  • Each cost value applied in equation (12) corresponds to an image feature of a pixel of a current or reference image.
  • the optimal disparity 620 of a current distance p 0 and a reference distance p 1 is identified for each current pixel 408C, and corresponds to the reference pixel 408R corresponding to the current pixel 408C.
  • the reference pixel 408R is the candidate pixel on the epipolar line p'e' that provides the optimal disparity 620 between the minimal and maximal disparity. More details on image feature extraction 604 are described below with reference to Figure 7.
  • the disparity 620 (Disparity) of the current distance p 0 and reference distance p 1 is identified for the current pixel 408C
  • the disparity 620 is converted (616) to an inverse depth value of the current pixel 408C, which is applied to create an inverse depth map 622 of the current image 402.
  • One or more guided image filters are applied to reduce a noise level of the inverse depth map of the current image 402.
  • an uncertainty level is measured in semi-global matching 614, and converted into a confidence map.
  • the confidence map is multiplied by the inverse depth map 622 to create a weighted inverse depth map, and fast global image smoothing is applied, e.g., using a guided image filter, to generate a smoothed weighted inverse depth map.
  • fast global image smoothing is applied, e.g., using a guided image filter, to generate a smoothed weighted inverse depth map.
  • A. guided filter is applied to the confidence map to generate a smoothed confidence map.
  • the smoothed weighted inverse depth map is divided by the smoothed confidence map to generate a smoothed inverse depth map 624.
  • the smoothed inverse depth map 624 is further converted to a depth map 414 corresponding to the current image 402.
  • the process 600 is implemented to provide a depth image 414 by a mobile device having a single moving monocular camera 260.
  • the mobile device further includes an augmented reality (AR) software development kit (SDK) that provides more information for developers to implement AR applications.
  • AR augmented reality
  • SDK software development kit
  • the process can be implemented in a smartphone 104C to estimate depth images in less than 100 milliseconds and satisfy a real time requirement of the AR applications. Such a depth estimation ability is necessary to provide close to reality collision and occlusion effects in the AR applications.
  • an AR application needs to create a dense 3D mesh for an environment.
  • a sparse point cloud resulting from SLAM is applied to create the dense 3D mesh.
  • time-of-flight (TOF) sensors are applied to measure a depth from the smartphone to the environment, which consumes a lot of battery power. Conversely, in some embodiments, the TOF sensors are not applied for the purposes of conserving the battery power, and the single moving monocular camera 260 is applied to estimate a depth image based on the process 600.
  • the process 600 enables close to real AR effects in mobile devices and dramatically reduces computational and power resources by estimating depth maps from monocular image sequences.
  • the process 600 is applied jointly with SLAM to add more feature points from descriptor matching processing. These SLAM feature points enable prior depth estimation 608 to be more accurate.
  • the process 600 is applied based on a neural network model. If the neural network model is trained in an unsupervised manner, a reprojection loss is applied by warping the current image 402 to the reference image 404. More importantly, the process 600 considers epi-polar geometry and determines an inverse depth map based on equations (1.1) and (1.2), which has avoided duplicated and missing matching and enabled an accurate and efficient solution for depth estimation.
  • Figure 7 illustrates a feature extraction scheme 700 to extract an image feature from an image pixel 702, in accordance with some embodiments.
  • a gray level or gradient of the image pixel 702 is compared with a first number of adjacent pixels 704 (e.g., 16 or 32 adjacent pixels) on the current image 402 to determine the first feature value having the first number of bits (e.g., 16 or 32 bits).
  • a gray level or gradient of the corresponding candidate pixel with the first number of adjacent pixels on the reference image 404 to determine the second feature value having the first number of bits.
  • the first number of adjacent pixel s 704 is selected in a region 706 including the image pixel 702, and the number of adjacent pixels 704 are distributed evenly in the region 706.
  • the first number is 32.
  • a corresponding image feature is stored in a 32-bit descriptor generated from the region 706, which has more than the first number of pixels including the respective pixel 702.
  • the region 706 optionally has a square, rectangular, or diamond shape. Alternatively, the region 706 may have an irregular shape.
  • the region 706 is centered at the respective pixel 702.
  • the region is not centered at the respective pixel 702.
  • the pixel 702 is separated from the closest edge of the current image 402 by less than 3 pixels when the region includes 7x7 pixels.
  • the pixel 702 may be separated from two opposite edges of the region 706 by 2 pixels and 4 pixels, respectively.
  • the region 706 has 7 rows and 7 columns and includes 48 pixels surrounding the respective pixel 702.
  • a gray level of the image pixel 702 is compared with 32 selected adjacent pixels 704 in the region 706 to determine 32 bits in the 32-bit descriptor.
  • the image pixel 702 is marked with “o”, and the 32 adjacent pixels 704 are selected from 48 remaining pixels and marked with “x”.
  • a respective bit in the 32-bit descriptor is equal to a first value (e.g., “1”), and con versely, if the gray level of the pixel 702 is equal to or less than the gray level of the respective adjacent pixel 704, a respective bit in the 32-bit descriptor is equal to a second value (e.g., “0”).
  • each image pixel 702 in the current image 402 corresponds to a 32-bit image feature (i.e., the 32-bit descriptor) generated from the respective region 706, so does each corresponding reference pixel in the reference image 404 correspond to a 32-bit reference feature.
  • Figure 8 is a flow diagram of a noise filtering process 618 that reduces noise in an inverse depth map 622, in accordance with some embodiments.
  • a disparity 620 of a current distance p 0 of a current epipol ar line 412C and a reference distance p 1 of a reference epipolar line 412R is identified for a current pixel 408C of a current image 402
  • the disparity 620 is converted to an inverse depth value of the current pixel 408C
  • inverse depth value of each current pixel 408C is applied to create an inverse depth map 622 of the current image 402.
  • Guided image filters 802 are applied to reduce a noise level of the inverse depth map 622 of the current image 402.
  • an uncertainty level is measured in semi -global matching 614, and converted into a confidence map 804.
  • the confidence map 804 is multiplied by the inverse depth map 622 to create a weighted inverse depth map 806, and a first guided image filter 802A (e.g., for fast global image smoothing) is applied to generate a smoothed weighted inverse depth map 808.
  • a second guided filter 802B is applied to the confidence map 804 to generate a smoothed confidence map 810.
  • the smoothed weighted inverse depth map 808 is divided by the smoothed confidence map 810 to generate a smoothed inverse depth map 624.
  • FIG. 9 is a flow diagram of a depth mapping method 900, in accordance with some embodiments.
  • the method is applied in the AR glasses 104D, robotic systems, vehicles, or mobile phones.
  • the method 900 is described as being implemented by an electronic device (e.g., a depth image module 232 of a client device 104).
  • An example of the client device 104 is a head-mount display 104D or a mobile phone 104C.
  • Method 800 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the electronic system.
  • Each of the operations shown in Figure 9 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 of the electronic system 200 in Figure 2).
  • the computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices.
  • the instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors.
  • the electronic device obtains (902) a current image 402 captured by a camera having a current camera pose and a reference image 404 captured by the camera having a reference camera pose.
  • Each current pixel 408C of at least a subset of the current image 402 has a current pixel position in the current image 402 and corresponds to a respective reference pixel 408R of the reference image 404.
  • the electronic device obtains (906) a plurality of pixel conversion parameters 416 of the current pixel 408C, determines (908) a disparity of a current distance p 0 of the current pixel 408C and a current epipole 410C of the current image 402 and a reference distance p 1 of the respective reference pixel 408R and a reference epipole 410R of the reference image 404, and in accordance with the plurality of pixel conversion parameters 416, determines (910) a depth value of the current pixel 408C from the disparity of the current distance and the reference distance.
  • the electronic device creates a depth map 414 (912) corresponding to the current image 402, and the depth map 414 includes the depth value 420 of each current pixel 408C of the subset of the current image 402.
  • an inverse depth value of the current pixel 408C is determined (920) from the disparity 620 of the current and reference distances p 0 and p 1 , and converted (922) to the depth value 420 of the current pixel 408C.
  • the electronic device creates (914) a lookup table 248 correlating a plurality of pixel positions to the plurality of pixel conversion parameters 416 based on the reference camera pose, the current camera pose, a camera intrinsic matrix K, and an epipole position in the current image 402.
  • the electronic device obtains the plurality of pixel conversion parameters 416 for the current pixel 408C of the current image 402 by checking the lookup table 248 to identify the plurality of pixel conversion parameters 416 corresponding to the current pixel 408C.
  • the lookup table 248 includes (916) one or more of: the current distance p 0 of the current pixel 408C and the current epipole 410C of the current image 402, i nformation of a tilting angle of an epipolar line 412R in the reference image 404, a disparity range of the disparity of the current and reference distances p 0 and p 1 , and the plurality of pixel conversion parameters 416.
  • the electronic device selects (918) the reference image 404 from a plurality of candidate images 430 in accordance with a reference selection criterion 250.
  • the reference selection criterion 250 requires at least one condition of: an angle between two camera bearing directions associated with the current image 402 and the reference image 404 less than a bearing direction angle threshold; the current and reference images 402 and 404 sharing at least a threshold number of common feature points; and an error of a projected camera pose and the current camera pose being less than a pose error threshold.
  • the projected camera pose is determined for the current image 402 based on the reference camera pose and common feature points of the reference and current images 404 and 402.
  • the reference selection criterion 250 defines a bearing direction angle threshold. For each of a subset of candidate images 430, the electronic device determines an angle of a current camera bearing direction associated with the current image 402 and a candidate camera bearing direction associated with the candidate image 430. In accordance with a determination that the angle corresponding to one of the candi date images 430 is less than the bearing direction angle threshold, the electronic device selects the one of the candidate images 430 as the reference image 404.
  • the reference selection criterion 250 defines a feature point number range. For each of a subset of candidate images, the electronic device determines a number of common feature points that exist in both the candidate image 430 and the current image 402. In accordance with a determination that the number of common feature points determined based on one of the candidate images 430 is within the feature point number range, the electronic device selects the one of the candidate images 430 as the reference image 404.
  • the reference selection criterion 250 defines a pose error threshold.
  • the electronic device identifies a plurality of current feature points in the current image 402. For each of a subset of candidate images 430, the electronic device identifies a plurality of candidate feature points in the candidate image, estimates a candidate camera pose corresponding to the current image 402 by comparing the candidate feature points and the current feature points, determines a camera pose error between the candidate and current camera poses of the current image 402, and in accordance with a determination that the camera pose error determined based on one of the candidate images is less than pose error threshold, selects the one of the candidate images 430 as the reference image 404.
  • the electronic device determines an epipolar line p'e' (412R) passing the reference pixel 408R on the reference image 404, e.g., based on the reference camera pose, the current camera pose, a camera intrinsic matrix K, the current pixel position, and the reference pixel position.
  • An epipolar segment is selected on the epipolar line p'e' (412R) based on a disparity range.
  • the electronic device creates a cost volume for candidate pixels that lie on the epipolar segment, and each cost element indicates a hamming distance between a first feature value of the current pixel 408C in the current image 402 and a second feature value of a corresponding candidate pixel on the epipolar segment in the reference image 404.
  • the disparity of the current and reference distances p 0 and p 1 is determined based on the cost volume, e.g., using semi-global matching, and corresponds to one of the candidate pixels, which is thereby identified as the reference pixel 408R associated with the current pixel 408C.
  • the electronic device compares a gray level or gradient of the current pixel 408C with a first number of adjacent pixels on the current image 402 to determine the first feature value having the first number of bits (e.g., 8, 16, 32 bits).
  • the electronic device compares a gray level or gradient of the corresponding candidate pixel with the first number of adjacent pixels on the reference image 404 to determine the second feature value having the first number of bits.
  • the electronic device selects the first number of adjacent pixels 704 in a region 706 centered at the current pixel 408C of the current image 402 or at each candidate pixel on the epipolar segment of the reference image 404.
  • the number of adjacent pixels 704 are distributed evenly in the region 706.
  • the di sparity range of the current image 402 is estimated by identifying a plurality of sparse feature points in the current image 402 using synchronous localization and mapping (SLAM), determining a first depth value and a second depth value defining a depth range of the plurality of feature points, and estimating the disparity range based on the first and second depth values.
  • SLAM synchronous localization and mapping
  • the term “if” is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context.
  • the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.
  • stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Image Processing (AREA)

Abstract

Un dispositif électronique obtient une image actuelle et une image de référence qui sont toutes deux capturées par un dispositif de prise de vues. Chaque pixel actuel d'au moins un sous-ensemble de l'image actuelle a une position de pixel actuel dans l'image actuelle et correspond à un pixel de référence respectif de l'image de référence. Pour chaque pixel actuel du ou des sous-ensembles de l'image actuelle, le dispositif électronique obtient des paramètres de conversion de pixel du pixel actuel et détermine une disparité entre une distance actuelle du pixel actuel et d'un épipôle actuel de l'image actuelle et une distance de référence du pixel de référence respectif et d'un épipôle de référence de l'image de référence. Conformément à des paramètres de conversion de pixels, le dispositif électronique détermine une valeur de profondeur de chaque pixel actuel à partir de la disparité entre la distance actuelle et la distance de référence et crée une carte de profondeur correspondant à l'image actuelle.
PCT/US2022/026561 2022-04-27 2022-04-27 Estimation de profondeur pour systèmes slam à l'aide de caméras monoculaires WO2023211435A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2022/026561 WO2023211435A1 (fr) 2022-04-27 2022-04-27 Estimation de profondeur pour systèmes slam à l'aide de caméras monoculaires

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2022/026561 WO2023211435A1 (fr) 2022-04-27 2022-04-27 Estimation de profondeur pour systèmes slam à l'aide de caméras monoculaires

Publications (1)

Publication Number Publication Date
WO2023211435A1 true WO2023211435A1 (fr) 2023-11-02

Family

ID=88519468

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/026561 WO2023211435A1 (fr) 2022-04-27 2022-04-27 Estimation de profondeur pour systèmes slam à l'aide de caméras monoculaires

Country Status (1)

Country Link
WO (1) WO2023211435A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150009133A1 (en) * 2008-07-09 2015-01-08 Primesense Ltd. Integrated processor for 3d mapping
WO2015043872A1 (fr) * 2013-09-25 2015-04-02 Technische Universität München Localisation et cartographie simultanées semi-denses
US20170094243A1 (en) * 2013-03-13 2017-03-30 Pelican Imaging Corporation Systems and Methods for Synthesizing Images from Image Data Captured by an Array Camera Using Restricted Depth of Field Depth Maps in which Depth Estimation Precision Varies
US20190236797A1 (en) * 2019-04-12 2019-08-01 Intel Corporation Accommodating depth noise in visual slam using map-point consensus

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150009133A1 (en) * 2008-07-09 2015-01-08 Primesense Ltd. Integrated processor for 3d mapping
US20170094243A1 (en) * 2013-03-13 2017-03-30 Pelican Imaging Corporation Systems and Methods for Synthesizing Images from Image Data Captured by an Array Camera Using Restricted Depth of Field Depth Maps in which Depth Estimation Precision Varies
WO2015043872A1 (fr) * 2013-09-25 2015-04-02 Technische Universität München Localisation et cartographie simultanées semi-denses
US20190236797A1 (en) * 2019-04-12 2019-08-01 Intel Corporation Accommodating depth noise in visual slam using map-point consensus

Similar Documents

Publication Publication Date Title
KR102319177B1 (ko) 이미지 내의 객체 자세를 결정하는 방법 및 장치, 장비, 및 저장 매체
EP3786890B1 (fr) Procédé et appareil de détermination de pose de dispositif de capture d'image, et support d'enregistrement correspondant
US20220012495A1 (en) Visual feature tagging in multi-view interactive digital media representations
CN111325796B (zh) 用于确定视觉设备的位姿的方法和装置
US10750161B2 (en) Multi-view interactive digital media representation lock screen
JP6125100B2 (ja) 点特徴と線特徴とを使用する堅牢な追跡
US11776142B2 (en) Structuring visual data
KR100560464B1 (ko) 관찰자의 시점에 적응적인 다시점 영상 디스플레이 시스템을 구성하는 방법
KR20200031019A (ko) 깊이 데이터를 최적화하기 위해 깊이 맵에 이미지 등록을 수행할 수 있는 깊이 데이터 처리 시스템
CN113711276A (zh) 尺度感知单目定位和地图构建
CN113643342A (zh) 一种图像处理方法、装置、电子设备及存储介质
WO2023082822A1 (fr) Procédé et appareil de traitement de données d'image
WO2023088127A1 (fr) Procédé de navigation en intérieur, serveur, appareil et terminal
KR20210050997A (ko) 포즈 추정 방법 및 장치, 컴퓨터 판독 가능한 기록 매체 및 컴퓨터 프로그램
WO2023086398A1 (fr) Réseaux de rendu 3d basés sur des champs de radiance neurale de réfraction
WO2023211435A1 (fr) Estimation de profondeur pour systèmes slam à l'aide de caméras monoculaires
WO2023091131A1 (fr) Procédés et systèmes pour récupérer des images sur la base de caractéristiques de plan sémantique
KR102299902B1 (ko) 증강현실을 제공하기 위한 장치 및 이를 위한 방법
WO2023101662A1 (fr) Procédés et systèmes pour mettre en œuvre une odométrie visuelle-inertielle sur la base d'un traitement simd parallèle
CN107993247A (zh) 追踪定位方法、系统、介质和计算设备
WO2023277877A1 (fr) Détection et reconstruction de plan sémantique 3d
WO2023063937A1 (fr) Procédés et systèmes de détection de régions planes à l'aide d'une profondeur prédite
WO2024123343A1 (fr) Mise en correspondance stéréo pour une estimation de profondeur à l'aide de paires d'images avec des configurations de pose relative arbitraires
WO2023023162A1 (fr) Détection et reconstruction de plan sémantique 3d à partir d'images stéréo multi-vues (mvs)
WO2023091129A1 (fr) Localisation de caméra sur la base d'un plan

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22940446

Country of ref document: EP

Kind code of ref document: A1