WO2022271499A1 - Procédés et systèmes d'estimation de profondeur à l'aide d'une caméra dite « fisheye » - Google Patents

Procédés et systèmes d'estimation de profondeur à l'aide d'une caméra dite « fisheye » Download PDF

Info

Publication number
WO2022271499A1
WO2022271499A1 PCT/US2022/033563 US2022033563W WO2022271499A1 WO 2022271499 A1 WO2022271499 A1 WO 2022271499A1 US 2022033563 W US2022033563 W US 2022033563W WO 2022271499 A1 WO2022271499 A1 WO 2022271499A1
Authority
WO
WIPO (PCT)
Prior art keywords
images
training
depth map
fisheye
depth
Prior art date
Application number
PCT/US2022/033563
Other languages
English (en)
Other versions
WO2022271499A8 (fr
Inventor
Pan JI
Qingan Yan
Yi Xu
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Publication of WO2022271499A1 publication Critical patent/WO2022271499A1/fr
Publication of WO2022271499A8 publication Critical patent/WO2022271499A8/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/2224Studio circuitry; Studio devices; Studio equipment related to virtual studio applications
    • H04N5/2226Determination of depth image, e.g. for foreground/background separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/30Determination of transform parameters for the alignment of images, i.e. image registration
    • G06T7/33Determination of transform parameters for the alignment of images, i.e. image registration using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N17/00Diagnosis, testing or measuring for television systems or their details
    • H04N17/002Diagnosis, testing or measuring for television systems or their details for television cameras
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/698Control of cameras or camera modules for achieving an enlarged field of view, e.g. panoramic image capture
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/80Camera processing pipelines; Components thereof
    • H04N23/81Camera processing pipelines; Components thereof for suppressing or minimising disturbance in the image signal generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • the present invention is directed to extended reality systems and methods thereof.
  • extended reality (XR) devices including both augmented reality (AR) devices and virtual reality (VR) devices . -have become increasingly popular.
  • XR devices Important design considerations and challenges for XR devices include performance, cost, and power consumption. Due to various limitations, existing XR devices have been inadequate — including but not limited to depth determination- . for reasons further explained below.
  • the present invention is directed to extended reality systems and methods thereof. According to a specific embodiment, images captured using a single fisheye lens without rectilinear correction are used for depth estimation. The distortion characteristics of tire fisheye lens are used in conjunction with a pretrained model to generate a depth map. There are other embodiments as well.
  • a system of one or more computers can he configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions.
  • One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by a data processing apparatus, cause the apparatus to perform the actions.
  • One general aspect includes a method for generating a pretrained model for a fisheye lens. The method includes capturing training images using one or more fisheye lenses, the training images having a fisheye distortion associated with the one or more fisheye lenses.
  • the method also includes storing a training data model in a memory.
  • the method also includes determining a geometrical difference associated with the fisheye distortion using the training images.
  • the method also includes calculating a photometric loss value using at least the geometric difference.
  • the method also includes generating a training depth map using at least the photometric loss value.
  • the method also includes obtaining a reference data model.
  • the method also includes generating a reference depth map using at least the reference data model.
  • the method also includes calculating a ranking loss value using at least the training depth map and the reference depth map.
  • the method also includes updating the training data model using at least the photometric loss value and the ranking loss value.
  • Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
  • the method may include determining a depth value of training images using a parallax between two fisheye lenses.
  • the method may include determining a depth value of the training images based on a change of image position over a predetermined time interval.
  • the method may include projecting the training images into a rectilinear space.
  • the method may include unprojecting a uniform pixel grid into a fisheye space.
  • the method may include: calculating a scale invariant loss value using at least the training depth map and the reference depth map, and updating the training data model using the scale invariant loss value.
  • the method may include normalizing the training depth map and the reference depth map.
  • the method may include calculating a cumulative loss value using the photometric loss value, ranking loss value, and the scale invariant loss value.
  • the method may include calculating a radial distortion polynomial associated with the fisheye distortion.
  • One general aspect is directed to an extended reality apparatus, which includes a housing that includes a front side and a rear side.
  • the apparatus also includes a camera module that is mounted on the housing and may include a fisheye lens and a sensor.
  • the fisheye lens is positioned on the front side of the housing and characterized by a distortion characteristics.
  • the camera module is configured to capture a plurality of images, the plurality of images being non-reetilinear.
  • the apparatus also includes a display that is positioned on tire rear side of the housing.
  • the apparatus also includes a storage configured to store a pretrained model, which is based on least the distortion characteristics of the fisheye lens.
  • the apparatus also includes a memory that is coupled to the sensor and configured to store the plurality of images.
  • the apparatus also includes a processor coupled to the memory and configured to generate a depth map using the plurality of images and the pretrained model.
  • Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.
  • the processor may be further configured to generate an object image.
  • the display is configured to display a composite image including the object image overlaying the plurality of images.
  • the pretrained model may include weight values and bias values determined using a photometric loss attributed to the distortion characteristics.
  • the processor is further configured to perform rectilinear correction on the plurality of images.
  • One general aspect includes a method for operating an extended reality device.
  • the method includes capturing a plurality of images using a fisheye lens, which is characterized by a distortion characteristics.
  • the plurality of images is noil-rectilinear.
  • the method also includes storing the plurality of images in a memory.
  • the method also includes accessing a pretrained model stored in a storage, the pretrained model being based on least the distortion characteristics of the fisheye lens.
  • the method also includes generating a depth map using the pretrained model and the plurality of images and the pretrained model.
  • the method also includes generating a plurality of rectilinear images using the plurality of images.
  • the method also includes displaying the plurality of rectilinear ⁇ images.
  • Implementations may include one or more of the following features.
  • the method may include projecting the depth map to a rectilinear space.
  • the method may include identifying a region using the depth map, generating an object image, and generating an augmented reality image by overlaying the object image over the region.
  • the method may include displaying the augmented reality image in a rectilinear space.
  • depth estimation techniques according to the present invention allow for accurate and efficient depth estimation in various scenarios including real-time applications.
  • Depth estimation techniques according to various embodiments of the present invention can be used in a wide variety of systems, including XR devices (e.g., commonly head-mounted displays) that are equipped with single fisheye lens or camera.
  • the depth estimation methods of the present invention can be implemented with various XR devices to generate optimal depth maps from single monocular image input to provide users with an immersive AR/VR experience.
  • Embodiments of the present invention present a cost-effective approach for depth estimation that is able to preserve accurate scene geometry under various conditions (e.g., low-light, over-exposure, texture-less regions, etc.) and can he adopted into existing XR systems via software or firmware update. There are other benefits as well.
  • Figure 1A is a simplified block diagram illustrating an extended reality device according to embodiments of the present invention.
  • Figure 1B is a simplified block diagram illustrating an extended reality device according to embodiments of the present invention.
  • Figure 2 is a simplified flow diagram illustrating an operation of an extended reality device according to embodiments of the present invention.
  • Figure 3 is a simplified block diagram illustrating a system for generating a depth estimation model according to embodiments of the present invention.
  • Figure 4 is a simplified block diagram illustrating a machine learning method according to embodiments of the present invention.
  • Figure 5 is a simplified flow diagram illustrating a process for generating a depth estimation model according to embodiments of the present invention.
  • the present invention is directed to extended reality systems and methods thereof. According to a specific embodiment, images captured using a single fisheye lens without rectilinear correction are used for depth estimation. The distortion characteristics of tire fisheye lens are used in conjunction with a pretrained model to generate a depth map. There are other embodiments as well .
  • Embodiments of the present invention provide a complete depth estimation system for XR devices (e.g., AR-glass).
  • XR devices e.g., AR-glass
  • stereo fisheye cameras with low- power sensors are used with fisheye lenses to capture a large field of view (FoV), which is advantageous to provide an immersive AR/VR experience.
  • FoV field of view
  • various embodiments of the present invention provide a complete system-wide solution that covers various stages of the depth estimation (e.g., training and inference), which may involve features such as real-time on edge devices (e.g., mobile phone, embedding devices), to enable depth estimation based on fisheye images (e.g., monocular and/or stereoscopic) in various situations (e.g., low-light/over-exposure conditions, large homogeneous regions, etc.).
  • edge devices e.g., mobile phone, embedding devices
  • fisheye images e.g., monocular and/or stereoscopic
  • situations e.g., low-light/over-exposure conditions, large homogeneous regions, etc.
  • any element in a claim that does not explicitly state “means for” performing a specified function, or “step for” performing a specific function, is not to be interpreted as a “means” or “step” clause as specified in 35 U.S.C. Section 112, Paragraph 6.
  • FIG. 1 A is a simplified block diagram illustrating an extended reality device according to embodiments of the present invention.
  • This diagram is merely an example, which should not unduly limit the scope of the claims.
  • One of ordinary skill in the art would recognize many variations, alternatives, and modifications.
  • XR extended Reality
  • XR apparatus 115 as shown can be configured as VR, AR, or others.
  • XR apparatus 115 may include small housing for AR applications or relatively larger housing for VR applications.
  • XR apparatus 115 may include a pair of stereo cameras (e.g., 180A and 1 BOB) that are configured on the front side of XR apparatus 115.
  • Cameras 180A and 180B are respectively mounted on the left and right side of the XR apparatus 115 and are configured to capture a pair of stereo images.
  • additional cameras may he configured below cameras 180A and 1 BOB to provide an additional field of view and range estimation accuracy.
  • cameras 180A and 180B both include ultrawide angle or fisheye lenses that offer large fields of view, and they share a common field of view.
  • cameras 180A and 180B may be configured with different mounting angles.
  • a depth value of the image captured by either camera may be associated with and can be determined using a parallax between the two camera lenses.
  • XR apparatus 115 may include multiple cameras, embodiments of the present invention afford depth determination using images captured by a single fisheye camera, whose fisheye distortion characteristics is used — among other things — in depth determination.
  • XR apparatus 115 includes a single camera 180C that comprises a fisheye lens, which may be characterized by distortion characteristics.
  • Camera 180C may further include a sensor (not shown) that facilitates image capture.
  • the sensors of the front cameras may be low-resolution monochrome sensors, which are not only energy- efficient (without color filter and color processing thereof), but also relatively inexpensive, both in terms of device size and cost.
  • a single fisheye camera 180C is configured to capture one or more monocular images at a predetermined resolution (e.g., 640 x 400).
  • the monocular image captured by the single fisheye camera 180C may have a fisheye distortion that is associated with the distortion characteristics of the fisheye camera 180C.
  • multiple fisheye cameras may be implemented, hut only a single fisheye camera and its images are only for depth estimation.
  • the distorted/non-rectilinear monocular image may be fed into a depth estimation system of XR apparatus 115 (described in further detail below) to generate a depth map without performing image rectification on tire input image.
  • the depth map can identify the relative object depth of tire images.
  • Display 185 is configured on the backside of XR apparatus 115.
  • display 185 may be a semitransparent display that overlays information on an optical lens in AR applications.
  • display 185 may include a non-transparent display.
  • FIG. IB is a simplified block diagram illustrating components of extended reality apparatus 115 according to embodiments of the present invention.
  • This diagram is merely an example, which should not unduly limit the scope of the claims.
  • an XR apparatus e.g., an XR headset
  • a computing system 115n as shown, which might include, without limitation, at least one of memory 140, processor 150, storage 160, communication interface 170, a camera module 180, a display 185, and/or peripheral devices 190, and/or the like.
  • the element of computing system 115n can be configured together to perform a depth estimation process on an input image/video to produce an output image/video augmented with virtual contents.
  • the processor 150 might communicatively be coupled (e.g., via a bus, via wired connectors, or via electrical pathways (e.g., traces and/or pads, etc.) of printed circuit hoards ("PCBs") or integrated circuits ("ICs”), and/or the like) to each of one or more of memory 140, storage 160, communication interface 170, camera module 180, display 185, and/or peripheral devices 190, and/or the like.
  • PCBs printed circuit hoards
  • ICs integrated circuits
  • camera module 180 is configured to capture images and/or videos of the surrounding environment and is mounted on the housing of tire XR apparatus 115.
  • Camera module 180 includes one or more cameras that include their respective lenses (e.g., a fisheye lens 181) and sensors (e.g., a sensor 182) used to capture images and/or video of an area in front of the XR apparatus 115.
  • camera module 180 includes cameras 180A and 180B as shown in Figure 1A, and they are configured respectively on the left and right sides of the housing, in other embodiments, camera module 180 includes a single camera (e.g., camera 180C in Figure 1 A), which is equipped with fisheye lens 181 and/or sensor 182.
  • Fisheye lens 181 may be positioned on the front side of the housing.
  • fisheye lens 181 is characterized by a distortion characteristics and a field-of-view. It is to he appreciated that the distortion characteristics of the fisheye lens 181 allow camera module 180 to capture one or more images characterized by fisheye distortion (i.e., non- reetiiinear images) that provides a large field-of-view.
  • the sensor 182 of the camera module 180 may he low-resolution monochrome sensors, which are not only energy-efficient (without color filter and color processing thereof), but also relatively inexpensive, both in terms of device size and cost.
  • memory 140 is coupled to sensor 182 and configured to store the images and/or videos captured by the camera module 180.
  • camera module 180 can capture one or more stereoscopic images pairs, which can be temporarily stored at memory 140 for further processing.
  • camera module 180 includes a single camera (e.g., 180C) that has a fisheye lens (e.g., fisheye lens 181)
  • camera module 180 can capture one or more monocular images and/or videos, which can be stored at memory 140 and further processed by processor 150 for depth estimation.
  • memory 140 is coupled to processor 150 and configured to store instructions executable by the processor 150.
  • Memory 140 may include, without limitation, local and/or network accessible storage, and/or can include, without limitation, a disk drive, a drive array, an optical storage device, solid-state storage device such as a random-access memory (“RAM”), a read-only memory (“ROM”), and/or non-volatile memory, which can be programmable, flash-updateable, and/or the like.
  • RAM random-access memory
  • ROM read-only memory
  • non-volatile memory which can be programmable, flash-updateable, and/or the like.
  • Such storage devices may be configured to implement any appropriate data stores, including, without limitation, various file systems, database structures, and/or the like.
  • memory 140 may be implemented as a part of the processor 150 in a system-on- chip (SoC) arrangement.
  • SoC system-on- chip
  • storage 160 is configured to store a pretrained module (e.g., a depth estimation module as described in further detail below) that is based on at least the distortion characteristics of the fisheye lens.
  • the pretrained module is a depth estimation module that accepts the distorted and/or non-reetiiinear image inputs and outputs a depth map in response to executable instructions from processor 150.
  • the pretrained module may comprise weight values and bias values associated with the distortion characteristics of the fisheye lens 181.
  • the storage 160 is incorporated within a computer system, such as the computing system 115n.
  • the storage 160 might be separate from a computer system (i.e., a removable medium, such as a compact disc, etc.), and/or provided in an installation package, such that storage 160 can be used to program, configure, and/or adapt a general-purpose computer with the instructions/code stored thereon.
  • a computer system i.e., a removable medium, such as a compact disc, etc.
  • an installation package such that storage 160 can be used to program, configure, and/or adapt a general-purpose computer with the instructions/code stored thereon.
  • processor 150 is configured to perform various executable instructions for image processing. For example, processor 150 is configured to generate a depth map using one or more images/videos stored in memory 140. In some cases, processor 150 may further retrieve and/or refine the pretrained model stored in storage 160 to identify relative object depth for depth map generation. Depending on the implementations, processor 150 may perform a rectilinear correction on the input images to assist the depth estimation process.
  • Processor 150 might include, without limitation, a central processing unit (CPU) 151, a graphical processing unit (GPU) 152, and/or a neural processing unit (NPU).
  • CPU central processing unit
  • GPU graphical processing unit
  • NPU neural processing unit
  • CPU 151 handles various types of system functions, such as moving input images/videos to memory 140, and retrieving and executing executable instructions (e.g., rectilinear ⁇ correction, depth feature extraction, pose feature extraction, loss term calculation, etc.).
  • GPU 152 may be specially designed to manipulate graphic creation and image processing, which is advantageous to process input images/videos for depth estimation and produce output images/videos augmented with virtual contents.
  • the output image/video may be sent by GPU 152 to display 185 to provide an immersive AR/VR experience to a user 120n.
  • NPU 153 is specifically to utilize at least the input images/videos and a depth estimation model (e.g., the pretrained model) to perform a depth estimation process, through which a depth map can be obtained, and the parameters of the depth estimation model may be adjusted/updated to further improve the estimation accuracy.
  • processor module 120 may be configured as a multi-core processor with one or more processing units, each of which can read and execute program instructions separately for efficient processing and overall power consumption reduction.
  • the field of view of camera module 180 overlaps with a field of view' of an eye of the user 120n.
  • the display sereen(s) and/or projector(s) 185 may be used to display or project the generated image overlay s images or video of the actual area.
  • the communication interface 170 provides wired or wireless communication with other devices and/or networks.
  • communication interface 170 may be connected to a computer for tether operations, where the computer provides the processing power needed for graphic- intensive applications.
  • computing system 115n further includes one or more peripheral devices 190 configured to improve user interaction in various aspects.
  • peripheral devices 190 may include, without limitation, at least one of spcaker(s) or earpiece(s), eye-tracking sensor(s), light source(s), audio sensor(s) or microphone(s), touch screen(s), keyboard, mouse, and/or other input/output devices.
  • Figure 2 is a simplified flow' diagram illustrating an operation of an extended reality device according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims.
  • One of ordinary skill in the art would recognize many variations, alternati ves, and modifications. For example, one or more steps may be added, removed, repeated, replaced, modified, rearranged, and/or overlapped, and they should limit the scope of the claims.
  • a plurality of images is captured using a fisheye lens.
  • the fisheye lens is characterized by a distortion characteristics such that the plurality of images captured by the fisheye lens are non-rectilinear.
  • the images may be captured in the form of stereoscopic image pair (e.g., left image and right image).
  • images may he captured in the form of a plurality of monocular images and/or videos (e.g., a plurality of sequential monocular image frames).
  • the camera may be characterized by a plurality of extrinsic parameters associated with its position and orientation (e.g., camera pose or headset pose) and a plurality of intrinsic parameters associated with its geometric properties (e.g., focal length, aperture, resolution, principal point, field-of-view', fisheye distortion characteristics, etc.). It is to be appreciated that the images captured by the fisheye lens offer a large field-of-view that provides more coverage of the surrounding environment.
  • the plurality of images is stored in a memory.
  • a buffer memory may be used to store the plurality of images.
  • the captured images are stored temporarily for the purpose of processing, and may first be stored in volatile memory and later transferred to non-volatile memory.
  • a pertained model stored in a storage is accessed and/or retrieved by, for example, a processor 150 as shown in Figure IB.
  • the pretrained model is a depth estimation model that is trained with a plurality of stereoscopic image pairs and/or a plurality of temporal monocular images.
  • the pretrained estimation model may be trained by a depth estimation system, which will be described in further detail below.
  • the pretrained model is based on at least the distortion characteristics of the fisheye lens.
  • the pretrained model can utilize the distortion characteristics to generate a depth map.
  • the pretrained model may include weight values and bias values associated with the distortion characteristics.
  • the distorted input images may directly be fed into the pretrained model for depth estimation without going through an image rectification process, which can not only preserve the wide field-of-view of the image, but also significantly save the computational resources and time such that real-time applications based on fisheye depth maps may be realized.
  • a depth map is generated.
  • the depth map is generated using at least the pretrained model and the plurality of images.
  • the pretrained model receives the plurality of input images and its metadata including the intrinsic and/or extrinsic parameters of the camera.
  • the pretrained model may then output a depth map corresponding to the image of the scene based on the parameters obtained through a depth estimation training process (described in further detail below).
  • the pretrained model may be configured to calculate a camera/headset pose. For example, a parallax between two fi sheye lenses may be determined by the pretrained model based on at least the extrinsic and/or intrinsic parameters of the cameras to provide depth information.
  • the pretrained model may calculate a change of image position between two or more monocular images during a predetermined time interval to obtain depth information.
  • the generated depth map reflects the relative distance of the surrounding objects from the XR apparatus and helps to improve situational awareness and/or virtual content generation, and/or the like.
  • the parameters of the pretrained model may be updated to refine the pretrained model via, for example, an NPU to improve depth estimation accuracy and efficiency.
  • a plurali ty of rectilinear images is generated.
  • the plurality of rectilinear images is generated using the plurality of images by projecting the depth map to a rectilinear space.
  • the depth map may be configured to identify a region of interest for virtual content generation.
  • the depth map may be used (e.g., by processor 150 as shown in Figure IB) to generate an object image, which may overlay the identified region to generate a composite image (e.g., an augmented reality image).
  • the augmented reality image may be displayed in a rectilinear space, it is to be appreciated that the rectilinear images provide various benefits such as increased image resolution for improving user experience.
  • the plurality of rectilinear images is displayed via, for example, display 185 as shown in Figure IB.
  • the display provides a user with a constant feed of augmented video/images that includes changes in the scenes as the user/headset moves.
  • the computing system 115n of the XR apparatus 115 may constantly receive fisheye images as the user/headset changes position, the pretrained model thus iteratively generates depth maps to update the output image including the virtual content (e.g., the object image) shown on the display.
  • FIG. 3 is a simplified block diagram illustrating a system for generating a depth estimation model according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications.
  • a depth estimation training system 300 includes, without limitation, a processor module 305, a CPU 310, a GPU 315, an NPU 320, a communication interface 325, a storage 330, a memory 335, a data bus 340, and/or the like.
  • the elements of depth estimation system 300 can be configured together to perform a depth estimation training process and generate a depth estimation model, as further described below.
  • processor module 305 may communicatively be coupled to communication interface 325, storage 330, memory 335 via data bus 340.
  • the communication interface 325 provides wired or wireless communication with other devices and/or networks.
  • system 300 may access a plurality of training images captured by one or more fisheye lenses from network 350 and/or a reference data model via communication interface 325 for depth estimation training.
  • Storage 330 may include a dynamic random-access memory (DRAM) and/or non-volatile memory.
  • the plurality of training images may be stored in storage 330 to train a training data model.
  • DRAM dynamic random-access memory
  • memory 335 is configured to store one or more sequences of instructions executable by processor module 305.
  • the training data model may be stored in the memory and is continually updated in response to processor 305 executing a sequence of instructions (e.g,, depth estimation algorithms, pose extraction algorithms, loss term calculation algorithms, and/or the like).
  • the training images may he processed by GPU 315 to extract depth and/or pose features from the image during the training process.
  • CPU 310 and/or NPU 320 may iteratively perform a depth estimation training algorithm to refine the training data model.
  • FIG. 4 is a simplified block diagram illustrating a machine learning method according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications.
  • a depth estimation training process 400 implemented with a computing system e.g., depth estimation training system 300
  • the depth estimation training process 400 first starts with receiving a plurality of training images 405.
  • the plurality of training images 405 is captured by one or more fisheye lenses.
  • the plurality of training images 405 may be captured by a pair of stereo cameras respectively equipped with a fisheye lens and/or a sensor.
  • the plurality of training images 405 may therefore be acquired in the form of a plurality of stereoscopic Image pahs.
  • the plurality of training images 405 may he captured by a single monocular’ camera equipped with a fisheye lens and/or a sensor.
  • the plurality of training images 405 may then be acquired in the form of a plurality of temporal monocular images.
  • the training images 405 may be a plurality of sequential frames from a monocular video.
  • the camera(s) may be characterized by a plurality of extrinsic parameters associated with its position and orientation (e.g., camera pose) and/or a plurality of intrinsic parameters associated with its geometric properties (e.g., focal length, aperture, resolution, fieid-of-view, principal point, fisheye distortion characteristics, etc.).
  • the fisheye lens is characterized by a distortion characteristics, which results in a fisheye distortion of the training images captured.
  • the plurality of training images 405 may be sent to the depth estimation training system 300 via communication interface 325 (as shown in Figure 3).
  • the plurality of extrinsic and/or intrinsic parameters of the eamera(s) may be provided to the depth estimation training system 300 along with the plurality of training images 405.
  • the depth estimation training system 300 includes a pose network 410 configured to estimate a geometrical difference between the training images 405.
  • pose network 410 may be trained — via a deep learning process — to predict the geometrical difference (e.g., camera pose difference) between a first monocular ⁇ image characterized by a first timestamp and a second image characterized by a second timestamp.
  • the geometrical difference may be associated with a change of image position over a predetermined time interval (e.g., from the first timestamp to the second timestamp).
  • pose network 410 estimates a geometrical difference between tire pair of stereoscopic images.
  • the geometrical difference may be determined using one or more extrinsic and/or intrinsic parameters (e.g., fisheye distortion characteristics).
  • the pose features extracted by the pose network 410 help to predict a transformation matrix 415 between the training images 405 (e.g., the first monocular image and the second monocular image).
  • the pose difference may be fixed.
  • the pose network 410 allows the depth estimation model 300 to learn the pose relationships among the training images 405, which may be used in later presses such as view synthesis and/or loss term calculation.
  • the depth estimation training system 300 includes a depth network 420 configured to generate a training depth map 425.
  • the depth network 420 takes the training images 405 (e.g., a plurality of distorted monocular images) as input and performs a deep learning process to predict a dense depth map as the training depth map 425.
  • a convolutional neural network is used to extract depth features and generate the training depth map 425 at a predetermined resolution. It is to be appreciated that a depth value of the training images 405 may be determined in various ways.
  • a depth value of the training images 405 is determined using a parallax between two fisheye lenses that are used to capture the training images.
  • the parallax may be associated with a plurality of extrinsic and/or intrinsic parameters of the fisheye lenses.
  • a depth value of a training image is determined based on a change of image position over a predetermined time interval.
  • the transformation matrix 415 generated by pose network 410 and the training depth map 425 produced by depth network 420 may be utilized to calculate a photometric loss 460 via a combination of unprojection and projection operations.
  • the unprojection and/or projection operations may be performed using a radial distortion polynomial associated with the fisheye distortion.
  • a view synthesis module 445 is configured to perform a view synthesis process to generate a synthesis image 450 from two neighboring fisheye images (e.g., the first monocular image and the second monocular image).
  • the view synthesis process learns a depth and pose relationship from the training images 405 using at least the transformation 415 and the training depth map 425. For example, with a pair of stereoscopic training images, the synthesis image 450 may be generated by projecting the right image onto the left image or vice versa. Similarly, with two temporal monocular images, the synthesis image 450 may be produced by projecting the first monocular image onto the second monocular image or vice versa.
  • the view synthesis module 445 performs an unprojection operation to unproject a uniform pixel grid into a fisheye space.
  • the uniform pixel grid may be characterized by a resolution that is substantially the same as the input training images (e.g., around 640 x 400).
  • an intermediate rectilinear depth map 440 is generated first using training depth map 425 to transform 3D points into camera coordinates.
  • the intermediate rectilinear depth map 440 may be obtained by projecting the training images 405 into a rectilinear space. Once the intermediate rectilinear depth map 440 is obtained, the unprojection operation is then performed by applying a radial distortion polynomial function.
  • the view synthesis module 445 is configured to further perform a projection operation using the intermediate rectilinear depth map 440 and the transformation matrix 415 to generate the distorted synthesis image 450.
  • the training data model 490 is trained in a self-supervised manner without requiring large amounts of ground truth data.
  • the view synthesis module 445 operates to synthesize virtual target images from neighboring views
  • a projection function F is introduced to map 3D points P; in 3D space to image coordinates
  • the corresponding unprojection function F is also used to convert image pixels — based on the training depth map D — into 3D space in order to acquire color information from other views.
  • mapping equation where is the angle of incidence, is the the polynomial radial distortion model mapping the incidence angle to the image radius, stand tor the focal length and the principal point derived from the intrinsic matrix K, and denote the set of fisheye distortion coefficients.
  • an intermediate rectified depth map is generated first from the training depth map D to transform a pixel p, into camera coordinates, by warping a pixel grid according to Eqn. 1.
  • the rectified D is then used to unproject the grid into 3D by applying the unprojection function
  • the view synthesis process includes: (1) unproject a uniform pixei grid, which has the same resolution with input frames through (see Eqn. 2 above); and (2) project those 3D points by (see Eqn. 1 above) and the associated pose information from pose network 410, to obtain distorted synthesis images 450.
  • the distorted synthesis image 450 is utilized to calculate a photometric loss 460 between two neighboring fisheye images.
  • photometric loss 460 is calculated by comparing the synthesis image 450 and one image of the training pair using at least the geometric difference.
  • Such process may be performed iteratively via a deep learning algorithm to minimize the photometric loss and tune parameters for a training data model 490.
  • the 3D geometry can be preserved by incorporating photometric loss 460 in training the depth estimation model and/or updating the training depth map 425.
  • an edge-aware depth smoothness loss may be introduced by a smoothing module (not shown) to account for color similarity and/or irregularity between neighboring pixels.
  • an auto-masking mechanism may be adopted to mask out static pixels when computing the photometric loss 460. For example, one or more static regions may he identified, and a mask may he applied to discount these regions when calculating the photometric loss 460.
  • a target fisheye image i t and a source fisheye image I. are used to calculate the photometric loss.
  • the depth network 420 and the pose network 410 are jointly trained to predict a dense depth map D, and a relative transformation matrix T ...
  • the per -pixel minimum photometric loss can be calculated as follows: and where denotes a weighted combination of the LI and structured
  • SIMilarity (8 SIM) losses , and is a bilinear sampling operator.
  • edge-aware smoothness losss is defined as follows: where d is a mean-normalized inverse depth.
  • a reference data model 430 is introduced to train the training data model 490.
  • the reference data model 430 that is diversely trained across various datasets might be obtained from the network 350 via a communication interface 325 (as shown in Figure 3).
  • the reference data model 430 is well suited for- — among other things — predicting relative depth relationships among pixels (as it is trained on large and diverse datasets), which promotes generality and cross-dataset learning in training the depth estimation model.
  • reference data model 430 receives tire training images 405 and generates a reference depth map 435, which includes the depth ordering information (e.g., at per pixel level).
  • a ranking loss 480 is calculated using the training depth map 425 and the reference depth map 430.
  • the ranking loss 480 may be calculated by comparing each pixel to its neighboring pixel (e.g., its left horizontal neighbor, and/or its top vertical neighbor, and/or the like).
  • Ranking loss 480 allows the training data model 490 to learn fee depth ordering relationships among pixels from the reference data model 430.
  • a scale invariant loss 470 is calculated using at least the training depth map 425 and the reference depth map 435.
  • the scale invariant loss 470 is configured to strengthen fee supervision in texture-less regions (e.g., low-light regions, overexposure regions, homogenous regions in indoor environments, and/or the like).
  • the reference data model 430 may be greater than the Raining data model in terms of size, complexity, and/or the like.
  • the training depth map 425 and the reference depth map 435 are normalized to improve scale awareness of the depth estimation model.
  • a cumulative loss 485 is calculated using two or more of the loss terms to refine the training data model 490. For example, the cumulative loss value is calculated using fee photometric loss 460, ranking loss 480, and scale invariant loss 470.
  • the training depth map is denoted as and the reference depth map is denoted as D MIDAS
  • the ranking loss is calculated from each pixel to its left horizontal neighbor as follows: where a and are hyper-parameters controlling the ranking gap and can be set to . for example .
  • the ranking loss can also be calculated from each pixel to its top vertical neighbor
  • the final ranking loss may be calculated as the sum o
  • the training depth map 425 and the reference depth map 435 may first be normalized to generate normalized deph maps respectively.
  • the scale invariant loss is defined as follows:
  • the cumulative loss 485 is a weighted combination of the photometric loss, the edge-aware depth smoothness loss, the ranking loss, and the scale invariant loss, i.e.,
  • FIG. 5 is a simplified flow diagram illustrating a process for generating a depth estimation model according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims.
  • One of ordinary skill in the art would recognize many variations, alternati ves, and modifications. For example, one or more steps may be added, removed, repeated, replaced, modified, rearranged, and/or overlapped , and they should limit the scope of the claims,
  • a plurality of training images is captured.
  • the plurality of training images is captured by one or more fisheye lenses and have a fisheye distortion associated with the one or more fisheye lenses.
  • the training images are a plurality of stereoscopic image pairs.
  • the training images are a plurali ty of monocul ar images or a plurality of temporal monocular image frames from a video.
  • a training data model is stored in a memory.
  • the training data model is first temporarily stored in volatile memory for further processing and can be constantly refined during a training process. The training data model may later be transferred to non-volatile memory once the training process completes.
  • a geometrical difference associated with the fisheye distortion is determined using the training images. For example, one or more extrinsic and/or intrinsic parameters of the fisheye lens may he used in determining the geometrical difference.
  • the geometrical difference may reflect the pose difference between the training images and can be used in the later process such as view synthesis, which generates a supervision signal for training the depth estimation model.
  • a photometric loss value is calculated using at least the geometric difference.
  • the photometric loss value may be calculated by comparing a synthesis image generated by view synthesis and at least one of the original training images.
  • the photometric loss value is associated with the 3D geometry of the scene embodied in one or more training images.
  • a training depth map is generated using at least the photometric loss value.
  • the training depth map generated by a neural network — can be used to calculate one or more loss terms (e.g,, photometric loss, and/or the like), which are configured to update the training data model.
  • a reference data model is obtained.
  • the reference data model is a depth estimation model that has been trained on a large amount of diverse data and can predict the depth ordering relationships among neighboring pixels.
  • the reference data model is larger than the training data model.
  • a reference depth map is generated using at least tire reference data model.
  • the reference depth map takes the training images as input and outputs the reference depth map based on the training images.
  • the reference depth map generated by the reference data model may estimate the relative depth relationship among neighboring pixels.
  • a ranking loss value is calculated using at least the training depth map and the reference depth map.
  • the ranking loss value may be calculated by comparing each pixel to its neighboring pixel (e.g., its left horizontal neighbor, and/or its top vertical neighbor, and/or the like).
  • the training data model gradually learns the depth ordering relationships from the reference data model. Since the training data model is smaller than the reference data model, it is able to predict similar depth ordering relationships with less computational time and resources.
  • the training data model is updated using at least the photometric loss value and the ranking loss value.
  • the photometric loss value and the ranking loss value may be continually minimized as the training process progresses, and the training data model may thus he constantly updated.
  • additional loss terms e.g., edge-aware depth smoothness loss, scale invariant loss, and/or the like
  • This diagram is merely an example, which should not unduly limit the scope of the claims.
  • One of ordinary skill in the ail would recognize many variations, alternatives, and modifications.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Image Processing (AREA)
  • Closed-Circuit Television Systems (AREA)

Abstract

La présente invention concerne des systèmes de réalité étendue et des procédés associés. Selon un mode de réalisation spécifique, des images capturées à l'aide d'un seul objectif à grand angulaire extrême (dit « fisheye ») sans correction rectiligne sont utilisées pour une estimation de profondeur. Les caractéristiques de distorsion de l'objectif à grand angulaire extrême (objectif « fisheye ») sont utilisées conjointement avec un modèle pré-entraîné pour générer une carte de profondeur. D'autres modes de réalisation existent également.
PCT/US2022/033563 2021-06-25 2022-06-15 Procédés et systèmes d'estimation de profondeur à l'aide d'une caméra dite « fisheye » WO2022271499A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202163215397P 2021-06-25 2021-06-25
US63/215,397 2021-06-25

Publications (2)

Publication Number Publication Date
WO2022271499A1 true WO2022271499A1 (fr) 2022-12-29
WO2022271499A8 WO2022271499A8 (fr) 2023-05-11

Family

ID=84544790

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/033563 WO2022271499A1 (fr) 2021-06-25 2022-06-15 Procédés et systèmes d'estimation de profondeur à l'aide d'une caméra dite « fisheye »

Country Status (1)

Country Link
WO (1) WO2022271499A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117201705A (zh) * 2023-11-07 2023-12-08 天津云圣智能科技有限责任公司 一种全景图像的获取方法、装置、电子设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100097526A1 (en) * 2007-02-14 2010-04-22 Photint Venture Group Inc. Banana codec
US20140316665A1 (en) * 2012-03-29 2014-10-23 Harnischfeger Technologies, Inc. Collision detection and mitigation systems and methods for a shovel
US20160163110A1 (en) * 2014-12-04 2016-06-09 Htc Corporation Virtual reality system and method for controlling operation modes of virtual reality system
US20160180507A1 (en) * 2014-09-24 2016-06-23 Chung-Ang University Industry-Academy Cooperation Foundation Non-dyadic lens distortion correction method and apparatus
US20170091535A1 (en) * 2015-09-29 2017-03-30 BinaryVR, Inc. Head-mounted display with facial expression detecting capability

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100097526A1 (en) * 2007-02-14 2010-04-22 Photint Venture Group Inc. Banana codec
US20140316665A1 (en) * 2012-03-29 2014-10-23 Harnischfeger Technologies, Inc. Collision detection and mitigation systems and methods for a shovel
US20160180507A1 (en) * 2014-09-24 2016-06-23 Chung-Ang University Industry-Academy Cooperation Foundation Non-dyadic lens distortion correction method and apparatus
US20160163110A1 (en) * 2014-12-04 2016-06-09 Htc Corporation Virtual reality system and method for controlling operation modes of virtual reality system
US20170091535A1 (en) * 2015-09-29 2017-03-30 BinaryVR, Inc. Head-mounted display with facial expression detecting capability

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHEN CHEN; GEORGIADIS ANTHIMOS: "Parameterized Synthetic Image Data Set for Fisheye Lens", 2018 5TH INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND CONTROL ENGINEERING (ICISCE), IEEE, 20 July 2018 (2018-07-20), pages 370 - 374, XP033501686, DOI: 10.1109/ICISCE.2018.00084 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117201705A (zh) * 2023-11-07 2023-12-08 天津云圣智能科技有限责任公司 一种全景图像的获取方法、装置、电子设备及存储介质
CN117201705B (zh) * 2023-11-07 2024-02-02 天津云圣智能科技有限责任公司 一种全景图像的获取方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
WO2022271499A8 (fr) 2023-05-11

Similar Documents

Publication Publication Date Title
US11756223B2 (en) Depth-aware photo editing
US20200051269A1 (en) Hybrid depth sensing pipeline
CN110574025B (zh) 用于合并交错通道数据的卷积引擎
CN109565551B (zh) 对齐于参考帧合成图像
US11663691B2 (en) Method and apparatus for restoring image
US10547822B2 (en) Image processing apparatus and method to generate high-definition viewpoint interpolation image
US20130335535A1 (en) Digital 3d camera using periodic illumination
US11656722B1 (en) Method and apparatus for creating an adaptive bayer pattern
WO2022076116A1 (fr) Segmentation pour effets d'image
EP3135033B1 (fr) Stéréo structurée
JP6882868B2 (ja) 画像処理装置、画像処理方法、システム
WO2013074561A1 (fr) Modification du point de vue d'une image numérique
EP3886044B1 (fr) Enregistrement robuste de surface sur la base d'une perspective paramétrée de modèles d'images
US11734877B2 (en) Method and device for restoring image obtained from array camera
WO2022271499A1 (fr) Procédés et systèmes d'estimation de profondeur à l'aide d'une caméra dite « fisheye »
US11770551B2 (en) Object pose estimation and tracking using machine learning
WO2021164000A1 (fr) Procédé, appareil et dispositif de traitement d'image, et support
US20230005213A1 (en) Imaging apparatus, imaging method, and program
CN114241127A (zh) 全景图像生成方法、装置、电子设备和介质
KR20210133472A (ko) 이미지 병합 방법 및 이를 수행하는 데이터 처리 장치
JP2012173858A (ja) 全方位画像生成方法、画像生成装置およびプログラム
WO2022241328A1 (fr) Procédé de détection de mouvement de la main et système à étalonnage de la forme de la main
WO2024072722A1 (fr) Zoom continu sans à-coups dans un système à caméras multiples grâce à des caractéristiques visuelles basées sur des images et des étalonnages géométriques optimisés
WO2023191888A1 (fr) Anti-mystification d'objet à base de corrélation pour caméras à double pixel
CN118096614A (zh) 一种双目图像同时去畸变及对齐方法、系统、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22829030

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22829030

Country of ref document: EP

Kind code of ref document: A1