WO2021240056A1 - Method and system for supporting 3d perception from monocular images - Google Patents

Method and system for supporting 3d perception from monocular images Download PDF

Info

Publication number
WO2021240056A1
WO2021240056A1 PCT/FI2021/050370 FI2021050370W WO2021240056A1 WO 2021240056 A1 WO2021240056 A1 WO 2021240056A1 FI 2021050370 W FI2021050370 W FI 2021050370W WO 2021240056 A1 WO2021240056 A1 WO 2021240056A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
scene
illumination condition
mfp
stack
Prior art date
Application number
PCT/FI2021/050370
Other languages
French (fr)
Inventor
Seppo Valli
Original Assignee
Teknologian Tutkimuskeskus Vtt Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Teknologian Tutkimuskeskus Vtt Oy filed Critical Teknologian Tutkimuskeskus Vtt Oy
Publication of WO2021240056A1 publication Critical patent/WO2021240056A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/586Depth or shape recovery from multiple images from multiple light sources, e.g. photometric stereo
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01BMEASURING LENGTH, THICKNESS OR SIMILAR LINEAR DIMENSIONS; MEASURING ANGLES; MEASURING AREAS; MEASURING IRREGULARITIES OF SURFACES OR CONTOURS
    • G01B11/00Measuring arrangements characterised by the use of optical techniques
    • G01B11/24Measuring arrangements characterised by the use of optical techniques for measuring contours or curvatures
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C11/00Photogrammetry or videogrammetry, e.g. stereogrammetry; Photographic surveying
    • G01C11/04Interpretation of pictures
    • G01C11/06Interpretation of pictures by comparison of two or more pictures of the same area
    • GPHYSICS
    • G02OPTICS
    • G02BOPTICAL ELEMENTS, SYSTEMS OR APPARATUS
    • G02B30/00Optical systems or apparatus for producing three-dimensional [3D] effects, e.g. stereoscopic images
    • G02B30/50Optical systems or apparatus for producing three-dimensional [3D] effects, e.g. stereoscopic images the image being built up from image elements distributed over a 3D volume, e.g. voxels
    • G02B30/52Optical systems or apparatus for producing three-dimensional [3D] effects, e.g. stereoscopic images the image being built up from image elements distributed over a 3D volume, e.g. voxels the 3D volume being constructed from a stack or sequence of 2D planes, e.g. depth sampling systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/204Image signal generators using stereoscopic image cameras
    • H04N13/207Image signal generators using stereoscopic image cameras using a single 2D image sensor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/261Image signal generators with monoscopic-to-stereoscopic image conversion
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/20Image signal generators
    • H04N13/261Image signal generators with monoscopic-to-stereoscopic image conversion
    • H04N13/264Image signal generators with monoscopic-to-stereoscopic image conversion using the relative movement of objects in two video frames or fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10048Infrared image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10141Special mode during image acquisition
    • G06T2207/10152Varying illumination

Definitions

  • This discomfort when viewing S3D content is due at least in part to the fact that present displays force a viewer to accommodate or focus on one display surface, although at the same time, the viewer’s eyes are choosing image details with varying vergence. In other words the current displays force a user to assume varying levels of eye- cross, forcing a user to go cross-eyed. In contrast, within real-world viewing, focus and vergence are strongly related.
  • MFP displays avoid this vergence-accommodation conflict (VAC) by rendering each scene detail on a focal plane close to its real distance/depth.
  • VAC vergence-accommodation conflict
  • MFP display requires capturing depth corresponding to each image pixel. That is, MFP displays must know depth properties of a captured scene.
  • a depth map must be formed.
  • depth maps are typically created using stereoscopic cameras or complicated methods involving neural networks.
  • Stereoscopic cameras tend to make content production more complicated, and typically make assumptions on the orientation when viewing the content.
  • Neural networks again increase complication, typically needing a large training set with known or measured depth properties.
  • RGB-D Red Green Blue - Depth
  • RGB-D approaches are relatively expensive, consume high power, and require accurate and potentially complex calibration between cameras and depth sensors.
  • RGB-D approaches may capture depth accurately, they tend to result with low spatial resolution. Depth of focus approaches are also rather complex, and for good accuracy, may require multiple scene captures with different focal lengths.
  • motion parallax methods may be inaccurate, and in particular, unable to capture 3D data from a single or few image captures.
  • the present invention allows for the derivation of depth data from monoscopic images using differences in shading from images captured under differing lighting.
  • Embodiments of the invention allow for stereoscopic rendering without the need for stereoscopic image capture.
  • Certain embodiments of the invention even provide for the derivation of depth data from images captured without the use of Red Green Blue - Depth (RGB-D) sensors. For example via sensors of a mobile phone, small consumer camera or other small consumer device.
  • RGB-D Red Green Blue - Depth
  • Embodiments of the invention described herein provide for a much simpler method of deriving depth data and are able to derive scene depth for both stereoscopic 3D (S3D) and accommodative displays. Further, embodiments of the present invention are well suited for use in situations with an arbitrary stereo baseline, that is, head tilt when viewing 3D content.
  • a method for deriving depth data for a scene from monocular images via shape-from-shading comprising: - receiving monocular image data containing at least a pair of images comprising a first image of a scene in a first illumination condition and a second image of the scene in a second illumination condition different from the first illumination condition, - processing a difference of the first image and second image to produce a difference image,
  • a depth map comprising depth level information for the portions of the scene based at least partly on the determined intensity of scattered and/or reflected light from said portions of the scene.
  • an apparatus comprising at least one processing core, at least one memory including computer program code, the at least one memory and the computer program code being configured to, with the at least one processing core, cause the apparatus at least to:
  • a depth map comprising depth level information for the portions of the scene based at least partly on the determined intensity of scattered and/or reflected light from said portions of the scene.
  • FIGURES 1A and IB illustrate processes employed by at least some embodiments of the present invention
  • FIGURE 2 shows a process in accordance with certain embodiments of the present invention
  • FIGURE 3 illustrates a process according to some embodiments of the present invention employing a serial stream of images
  • FIGURE 4 illustrates an example apparatus capable of supporting at least some embodiments of the present invention
  • FIGURES 5A - 5F display images as processed by certain embodiments of the present invention.
  • Embodiments of the present invention provide a novel approach for capturing depth using monocular images using shape-ffom-shading (SfS) techniques.
  • the approach is much simpler than existing methods while still being capable of providing scene depth information for both stereoscopic 3D (S3D) and accommodative displays.
  • using monocular image capture of scenes as per certain embodiments of the present invention allows for the provision of arbitrary stereo baseline to accommodate head tilt, when viewing content created via such embodiments.
  • At least some embodiments of the present invention may be realized with simple consumer devices, for example a mobile phone or consumer still or video camera.
  • 3D information is captured from monocular images.
  • a scene is actively illuminated, and captures of the backscattered and/or reflected light are used to estimate distances of corresponding surfaces and pixels. For example, by processing a difference of images with and without flash, an estimation for the intensity of the scattered and reflected light is obtained.
  • At least some embodiments of the present invention serve to limit or eliminate the effect of ambient lighting when deriving 3D information.
  • FIGURES 1A and IB illustrate, broadly, processes employed by at least some embodiments of the present invention.
  • at least two images 130A and 130B are captured of a scene 100 comprising a three dimensional surface 110. Images 130A and 130B are captured in differing illumination conditions, in this instance by employing a flash of the camera 120A and 120B. As illustrated, a single camera in the same position only in alternating states of flash can be employed to capture the two images 130A and 130B. From these two images 130A and 130B a reconstruction of the scene 140 may be performed via methods as will be outlined below.
  • FIGURE IB shows how incident light 150 is reflected from a surface 190 as both diffuse reflections 160 and a specular reflection 170.
  • FIGURE 2 shows a process in accordance with certain embodiments of the present invention.
  • a method for deriving depth data for a scene from monocular images via shape-ffom-shading (SfS) is outlined, the method comprising at least four steps.
  • monocular image data is received containing at least a pair of images comprising a first image of a scene in a first illumination condition and a second image of the scene in a second illumination condition different from the first illumination condition.
  • these illumination conditions can be actively varied or naturally variable.
  • a difference of the first image and second image is processed to produce a difference image.
  • step 230 an intensity of scattered and/or reflected light from portions of the scene are determined based at least partly on the difference image. Then within step 240 a depth map is derived comprising depth level information for the portions of the scene based at least partly on the determined intensity of scattered and/or reflected light from said portions of the scene.
  • the depth map comprises an image of the scene containing information relating to the distance of surfaces of scene from the capture point.
  • the depth may be referred to or is analogous to a depth view, depth buffer or z-buffer.
  • the depth map contains values from 0 to 255. Such values may be encoded on a pixel by pixel basis or on an aggregate basis for example.
  • a slightly limited range may be beneficial, leaving some headroom and footroom outside the values reserved for the digital signal.
  • Certain depth maps use more than 8 bits per pixel, for example depth maps derived from RGB-D sensors.
  • depth level information comprises information on distances of portions of a scene from a fixed point, for example from the capture point of the image or some other reference point. As discussed this depth level information may be provided in the form of a depth map for a portion of a scene, for example for surfaces of the scene. Certain embodiments provide for compensation for motion between an instant of capture for the first image and an instant of capture for the second image. This may be accomplished, for example, by spatially compensating for a possible mismatch in captured images prior to processing the images to derive depth information.
  • the difference in illumination conditions as discussed above may be arrived at in a variety of ways. For example, within a first illumination condition there may be no active illumination, while in a second illumination condition there may be active illumination provided by, for example, a flash of the capturing device, such as an infrared flash. In certain embodiments this active illumination is provided within the visible light spectrum. In some embodiments active illumination is provided in the infrared spectrum or a combination of spectrums.
  • Some embodiments rely on active illumination provided at two separate intensity levels to provide for different illumination conditions. This varying of active illumination is less noticeable by a user as, in certain embodiments, the flash appears to be on consistently and the variance in active illumination can be adjusted such that it is less perceptible during image capture. For example, a constant fill-flash light may be employed.
  • Illumination may be provided without active participation of the capture device, for example if the natural illumination of the scene to be captured varies then no active illumination may be required. However, even in cases when active illumination is desired but there is natural illumination, the effect of natural or non-actively provided illumination is effectively neutralized as it should impact image pairs equally and as such would cancel out during the processes described herein.
  • the exact location of the illumination source is not necessary to derive depth data from the scene.
  • the exact location of other illumination sources such as the sun, electric light, fire light or other, is not needed to derive depth data.
  • the exact location of the illumination source relative to the capture point may not be necessary.
  • the location of the illumination source is not provided but may be derived from captured image data.
  • a difference of the first image and second image may comprise subtracting the first image from the second image.
  • Such a difference image may be formed based at least in part by the difference in reflections from points of the scene, for example diffuse and specular reflection.
  • Certain embodiments of the present invention provide for the generation of at least one of a stereoscopic image pair and a Multiple Focal Plane (MFP) stack based at least in part on the derived depth map. Such MFP stacks may also be used to create stereoscopic image pairs. Some embodiments provide for production of an image pair or MFP stack which uses a viewer’s head orientation or tilt and/or motion data to support stereoscopic disparity and motion parallax. Certain embodiments generate an MFP stack and then shift the MFP stack to obtain a stereoscopic MFP stack.
  • MFP Multiple Focal Plane
  • processing of the difference images comprises a square root operation.
  • Certain embodiments further employ a step of leveling at least one of the captured, difference or processed images such that the images are adjusted to account for overly bright or dark areas.
  • Some embodiments employ mapping operations, for example scaling operations, and/or filtering operations at some point during the process.
  • Some embodiments of the present invention employ a transmitter and receiver topography whereby an image capture device captures a series of images under varying illumination conditions, potentially multiplexes said images and then transmits them to a receiver. Within the receiver the images are then processed according to methods disclosed herein to arrive at 3D data of the captured images.
  • a transmitter and receiver combination could be, for example, a mobile phone running an application and a server configured to receive information from the application.
  • FIGURE 3 illustrates a process according to certain embodiments of the present invention which employs a serial stream of images with alternating illumination conditions, for example, as could be provided by a flash of a capture device which is activated every other frame of capture.
  • image data for example a video
  • a flash is intermittently fired, for example, such that every other frame is illuminated.
  • a serial stream of images with alternating illumination conditions is received 310.
  • this serial stream of images is demultiplexed 320 and motion during capture is estimated and compensated for 330.
  • a formation of differences between any two images in the stream is formed 340 and processed into a difference map 350, for example by taking the square root.
  • Leveling, by adjusting bright levels backwards, may be performed 360 prior to filtering for smoothing of the depth map 370.
  • a depth map approximation is forward to a next stage of depth map generation 380, for example by taking into account another image pair from the stream.
  • FIGURE 4 illustrates a device capable of supporting at least certain embodiments of the claimed invention, for example a portable device such as a mobile phone, tablet or camera.
  • Device 400 may comprise memory 420.
  • Memory 420 may comprise random-access memory and/or permanent memory.
  • Memory 420 may comprise at least one RAM chip.
  • Memory 420 may comprise solid-state, magnetic, optical and/or holographic memory, for example.
  • Memory 420 may be at least in part accessible to processor 410.
  • Memory 420 may be at least in part comprised in processor 410.
  • Memory 420 may be means for storing information.
  • Memory 420 may comprise computer instructions that processor 410 is configured to execute.
  • processor 410 When computer instructions configured to cause processor 410 to perform certain actions are stored in memory 420, and device 400 overall is configured to run under the direction of processor 410 using computer instructions from memory 420, processor 410 and/or its at least one processing core may be considered to be configured to perform said certain actions.
  • Memory 420 may be at least in part comprised in processor 410. Memory 420 may be at least in part external to device 400 but accessible to device 400.
  • Device 400 may comprise a transmitter 430.
  • Device 400 may comprise a receiver 440.
  • Transmitter 430 and receiver 440 may be configured to transmit and receive, respectively, information in accordance with at least one cellular or non-cellular standard.
  • Transmitter 430 may comprise more than one transmitter.
  • Receiver 440 may comprise more than one receiver.
  • Transmitter 430 and/or receiver 440 may be configured to operate in accordance with global system for mobile communication, GSM, wideband code division multiple access, WCDMA, 5G, long term evolution, LTE, IS-95, wireless local area network, WLAN, Ethernet and/or worldwide interoperability for microwave access, WiMAX, standards, for example.
  • Device 400 may comprise user interface, UI, 460.
  • UI 460 may comprise at least one of a display, a keyboard, a touchscreen, a vibrator arranged to signal to a user by causing device 400 to vibrate, a speaker and a microphone.
  • a user may be able to operate device 400 via UI 460, for example to accept incoming telephone calls, to originate telephone calls or video calls, to browse the Internet, to manage digital files stored in memory 420 or on a cloud accessible via transmitter 430 and receiver 440, or via NFC transceiver 450, and/or to play games.
  • Processor 410 may be furnished with a transmitter arranged to output information from processor 410, via electrical leads internal to device 400, to other devices comprised in device 400.
  • a transmitter may comprise a serial bus transmitter arranged to, for example, output information via at least one electrical lead to memory 420 for storage therein.
  • the transmitter may comprise a parallel bus transmitter.
  • processor 410 may comprise a receiver arranged to receive information in processor 410, via electrical leads internal to device 400, from other devices comprised in device 400.
  • Such a receiver may comprise a serial bus receiver arranged to; for example, receive information via at least one electrical lead from receiver 440 for processing in processor 410.
  • the receiver may comprise a parallel bus receiver.
  • Device 400 may comprise further devices not illustrated in FIGURE 4.
  • device 400 may comprise at least one digital camera.
  • Some devices 400 may comprise a back-facing camera and a front-facing camera, wherein the back-facing camera may be intended for digital photography and the front- facing camera for video telephony.
  • Device 400 may comprise a fingerprint sensor arranged to authenticate, at least in part, a user of device 400.
  • device 400 lacks at least one device described above.
  • some devices 400 may lack a NFC transceiver 450 and/or user identity module 470.
  • Processor 410, memory 420, transmitter 430, receiver 440 and/or UI 460 may be interconnected by electrical leads internal to device 400 in a multitude of different ways.
  • each of the aforementioned devices may be separately connected to a master bus internal to device 400, to allow for the devices to exchange information.
  • this is only one example and depending on the embodiment various ways of interconnecting at least two of the aforementioned devices may be selected without departing from the scope of the present invention.
  • At least certain embodiments of the present invention comprise at least one processing core 410, at least one memory 420 including computer program code, the at least one memory 420 and the computer program code being configured to, with the at least one processing core 410, cause the apparatus at least to:
  • a depth map comprising depth level information for the portions of the scene based at least partly on the determined intensity of scattered and/or reflected light from said portions of the scene.
  • Certain embodiments further comprise an image sensor 450 and illuminator 470, such as the image sensor and flash of a mobile phone, the apparatus being further configured to capture the monocular image data while varying illumination conditions of the scene.
  • the scene is actively illuminated in the second illumination condition by the illuminator 470 of the apparatus. 18.
  • Certain apparatuses are configured such that within the first illumination condition, the scene is actively illuminated by a flash or the illuminator 470 at a lower intensity level than it is illuminated by a flash or illuminator 470 in the second illumination condition.
  • Apparatuses according to the current invention may be configured such that: there is greater illumination in the second illumination condition than the first illumination condition and processing a difference of the first image and second image comprises subtracting the first image from the second image. They may be further configured to generate at least one of a stereoscopic image pair and a Multiple Focal Plane (MFP) stack based at least in part on the derived depth map. Still further, they may be configured to generate a Multiple Focal Plane (MFP) stack and then shift the MFP stack to obtain a stereoscopic MFP stack.
  • MFP Multiple Focal Plane
  • FIGURES 5A - 5F display images as processed by certain embodiments of the present invention, including virtual viewpoints shown as printed cross-eyed stereograms.
  • FIGURE 5A illustrates a scene captured under two different illumination conditions.
  • One monoscopic camera is used to capture the images at slight different points in time to produce the image pair as shown.
  • certain embodiments correct for this motion prior to further processing.
  • the differences of the images of Figure 5 A are processed to first form a difference image (a).
  • the difference image (a) is further processed by scaling, for example based on a histogram, and remapped to arrive at image (b). As the intensity of reflections decays inversely proportional to the squared distance, this scaling may be accomplished by a square root function.
  • This image (b) may be further processed by filtering to reduce steps or coarseness to arrive at image (c).
  • FIG. 5C shows the effects of this leveling process as can be seen by comparing, for example, the shade differences at the doll’s hair, sleeves, and chest within Figure 5C to those of image (c) of Figure 5B.
  • this leveling is a recursive process whereby a depth map is used for the first leveling round, which in turn produces an improved depth map, used for the second leveling round, and so on, until the result converges or some threshold is met.
  • FIGURE 5D shows a depth map of FIGURE 5B image (c) as broken down into three weight maps (a), (b) and (c), for example by using linear depth blending.
  • the number of weight maps here is for illustrative purposes and corresponds to the number of focal planes desired; however this number may be increased for higher quality outputs.
  • the weigh maps sum to a value of one and therefore can be used to create focal planes which sum to the original image, as illustrated herein the original image being a normally lighted doll picture.
  • FIGURE 5E shows the three focal planes (a), (b), (c) derived from the weight maps in Figure 5D.
  • FIGURE 5F shows a cross-eyed stereo pair formed using the three focal planes shown in Figure 5E.
  • At least some embodiments of the present invention capture orientation information during image capture, for example via a sensor within the image capture device which will sense the direction of gravitational pull.
  • a “gravity stabilized” capture mode can be implemented whereby a user does not need to be concerned with regard to the orientation of the image capture device during capture.
  • This orientation information can then be used during generation of an MFP stack or stereoscopic image pair.
  • the orientation of the baseline may be decided and supported freely in the receiver. In certain instances this may be done based on the viewers head tilt. Further embodiments may support synthetic motion parallax in viewing.
  • Certain embodiments of the present invention provide for a 3D video service which allows for shooting and viewing of stable S3D or accommodative MFP videos with less restrictive assumptions for both capture and viewing orientations.
  • the capture device captures orientation information, such as a gravity vector and includes that, potentially in multiplexed and/or encoded image data, during capture of images as per methods discussed above. That orientation information is then used to stabilize or adjust the captured image data such that they have a consistent orientation.
  • This stabilization may occur at any point in the processing of the image as per the methods described herein. For example, the stabilization may occur within a capture device prior to processing or sending the captured information. In at least some embodiments this stabilization occurs after the formation of depth map information.
  • Certain embodiments of the present invention employ both orientation derived during image capture and orientation derived during display in order to match an orientation of a captured scene with a head tilt of a viewer of a displayed scene, potentially a scene derived from a stereoscopic image pair or MFP stack created as per methods described herein.
  • embodiments of the present invention may employ conventional auto-stereoscopic display, a shutter glasses based S3D display, or a stereoscopic near-eye display, for example as can be implemented via a mobile phone or tablet.
  • Certain embodiments of the present invention provide the ability of a user to adjust the brightness of a rendered 3D image between those levels of illumination obtained during image capture.
  • At least some embodiments of the present invention employ a calibration pattern prior to capturing images.
  • the known properties of the calibration pattern can then be used to improve the derivation of 3D information.
  • Certain embodiments of the present invention employ a step of displaying the generated at least one of a stereoscopic image pair and MFP stack.
  • data is received regarding orientation of the head of a viewer and said orientation data is used when generating at least one of a stereoscopic image pair and a Multiple Focal Plane (MFP) stack.
  • MFP Multiple Focal Plane
  • the image data is received as multiplexed data containing alternating images under alternating illumination conditions.
  • At least some embodiments filter high frequency textures from the depth map. Certain embodiments employ a step of processing the difference image prior to the derivation of the depth map, for example by scaling, linearization and/or filtering of the difference image.

Abstract

A method and apparatus for deriving depth data for a scene from monocular images by: receiving monocular image data containing at least a pair of images: a first image of a scene in a first illumination condition and a second image of the scene in a second illumination condition different from the first illumination condition. Processing a difference of the first image and second image to produce a difference image and determining an intensity of scattered and/or reflected light from portions of the scene based on the difference image. Then deriving a depth map comprising depth level information for the portions of the scene based on the determined intensity of scattered and/or reflected light from said portions of the scene.

Description

METHOD AND SYSTEM FOR SUPPORTING 3D PERCEPTION FROM
MONOCULAR IMAGES
BACKGROUND
[0001] Stereoscopic rendering based content and services are now available on television sets, in computer games, and via Augmented Reality (AR) or Virtual Reality (VR) glasses. Even modem mobile phones provide for certain types of stereoscopic rendering.
[0002] While the applications employing stereoscopic rendering have increased, there are still many hurdles involved in such offerings. They require complex stereoscopic scene capture, content production and delivery. Further, current stereoscopic rendering techniques may result in discomfort when viewing stereoscopic (S3D) displays.
[0003] This discomfort when viewing S3D content is due at least in part to the fact that present displays force a viewer to accommodate or focus on one display surface, although at the same time, the viewer’s eyes are choosing image details with varying vergence. In other words the current displays force a user to assume varying levels of eye- cross, forcing a user to go cross-eyed. In contrast, within real-world viewing, focus and vergence are strongly related.
[0004] Multiple focal plane (MFP) displays avoid this vergence-accommodation conflict (VAC) by rendering each scene detail on a focal plane close to its real distance/depth. However, in addition to challenging optical arrangements for display, MFP display requires capturing depth corresponding to each image pixel. That is, MFP displays must know depth properties of a captured scene.
[0005] Therefore, in order to avoid the problems associated with S3D displays and employ MFP displays, a depth map must be formed. However, such depth maps are typically created using stereoscopic cameras or complicated methods involving neural networks. Stereoscopic cameras tend to make content production more complicated, and typically make assumptions on the orientation when viewing the content. Neural networks again increase complication, typically needing a large training set with known or measured depth properties. [0006] Further, capturing depth by stereovision is relatively inaccurate due to inaccuracies in deducting true disparity of stereo views. Red Green Blue - Depth (RGB-D) approaches are relatively expensive, consume high power, and require accurate and potentially complex calibration between cameras and depth sensors. Moreover, although RGB-D approaches may capture depth accurately, they tend to result with low spatial resolution. Depth of focus approaches are also rather complex, and for good accuracy, may require multiple scene captures with different focal lengths. On the other hand, motion parallax methods may be inaccurate, and in particular, unable to capture 3D data from a single or few image captures.
SUMMARY OF THE INVENTION
[0007] Instead of deriving depth data by capturing stereoscopic disparity from stereoscopic images, for example by applying motion estimation methods, the present invention allows for the derivation of depth data from monoscopic images using differences in shading from images captured under differing lighting. Embodiments of the invention allow for stereoscopic rendering without the need for stereoscopic image capture. Certain embodiments of the invention even provide for the derivation of depth data from images captured without the use of Red Green Blue - Depth (RGB-D) sensors. For example via sensors of a mobile phone, small consumer camera or other small consumer device.
[0008] Embodiments of the invention described herein provide for a much simpler method of deriving depth data and are able to derive scene depth for both stereoscopic 3D (S3D) and accommodative displays. Further, embodiments of the present invention are well suited for use in situations with an arbitrary stereo baseline, that is, head tilt when viewing 3D content.
[0009] The invention is defined by the features of the independent claims. Some specific embodiments are defined in the dependent claims.
[0010] According to a first aspect of the present invention, there is provided a method for deriving depth data for a scene from monocular images via shape-from-shading (SfS), the method comprising: - receiving monocular image data containing at least a pair of images comprising a first image of a scene in a first illumination condition and a second image of the scene in a second illumination condition different from the first illumination condition, - processing a difference of the first image and second image to produce a difference image,
- determining an intensity of scattered and/or reflected light from portions of the scene based at least partly on the difference image,
- deriving a depth map comprising depth level information for the portions of the scene based at least partly on the determined intensity of scattered and/or reflected light from said portions of the scene.
[0011] According to a second aspect of the present invention, there is provided an apparatus comprising at least one processing core, at least one memory including computer program code, the at least one memory and the computer program code being configured to, with the at least one processing core, cause the apparatus at least to:
- receive monocular image data containing at least a pair of images comprising a first image of a scene in a first illumination condition and a second image of the scene in a second illumination condition different from the first illumination condition, - process a difference of the first image and second image to produce a difference image,
- determine an intensity of scattered and/or reflected light from portions of the scene based at least partly on the difference image,
- deriving a depth map comprising depth level information for the portions of the scene based at least partly on the determined intensity of scattered and/or reflected light from said portions of the scene.
BRIEF DESCRIPTION OF THE DRAWINGS [0012] FIGURES 1A and IB illustrate processes employed by at least some embodiments of the present invention; [0013] FIGURE 2 shows a process in accordance with certain embodiments of the present invention;
[0014] FIGURE 3 illustrates a process according to some embodiments of the present invention employing a serial stream of images; [0015] FIGURE 4 illustrates an example apparatus capable of supporting at least some embodiments of the present invention, and
[0016] FIGURES 5A - 5F display images as processed by certain embodiments of the present invention. EMBODIMENTS
[0017] Embodiments of the present invention provide a novel approach for capturing depth using monocular images using shape-ffom-shading (SfS) techniques. The approach is much simpler than existing methods while still being capable of providing scene depth information for both stereoscopic 3D (S3D) and accommodative displays. Furthermore, using monocular image capture of scenes as per certain embodiments of the present invention allows for the provision of arbitrary stereo baseline to accommodate head tilt, when viewing content created via such embodiments. At least some embodiments of the present invention may be realized with simple consumer devices, for example a mobile phone or consumer still or video camera. [0018] According to certain embodiments 3D information is captured from monocular images. In such embodiments, a scene is actively illuminated, and captures of the backscattered and/or reflected light are used to estimate distances of corresponding surfaces and pixels. For example, by processing a difference of images with and without flash, an estimation for the intensity of the scattered and reflected light is obtained. At least some embodiments of the present invention serve to limit or eliminate the effect of ambient lighting when deriving 3D information.
[0019] FIGURES 1A and IB illustrate, broadly, processes employed by at least some embodiments of the present invention. Within Figure 1A, at least two images 130A and 130B are captured of a scene 100 comprising a three dimensional surface 110. Images 130A and 130B are captured in differing illumination conditions, in this instance by employing a flash of the camera 120A and 120B. As illustrated, a single camera in the same position only in alternating states of flash can be employed to capture the two images 130A and 130B. From these two images 130A and 130B a reconstruction of the scene 140 may be performed via methods as will be outlined below. In essence, the process can be used to derive a surface which would result in the two images 130A and 130B when illuminated in a similar fashion to the illumination present during capture of the images 130A and 130B. FIGURE IB, shows how incident light 150 is reflected from a surface 190 as both diffuse reflections 160 and a specular reflection 170.
[0020] FIGURE 2 shows a process in accordance with certain embodiments of the present invention. As seen, a method for deriving depth data for a scene from monocular images via shape-ffom-shading (SfS) is outlined, the method comprising at least four steps. Within step 210 monocular image data is received containing at least a pair of images comprising a first image of a scene in a first illumination condition and a second image of the scene in a second illumination condition different from the first illumination condition. As will be outlined below, these illumination conditions can be actively varied or naturally variable. Within step 220, a difference of the first image and second image is processed to produce a difference image. The method continues within step 230 wherein an intensity of scattered and/or reflected light from portions of the scene are determined based at least partly on the difference image. Then within step 240 a depth map is derived comprising depth level information for the portions of the scene based at least partly on the determined intensity of scattered and/or reflected light from said portions of the scene.
[0021] According to certain embodiments of the present invention the depth map comprises an image of the scene containing information relating to the distance of surfaces of scene from the capture point. In some embodiments the depth may be referred to or is analogous to a depth view, depth buffer or z-buffer. Within at least some embodiments the depth map contains values from 0 to 255. Such values may be encoded on a pixel by pixel basis or on an aggregate basis for example. Within embodiments employing video coding schemes a slightly limited range may be beneficial, leaving some headroom and footroom outside the values reserved for the digital signal. Certain depth maps use more than 8 bits per pixel, for example depth maps derived from RGB-D sensors. In such case, the applied video coding method may use correspondingly more bits per pixel and thus provide higher dynamics. [0022] Within at least some embodiments depth level information comprises information on distances of portions of a scene from a fixed point, for example from the capture point of the image or some other reference point. As discussed this depth level information may be provided in the form of a depth map for a portion of a scene, for example for surfaces of the scene. Certain embodiments provide for compensation for motion between an instant of capture for the first image and an instant of capture for the second image. This may be accomplished, for example, by spatially compensating for a possible mismatch in captured images prior to processing the images to derive depth information.
[0023] The difference in illumination conditions as discussed above may be arrived at in a variety of ways. For example, within a first illumination condition there may be no active illumination, while in a second illumination condition there may be active illumination provided by, for example, a flash of the capturing device, such as an infrared flash. In certain embodiments this active illumination is provided within the visible light spectrum. In some embodiments active illumination is provided in the infrared spectrum or a combination of spectrums.
[0024] Some embodiments rely on active illumination provided at two separate intensity levels to provide for different illumination conditions. This varying of active illumination is less noticeable by a user as, in certain embodiments, the flash appears to be on consistently and the variance in active illumination can be adjusted such that it is less perceptible during image capture. For example, a constant fill-flash light may be employed.
[0025] Illumination may be provided without active participation of the capture device, for example if the natural illumination of the scene to be captured varies then no active illumination may be required. However, even in cases when active illumination is desired but there is natural illumination, the effect of natural or non-actively provided illumination is effectively neutralized as it should impact image pairs equally and as such would cancel out during the processes described herein.
[0026] Within certain embodiments of the present invention, the exact location of the illumination source is not necessary to derive depth data from the scene. For example, when illumination not actively provided by the capture source is utilized the exact location of other illumination sources, such as the sun, electric light, fire light or other, is not needed to derive depth data. Even in embodiments wherein an illumination source of the capture device is utilized, the exact location of the illumination source relative to the capture point may not be necessary. Within certain embodiments the location of the illumination source is not provided but may be derived from captured image data.
[0027] Within embodiments having a greater illumination in the second illumination condition than the first illumination condition processing a difference of the first image and second image may comprise subtracting the first image from the second image. Such a difference image may be formed based at least in part by the difference in reflections from points of the scene, for example diffuse and specular reflection.
[0028] Certain embodiments of the present invention provide for the generation of at least one of a stereoscopic image pair and a Multiple Focal Plane (MFP) stack based at least in part on the derived depth map. Such MFP stacks may also be used to create stereoscopic image pairs. Some embodiments provide for production of an image pair or MFP stack which uses a viewer’s head orientation or tilt and/or motion data to support stereoscopic disparity and motion parallax. Certain embodiments generate an MFP stack and then shift the MFP stack to obtain a stereoscopic MFP stack.
[0029] Within some embodiments processing of the difference images comprises a square root operation. Certain embodiments further employ a step of leveling at least one of the captured, difference or processed images such that the images are adjusted to account for overly bright or dark areas. Some embodiments employ mapping operations, for example scaling operations, and/or filtering operations at some point during the process.
[0030] Some embodiments of the present invention employ a transmitter and receiver topography whereby an image capture device captures a series of images under varying illumination conditions, potentially multiplexes said images and then transmits them to a receiver. Within the receiver the images are then processed according to methods disclosed herein to arrive at 3D data of the captured images. Such a transmitter and receiver combination could be, for example, a mobile phone running an application and a server configured to receive information from the application.
[0031] FIGURE 3 illustrates a process according to certain embodiments of the present invention which employs a serial stream of images with alternating illumination conditions, for example, as could be provided by a flash of a capture device which is activated every other frame of capture. Within some embodiments image data, for example a video, is captured while a flash is intermittently fired, for example, such that every other frame is illuminated.
[0032] As in certain embodiments of the present invention, first a serial stream of images with alternating illumination conditions is received 310. Optionally, this serial stream of images is demultiplexed 320 and motion during capture is estimated and compensated for 330. Then a formation of differences between any two images in the stream is formed 340 and processed into a difference map 350, for example by taking the square root. Leveling, by adjusting bright levels backwards, may be performed 360 prior to filtering for smoothing of the depth map 370. Finally a depth map approximation is forward to a next stage of depth map generation 380, for example by taking into account another image pair from the stream.
[0033] FIGURE 4 illustrates a device capable of supporting at least certain embodiments of the claimed invention, for example a portable device such as a mobile phone, tablet or camera. Device 400 may comprise memory 420. Memory 420 may comprise random-access memory and/or permanent memory. Memory 420 may comprise at least one RAM chip. Memory 420 may comprise solid-state, magnetic, optical and/or holographic memory, for example. Memory 420 may be at least in part accessible to processor 410. Memory 420 may be at least in part comprised in processor 410. Memory 420 may be means for storing information. Memory 420 may comprise computer instructions that processor 410 is configured to execute. When computer instructions configured to cause processor 410 to perform certain actions are stored in memory 420, and device 400 overall is configured to run under the direction of processor 410 using computer instructions from memory 420, processor 410 and/or its at least one processing core may be considered to be configured to perform said certain actions. Memory 420 may be at least in part comprised in processor 410. Memory 420 may be at least in part external to device 400 but accessible to device 400.
[0034] Device 400 may comprise a transmitter 430. Device 400 may comprise a receiver 440. Transmitter 430 and receiver 440 may be configured to transmit and receive, respectively, information in accordance with at least one cellular or non-cellular standard. Transmitter 430 may comprise more than one transmitter. Receiver 440 may comprise more than one receiver. Transmitter 430 and/or receiver 440 may be configured to operate in accordance with global system for mobile communication, GSM, wideband code division multiple access, WCDMA, 5G, long term evolution, LTE, IS-95, wireless local area network, WLAN, Ethernet and/or worldwide interoperability for microwave access, WiMAX, standards, for example.
[0035] Device 400 may comprise user interface, UI, 460. UI 460 may comprise at least one of a display, a keyboard, a touchscreen, a vibrator arranged to signal to a user by causing device 400 to vibrate, a speaker and a microphone. A user may be able to operate device 400 via UI 460, for example to accept incoming telephone calls, to originate telephone calls or video calls, to browse the Internet, to manage digital files stored in memory 420 or on a cloud accessible via transmitter 430 and receiver 440, or via NFC transceiver 450, and/or to play games.
[0036] Processor 410 may be furnished with a transmitter arranged to output information from processor 410, via electrical leads internal to device 400, to other devices comprised in device 400. Such a transmitter may comprise a serial bus transmitter arranged to, for example, output information via at least one electrical lead to memory 420 for storage therein. Alternatively to a serial bus, the transmitter may comprise a parallel bus transmitter. Likewise processor 410 may comprise a receiver arranged to receive information in processor 410, via electrical leads internal to device 400, from other devices comprised in device 400. Such a receiver may comprise a serial bus receiver arranged to; for example, receive information via at least one electrical lead from receiver 440 for processing in processor 410. Alternatively to a serial bus, the receiver may comprise a parallel bus receiver.
[0037] Device 400 may comprise further devices not illustrated in FIGURE 4. For example, where device 400 comprises a smartphone, it may comprise at least one digital camera. Some devices 400 may comprise a back-facing camera and a front-facing camera, wherein the back-facing camera may be intended for digital photography and the front- facing camera for video telephony. Device 400 may comprise a fingerprint sensor arranged to authenticate, at least in part, a user of device 400. In some embodiments, device 400 lacks at least one device described above. For example, some devices 400 may lack a NFC transceiver 450 and/or user identity module 470. [0038] Processor 410, memory 420, transmitter 430, receiver 440 and/or UI 460 may be interconnected by electrical leads internal to device 400 in a multitude of different ways. For example, each of the aforementioned devices may be separately connected to a master bus internal to device 400, to allow for the devices to exchange information. However, as the skilled person will appreciate, this is only one example and depending on the embodiment various ways of interconnecting at least two of the aforementioned devices may be selected without departing from the scope of the present invention.
[0039] As shown at least certain embodiments of the present invention comprise at least one processing core 410, at least one memory 420 including computer program code, the at least one memory 420 and the computer program code being configured to, with the at least one processing core 410, cause the apparatus at least to:
- receive monocular image data containing at least a pair of images comprising a first image of a scene in a first illumination condition and a second image of the scene in a second illumination condition different from the first illumination condition, - process a difference of the first image and second image to produce a difference image,
- determine an intensity of scattered and/or reflected light from portions of the scene based at least partly on the difference image,
- deriving a depth map comprising depth level information for the portions of the scene based at least partly on the determined intensity of scattered and/or reflected light from said portions of the scene.
[0040] Certain embodiments further comprise an image sensor 450 and illuminator 470, such as the image sensor and flash of a mobile phone, the apparatus being further configured to capture the monocular image data while varying illumination conditions of the scene. In some embodiments the scene is actively illuminated in the second illumination condition by the illuminator 470 of the apparatus. 18. Certain apparatuses are configured such that within the first illumination condition, the scene is actively illuminated by a flash or the illuminator 470 at a lower intensity level than it is illuminated by a flash or illuminator 470 in the second illumination condition.
[0041] Apparatuses according to the current invention may be configured such that: there is greater illumination in the second illumination condition than the first illumination condition and processing a difference of the first image and second image comprises subtracting the first image from the second image. They may be further configured to generate at least one of a stereoscopic image pair and a Multiple Focal Plane (MFP) stack based at least in part on the derived depth map. Still further, they may be configured to generate a Multiple Focal Plane (MFP) stack and then shift the MFP stack to obtain a stereoscopic MFP stack.
[0042] FIGURES 5A - 5F display images as processed by certain embodiments of the present invention, including virtual viewpoints shown as printed cross-eyed stereograms. FIGURE 5A illustrates a scene captured under two different illumination conditions. One monoscopic camera is used to capture the images at slight different points in time to produce the image pair as shown. As discussed above, there are a variety of ways in which to achieve the differences in illumination. As the images are captured at different times there may have been motion of the capture means between image captures, as such, certain embodiments correct for this motion prior to further processing.
[0043] Within FIGURE 5B the differences of the images of Figure 5 A are processed to first form a difference image (a). In certain embodiments the difference image (a) is further processed by scaling, for example based on a histogram, and remapped to arrive at image (b). As the intensity of reflections decays inversely proportional to the squared distance, this scaling may be accomplished by a square root function. This image (b) may be further processed by filtering to reduce steps or coarseness to arrive at image (c).
[0044] Surface properties generally bias reflected intensities so that dark surfaces corresponding to more diffuse and small reflections show up further away than light surfaces corresponding to less diffuse and stronger reflections. Correspondingly, a mechanism (here denoted as leveling) may be employed for preventing biasing of light and dark areas to different distances, while they are in fact at about the same depth. Figure 5C shows the effects of this leveling process as can be seen by comparing, for example, the shade differences at the doll’s hair, sleeves, and chest within Figure 5C to those of image (c) of Figure 5B. In at least some embodiments this leveling is a recursive process whereby a depth map is used for the first leveling round, which in turn produces an improved depth map, used for the second leveling round, and so on, until the result converges or some threshold is met. [0045] FIGURE 5D shows a depth map of FIGURE 5B image (c) as broken down into three weight maps (a), (b) and (c), for example by using linear depth blending. The number of weight maps here is for illustrative purposes and corresponds to the number of focal planes desired; however this number may be increased for higher quality outputs. In certain embodiments the weigh maps sum to a value of one and therefore can be used to create focal planes which sum to the original image, as illustrated herein the original image being a normally lighted doll picture. FIGURE 5E shows the three focal planes (a), (b), (c) derived from the weight maps in Figure 5D. FIGURE 5F shows a cross-eyed stereo pair formed using the three focal planes shown in Figure 5E.
[0046] At least some embodiments of the present invention capture orientation information during image capture, for example via a sensor within the image capture device which will sense the direction of gravitational pull. In such embodiments a “gravity stabilized” capture mode can be implemented whereby a user does not need to be concerned with regard to the orientation of the image capture device during capture. This orientation information can then be used during generation of an MFP stack or stereoscopic image pair. For example, when synthesizing disparity for an S3D stereo video or a stereoscopic MFP video, the orientation of the baseline may be decided and supported freely in the receiver. In certain instances this may be done based on the viewers head tilt. Further embodiments may support synthetic motion parallax in viewing. Certain embodiments of the present invention provide for a 3D video service which allows for shooting and viewing of stable S3D or accommodative MFP videos with less restrictive assumptions for both capture and viewing orientations.
[0047] Within some embodiments the capture device captures orientation information, such as a gravity vector and includes that, potentially in multiplexed and/or encoded image data, during capture of images as per methods discussed above. That orientation information is then used to stabilize or adjust the captured image data such that they have a consistent orientation. This stabilization may occur at any point in the processing of the image as per the methods described herein. For example, the stabilization may occur within a capture device prior to processing or sending the captured information. In at least some embodiments this stabilization occurs after the formation of depth map information. [0048] Certain embodiments of the present invention employ both orientation derived during image capture and orientation derived during display in order to match an orientation of a captured scene with a head tilt of a viewer of a displayed scene, potentially a scene derived from a stereoscopic image pair or MFP stack created as per methods described herein.
[0049] For purposes of display, embodiments of the present invention may employ conventional auto-stereoscopic display, a shutter glasses based S3D display, or a stereoscopic near-eye display, for example as can be implemented via a mobile phone or tablet. [0050] Certain embodiments of the present invention provide the ability of a user to adjust the brightness of a rendered 3D image between those levels of illumination obtained during image capture.
[0051] At least some embodiments of the present invention employ a calibration pattern prior to capturing images. The known properties of the calibration pattern can then be used to improve the derivation of 3D information.
[0052] Certain embodiments of the present invention employ a step of displaying the generated at least one of a stereoscopic image pair and MFP stack. Within some embodiments data is received regarding orientation of the head of a viewer and said orientation data is used when generating at least one of a stereoscopic image pair and a Multiple Focal Plane (MFP) stack.
[0053] Within some embodiments the image data is received as multiplexed data containing alternating images under alternating illumination conditions.
[0054] At least some embodiments filter high frequency textures from the depth map. Certain embodiments employ a step of processing the difference image prior to the derivation of the depth map, for example by scaling, linearization and/or filtering of the difference image.
[0055] It is to be understood that the embodiments of the invention disclosed are not limited to the particular structures, process steps, or materials disclosed herein, but are extended to equivalents thereof as would be recognized by those ordinarily skilled in the relevant arts. It should also be understood that terminology employed herein is used for the purpose of describing particular embodiments only and is not intended to be limiting.
[0056] Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment.
[0057] As used herein, a plurality of items, structural elements, compositional elements, and/or materials may be presented in a common list for convenience. However, these lists should be construed as though each member of the list is individually identified as a separate and unique member. Thus, no individual member of such list should be construed as a de facto equivalent of any other member of the same list solely based on their presentation in a common group without indications to the contrary. In addition, various embodiments and example of the present invention may be referred to herein along with alternatives for the various components thereof. It is understood that such embodiments, examples, and alternatives are not to be construed as de facto equivalents of one another, but are to be considered as separate and autonomous representations of the present invention.
[0058] Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided, such as examples of lengths, widths, shapes, etc., to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
[0059] While the forgoing examples are illustrative of the principles of the present invention in one or more particular applications, it will be apparent to those of ordinary skill in the art that numerous modifications in form, usage and details of implementation can be made without the exercise of inventive faculty, and without departing from the principles and concepts of the invention. Accordingly, it is not intended that the invention be limited, except as by the claims set forth below.
[0060] The verbs “to comprise” and “to include” are used in this document as open limitations that neither exclude nor require the existence of also un-recited features. The features recited in depending claims are mutually freely combinable unless otherwise explicitly stated. Furthermore, it is to be understood that the use of "a" or "an", i.e. a singular form, throughout this document does not exclude a plurality.

Claims

CLAIMS:
1. A method for deriving depth data for a scene from monocular images via shape-from- shading (SfS), the method comprising:
- receiving monocular image data containing at least a pair of images comprising a first image of a scene in a first illumination condition and a second image of the scene in a second illumination condition different from the first illumination condition,
- processing a difference of the first image and second image to produce a difference image,
- determining an intensity of scattered and/or reflected light from portions of the scene based at least partly on the difference image, and
- deriving a depth map comprising information on distances of the portions of the scene from a fixed point, characterized in that the depth map is derived based at least partly on the determined intensity of scattered and/or reflected light from said portions of the scene.
2. The method of claim 1, wherein in the second illumination condition, the scene is actively illuminated, for example by a flash, such as an infrared flash.
3. The method of any preceding claim, wherein in the first illumination condition there is no active illumination.
4. The method of any preceding claim, wherein there is greater illumination in the second illumination condition than the first illumination condition and processing a difference of the first image and second image comprises subtracting the first image from the second image.
5. The method of any preceding claim, further comprising the step of generating at least one of a stereoscopic image pair and a Multiple Focal Plane (MFP) stack based at least in part on the derived depth map.
6. The method of any preceding claim, further comprising the step of generating a Multiple Focal Plane (MFP) stack and then shifting the MFP stack to obtain a stereoscopic MFP stack.
7. The method of any preceding claim, wherein the image data further comprises orientation information captured during image capture, such as orientation information relative to the gravity direction.
8. The method of claim 7, further comprising the step of displaying the stereoscopic image pair or MFP stack wherein the image data further comprises orientation information captured during said display, such as orientation information relative to the gravity direction, the orientation information being used during generation of the stereoscopic image pair or MFP stack such that an orientation of a captured scene is matched with a head tilt of a viewer of a displayed scene.
9. The method of any preceding claim wherein the image data consists of image data captured while a flash is intermittently fired, for example, such that every other frame is illuminated.
10. The method of any preceding claim, further comprising the step of compensating for motion between an instant of capture for the first image and an instant of capture for the second image.
11. The method of any preceding claim, wherein in the first illumination condition, the scene is actively illuminated by a flash at a lower intensity level than it is illuminated by a flash in the second illumination condition.
12. An apparatus comprising at least one processing core, at least one memory including computer program code, the at least one memory and the computer program code being configured to, with the at least one processing core, cause the apparatus at least to:
- receive monocular image data containing at least a pair of images comprising a first image of a scene in a first illumination condition and a second image of the scene in a second illumination condition different from the first illumination condition, - process a difference of the first image and second image to produce a difference image,
- determine an intensity of scattered and/or reflected light from portions of the scene based on the difference image, and
- deriving a depth map comprising information on distance of the portions of the scene from a fixed point, characterized in that the depth map is derived based on the determined intensity of scattered and/or reflected light from said portions of the scene.
13. The apparatus of claim 12, further comprising an image sensor and illuminator, such as the image sensor and flash of a mobile phone, the apparatus being further configured to capture the monocular image data while varying illumination conditions of the scene.
14. The apparatus of claim 12 or 13 wherein in the second illumination condition, the scene is actively illuminated.
15. The apparatus of any of claims 12 - 14, wherein there is greater illumination in the second illumination condition than the first illumination condition and processing a difference of the first image and second image comprises subtracting the first image from the second image.
16. The apparatus of any of claims 12 - 15, wherein the apparatus is further configured to generate at least one of a stereoscopic image pair and a Multiple Focal Plane (MFP) stack based at least in part on the derived depth map.
17. The apparatus of any of claims 12 - 16, wherein the apparatus is further configured to generate a Multiple Focal Plane (MFP) stack and then shift the MFP stack to obtain a stereoscopic MFP stack.
18. The apparatus of any of claims 12 - 17, wherein the image data further comprises orientation information captured during image capture, such as orientation information relative to the gravity direction.
19. The apparatus of claim 18, further comprising the step of displaying the stereoscopic image pair or MFP stack wherein the image data further comprises orientation information captured during said display, such as orientation information relative to the gravity direction, the orientation information being used during generation of the stereoscopic image pair or MFP stack such that an orientation of a captured scene is matched with a head tilt of a viewer of a displayed scene.
20. The apparatus of any of claims 12 - 19, wherein in the first illumination condition, the scene is actively illuminated by a flash at a lower intensity level than it is illuminated by a flash in the second illumination condition.
21. A non-transitory computer readable medium having stored thereon a set of computer readable instructions that, when executed by at least one processor, cause an apparatus to at least perform a method in accordance with at least one of claims 1 - 11.
PCT/FI2021/050370 2020-05-25 2021-05-21 Method and system for supporting 3d perception from monocular images WO2021240056A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20205528 2020-05-25
FI20205528A FI20205528A1 (en) 2020-05-25 2020-05-25 Method and system for supporting 3D perception from monocular images

Publications (1)

Publication Number Publication Date
WO2021240056A1 true WO2021240056A1 (en) 2021-12-02

Family

ID=76217875

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2021/050370 WO2021240056A1 (en) 2020-05-25 2021-05-21 Method and system for supporting 3d perception from monocular images

Country Status (2)

Country Link
FI (1) FI20205528A1 (en)
WO (1) WO2021240056A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1400924B1 (en) * 2002-09-20 2008-12-31 Nippon Telegraph and Telephone Corporation Pseudo three dimensional image generating apparatus
US20090033753A1 (en) * 2007-07-31 2009-02-05 Daisuke Sato Imaging apparatus, imaging method, storage medium, and integrated circuit
EP3151190A1 (en) * 2015-09-30 2017-04-05 Thomson Licensing Method, apparatus and system for determining normal and reflectance parameters using sequential illumination
WO2019183211A1 (en) * 2018-03-23 2019-09-26 Pcms Holdings, Inc. Multifocal plane based method to produce stereoscopic viewpoints in a dibr system (mfp-dibr)

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1400924B1 (en) * 2002-09-20 2008-12-31 Nippon Telegraph and Telephone Corporation Pseudo three dimensional image generating apparatus
US20090033753A1 (en) * 2007-07-31 2009-02-05 Daisuke Sato Imaging apparatus, imaging method, storage medium, and integrated circuit
EP3151190A1 (en) * 2015-09-30 2017-04-05 Thomson Licensing Method, apparatus and system for determining normal and reflectance parameters using sequential illumination
WO2019183211A1 (en) * 2018-03-23 2019-09-26 Pcms Holdings, Inc. Multifocal plane based method to produce stereoscopic viewpoints in a dibr system (mfp-dibr)

Also Published As

Publication number Publication date
FI20205528A1 (en) 2021-11-26

Similar Documents

Publication Publication Date Title
CN1956555B (en) Apparatus and method for processing 3d picture
EP2299726B1 (en) Video communication method, apparatus and system
US9420200B2 (en) 3D cameras for HDR
US9544574B2 (en) Selecting camera pairs for stereoscopic imaging
KR101492876B1 (en) 3d video control system to adjust 3d video rendering based on user prefernces
JP5891424B2 (en) 3D image creation apparatus and 3D image creation method
US20230283762A1 (en) Method and system for near-eye focal plane overlays for 3d perception of content on 2d displays
US20120236114A1 (en) Depth information generator for generating depth information output by only processing part of received images having different views, and related depth information generating method and depth adjusting apparatus thereof
EP2833638B1 (en) Image processing device, imaging device, and image processing method
JP2014103689A (en) Method and apparatus for correcting errors in three-dimensional images
CN112136324A (en) Multi-focal plane based method for generating stereoscopic views in DIBR system (MFP-DIBR)
US20130141550A1 (en) Method, apparatus and computer program for selecting a stereoscopic imaging viewpoint pair
US20130202191A1 (en) Multi-view image generating method and apparatus using the same
JP2009010557A (en) Depth data output device, and depth data receiver
Kuster et al. Towards next generation 3D teleconferencing systems
US11415935B2 (en) System and method for holographic communication
CN112585962A (en) Method and system for forming extended focal plane for large viewpoint changes
US9210396B2 (en) Stereoscopic image generation apparatus and stereoscopic image generation method
JP4230331B2 (en) Stereoscopic image generation apparatus and image distribution server
KR101158678B1 (en) Stereoscopic image system and stereoscopic image processing method
WO2021240056A1 (en) Method and system for supporting 3d perception from monocular images
Benzeroual et al. 3D display size matters: Compensating for the perceptual effects of S3D display scaling
JP2014003486A (en) Image processing device, image capturing device, image display device, image processing method, and program
Chappuis et al. Subjective evaluation of an active crosstalk reduction system for mobile autostereoscopic displays
Xu Capturing and post-processing of stereoscopic 3D content for improved quality of experience

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21729342

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21729342

Country of ref document: EP

Kind code of ref document: A1