US9414048B2 - Automatic 2D-to-stereoscopic video conversion - Google Patents
Automatic 2D-to-stereoscopic video conversion Download PDFInfo
- Publication number
- US9414048B2 US9414048B2 US13/315,488 US201113315488A US9414048B2 US 9414048 B2 US9414048 B2 US 9414048B2 US 201113315488 A US201113315488 A US 201113315488A US 9414048 B2 US9414048 B2 US 9414048B2
- Authority
- US
- United States
- Prior art keywords
- input image
- view
- per
- depth
- images
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
- 238000006243 chemical reaction Methods 0.000 title abstract description 7
- 238000000034 method Methods 0.000 claims abstract description 132
- 230000008569 process Effects 0.000 claims abstract description 46
- 238000005457 optimization Methods 0.000 claims description 29
- 230000002194 synthesizing effect Effects 0.000 claims description 28
- 238000003708 edge detection Methods 0.000 claims description 5
- 238000009826 distribution Methods 0.000 abstract description 6
- 238000010380 label transfer Methods 0.000 abstract description 5
- 230000015572 biosynthetic process Effects 0.000 description 33
- 238000003786 synthesis reaction Methods 0.000 description 33
- 230000033001 locomotion Effects 0.000 description 25
- 230000003287 optical effect Effects 0.000 description 16
- 230000006870 function Effects 0.000 description 14
- 238000010586 diagram Methods 0.000 description 13
- 238000004891 communication Methods 0.000 description 11
- 238000009877 rendering Methods 0.000 description 9
- 230000002123 temporal effect Effects 0.000 description 9
- 238000012546 transfer Methods 0.000 description 8
- 201000011243 gastrointestinal stromal tumor Diseases 0.000 description 6
- 230000011218 segmentation Effects 0.000 description 5
- 230000003068 static effect Effects 0.000 description 5
- 238000013459 approach Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000011049 filling Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000007796 conventional method Methods 0.000 description 3
- 238000005259 measurement Methods 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 208000003464 asthenopia Diseases 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 230000001427 coherent effect Effects 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000012804 iterative process Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000000593 degrading effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012886 linear function Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000010422 painting Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000007723 transport mechanism Effects 0.000 description 1
Images
Classifications
-
- H04N13/026—
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N13/00—Stereoscopic video systems; Multi-view video systems; Details thereof
- H04N13/20—Image signal generators
- H04N13/261—Image signal generators with monoscopic-to-stereoscopic image conversion
Definitions
- a “Stereoscopic Video Converter” provides various techniques for automatically converting arbitrary 2D video sequences into perceptually plausible stereoscopic or “3D” versions based on estimations of dense depth maps for every frame of the input video sequence.
- 3D videos have become increasingly popular.
- 3D implies the presentation of stereoscopic video that displays separate corresponding images to the left and right eyes to convey the sense of depth to the viewer.
- various techniques have been developed for converting monoscopic (2D) videos into 3D.
- semi-automatic a user generally guides the conversion process, e.g., by notating depth on various frames, drawing rough occlusion boundaries, or iteratively refining automatic estimates.
- Semi-automatic methods are interesting because they attempt to improve automatic results with some user intervention; however, most of these methods focus on interfaces for utilizing these methods rather than developing the methods themselves.
- SfM Structure from Motion
- a method for capturing robust 3D positions of points in images can be used when multiple camera views of a static scene are available.
- SfM output is generally sparse, but methods exist for obtaining dense point-based reconstructions, and SfM methods have been used to obtain super-resolution stereoscopic videos.
- various techniques have been used to synthesize dense surface (mesh) reconstructions.
- methods exist for obtaining temporally consistent depth-maps rather than explicitly reconstructing the 3D scene. It has also been shown that graph-based depth inference from multiple views under second order smoothness priors is tractable and leads to plausible results.
- other techniques estimate rough planar geometry of a scene for reconstruction and depth estimation, given multiple photographs. With active learning, similar methods have been applied to single images.
- Depth from motion parallax is another method for obtaining dense disparity. These methods typically use optical flow techniques to estimate motion parallax, which in turn can be used to hypothesize per-pixel depth. Depth from motion parallax methods can work for dynamic scenes, but are prone to tracking failures e.g., due to noise, textureless surfaces, and sharp motion.
- depth-based techniques filling holes of unknown color that are guaranteed to appear during the depth image based rendering (DIBR) process of synthesizing a new view. Holes are most commonly filled using linear interpolation or even Poisson blending (note that such techniques are sometimes based on solving Laplace's equation to minimize gradients in the hole regions). In painting has also been used for hole filling. Further, when sufficient views exist, occlusion information may be found in other frames for use in filling holes. Unfortunately, most of these methods tend to produce unnatural artifacts near occlusion boundaries, thus degrading the appearance of the resulting 3D image or video sequence.
- DIBR depth image based rendering
- a “Stereoscopic Video Converter,” as described herein, provides various techniques for automatically converting arbitrary 2D video sequences into perceptually plausible stereoscopic or 3D versions based on estimations of dense depth maps for every frame of the video sequence.
- the Stereoscopic Video Converter is capable of performing the 2D-to-3D conversion process without requiring user inputs such as manual markups of the video sequence to specify depth, and without requiring an assumption that video scenes are sufficiently static scenes to allow conventional structure from motion and stereo techniques to be applied.
- the automated 2D-to-3D conversion process provided by the SVC includes a process for automated depth estimation via label transfer.
- This automated depth estimation process begins by providing a database of images and videos having known ground truth depths. Then, given a new image frame (i.e., each individual frame of an input video sequence to be converted from 2D-to-3D), the SVC matches features extracted from each image frame of the input video sequence with features from the images and videos in the database. The SVC then transfers the depths from matched features in the database to each corresponding image frame of the input video sequence as initial depth estimates for each of those frames. These initial estimates are then refined via an iterative process, with the final estimates then being used for view synthesis.
- the SVC automatically generates the “right” view of a corresponding stereoscopic image for each frame (assuming that each original input frame represents the “left” view of the stereoscopic image). It should be noted that the SVC could alternately generate the “left” view frame by assuming that each original input frame is the “right” view of the stereoscopic image. Thus, for purposes of explanation, the following discussion will simply refer to the original image frame as the “left view” and the automatically generated view as the “right view” of the resulting stereoscopic image. However, in various embodiments, the SVC generates both the left and right views from the input frame and depth information inferred for the input frame.
- the SVC generates the right view for each image frame by using an image saliency process that selectively stretches textures of the features extracted from each image frame when warping the left view to create the right view given the estimated depths of those features.
- an image saliency process that selectively stretches textures of the features extracted from each image frame when warping the left view to create the right view given the estimated depths of those features.
- parts of the image that are less salient i.e., less important because it is less textured
- the SVC generates the right view for each image frame by using image priors to guide an automated reconstruction of the right view.
- this process reconstructs the right view such that, locally, the reconstructed image looks very similar to some portions of the input left views.
- the left views used for this reconstruction process do not all need to be on the same timeframe.
- this alternate embodiment for reconstruction of the right view can use portions of different image frames (i.e., left views) from the input video sequence.
- this reconstruction process is extended by coupling depth estimation and view synthesis, so that depth estimation seeds view synthesis while a computed measure of appropriateness of the synthesized view (measured by fitting image priors) guides depth extraction.
- FIG. 1 provides an exemplary architectural flow diagram that illustrates program modules for implementing various embodiments of the Stereoscopic Video Converter (SVC), as described herein.
- SVC Stereoscopic Video Converter
- FIG. 2 provides an exemplary architectural flow diagram that expands upon the depth estimation module 125 of FIG. 1 and that further illustrates program modules for implementing various embodiments of the SVC, as described herein.
- FIG. 3 provides an exemplary architectural flow diagram that expands upon the view synthesis module 140 of FIG. 1 and that further illustrates program modules for implementing various embodiments of the SVC, as described herein.
- FIG. 4 provides an exemplary architectural flow diagram that expands upon an alternative implementation of the view synthesis module 140 of FIG. 1 and that further illustrates program modules for implementing various embodiments of the SVC, as described herein.
- FIG. 5 is a general system diagram depicting a simplified general-purpose computing device having simplified computing and I/O capabilities for use in implementing various embodiments of the SVC, as described herein.
- a “Stereoscopic Video Converter,” as described herein, provides various techniques for automatically converting arbitrary 2D video sequences into perceptually plausible stereoscopic or 3D versions based on estimations of dense depth maps for every frame of the video sequence.
- the Stereoscopic Video Converter operates to automatically synthesize a stereo video sequence from a monoscopic video.
- the SVC provides a video conversion solution that does not require multiple viewpoints or non-moving scene objects, and is even applicable to single images.
- SVC Structure from Motion
- the SVC also provides a temporally coherent image-warping technique for rendering new viewpoints given an image and corresponding depth image based rendering (DIBR) that preserves high frequency and highly salient video content that maintains perceptually consistent depth between successive image frames.
- DIBR depth image based rendering
- the SVC provides various techniques for automatically converting arbitrary 2D video sequences into perceptually plausible stereoscopic or 3D versions based on estimations of dense depth maps for every frame of the video sequence.
- the processes summarized above are illustrated by the general system diagram of FIG. 1 .
- the system diagram of FIG. 1 illustrates the interrelationships between program modules for implementing various embodiments of the SVC, as described herein.
- the system diagram of FIG. 1 illustrates a high-level view of various embodiments of the SVC
- FIG. 1 is not intended to provide an exhaustive or complete illustration of every possible embodiment of the SVC as described throughout this document.
- any boxes and interconnections between boxes that may be represented by broken or dashed lines in FIG. 1 represent alternate embodiments of the SVC described herein, and that any or all of these alternate embodiments, as described below, may be used in combination with other alternate embodiments that are described throughout this document.
- the processes enabled by the SVC begin operation by using an image input module 100 to receive an input image or video 105 .
- the input image or video 105 is either pre-recorded, or is recorded or captured using a conventional image or video capture device 110 .
- a user interface module 115 can be used to select from a library or list of the images or videos 105 or to select an input image from the image or video capture device 110 .
- the image input module then provides an input frame 120 (or sequential input frames in the case of a video) to a depth estimation module 125 .
- the depth estimation module 125 estimates or infers depth in the input frame 120 in combination with a database prior that represents an average per-pixel depth across some or all of the images in the set of images and/or videos with depth information 130 .
- the resulting inferred depth 135 represents a plausible, though not exact, estimate of the depth of the pixels of the input frame 120 . See FIG. 2 and the corresponding discussion below for a more detailed discussion of how the inferred depth 135 is produced by the depth estimation module 125 .
- both the inferred depth and the input frame are provided to a view synthesis module 140 .
- the view synthesis module 140 synthesizes a right view 145 (assuming that the input frame 120 is the left view 150 ) from the input frame and the inferred depth 135 for the input frame.
- the SVC reduces visible artifacts by synthesizing both the right view 145 and the left view 150 from the input frame 120 and the corresponding inferred depth 135 .
- the depth estimation process performed by the depth estimation module 125 is coupled with the view synthesis process performed by the view synthesis module 140 .
- This coupling enables an iterative process wherein, following the initial depth estimation to generate the inferred depth 135 , the output of the view synthesis module 140 is fed back to the depth estimation module 125 for use in the generation of a new inferred depth 135 that in turn seeds a new view synthesis iteration.
- This looping process between depth estimation and view synthesis then repeats until either a fixed or user adjustable number of iterations have been performed, or until the depth estimation module 125 converges onto an inferred depth 135 that changes less than some predetermined threshold.
- the resulting left view 145 and the right view 150 are then provided to a stereoscopic view generation module 155 that constructs a visually plausible “3D” or stereoscopic view for each input frame.
- typical formats for stereoscopic views include anaglyphs 160 and various interlaced stereo formats.
- the stereoscopic view generation module 155 can implemented to output any conventional stereoscopic format.
- FIG. 2 expands upon the depth estimation module 125 of FIG. 1 .
- the system diagram of FIG. 2 illustrates the interrelationships between program modules for implementing the depth estimation module and further illustrates various embodiments of the SVC, as described herein.
- the system diagram of FIG. 2 illustrates a high-level view of various embodiments of the SVC, FIG. 2 is not intended to provide an exhaustive or complete illustration of every possible embodiment of the SVC as described throughout this document.
- the depth estimation module 125 receives inputs that include the input frame 120 and the set of images and/or videos with depth information 130 . These inputs are then processed to generate the inferred depth 135 for the input frame 120 as discussed in specific detail herein in Section 2.2 and Section 2.3.
- the depth estimation module 125 first processes the set of the set of images and/or videos with depth information 130 using a depth prior estimation module 205 that creates a database prior 210 (also referred to below as “E prior ”) that represents a per-pixel average or mean depth across all images in the set of images and/or videos with depth information. Note also that subsets of images in the set of images and/or videos with depth information 130 can be used to create the database prior 210 .
- similar scenes such as, for example, images or videos of houses surrounded by trees can be used to create a specific database prior that will then be used in combination with similar input frames 120 to generate the corresponding inferred depth 135 .
- the depth prior estimation module 205 creates one or more database priors 210 from the set of images and/or videos with depth information 130 as a pre-processing step. Those database priors 210 are then stored and made available for use as an initial estimate or prior for generating the inferred depth 135 , as discussed in further detail below.
- the depth transfer approach provided by the SVC operates in three stages, as illustrated by FIG. 2 .
- the SVC uses a similarity query module 220 to find a candidate set 225 of images that are similar to the input frame. Note that a maximum number of candidates can be either pre-specified or set by the user.
- This candidate set 225 also referred to herein as “candidate images” represent images in the database that are “similar” to the current input frame being processed.
- a warping module 230 applies a warping procedure (e.g., SIFT Flow, etc.) to the candidate images and corresponding depths to align each of the candidates with the input frame 120 to create a set of warped candidates 235 .
- a warping procedure e.g., SIFT Flow, etc.
- a depth optimization module 215 receives as input the database prior, 210 , the input frame 120 , and the set of warped candidates 235 . Given these inputs, the depth optimization module 215 then performs an iterative optimization process that intelligently interpolates and smoothes the depth values of the warped candidates 235 to produce the inferred depth 135 for the pixels of the input image. Note also that in the case of video inputs with successive image frames that are closely related, a video extension module 240 is used in combination with the depth optimization module 215 to ensure perceptual continuity of the inferred depth information between successive image frames of the video sequence. Specific details of the video extension process for estimating depth in video sequences are provided in Section 2.2.4.
- the SVC provides various techniques for automatically converting arbitrary 2D video sequences into perceptually plausible stereoscopic or 3D versions based on estimations of dense depth maps for every frame of the video sequence.
- the following sections provide a detailed discussion of the operation of various embodiments of the SVC, and of exemplary methods for implementing the program modules described in Section 1 with respect to FIG. 1 and FIG. 2 .
- the following sections provides examples and operational details of various embodiments of the SVC, including: an operational overview of the SVC; non-parametric depth estimation; depth inference implementation details; automatic stereoscopic view synthesis; and various extensions and additional embodiments of the SVC.
- the SVC-based processes described herein provide various techniques for automatically converting arbitrary 2D video sequences into perceptually plausible stereoscopic or 3D versions based on estimations of dense depth maps for every frame of the video sequence.
- these processes infer a stereoscopic video from a monocular video.
- left views could be inferred from right view as input, if desired.
- the SVC uses an algorithm that poses each stage of the process as a continuous optimization problem.
- L) is formally defined in combination with a discussion of how to maximize this probability (i.e., depth inference) using a non-parametric, label transfer approach, with a discussion of the implementation details for this process provided in Section 2.3.
- D, L) (i.e., right view synthesis assuming a left view input) is then described in Section 2.4.
- the SVC provides various techniques for using non-parametric learning techniques to infer depth from a video via a process that imposes no requirements on the video, such as motion parallax or sequence length, and can even be applied to a single image.
- Section 2.2.1 first describes depth estimation as it applies to single images, followed by a discussion of how these depth estimation techniques extend to videos (see Section 2.2.4) in a manner that allows the SVC to provide coherent depth estimation over time for the input video sequence.
- the basic pipeline for estimating depth from a single image begins with finding matching candidates for the input image frame from the set or database of images and/or videos with depth information (see Section 2.2.1). This candidate set is then warped candidates to match the structure of the input image (see Section 2.2.2). The SVC then uses a global optimization procedure to interpolate the warped candidates (Equation 2), producing per-pixel depth estimates for the input image. An extension to this process for handling video sequences is then described in Section 2.2.4.
- arbitrary scenes can be semantically labeled using a non-parametric learning paradigm.
- a non-parametric learning paradigm For example, given an unlabeled input image and a database with known per-pixel labels (e.g., sky, car, tree, window, etc.), conventional labeling techniques operate to select similar scenes from the database and to intelligently transfer the known labels to the input image based on SIFT features.
- SIFT scale-invariant feature transforms
- SIFT scale-invariant feature transforms
- the SVC leverages such techniques to provide the stereoscopic rendering techniques described herein by showing how label transfer can be applicable in domains other than semantic labeling.
- the SVC uses a database of existing images having depth information.
- the pixel information for images in the database is discussed herein using notation such as “RGBD”, where the notation RGBD indicates an image with 4 channels: three standard color channels of the RGB space, as well as depth (D).
- RGBBD notation such as “RGBD”
- RGBD indicates an image with 4 channels: three standard color channels of the RGB space, as well as depth (D).
- D depth
- the SVC is not limited to the use of the RBG color space, and that any desired color space can be used given the techniques described herein. A wide variety of techniques can be used to construct this database.
- consumer-grade image capture or range scanners such as the Microsoft° KinectTM were used with to capture images and video sequences with integral depth information for creating the database of images with depth information.
- Microsoft° KinectTM consumer-grade image capture or range scanners
- many existing databases of images with depth information are currently available from a variety of sources.
- the depth transfer approach provided by the SVC operates in three stages, as illustrated by FIG. 2 , as discussed above.
- the SVC finds “candidate images” in the database, which are “similar” to the input image (from the input video sequence).
- a matching procedure e.g., SIFT Flow, etc.
- an optimization procedure is used to intelligently interpolate and smooth the warped candidate depth values, which results in the inferred depth for the pixels of the input image.
- the SVC operates on the assumption that scenes with similar semantics and depths will have roughly similar per-pixel depth values when densely aligned. Of course, not all of these initial estimates will be correct. Consequently, as noted above, the SVC finds multiple candidate images for each input image, and then uses these multiple candidates to refine and interpolate the initial estimates using a global optimization process that infers a plausible per-pixel depth estimate for the input frame.
- the SVC computes high level image features for each image or frame of video in the database as well as the input image.
- the SVC uses GIST and optical flow based techniques for computing these high-level features.
- GIST techniques provide a collection of image features that summarize the most important characteristics of an input image.
- the SVC selects the top K matching frames from the database.
- the SVC ensures that each video in the database contributes no more than one matching frame for any particular input frame. This embodiment forces matching images to be from differing viewpoints, allowing for greater variety among matches.
- these matching images from the database are referred to candidate images, and their corresponding depths are referred to as candidate depths.
- the candidate images from the database are selected to match the input image closely in feature space (which include GIST features), it is assumed that the overall semantics of the scene are roughly similar, and that the distribution of depth is comparable among the input and candidates. However, ideally, a direct pixel-to-pixel correspondence between the input and all candidates is desired, so as to limit the search space when inferring depth from the candidates.
- This pixel-to-pixel correspondence is achieved through SIFT flow, which matches per-pixel SIFT features to estimate dense scene alignment.
- SIFT flow the SVC estimates warping functions ⁇ i , i ⁇ 1, . . . , k ⁇ for each candidate image.
- these warping functions map pixel locations from a given candidate's domain to pixel locations in the input's domain. Note that the warping functions can be multivalued (one-to-many), and are not necessarily surjective.
- SIFT flow warping is performed by first calculating SIFT features for each pixel in two input images. These features are matched in a one-to-many fashion. This matching defines an operator ⁇ , which when applied to the first input image, maps pixel locations to their corresponding matching locations in the second input image. This can be used, for example, to achieve dense scene alignment.
- each warped candidate depth is assumed to provide a rough approximation to the depth of the input.
- semantically similar candidate images and dense scene alignment can be obtained through SIFT flow.
- the warped depth could be used as the final inferred depth, if desired, the warped candidates may still contain inaccuracies and are often not spatially smooth. Consequently, more accurate depth estimates are achieved by the SVC by employing an optimization process that considers all of the warped candidates, and then uses this information to synthesize the most likely depth for the input image.
- the SVC acts to minimize the negative log-posterior P(D
- the data term is defined as:
- w i (j) is a confidence measure of the accuracy of the j th candidate's warped depth at pixel i (see Section 2.3 for a further discussion of this point) and K is the total number of candidates.
- the SVC measures not only absolute differences, but also relative depth changes, i.e., gradient depth.
- E prior ( D i ) ⁇ ( D i ⁇ i ) Equation (5) where the prior, , is computed by averaging all depth images in the database (or sets of similar image frames in the database).
- the above-described optimization has three parameters, including ⁇ , ⁇ , and ⁇ , which control the weightings of spatial smoothness, prior, and gradient depth, respectively.
- the number of candidate images is controlled by K.
- the depth estimation framework can also infer temporally consistent depth for sequences of frames (videos) by modifying the matching and optimization procedures discussed in Section 2.2.
- the SVC performs the same candidate matching and warping scheme as in the single image case for each frame of the video sequence, and then augments the original objective function (Equation 2) with temporal information to improve depth estimation and to maintain coherence throughout the sequence.
- the objective function becomes:
- E video ⁇ ( D ) E ⁇ ( D ) + ⁇ i ⁇ pixels ⁇ vE coherence ⁇ ( D i ) + ⁇ ⁇ ⁇ E motion ⁇ ( D i ) Equation ⁇ ⁇ ( 6 )
- E coherence achieves temporal smoothness
- E motion uses motion cues to better estimate the depth of moving objects.
- Temporal coherence is modeled by first computing per-pixel optical flow for each pair of consecutive frames in the video using conventional optical flow-based techniques.
- optical flow defines a mapping from one frame to the next, indicating where pixels map to in the subsequent frame.
- the SVC uses a motion segmentation technique (as discussed in further detail in Section 2.3) to detect moving objects in the video, and constraints the depth of these objects to be consistent with the point in which the moving objects contact the floor (or ground, road, etc.).
- m a binary motion segmentation mask
- the SVC minimizes the objective function using iteratively reweighted least squares, which allows to be recomputed prior to each iteration.
- any depth information estimated or inferred for input image frames is user editable or user adjustable.
- the SVC uses a combination of GIST features and features derived from optical flow.
- Equation (3) is used to ensure that the inferred depth is similar to each of the K warped candidate depths. However, some of the candidate depth values will be more reliable than others, and this reliability is modeled with a confidence weighting for each pixel in each candidate image (e.g., w i (j) is the weight of the i th pixel from the i th candidate image). These weights are computed by comparing per-pixel SIFT descriptors, obtained during the SIFT flow computation, of both the input image and the candidate images:
- w i ( j ) 1 - 1 1 + e - 10 ⁇ ( ⁇ ⁇ S i - ⁇ j ⁇ ( S i ( j ) ) ⁇ ⁇ - 0.5 ) Equation ⁇ ⁇ ( 10 )
- S i and S i (j) are the SIFT feature vectors at pixel i in candidate image j. Note that the candidate image's SIFT features are computed first, and then warped using the warping function ( ⁇ j ) calculated with SIFT flow.
- the SVC uses a spatial regularization term for optimization.
- the smoothness is not applied uniformly to the inferred depth, as there is some relation between image appearance and depth. Therefore, it is assumed that regions in the image with similar texture are likely to have similar, smooth depth transitions, and that discontinuities in the image are likely to correspond to discontinuities in depth. This assumption is enforced with a per-pixel weighting of the spatial regularization term such that this weight is large where the image gradients are small, and vice-versa, as in Equation (4).
- the SVC is generally concerned with videos that come from a static viewpoint, and leverages this fact to detect and segment moving objects in the scene.
- the SVC finds the darkest (in terms of intensity) image in the sequence, and performs image-histogram equalization on all other frames in the video. Note that the darkest image is used so as to not enhance pixel noise in dark images if a brighter image is instead used as the reference.
- the SVC computes the optical flow for each pair of consecutive frames, and estimates the background image (B i ) by taking a weighted average (based on flow) of the input sequence, as illustrated by Equation (11):
- L i,k and f i,k are the i th pixel from the k th video frame of the input image and flow weights respectively.
- other background detection techniques such as median filtering can be used to detect the background. However, it has been observed that median filtering produces results that are not quite as good.
- the SVC computes the probability that a given pixel is “in motion” by testing its relative different from the background, weighted by the magnitude of the flow, and threshold the following probability:
- Equation (8) Equation (8) to improve depth estimates for moving objects in the above-described optimization.
- the SVC uses iteratively reweighted least squares (IRLS).
- IRLS iteratively reweighted least squares
- IRLS works by approximating the objective by a linear function of the parameters, and solving the system by minimizing the squared residual (e.g., with least squares).
- the size of this system can be very large, although this system will be sparse because of the limited number of pairwise interactions in the optimization. Still, given modern hardware limitations, these types of systems are not directly solvable. Consequently, the SVC uses an iterative method to solve the least squares system at each iteration of the IRLS procedure. In various tested embodiments, a preconditioned conjugate gradient (PCG) process was used for this purpose.
- PCG conjugate gradient
- the SVC uses iterative optimization, starting from an good initial estimate is helpful for quick convergence with fewer iterations. It has been observed that initializing with some function of the warped candidate depths provide a reasonable starting point. Therefore, in various tested embodiments, the median value (per-pixel) of all candidate depths was used for initialization purposes, though it should be understood that other initializations may also be used without departing from the intended scope of the ideas described herein.
- the SVC can perform depth image based rendering (DIBR) to synthesize a new view for each frame for creating a stereoscopic display for each frame.
- DIBR depth image based rendering
- a typical strategy for DIBR is to simply reproject pixels based on depth values to a new, synthetic camera view.
- Such methods are susceptible to large “holes” at disocclusions. Much work has been done to fill these holes using conventional techniques, but visual artifacts still remain in the case of general scenes.
- the SVC uses a new extension to a recent DIBR technique, that is only applicable to single images, which uses image warping to overcome problems such as disocclusions and hole filling.
- This conventional DIBR technique is based on the general idea that people are less perceptive to errors in low saliency regions, and thus disocclusions are covered by “stretching” the input image where people are less likely to notice artifacts.
- this conventional DIBR technique receives a single image and per-pixel disparity values as an input and, and intelligently warps the input image based on the disparity such that highly salient regions remain unmodified.
- FIG. 3 provides an exemplary architectural flow diagram that expands upon the view synthesis module 140 of FIG. 1 and that further illustrates program modules for implementing various embodiments of the SVC, as described herein.
- the inferred depth 135 is provided to a depth inversion module 320 component of the view synthesis module 140 .
- the depth inversion module inverts the inferred depth 135 to produce a disparity 325 for the input frame 120 .
- the input frame 120 is provided to a saliency/edge detection module 310 component of the view synthesis module 140 .
- the saliency/edge detection module 310 processes the input frame 120 to compute saliency-preserving smoothness weights 315 . These weights enable the SVC to maintain spatial and temporal coherence while also ensuring that highly salient regions remain perceptually intact during the view warping performed by a view synthesis optimization module 330 component of the view synthesis module 140 .
- video sequences processed by the view synthesis module 140 maintain spatial and temporal coherence in the synthesized right view 145 and left view 150 .
- the input frame 120 can be used as the left view 150 .
- the SVC first inverts the depth to convert it to disparity, and then scales the disparity by the maximum disparity value:
- W 0 W max D + ⁇ ′ Equation ⁇ ⁇ ( 13 )
- W max is an adjustable parameter that modulates how much objects “pop-out” from the screen when viewed with a stereoscopic device. Increasing the value of W max enhances the “3D” effect, but can also cause eye strain or problems with fusing the stereo images if set too high.
- the SVC minimizes the second term of the log posterior, modeled as:
- the Q smooth term contains the same terms as in the spatial and temporal smoothness functions provided above (see Equations (4) and (7), respectively), and ⁇ and ⁇ control the weighting of these smoothness terms in the optimization. Note that in tested embodiments, both ⁇ and ⁇ were set to a value of 10, although these values can be adjusted, if desired. Given this formulation, the SVC ensures spatial and temporal coherence while also ensuring that highly salient regions remain perceptually intact during view warping.
- the SVC divides the disparities by two (i.e.,
- the SVC uses these halved values to render the input frame(s) into two new views (corresponding to the stereo left and right views) that generally produces larger numbers of smaller artifacts.
- the SVC can also use other methods, such as only rendering one new frame with larger disparities while using the input frame as the other half of the stereo pair; however, such techniques tend to produce relatively fewer, but larger, artifacts. In general, people tend to be less perceptive of many small artifacts when compared with few large artifacts.
- the SVC uses a conventional anisotropic pixel splatting method that “splats” input pixels into the new view (based on W) as weighted, anisotropic Gaussian blobs.
- the SVC can convert to any desired 3D viewing format, including, but not limited to, anaglyph or interlaced stereo formats.
- the SVC used the anaglyph format since cyan/red anaglyph glasses are currently more widespread than polarized/autostereoscopic displays (used with interlaced 3D images).
- the SVC also shifts the left and the right images such that the nearest object has zero disparity, making the nearest object appear at the display surface, and all other objects appear behind the display. This is known as the “window” metaphor.
- saliency-preserving smoothness weights are used to maintain spatial and temporal coherence while also ensuring that highly salient regions remain perceptually intact during the view warping.
- the SVC instead uses texture patches as “image priors” that are extracted from the input frame or frames of the input video sequence to maintain coherence between frames during view synthesis.
- this additional embodiment joins together the depth estimation and view synthesis stages described above, and jointly maximize the probabilities in Equation (1) rather than computing these probabilities separately. More specifically, this embodiment uses texture priors extracted from each frame of the input video sequence to ensure that the synthesized views do not contain regions that differ significantly from regions that are observed in the input video. Note that either or both a texture patch size and a maximum number of texture patches can be either pre-specified or set by the user.
- FIG. 4 provides an exemplary architectural flow diagram that expands upon the view synthesis module 140 of FIG. 1 and that further illustrates program modules for implementing various embodiments of the SVC, as described herein.
- each input frame 120 of the input video sequence is processed by a texture patch extraction module 410 component of the view synthesis module 140 to generate a database of texture priors, referred to herein as a database of image texture patches 415 .
- This database is created by sampling small regions of each input frame 120 .
- the inferred depth 135 is provided to the depth inversion module 320 component of the view synthesis module 140 .
- the depth inversion module inverts the inferred depth 135 to produce the disparity 325 for the input frame 120 .
- the database of image texture patches 415 and the input frame 120 are provided to a texture-based view synthesis optimization module 430 component of the view synthesis module 140 .
- video sequences processed by the view synthesis module 140 maintain coherence in the synthesized right view 145 and left view 150 relative to the original input frames.
- the SVC uses structure from motion (SfM) to compute robust (albeit sparse) depth estimates.
- SfM depths are then constrained to be unchanged during the optimization. Knowing even a few accurate depth measurements prior to optimization can greatly improve the entirety of the inferred depth.
- the SVC uses global motion estimation to improve motion segmentation for providing depth estimates for moving objects.
- Global motion estimation can be done using point tracking and homography estimation (e.g., using RANdom SAmple Consensus, or RANSAC).
- RANSAC Random SAmple Consensus
- the SVC assumed that depth discontinuities occur at image edges, and thus the smoothness weights in the depth optimization (i.e., Equation (4), s x ,s y ) are functions of image edges. However, it is more likely that depth discontinuities occur along occlusion boundaries. Consequently, in various embodiments, the SVC incorporates conventional estimates of occlusion boundaries in the smoothness weights.
- FIG. 5 illustrates a simplified example of a general-purpose computer system on which various embodiments and elements of the SVC, as described herein, may be implemented. It should be noted that any boxes that are represented by broken or dashed lines in FIG. 5 represent alternate embodiments of the simplified computing device, and that any or all of these alternate embodiments, as described below, may be used in combination with other alternate embodiments that are described throughout this document.
- FIG. 5 shows a general system diagram showing a simplified computing device such as computer 500 .
- Such computing devices can be typically be found in devices having at least some minimum computational capability, including, but not limited to, personal computers, server computers, hand-held computing devices, laptop or mobile computers, communications devices such as cell phones and PDA's, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, audio or video media players, etc.
- the device should have a sufficient computational capability and system memory to enable basic computational operations.
- the computational capability is generally illustrated by one or more processing unit(s) 510 , and may also include one or more GPUs 515 , either or both in communication with system memory 520 .
- the processing unit(s) 510 of the general computing device of may be specialized microprocessors, such as a DSP, a VLIW, or other micro-controller, or can be conventional CPUs having one or more processing cores, including specialized GPU-based cores in a multi-core CPU.
- the simplified computing device of FIG. 5 may also include other components, such as, for example, a communications interface 530 .
- the simplified computing device of FIG. 5 may also include one or more conventional computer input devices 540 (e.g., pointing devices, keyboards, audio input devices, video input devices, haptic input devices, devices for receiving wired or wireless data transmissions, etc.).
- the simplified computing device of FIG. 5 may also include other optional components, such as, for example, one or more conventional computer output devices 550 (e.g., display device(s) 555 , audio output devices, video output devices, devices for transmitting wired or wireless data transmissions, etc.).
- typical communications interfaces 530 , input devices 540 , output devices 550 , and storage devices 560 for general-purpose computers are well known to those skilled in the art, and will not be described in detail herein.
- the simplified computing device of FIG. 5 may also include a variety of computer readable media.
- Computer readable media can be any available media that can be accessed by computer 500 via storage devices 560 and includes both volatile and nonvolatile media that is either removable 570 and/or non-removable 580 , for storage of information such as computer-readable or computer-executable instructions, data structures, program modules, or other data.
- Computer readable media may comprise computer storage media and communication media.
- Computer storage media includes, but is not limited to, computer or machine readable media or storage devices such as DVD's, CD's, floppy disks, tape drives, hard drives, optical drives, solid state memory devices, RAM, ROM, EEPROM, flash memory or other memory technology, magnetic cassettes, magnetic tapes, magnetic disk storage, or other magnetic storage devices, or any other device which can be used to store the desired information and which can be accessed by one or more computing devices.
- computer or machine readable media or storage devices such as DVD's, CD's, floppy disks, tape drives, hard drives, optical drives, solid state memory devices, RAM, ROM, EEPROM, flash memory or other memory technology, magnetic cassettes, magnetic tapes, magnetic disk storage, or other magnetic storage devices, or any other device which can be used to store the desired information and which can be accessed by one or more computing devices.
- modulated data signal or “carrier wave” generally refer to a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
- communication media includes wired media such as a wired network or direct-wired connection carrying one or more modulated data signals, and wireless media such as acoustic, RF, infrared, laser, and other wireless media for transmitting and/or receiving one or more modulated data signals or carrier waves. Combinations of the any of the above should also be included within the scope of communication media.
- software, programs, and/or computer program products embodying the some or all of the various embodiments of the SVC described herein, or portions thereof, may be stored, received, transmitted, or read from any desired combination of computer or machine readable media or storage devices and communication media in the form of computer executable instructions or other data structures.
- SVC may be further described in the general context of computer-executable instructions, such as program modules, being executed by a computing device.
- program modules include routines, programs, objects, components, data structures, etc., that perform particular tasks or implement particular abstract data types.
- the embodiments described herein may also be practiced in distributed computing environments where tasks are performed by one or more remote processing devices, or within a cloud of one or more devices, that are linked through one or more communications networks.
- program modules may be located in both local and remote computer storage media including media storage devices.
- the aforementioned instructions may be implemented, in part or in whole, as hardware logic circuits, which may or may not include a processor.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)
- Processing Or Creating Images (AREA)
Abstract
Description
P(R,D|L)=P(D|L)P(R|D,L) Equation (1)
which allows the SVC to model the distribution of depth given the left images (P(D|L)), and also the distribution of right images given the left image sequence and corresponding depths (P(R|D, L)). In summary, the SVC first estimates depth from the input sequence, and subsequently uses the input video and estimated depth to synthesize a stereoscopic video. An overview of this process is illustrated by
where Z is the normalization constant of the probability, and α and β are variable parameters. Note that in various tested embodiments, values on the order of α=10 and β=0.5 were observed to provide acceptable results. However, these values may be adjusted, if desired. The objective contains three terms: data (Etransfer), spatial smoothness (Esmooth), and database prior (Eprior).
where wi (j) is a confidence measure of the accuracy of the jth candidate's warped depth at pixel i (see Section 2.3 for a further discussion of this point) and K is the total number of candidates. The SVC measures not only absolute differences, but also relative depth changes, i.e., gradient depth. The second and third terms of Equation (3) model the gradient depth differences with the inferred depth gradients, weighted by γ (where a value of approximately γ=0.5 was used, though this value can be adjusted, if desired). Note that these terms allow for more accurate intra-object depth inference.
E smooth(D i)=s x,iφ(∇x D i)+s y,iφ(∇y D i) Equation (4)
Soft thresholds of the image gradients (using a sigmoidal function) are used, defining sx,i=1−1/(1+e−100(∥∇
E prior(D i)=φ(D i − i) Equation (5)
where the prior, , is computed by averaging all depth images in the database (or sets of similar image frames in the database).
where Ecoherence achieves temporal smoothness, and Emotion uses motion cues to better estimate the depth of moving objects. The weights ν and η are used to balance the relative influence of each term. Note that in tested embodiments, values of approximately ν=10 and η=5 were used, though these values can be adjusted, if desired.
E coherence(D i)=s t,iφ(∇flow D i) Equation (7)
E motion(D i)=m iφ(D i − i) Equation (8)
(1−ω)∥G 1 −G 2 ∥+ω∥F 1 −F 2∥ Equation (9)
where ω=0.5 in various tested embodiments of the SVC, though this value can be adjusted, if desired.
where Si and Si (j) are the SIFT feature vectors at pixel i in candidate image j. Note that the candidate image's SIFT features are computed first, and then warped using the warping function (ψj) calculated with SIFT flow.
where t is an adjustable threshold. Note a value of approximately t=0.01 was used for various embodiments of the SVC, though this value can be adjusted, if desired. Relative differences (i.e., division by the background pixels) are used so as to not bias the estimates with brighter pixels. This “segmentation mask” is used in Equation (8) to improve depth estimates for moving objects in the above-described optimization.
where W0={W1, . . . , Wn}, D={D1, . . . , Dn} is the initial disparity and depth, respectively, for each of the n frames of the input, and Wmax is an adjustable parameter that modulates how much objects “pop-out” from the screen when viewed with a stereoscopic device. Increasing the value of Wmax enhances the “3D” effect, but can also cause eye strain or problems with fusing the stereo images if set too high.
where li is a weight based on image saliency and initial disparity values that constrains disaparity values corresponding to highly salient regions and very close objects to remain unchanged, and is set to
The Qsmooth term contains the same terms as in the spatial and temporal smoothness functions provided above (see Equations (4) and (7), respectively), and λ and μ control the weighting of these smoothness terms in the optimization. Note that in tested embodiments, both λ and μ were set to a value of 10, although these values can be adjusted, if desired. Given this formulation, the SVC ensures spatial and temporal coherence while also ensuring that highly salient regions remain perceptually intact during view warping.
and use these halved values to render the input frame(s) into two new views (corresponding to the stereo left and right views) that generally produces larger numbers of smaller artifacts. Note that the SVC can also use other methods, such as only rendering one new frame with larger disparities while using the input frame as the other half of the stereo pair; however, such techniques tend to produce relatively fewer, but larger, artifacts. In general, people tend to be less perceptive of many small artifacts when compared with few large artifacts. For rendering, the SVC uses a conventional anisotropic pixel splatting method that “splats” input pixels into the new view (based on W) as weighted, anisotropic Gaussian blobs.
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/315,488 US9414048B2 (en) | 2011-12-09 | 2011-12-09 | Automatic 2D-to-stereoscopic video conversion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/315,488 US9414048B2 (en) | 2011-12-09 | 2011-12-09 | Automatic 2D-to-stereoscopic video conversion |
Publications (2)
Publication Number | Publication Date |
---|---|
US20130147911A1 US20130147911A1 (en) | 2013-06-13 |
US9414048B2 true US9414048B2 (en) | 2016-08-09 |
Family
ID=48571620
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/315,488 Active 2035-06-09 US9414048B2 (en) | 2011-12-09 | 2011-12-09 | Automatic 2D-to-stereoscopic video conversion |
Country Status (1)
Country | Link |
---|---|
US (1) | US9414048B2 (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140078260A1 (en) * | 2012-09-20 | 2014-03-20 | Brown University | Method for generating an array of 3-d points |
US20160269709A1 (en) * | 2015-03-13 | 2016-09-15 | Eys3D Microelectronics, Co. | Image process apparatus and image process method |
CN108564620A (en) * | 2018-03-27 | 2018-09-21 | 中国人民解放军国防科技大学 | Scene depth estimation method for light field array camera |
CN110443257A (en) * | 2019-07-08 | 2019-11-12 | 大连理工大学 | A kind of conspicuousness detection method based on Active Learning |
US20200082541A1 (en) * | 2018-09-11 | 2020-03-12 | Apple Inc. | Robust Use of Semantic Segmentation for Depth and Disparity Estimation |
US10679373B2 (en) * | 2016-04-21 | 2020-06-09 | Ultra-D Coöperatief U.A. | Dual mode depth estimator |
DE102018221625A1 (en) | 2018-12-13 | 2020-06-18 | Robert Bosch Gmbh | Transfer of additional information between camera systems |
Families Citing this family (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012253713A (en) * | 2011-06-07 | 2012-12-20 | Sony Corp | Image processing device, method for controlling image processing device, and program for causing computer to execute the method |
US9471988B2 (en) | 2011-11-02 | 2016-10-18 | Google Inc. | Depth-map generation for an input image using an example approximate depth-map associated with an example similar image |
US9661307B1 (en) | 2011-11-15 | 2017-05-23 | Google Inc. | Depth map generation using motion cues for conversion of monoscopic visual content to stereoscopic 3D |
US20130202191A1 (en) * | 2012-02-02 | 2013-08-08 | Himax Technologies Limited | Multi-view image generating method and apparatus using the same |
US9111350B1 (en) | 2012-02-10 | 2015-08-18 | Google Inc. | Conversion of monoscopic visual content to stereoscopic 3D |
JP2013172190A (en) * | 2012-02-17 | 2013-09-02 | Sony Corp | Image processing device and image processing method and program |
EP2670146A1 (en) * | 2012-06-01 | 2013-12-04 | Alcatel Lucent | Method and apparatus for encoding and decoding a multiview video stream |
US9241146B2 (en) * | 2012-11-02 | 2016-01-19 | Nvidia Corporation | Interleaved approach to depth-image-based rendering of stereoscopic images |
US9674498B1 (en) | 2013-03-15 | 2017-06-06 | Google Inc. | Detecting suitability for converting monoscopic visual content to stereoscopic 3D |
JP2015095779A (en) * | 2013-11-12 | 2015-05-18 | ソニー株式会社 | Image processing apparatus, image processing method, and electronic equipment |
US10074182B2 (en) | 2013-11-14 | 2018-09-11 | Microsoft Technology Licensing, Llc | Presenting markup in a scene using depth fading |
US9131209B1 (en) * | 2014-10-27 | 2015-09-08 | Can Demirba{hacek over (g)} | Method for automated realtime conversion of 2D RGB images and video to red-cyan stereoscopic anaglyph 3D |
US10200666B2 (en) * | 2015-03-04 | 2019-02-05 | Dolby Laboratories Licensing Corporation | Coherent motion estimation for stereoscopic video |
KR102286572B1 (en) * | 2015-03-04 | 2021-08-06 | 한국전자통신연구원 | Device and Method for new 3D Video Representation from 2D Video |
CN104899558A (en) * | 2015-05-25 | 2015-09-09 | 东华大学 | Scene recognition and colorization processing method for vehicle-mounted infrared image |
BR112018002224A8 (en) | 2015-08-03 | 2020-09-08 | Kiana Ali Asghar Calagari | 2d to 3d video frame converter |
US11463676B2 (en) | 2015-08-07 | 2022-10-04 | Medicaltek Co. Ltd. | Stereoscopic visualization system and method for endoscope using shape-from-shading algorithm |
US20170035268A1 (en) * | 2015-08-07 | 2017-02-09 | Ming Shi CO., LTD. | Stereo display system and method for endoscope using shape-from-shading algorithm |
EP3156942A1 (en) * | 2015-10-16 | 2017-04-19 | Thomson Licensing | Scene labeling of rgb-d data with interactive option |
US10554956B2 (en) * | 2015-10-29 | 2020-02-04 | Dell Products, Lp | Depth masks for image segmentation for depth-based computational photography |
FR3054347B1 (en) * | 2016-07-19 | 2019-08-23 | Safran | METHOD AND DEVICE FOR AIDING NAVIGATION OF A VEHICLE |
CN108064448A (en) * | 2016-09-14 | 2018-05-22 | 深圳市柔宇科技有限公司 | A kind of playback equipment and its playback method |
US20180322689A1 (en) * | 2017-05-05 | 2018-11-08 | University Of Maryland, College Park | Visualization and rendering of images to enhance depth perception |
US10572761B1 (en) * | 2017-06-05 | 2020-02-25 | Google Llc | Virtual reality system using super-resolution |
US10735707B2 (en) * | 2017-08-15 | 2020-08-04 | International Business Machines Corporation | Generating three-dimensional imagery |
CN109688397B (en) * | 2017-10-18 | 2021-10-22 | 上海质尊文化传媒发展有限公司 | Method for converting 2D (two-dimensional) video into 3D video |
US11194994B2 (en) * | 2017-12-20 | 2021-12-07 | X Development Llc | Semantic zone separation for map generation |
US10991150B2 (en) | 2018-05-09 | 2021-04-27 | Massachusetts Institute Of Technology | View generation from a single image using fully convolutional neural networks |
US20220368881A1 (en) * | 2019-01-25 | 2022-11-17 | Bitanimate, Inc. | Detection and ranging based on a single monoscopic frame |
CN110084742B (en) * | 2019-05-08 | 2024-01-26 | 北京奇艺世纪科技有限公司 | Parallax map prediction method and device and electronic equipment |
US11764941B2 (en) * | 2020-04-30 | 2023-09-19 | International Business Machines Corporation | Decision tree-based inference on homomorphically-encrypted data without bootstrapping |
WO2021229455A1 (en) * | 2020-05-11 | 2021-11-18 | Niantic, Inc. | Generating stereo image data from monocular images |
US11328172B2 (en) * | 2020-08-24 | 2022-05-10 | Huawei Technologies Co. Ltd. | Method for fine-grained sketch-based scene image retrieval |
CN112991495B (en) * | 2021-03-09 | 2023-10-27 | 大连海事大学 | Interactive iterative virtual shoe print image generation method based on sketch |
US20220413433A1 (en) * | 2021-06-28 | 2022-12-29 | Meta Platforms Technologies, Llc | Holographic Calling for Artificial Reality |
CN113506217B (en) * | 2021-07-09 | 2022-08-16 | 天津大学 | Three-dimensional image super-resolution reconstruction method based on cyclic interaction |
CN116205788B (en) * | 2023-04-27 | 2023-08-11 | 粤港澳大湾区数字经济研究院(福田) | Three-dimensional feature map acquisition method, image processing method and related device |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7027054B1 (en) * | 2002-08-14 | 2006-04-11 | Avaworks, Incorporated | Do-it-yourself photo realistic talking head creation system and method |
US20080031327A1 (en) * | 2006-08-01 | 2008-02-07 | Haohong Wang | Real-time capturing and generating stereo images and videos with a monoscopic low power mobile device |
US20080150945A1 (en) | 2006-12-22 | 2008-06-26 | Haohong Wang | Complexity-adaptive 2d-to-3d video sequence conversion |
US20090116732A1 (en) * | 2006-06-23 | 2009-05-07 | Samuel Zhou | Methods and systems for converting 2d motion pictures for stereoscopic 3d exhibition |
US20100026784A1 (en) | 2006-12-19 | 2010-02-04 | Koninklijke Philips Electronics N.V. | Method and system to convert 2d video into 3d video |
US20100111417A1 (en) | 2008-11-03 | 2010-05-06 | Microsoft Corporation | Converting 2d video into stereo video |
US20110096832A1 (en) | 2009-10-23 | 2011-04-28 | Qualcomm Incorporated | Depth map generation techniques for conversion of 2d video data to 3d video data |
US20110109720A1 (en) | 2009-11-11 | 2011-05-12 | Disney Enterprises, Inc. | Stereoscopic editing for video production, post-production and display adaptation |
US7999844B2 (en) | 1995-12-22 | 2011-08-16 | Dynamic Digital Depth Research Pty Ltd. | Image conversion and encoding techniques |
US20130034337A1 (en) * | 2011-08-03 | 2013-02-07 | Qatar Foundation | Copy detection |
-
2011
- 2011-12-09 US US13/315,488 patent/US9414048B2/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7999844B2 (en) | 1995-12-22 | 2011-08-16 | Dynamic Digital Depth Research Pty Ltd. | Image conversion and encoding techniques |
US7027054B1 (en) * | 2002-08-14 | 2006-04-11 | Avaworks, Incorporated | Do-it-yourself photo realistic talking head creation system and method |
US20090116732A1 (en) * | 2006-06-23 | 2009-05-07 | Samuel Zhou | Methods and systems for converting 2d motion pictures for stereoscopic 3d exhibition |
US20080031327A1 (en) * | 2006-08-01 | 2008-02-07 | Haohong Wang | Real-time capturing and generating stereo images and videos with a monoscopic low power mobile device |
US20100026784A1 (en) | 2006-12-19 | 2010-02-04 | Koninklijke Philips Electronics N.V. | Method and system to convert 2d video into 3d video |
US20080150945A1 (en) | 2006-12-22 | 2008-06-26 | Haohong Wang | Complexity-adaptive 2d-to-3d video sequence conversion |
US20100111417A1 (en) | 2008-11-03 | 2010-05-06 | Microsoft Corporation | Converting 2d video into stereo video |
US20110096832A1 (en) | 2009-10-23 | 2011-04-28 | Qualcomm Incorporated | Depth map generation techniques for conversion of 2d video data to 3d video data |
US20110109720A1 (en) | 2009-11-11 | 2011-05-12 | Disney Enterprises, Inc. | Stereoscopic editing for video production, post-production and display adaptation |
US20130034337A1 (en) * | 2011-08-03 | 2013-02-07 | Qatar Foundation | Copy detection |
Non-Patent Citations (30)
Title |
---|
Ashutosh Saxena, Sung H Chung, and Andrew Y Ng. Learning Depth from Single Monocular Images. In NIPS 18. MIT Press, 2005. |
Aude Oliva and Antonio Torralba. Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope. International Journal of Computer Vision, 42:145-175, 2001. |
Beyang Liu, S. Gould, and D. Koller. Single Image Depth Estimation from Predicted Semantic Labels. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 1253-1260, Jun. 2010. |
C Wu, J-M Frahm, and M Pollefeys. Repetition-based Dense Single-View Reconstruction. CVPR, pp. 1-8, Mar. 2011. |
Ce Liu, J. Yuen, and A. Torralba. Nonparametric scene parsing: Label transfer via dense scene alignment. Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, 0:1972-1979, 2009. |
Ce Liu, J. Yuen, and A. Torralba. SIFT Flow: Dense Correspondence across Scenes and its Applications. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(5):978-994, May 2011. |
Chao-Chung Cheng, Chung-Te Li, and Liang-Gee Chen. A Novel 2D to-3D Conversion System using Edge Information. Consumer Electronics, IEEE Transactions on, 56(3):1739-1745, 2010. |
Chenglei Wu, Guihua Er, Xudong Xie, Tao Li, Xun Cao, and Qionghai Dai. A Novel Method for Semi-automatic 2D to 3D Video Conversion. In 3DTV Conference: The True Vision-Capture, Transmission and Display of 3D Video, 2008, pp. 65-68, 2008. |
E Rotem, K Wolowelsky, and D Pelz. Automatic video to stereoscopic video conversion. Proc. SPIE, 5664(198):1-9, Mar. 2005. |
E. Delage, Honglak Lee, and A.Y. Ng. A Dynamic Bayesian Network Model for Autonomous 3D Reconstruction from a Single Indoor Image. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 2, pp. 2418-2428, 2006. |
Feng Han and Song-Chun Zhu. Bayesian Reconstruction of 3D Shapes and Scenes from a Single Image. In Proceedings of the First IEEE International Workshop on Higher-Level Knowledge in 3D Modeling and Motion Analysis, pp. 12-, Washington, DC, USA, 2003. IEEE Computer Society. Bayesian Reconstruction of 3D Shapes and Scenes from a Single Image. |
Fitzgibbon, Y. Wexler, and A. Zisserman. Image-based Rendering using Image-based Priors. In Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, vol. 2, pp. 1176-1183, Oct. 2003. |
Guttmann, et al., "Semi-automatic Stereo Extraction from Video Footage", IEEE 12th International Conference on Computer Vision (ICCV), Sep. 29-Oct. 2, 2009, pp. 136-142. |
Hoiem, et al., "Automatic Photo Pop-up", ACM SIGGRAPH, Journal of ACM Transactions on Graphics (TOG), vol. 24, Issue 3, Jul. 2005, pp. 577-584. |
Ianir Ideses, Leonid Yaroslavsky, and Barak Fishbain. Real-time 2D to 3D Video Conversion. Journal of Real-Time Image Processing, 2:3-9, 2007. 10.1007/s11554-007-0038-9. |
K. Moustakas, D. Tzovaras, and M.G. Strintzis. Stereoscopic Video Generation based on Efficient Layered Structure and Motion Estimation from a Monoscopic Image Sequence. Circuits and Systems for Video Technology, IEEE Transactions on, 15(8):1065-1073, 2005. |
Kim, et al., "A Stereoscopic Video Generation Method Using Stereoscopic Display Characterization and Motion Analysis", IEEE Transactions on Broadcasting, vol. 54, No. 2, Jun. 2008, pp. 188-197. |
Ko, et al., "2D-To-3D Stereoscopic Conversion: Depth-Map Estimation in a 2D Single-View Image", SPIE, vol. 6696, 66962A, Aug. 2007, pp. 9. |
Kowdle, Y.J. Chang, A. Gallagher, and T.H. Chen. Active Learning for Piecewise Planar 3D Reconstruction. In CVPR11, pp. 929-936, 2011. |
L Zhang, C. Vazquez, and S. Knorr. 3D-TV Content Creation: Automatic 2D-to-3D Video Conversion. Broadcasting, IEEE Transactions on, pp.(99):1-12, 2011. |
Lai-Man Po, Xuyuan Xu, Yuesheng Zhu, Shihang Zhang, Kwok-Wai Cheung, and Chi-Wang Ting. Automatic 2D-to-3D Video Conversion Technique based on Depth-from-Motion and Color Segmentation. In Signal Processing (ICSP), 2010 IEEE 10th International Conference on, pp. 1000-1003, 2010. |
M Kim, S Park, H Kim, and I Artem. Automatic Conversion of Two-dimensional Video into Stereoscopic Video. In Three-Dimensional TV, Video, and Display IV. SPIE, Dec. 2005. |
Oliver Wang, Manuel Lang, Matthias Frei, Alexander Hornung, Aljoscha Smolic, and Markus Gross. Stereobrush: Interactive 2D to 3D Conversion using Discontinuous Warps. In Proceedings of the Eighth Sketch-Based Interfaces and Modeling Symposium, SBIM '11, 2011. |
Saxena, et al., "3-D Depth Reconstruction from a Single Still Image", International Journal of Computer Vision, vol. 76, 2007, pp. 16. |
Saxena, Min Sun, and A.Y. Ng. Make3D: Learning 3D Scene Structure from a Single Still Image. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):824-840, 2009. |
Sebastian Knorr, Matthias Kunter, and Thomas Sikora. Stereoscopic 3D from 2D video with super-resolution capability. Image Commun., 23:665-676, Oct. 2008. |
Tal Hassner and Ronen Basri. Example Based 3D Reconstruction from Single 2D Images. In CVPR workshop on Beyond Patches, 2006. |
Tao Li, Qionghai Dai, and Xudong Xie. An Efficient Method for Automatic Stereoscopic Conversion. In Visual Information Engineering, 2008. VIE 2008. 5th International Conference on, pp. 256-260, 2008. |
Ward, et al., "Depth Director: A System for Adding Depth to Movies", IEEE Computer Graphics and Applications, vol. 31, No. 1, Jan./Feb. 2011, pp. 36-48. |
Zhang, et al., "Stereoscopic Video Synthesis from a Monocular Video", IEEE Transactions on Visualization and Computer Graphics, vol. 13, No. 4, Jul./Aug. 2007, pp. 686-696. |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140078260A1 (en) * | 2012-09-20 | 2014-03-20 | Brown University | Method for generating an array of 3-d points |
US10008007B2 (en) * | 2012-09-20 | 2018-06-26 | Brown University | Method for generating an array of 3-D points |
US20160269709A1 (en) * | 2015-03-13 | 2016-09-15 | Eys3D Microelectronics, Co. | Image process apparatus and image process method |
US10148934B2 (en) * | 2015-03-13 | 2018-12-04 | Eys3D Microelectronics, Co. | Image process apparatus and image process method |
US10679373B2 (en) * | 2016-04-21 | 2020-06-09 | Ultra-D Coöperatief U.A. | Dual mode depth estimator |
CN108564620A (en) * | 2018-03-27 | 2018-09-21 | 中国人民解放军国防科技大学 | Scene depth estimation method for light field array camera |
US20200082541A1 (en) * | 2018-09-11 | 2020-03-12 | Apple Inc. | Robust Use of Semantic Segmentation for Depth and Disparity Estimation |
US11526995B2 (en) * | 2018-09-11 | 2022-12-13 | Apple Inc. | Robust use of semantic segmentation for depth and disparity estimation |
DE102018221625A1 (en) | 2018-12-13 | 2020-06-18 | Robert Bosch Gmbh | Transfer of additional information between camera systems |
WO2020119996A1 (en) | 2018-12-13 | 2020-06-18 | Robert Bosch Gmbh | Transfer of additional information between camera systems |
CN110443257A (en) * | 2019-07-08 | 2019-11-12 | 大连理工大学 | A kind of conspicuousness detection method based on Active Learning |
CN110443257B (en) * | 2019-07-08 | 2022-04-12 | 大连理工大学 | Significance detection method based on active learning |
Also Published As
Publication number | Publication date |
---|---|
US20130147911A1 (en) | 2013-06-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9414048B2 (en) | Automatic 2D-to-stereoscopic video conversion | |
CN111325794B (en) | Visual simultaneous localization and map construction method based on depth convolution self-encoder | |
Dhamo et al. | Peeking behind objects: Layered depth prediction from a single image | |
JP7403528B2 (en) | Method and system for reconstructing color and depth information of a scene | |
Karsch et al. | Depth transfer: Depth extraction from video using non-parametric sampling | |
US9390515B2 (en) | Keyframe selection for robust video-based structure from motion | |
US9237330B2 (en) | Forming a stereoscopic video | |
US9041819B2 (en) | Method for stabilizing a digital video | |
Karsch et al. | Depth extraction from video using non-parametric sampling | |
US8885941B2 (en) | System and method for estimating spatially varying defocus blur in a digital image | |
KR101370718B1 (en) | Method and apparatus for 2d to 3d conversion using panorama image | |
US20130127988A1 (en) | Modifying the viewpoint of a digital image | |
US20130129192A1 (en) | Range map determination for a video frame | |
Choi et al. | Space-time hole filling with random walks in view extrapolation for 3D video | |
US20130129193A1 (en) | Forming a steroscopic image using range map | |
US9317928B2 (en) | Detecting and tracking point features with primary colors | |
WO2008152607A1 (en) | Method, apparatus, system and computer program product for depth-related information propagation | |
Jain et al. | Enhanced stable view synthesis | |
Orozco et al. | HDR multiview image sequence generation: Toward 3D HDR video | |
Woodford et al. | Efficient new-view synthesis using pairwise dictionary priors | |
Patil et al. | Review on 2D-to-3D image and video conversion methods | |
Ko et al. | Disparity Map estimation using semi-global matching based on image segmentation | |
Zhang et al. | Superpixel-based image inpainting with simple user guidance | |
Babahajiani | Geometric computer vision: Omnidirectional visual and remotely sensed data analysis | |
US12039657B2 (en) | View synthesis of a dynamic scene |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: MICROSOFT CORPORATION, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:KARSCH, KEVIN ROBERT;LIU, CE;KANG, SING BING;SIGNING DATES FROM 20111207 TO 20111208;REEL/FRAME:027779/0392 |
|
AS | Assignment |
Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034544/0541 Effective date: 20141014 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |