WO2013109252A1

WO2013109252A1 - Generating an image for another view

Info

Publication number: WO2013109252A1
Application number: PCT/US2012/021590
Authority: WO
Inventors: Gowri Somanath; Shan He; Izzat Hekmat Izzat
Original assignee: Thomson Licensing
Priority date: 2012-01-17
Filing date: 2012-01-17
Publication date: 2013-07-25

Abstract

Various implementations provide a pipeline for 2D-to-3D conversion. Particular implementations use the pipeline to produce stereoscopic image pairs from 2D images in a video sequence. According to a general aspect, a particular image from a first view is accessed. Disparity values are determined for multiple pixels of the particular image using a processor-based algorithm. The particular image is warped to a second view based on the disparity values, to produce a warped image from the second view. The particular image and the warped image are provided as a three-dimensional stereo pair of images.

Description

GENERATING AN IMAGE FOR ANOTHER VIEW

TECHNICAL FIELD

Implementations are described that relate to image content. Various particular implementations relate to generating a stereoscopic image pair.

BACKGROUND

It is often desirable to create a stereoscopic image pair from a two-dimensional ("2D") image. Processes for creating the stereoscopic image pair suffer from a variety of drawbacks, however.

SUMMARY

According to a general aspect, a particular image from a first view is accessed. Disparity values are determined for multiple pixels of the particular image using a processor-based algorithm. The particular image is warped to a second view based on the disparity values, to produce a warped image from the second view. The particular image and the warped image are provided as a three-dimensional stereo pair of images.

The details of one or more implementations are set forth in the accompanying drawings and the description below. Even if described in one particular manner, it should be clear that implementations may be configured or embodied in various manners. For example, an implementation may be performed as a method, or embodied as an apparatus, such as, for example, an apparatus configured to perform a set of operations or an apparatus storing instructions for performing a set of operations, or embodied in a signal. Other aspects and features will become apparent from the following detailed description considered in

conjunction with the accompanying drawings and the claims. BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a pictorial representation of an actual depth value for parallel cameras.

FIG. 2 is a pictorial representation of a disparity value.

FIG. 3 is a pictorial representation of the relationship between apparent depth and disparity.

FIG. 4 is a pictorial representation of convergent cameras.

FIG. 5 is a pictorial representation of occlusion in stereoscopic video image pairs.

FIG. 6 is a block/flow diagram depicting an implementation of an image conversion system.

FIG. 7 is a block/flow diagram depicting an implementation of an image conversion process.

FIG. 8 includes two block/flow diagrams depicting a first implementation and a second implementation of the image conversion process of FIG. 7.

FIG. 9 includes two block/flow diagrams depicting a third implementation and a fourth implementation of the image conversion process of FIG. 7.

FIG. 10 includes two block/flow diagrams depicting a fifth implementation and a sixth implementation of the image conversion process of FIG. 7.

FIG. 1 1 is a block/flow diagram depicting a first implementation of the image conversion system of FIG. 6.

FIG. 12 is a block/flow diagram depicting a second implementation of the image conversion system of FIG. 6.

FIG. 13 is a block/flow diagram depicting a third implementation of the image conversion system of FIG. 6.

FIG. 14 is a block/flow diagram depicting another implementation of an image conversion process.

FIG. 15 is a block/flow diagram depicting a first implementation of a

communications system. FIG. 16 is a block/flow diagram depicting a second implementation of a communications system.

DETAILED DESCRIPTION We begin with an preview of various implementations and features. This is followed by a more detailed discussion of relevant features.

At least one implementation provides an automated process (a "pipeline") for generating a stereoscopic image pair based on a 2D image. The pipeline of this implementation estimates disparity, and warps the 2D image to create a second image. The two images are provided as the stereoscopic image pair for use in, for example, providing three-dimensional ("3D") content to a viewer.

In various implementations, we provide a framework to convert monocular (2D) video sequences to stereo (3D) video. In these implementations, we estimate a depth map for each frame in the sequence. The depth map is typically used to obtain a disparity map for generating a stereo pair. We discuss at least two schemes to obtain the initial depth map, using stereo or structure from motion techniques. The matching is, in various implementations, either sparse or dense. We present schemes to correct and refine the depth maps, including both automatic and semi-automatic schemes. Several such schemes use, for example, image information and statistical filters. The depth map is then typically converted to a disparity map using one or more of the mapping methods discussed. The mapping methods are used, for example, for the purpose of range adjustment, enhancement of the depth perception, or alteration for better viewer experience. The disparity map is then typically used to generate a stereo pair for each frame using one of the warping techniques ranging from, for example, a shifting operation to function based surface warps. The framework is implemented, in various implementations, as a fully automatic pipeline or a semiautomatic scheme with some manual interaction.

Given a sequence of frames from a monocular (2D) video, one typical goal is to generate a stereo pair for each frame. Depth is generally perceived in a stereo pair due to horizontal/lateral parallax between corresponding points in the two images (as described in more detail below). The parallax is generally

proportional to the depth of the point in the scene. Hence, in order to generate the second view with the appropriate parallax, in various implementations we use (for example, access and/or generate) knowledge about the relative depth of points in the scene. In other words, we use knowledge about the structure of the scene. This knowledge is, in various implementations, represented as a depth map or a disparity map that has, typically, the same size as the 2D image, with each pixel value typically indicating the depth of the point with respect to the current camera view. Various implementations convert the depth map to a disparity map that defines the horizontal parallax or shift for the pixel. The relation between the depth map and the disparity map is, in various

implementations, either linear or non-linear. The range of disparity is usually varied to suit the viewing screen dimension and the distance of the viewer to the screen. After the disparity map is obtained, various implementations generate the stereo pair. The stereo pair is generated using, for example, a warping technique.

In various implementations, we generate a stereo pair (3D) for each frame of a monocular (2D) video sequence. In doing so, several implementations use, for example, the temporal relation between frames in terms of camera/object motion or any other feature useful. Several implementations perform a depth estimation step that is based on analyzing the entire image or a sub-sample of the image. Such implementations do not restrict the examination to, for example, only the top and bottom regions of an image. Also, several implementations are not restricted to contiguous region/point examination. Further, various

implementations do not restrict the depth to a predefined set of depth models. This provides the advantage of being able to estimate depth of the actual scene more closely, as compared to the use of depth models.

In various implementations, we do not require an individual to manually segment the "objects" in a scene, nor to manually assign depth values to each object. Of course, in various implementations, if object segments are available through, for example, a manual process or an automatic tracking process, such segments are used for correction and refinement of the depth map. Several implementations fill occlusions by repeating pixel colors. However, other implementations use more general inpainting and interpolation techniques.

Various implementations perform depth estimation without performing "object" or layer based depth estimation or assignment. Several such implementations thus operate without explicit identification of objects in a selected image and without creating a model for depth or objects. These implementations do not identify the objects in each scene nor determine the object characteristics, nor separate objects or layers.

We provide particular implementations that do not follow fully automatic schemes using a fixed restricted set of depth models. Such schemes and models are often not suitable for scenes and are often not applicable for various categories of images. For example, we provide at least one implementation that does not assume that the lower half of the image has depth such that it is always closer to the viewer than the top half. As another example, we provide at least one implementation that does not use a box structure in which the central part of the image is placed at greater depth to the viewer than the other parts.

Additionally, we provide particular implementations that do not use completely manual processes in which the user determines the depth of each image pixel or groups of pixels. Such manual processes are typically expensive in terms of time and cost. For example, we provide at least one implementation that provides a semi-automatic to fully-automatic scheme that can better determine the true structure in the given scene (as opposed to using a predetermined, possibly incorrect model). Additionally, such a scheme is, in various implementations, further refined through various parameters and/or manual determination, towards generation of a depth map and second view for stereo/3D viewing. Thus the trade-off between different degrees of human intervention allows for time, effort, and cost balancing. In at least one implementation, we propose a system for generating a second view based on a disparity map. The disparity map is, in various implementations, sparse or dense and is, in various implementations, computed using stereo or structure from motion. The disparity map is then warped to generate the second view.

Stepping back from the above preview, FIGS. 1 -5 provide a more detailed discussion of various features. In particular, depth, disparity, and occlusions, as these terms relate to various implementations are discussed.

FIG. 1 illustrates the concept of depth in a video image. FIG. 1 shows a right camera 105 with a sensor 107, and a left camera 1 10 with a sensor 1 12. Both cameras 105, 1 10 are capturing images of an object 1 15. For the purposes of illustration, object 1 15 is a physical cross, having an arbitrary detail 1 16 located on the right side of the cross (see FIG. 2). The right camera 105 has a capture angle 120, and the left camera 1 10 has a capture angle 125. The two capture angles 120, 125 overlap in a 3D stereo area 130.

Because the object 1 15 is in the 3D stereo area 130, the object 1 15 is visible to both cameras 105, 1 10, and therefore the object 1 15 is capable of being perceived as having a depth. The object 1 15 has an actual depth 135. The actual depth 135 is generally referred to as the distance from the object 1 15 to the cameras 105, 1 10. More specifically, the actual depth 135 may be referred to as the distance from the object 1 15 to a stereo camera baseline 140, which is the plane defined by the entrance pupil plane of both cameras 105, 1 10. The entrance pupil plane of a camera is typically inside a zoom lens and, therefore, is not typically physically accessible. The cameras 105, 1 10 are also shown having a focal length 145. The focal length 145 is the distance from the exit pupil plane to the sensors 107, 1 12. For the purposes of illustration, the entrance pupil plane and the exit pupil plane are shown as coincident, when in most instances they are slightly separated.

Additionally, the cameras 105, 1 10 are shown as having a baseline length 150. The baseline length 150 is the distance between the centers of the entrance pupils of the cameras 105, 1 10, and therefore is measured at the stereo camera baseline 140.

The object 1 15 is imaged by each of the cameras 105 and 1 10 as real images on each of the sensors 107 and 1 12. These real images include a real image 1 17 of the detail 1 16 on the sensor 107, and a real image 1 18 of the detail 1 16 on the sensor 1 12. As shown in FIG. 1 , the real images are flipped, as is known in the art.

Depth is closely related to disparity. FIG. 2 shows a left image 205 captured from the camera 1 10, and a right image 210 captured from the camera 105. Both images 205, 210 include representation of the object 1 15 with detail 1 16. The image 210 includes a detail image 217 of the detail 1 16, and the image 205 includes a detail image 218 of the detail 1 16. The far right point of the detail 1 16 is captured in a pixel 220 in the detail image 218 in the left image 205. and is captured in a pixel 225 in the detail image 217 in the right image 210. The horizontal difference between the locations of the pixel 220 and the pixel 225 is the disparity 230. The object images 217, 218 are assumed to be registered vertically so that the images of detail 1 16 have the same vertical positioning in both the images 205, 210. The disparity 230 provides a perception of depth to the object 215 when the left and right images 205, 210 are viewed by the left and right eyes, respectively, of a viewer.

FIG. 3 shows the relationship between disparity and perceived depth. Three observers 305, 307, 309 are shown viewing a stereoscopic image pair for an object on a respective screens 310, 320, 330.

The first observer 305 views a left view 315 of the object and a right view 317 of the object that have a positive disparity. The positive disparity reflects the fact that the left view 315 of the object is to the left of the right view 317 of the object on the screen 310. The positive disparity results in a perceived, or virtual, object 319 appearing to be behind the plane of the screen 310.

The second observer 307 views a left view 325 of the object and a right view 327 of the object that have zero disparity. The zero disparity reflects the fact that the left view 325 of the object is at the same horizontal position as the right view 327 of the object on the screen 320. The zero disparity results in a perceived, or virtual, object 329 appearing to be at the same depth as the screen 320.

The third observer 309 views a left view 335 of the object and a right view 337 of the object that have a negative disparity. The negative disparity reflects the fact that the left view 335 of the object is to the right of the right view 337 of the object on the screen 330. The negative disparity results in a perceived, or virtual, object 339 appearing to be in front of the plane of the screen 330.

It is worth noting at this point, that disparity and depth can be used

interchangeably in implementations unless otherwise indicated or required by context. Using Equation 1 , we know disparity is inversely-proportional to scene depth.

(1 ) a

where "D" describes depth (135 in FIG. 1 ), "b" is the baseline length (150 in FIG. 1 ) between two stereo-image cameras, "f is the focal length for each camera (145 in FIG. 1 ), and "d" is the disparity for two corresponding feature points (230 in FIG. 2).

Equation 1 above is valid for parallel cameras with the same focal length. More complicated formulas can be defined for other scenarios but in most cases

Equation 1 can be used as an approximation. Additionally, however, Equation 2 below is valid for at least various arrangements of converging cameras, as is known by those of ordinary skill in the art: f - b (2)

D - d_r - d where "d∞" is the value of disparity for an object at infinity. d∞ depends on the convergence angle and the focal length, and is expressed in meters (for

example) rather than in the number of pixels. Focal length was discussed earlier with respect to FIG. 1 and the focal length 145. Convergence angle is shown in FIG. 4.

FIG. 4 includes the camera 105 and the camera 1 10 positioned in a converging configuration rather than the parallel configuration of FIG. 1. An angle 410 shows the lines of sight of the cameras 105, 1 10 converging, and the angle 410 may be referred to as the convergence angle.

Disparity maps are used to provide, for example, disparity information for a video image. A disparity map generally refers to a set of disparity values with a geometry corresponding to the pixels in the associated video image.

Disparity maps, or more generally, disparity information, may be used for a variety of processing operations. Such operations include, for example, view interpolation (rendering) for adjusting the 3D effect on a consumer device.

In one implementation, the 3D effect is softened (reduced) based on a user preference. To reduce the 3D effect (reduce the absolute value of the disparity), a new view is interpolated using the disparity and video images. For example, the new view is positioned at a location between the existing left view and right view, and the new view replaces one of the left view and the right view. Thus, the new stereoscopic image pair has a smaller baseline length and will have a reduced disparity, and therefore a reduced 3D effect.

In another implementation, extrapolation, rather than interpolation, is performed to exaggerate the apparent depth and thereby increase the 3D effect. In this implementation, a new view is extrapolated corresponding to a virtual camera having an increased baseline length relative to one of the original left and right views.

For many 3D processing operations, a dense disparity map is preferred over a down-sampled disparity map or other sparse disparity maps. For example, when a disparity map is used to enable user-controllable 3D-effects, disparity information on a per-pixel basis is generally preferred. The per-pixel basis disparity information generally allows better results to be achieved, because using a sparse disparity map (for example, a down-sampled disparity map) may degrade the quality of synthesized views.

Disparity, and the related depth variations, produce occlusions between different views of a scene. FIG. 5 shows a left view 510 and a right view 520 that combine, in a viewer's brain, to produce a 3D scene 530. The left view 510, the right view 520, and the 3D scene 530 each contain three objects, which include a wide cylinder 532, an oval 534, and a thin cylinder 536. However, as shown in FIG. 5, two of the three objects 532, 534, 536 are in different relative locations in each of the views 510, 520 and the 3D scene 530. Those two objects are the wide cylinder 532 and the thin cylinder 536. The oval 534 is in the same relative location in each of the views 510, 520 and the 3D scene 530.

The different relative locations produce occlusions, as explained by the following simplified discussion. The left view 510 is shown in a left image 540 that also reveals occluded areas 545 and 548. The occluded areas 545 and 548 are only visible in the left view 510 and not in the right view 520. This is because (i) the area in the right view 520 that corresponds to the occluded area 545 is covered by the wide cylinder 532, and (ii) the area in right view 520 that corresponds to the occluded area 548 is covered by the narrow cylinder 536.

Similarly, the right view 520 is shown in a right image 550 that also reveals two occluded areas 555 and 558. The occluded areas 555, 558 are only visible in the right view 520 and not in the left view 510. This is because (i) the area in the left view 510 that corresponds to the occluded area 555 is covered by the wide cylinder 532, and (ii) the area in the left view 510 that corresponds to the occluded area 558 is covered by narrow cylinder 536. Given that occlusions may exist in a stereoscopic image pair, it is useful to provide two disparity maps for a stereoscopic image pair. In one such

implementation, a left disparity map is provided for a left video image, and a right disparity map is provided for a right video image. Known algorithms may be used to assign disparity values to pixel locations of each image for which disparity values cannot be determined using the standard disparity vector approach. Occlusion areas can then determined by comparing the left and right disparity values.

As an example of comparing left and right disparity values, consider a left-eye image and a corresponding right-eye image. A pixel L is located in row N and has a horizontal coordinate X|_ in the left-eye image. Pixel L is determined to have a disparity value di_. A pixel R is located in row N of the corresponding right-eye image and has a horizontal coordinate nearest x_L + di_. The pixel R is determined to have a disparity value dR of about "-di_". Then, with a high degree of confidence, there is no occlusion at L or R because the disparities correspond to each other. That is, the pixels L and R both point to each other, generally, with their determined disparities.

However, if dR is not substantially the same as -di_, then there may be an occlusion. For example, if the two disparity values are substantially different, after accounting for the sign, then there is generally a high degree of confidence that there is an occlusion. Substantial difference is indicated, in one

implementation, by |di_ - CIR| > 1 . Additionally, if one of the disparity values (either d_R or d_L) is unavailable, then there is generally a high degree of confidence that there is an occlusion. A disparity value may be unavailable because, for example, the disparity value cannot be determined. The occlusion generally relates to one of the two images. For example, the portion of the scene shown by the pixel associated with the disparity having the smaller magnitude, or shown by the pixel corresponding to the unavailable disparity value, is generally considered to be occluded in the other image.

1 . Overview of one or more implementations As discussed earlier, various implementations discuss methods and systems for obtaining a stereo pair (3D) for frames from a monocular (2D) video sequence. FIG. 6 shows an overview of a system 600 used for one or more implementations of such a 2D-3D conversion.

The system 600 includes an input source 610. The input source 610 stores, in various implementations, one or more images and/or an input video. In other implementations, the input source 610 is a means for providing input images, such as a browser or other user interface that allows an image to be selected, received from a storage device, and provided to another component in the system 600. In certain implementations, the input source 610 includes a browser for selecting and receiving images from the internet, or an operating system user interface for selecting and retrieving images from a local network.

The system 600 further includes a pipeline 620 for generating a stereo pair for one or more images received from the input source 610. The system 600 further includes a viewing medium 630 used, for example, for receiving the generated stereo pair from the pipeline 620, and for displaying the generated stereo pair for viewing by a user. The system 600 additionally includes a user 640 that potentially interfaces with, for example, each of the input source 610, the pipeline 620, and the viewing medium 630.

The user 640 interfaces with the input source 610, in various implementations, to select and/or view an input image or input video. The user 640 interfaces with the pipeline 620, in various implementations, to provide input to, and receive selection information from, the pipeline 620. Various forms of input to, and information from, the pipeline 620 are described with respect to particular implementations elsewhere in this application. The user 640 interfaces with the viewing medium 630, in various implementations, to view the input 2D image, a rendered 2D image, a 3D image pair, and/or selection information from the pipeline 620. The user 640 performs communication to/from the other components 610, 620, and 630 using one or more input and/or output devices (not shown). Such input and/or output devices include, for example, a mouse, a touch screen for receiving finger commands, a pen for use with a touch screen, a microphone for receiving voice commands, a speaker for receiving information audibly, and/or a display screen for receiving information visually.

The input source 610 is, in various implementations, a storage medium

accessible to or integrated with a computer system where the pipeline 620 is implemented. The pipeline is implemented, in various implementations, on a single computer or suitably coded to operate on a cluster or distributed system. The viewing medium 630 is, in various implementations, separate from or integrated with a processing system that executes the pipeline. In at least one implementation, the viewing medium 630 is a standard monitor and the rendering is an anaglyph. In at least one other implementation, the viewing medium 630 is a 3D-capable TV or a projector-screen combination, and a suitable form of stereo pair is rendered. The stereo pair is, in particular implementations, an interleaved or time-shutter based stereo pair.

Certain implementations of the system 600 include a storage device in addition to the input source 610. Various implementations of the storage device perform the task of storing, either permanently or transiently, for example, an input image and/or a generated stereo pair. The storage device is, in various

implementations, separate from or integrated with the input source 610.

2. Pipeline for conversion according to one or more implementations

FIG. 7 shows a process 700 that is performed by a variety of implementations of the pipeline 620. Overviews of various implementations of the pipeline 620 were described earlier. The process 700 can be used with many of these

implementations. The elements of the process 700 are described below. In any given implementation, various elements of the process 700 are optional.

The process 700 includes accessing an input video 710. The process 700 further includes matching (720), estimating camera parameters (730), and obtaining a depth map (740). The operations 720-740 are described further, for particular implementations, in sections 2.1. through 2.2. below.

The process 700 also includes correcting a depth map (750). The operation 750 is described for various implementations with respect to section 2.3. below.

The process 700 also includes refining a depth map (760). The operation 760 is described for various implementations with respect to section 2.4. below.

The process 700 also includes rescaling and/or remapping a depth map to form a disparity map (770). More generally, the operation 770 relates to forming a disparity map. The operation 770 is described for various implementations with respect to section 2.5. below. The process 700 also includes warping to produce a stereo pair (780). The operation 780 is described for various implementations with respect to section 2.6. below.

The process 700 also includes rendering a stereo pair (790). The operation 790 is described for various implementations with respect to section 2.7. below.

2.1. Depth estimation from stereo

In various implementations of the process 700, the matching operation 720 includes a variation of stereo matching, and a depth map is determined (740) from the stereo matching. Stereo matching is used herein to broadly refer to the application of stereo matching techniques to two images, whether the two images are true stereo images or not.

Note that the stereo matching produces a disparity map, but the disparity map is converted, in various implementations, to a depth map. Accordingly, various implementations use the disparity map generated from the stereo matching, and do not convert the disparity map to a depth map. In such implementations that use the disparity map directly, the operation of estimating camera parameters 730 need not be performed. Referring to FIG. 8, a process 800a that uses stereo matching is provided. The process 800a is described in more detail further below.

Note, however, that stereo matching implementations often do perform the operation of estimating camera parameters 730. This is done, in various implementations, in order to determine a depth map that corresponds to the disparity map. Such depth maps are useful, for example, because they are valid for any display size.

In various implementations, consecutive or time spaced frames are treated as stereo pairs. One of the frames is designated as the reference image, and disparity is recovered with respect to the reference image. In various

implementations, the process is repeated for each frame as a reference image, or the disparity map is transferred to neighboring frames. The disparity map is transferred in situations, for example, in which there is no change between two frames (due, for example, to the absence of camera or object motion), and we can therefore essentially use the same disparity map. Also, if the change between two frames is very minimal, the majority of the disparity map is reused in various implementations (for example, if an object moves but the camera is not moving, then the background will have the same depth/disparity in the two frames). These implementations use stereo matching to perform the matching 720 of the process 700. An implementation of stereo matching is now described in more detail.

Given two time spaced frames, 11 and I2, various implementations rectify the images such that the epipolar lines are horizontal. This corresponds to a canonical stereo camera setup. A variety of techniques have been proposed for rectification in the computer vision literature, including for example, A. Fusiello and L. Irsara, "Quasi-euclidean Uncalibrated Epipolar Rectification", International Conference on Pattern Recognition (ICPR), 2008, Tampa, FL, which is hereby incorporated by reference in its entirety for all purposes. One or more

implementations use any of a variety of these known techniques.

Given the rectified images, M r and I2r, various implementations perform stereo matching to obtain the disparity/depth map. A plethora of stereo algorithms have been proposed in the literature. Methods based on block matching include, for example, Sum of Square Distance (SSD), Normalized Cross Correlation (NCC), and Sum of Absolute Differences (SAD). Such block matching methods are implemented for real-time use in various implementations. Graph based methods are based on belief propagation or max-flow min-cut algorithms. Graph based methods typically provide dense disparity maps, which are converted in various implementations into dense depth maps, but also typically have large memory and convergence time requirements. The stereo matching process provides a disparity map, which is the horizontal parallax for each pixel or block of pixels. The disparity map is used directly in various implementations.

In other implementations, the corresponding depth map is determined. The depth is proportional to the disparity by the following relation, z=f^*B/d. Here, "z" is the depth, "f the focal length of the camera, "B" the baseline or separation between the camera when the two images were taken, and "d" is the disparity. Additional discussion of the relation between depth and disparity is provided with respect to FIGS. 1 -5. In various implementations, the camera parameters are known or estimated using other techniques. If the camera parameters are unknown, a predefined value is used in various implementations for the product f^*B. Note that disparity is measured in pixels, which can range up to plus or minus the image width. Those disparity values can be too large in particular applications, and so various implementations determine the corresponding depth values.

The rectification process transforms the image. Hence, in order to obtain the depth map with respect to the original image, various implementations apply an inverse transform to the reference image and depth map obtained. The depth map is inverse transformed as well, to allow the depth map to be used with respect to the original video frame (from the input). For matching, the original video frame was rectified, and the disparity map was generated with respect to that rectified video image, and the depth map was based on the disparity map. Now both the video image and the depth map are "unrectified" (inverse transformed).

Intuitively, temporal motion provides depth information because, for example, higher motion often indicates that an object is closer to the camera. Conversely, lower motion often indicates further distance. For example, if a camera moves, then the two frames corresponding to the two positions of the camera are treated, in various implementations, as if the two frames were captured from two separate cameras as described for the traditional stereo camera setup.

2.2. Depth estimation from matching and camera tracking

In various implementations of the process 700, the matching operation 720 includes feature-based matching or flow-based matching techniques.

Additionally, in these implementations, the operation of estimating the camera parameters 730 includes estimating a projection matrix for the relevant cameras. Based on the feature-based or flow-based matching, and the estimated camera projection matrices, a depth map is determined 740. Referring to FIG. 8, a process 800b is provided that includes feature matching and camera parameter estimation. The process 800b is described in more detail further below.

This implementation uses multiple images, which are typically temporal (time- spaced) images. Additionally, implementations typically involve only a single camera. However, multiple camera projection matrices are obtained when, for example, the camera changes position. That is, a projection matrix includes camera position information (see R, T below) which changes as the camera moves. When the camera is not calibrated, the parameters are estimated for a given sequence up to a certain factor. Hence, even with the same camera, two sequences can be processed to obtain two equivalent yet different sets of camera parameters. This is particularly true when the sequences are non- overlapping. But if the camera is calibrated beforehand, then some of the parameters are known in absolute terms and apply to all sequences.

Again, note that simply having a disparity map is sufficient to generate a stereo pair which will work for providing a 3D effect. However, many implementations strive to get the depth map.

In various implementations, depth is estimated if the camera parameters - intrinsic and extrinsic - are known and the projection of a scene point is known in two or more images of the video sequence. The 3D world co-ordinates of a point X are related to the point's image projection (location in image) x as follows, x=PX, where "P" is the 3 x 4 projection matrix of the camera. The projection matrix P is formed from the intrinsic and extrinsic camera parameters as follows, P=K ^* [ R I T ]. Here "K" is a 3x3 intrinsic matrix containing the image center, the focal length, and the skew of the camera. "R" and "T" form the extrinsic parameters of rotation and translation of the camera. In at least one

implementation, R|T is a 3x4 matrix formed by concatenating/appending the two separate matrices of R and T. These parameters (R, T, K, and P) and the associated details are well known in the art. See, for example, Multiple View Geometry in Computer Vision, Second Edition, by Richard Hartley and Andrew Zisserman, Cambridge University Press, March 2004, a textbook which is hereby incorporated by reference in its entirety for all purposes. We can solve for X, if we know the projection of the point in two or more images. That is, if we have two or more equations of the form Xj=PiX. This is known as 3D triangulation, and can be understood as intersecting two rays originating from the camera center and passing through the image at respective local points x,, and intersecting at a 3D point "X". This generally assumes that there are, for example, separate cameras pointing at a common scene. The position X is estimated from the intersection of two or more such rays. Hence, in order to convert a given monocular video to 3D, in various implementations we estimate the camera matrices P, and the projections x, of each scene point.

A variety of techniques are known in the art for performing the above matching and camera tracking. One or more implementations use any of a variety of these known techniques.

2.2.1. Camera parameters

One or all of the camera parameters can be estimated using a camera tracking or structure from motion ("SFM") technique. A multitude of schemes for camera tracking and/or SFM have been proposed and implemented in commercial and open source products. One or more implementations use any of a variety of these known techniques. Certain techniques are discussed in, for example, Noah Snavely, Steven M. Seitz, Richard Szeliski, "Photo Tourism: Exploring image collections in 3D", ACM Transactions on Graphics (Proceedings of

SIGGRAPH 2006), 2006, incorporated herein by reference in its entirety for all purposes.

2.2.2. Matching

To obtain the 3D location of a scene point, we use the projection of the point in two or more images taken with different camera positions. This process of obtaining the correspondence/matching can be performed on a sparse set of features or for each pixel of the image (that is, a dense set of features).

In sparse feature matching, a set of salient features (or interest points) is tracked/matched robustly across images. A variety of features and

matching/tracking schemes can be used. A large number of feature detectors have been proposed based on scale-space analysis, edge detectors, and pyramid-based image filtering. The features can be tracked using any tracking scheme, or comparing the feature vectors (descriptors) using either the L1 or the L2 norm. The various detectors and descriptors vary in terms of invariance to image transformations (such as scale change), type of image features selected, and dimension of the descriptor. One or more implementations use any of a variety of these known techniques.

In dense feature matching, a dense correspondence, that is, a match for each pixel, is obtained. The dense correspondence is obtained, in various

implementations, using, for example, flow-based techniques or dense feature- point detection and matching. A variety of dense and sparse optic-flow schemes have been proposed in the literature. Methods have also been suggested to apply known descriptors on a dense sampling of the image followed by matching of these descriptors, similar to their sparse counter parts, or using techniques like graph-cut or belief propagation. One or more implementations use any of a variety of these known techniques. See, for example, the following references which are hereby incorporated by reference in their entirety for all purposes: (i) B. D. Lucas and T. Kanade (1981 ), "An iterative image registration technique with an application to stereo vision", Proceedings of Imaging Understanding

Workshop, pages 121 -130, (ii) Vladimir Kolmogorov and Ramin Zabih,

"Computing Visual Correspondence with Occlusions using Graph Cuts",

International Conference on Computer Vision, July 2001 , and (iii) Pedro F.

Felzenszwalb and Daniel P. Huttenlocher, "Efficient Belief Propagation for Early Vision", International Journal of Computer Vision, Vol. 70, No. 1 , October 2006. Note that implementations are described that use one or more of various types of feature matching and/or stereo matching. For example, the methods SSD, SAD, and NCC are known as block-matching techniques, and they work essentially on all/any patch/block of an image. Such block-matching techniques are, in various implementations, commonly aimed at getting as dense a disparity map as possible. However, various implementations perform stereo matching without aiming to get a dense disparity map. Feature-matching techniques commonly determine a set of features or salient-points and generally match only patches around these detected points.

2.2.3. 3D triangulation and initial depth map

Given the projection matrices and correspondences, various implementations estimate the 3D location of the point in the co-ordinate system established by the camera tracking scheme and up to an unknown scale. That is, if the camera is not calibrated, the parameters are estimated up to a factor. Accordingly, the real 3D depth and the estimated 3D point X, have a scale. This can be represented as (real X) = (scale-factor)^*(estimated X). This scale is presumed to be unknown without calibration of the camera.

For implementations that use sparse feature tracking, a sparse set of 3D points is typically obtained. For implementations that use dense matching, the 3D location of each pixel is typically estimated.

In various implementations that use sparse feature tracking, the selected salient points are not necessarily distributed uniformly over the image. In various implementations, a sparse depth map is converted, using for example, triangulation or interpolation, to get a dense depth map or at least to get a more uniformly sampled depth map. Such a converted depth map will not always be robust because, for example, the sparse depth values are not necessarily distributed in the image.

The dense matching algorithms can provide dense correspondence. However, the algorithms are not always accurate. Additionally, the algorithms often increase the computation time for triangulation because there are typically a large number of pixels (for example, on the order of two million pixels for high definition ("HD") frames). A lack of accuracy is caused, in various

implementations, because implementations often have a trade-off between density and accuracy. Typically, some sort of "smoothness" constraint is used. As a result, for example, patches that are inherently ambiguous to match will often not be accurate. To address these concerns, at least in part, various implementations use a hybrid approach. In the hybrid approach, a dense flow is calculated for the image, such that matching (for example, flow or feature-based) is performed for all pixels. However, only a uniform set of pixels (for example, a regular grid) are

triangulated, using, for example, the described process of determining X from x, and P., to obtain a sparse depth map which is refined as described in the later sections. An implementation of the hybrid approach is described below for super-pixels.

Because most of the matching errors typically occur at object boundaries, occlusions, or depth discontinuities, various implementations achieve better results by attempting to avoid such locations and by using "interior" pixels for depth estimation. Such implementations do not typically use a fixed or regular grid-sampling of the image. Rather, such implementations typically first divide the image into a fixed number of "super-pixels" as described in, for example, the following two references, both of which are hereby incorporated by reference in their entirety for all purposes: (i) X. Ren and J. Malik, "Learning a classification model for segmentation", Proc. 9th Int. Conf. Computer Vision, volume 1 , pages 10-17, 2003, and (ii) Alex Levinshtein, Adrian Stere, Kiriakos N. Kutulakos, David J. Fleet, Sven J. Dickinson, Kaleem Siddiqi, "TurboPixels: Fast Superpixels Using Geometric Flows", IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 2290-2297, December 2009.

The superpixels provide a quasi-uniform sampling, while keeping in mind image edges. That is, an image edge will typically not occur inside a super-pixel. The center of the superpixels is used in certain implementations to obtain a sparse depth map from triangulation. Accordingly, the disparity (and depth) is calculated with respect to, for example, the center pixel location I the superpixels. The number of superpixels is varied in different implementations based on image size and scene. For example, 5000 superpixels is used for a HD frame in one implementation. 2.3. Depth map correction

The operation of correcting the depth map 750 includes, in various

implementations, applying one or more of the following corrections/filters to a depth map. The depth map is obtained, in various implementations, from any of the above processes, particularly the processes described with respect to sections 2.1 and 2.2. This correction operation 750 is applied, in various implementations, to a depth map obtained by converting a disparity map from a stereo-matching process (section 2.1 ). This correction operation need not be applied in all implementations. · Histogram filtering

Due to errors in matching or camera parameters, the depth map obtained may contain some spurious values. In various implementations, such errors are removed or reduced by histogram filtering. In certain implementations, the depth values are binned into a histogram and values within bins having a "low" population are discarded. The number of bins and the threshold for minimum bin occupancy is set based on the scene and the nature of the error. For example, in one implementation, if a scene contains many depth layers and many objects with different depth values, then more bins are used, as compared to a scene which has fewer depth layers (for example, a scene that has a far background and a simple object in the foreground). Additionally, for a given image size (the total number of pixels is fixed), if there are many depth layers, then more bins are generally used. Using more bins typically means that each bin will have a lower percentage of the total pixels and, thus, that the bin counts will generally be lower. As a result, as the number of bins increases, the expected bin count generally decreases, and hence the minimum required bin occupancy is set to a lower value.

In different implementations, the filter is applied (i) on the entire depth map, (ii) on smaller size blocks using distinct/overlapping sliding windows, or (iii) on segments obtained from color based segmentation of the image. With respect to item (ii), windows refers to image patches, which are generally square or rectangular. Patches are distinct if the patches do not overlap in pixels, such as, for example, a patch from pixel (1 , 1 ) to pixel (10, 10) and a patch from pixel (1 , 1 1 ) to pixel (1 1 , 20). With respect to item (iii), in typical implementations, the segment-based filtering is done inside the segments, and the filter is applied to the whole segment.

• Statistical Filters

Similar to histogram filtering, in various implementations median filtering is applied on small blocks of the image using a sliding window operation.

Depending on the window size, the mean or the mode are used instead of the median.

If the window size is large, then the use of the mean may result in a "smudging" effect. In general, a larger window has the potential to cover two depth layers within the window (at boundaries or even within an object). In such cases, the mean would typically result in creating a depth which lies somewhere in between - which may not be visually appealing or even a valid depth layer inside the scene. A mode or a median would instead typically choose to place the depth at the level of the larger object or contributor within that window.

For example, consider a scene with a person standing in front of a far

background - which means essentially that there are two main depth layers (the person's depth in the foreground, and the far background). At the border of the person and the background, assume that the window contains, for example, part of the person's hand and part of part of the background. An in-between depth layer, generated using the mean, would generally result in the appearance that the hand was connected to the far background. This will typically look awkward to a viewer of the scene. A mode or median, in contrast, would generally result in part of the background appearing to stick to the hand. Note that this is avoided, in various implementations, by using super-pixels that respect image edges and, hopefully, do not create a segment that contains parts of both the person and background. In various implementations, for each sliding window or image segment, errors are also removed by using thresholds derived from the mean and variance statistics of the values within the block or segment. For example, values outside of the mean +/- ISTsigma (ISM , 2, 3, ...) are regarded as outliers and discarded in certain implementations.

• Object segment based correction

Masks and rotoscopes are well-known in the art and typically provide information about the shape and/or location of an object. The terms "mask" and "rotoscope" are used interchangeably in this discussion. If the masks/rotoscopes are available, various implementations use the masks/rotoscopes for one or more of the objects to ensure consistency in disparity and/or depth within the object. For example, masks are available from object segmentation and tracking in certain implementations. Given the object boundary, which is provided by the mask or rotoscope, various implementations apply statistical or histogram filters within the segment, as determined by the object boundary, to remove noise. The refinement stage then fills the missing disparity values.

• Manual correction

Depth values are also modified manually in certain implementations.

Modifications include, for example, deleting or replacing depth values. Manual modifications are made, in various implementations, within selected regions of the image. Examples of such regions include rectangular blocks, image segments, a collection of superpixels, or known object segments. In various implementations, a semi-automatic process is used, and a user selects only particular regions in which to perform manual corrections.

2.4. Depth map refinement

The operation of refining the depth map (760) is applied, in various

implementations, to the corrected and sparse depth map from the previous stage, or the sparse-dense map of the hybrid approach. Note also that in various implementations, the correction operation 750 results in the removal of depth values at certain pixels. The refining operation (760) produces, in typical implementations, a dense depth map.

In various implementations, the initial depth map is interpolated keeping in mind object boundaries, and attempting to avoid interpolating across object

boundaries. For example, particular implementations use a color segmentation of the reference image and interpolate within each segment. Performance of these implementations will generally be improved if two conditions are met. The first condition is that object boundaries appear as edges in the image. The second condition is the planar assumption. The planar assumption assumes that the surface that a segment represents is locally flat or planar, thus allowing implementations to use some form of linear interpolation in order to interpolate within a segment.

Segmentation of the image is performed, in various implementations, using any high dimensional clustering algorithm based on graph theory or non-parametric methods such as, for example, mean shift. Each image segment is interpolated using, for example, bilinear or cubic interpolation. The filters used for the initial depth map can be used if required to remove interpolation errors. Further smoothing of the depth map can be performed using a sliding window median filter. Thus, the refinement operation 760 includes, in various implementations, several operations, including, for example, segmentation, interpolation, and filtering. Additionally, the segmentation of previous operations is, in various implementations, reused.

Note that prior to interpolation, statistics on known depth values within the segment are used, in various implementations, to filter "bad" depth values. This filtering is performed, for example, in the correcting operation 750 discussed elsewhere. For example, certain implementations mask out pixels whose depth values are beyond a known range or outside of some standard deviation.

2.5. Depth map remappinq/rescalinq

The operation of, generally speaking, producing a disparity map (770) includes, in various implementations, remapping and/or rescaling the depth map. The depth map that is used to generate a disparity map is, in various implementations, the refined dense depth map described above. Other implementations use, for example, a sparse depth map to generate a disparity map.

As is known, given the dense depth map and the reference image, the

corresponding stereo pair can be formed by using the depth map to obtain a disparity map, and then using the disparity map to form the stereo image for the reference image. The disparity map indicates the parallax/ horizontal shift for one or more pixels, and a dense disparity map typically indicates the parallax / horizontal shift for each pixel of the reference image.

Note that the number of disparity levels refers to the total number of distinct disparities. For example, if a scene has four depth layers, then typical implementations use at least four disparity levels to clearly correspond/mark/map those four depths. The number of distinct disparity levels and the range suitable depends on, for example, the screen size and the viewer distance. A variety of functions can be used to remap the depth map into a disparity map. Before describing several such functions, as examples, we define the following notation:

• D: Depth/disparity map with respect to a certain (reference) frame/image.

Note that this discussion provides a general conversion from one depth or disparity value to another depth or disparity value. D is composed of constituent values D(i) for each location "i" in D that has a depth value (for example, in many implementations each pixel location in a depth map would have a value for D(i), with "i" set to the individual pixel location).

• OldDmax: The maximum value of D. · OldDmin: The minimum value of D.

• WarpDmax: The maximum value of the target range. The target range is the range of disparity values allowable for the disparity map that is being generated. The maximum value is the largest value, which may be positive, zero, or negative. For example, if the target range is [-100 to -50], then the maximum value is -50 and the minimum value is -100. • WarpDmin: The minimum value of the target range.

• sc: Scale factor for log mapping

• f: Factor for linear mapping

• WarpD: Disparity map. Note that although this application generally speaks of the image being warped, and the depth map being remapped or rescaled or converted to produce a disparity map, the term "warping" is also used in the literature (and here) for the process of converting depth to disparity.

WarpD is composed of constituent values WarpD(i) for each location "i" in WarpD that has a disparity value (for example, in many implementations each pixel location in the disparity map would have a value for WarpD(i), with "i" set to the individual pixel location).

Various implementations use one or more of the following functions to remap the depth map into a disparity map, WarpD. Note that the mappings below are typically performed for each value of depth, D(i), for all possible values of i. · Linear Mapping

- NewD(i) = (WarpDmax-WarpDmin)^*(D-0ldDmin)/(0ldDmax-0ldDmin) + WarpDmin

- WarpD(i) = NewD(i)/f. Note that the scale factor "f" is used in various

implementations to assist in changing the depth perception. The change is, for example, intended to soften or, conversely, exaggerate, the depth.

The change is based on, for example, viewing medium or user preference, "f can be any value, including less than 1 (which will scale up the perceived depth) or greater than 1 (to scale down the perceived depth).

• Log mapping

- Get Linear map NewD(i)

- WarpD(i) = log(1 +sc^*NewD(i)). As with "f", the scale factor "sc" allows scaling of the perceived depth. Also, if the values in D are too small, various implementations scale them up before taking the log. "sc", as well as "f , are determined, in various implementations, using trial and error, or a heuristic-based approach on initial disparity and expected target and effect.

Exponential mapping

- Get Linear map NewD(i)

- WarpD(i) = exp(1/(sc^*NewD(i))) Polynomial fitting

In various implementations, the above functions are applied to D as is.

However, in other implementations, a polynomial function is fit using only some of the depth layers. The depth layers refer to the different "apparent" distances or depths in the scene. For example, if there is a scene with a far background and a near foreground - then it can be roughly said to have two depth layers. As discussed below, certain layers can often be said to be dominant in a particular scene. This provides more control over separation of chosen layers. Some particular implementations, which are examples for performing this, follow. In various implementations:

- We obtain a histogram from the depth map, D.

- We select dominant layers/bins from the histogram. For example, in certain implementations values that contribute more than a threshold number of pixels are selected (for example, more than 30% of the pixels, or more than the mean of the histogram).

- The selected layers/bins are represented, for example, by the bin centers of the histogram or by one of the bounds. We designate "x" as a vector that includes these representative values.

- We designate "y" as the mapped disparity, obtained by applying one or more of the functions described above to the vector "x".

- We then approximate a polynomial y=P(x). The degree of p is, in various implementations, fixed or decided based on the number of unique values in x. For example, the degree can be up to a value of: (number of unique values in x) - 1 ).

- P(x) is then used to map the other depth values, from the depth map D, to disparity values. · Combination

Certain implementations also provide a combination that is often suitable for a TV-sized medium. In these implementations, we map the depth using polynomial fitting after dominant layer selection. The mapped value, y, used for polynomial fitting, is obtained, for example, from exponential mapping with sc=500. The final disparity is obtained by a linear mapping of the resulting map to a range of -30 to +30. Accordingly, after polynomial fitting the entire depth map "D", a linear mapping is performed. This is done, for example, because the polynomial mapping of "D" may have modified the "target range" that had been selected when mapping "x" (the selected layers) to "y". That is, the polynomial fit can have the effect of changing the relative separation. Linear mapping then takes those modified values and maps to a final range that is desired based on one or more of a variety of factors, such as, for example, the size of the medium (for example, a TV, a projector, or a movie screen).

2.6 Warping

The operation of warping to produce a stereo pair (780) is performed, in various implementations, using the reference image and a suitably remapped disparity map (sparse or dense). Certain implementations use, for example, the

remapped disparity map produced in the previous section 2.5.

In various implementations, we obtain the stereo pair by using the reference view as one view/eye (for example, the left-eye view), and generating the second view/eye (for example, the right-eye view) by the process of warping. Other implementations, however, obtain the stereo pair by applying warping to the reference view to obtain two new views that are then used as a stereo pair. The latter process of obtaining two new views is performed, in various

implementations, by performing two warping steps. One warping step uses the disparity map (for example, the remapped disparity map of section 2.5.). Another warping step uses a sign-changed disparity map. A sign-changed disparity map is a disparity map obtained by changing the sign (multiplying by negative one) of every disparity value in the map. This is similar to treating the monocular reference as a center view and generating two views on either side for stereo viewing.

Warping is performed in various different ways in different implementations. The following implementations present several such examples.

2.6.1 Shifting

A dense disparity map typically defines a horizontal displacement for each pixel. We can thus create the second view by depth sensitive shifting of pixels through forward or backward warping. In the case of forward warping, in various implementations, for every pixel with known disparity, we determine the position in the other view as follows:

R(x+d, y)=L(x, y) where:

L is the reference/left view, R is the generated right/other view, x and y is the pixel location, and d is the disparity at the position indicated by x and y.

This generally results in undetermined pixels in the generated image due to occlusions. The undetermined or occlusion pixels can be filled using

interpolation or in-painting schemes.

In the case of backward warping, in various implementations, for each location in the target new image we determine a source pixel in the reference image, interpolating as required. For either forward or backward warping, interpolation and/or in-painting schemes, for example, are used in various implementations to fill undetermined or occluded pixels.

2.6.2 Thin-Plate Splines

Various implementations use thin-plate splines ("TPS"). TPS have been used for producing smooth warping functions in the literature. In various implementations that apply TPS, a sparse set of control point locations in the original and target (warped) image is determined, a warping function is estimated, and interpolation is used to derive the locations for all pixels. Intuitively, thin-plate splines can often be understood as bending a continuous, flexible, and non-brittle material based on the control points. The surface bend defines the warping function. In general, discontinuities or sharp surface deviations are not captured effectively using TPS due to constraints on surface smoothness/bending. One advantage of TPS is that large gaps are not generated in the warped image. However, TPS typically is limited in its ability to effect sharp depth discontinuities. TPS is applied in various implementations when, for example, a scene contains smooth depth variations and/or small depth discontinuities.

Certain implementations use an automatic scheme to determine the applicability of TPS over other warping methods. Various such implementations generate one or more measures using the gradient of the disparity map, and base the decision of whether to use TPS on the value of the measure(s).

2.6.3 Polynomial warps

Various implementations use polynomial warps to warp the depth. General n^th degree polynomials can be fit to map control points from the original image to the target image, as is known in the literature. The fit polynomial is used, in various implementations, to determine the locations for all pixels. A degree of n=1 simulates a linear warp, which is similar to a uniform translation or shift.

Piecewise linear or spatially varying multiple polynomials are used in various implementations. 2.7 Stereo rendering

The operation of rendering a stereo pair (790) is performed, in various

implementations, by rendering the stereo pair produced in section 2.6. above. Other implementations, however, make further adjustments prior to providing the stereo pair for viewing/display. For example, certain implementations change the disparity by shifting the images relative to each other. This is used, in various applications, to adapt to personal viewing preferences, size of display medium, or distance to display.

For example, in various implementations, a scene has four depth layers, which are represented as four disparity layers. Those four values of disparity/depth can be chosen in different ways, and different implementations use different values. For example, a first implementation uses [-1 , -2, -3, -4] as the disparity values to make the corresponding objects pop out of the screen. A second implementation uses [1 , 2, 3, 4] as the disparity values to make the corresponding objects appear "inside". A third implementation uses [2, 4, 6, 8] as the disparity values to exaggerate the relative separation.

3. Examples of implementations of a pipeline

Referring to FIGS. 8-10, various implementations are displayed. These implementations provide a pipeline within a conversion system.

FIG. 8 provides the process 800a and the process 800b. Each is discussed in turn.

The process 800a uses stereo matching, and various implementations are described in section 2.1. above. The process 800a does not specifically recite any operations for estimating camera parameters, correcting a depth map, refining a depth map, or rescaling/remapping a depth map. Certain

implementations produce an adequate disparity map directly from stereo matching, and are able to avoid these operations. Various implementations, however, do include one or more of these operations.

The process 800a includes receiving input video (810a). The operation 810a is performed, for various implementations, as described (i) with respect to the operation 710 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 800a performs stereo matching (820a). The operation 820a is performed, for various implementations, as described (i) with respect to the operation 720 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 800a obtains a depth map (840a). The operation 840a is performed, for various implementations, as described (i) with respect to the operation 740 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline. Various implementations of stereo matching (820a) and obtaining a depth map (840a), are described in section 2.1. above (and in general with respect to the entirety of section 2. above describing various aspects of an example of a pipeline). Note that the operation 840a is performed, in various implementations, by obtaining a disparity map rather than a depth map. Indeed, such a disparity map is obtained, in particular

implementations, directly from the stereo matching operation 820a.

The process 800a warps the original image to obtain a stereo pair (880a). The operation 880a is performed, for various implementations, as described (i) with respect to the operation 780 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline. Note that the operation 880a is performed, in various implementations, by using a disparity map obtained in the operation 840a. Indeed, in particular

implementations, the disparity map obtained from stereo matching is directly applied to the original image to warp the image and create a new image.

The process 800a renders a stereo pair (890a). The operation 890a is

performed, for various implementations, as described (i) with respect to the operation 790 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 800b uses feature matching and camera parameter estimation, and various implementations are described in section 2.2. above. The process 800b does not specifically recite any operations for correcting or refining a depth map. Certain implementations produce an adequate depth map directly from feature matching, and are able to avoid these operations. Various implementations, however, do include one or more of these operations.

The process 800b includes receiving input video (810b). The operation 810b is performed, for various implementations, as described (i) with respect to the operation 710 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 800b performs dense feature matching (820b). The operation 820b is performed, for various implementations, as described (i) with respect to the operation 720 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline. Other

implementations of the process 800b use sparse feature matching in place of, or in addition to, dense feature matching.

The process 800b estimates camera parameters (830b). The operation 830b is performed, for various implementations, as described (i) with respect to the operation 730 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 800b obtains a depth map (840b). The operation 840b is performed, for various implementations, as described (i) with respect to the operation 740 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline. Various implementations of dense (820b) and/or sparse feature matching, estimating camera parameters (830b), and obtaining a depth map (840b), are described in section 2.2. above (and in general with respect to the entirety of section 2. above describing various aspects of an example of a pipeline).

The process 800b performs depth map rescaling/remapping to obtain a disparity map (870b). Various implementations provide a disparity map using techniques other than rescaling/remapping a depth map. The operation 870b is performed, for various implementations, as described (i) with respect to the operation 770 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 800b warps the original image to obtain a stereo pair (880b). The operation 880b is performed, for various implementations, as described (i) with respect to the operation 780 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 800b renders a stereo pair (890b). The operation 890b is

FIG. 9 provides two processes that include depth refinement. FIG. 9 provides a process 900a showing the use of sparse feature matching with depth refinement. FIG. 9 also provides a process 900b showing the use of dense feature matching with depth refinement. Neither the process 900a nor the process 900b

specifically recite the use of depth correction. Certain implementations produce an adequate depth map directly from feature matching, and are able to avoid a depth correcting operation. Various implementations, however, do include depth correction. Further, various implementations use stereo matching instead of feature matching in one or more of the processes 900a and 900b.

The process 900a includes receiving input video (910a). The operation 910a is performed, for various implementations, as described (i) with respect to the operation 710 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 900a performs sparse feature matching (920a). The operation 920a is performed, for various implementations, as described (i) with respect to the operation 720 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline. Other

implementations of the process 900a use stereo matching in place of, or in addition to, sparse feature matching (920a). The process 900a estimates camera parameters (930a). The operation 930a is performed, for various implementations, as described (i) with respect to the operation 730 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 900a obtains a depth map (940a). The operation 940a is performed, for various implementations, as described (i) with respect to the operation 740 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline. Various implementations of sparse feature matching (920a), estimating camera parameters (930a), and obtaining a depth map (940a), are described in section 2.2. above (and in general with respect to the entirety of section 2. above describing various aspects of an example of a pipeline).

The process 900a refines a depth map (960a). The operation 960a is performed, for various implementations, as described (i) with respect to the operation 760 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 900a performs depth map rescaling/remapping to obtain a disparity map (970a). Various implementations provide a disparity map using techniques other than rescaling/remapping a depth map. The operation 970a is performed, for various implementations, as described (i) with respect to the operation 770 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 900a warps the original image to obtain a stereo pair (980a). The operation 980a is performed, for various implementations, as described (i) with respect to the operation 780 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 900a renders a stereo pair (990a). The operation 990a is

performed, for various implementations, as described (i) with respect to the operation 790 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline. The process 900b includes receiving input video (910b). The operation 910b is performed, for various implementations, as described (i) with respect to the operation 710 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 900b performs dense feature matching (920b). The operation 920b is performed, for various implementations, as described (i) with respect to the operation 720 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline. Other

implementations of the process 900b use stereo matching in place of, or in addition to, dense feature matching (920b).

The process 900b estimates camera parameters (930b). The operation 930b is performed, for various implementations, as described (i) with respect to the operation 730 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 900b obtains a depth map (940b). The operation 940b is performed, for various implementations, as described (i) with respect to the operation 740 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline. Various implementations of dense (920b) feature matching, estimating camera parameters (930b), and obtaining a depth map (940b), are described in section 2.2. above (and in general with respect to the entirety of section 2. above describing various aspects of an example of a pipeline).

The process 900b refines a depth map (960b). The operation 960b is performed, for various implementations, as described (i) with respect to the operation 760 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 900b performs depth map rescaling/remapping to obtain a disparity map (970b). Various implementations provide a disparity map using techniques other than rescaling/remapping a depth map. The operation 970b is performed, for various implementations, as described (i) with respect to the operation 770 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 900b warps the original image to obtain a stereo pair (980b). The operation 980b is performed, for various implementations, as described (i) with respect to the operation 780 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 900b renders a stereo pair (990b). The operation 990b is

FIG. 10 provides two processes that include depth correction. FIG. 10 provides a process 1000a showing the use of sparse feature matching with depth correction. FIG. 10 also provides a process 1000b showing the use of dense feature matching with depth correction. Various implementations use stereo matching instead of feature matching in one or more of the processes 1000a and 1000b.

The process 1000a includes receiving input video (1010a). The operation 1010a is performed, for various implementations, as described (i) with respect to the operation 710 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 1000a performs sparse feature matching (1020a). The operation 1020a is performed, for various implementations, as described (i) with respect to the operation 720 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline. Other implementations of the process 1000a use stereo matching in place of, or in addition to, sparse feature matching (1020a).

The process 1000a estimates camera parameters (1030a). The operation 1030a is performed, for various implementations, as described (i) with respect to the operation 730 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline. The process 1000a obtains a depth map (1040a). The operation 1040a is performed, for various implementations, as described (i) with respect to the operation 740 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline. Various

implementations of sparse feature matching (1020a), estimating camera parameters (1030a), and obtaining a depth map (1040a), are described in section 2.2. above (and in general with respect to the entirety of section 2. above describing various aspects of an example of a pipeline).

The process 1000a correcting a depth map (1050a). The operation 1050a is performed, for various implementations, as described (i) with respect to the operation 750 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 1000a refines a depth map (1060a). The operation 1060a is performed, for various implementations, as described (i) with respect to the operation 760 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 1000a performs depth map rescaling/remapping to obtain a disparity map (1070a). Various implementations provide a disparity map using techniques other than rescaling/remapping a depth map. The operation 1070a is performed, for various implementations, as described (i) with respect to the operation 770 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 1000a warps the original image to obtain a stereo pair (1080a). The operation 1080a is performed, for various implementations, as described (i) with respect to the operation 780 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 1000a renders a stereo pair (1090a). The operation 1090a is performed, for various implementations, as described (i) with respect to the operation 790 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline. The process 1000b includes receiving input video (1010b). The operation 1010b is performed, for various implementations, as described (i) with respect to the operation 710 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline. The process 1000b performs dense feature matching (1020b). The operation 1020b is performed, for various implementations, as described (i) with respect to the operation 720 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline. Other implementations of the process 1000b use stereo matching in place of, or in addition to, dense feature matching (1020b).

The process 1000b estimates camera parameters (1030b). The operation 1030b is performed, for various implementations, as described (i) with respect to the operation 730 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline. The process 1000b obtains a depth map (1040b). The operation 1040b is performed, for various implementations, as described (i) with respect to the operation 740 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline. Various implementations of dense (1020b) feature matching, estimating camera parameters (1030b), and obtaining a depth map (1040b), are described in section 2.2. above (and in general with respect to the entirety of section 2. above describing various aspects of an example of a pipeline).

The process 1000b correcting a depth map (1050b). The operation 1050b is performed, for various implementations, as described (i) with respect to the operation 750 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 1000b refines a depth map (1060b). The operation 1060b is performed, for various implementations, as described (i) with respect to the operation 760 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline. The process 1000b performs depth map rescaling/remapping to obtain a disparity map (1070b). Various implementations provide a disparity map using techniques other than rescaling/remapping a depth map. The operation 1070b is performed, for various implementations, as described (i) with respect to the operation 770 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 1000b warps the original image to obtain a stereo pair (1080b). The operation 1080b is performed, for various implementations, as described (i) with respect to the operation 780 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

The process 1000b renders a stereo pair (1090b). The operation 1090b is performed, for various implementations, as described (i) with respect to the operation 790 in the process 700, and/or (ii) in general with respect to section 2. above describing various aspects of an example of a pipeline.

4. Examples of system implementations

Various implementations of a pipeline are used in a 2D-3D conversion system. Referring to FIGS. 1 1-13, several implementations of such a conversion system are shown. Additionally, FIGS. 1 1 -13 illustrate various types of user interaction with the system 600 of FIG. 6, according to particular implementations. FIG. 1 1 shows examples of user interaction at the input level. FIG. 12 shows examples of user interaction with the different modules of a pipeline. FIG. 13 shows examples of user interaction with a viewing medium.

Referring again to FIG. 1 1 , a system 1 100 is provided. The system 1 100 includes an input source 1 1 10, a pipeline 1 120, a viewing medium 1 130, and a user 1 140. In various implementations, the system 1 100 is the same as the system 600 of FIG. 6.

FIG. 1 1 also shows examples of user interactions with the input source 1 1 10. The examples of user interactions are indicated as tasks 1 141 -1 149. In various implementations, the system 1 100 allows the user 1 140 to perform none, some, or all of the tasks 1 141 -1 149. FIG. 1 1 illustrates the following tasks: - (i) a task 1 141 of changing a video format for an input image,

- (ii) a task 1 143 of cropping or modifying a resolution of an input image,

- (iii) a task 1 145 of selecting video fragments to convert,

- (iv) a task 1 147 of selecting frames to convert, and

- (v) a task 1 149 of selecting sequences to convert.

Referring again to FIG. 12, a system 1200 is provided. The system 1200 includes an input source 1210, a pipeline 1220, a viewing medium 1230, and a user 1240. In various implementations, the system 1200 is the same as the system 600 of FIG. 6 and/or the same as the system 1 100 of FIG. 1 1. FIG. 12 also shows examples of user interactions with different modules of the pipeline 1220.

FIG. 12 shows the pipeline 1220 including a variety of modules, as described elsewhere in this application. The modules include:

- (i) a matching module 1221 , as described, for example, with respect to the operation 720 and/or section 2.1. or section 2.2. above,

- (ii) a camera module 1222, as described, for example, with respect to the operation 730 and/or section 2.2. above,

- (iii) a depth correction module 1223, as described for example, with respect to the operation 750 and/or section 2.3. above,

- (iv) a depth refinement module 1224, as described for example, with respect to the operation 760 and/or section 2.4. above,

- (v) a depth remapping module 1225, as described for example, with respect to the operation 770 and/or section 2.5. above,

- (vi) a warping module 1226, as described for example, with respect to the operation 780 and/or section 2.6. above, and

- (vii) a rendering module 1227, as described for example, with respect to the operation 790 and/or section 2.7. above. FIG. 12 also shows examples of user interactions with the pipeline 1220. At a broad level, the user 1240 selects, in various implementations, which modules of the pipeline 1220 to use. The user 1240 interacts, in various implementations, with none, some, or all of the modules1221 -1227. For each module 1221 -1227, the user 1240 performs, in various implementations, none, some, or all of the possible interactions. A common additional interaction, in various

implementations, is to simply inspect the current parameters for one or more of the modules 1221 -1227.

The system 1200 includes examples of user interactions, referred to herein as tasks 1241 a-1247, associated with respective modules 1221 -1227 of the pipeline 1220. The system 1200 includes the following tasks, details of which are provided for certain implementations in the discussion of those implementations elsewhere in this application:

- (i) Associated with the matching module 1221 and the camera module 1222, three tasks are shown, including:

- (a) a task 1241 a of selecting a matching scheme, for example, stereo matching, sparse feature matching, or dense feature matching,

- (b) a task 1241 b of selecting parameters for a matching scheme, and

- (c) a task 1241 c of inputting and/or changing some (or all) camera parameters.

- (ii) Associated with the depth correction module 1223, three tasks are shown, including:

- (a) a task 1243a of providing object segmentation,

- (b) a task 1243b of providing a partial or complete depth map, and - (c) a task 1243c of providing one or more manual corrections to a depth map.

- (iii) Associated with the depth refinement module 1224, three tasks are shown, including: - (a) a task 1244a of modifying segmentation, such as, for example, object segmentation,

- (b) a task 1244b of choosing one or more filters to apply, and

- (c) a task 1244c of choosing an interpolation scheme.

- (iv) Associated with the depth remapping module 1225, two tasks are shown, including:

- (a) a task 1245a of selecting functions, such as, for example, linear, log, exponential, or polynomial mapping functions, and

- (b) a task 1245b of selecting one or more function parameters, such as, for example, the factor "f" for linear mapping and/or the scale factor "sc" for log mapping.

- (v) Associated with the warping module 1226, two tasks are shown, including:

- (a) a task 1246a of selecting a warping method, such as, for example, shifting, TPS, or polynomial warps, and

- (b) a task 1246b of selecting one or more parameters for a warping method.

- (vi) Associated with the rendering module 1227, a single task 1247 is shown. The task 1247 is adjusting the stereo pair to a particular viewing medium.

Referring again to FIG. 13, a system 1300 is provided. The system 1300 includes an input source 1310, a pipeline 1320, a viewing medium 1330, and a user 1340. In various implementations, the system 1300 is the same as the system 600 of FIG. 6, the system 1 100 of FIG. 1 1 , and/or the system 1200 of FIG. 12.

FIG. 13 also shows examples of user interactions with the viewing medium 1330. The system 1300 includes the following tasks, details of which are provided for certain implementations in the discussion of those implementations elsewhere in this application: - (i) A task 1342 is shown for selecting a 3D viewing method. For example, the user 1340 selects, in particular implementations, a suitable 3D viewing format. Such formats include, for example, anaglyph or interleaved.

- (ii) A task 1344 is shown for selecting one or more parts of a converted video to view and/or inspect. The video that is being inspected is, for example, a warped image such as that created in the operation 780 of the process 700. In particular implementations, for example, the user 1340 can choose to view only parts of the result (for example, the warped image) and interact again with the input video or the pipeline to enhance the results until the results are satisfactory. Thus, the system 1300 forms a loop in which the user 1340 can perform multiple iterative passes over the different units of the system 1300 in order to achieve satisfactory results.

5. Some additional implementations

Referring to FIG. 14, a process 1400 is shown for providing a stereoscopic image pair. Various implementations of the process 1400 include, for example, the processes 700, 800a, 800b, 900a, 900b, 1000a, and 1000b.

The process 1400 includes accessing a particular image from a first view (1410). The process 1400 includes determining disparity values for multiple pixels of the particular image (1420). Various implementations determine the disparity values using a processor-based algorithm. A processor-based algorithm includes any algorithm operating on, or suited to be operated on, a processor. Such algorithms include, for example, fully automated algorithms and will generally include semi-automated algorithms. Processor-based algorithms permit user- input to be received. The process 1400 includes warping the particular image to a second view based on the disparity values, to produce a warped image from the second view (1430). The term "warping", as used in this application is intended to be a broad term that includes any mechanism to convert an image from a first view to a second view.

The process 1400 includes providing the particular image and the warped image as a three-dimensional stereoscopic pair of images (1440). Referring to FIG. 15, a system or apparatus 1500 is shown, to which the features and principles described above may be applied. The system or apparatus 1500 may be, for example, a system for transmitting a signal using any of a variety of media, such as, for example, satellite, cable, telephone-line, terrestrial broadcast, infra-red, or radio frequency. The system or apparatus 1500 also, or

alternatively, may be used, for example, to provide a signal for storage. The transmission may be provided, for example, over the Internet or some other network, or line of sight. The system or apparatus 1500 is capable of generating and delivering, for example, video content and other content, for use in, for example, providing a 2D or 3D video presentation. It should also be clear that the blocks of FIG. 15 provide a flow diagram of a process, in addition to providing a block diagram of a system or apparatus.

The system or apparatus 1500 receives an input video sequence from a processor 1501 . In one implementation, the processor 1501 is part of the system or apparatus 1500. The input video sequence is, in various implementations, (i) an original input video sequence as described, for example, with respect to the input source 610, and/or (ii) a sequence of 3D stereoscopic image pairs as described, for example, with respect to the output of the pipeline 620. Thus, the processor 1501 is configured, in various implementations, to perform one or more of the methods described in this application. In various implementations, the processor 1501 is configured for performing one or more of the process 700, the process 800a, the process 800b, the process 900a, the process 900b, the process 1000a, the process 1000b, or the process 1400.

The system or apparatus 1500 includes an encoder 1502 and a

transmitter/receiver 1504 capable of transmitting the encoded signal. The encoder 1502 receives, for example, one or more input images from the processor 1501 . The encoder 1502 generates an encoded signal(s) based on the input signal and, in certain implementations, metadata information. The encoder 1502 may be, for example, an AVC encoder. The AVC encoder may be applied to both video and other information. The encoder 1502 may include sub-modules, including for example an assembly unit for receiving and assembling various pieces of information into a structured format for storage or transmission. The various pieces of information may include, for example, coded or uncoded video, and coded or uncoded elements such as, for example, motion vectors, coding mode indicators, and syntax elements. In some implementations, the encoder 1502 includes the processor

1501 and therefore performs the operations of the processor 1501 .

The transmitter/receiver 1504 receives the encoded signal(s) from the encoder

1502 and transmits the encoded signal(s) in one or more output signals. Typical transmitters perform functions such as, for example, one or more of providing error-correction coding, interleaving the data in the signal, randomizing the energy in the signal, and modulating the signal onto one or more carriers using a modulator/demodulator 1506. The transmitter/receiver 1504 may include, or interface with, an antenna (not shown). Further, implementations of the transmitter/receiver 1504 may be limited to the modulator/demodulator 1506.

The system or apparatus 1500 is also communicatively coupled to a storage unit 1508. In one implementation, the storage unit 1508 is coupled to the encoder 1502, and is the storage unit 1508 stores an encoded bitstream from the encoder 1502. In another implementation, the storage unit 1508 is coupled to the transmitter/receiver 1504, and stores a bitstream from the transmitter/receiver 1504. The bitstream from the transmitter/receiver 1504 may include, for example, one or more encoded bitstreams that have been further processed by the transmitter/receiver 1504. The storage unit 1508 is, in different

implementations, one or more of a standard DVD, a Blu-Ray disc, a hard drive, or some other storage device.

The system or apparatus 1500 is also communicatively coupled to a presentation device 1509, such as, for example, a television, a computer, a laptop, a tablet, or a cell phone. Various implementations provide the presentation device 1509 and the processor 1501 in a single integrated unit, such as, for example, a tablet or a laptop. The processor 1501 provides an input to the presentation device 1509. The input includes, for example, a video sequence intended for processing with a 2D-to-3D conversion algorithm. Thus, the presentation device 1509 is, in various implementations, the viewing medium 630. The input includes, as another example, a stereoscopic video sequence prepared using, in part, a conversion process described in this application. Referring to FIG. 16, a system or apparatus 1600 is shown to which the features and principles described above may be applied. The system or apparatus 1600 may be configured to receive signals over a variety of media, such as, for example, satellite, cable, telephone-line, terrestrial broadcast, infra-red, or radio frequency. The signals may be received, for example, over the Internet or some other network, or by line-of-sight. It should also be clear that the blocks of FIG. 16 provide a flow diagram of a process, in addition to providing a block diagram of a system or apparatus.

The system or apparatus 1600 may be, for example, a cell-phone, a computer, a tablet, a set-top box, a television, a gateway, a router, or other device that, for example, receives encoded video content and provides decoded video content for processing.

The system or apparatus 1600 is capable of receiving and processing content information, and the content information may include, for example, video images and/or metadata. The system or apparatus 1600 includes a transmitter/receiver 1602 for receiving an encoded signal, such as, for example, the signals described in the implementations of this application. The transmitter/receiver 1602 receives, in various implementations, for example, a signal providing one or more of a signal output from the system 1500 of FIG. 15, or a signal providing a transmission of a video sequence such as, for example, a 2D or 3D video sequence intended for display on the viewing medium 630.

Typical receivers perform functions such as, for example, one or more of receiving a modulated and encoded data signal, demodulating the data signal from one or more carriers using a modulator/demodulator 1604, de-randomizing the energy in the signal, de-interleaving the data in the signal, and error- correction decoding the signal. The transmitter/receiver 1602 may include, or interface with, an antenna (not shown). Implementations of the transmitter/receiver 1602 may be limited to the modulator/demodulator 1604.

The system or apparatus 1600 includes a decoder 1606. The

transmitter/receiver 1602 provides a received signal to the decoder 1606. The signal provided to the decoder 1606 by the transmitter/receiver 1602 may include one or more encoded bitstreams. The decoder 1606 outputs a decoded signal, such as, for example, a decoded display plane. The decoder 1606 is, in various implementations, for example, an AVC decoder.

The system or apparatus 1600 is also communicatively coupled to a storage unit 1607. In one implementation, the storage unit 1607 is coupled to the

transmitter/receiver 1602, and the transmitter/receiver 1602 accesses a bitstream from the storage unit 1607. In another implementation, the storage unit 1607 is coupled to the decoder 1606, and the decoder 1606 accesses a bitstream from the storage unit 1607. The bitstream accessed from the storage unit 1607 includes, in different implementations, one or more encoded bitstreams. The storage unit 1607 is, in different implementations, one or more of a standard DVD, a Blu-Ray disc, a hard drive, or some other storage device.

The output video from the decoder 1606 is provided, in one implementation, to a processor 1608. The processor 1608 is, in one implementation, a processor configured for performing, for example, all or part of the process 700, the process 800a, the process 800b, the process 900a, the process 900b, the process 1000a, the process 1000b, or the process 1400. In another implementation, the processor 1608 is configured for performing one or more other post-processing operations.

In some implementations, the decoder 1606 includes the processor 1608 and therefore performs the operations of the processor 1608. In other

implementations, the processor 1608 is part of a downstream device such as, for example, a set-top box, a tablet, a router, or a television. More generally, the processor 1608 and/or the system or apparatus 1600 are, in various implementations, part of a gateway, a router, a set-top box, a tablet, a television, or a computer.

The processor 1608 is also communicatively coupled to a presentation device 1609, such as, for example, a television, a computer, a laptop, a tablet, or a cell phone. Various implementations provide the presentation device 1609 and the processor 1608 in a single integrated unit, such as, for example, a tablet or a laptop. The processor 1608 provides an input to the presentation device 1609. The input includes, for example, a video sequence intended for processing with a 2D-to-3D conversion algorithm. Thus, the presentation device 1609 is, in various implementations, the viewing medium 630. The input includes, as another example, a stereoscopic video sequence prepared using, in part, a conversion process described in this application.

The system or apparatus 1600 is also configured to receive input from a user or other input source. The input is received, in typical implementations, by the processor 1608 using a mechanism not explicitly shown in FIG. 16. The input mechanism includes, in various implementations, a mouse or a microphone. In various implementations, however, the input is received through the presentation device 1609, such as, for example, when the presentation device is a touch screen. In at least one implementation, the input includes user input as described, for example, with respect to FIGS. 1 1 -13.

The system or apparatus 1600 is also configured to provide a signal that includes data, such as, for example, a video sequence to a remote device. The signal is, for example, modulated using the modulator/demodulator 1604 and transmitted using the transmitter/receiver 1602.

Referring again to FIG. 15, the system or apparatus 1500 is further configured to receive input, such as, for example, a video sequence. The input is received by the transmitter/receiver 1506, and provided to the processor 1501. In various implementations, the processor 1501 performs a 2D-to-3D conversion process on the input. Referring again to FIG. 6, the operations performed by the pipeline 620 are, in various implementations, performed by a single processor. In other

implementations, the operations are performed by multiple processors working in a collective manner to provide an output result. Various implementations provide one or more of the following advantages and/or features:

- (i) Depth is explicitly calculated/determined, or subsequently derived/estimated, for each image/scene (and, for certain implementations, each pixel), rather than, for example, using a restricted set of depth models. These implementations avoid the use of, for example, of a fixed restricted set of depth models. For example, one such a depth model assumes that the lower half of an image has depth such that it is always closer to the viewer than the top half of the image. Another depth model uses a box structure in which the central part of the image is placed at a greater depth to the viewer than the other parts of the image. - (ii) A fully automatic and near-real time system is provided for converting a 2D image to a 3D stereoscopic image pair. Such implementations avoid the need to have a human user/expert who can mark the objects, and assign

depth/disparity/shifts to each image pixel or object. The time, effort, and cost associated with using a human user/expert increases with the number or pixels, objects, and the general "complexity" of the scene in the image.

- (iii) A semi-automatic system allowing user input. Such implementations typically have the ability to increase accuracy in certain images or parts of images. In addition, changes are often desired in the resultant 3D content for special effects (such as, for example, to bring a certain part of an image into greater user attention).

- (iv) The depth map is processed as a 2D image and converted to a suitable disparity map for warping, rather than, for example, explicitly reconstructing the 3D points and rendering from differing viewpoints. Various such implementations avoid the difficulties that are often encountered in reconstructing a sparse set of 3D points corresponding to a set of scene points which consistently occur over a certain duration in a given image sequence. The sparse set of 3D points provides knowledge of depth of these specific points but not of other pixels in the image. The distribution of these pixels is often non-regular and highly sparse. Accordingly, interpolation of depth to other pixels can frequently not result in a depth map closely matching the scene. Such a depth map typically leads to less than the desired quality in a generated stereo pair or in a rendering of the complete image.

- (v) The explicit detection of possible errors in an estimated depth map and the removal of errors using various degrees of automated and user-guided steps allows generation of higher quality 3D content. One of the methods for error removal uses general image segments, in which the segment boundaries may or may not correspond to object boundaries. Various implementations do not need to have the segments correspond closely to objects in the scene. However, various implementations take advantage of object boundary knowledge, when such knowledge is available. The absence of object boundary knowledge does not necessarily lower the quality of the generated 3D content in various implementations.

- (vi) Various implementations also provide for explicit remapping of

depth/disparity to change relative placement of depth/disparity layers in a scene. This is useful, for example, in applications in which one or more of the following is desired: (a) accommodating a user preference for depth range, (b) allowing content rendering based on viewing medium (size of medium, distance from medium to viewer, or other aspects), and/or (c) content modification for special effects such as, for example, viewer attention and focus. It is noted that some implementations have particular advantages, or

disadvantages. However, a discussion of the disadvantages of an

implementation does not eliminate the advantages of that implementation, nor indicate that the implementation is not a viable and even recommended implementation. Various implementations generate or process signals and/or signal structures. Such signals are formed, in certain implementations, using pseudo-code or syntax. Signals are produced, in various implementations, at the outputs of (i) the stereo rendering operations 790, 890a, 890b, 990a, 990b, 1090a, 1090b, or 1440, (ii) any of the processors 1501 and 1608, (iii) the encoder 1502, (iv) any of the transmitter/receivers 1504 and 1602, or (v) the decoder 1606. The signal and/or the signal structure is transmitted and/or stored (for example, on a processor-readable medium) in various implementations.

This application provides multiple block/flow diagrams, including the block/flow diagrams of FIGS. 6-16. It should be clear that the block/flow diagrams of this application present both a flow diagram describing a process, and a block diagram describing functional blocks of an apparatus, device, or system.

Further, the block/flow diagrams illustrate relationships among the components and outputs of the components. Additionally, this application provides multiple pictorial representations, including the pictorial representations of FIGS. 1 -5. It should be clear that pictorial representations of FIGS. 1 -5 present a visual representation of a feature or concept. Additionally, FIGS. 1 -5 also present a visual representation of a device and/or process related to the feature or concept that is depicted. Additionally, many of the operations, blocks, inputs, or outputs of the

implementations described in this application are optional, even if not explicitly stated in the descriptions and discussions of these implementations. For example, many of the operations discussed with respect to FIG. 7 can be omitted in various implementations. The mere recitation of a feature in a particular implementation does not indicate that the feature is mandatory for all

implementations. Indeed, the opposite conclusion should generally be the default, and all features are considered optional unless such a feature is stated to be required. Even if a feature is stated to be required, that requirement is intended to apply only to that specific implementation, and other implementations are assumed to be free from such a requirement. We thus provide one or more implementations having particular features and aspects. In particular, we provide several implementations related to converting a 2D image to a 3D stereoscopic image pair. Such conversions, as described in various implementations in this application, can be used in a variety of environments, including, for example, creating another view in a 2D-to-3D conversion process, and rendering additional views for 2D applications.

Additional variations of these implementations and additional applications are contemplated and within our disclosure, and features and aspects of described implementations may be adapted for other implementations. Several of the implementations and features described in this application may be used in the context of the AVC Standard, and/or AVC with the MVC extension (Annex H), and/or AVC with the SVC extension (Annex G). AVC refers to the existing International Organization for Standardization/International

Electrotechnical Commission (ISO/IEC) Moving Picture Experts Group-4 (MPEG- 4) Part 10 Advanced Video Coding (AVC) standard/International

Telecommunication Union, Telecommunication Sector (ITU-T) H.264

Recommendation (referred to in this application as the "H.264/MPEG-4 AVC Standard" or variations thereof, such as the "AVC standard", the "H.264 standard", "H.264/AVC", or simply "AVC" or "H.264"). Additionally, these implementations and features may be used in the context of another standard (existing or future), or in a context that does not involve a standard.

Reference to "one embodiment" or "an embodiment" or "one implementation" or "an implementation" of the present principles, as well as other variations thereof, mean that a particular feature, structure, characteristic, and so forth described in connection with the embodiment is included in at least one embodiment of the present principles. Thus, the appearances of the phrase "in one embodiment" or "in an embodiment" or "in one implementation" or "in an implementation", as well any other variations, appearing in various places throughout the specification are not necessarily all referring to the same embodiment. Additionally, this application or its claims may refer to "determining" various pieces of information. Determining the information may include one or more of, for example, estimating the information, calculating the information, evaluating the information, predicting the information, or retrieving the information from memory.

Further, this application or its claims may refer to "accessing" various pieces of information. Accessing the information may include one or more of, for example, receiving the information, retrieving the information (for example, memory), storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the

information, evaluating the information, or estimating the information.

This application or its claims may refer to "providing" information from, for example, a first device (or location) to a second device (or location). This application or its claims may also, or alternatively, refer, for example, to

"receiving" information from the second device (or location) at the first device (or location). Such "providing" or "receiving" is understood to include, at least, direct and indirect connections. Thus, intermediaries between the first and second devices (or locations) are contemplated and within the scope of the terms "providing" and "receiving". For example, if the information is provided from the first location to an intermediary location, and then provided from the intermediary location to the second location, then the information has been provided from the first location to the second location. Similarly, if the information is received at an intermediary location from the first location, and then received at the second location from the intermediary location, then the information has been received from the first location at the second location. Additionally, this application or its claims may refer to "receiving" various pieces of information. Receiving is, as with "accessing", intended to be a broad term. Receiving the information may include one or more of, for example, accessing the information, or retrieving the information (for example, from memory).

Further, "receiving" is typically involved, in one way or another, during operations such as, for example, storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.

Various implementations refer to "images" and/or "pictures". The terms "image" and "picture" are used interchangeably throughout this document, and are intended to be broad terms. An "image" or a "picture" may be, for example, all or part of a frame or of a field. The term "video" refers to a sequence of images (or pictures). An image, or a picture, may include, for example, any of various video components or their combinations. Such components, or their combinations, include, for example, luminance, chrominance, Y (of YUV or YCbCr or YPbPr), U (of YUV), V (of YUV), Cb (of YCbCr), Cr (of YCbCr), Pb (of YPbPr), Pr (of

YPbPr), red (of RGB), green (of RGB), blue (of RGB), S-Video, and negatives or positives of any of these components. An "image" or a "picture" may also, or alternatively, refer to various different types of content, including, for example, typical two-dimensional video, a disparity map for a 2D video picture, a depth map that corresponds to a 2D video picture, or an edge map.

Further, many implementations may refer to a "frame". However, such

implementations are assumed to be equally applicable to a "picture" or "image".

A "mask", or similar terms, is also intended to be a broad term. A mask generally refers, for example, to a picture that includes a particular type of information. However, a mask may include other types of information not indicated by its name. For example, a background mask, or a foreground mask, typically includes information indicating whether pixels are part of the foreground and/or background. However, such a mask may also include other information, such as, for example, layer information if there are multiple foreground layers and/or background layers. Additionally, masks may provide the information in various formats, including, for example, bit flags and/or integer values.

Similarly, a "map" (for example, a "depth map", a "disparity map", or an "edge map"), or similar terms, are also intended to be broad terms. A map generally refers, for example, to a picture that includes a particular type of information. However, a map may include other types of information not indicated by its name. For example, a depth map typically includes depth information, but may also include other information such as, for example, video or edge information. Additionally, maps may provide the information in various formats, including, for example, bit flags and/or integer values.

It is to be appreciated that the use of any of the following 7", "and/or", and "at least one of, for example, in the cases of "A/B", "A and/or B" and "at least one of A and B", is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of both options (A and B). As a further example, in the cases of "A, B, and/or C" and "at least one of A, B, and C" and "at least one of A, B, or C", such phrasing is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of the third listed option (C) only, or the selection of the first and the second listed options (A and B) only, or the selection of the first and third listed options (A and C) only, or the selection of the second and third listed options (B and C) only, or the selection of all three options (A and B and C). This may be extended, as readily apparent by one of ordinary skill in this and related arts, for as many items listed.

Additionally, many implementations may be implemented in one or more of an encoder (for example, the encoder 1502), a decoder (for example, the decoder 1606), a post-processor (for example, the processor 1608) processing output from a decoder, or a pre-processor (for example, the processor 1501 ) providing input to an encoder.

The processors discussed in this application do, in various implementations, include multiple processors (sub-processors) that are collectively configured to perform, for example, a process, a function, or an operation. For example, the processor 1501 and the processor 1608 are each, in various implementations, composed of multiple sub-processors that are collectively configured to perform the operations of the respective processors 1501 and 1608. Further, other implementations are contemplated by this disclosure. The implementations described herein may be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method), the implementation of features discussed may also be implemented in other forms (for example, an apparatus or program). An apparatus may be implemented in, for example, appropriate hardware, software, and firmware. The methods may be implemented in, for example, an apparatus such as, for example, a processor, which refers to processing devices in general, including, for example, a computer, a set-top box, a gateway, a router, a microprocessor, an integrated circuit, or a programmable logic device. Processors also include communication devices, such as, for example, computers, cell phones, portable/personal digital assistants ("PDAs"), tablets, laptops, and other devices that facilitate communication of information between end-users. A processor may also include multiple processors that are collectively configured to perform, for example, a process, a function, or an operation. The collective configuration and performance may be achieved using any of a variety of techniques known in the art, such as, for example, use of dedicated sub-processors for particular tasks, or use of parallel processing.

Implementations of the various processes and features described herein may be embodied in a variety of different equipment or applications, particularly, for example, equipment or applications associated with inpainting, background estimation, rendering additional views, 2D-to-3D conversion, data encoding, data decoding, and other processing of images or other content. Examples of such equipment include a processor, an encoder, a decoder, a post-processor processing output from a decoder, a pre-processor providing input to an encoder, a video coder, a video decoder, a video codec, a web server, a set-top box, a laptop, a personal computer, a tablet, a router, a cell phone, a PDA, and other communication devices. As should be clear, the equipment may be mobile and even installed in a mobile vehicle. Additionally, the methods may be implemented by instructions being performed by a processor (or by multiple processors collectively configured to perform such instructions), and such instructions (and/or data values produced by an implementation) may be stored on a processor-readable medium such as, for example, an integrated circuit, a software carrier or other storage device such as, for example, a hard disk, a compact diskette ("CD"), an optical disc (such as, for example, a DVD, often referred to as a digital versatile disc or a digital video disc), a random access memory ("RAM"), or a read-only memory ("ROM"). The instructions may form an application program tangibly embodied on a processor- readable medium. Instructions may be, for example, in hardware, firmware, software, or a combination. Instructions may be found in, for example, an operating system, a separate application, or a combination of the two. A processor may be characterized, therefore, as, for example, both a device configured to carry out a process and a device that includes a processor- readable medium (such as a storage device) having instructions for carrying out a process. Further, a processor-readable medium may store, in addition to or in lieu of instructions, data values produced by an implementation.

As will be evident to one of skill in the art, implementations may produce a variety of signals formatted to carry information that may be, for example, stored or transmitted. The information may include, for example, instructions for performing a method, or data produced by one of the described implementations. Such a signal may be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal. The formatting may include, for example, encoding a data stream and modulating a carrier with the encoded data stream. The information that the signal carries may be, for example, analog or digital information. The signal may be

transmitted over a variety of different wired or wireless links, as is known. The signal may be stored on a processor-readable medium.

A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made. For example, elements of different implementations may be combined, supplemented, modified, or removed to produce other implementations. Additionally, one of ordinary skill will understand that other structures and processes may be substituted for those disclosed and the resulting implementations will perform at least substantially the same function(s), in at least substantially the same way(s), to achieve at least substantially the same result(s) as the implementations disclosed. Accordingly, these and other implementations are contemplated by this application.

Claims

1. A method comprising:

accessing a particular image from a first view;

determining disparity values for multiple pixels of the particular image using a processor-based algorithm;

warping the particular image to a second view based on the disparity values, to produce a warped image from the second view; and

providing the particular image and the warped image as a three- dimensional stereo pair of images.

2. The method of claim 1 wherein:

determining disparity comprises using temporal information from a time- spaced image in a sequence of images that includes the particular image.

3. The method of claim 1 wherein:

determining disparity values comprises calculating the disparity values for the multiple pixels on a per-pixel basis.

4. The method of claim 1 wherein:

determining disparity values comprises performing stereo matching between the particular image and a time-spaced image in a sequence of images that includes the particular image.

5. The method of claim 4 wherein:

the disparity values are determined directly from the stereo matching.

6. The method of claim 1 wherein:

determining disparity comprises determining depth values and converting the depth values to disparity values.

7. The method of claim 6 wherein:

determining depth values comprises using temporal information from a time-spaced image in a sequence of images that includes the particular image.

8. The method of claim 6 wherein determining depth values comprises using one or more of camera information or motion information from one or more time- spaced images in a sequence of images that includes the particular image.

9. The method of claim 6 wherein:

determining depth values comprises (i) performing stereo matching between the particular image and a time-spaced image in a sequence of images that includes the particular image to produce initial disparity values, and (ii) converting the initial disparity values to the depth values.

10. The method of claim 9 further comprising:

rectifying the particular image prior to performing the stereo matching; and applying an inverse rectification to a depth map that includes the depth values.

1 1. The method of claim 6 wherein:

determining depth values comprises calculating the depth values for the multiple pixels on a per-pixel basis.

12. The method of claim 6 wherein:

determining depth values is based on multiple projections for a feature point and on camera parameters for the multiple projections.

13. The method of claim 12 wherein:

the multiple projections for the feature point are based on feature matching between the particular image and a time-spaced image in the sequence of images.

14. The method of claim 12 wherein:

the camera parameters are estimated.

15. The method of claim 6 wherein determining depth values comprises: sampling the particular image in a manner that favors selection of non- edge pixels as sample pixels; and

determining depth values for the sample pixels.

16. The method of claim 15 further comprising:

modifying the determined depth values to reduce

17. The method of claim 15 further comprising:

interpolating the determined depth values to generate depth values for pixels in the particular image other than the sample pixels.

18. The method of claim 1 wherein:

determining disparity values comprises performing block matching or feature matching without performing object segmentation and/or identification.

19. The method of claim 1 wherein:

determining disparity values using the processor-based algorithm is performed in an automated manner without user intervention.

20. The method of claim 1 further comprising:

identifying the multiple pixels using the processor-based algorithm.

21 . The method of claim 20 wherein:

determining disparity values and identifying the multiple pixels, using the processor-based algorithm, are performed in an automated manner without user intervention.

22. The method of claim 1 further comprising:

receiving user-input selecting a parameter for use in one or more of (i) determining the disparity values, (ii) warping the particular image, or (iii) providing the particular image and the warped image.

23. The method of claim 1 wherein determining the disparity values comprises changing one or more of: (i) a perceived depth of one or more objects in the particular image, (ii) a perceived depth of a depth layer in the particular image, or (iii) a relative separation among depth layers in the particular image.

24. The method of claim 1 further comprising:

performing a matching operation between the particular image and one or more time-spaced images in a sequence of images that includes the particular image;

obtaining depth values for a depth map, the obtained depth values being for the multiple pixels based on the matching operation;

correcting one or more of the depth values in the depth map to produce one or more corrected depth values in the depth map; and

producing additional depth values for the depth map based on depth values, obtained depth values or corrected depth values, already in the depth map, the additional depth values being for additional pixels in the particular image beyond the multiple pixels,

wherein determining the disparity values is based on depth values, obtained depth values or corrected depth values or additional depth values, in the depth map.

25. The method of claim 24 further comprising:

estimating camera parameters for multiple camera positions represented by the one or more time-spaced images and the particular image.

26. The method of claim 24 further comprising:

receiving user-input selecting a parameter for use in one or more of (i) performing the matching operation, (ii) obtaining the depth values, (iii) correcting one or more of the depth values, (iv) producing the additional depth values, (v) determining the disparity values, (vi) warping the particular image, or (vii) providing the particular image and the warped image.

27. An apparatus comprising:

means for accessing a particular image from a first view;

means for determining disparity values for multiple pixels of the particular image using a processor-based algorithm;

means for warping the particular image to a second view based on the disparity values, to produce a warped image from the second view; and

means for providing the particular image and the warped image as a three-dimensional stereo pair of images.

28. An apparatus comprising one or more processors collectively configured to perform at least the following:

accessing a particular image from a first view;

29. The apparatus of claim 28 further comprising:

a modulator for modulating a signal with data indicating the particular image and the warped image, for transmission of the three-dimensional stereo pair of images.

30. The apparatus of claim 28 further comprising:

a demodulator for demodulating a signal that includes data indicating the particular image.

31. The apparatus of claim 28 wherein the apparatus comprises one or more of an encoder, a decoder, a modulator, a demodulator, a receiver, a set-top box, a gateway, a router, a tablet, or a laptop.

32. A processor readable medium having stored thereon instructions that when executed cause one or more devices to collectively perform at least the following:

accessing a particular image from a first view;