SYSTEMS AND METHODS FOR THREE-DIMENSIONAL VIDEO
GENERATION
DESCRIPTION
Related Applications
[0001] This application is based upon and claims the benefit of priority from U.S. Provisional Patent Application No. 61/231 ,285, filed August 4, 2009, the entire contents of which are incorporated herein by reference.
Technical Field
[0002] This disclosure relates to systems and methods for three- dimensional video generation.
Background
[0003] Three-dimensional (3D) TV has recently been foreseen as part of a next wave of promising technologies for consumer electronics. Theoretically, 3D technologies incorporate a third dimension of depth into an image, which may provide a stereographic perception to a viewer of the image.
[0004] Currently, there are limited 3D video content sources in the market. Therefore, different methods to generate 3D video content have been studied and developed. One of the methods is to convert two-dimensional (2D) video content to 3D video content, which may fully employ existing 2D video content sources. However, some disclosed conversion techniques may not be ready for use due to their high computational complexity or unsatisfactory quality.
SUMMARY
[0005] According to a first aspect of the present disclosure, there is provided a device for generating a three-dimensional (3D) video based on a two- dimensional (2D) image sequence including at least one 2D image, comprising: an object segmentation and tracking module configured to segment out objects in the 2D image and track the objects in the 2D image sequence; an object
classification module coupled to the object segmentation and tracking module and configured to classify the objects in the 2D image; an object orientation estimation module coupled to the object classification module and configured to estimate orientations of the objects and generate depth information for the 2D image; a scene structure estimation module coupled to the object classification module and configured to estimate a scene structure in the 2D image based on the object classification; an object topological relationship determination module coupled to the scene structure estimation module and the object orientation estimation module, and configured to determine a topological relationship of the objects in the 2D image based on the estimated scene structure and the estimated orientations of the objects; and a depth-image based rendering module coupled to the object topological relationship determination module and configured to generate the 3D video, based on the depth information for the 2D image and the topological relationship of the objects.
[0006] According to a second aspect of the present disclosure, there is provided a computer-implemented method for generating a three-dimensional (3D) video based on a two-dimensional (2D) image sequence including at least one 2D image, comprising: segmenting out objects in the 2D image and tracking the objects in the 2D image sequence; classifying the objects in the 2D image; estimating orientations of the objects and generating depth information for the 2D image; estimating a scene structure in the 2D image based on the classification of the objects; determining a topological relationship of the objects in the 2D image based on the estimated scene structure and the estimated orientations of the objects; and generating the 3D video based on the depth information for the 2D image and the topological relationship of the objects.
[0007] According to a third aspect of the present disclosure, there is provided a computer-readable medium including instructions, executable by a processor of a three-dimensional (3D) video generating system, for performing a method for generating a 3D video based on a two-dimensional (2D) image
sequence including at least one 2D image, the method comprising: segmenting out objects in the 2D image and tracking the objects in the 2D image sequence; classifying the objects in the 2D image; estimating orientations of the objects and generating depth information for the 2D image; estimating a scene structure in the 2D image based on the classification of the objects; determining a topological relationship of the objects in the 2D image based on the estimated scene structure and the estimated orientations of the objects; and generating the 3D video based on the depth information for the 2D image and the topological relationship of the objects.
[0008] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and, together with the description, serve to explain the principles of the invention.
[0010] Fig. 1 illustrates a block diagram of a system for generating a 3D video, according to an exemplary embodiment.
[0011] Fig. 2 illustrates a block diagram of a 3D video generator, according to an exemplary embodiment.
[0012] Fig. 3A shows an exemplary 2D image received by a 3D video generator, according to an exemplary embodiment.
[0013] Fig. 3B shows exemplary segmentation results for a 2D image, according to an exemplary embodiment.
[0014] Fig. 3C shows exemplary classification results for a 2D image, according to an exemplary embodiment.
[0015] Fig. 3D illustrates a method for performing object orientation estimation, according to an exemplary embodiment.
[0016] Fig. 4 illustrates a method for performing object orientation estimation, according to an exemplary embodiment.
[0017] Fig. 5 illustrates a flowchart of a method for generating a 3D video, according to an exemplary embodiment.
DESCRIPTION OF THE EMBODIMENTS
[0018] Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise represented. The implementations set forth in the following description of exemplary embodiments do not represent all implementations consistent with the invention. Instead, they are merely examples of systems and methods
consistent with aspects related to the invention as recited in the appended claims.
[0019] Fig. 1 illustrates a block diagram of a system 100 for generating a three-dimensional (3D) video, according to an exemplary embodiment. The system 100 may include or be connectable to a two-dimensional (2D) video content source, such as a video storage medium 102 or a media server 104 connected with a network 106, a video device 108, a 3D video generator 110, and a display device 112.
[0020] In exemplary embodiments, the video storage medium 102 may be any medium for storing video content. For example, the video storage medium 102 may be provided as a compact disc (CD), a digital video disc (DVD), a hard disk, a magnetic tape, a flash memory card/drive, a volatile or non-volatile memory, a holographic data storage, or any other storage medium. The video storage medium 102 may be located within the video device 108, local to the video device 108, or remote from the video device 108.
[0021] In exemplary embodiments, the media server 104 may be a computer server that receives a request for 2D video content from the video device 108, processes the request, and provides 2D video content to the video
device 108 through the network 106. For example, the media server 104 may be a web server, an enterprise server, or any other type of computer server. The media server 104 is configured to accept requests from the video device 108 based on, e.g., a hypertext transfer protocol (HTTP) or other protocols that may initiate a video session, and to serve the video device 108 with 2D video content.
[0022] In exemplary embodiments, the network 106 may include a wide area network (WAN), a local area network (LAN), a wireless network suitable for packet-type communications, such as Internet communications, a broadcast network, or any combination thereof. The network 106 is configured to distribute digital or non-digital video content.
[0023] In exemplary embodiments, the video device 108 is a hardware device such as a computer, a personal digital assistant (PDA), a mobile phone, a laptop, a desktop, a videocassette recorder (VCR), a laserdisc player, a DVD player, a blue ray disc player, or any electronic device configured to output a 2D video, i.e., a 2D image sequence. The video device 108 may include software applications that allow the video device 108 to communicate with and receive 2D video content from, e.g., the video storage medium 102 or the media server 104. In addition, the video device 108 may, by means of included software
applications, transform the received 2D video content into digital format, if not already in digital format.
[0024] In exemplary embodiments, the 3D video generator 110 is configured to generate a 3D video based on the 2D image sequence outputted by the video device 108. The generator 110 may be implemented as a hardware device that is either stand-alone or incorporated into the video device 108, or software applications installed on the video device 108, or a combination thereof. In addition, the 3D video generator 110 may include a processor, and a user interface to receive inputs from a user for facilitating the 3D video generating process, as described below. For example, the user interface may be
implemented with a hardware device, such as a keyboard or a mouse, to receive
the inputs from the user, and/or a software application configured to process the inputs received from the user. By involving user interaction in the 3D video generating process, efficiency and accuracy may be improved for the 3D video generator 110. Further, the 3D video generator 110 may store the generated 3D video in a storage device for later playing.
[0025] In exemplary embodiments, the display device 112 is configured to display images in the 2D image sequence and to present the generated 3D video. For example, the display device 112 may be provided as a monitor, a projector, or any other video display device. The display device 112 may also be a part of the video device 108. The user may view the images in the 2D image sequence on the display device 112 to provide inputs to the 3D video generator 110. The user may also watch the generated 3D video on the display device 112. It is to be understood that devices shown in Fig. 1 are for illustrative purposes. Certain devices may be removed or combined, and additional devices may be added.
[0026] Fig. 2 illustrates a block diagram of a 3D video generator 200, according to an exemplary embodiment. For example, the 3D video generator 200 may be the 3D video generator 110 (Fig. 1). The 3D video generator 200 may include an object segmentation and tracking module 202, an object classification module 204, an object orientation estimation module 206, a scene structure estimation module 208, an object topological relationship determination module 210, and a depth-image based rendering (DIBR) module 212. The 3D video generator 200 may further include user input interfaces, e.g., an object segmentation and labeling interface 214 and an object orientation and thickness specification interface 216.
[0027] In exemplary embodiments, the object segmentation and tracking module 202 is a hardware device or software configured to receive a 2D image sequence including at least one 2D image to process. The object segmentation and tracking module 202 may separate scene content in the 2D image into one or more constituent parts, i.e., semantic objects, referred to hereafter as objects.
For example, an object corresponds to a region or shape in the 2D image that represents an entity with a particular semantic meaning, such as a tree, a lake, a house, etc. The object segmentation and tracking module 202 detects the objects in the 2D image and segments out the detected semantic objects for further processing.
[0028] For example, in performing segmentation, the object segmentation and tracking module 202 may group pixels of the 2D image into different regions based on a homogeneous low-level feature, such as color, motion, or texture, each of the regions representing one of the objects. As a result, the object segmentation and tracking module 202 detects and segments out the objects in the 2D image.
[0029] Fig. 3A shows an exemplary 2D image 300 received by the 3D video generator 200 (Fig. 2), according to an exemplary embodiment. The image 300 includes a sky object 302-1 , a tree object 302-2, a road object 302-3, a grass object 302-4, and a human body object 302-5. Referring to Figs. 2 and 3A, the object segmentation and tracking module 202 segments out objects in the image 300 by grouping pixels of the image 300 into different regions.
[0030] Fig. 3B shows exemplary segmentation results 320 for the image 300 (Fig. 3A), according to an exemplary embodiment. Referring to Fig. 3B, the segmentation results 320 include a plurality of regions 322-1 , 322-2, ... , 322-5, each of the regions 322-1 , 322-2, ... , 322-5 having a different grayscale range and representing one of the objects in the image 300 (Fig. 3A).
[0031] Referring back to Fig. 2, in exemplary embodiments, the object segmentation and tracking module 202 may also operate together with the object segmentation and labeling interface 214 to perform user interactive segmentation. For example, a user may view the 2D image from a display device, such as the display device 112 (Fig. 1), to determine object boundaries in the 2D image, and input into the 3D video generator 200 object boundary information through the object segmentation and labeling interface 214. User input may be valuable in
facilitating the segmentation process and more easily and reliably provide accurate results according to user visual perception. The object segmentation and labeling interface 214 then provides the object boundary information to the object segmentation and tracking module 202.
[0032] In exemplary embodiments, the object segmentation and tracking module 202 is configured to perform user interactive segmentation for key frames of the 2D image sequence. For each of the key frames, the object segmentation and tracking module 202 operates together with the object segmentation and labeling interface 214 to perform user interactive segmentation, similar to the above description.
[0033] In exemplary embodiments, the object segmentation and tracking module 202 may further use an object tracking algorithm to track the objects in the 2D image sequence. For example, the object segmentation and tracking module 202 may detect how portions/sizes of the objects change with time in the 2D image sequence. In this manner, the object segmentation and tracking module 202 may detect for each of the objects an object evolving path in the 2D image sequence in the time domain.
[0034] In exemplary embodiments, the 2D image sequence may be taken by a camera/camcorder in a zoom in or zoom out mode. As a result, scenes in the 2D image sequence may have a depth range changing with time. Sizes and positions of the objects in the scene may also change with time. Under such situations, user interactive segmentation may be utilized to provide correct object tracking and, thus, make subsequent generation of object depth information consistent for the 2D image sequence.
[0035] In exemplary embodiments, the object classification module 204 is a hardware device or software configured to receive object segmentation results from the object segmentation and tracking module 202, and to perform object classification based on a plurality of training data sets in an object database (not shown). Each of the training data sets includes a group of training images
representing an object category, such as a building category, a grass category, a tree category, a cow category, a sky category, a face category, a car category, a bicycle category, etc. Based on a group of training images for an object category, the object classification module 204 may classify or identify the objects in the 2D image, and assign category labels for the classified objects.
[0036] In exemplary embodiments, a training data set may initially include no training images or a relatively small number of training images. In such a situation, the object classification module 204 operates together with the object segmentation and labeling interface 214 to perform user interactive object classification. For example, when the object classification module 204 receives the object segmentation results from the object segmentation and tracking module 202, the user may view the 2D image from the display device to determine an object category for an object in the 2D image, and label that object with the determined object category through the object segmentation and labeling interface 214. As a result, the object classification module 204 learns the object category determined by the user, and stores the 2D image as a training image in the object database. Alternatively/additionally, training images may be preloaded to the object database.
[0037] Once the object classification module 204 learns the object category through user interaction, the object classification module 204 may further extract information regarding the learned object category. For example, mathematically, training images for the learned object category may be
convolved with an array of band-pass filters, i.e., a filter bank, to generate a set of filter responses, and the set of filter responses may then be clustered over the whole training data set to generate high-dimensional space feature vectors, referred to herein as textons, which may then be used for future object
classification.
[0038] As the object classification module 204 learns additional object categories, training data sets in the object database will be enlarged. In addition,
features regarding the object categories may be further refined, eventually to reduce user interaction frequency and, thus, to make the object classification process more automatic.
[0039] Once object categories are identified for the 2D image, topological relationships among the objects may be analyzed from a semantic point of view. Further, automatic object classification may reduce user interaction frequency for labeling. User interaction may also correct object classification and help build and refine the object database for future applications.
[0040] Fig. 3C shows exemplary object classification results for the image 300 (Fig. 3A), according to an exemplary embodiment. Referring to Fig. 3C, the object classification module 204 (Fig. 2) classifies the object corresponding to the region 322-1 as a sky object, classifies the object corresponding to the region 322-2 as a tree object, classifies the object corresponding to the region 322-3 as a road object, classifies the object corresponding to the region 322-4 as a grass object, and classifies the object corresponding to the region 322-5 as a human body object.
[0041] Referring back to Fig. 2, in exemplary embodiments, the scene structure estimation module 208 is a hardware device or software configured to analyze the 2D image to obtain a scene structure for the 2D image. Obtaining the scene structure may improve accuracy for depth information generation for the 2D image. For example, the scene structure estimation module 208 may perform the analysis based on linear perspective properties of the 2D image, by detecting a vanishing point and vanishing lines in the 2D image. The vanishing point represents a farthest point in a scene shown in the 2D image, and the vanishing lines each represent a direction in which depth increases. The vanishing lines converge at the vanishing point in the 2D image.
[0042] By detecting the vanishing point and the vanishing lines, the scene structure estimation module 208 obtains a projected far-to-near direction in the 2D image and, thus, obtains the estimated scene structure. In addition, the
scene structure estimation module 208 may take advantage of object
classification results from the object classification module 204 to more accurately estimate the scene structure in the 2D image.
[0043] In exemplary embodiments, the object orientation estimation module 206 is configured to estimate depths, thicknesses, and orientations of the objects in the 2D image. Typically, each pixel of the objects in the 2D image corresponds to a depth. A depth of a pixel of an object represents a distance between a viewer and a part of the object corresponding to that pixel. A thickness of an object may be defined as a depth difference between a first pixel of the object corresponding to a maximum depth of the object and a second pixel of the object corresponding to a minimum depth of the object. The objects in the 2D image may be classified into first, second, and third estimation categories. The first estimation category includes the objects that are relatively far in the scene shown in the 2D image, such as the sky object 302-1 and the tree object 302-2 in the image 300 (Fig. 3A). The second estimation category includes the objects that are relatively near in the scene and have relatively a large thickness, such as the road object 302-3 and the grass object 302-4 in the image 300 (Fig. 3A). The third estimation category includes the objects that are relatively near in the scene and have a relatively small thickness, such as the human body object 302-5 in the image 300 (Fig. 3A). The user may view the 2D image from the display device to determine estimation categories for the objects in the 2D image.
[0044] In exemplary embodiments, the object orientation estimation module 206 may set depths for the first estimation category of objects equal to a default large value, and set thicknesses for the first estimation category of objects equal to zero. Alternatively, the user may specify the depths for the first estimation category of objects to be the default large value and specify the thicknesses for the first estimation category of objects to be zero, through the object orientation and thickness specification interface 216.
[0045] In exemplary embodiments, the object orientation estimation module 206 may estimate depths for the second and third estimation categories of objects. For example, the object orientation estimation module 206 may detect the vanishing point and the vanishing lines in the 2D image, and generate a depth map of the 2D image accordingly. More particularly, the object orientation estimation module 206 may generate different depth gradient planes relative to the vanishing point and the vanishing lines. The object orientation estimation module 206 then assigns depth levels to pixels of the 2D image according to the depth gradient planes. The object orientation estimation module 206 may additionally perform calibrations, and finally derive the depth map. In addition, the user may specify thicknesses for the second estimation category of objects through the object orientation and thickness specification interface 216. For example, the object orientation and thickness specification interface 216 may be implemented with a keyboard and a mouse. The user may determine a thickness of an object in the second estimation category according to visual perception, and input the determined thickness of the object through the keyboard.
[0046] Because orientation generally matters for objects with relatively large thicknesses, the object orientation estimation module 206 may further estimate orientations for the second estimation category of objects. Fig. 3D illustrates a method 360 for performing object orientation estimation for the image 300 (Fig. 3A), according to an exemplary embodiment. Referring to Figs. 2, 3A, and 3D, in the exemplary embodiment, pixels of an object, such as the road object 302-3 or the grass object 302-4, aligned in the same horizontal line are approximately at the same depth in the image 300. Therefore, the object orientation estimation module 206 may estimate an orientation of the object by linking the farthest point of the object with the nearest point of the object, as shown by an arrow 362 for the road object 302-3 and an arrow 364 for the grass object 302-4. Alternatively, the user may specify, through the object orientation
and thickness specification interface 216, the orientation of the object by linking the farthest point of the object with the nearest point of the object. As described above, the object orientation and thickness specification interface 216 may be implemented with a keyboard and a mouse. The user may view the image 300 on the display device, and use the mouse to link the farthest point of the object with the nearest point of the object.
[0047] Fig. 4 illustrates a method 400 for performing object orientation estimation for an image 402, according to an exemplary embodiment. The image 402 shows a scene including a person object 404 and a wall object 406. In the exemplary embodiment, because the object 404 has a thickness, pixels of the object 404 aligned in the same horizontal line, e.g., a horizontal line 408, are not at the same depth in the image 402.
[0048] Referring to Figs. 2 and 4, the user may determine a near surface boundary 410 and a far surface boundary 412 for the object 404 through the object orientation and thickness specification interface 216. The user may then specify a number of main orientation directions, such as orientation directions 414-1 , 414-2, and 414-3, from the far surface boundary 412 to the near surface boundary 410. The object orientation estimation module 206 may then interpolate additional orientation directions from the far surface boundary 412 to the near surface boundary 410, based on the main orientation directions specified by the user. As a result, the object orientation estimation module 206 determines an orientation for the object 404.
[0049] Referring back to Fig. 2, in exemplary embodiments, the object topological relationship determination module 210 is configured to determine a topological relationship among, e.g., relative positions of, the objects in the 2D image and maintain consistency of the topological relationship in the 2D image sequence, based on the orientations of the objects estimated by the object orientation module 206 and the scene structure estimated by the scene structure estimation module 208. For example, the object topological relationship
determination module 210 may determine that an average depth of the human body object 302-5 (Fig. 3A) is substantially the same as an average depth of the nearest surface of the grass object 302-4 (Fig. 3A). Also for example, the object topological relationship determination module 210 may determine that an average depth of the wall object 406 (Fig. 4) is larger than an average depth of the far surface of the object 404 corresponding to the far surface boundary 412 (Fig. 4). The object topological relationship determination module 210 may determine the topological relationship among the objects in the 2D image based on the depth information for the 2D image.
[0050] In exemplary embodiments, the DIBR module 212 is a hardware device or software configured to apply DIBR algorithms to generate the 3D video for display. The DIBR algorithms may produce a 3D representation based on the topological relationship of the objects in the 2D image and the depth information for the 2D image. To achieve a better 3D effect, the DIBR module 212 may utilize depth information of one or more neighboring images in the 2D image sequence.
[0051] In exemplary embodiments, the DIBR algorithms may include 3D image warping. 3D image warping changes a view direction and a viewpoint of an object, and transforms pixels in a reference image of the object to a destination view in a 3D environment based on depth levels of the pixels. A function may be used to map pixels from the reference image to the destination view. The DIBR module 212 may adjust and reconstruct the destination view to achieve a better effect.
[0052] In exemplary embodiments, the DIBR algorithms may also include plenoptic image modeling. Plenoptic image modeling provides 3D scene information of an image visible from arbitrary viewpoints. The 3D scene information may be obtained by a function based on a set of reference images with depth information. These reference images are warped and combined to form 3D representations of the scene from a particular viewpoint. For an
improved effect, the DIBR module 212 may adjust and reconstruct the 3D scene information. Based on the 3D scene information, the DIBR module 212 may generate multi-view video frames for 3D displaying.
[0053] Fig. 5 illustrates a flowchart of a method 500 for generating a 3D video based on a 2D image sequence including at least one 2D image, according to an exemplary embodiment. Referring to Fig. 5, objects are segmented out in the 2D image with potential user interactions and tracked in the 2D image sequence (502). The segmenting may be facilitated by user interaction. Based on the segmentation, the objects are further classified and identified, which may be facilitated by user interaction (504). Orientations of the objects are further estimated based on object classification results, and depth information of the 2D image is generated (506). This orientation estimation and depth information generation may also be facilitated by user interaction.
[0054] In addition, based on the object classification results, a scene structure is estimated for the 2D image (508). A topological relationship among the objects is then determined (510), based on the estimated scene structure and the estimated orientations of the objects. Depth-image based rendering is further applied to produce a 3D representation based on the depth information of the 2D image and the topological relationship of the objects (512). As the above process is repeated for each image in the 2D image sequence, the 3D video is generated.
[0055] The method disclosed herein may be implemented as a computer program product, i.e., a computer program tangibly embodied in an information carrier, e.g., a machine readable storage device, for execution by a data processing apparatus, e.g., a programmable processor, a computer, or multiple computers. The computer program may be written in any form of programming language, including compiled or interpreted languages, and may be deployed in any form, including stand-alone program, module, subroutine, or other unit
suitable for use in a computing environment. The computer program may be deployed to be executed on one computer, or on multiple computers.
[0056] In exemplary embodiments, there is also provided a computer- readable medium including instructions, executable by a processor in a 3D video generating system, for performing the above described method for generating a 3D video based on a 2D image sequence.
[0057] A portion or all of the method disclosed herein may also be implemented by an application specific integrated circuit (ASIC), a field- programmable gate array (FPGA), a complex programmable logic device (CPLD), a printed circuit board (PCB), a digital signal processor (DSP), a combination of programmable logic components and programmable interconnects, a single central processing unit (CPU) chip, or a CPU chip combined on a motherboard.
[0058] Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. The scope of the invention is intended to cover any variations, uses, or adaptations of the invention following the general principles thereof and including such departures from the present disclosure as come within known or customary practice in the art. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
[0059] It will be appreciated that the present invention is not limited to the exact construction that has been described above and illustrated in the
accompanying drawings, and that various modifications and changes can be made without departing from the scope thereof. It is intended that the scope of the invention only be limited by the appended claims.