US20130321575A1

US20130321575A1 - High definition bubbles for rendering free viewpoint video

Info

Publication number: US20130321575A1
Application number: US13/598,747
Authority: US
Inventors: Adam Kirk; Neil Fishman; Don Gillett; Patrick Sweeney; Kanchan Mitra; David Eraker
Original assignee: Microsoft Corp
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2012-05-31
Filing date: 2012-08-30
Publication date: 2013-12-05
Also published as: US9846960B2; US9251623B2; US20130321593A1; US20130321418A1; US20130321566A1; US20130321586A1; US8917270B2; US20130321410A1; US20130321413A1; US20130321590A1; US20130321589A1; US20130321396A1; US9256980B2

Abstract

A “Dynamic High Definition Bubble Framework” allows local clients to display and navigate FVV of complex multi-resolution and multi-viewpoint scenes while reducing computational overhead and bandwidth for rendering and/or transmitting the FVV. Generally, the FVV is presented to the user as a broad area from some distance away. Then, as the user zooms in or changes viewpoints, one or more areas of the overall area are provided in higher definition or fidelity. Therefore, rather than capturing and providing high definition everywhere (at high computational and bandwidth costs), the Dynamic High Definition Bubble Framework captures one or more “bubbles” or volumetric regions in higher definition in locations where it is believed that the user will be most interested. This information is then provided to the client to allow individual clients to navigate and zoom different regions of the FVV during playback without losing fidelity or resolution in the zoomed areas.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit under Title 35, U.S. Code, Section 119(e), of a previously filed U.S. Provisional Patent Application, Ser. No. 61/653,983 filed on May 31, 2012, by Simonnet, et al., and entitled “INTERACTIVE SPATIAL VIDEO,” the subject matter of which is incorporated herein by reference.

BACKGROUND

In general, in free-viewpoint video (FVV), multiple video streams are used to re-render a time-varying scene from arbitrary viewpoints. The creation and playback of a FVV is typically accomplished using a substantial amount of data. In particular, in FVV, scenes are generally simultaneously recorded from many different perspectives using sensors such as RGB cameras. This recorded data is then generally processed to extract 3D geometric information in the form of geometric proxies or models using various 3D reconstruction (3DR) algorithms. The original RGB data and geometric proxies are then recombined during rendering, using various image based rendering (IBR) algorithms, to generate multiple synthetic viewpoints.
Unfortunately, when a complex FVV such as a football game is recorded or otherwise captured, rendering the entire volume of the overall capture area to generate the FVV generally uses a very large dataset and a correspondingly large computational overhead for rendering the various viewpoints of the FVV for viewing on local clients.

SUMMARY

This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. Further, while certain disadvantages of prior technologies may be noted or discussed herein, the claimed subject matter is not intended to be limited to implementations that may solve or address any or all of the disadvantages of those prior technologies.
In general, a “Dynamic High Definition Bubble Framework” as described herein provides various techniques that allow local clients to display free viewpoint video (FVV) of complex 3D scenes while reducing computational overhead and bandwidth for rendering and/or transmitting the FVV. These techniques allow the client to perform spatial navigation through the FVV, while changing viewpoints and/or zooming into one or more higher definition regions or areas (specifically defined and referred to herein as “high definition bubbles”) within the overall area or scene of the FVV.
More specifically, the Dynamic High Definition Bubble Framework enables local rendering of FVV by providing a lower fidelity geometric proxy of an overall scene or viewing area in combination with one or more higher fidelity geometric proxies of the scene corresponding to regions of interest (e.g., areas of action in the scene that the user may wish to view in expanded detail and from one or more different viewpoints). This allows the user to view the entire volume of the scene as FVV, with interesting features or regions of the scene being provided in higher detail and optionally from a plurality of user-selectable viewpoints, while reducing the amount of data that is transmitted to the client for local rendering of the FVV. Note that the high definition bubbles may have differing levels of resolution or fidelity levels as well as differing numbers of viewpoints. Further, some of these viewpoints may be available at different resolutions or fidelity levels even within the same high definition bubble.
The Dynamic High Definition Bubble Framework enables these capabilities by providing multiple areas or sub-regions of higher definition video capture within the overall viewing area or scene. One implementation of this concept is to use multiple cameras (e.g., a camera array or the like) surrounding the scene to capture the scene or event holistically, in whatever resolution is desired. Concurrently, a set of cameras (e.g., a camera array or the like) that zoom in on particular regions of interest within the overall scene are used to create higher definition geometric proxies that enable a higher quality viewing experience of “bubbles” associated with the zoomed regions of the scene.
For example, various embodiments of the Dynamic High Definition Bubble Framework are enabled by using captured image or video data to create a 3D representation (or other visual representation of the “real” world) of the overall space of a scene. One or more sub-regions (i.e., high definition bubbles) of the larger space of the overall scene are then transferred to the client as high definition geometric proxies while the remaining areas of the overall scene are transferred to the client using lower resolution geometric proxies. Advantageously, the sub-regions represented by the high definition bubbles can be in fixed or predefined positions (e.g., the end zone of football field) or can move within the larger area of the overall scene (e.g., camera arrays following a ball or a particular player in a soccer game). These high definition bubbles are enabled by using any desired combination of fixed and moving camera arrays to capture high-resolution image data within one or more regions of interest relative to the area of the overall scene.
Captured image data is then used to generate geometric proxies or 3D models of the scene for local rendering of the FVV from any available viewpoint and at any desired resolution corresponding to the selected viewpoint. Note also that the FVV can be pre-rendered and sent to the client as a viewable and navigable FVV.
In particular, when used to stream 3D geometric proxies or models and corresponding RGB data to the client for locally render the FVV, the techniques enabled by the Dynamic High Definition Bubble Framework serve to reduce the amount of data used to render a specific viewpoint and resolution selected by the user when viewing or navigating the FVV. This approach is also applicable to server side rendering performance, when a video frame is generated on the server and transmitted to the client. In the server side example, using lower fidelity representations of areas that are far away from a region of interest (i.e., the desired viewpoint) in combination with using higher fidelity representations of the regions of interest reduces the time and computational overhead needed for generating video frames prior to transmission to the client.
In other words, in various embodiments, the Dynamic High Definition Bubble Framework creates a navigable FVV that presents a general or remote view (e.g., relatively far back from the action) of an overall volumetric space and then chooses an optimal dataset to use to render various portions of the FVV at the desired resolutions/fidelity. This allows the Dynamic High Definition Bubble Framework to seamlessly support varying resolutions for different regions while optimally choosing the appropriate dataset to process for the desired output. Advantageously, rendering regions within the high definition bubbles using higher resolutions allows the user to zoom into those regions without creating pixelization artifacts or other zoom-based viewing problems. In other words, even though the user is zooming into particular areas or regions, the FVV displayed to the user does not lose fidelity or resolution in those zoomed areas.
In view of the above summary, it is clear that the Dynamic High Definition Bubble Framework described herein provides various techniques that allow local clients to display and navigate FVV of complex multi-resolution and multi-viewpoint scenes while reducing computational overhead and bandwidth for rendering and/or transmitting the FVV. In addition to the just described benefits, other advantages of the Dynamic High Definition Bubble Framework will become apparent from the detailed description that follows hereinafter when taken in conjunction with the accompanying drawing figures.

DESCRIPTION OF THE DRAWINGS

The specific features, aspects, and advantages of the claimed subject matter will become better understood with regard to the following description, appended claims, and accompanying drawings where:

FIG. 1 provides an exemplary architectural flow diagram that illustrates program modules for using a “Dynamic High Definition Bubble Framework” for creating and navigating free viewpoint videos (FVV) of complex multi-resolution and multi-viewpoint scenes while reducing computational overhead and bandwidth for rendering and/or transmitting the FVV to clients, as described herein.

FIG. 2 provides an illustration of high definition bubbles within an overall viewing area or scene, as described herein

FIG. 3 provides illustration of the use of separate camera arrays to capture a high definition bubble and an overall viewing area, as described herein.

FIG. 4 illustrates a general system flow diagram that illustrates exemplary methods for implementing various embodiments of the Dynamic High Definition Bubble Framework for creating and navigating FVV's having high definition bubbles, as described herein.

FIG. 5 is a general system diagram depicting a simplified general-purpose computing device having simplified computing and I/O capabilities for use in implementing various embodiments of the Dynamic High Definition Bubble Framework, as described herein.

DETAILED DESCRIPTION OF THE EMBODIMENTS

In the following description of the embodiments of the claimed subject matter, reference is made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration specific embodiments in which the claimed subject matter may be practiced. It should be understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the presently claimed subject matter.
1.0 Introduction:
Note that some or all of the concepts described herein are intended to be understood in view of the overall context of the discussion of “Interactive Spatial Video” provided in U.S. Provisional Patent Application, Ser. No. 61/653,983 filed on May 31, 2012, by Simonnet, et al., and entitled “INTERACTIVE SPATIAL VIDEO,” the subject matter of which is incorporated herein by reference.
Note that various examples discussed in the following paragraphs refer to football games and football stadiums for purposes of explanation. However, it should be understood that the techniques described herein are not limited to any particular location, any particular activities, any particular size of volumetric space, or any particular number of scenes or objects.
In general, when a complex free-viewpoint video (FVV) of 3D scenes is recorded, one or more overall capture areas typically surround the “action”, which is confined to one or more smaller volumetric areas or sub-regions within the overall capture area. For example, in a football game, the size of the field is relatively large, but at any given time, the interesting action is generally centered on the ball and one or more players or athletes around the ball. While it is technically feasible to capture and render the entire capture volume at full fidelity, this would typically result in the generation of very large datasets to be sent from the server to the client for local rendering.
Advantageously, a “Dynamic High Definition Bubble Framework,” as described herein, provides various techniques that specifically address such concerns by providing the client with one or more lower fidelity geometric proxies of an overall viewing area or volumetric space. Concurrently, the Dynamic High Definition Bubble Framework provides one or more sub-regions of the overall viewing area as higher fidelity representations. Local clients then use this information to view and navigate through the overall FVV while providing the user with the capability to zoom into areas of higher fidelity. In other words, the Dynamic High Definition Bubble Framework provides various techniques that allow local clients to display and navigate FVV of complex multi-resolution and multi-viewpoint scenes while reducing computational overhead and bandwidth for rendering and/or transmitting the FVV. Advantageously, rendering regions within the high definition bubbles using higher resolutions allows the user to zoom into those regions without creating pixelization artifacts or other zoom-based viewing problems. In other words, even though the user is zooming into particular areas or regions, the FVV displayed to the user does not lose fidelity or resolution in those zoomed areas.
More specifically, the Dynamic High Definition Bubble Framework enables local rendering of image frames of the FVV by providing a lower fidelity geometric proxy of an overall scene in combination with one or more higher fidelity geometric proxies of the scene corresponding to regions of interest (e.g., areas of action in the scene that the user may wish to view in expanded detail). This allows the user to view the entire volume of the scene as FVV, with interesting features or regions of the scene being provided in higher detail in the event that the user zooms into such regions, while reducing the amount of data that is transmitted to the client for local rendering of the FVV.
One implementation of this concept is to use multiple cameras (e.g., camera arrays or the like) surrounding the scene to capture the scene or event holistically, in whatever resolution is desired. Concurrently, a set of cameras that zoom in on particular regions of interest within the overall scene (such as the “action” in a football game where a player is carrying the ball) are used to capture data for creating higher definition geometric proxies that enable a higher quality viewing experience of “bubbles” associated with the zoomed regions of the scene. These bubbles are specifically defined and referred to herein as “high definition bubbles.”Further, depending upon the available camera data, multiple viewpoints of potentially varying resolution or fidelity may be available within each bubble.
For any given scenario (e.g., sporting events, movie scenes, concerts, etc.), the Dynamic High Definition Bubble Framework typically presents a broad view of the overall viewing area or volumetric space from some distance away. Then, as the user zooms in or changes viewpoints, one or more areas of the overall scene or viewing area are provided in higher definition or fidelity. Therefore, rather than providing high definition everywhere (at high computational and bandwidth costs), the Dynamic High Definition Bubble Framework captures one or more bubbles in higher definition in locations or regions where it is believed that the user will be most interested. In other words, an author of the FVV will use the Dynamic High Definition Bubble Framework to capture bubbles in places where it is believed that user's may want more detail, or where the author want user's to be able to explore the FVV in greater detail.
Bubbles can be presented to the user in various ways. For example, in displaying the FVV to the user, the user is provided with the capability to zoom and/or change viewpoints (e.g., pans, tilts, rotations, etc.). In the case that the user zooms into a region corresponding to a high definition bubble, the user will be presented with higher resolution image frames during the zoom. As such, there is no need to demarcate explicit regions of the FVV that contain high definition bubbles.
In other words, the user is presented with the entire scene and as they scroll through it, more data is available in areas (i.e., bubbles) where there is higher detail. For example, by zooming into a high definition bubble around a football, the user will see that there is more detail available to them, while if they zoom into the grass near the edge of a field where there is less action, the user will see less detail (assuming that there is no corresponding high definition bubble there). Therefore, by placing bubbles in areas where the user is expected to look for higher detail (such as a tight view in and around the ball when it is fumbled) detail available to the user is higher, while off to one side of the field distant from the play, it is unlikely the user will zoom into that area. Therefore, when the user does zoom into the area around the ball, it creates an illusion as if the user can zoom in anywhere.
In alternate embodiments of the Dynamic High Definition Bubble Framework, the FVV is presented with thumbnails or highlighting within or near the overall scene to alert the user as to locations, regions or bubbles (and optionally available viewpoints) of higher definition. For example, the Dynamic High Definition Bubble Framework can provide a FVV of a boxing match where the overall ring is in low definition, but the two fighters are within a high definition bubble. In this case, the FVV may include indications of either or both the existence of the high definition bubble around the fighters and various available viewpoints within that bubble such as a view of the opponent from either boxers perspective.
Advantageously, the Dynamic High Definition Bubble Framework allows different users to have completely different viewing experiences. For example, in the case of a football game, one user can be zoomed into a bubble around the ball, while another user is zoomed into a bubble around cheerleaders on the edge of the football field, while yet another user is zoomed out to see the overall action on the entire field. Further, the same user can watch the FVV multiple times using any of a number of available zooms into one or more high definition bubbles and from any of a number of available viewpoints relative to any of those high definition bubbles.
1.1 System Overview:
As noted above, the “Dynamic High Definition Bubble Framework,” provides various techniques that allow local clients to display and navigate FVV of complex multi-resolution and multi-viewpoint scenes while reducing computational overhead and bandwidth for rendering and/or transmitting the FVV. The processes summarized above are illustrated by the general system diagram of FIG. 1. In particular, the system diagram of FIG. 1 illustrates the interrelationships between program modules for implementing various embodiments of the Dynamic High Definition Bubble Framework, as described herein. Furthermore, while the system diagram of FIG. 1 illustrates a high-level view of various embodiments of the Dynamic High Definition Bubble Framework, FIG. 1 is not intended to provide an exhaustive or complete illustration of every possible embodiment of the Dynamic High Definition Bubble Framework as described throughout this document.
In addition, it should be noted that any boxes and interconnections between boxes that may be represented by broken or dashed lines in FIG. 1 represent alternate embodiments of the Dynamic High Definition Bubble Framework described herein, and that any or all of these alternate embodiments, as described below, may be used in combination with other alternate embodiments that are described throughout this document.
In general, as illustrated by FIG. 1, the processes enabled by the Dynamic High Definition Bubble Framework begin operation by using a data capture module 100 that uses multiple cameras or arrays to capture and generate 3D scene data 120 (e.g., geometric proxies, 3D models, RGB or other color space data, textures, etc.) for an overall viewing area and one or more viewpoints for one or more high definition bubbles within the overall viewing area.
In various embodiments, a user input module 110 is used for various purposes, including, but not limited to, defining and configuring one or more cameras and/or camera arrays for capturing an overall viewing area and one or more high definition bubbles. The user input module 110 is also used in various embodiments to define or specify one or more high definition bubbles, one or more viewpoints or view frustums, resolution or level of detail for one or more of the bubbles and one or more of the viewpoints, etc.
Typically, local clients will render video frames of the FVV from 3D scene data 120. However, in various embodiments, a pre-rendering module 130 uses the 3D scene data 120 to pre-render one or more FVV's that are then provided to one or more clients for viewing and navigation. In either case, a data transmission module 140 transmits either the pre-rendered FVV or 3D scene data 120 to one or more clients. The Dynamic High Definition Bubble Framework conserves bandwidth when transmitting to the client by only sending sufficient 3D scene data 120 for the level of detail desired to render image frames corresponding to an initial virtual navigation viewpoint or viewing frustum or one selected by the client. Following receipt of the 3D scene data 120, local clients use a local rendering module 150 to render one or more FVV's 160 or image frames of the FVV.
Finally, a FVV playback module 170 provides user-navigable interactive playback of the FVV in response to user navigation and zoom commands. In general, the FVV playback module 170 allows the user to pan, zoom, or otherwise navigate through the FVV. Further, user pan, tilt, rotation and zoom information is provided back to the local rendering module 150 or to the data transmission module for use in retrieving the 3D scene data 120 needed to render subsequent image frames of the FVV corresponding to user interaction and navigation through the FVV.
2.0 Operational Details:
The above-described program modules are employed for implementing various embodiments of the Dynamic High Definition Bubble Framework. As summarized above, the Dynamic High Definition Bubble Framework provides various techniques that allow local clients to display FVV of complex scenes while reducing computational overhead and bandwidth for rendering and/or transmitting the FVV.
The following sections provide a detailed discussion of the operation of various embodiments of the Dynamic High Definition Bubble Framework, and of exemplary methods for implementing the program modules described in Section 1 with respect to FIG. 1. In particular, the following sections provides examples and operational details of various embodiments of the Dynamic High Definition Bubble Framework, including: an operational overview of the Dynamic High Definition Bubble Framework; exemplary FVV scenarios enabled by the Dynamic High Definition Bubble Framework; and data capture scenarios and FVV generation.
2.1 Operational Overview:
As noted above, the Dynamic High Definition Bubble Framework-based processes described herein provide various techniques that allow local clients to display and navigate FVV of complex multi-resolution and multi-viewpoint scenes while reducing computational overhead and bandwidth for rendering and/or transmitting the FVV.
FIG. 2 illustrates various high definition bubbles within an overall viewing area 200, scene, or volumetric space. The Dynamic High Definition Bubble Framework generally uses various cameras or camera arrays to capture the overall viewing area 200 at some desired resolution level. One or more high definition bubbles within the overall viewing area 200 are then captured uses various cameras or camera arrays at higher resolution or fidelity levels. As illustrated by FIG. 2, these high definition bubbles (e.g., 210, 220, 230, 240, 250 and 260) can have arbitrary shapes, sizes and volumes. Further, high definition bubbles (e.g., 210, 220, 230) can be in fixed positions to capture particular regions of the overall scene that may be of interest (e.g., end zones in a football game). The high definition bubbles (e.g., 240, 250 and 260) may also represent dynamic regions that move to follow action along arbitrary paths (e.g., 240) or along fixed paths (e.g., 250 to 260). Note also that moving high definition bubbles may sometimes extend outside the overall viewing area 200 (e.g., 260), though this may result in FVV image frames in which only the content of that high definition bubble is visible. One or more high definition bubbles may also overlap (e.g., 230).
FIG. 3 illustrates the use of separate camera arrays to capture a high definition bubble 330 using a camera array (e.g., cameras 335, 340, 345 and 350) within an overall viewing area 300 that is in turn captured by a set of cameras (e.g., 305, 310, and 315) at a lower fidelity level than that of the high definition bubble.
Various embodiments of the Dynamic High Definition Bubble Framework are enabled by using captured image or video data to create a 3D representation (or other visual representation of the “real” world) of the overall space of a scene. One or more sub-regions (i.e., high definition bubbles) of the larger space of the overall scene are then transferred to the client as high definition geometric proxies or 3D models while the remaining areas of the overall scene are transferred to the client using lower definition geometric proxies or 3D models. Advantageously, as noted above, the sub-regions represented by the high definition bubbles can be in fixed or predefined positions (e.g., the end zone of football field) or can move within the larger area of the overall scene (e.g., following a ball or a particular player in a soccer game). These high definition bubbles are enabled by using any desired combination of fixed and moving camera arrays to capture high-resolution image data within one or more regions of interest relative to the area or volume of the overall scene.
Consequently, when used to stream both 3D geometric and RGB data from the server to the client, the FVV processing techniques enabled by the Dynamic High Definition Bubble Framework serve to reduce the amount of data used to render a specific viewpoint selected by the user for when viewing a FVV. This approach is also applicable to server side rendering performance, when a video frame is generated on the server and transmitted to the client. In the server side example, using lower fidelity representations of areas that are far away from a region of interest (i.e., the desired viewpoint) in combination with using higher fidelity representations of the regions of interest reduces the time and computational overhead needed for generating video frames prior to transmission to the client.
2.2 Exemplary FVV Scenarios:
The Dynamic High Definition Bubble Framework enables a wide variety of viewing scenarios for clients or users. As noted above, since the user is provided with the opportunity to navigate and zoom the FVV during playback, the viewing experience can be substantially different for individual viewers of the same FVV.
For example, considering a football game in a typical stadium, the Dynamic High Definition Bubble Framework uses a number of cameras or camera arrays to capture sufficient views to create an overall 3D view of the stadium at low to medium definition or fidelity (i.e., any desired fidelity level). In addition, the Dynamic High Definition Bubble Framework will also capture one or more specific locations or “bubbles” at a higher definition or fidelity and with a plurality of available viewpoints. Note that these bubbles are captured using fixed or movable cameras or camera arrays. For example, again considering the football game, the Dynamic High Definition Bubble Framework may have fixed cameras or camera arrays around the end zone to capture high definition images in these regions at all times. Further, one or more sets of moving cameras or camera arrays can follow the ball or particular players around the field to capture images of the ball or players from multiple viewpoints.
Generally, in the case of a football field, it would be difficult to capture every part of the entire field and all of the action in high definition without using very large amounts of data. Consequently, the Dynamic High Definition Bubble Framework captures and provides an overall view of the field by using some number of cameras capturing the overall field. Then, the Dynamic High Definition Bubble Framework uses one or more sets of cameras that capture the regions around the ball, specific players, etc., so that the overall low definition general background of the football field can be augmented by user navigable high definition views of what is going on in 3D in the “bubbles.” In other words, in various embodiments, the Dynamic High Definition Bubble Framework generally presents a general or remote view (e.g., relatively far back from the action) of an overall volumetric space and then layers or combines navigable high definition bubbles with the overall volumetric space based on a determination of the proper geometric registration or alignment of those high definition bubbles within the overall volumetric space.
In the case of a movie or the like, the Dynamic High Definition Bubble Framework enables the creation of movies where the user is provided with the capability to move around within a particular scene (i.e., change viewpoints) and to view particular parts of the scene, which are within bubbles, in higher definition while the movie is playing.
2.3 Exemplary Data Capture Scenarios and FVV Generation:
The following paragraphs describe various examples of scenarios involving the physical placement and geometric configuration of various cameras and camera arrays within a football stadium to capture multiple high definition bubbles and virtual viewpoints for navigation of FVV's of a football game with associated close-ups and zooms corresponding to the high definition bubbles and virtual viewpoints. It should be understood that the following examples are provided only for purposes of explanation and are not intended to limit the scope or use of the Dynamic High Definition Bubble Framework to the examples presented, to the particular camera array configurations or geometries discussed, or to the positioning or use of particular high definition bubbles or virtual viewpoints.
In general, understanding where cameras or camera arrays will be deployed and the geometry associated with those cameras determines how the resulting 3D scene data will be processed in an interactive Spatial Video (SV) and subsequently rendered to create the FVV for the user or client. In the case of a typical professional football game, it is assumed that all cameras and related technology for capturing images, following action scenes or the ball, cutting to particular locations or persons, etc., exists inside or above the stadium. In some cases, the cameras will record elements before the game. In other cases, the cameras will be used in the live broadcast of the game. In this example, there are several primary configurations, including, but not necessarily limited to the following:

- Asset Arrays—Camera arrays referred to as “asset arrays” are used to capture 3D image data of players, cheerleaders, coaches, referees, and any other items or people which may appear on the field before the game. Following processing of the raw image data, the output of these asset arrays is both an image intensive photorealistic rendering and a high fidelity geometric proxy similar to a CGI asset for any imaged items or people. This information can then be used in subsequent rendering of the FVV.
- Environment Model—Mobile SLR cameras, mobile video cameras, laser range scanners, etc., are used to build an image-based geometric proxy for the stadium environment before the game from 3D image data captured by one or more camera arrays. This 3D image data is then generally used to generate a geometric proxy or 3D model of the overall environment. Further, this geometric proxy or 3D model can be edited or modified to suit particular purposes (e.g., modified to allow dynamic placement of advertising messages along a stadium wall or other location during playback of the resulting FVV).
- Fixed Arrays—Fixed camera arrays are used to capture 3D image data of various game elements or features for insertion into the FVV. These elements include, but are not limited to announcers, ‘talking heads’, player interviews, intra-game fixed physical locations around the field, etc.
- Moving Arrays—Mobile camera arrays are used to capture 3D image data of intra-game action on the field. Note that these are the same types of mobile cameras that are currently used to record action in professional football games, though additional numbers of cameras may be used to capture 3D image data of the intra-game action. Note that image or video data captured by fans viewing the game from inside the stadium using cell phones or other cameras can also be used by the Dynamic High Definition Bubble Framework to record intra-game action on the field.

2.3.1 Asset Arrays:
In general, “asset arrays” are dense, fixed camera arrays optimized for creating a static (or moving) geometric proxy of an asset. Assets include any object or person who will be on the field such as players, cheerleaders, referees, footballs, or other equipment. The camera geometry of the asset arrays is optimized for the creation of a high fidelity geometric proxies and that requires a ‘full 360’ arrangement of sensors so that all aspects of the asset can be recorded and modeled; additional sensors may be placed above or below the assets. Note that in some cases, ‘full 360’ coverage may not be possible (e.g., views partially obstructed along some range of viewing directions), and that in such cases, user selection of viewpoints in the resulting FVV will be limited to whatever viewpoints can be rendered from the captured data. In addition to RGB (or other color space) cameras in the asset array, other sensor combinations such as active IR based stereo (also used in Kinect® or time of flight type applications) can be used to assist in 3D reconstruction. Additional techniques such as the use of green screen backgrounds can further assist in segmentation of the assets for use in creating high fidelity geometric proxies of those assets.
Asset arrays are generally utilized prior to the game and focus on static representations of the assets. Once recorded, these assets can be used as SV content for creating FVV's in two different ways, depending on the degree of geometry employed in their representation using image-based rendering (IBR).
Firstly, a low-geometry IBR method, including, but not limited to, view interpolation can be used to place the asset (players or cheerleaders) online using technology including, but not limited to, browser-based 2D or 3D rendering engines. This also allows users to view single assets with a web browser or the like to navigate around a coordinate system that allows them to zoom in to the players (or other assets) from any angle, thus providing the user or viewer with high levels of photorealism with respect to those assets. Again, rendering regions within the high definition bubbles using higher resolutions allows the user to zoom into those regions without losing fidelity or resolution in the zoomed areas, or otherwise creating pixelization artifacts or other zoom-based viewing problems. In other implementations, video can be used to highlight different player/cheerleader promotional activities such a throw, catch, block, cheer, etc. Note that various examples of view interpolation and view morphing for such purposes are discussed in the aforementioned U.S. Provisional Patent Application, the subject matter of which is incorporated herein by reference.
Secondly, a high fidelity geometry proxy of the players (or other persons such as cheerleaders, referees, coaches, announcers, etc.) is created and combined with view dependent texture mapping (VDTM) for use in close up FVV scenarios. To use these geometric proxies in FVV, a kinematic model for a human is used as a baseline for possible motions and further articulated based on RGB data from live-action video camera arrays. Multi-angle video data is then used to realistically articulate the geometric proxies for all players or a subset of players on the field. Advantageously, 6 degrees of freedom (6-DOF) movement of the user's viewpoint during playback of FVV is possible due the explicit use of 3D geometry in representing the assets. Again, various techniques for rendering and viewing the 3D content of the FVV is discussed in the aforementioned U.S. Provisional Patent Application, the subject matter of which is incorporated herein by reference.
2.3.2 Environment Model:
A model of the environment is useful to the FVV of the football game in a number of different ways, such as providing a calibration framework for live-action moving cameras, creating interstitials effects when transitioning between known real camera feeds, determining the accurate placement (i.e., registration or alignment) of various geometric proxies (generated from the high definition bubbles) for FVV, improving segmentation results with background data, accurately representing the background of the scene using image-based-rendering methods in different FVV use cases, etc.
As is well known to those skilled in the art, a number of conventional techniques exist for modeling the environment using RGB (or other color space) photos using a sparse geometric representations of the scene. For example, in the case of Photosynth®, sparse geometry means that only enough geometry is extracted to enable the alignment of multiple photographs into a cohesive montage. However, in any scenario, such as the football game scenario, the Dynamic High Definition Bubble Framework provides richer 3D rendering by using much more geometry. More specifically, geometric proxies corresponding to each high definition bubble are registered or aligned to the geometry of the environment model. Once properly positioned, the various geometric proxies are then used to render the frames of the FVV.
Traditional environment models are often created using a variety of sensors such as moving video cameras, fixed cameras for high resolution static images, and laser based range scanning devices. RGB data from video cameras and fixed camera data can be processed using conventional 3D reconstruction methods to identify features and their location; point clouds of the stadium can be created from these features. Additional geometry, also in the form of point clouds, can be extracted using range scanning devices for additional accuracy. Finally, the point cloud data can be merged together, meshed, and textured into a cohesive geometric model. This geometry can also be used as an infrastructure to organize RGB data for use in other IBR approaches for backgrounds useful for FVV functionality.
Similar to the use of asset arrays, an environment model is created and processed before being used in any live-action footage provided by the FVV. Various methods associated with FVV live action, as discussed below, are made possible by the creation of an environment model including interstitials, moving camera calibration, and geometry-articulation.
In the simplest use of background models, interstitial movements between real camera positions are enabled, allowing users to more clearly understand where various camera feeds are located. In any SV scenario involving FVV, real camera feeds will have the highest degree of photorealism and will be widely utilized. When a viewer elects to change real camera views—instead of immediately switching to the next video feed—a smooth and sweeping camera movement is optionally enabled by rendering a virtual transition from the viewpoint of one camera view to the other to provide additional spatial information about the location of the cameras relative to the scene.
Additional FVV scenarios make advantageous use of the environment model by using both fixed and moving camera arrays to enable FVV functionality. In the case of moving cameras, these are used to provide close-ups of action on the field (i.e., by registering or positioning geometric proxies generated from the high definition bubbles with the environment model). To use moving cameras for FVV, individual video frames are continuously calibrated based on their orientation and optical focus, as discussed in the aforementioned U.S. Provisional Patent Application, the subject matter of which is incorporated herein by reference.
In general, the Dynamic High Definition Bubble Framework uses structure from motion (SFM) based approaches, as discussed in the aforementioned U.S. Provisional Patent Application, the subject matter of which is incorporated herein by reference, to calibrate the moving cameras or cameras based on static high resolution static RGB images captured during the environment modeling stage. Finally, for close up FVV functionality the Dynamic High Definition Bubble Framework relies upon the aforementioned articulation of the high-fidelity geometric proxies for the assets (players) using data from both fixed and moving camera arrays. These proxies are then positioned (i.e., registered or aligned) in the correct location on the field by determining where these assets are located relative to the environment model, as discussed in the aforementioned U.S. Provisional Patent Application, the subject matter of which is incorporated herein by reference.
2.3.3 Fixed Arrays:
Fixed camera arrays are used in various scenarios associated with the football game, including intra-game focused footage as well as collateral footage. The defining characteristic of the fixed arrays are that cameras do not move relative to the scene.
For example, consider the use of FVV functionality for non-game collateral footage—this could include interviews with players or announcers. Further, consider an announcers stage having a medium density array of fixed RGB video cameras arranged in a 180-degree camera geometry pointing towards the stage for capturing 3D scene data of persons and assets on the stage. In this case, the views being considered generally include close-up views of humans, focused on the face, with limited need for full 6-DOF spatial navigation. In this case, an IBR approach such as view interpolation, view morphing, or view warping would use a less explicit geometric proxy for the scene, which would therefore emphasize photorealism at the expense of viewpoint navigation.
One use of this FVV functionality is that viewers (or producers) can enable real-time smooth pans between the different announcers as they comment and react. Another application of these ideas is to change views between the announcers and a top down map of the play presented next to the announcers. Another example scenario includes zooming in on a specific cheerleader doing a cheer, assuming that the fixed array is positioned on the field in an appropriate location for such views. In these scenarios, FVV navigation would be primarily limited to synthetic viewpoints between real camera positions or the axis of the camera geometry. However, by using the available 3D scene data for rendering the image frames, the results would be almost indistinguishable from real camera viewpoints.
The intra-game functionality discussed below highlights various benefits and advantages to the user when using the FVV technology described herein. For example, consider two classes of fixed arrays, one sparse array positioned with whole or partial views of the field from high vantage points within the stadium and another where denser fixed camera are positioned around the actual field such as in the end zone to capture a high definition bubble of the end zone.
In the case of high vantage point sparse arrays, this video data can be used to enable both far and medium FVV viewpoint control both during the game and during playback. This is considered a sparse array because the relative volume of the stadium is rather large and the distance between sensors is high. In this case, image-based rendering methods such as billboards and articulated billboards may be used to provide two-dimensional representations of the players on the field. These billboards are created using segmentation approaches, which are enabled partially by the environment model. These billboards maintain the photorealistic look of the players, but because they do not include the explicit geometry of the players (such as when represented as high fidelity geometric proxies). However, it should be understood that in general, navigation in the FVV is independent of the representation used.
Next, denser fixed arrays on the field such as around the end zone for capturing high definition bubbles allow for highly photorealist viewpoints during both live action and replay. Similar to the announcer's stage discussed above, viewpoint navigation would be largely constrained by the camera axis using similar image-based-rendering methods described for the announcer's stage. For the most part, these types of viewpoints are specifically enabled when camera density is at an appropriate level and therefore are not generally enabled for all locations within the stadium. In other words, dense camera arrays are used for capturing sub-regions of the overall stadium as high definition bubbles for inclusion in the FVV. In general, these methods are unsuitable for medium and sparse configurations of sensors.
2.3.4 Moving Arrays:
Typical intra-game football coverage comes from moving cameras for both live action coverage and for replays. The preceding discussion regarding camera arrays generally focused on creating high fidelity geometric proxies of players and assets, how an environment model of the stadium can be leveraged to enhance the FVV, and the use of intra-game fixed camera arrays in both sparse and dense configurations. The Dynamic High Definition Bubble Framework ties these elements together with sparse moving camera arrays to enable additional FVV functionality for medium shots using billboards and close-up shots that leverage full 6-DOF spatial navigation using high fidelity geometric proxies of players or other assets or persons using conventional game cameras and camera operators. In other words, moving camera arrays are used to capture high definition bubbles used in generating FVV's.
Moving cameras in the array are continuously calibrated using SFM approaches leveraging the environment model. The optical zoom functionality of these moving cameras is also used to capture image data within high definition bubbles using methods including using prior frames to help further refine or identify a zoomed in camera geometry. Once the individual frames of the moving cameras have been registered to the geometry of the environment model (i.e., correctly positioned within the stadium), additional image-based-rendering methods are enabled for different FVV based on the contributing camera geometries including RGB articulated geometric proxies with maximal spatial navigation and billboard methods which emphasize photorealism and less spatial navigation.
For example, to enable close up replays with full 6-DOF viewpoint control during playback, the Dynamic High Definition Bubble Framework uses image data from the asset arrays, fixed arrays, and moving arrays. First, the relative position of the players is tracked on the field using one or more fixed arrays. In this way, the approximate location of any player on the field is known. This allows the Dynamic High Definition Bubble Framework to determine which players are in a zoomed in moving camera field of view. Next, based on the identification of the players in the zoomed in fields of view, the Dynamic High Definition Bubble Framework selects the appropriate high-fidelity geometric proxies for each player that were created earlier using the asset arrays.
Finally, using a kinematic model for known human motion as well as conventional object recognition techniques applied to RGB video (from both fixed and moving cameras), the Dynamic High Definition Bubble Framework determines the spatial orientation of specific players on the field and articulates their geometric proxies as realistically as possible. Note that this also helps in filling in occluded areas (using various hole-filling techniques) when there were insufficient numbers or placements of cameras to capture a view. When the geometric proxies are mapped to their correct location on the field in both space and time, the Dynamic High Definition Bubble Framework then derives a full 6-DOF FVV replay experience for the user. In this way, users or clients can literally view a play from any potential position including close-up shots as well as intra-field camera positions. Advantageously, the net effect here is to enable interactive replays similar to what is possible with various Xbox® football games such as the “Madden NFL” series of electronic games by Electronic Arts Inc, although with real data.
Finally, multiple moving cameras focused on the same physical location of the field can also enable medium and close up views that use IBR methods with less explicit geometry such as billboard methodologies. These cameras can be combined with data from both the environment model as well as the fixed arrays to create additional FVV viewpoints within the stadium.
3.0 Operational Summary:
The processes described above with respect to FIG. 1 through FIG. 3 and in further view of the detailed description provided above in Sections 1 and 2 are illustrated by the general operational flow diagram of FIG. 4. In particular, FIG. 4 provides an exemplary operational flow diagram that summarizes the operation of some of the various embodiments of the Dynamic High Definition Bubble Framework. Note that FIG. 4 is not intended to be an exhaustive representation of all of the various embodiments of the Dynamic High Definition Bubble Framework described herein, and that the embodiments represented in FIG. 4 are provided only for purposes of explanation.
Further, it should be noted that any boxes and interconnections between boxes that are represented by broken or dashed lines in FIG. 4 represent optional or alternate embodiments of the Dynamic High Definition Bubble Framework described herein, and that any or all of these optional or alternate embodiments, as described below, may be used in combination with other alternate embodiments that are described throughout this document.
In general, as illustrated by FIG. 4, the Dynamic High Definition Bubble Framework begins operation by capturing (410) 3D image data for an overall viewing area and one or more high definition bubbles within the overall viewing area. The Dynamic High Definition Bubble Framework then uses the captured data to generate (420) one or more 3D geometric proxies or models for use in generating a Free Viewpoint Video (FVV). For each FVV, a view frustum for an initial or user selected virtual navigation viewpoint is then selected (430). The Dynamic High Definition Bubble Framework then selects (440) an appropriate level of detail for regions in the view frustum based on distance from viewpoint. Further, as discussed herein, the Dynamic High Definition Bubble Framework uses higher fidelity geometric proxies for regions corresponding to high definition bubbles and lower fidelity geometric proxies for other regions of overall viewing area.
The Dynamic High Definition Bubble Framework then provides (450) one or more clients with 3D geometric proxies corresponding to the view frustum, with those geometric proxies having a level of detail sufficient to render the scene (or other objects or people within the current viewpoint) from a viewing frustum corresponding to a user selected virtual navigation viewpoint. Given this data, the FVV is rendered or generated and presented to the user for viewing, with the user then navigating (460) the FVV by selecting zoom levels and virtual navigation viewpoints (e.g., pans, tilts, rotations, etc.), which are in turn used to select the view frustum for generating subsequent frames of the FVV.
4.0 Exemplary Operating Environments:
The Dynamic High Definition Bubble Framework described herein is operational within numerous types of general purpose or special purpose computing system environments or configurations. FIG. 5 illustrates a simplified example of a general-purpose computer system on which various embodiments and elements of the Dynamic High Definition Bubble Framework, as described herein, may be implemented. It should be noted that any boxes that are represented by broken or dashed lines in FIG. 5 represent alternate embodiments of the simplified computing device, and that any or all of these alternate embodiments, as described below, may be used in combination with other alternate embodiments that are described throughout this document.
For example, FIG. 5 shows a general system diagram showing a simplified computing device such as computer 500. Such computing devices can be typically be found in devices having at least some minimum computational capability, including, but not limited to, personal computers, server computers, hand-held computing devices, laptop or mobile computers, communications devices such as cell phones and PDA's, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, audio or video media players, etc.
To allow a device to implement the Dynamic High Definition Bubble Framework, the device should have a sufficient computational capability and system memory to enable basic computational operations. In particular, as illustrated by FIG. 5, the computational capability is generally illustrated by one or more processing unit(s) 510, and may also include one or more GPUs 515, either or both in communication with system memory 520. Note that that the processing unit(s) 510 of the general computing device of may be specialized microprocessors, such as a DSP, a VLIW, or other micro-controller, or can be conventional CPUs having one or more processing cores, including specialized GPU-based cores in a multi-core CPU.
In addition, the simplified computing device of FIG. 5 may also include other components, such as, for example, a communications interface 530. The simplified computing device of FIG. 5 may also include one or more conventional computer input devices 540 (e.g., pointing devices, keyboards, audio input devices, video input devices, haptic input devices, devices for receiving wired or wireless data transmissions, etc.). The simplified computing device of FIG. 5 may also include other optional components, such as, for example, one or more conventional computer output devices 550 (e.g., display device(s) 555, audio output devices, video output devices, devices for transmitting wired or wireless data transmissions, etc.). Note that typical communications interfaces 530, input devices 540, output devices 550, and storage devices 560 for general-purpose computers are well known to those skilled in the art, and will not be described in detail herein.
The simplified computing device of FIG. 5 may also include a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 500 via storage devices 560 and includes both volatile and nonvolatile media that is either removable 570 and/or non-removable 580, for storage of information such as computer-readable or computer-executable instructions, data structures, program modules, or other data. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes, but is not limited to, computer or machine readable media or storage devices such as DVD's, CD's, floppy disks, tape drives, hard drives, optical drives, solid state memory devices, RAM, ROM, EEPROM, flash memory or other memory technology, magnetic cassettes, magnetic tapes, magnetic disk storage, or other magnetic storage devices, or any other device which can be used to store the desired information and which can be accessed by one or more computing devices.
Storage of information such as computer-readable or computer-executable instructions, data structures, program modules, etc., can also be accomplished by using any of a variety of the aforementioned communication media to encode one or more modulated data signals or carrier waves, or other transport mechanisms or communications protocols, and includes any wired or wireless information delivery mechanism. Note that the terms “modulated data signal” or “carrier wave” generally refer a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. For example, communication media includes wired media such as a wired network or direct-wired connection carrying one or more modulated data signals, and wireless media such as acoustic, RF, infrared, laser, and other wireless media for transmitting and/or receiving one or more modulated data signals or carrier waves. Combinations of the any of the above should also be included within the scope of communication media.
Further, software, programs, and/or computer program products embodying the some or all of the various embodiments of the Dynamic High Definition Bubble Framework described herein, or portions thereof, may be stored, received, transmitted, or read from any desired combination of computer or machine readable media or storage devices and communication media in the form of computer executable instructions or other data structures.
Finally, the Dynamic High Definition Bubble Framework described herein may be further described in the general context of computer-executable instructions, such as program modules, being executed by a computing device. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform particular tasks or implement particular abstract data types. The embodiments described herein may also be practiced in distributed computing environments where tasks are performed by one or more remote processing devices, or within a cloud of one or more devices, that are linked through one or more communications networks. In a distributed computing environment, program modules may be located in both local and remote computer storage media including media storage devices. Still further, the aforementioned instructions may be implemented, in part or in whole, as hardware logic circuits, which may or may not include a processor.
The foregoing description of the Dynamic High Definition Bubble Framework has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the claimed subject matter to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. Further, it should be noted that any or all of the aforementioned alternate embodiments may be used in any combination desired to form additional hybrid embodiments of the Dynamic High Definition Bubble Framework. It is intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto.

Claims

What is claimed is:

1. A computer-implemented process for generating navigable free viewpoint video (FVV), comprising using a computer to perform process actions for:

generating a geometric proxy from 3D image data of an overall volumetric space;

generating one or more geometric proxies for each of one or more sub-regions of the overall volumetric space;

registering one or more of the geometric proxies of the sub-regions with the geometric proxy of the overall volumetric space; and

rendering a multi-resolution user-navigable FVV from the registered geometric proxies and the geometric proxy of the overall volumetric space, wherein portions of the FVV corresponding to the sub-regions are rendered with a higher resolution than other regions of the FVV.

2. The computer-implemented process of claim 1 wherein each sub-region is captured at a resolution greater than a resolution used to capture the overall volumetric space.

3. The computer-implemented process of claim 1 wherein one or more of the sub-regions are captured using one or more moving camera arrays.

4. The computer-implemented process of claim 1 wherein one or more of the sub-regions are captured using one or more fixed camera arrays.

5. The computer-implemented process of claim 1 wherein rendering the multi-resolution user-navigable FVV further comprises process actions for:

determining a current view frustum corresponding to a current client viewpoint for viewing the FVV; and

transmitting appropriate geometric proxies within the current view frustum to the client for local rendering of video frames of the FVV.

6. The computer-implemented process of claim 1 wherein one or more of the sub-regions move relative to the overall volumetric space during capture of the 3D image data for those sub-regions.

7. The computer-implemented process of claim 1 wherein one or more of the sub-regions overlap within the overall volumetric space.

8. A method for generating a navigable 3D representation of a volumetric space, comprising:

capturing 3D image data of an overall volumetric space and using this 3D image data to construct an environment model comprising a geometric proxy of the overall volumetric space;

capturing 3D image data for one or more sub-regions of the overall volumetric space and generating one or more geometric proxies of each sub-region;

registering one or more of the geometric proxies of each sub-region to the environment model;

determining a view frustum relative to the environment model; and

rendering frames of a multi-resolution user-navigable FVV from portions of the registered geometric proxies and environment model corresponding to the view frustum, wherein portions of the FVV corresponding to the sub-regions are rendered with a higher resolution than other regions of the FVV.

9. The method of claim 8 wherein the view frustum is determined from a current viewpoint of a client viewing the FVV, and wherein the rendering is performed by the client from portions of the registered geometric proxies and environment model corresponding to the view frustum transmitted to the client.

10. The method of claim 8 wherein zooming into portions of the FVV rendered with a higher resolution provides greater detail than when zooming into other regions of the FVV.

11. The method of claim 8 wherein each sub-region is captured at a resolution greater than a resolution used to capture the overall volumetric space.

12. The method of claim 8 wherein the sub-regions are captured using any combination of one or more moving camera arrays and one or more fixed camera arrays.

13. The method of claim 8 wherein one or more of the sub-regions move relative to the overall volumetric space during capture of the 3D image data for those sub-regions.

14. A computer-readable medium having computer executable instructions stored therein for generating a user navigable free viewpoint video (FVV), said instructions causing a computing device to execute a method comprising:

capturing 3D image data for an overall viewing area;

capturing 3D image data for one or more high definition bubbles within the overall viewing area;

generating a geometric proxy from the 3D image data of the overall viewing area;

generating one or more geometric proxies from the 3D image data of one or more of the high definition bubbles;

aligning one or more of the geometric proxies of the high definition bubbles with the geometric proxy of the overall viewing area; and

transmitting portions of any of the aligned geometric proxies corresponding to a current client viewpoint to a client for local client-based rendering of a multi-resolution user-navigable FVV, wherein portions of the FVV corresponding to the high definition bubbles are rendered with a higher resolution than other regions of the FVV.

15. The computer-readable medium of claim 14 wherein each high definition bubble is captured at a resolution greater than a resolution used to capture the overall viewing area.

16. The computer-readable medium of claim 14 wherein one or more of the high definition bubbles are captured using one or more moving camera arrays.

17. The computer-readable medium of claim 14 wherein one or more of high definition bubbles are captured using one or more fixed camera arrays.

18. The computer-readable medium of claim 14 wherein rendering the multi-resolution user-navigable FVV further comprises:

using portions of the aligned geometric proxies within the current view frustum for local rendering of video frames of the FVV.

19. The computer-readable medium of claim 14 wherein one or more of the high definition bubbles move relative to the overall viewing area during capture of the 3D image data for those high definition bubbles.

20. The computer-readable medium of claim 14 wherein one or more of the sub-regions overlap within the overall volumetric space.