WO2015027105A1

WO2015027105A1 - Virtual reality content stitching and awareness

Info

Publication number: WO2015027105A1
Application number: PCT/US2014/052168
Authority: WO
Inventors: Arthur Van Hoff; Thomas M. Annau; Jens Christensen; Daniel KOPEINIGG
Original assignee: Jaunt Inc.
Priority date: 2013-08-21
Filing date: 2014-08-21
Publication date: 2015-02-26
Also published as: US11032490B2; US9451162B2; US10425570B2; US20150055937A1; JP2016538790A; WO2015026632A1; JP6289641B2; US20150055929A1; US20150058102A1; US20170236162A1; US10666921B2; US10708568B2; US20150054913A1; US9930238B2; US10334220B2; JP2016537903A; US20200007743A1; US20170236149A1; US20160373640A1; US11128812B2

Abstract

The disclosure includes a system and method for aggregating image frames and audio data to generate virtual reality content. The system includes a processor and a memory storing instructions that, when executed, cause the system to: receive video data describing image frames from a camera array; receive audio data from a microphone array; aggregate the image frames to generate a stream of three- dimensional (3D) video data, the stream of 3D video data including a stream of left panoramic images and a stream of right panoramic images; generate a stream of 3D audio data from the audio data; and generate virtual reality content that includes the stream of 3D video data and the stream of 3D audio data.

Description

VIRTUAL REALITY CONTENT STITCHING AND AWARENESS

FIELD

The implementations discussed herein are related to a virtual presence system and method. More particularly, the implementations discussed herein relate to aggregating image frames from a camera array and audio data from a microphone array to generate virtual reality (VR) content.

BACKGROUND

As technology improves, people become more isolated from human-to- human interaction. Instead of interacting with people in the physical world, people become more interested in the changes occurring on their phones and other mobile devices. This can result in loneliness and a sense of being disconnected.

One way to reduce the feelings of isolation comes from using virtual reality systems. In a virtual reality system, users interact with visual displays generated by software to experience a new location, activity, etc. For example, the user may play a game and interact with other characters in the game. In another example, the government is currently using virtual reality systems to train pilots. Current systems, however, fail to completely remedy feelings of isolation because the VR systems are insufficiently realistic.

Some VR goggles are released to the market. These goggles may combine a screen, gyroscopic sensors, and accelerometers to create a VR viewing system with a wide field of view and responsive head-tracking. Many of these VR goggles are initially aimed at the gaming market, and early reactions indicate they will be popular.

The subject matter claimed herein is not limited to implementations that solve any disadvantages or that operate only in environments such as those described above. Rather, this background is only provided to illustrate one example technology area where some implementations described herein may be practiced. SUMMARY

According to one innovative aspect of the subject matter described in this disclosure, a system for aggregating image frames and audio data to generate virtual reality content includes one or more processors and one or more non-transitory tangible computer- readable mediums communicatively coupled to the one or more processors and storing executable instructions executable by the one or more processors to perform operations including: receiving video data describing image frames from camera modules; receiving audio data from a microphone array; aggregating the stream of 3D video data including a stream of left panoramic images and a stream of right panoramic images; generating a stream of 3D audio data from the audio data; and generating virtual reality content that includes the stream of 3D video data and the stream of 3D audio data.

According to another innovative aspect of the subject matter described in this disclosure, a system for stitching image frames to generate a left panoramic image and a right panoramic image includes one or more processors and one or more non- transitory tangible computer-readable mediums communicatively coupled to the one or more processors and storing executable instructions executable by the one or more processors to perform operations including: receiving image frames that are captured by two or more camera modules of a camera array at a particular time; interpolating a first virtual camera between a first set of camera modules from the two or more camera modules; determining a first set of disparity maps between the first set of camera modules; generating a first virtual camera image associated with the particular time for the first virtual camera from a first set of image frames that are captured by the first set of camera modules at the particular time, the first virtual camera image being generated based on the first set of disparity maps; and constructing a left panoramic image and a right panoramic image associated with the particular time from the image frames captured by the two or more camera modules and the first virtual camera image of the first virtual camera.

According to yet another innovative aspect of the subject matter described in this disclosure, a system includes one or more processors and one or more non- transitory tangible computer-readable mediums communicatively coupled to the one or more processors and storing executable instructions executable by the one or more processors to perform operations including: generating virtual reality content that includes a compressed stream of three-dimensional video data and a stream of three-dimensional audio data with a processor-based computing device programmed to perform the generating, providing the virtual reality content to a user, detecting a location of the user's gaze at the virtual reality content, and suggesting a first advertisement based on the location of the user's gaze. According to yet another innovative aspect of the subject matter described in this disclosure, a system includes one or more processors and one or more non- transitory tangible computer-readable mediums communicatively coupled to the one or more processors and storing executable instructions executable by the one or more processors to perform operations including: receiving virtual reality content that includes a stream of three-dimensional video data and a stream of three- dimensional audio data to a first user with a processor-based computing device programmed to perform the receiving, generating a social network for the first user, and generating a social graph that includes user interactions with the virtual reality content.

According to yet another innovative aspect of the subject matter described in this disclosure, a system includes one or more processors and one or more non- transitory tangible computer-readable mediums communicatively coupled to the one or more processors and storing executable instructions executable by the one or more processors to perform operations including: providing virtual reality content that includes a compressed stream of three-dimensional video data and a stream of three-dimensional audio data with a processor-based computing device programmed to perform the providing, determining locations of user gaze of the virtual reality content, and generating a heat map that includes different colors based on a number of user gazes for each location.

Other aspects include corresponding methods, systems, apparatus, and computer program products for these and other innovative aspects.

These and other implementations may each optionally include one or more of the following operations and features. For instance, the features include: identifying first matching camera modules for left panoramic images based on a left camera map; identifying second matching camera modules for right panoramic images based on a right camera map; stitching first image frames captured by the first matching camera modules at a particular time to form a corresponding left panoramic image in the stream of left panoramic images; stitching second image frames captured by the second matching camera modules at a particular time to form a corresponding right panoramic image in the stream of right panoramic images; for a pixel with a yaw value and a pitch value in a panorama: the left camera map identifying a first matching camera module for the pixel in the panorama and matching the pixel in the panorama to a pixel in an image plane of the first matching camera module, and the right camera map identifying a second matching camera module for the pixel in the panorama and matching the pixel in the panorama to a pixel in an image plane of the second matching camera module; the left camera map associating a pixel location in left panoramic images to a corresponding first matching camera module, the pixel location corresponding to a point of a panorama in a left viewing direction; the corresponding first matching camera module having a field of view that includes a viewing direction to the point of the panorama; the viewing direction of the corresponding first matching camera module being closer to the left viewing direction than other viewing directions associated with other camera modules; determining a current viewing direction associated with a user; generating the stream of left panoramic images and the stream of right panoramic images based on the current viewing direction; the left panoramic images having a higher resolution in the current viewing direction of the user than a second viewing direction opposite to the current viewing direction; the right panoramic images having a higher resolution in the current viewing direction of the user than the second viewing direction opposite to the current viewing direction.

For instance, the operations include: correcting color deficiencies in the left panoramic images and the right panoramic images; and correcting stitching errors in the left panoramic images and the right panoramic images.

For instance, the operations include: determining a cost for displaying advertisements based on the location of the user's gaze, the cost being based on a length of time that the user gazes at the location; providing a graphical object as part of the virtual reality content that is linked to a second advertisement; generating graphics for displaying a bottom portion and a top portion that include at least some of the virtual reality content and providing an advertisement that is part of at least the bottom portion or the top portion.

For instance, the operations include: suggesting a connection between the first user and a second user based on the virtual reality content; suggesting a group associated with the social network based on the virtual reality content; comprising automatically generating social network updates based on the first user interacting with the virtual reality content; determining subject matter associated with the virtual reality content, determining other users that are interested in the subject matter, and wherein the other users receive the social network updates based on the first user interacting with the virtual reality content; generating privacy settings for the first user for determining whether to publish social network updates based on a type of activity; and storing information in a social graph about the first user's gaze at advertisements displayed as part of the virtual reality content.

The features may optionally include determining other users that are interested in the subject matter based on the other users expressly indicating that they are interested in the subject matter.

For instance, the operations include: generating a playlist of virtual reality experiences. The features may include the playlist being based on most user views of virtual reality content, a geographical location, or the playlist is generated by a user that is an expert in subject matter and the playlist is based on the subject matter.

The object and advantages of the implementations will be realized and achieved at least by the elements, features, and combinations particularly pointed out in the claims.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the disclosure, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS

Example implementations will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:

Figure 1A illustrates a block diagram of some implementations of an example system that collects and aggregates image frames and audio data to generate VR content;

Figure IB illustrates a block diagram of some implementations of an example system for generating content for a virtual reality system;

Figure 2 illustrates a block diagram of some implementations of a computing device that includes an example aggregation system;

Figure 3 illustrates an example method for aggregating image frames and audio data to generate VR content according to some implementations;

Figures 4A-4C illustrate another example method for aggregating image frames and audio data to generate VR content according to some implementations; Figure 5 illustrates an example process of generating a left panoramic image and a right panoramic image from multiple image frames that are captured by multiple camera modules at a particular time;

Figure 6A is a graphic representation that illustrates an example panoramic image;

Figure 6B is a graphic representation that illustrates an example camera map;

Figures 7A and 7B are graphic representations that illustrate example processes of selecting a first camera module for a pixel in a left panoramic image to construct a left camera map and selecting a second camera module for the pixel in a right panoramic image to construct a right camera map;

Figure 8 is a graphic representation that illustrates an example process of blending pixels on a border of two camera sections;

Figures 9A and 9B are graphic representations that illustrate an example panoramic image with improved representation;

Figures 1 OA- IOC are graphic representations that illustrate a relationship between an increasing density of cameras and a reduction of stitching errors according to some implementations;

Figure 11 illustrates a block diagram of some implementations of a computing device that includes an example aggregation system;

Figures 12A and 12B illustrate an example method for stitching image frames captured at a particular time to generate a left and a right panoramic images according to some implementations;

Figures 13A and 13B illustrate an example method for generating a virtual camera image for a virtual camera located between two neighboring camera modules according to some implementations;

Figures 14A and 14B illustrate an example method for estimating a disparity map that maps disparity of pixels from a first sub-image of a first neighboring camera module to a second sub-image of a second neighboring camera module according to some implementations;

Figures 15 A and 15B illustrate an example method for determining similarity scores for pixels along an epipolar line that connects projection centers of two neighboring camera modules according to some implementations; Figure 16A illustrates an example process of generating a left panoramic image and a right panoramic image associated with a particular time according to some implementations;

Figure 16B is a graphic representation that illustrates an example panoramic image according to some implementations;

Figure 16C is a graphic representation that illustrates an example camera map according to some implementations;

Figures 17A-17C are graphic representations that illustrate selection of matching cameras for a point in a panorama for construction of a left and a right camera maps according to some implementations;

Figure 18 is a graphic representation that illustrates example disparity along an epipolar line according to some implementations;

Figure 19 is a graphic representation that illustrates interpolation of virtual cameras between real cameras and virtual cameras according to some implementations;

Figure 20 illustrates a block diagram of some implementations of a computing device that includes an example content system;

Figure 21 A illustrates different panels where virtual reality content may be displayed;

Figure 2 IB illustrates an example advertisement that is displayed as part of the virtual reality content;

Figure 21C illustrates example social network content;

Figure 2 ID illustrates example virtual reality content that includes a link to additional virtual reality content;

Figures 22A-22C illustrate an example method for aggregating image frames and audio data to generate VR content according to some implementations;

Figure 23 illustrates an example method for generating advertisements in a virtual reality system;

Figure 24 illustrates an example method for generating a social network based on virtual reality content; and

Figure 25 illustrates an example method for analyzing virtual reality content. DETAILED DESCRIPTION OF SOME EXAMPLE IMPLEMENTATIONS A VR experience may include one that creates a realistic sense of being in another place. Creating such an experience may involve reproducing three- dimensional ("3-D") video and optionally 3-D audio for a scene. For example, imagine a user is standing in a forest with a canopy of tree limbs overhead. The user may see trees, rocks, and other objects in various directions. As the user rotates his or her head from side to side and/or up and down, disparity (e.g., shifts in position) of the objects provides the user with depth perception, e.g., the ability to generally perceive the distance to an object in the field of view and/or the distance between objects in the field of view. The user may sense that there is a creek or river behind him or her because the user may hear running water. As the user tilts his or her head to the side, the user's view of the creek or river changes and the sound of the water changes. The creek or river may be easier to see and/or the sound of the water may become more distinct and clearer, and the user has a better sense of how far the water is from the user and how fast the water is flowing. In the canopy of tree limbs above the user, a bird is singing. When the user tilts his or her head upward, the user's senses detect changes in the surrounding environment: the user may see the canopy; the user may see a bluebird singing; the user may have a sense of how far away the bird is based on disparity; and the user may hear the bird's singing more distinctly and loudly since the user is now facing the bird. The user tilts his or her head back to a forward-facing position and now the user may be facing a deer that is standing just 10 feet away from the user. The deer starts to run toward the user and the user's depth perception indicates that the deer is getting closer to the user. Based on the user's depth perception and the relative position of objects around the deer, the user may tell that the deer is running toward him or her at a fast pace.

Current VR solutions may fail to realistically recreate the scene described in the preceding paragraph from the video produced by multiple spatially-separated cameras. For example, 3D video is needed to have depth perception that indicates the deer is running toward the user and running at a certain pace. 3D audio may augment the 3D video. For example, 3D audio may allow the user to hear a change in the water as the user tilts his or her head from side to side, or to hear the bird's song differently as the user tilts his or her head upward. Since existing solutions do not create 3D video as described herein and/or do not combine 3D video with 3D audio, they are unable to realistically recreate the scene described in the preceding paragraph.

The present disclosure relates to creating a realistic sense of being in another place by providing an immersive 3D viewing experience that may optionally be combined with immersive 3D audio listening experience.

In some implementations, a system described herein may include a camera array, a microphone array, an aggregation system, a viewing system, and other devices, systems, or servers. The system is applicable for recording and presenting any event including, but not limited to, a concert, a sports game, a wedding, a press conference, a movie, a promotion event, a video conference, or other event or scene that may be recorded by the camera array and the microphone array. The recording of the event or scene may be viewed through a VR display (e.g., a pair of VR goggles) during occurrence of the event or thereafter.

Camera modules included in the camera array may have lenses mounted around a spherical housing and oriented in different directions with a sufficient diameter and field of view, so that sufficient view disparity may be captured by the camera array for rendering stereoscopic images. The camera array may output raw video data describing image frames with different viewing directions to the aggregation system.

The microphone array is capable of capturing sounds from various directions. The microphone array may output the captured sounds and related directionalities to the aggregation system, which allows the aggregation system to reconstruct sounds from any arbitrary direction.

The aggregation system may aggregate raw video data outputted from the camera array and raw audio data outputted from the microphone array for processing and storage. In some implementations, the aggregation system may include a set of Gigabit Ethernet switches for collecting the raw video data and an audio interface for collecting the raw audio data. Both of the raw video data and audio data may be fed into a client device or a server with a storage device for storing the raw video data and audio data.

The aggregation system may include code and routines stored on a non- transitory memory for processing the raw video data and audio data received across multiple recording devices and for converting the raw video data and audio data into a single compressed stream of 3D video and audio data. For example, the aggregation system may include code and routines that, when executed by a processor, stitch the image frames from multiple camera modules into two panoramic 3D video streams for left and right eye viewing, such as a stream of left panoramic images for left eye viewing (also referred to as a left stream of panoramic images) and a stream of right panoramic images for right eye viewing (also referred to as a right stream of panoramic images). The streams of left and right panoramic images are configured to create a time-varying panorama viewed by a user using the viewing system.

In some implementations, the aggregation system may construct a stereoscopic panorama using image frames from multiple views each in a different direction. For example, the camera array includes multiple camera modules arranged around all 360 degrees of a sphere. The camera modules each have a lens pointing in a different direction. Because the camera modules are arranged around 360 degrees of a sphere and taking images of the scene from multiple viewpoints, the images captured by the camera modules at a particular time include multiple views of the scene from different directions. The resulting left or right panoramic image for the particular time includes a spherical representation of the scene at the particular time. Each pixel in the left or right panoramic image may represent a view of the scene in a slightly different direction relative to neighboring pixels.

In some implementations, the aggregation system generates, based on a left camera map, the stream of left panoramic images for left eye viewing from image frames captured by the camera array. The left camera map identifies a corresponding matching camera module for each pixel in a left panoramic image. A pixel in a panoramic image may correspond to a point in a panoramic scene, and a matching camera module for the pixel in the panoramic image may be a camera module that has a lens with a better view for the point than other camera modules. The left camera map may map pixels in a left panoramic image to corresponding matching camera modules. Similarly, the aggregation system generates, based on a right camera map, the stream of right panoramic images for right eye viewing from image frames captured by the camera array. The right camera map identifies a corresponding matching camera module for each pixel in a right panoramic image. The right camera map may map pixels in a right panoramic image to corresponding matching camera modules.

The aggregation system may also include code and routines that, when executed by a processor, correct camera calibration errors, exposure or color deficiencies, stitching artifacts, and other errors on the left and right panoramic images.

The aggregation system may also add four-channel ambisonic audio tracks to the 3D video streams, and may encode and compress the 3D video and audio streams using a standard moving picture experts group (MPEG) format or other suitable encoding/compression format.

In some implementations, the aggregation system includes code and routines configured to filter the 3D video data to improve its quality. The aggregation system may also include code and routines for intentionally changing the appearance of the video with a video effect. In some implementations, the aggregation system includes code and routines configured to determine an area of interest in a video for a user and to enhance the audio corresponding to the area of interest in the video.

The viewing system decodes and renders the 3D video and audio streams received from the aggregation system on a VR display device (e.g., Oculus Rift VR display or other suitable VR display) and audio reproduction devices (e.g., headphones or other suitable speakers). The VR display device may display left and right panoramic images for the user to provide a 3D immersive viewing experience. The viewing system may include the VR display device that tracks the movement of a user's head. The viewing system may also include code and routines for processing and adjusting the 3D video data and audio data based on the user's head movement to present the user with a 3D immersive viewing experience, which allows the user to view the event or scene in any direction. Optionally, 3D audio may also be provided to augment the 3D viewing experience.

In some implementations, the system described herein includes a content system. The content system may provide functionality similar to that of the aggregation system. The content system may also provide other functionality as described below in more detail.

Once the virtual reality content is generated, there are many applications for the virtual reality content. In one embodiment, the content system generates advertisements within the virtual reality. For example, the advertisements are displayed in areas that are unobtrusive, such as above the user or below the user. The virtual reality system may be able to determine how to charge for the advertisements based on a location of the user's gaze. In another embodiment, the content system communicates with a social network application to identify users for using the virtual reality content together, for generating virtual reality updates for the user's friends on the social network, for suggesting content to the user based on the user's interest in certain virtual reality subject matter, etc. In yet another embodiment, the virtual reality system determines overall usage information such as a heat map of user gazes and a playlist of virtual reality experiences.

The present disclosure also relates to stitching images to form a panoramic image.

Image stitching errors may be resulted from one or more sources that include, but are not limited to: a first source that includes errors in measurement of physical properties of cameras (e.g., errors in spatial positions, rotations, focus, and focal lengths of the cameras); a second source that includes mismatch between image measurement properties of the cameras (e.g., mismatch in brightness, contrast, and color); and a third source that includes disparity in viewing angles of close-by objects from different cameras.

The stitching errors caused by the first and second sources may be removed through camera calibration. For example, objects with known colors, brightness, contrast, spatial orientations, and positions may be used to characterize each camera and adjust camera parameters (e.g., focus, sensor gain, white balance) prior to using the cameras to capture image frames. Alternatively or additionally, overlapping images between cameras may be analyzed, and image post-processing techniques may be used to adjust camera model parameters to reduce difference between the overlapping images.

The stitching errors caused by the third source may be reduced or eliminated by increasing the number of camera modules (also referred to as real cameras) in a camera array to approach an ideal of a single, continuous, and spherical image sensor. This mechanism may reduce the viewing angle discrepancy between neighboring cameras and may thus reduce the stitching artifacts. In some implementations, rather than adding more real cameras into the camera array, an increasing camera density may be achieved by interpolating virtual cameras between real cameras in the camera array. This approach may be achieved by interpolating images from real cameras based at least in part on an estimation of the spatial proximity or depth of each image pixel (e.g., a depth map) to generate virtual camera images for the virtual cameras. For example, to approximate shifting a camera view to the left, pixels in the image may shift to the right based on the pixels' estimated proximity to the camera. A first pixel that is closer to the camera than a second pixel may shift a longer distance to the right than the second pixel in order to simulate parallax. The virtual camera image generated from the pixel shifting may be improved by combining shifted views from all nearby cameras.

In some implementations, a depth map may be computed using standard stereoscopic algorithms or obtained with a depth sensor such as the PrimeSense depth sensor. The depth map does not need to be entirely accurate as long as the errors produce no visible difference in the interpolated views. For example, a featureless background may present as an identical image regardless of viewing positions or angles to the background. Errors in the background's depth estimation may not affect image interpolation since the featureless background is invariant to pixel shifting.

In some implementations, the aggregation system described herein may interpolate virtual cameras between camera modules in the camera array to simulate an increasing camera density. A virtual camera may be a camera whose view is not directly observed. For example, a virtual camera may be a camera whose view may be estimated from image data collected from real camera sensors or virtual camera image data of other virtual cameras. A virtual camera may represent a simulated camera that locates between two or more neighboring camera modules. A position, orientation, field of view, depth of field, focal length, exposure, and white balance, etc., of the virtual camera may be different from the two or more neighboring camera modules that the virtual camera is based on.

The virtual camera may have a virtual camera image estimated from two or more image frames captured by the two or more neighboring camera modules. In some implementations, the virtual camera may be located in a particular position between the two or more neighboring camera modules, and the virtual camera image of the virtual camera may represent an estimated camera view from the particular position located between the neighboring camera modules. For example, the camera array with multiple camera modules may be housed around a spherical case. A virtual camera may be determined for an arbitrary angular position around the spherical case and its virtual camera image may also be estimated for the arbitrary angular position, which simulates a continuous rotation of point of view around the sphere even though the camera array may only capture discrete view points by the discrete camera modules. In some implementations, a virtual camera may also be estimated by interpolating between two real cameras. A real camera may refer to a camera module in the camera array. Alternatively, a virtual camera may also be interpolated between a real camera and another virtual camera. Alternatively, a virtual camera may be interpolated between two other virtual cameras.

In some implementations, the aggregation system may estimate a virtual camera image for a virtual camera located between a first camera and a second camera by: estimating disparity maps between the first and second cameras; determining image frames of the first and second cameras; and generating the virtual camera image by shifting and combining the image frames of the first and second cameras based on the disparity maps. The first camera may be a real camera or a virtual camera. The second camera may also be a real camera or a virtual camera.

In some implementations, the aggregation system may receive video data describing image frames captured by camera modules in the camera array and may process the video data to generate a stream of 3D video data. For example, the aggregation system may determine virtual cameras interpolated in the camera array, estimate virtual camera images for the virtual cameras, stitch the image frames and the virtual camera images into two panoramic 3D video streams for left and right eye viewing, such as a stream of left panoramic images for left eye viewing and a stream of right panoramic images for right eye viewing. The stream of 3D video data includes the streams of left and right panoramic images.

Implementations of the present disclosure will be explained with reference to the accompanying drawings.

Figure 1A illustrates a block diagram of some implementations of an example system 100 that collects and aggregates image frames and audio data to generate VR content, arranged in accordance with at least some implementations described herein. The illustrated system 100 includes a camera array 101, a connection hub 123, a microphone array 107, a client device 127, and a viewing system 133. In some implementations, the system 100 additionally includes a server 129 and a second server 198. The client device 127, the viewing system 133, the server 129, and the second server 198 may be communicatively coupled via a network 105.

The separation of various components and servers in the implementations described herein should not be understood as requiring such separation in all implementations, and it should be understood that the described components and servers may generally be integrated together in a single component or server. Additions, modifications, or omissions may be made to the illustrated implementation without departing from the scope of the present disclosure, as will be appreciated in view of the present disclosure.

While Figure 1A illustrates one camera array 101, one connection hub 123, one microphone array 107, one client device 127, one server 129, and one second server 198, the present disclosure applies to a system architecture having one or more camera arrays 101, one or more connection hubs 123, one or more microphone arrays 107, one or more client devices 127, one or more servers 129, one or more second servers 198, and one or more viewing systems 133. Furthermore, although Figure 1A illustrates one network 105 coupled to the entities of the system 100, in practice one or more networks 105 may be connected to these entities and the one or more networks 105 may be of various and different types.

The camera array 101 may be a modular camera system configured to capture raw video data that includes image frames. In the illustrated implementation shown in Figure 1A, the camera array 101 includes camera modules 103 a, 103b... l03n (also referred to individually and collectively herein as camera module 103). While three camera modules 103a, 103b, 103n are illustrated in Figure 1A, the camera array 101 may include any number of camera modules 103. The camera array 101 may be constructed using individual cameras with each camera module 103 including one individual camera. In some implementations, the camera array 101 may also include various sensors including, but not limited to, a depth sensor, a motion sensor (e.g., a global positioning system (GPS), an accelerometer, a gyroscope, etc.), a sensor for sensing a position of the camera array 101, and other types of sensors. The camera array 101 may be constructed using various configurations. For example, the camera modules 103a, 103b... l03n in the camera array 101 may be configured in different geometries (e.g., a sphere, a line, a cylinder, a cone, and a cubic, etc.) with the corresponding lenses in the camera modules 103a, 103b... l03n facing toward different directions. The camera array 101 has a flexible structure so that a particular camera module 103 may be removed from the camera array 101 and new camera modules 103 may be added to the camera array 101. For example, the camera modules 103 are positioned within The camera array 101 in a honeycomb pattern where each of the compartments forms an aperture where a camera module 103 may be inserted. In another example, the camera array 101 includes multiple lenses along a horizontal axis and a smaller number of lenses on a vertical axis.

In some implementations, the camera modules 103a, 103b... l03n in the camera array 101 may be oriented around a sphere in different directions with sufficient diameter and field of view to capture sufficient view disparity to render stereoscopic images. For example, the camera array 101 may include 32 Point Grey Blackfly Gigabit Ethernet cameras distributed around a 20-centimeter diameter sphere. In another example, the camera array 101 may comprise HER03+ GoPro® cameras that are distributed around a sphere. Camera models that are different from the Point Grey Blackfly camera model may be included in the camera array 101. For example, in some implementations the camera array lOlmay include a sphere whose exterior surface is covered in one or more optical sensors configured to render 3D images or video. The optical sensors may be communicatively coupled to a controller. The entire exterior surface of the sphere may be covered in optical sensors configured to render 3D images or video.

In some implementations, the camera modules 103 in the camera array 101 are configured to have a sufficient field-of-view overlap so that all objects can be seen from more than one view point. For example, the horizontal field of view for each camera module 103 included in the camera array 101 is 70 degrees. In some implementations, having the camera array 101 configured in such a way that an object may be viewed by more than one camera module 103 is beneficial for correcting exposure or color deficiencies in the images captured by the camera array 101. The camera modules 103 in the camera array 101 may or may not include built-in batteries. The camera modules 103 may obtain power from a battery coupled to the connection hub 123. In some implementations, the external cases of the camera modules 103 may be made of heat-transferring materials such as metal so that the heat in the camera modules 103 may be dissipated more quickly than using other materials. In some implementations, each camera module 103 may include a heat dissipation element. Examples of heat dissipation elements include, but are not limited to, heat sinks, fans, and heat-dissipating putty.

Each of the camera modules 103 may include one or more processors, one or more memory devices (e.g., a secure digital (SD) memory card, a secure digital high capacity (SDHC) memory card, a secure digital extra capacity (SDXC) memory card, and a compact flash (CF) memory card, etc.), an optical sensor (e.g., semiconductor charge-coupled devices (CCD), active pixel sensors in complementary metal-oxide-semiconductor (CMOS), and N-type metal-oxide- semiconductor (NMOS, Live MOS), etc.), a depth sensor (e.g., PrimeSense depth sensor), a lens (e.g., a camera lens), and other suitable components.

In some implementations, the camera modules 103a, 103b... l03n in the camera array 101 may form a daisy chain in which the camera modules 103 a, 103b...103n are connected in sequence. The camera modules 103a, 103b...103n in the camera array 101 may be synchronized through the daisy chain. One camera module (e.g., the camera module 103a) in the daisy chain may be configured as a master camera module that controls clock signals for other camera modules in the camera array 101. The clock signals may be used to synchronize operations (e.g., start operations, stop operations) of the camera modules 103 in the camera array 101. Through the synchronized start and stop operations of the camera modules 103, the image frames in the respective video data captured by the respective camera modules 103a, 103b... l03n are also synchronized.

Example implementations of the camera array 101 and the camera modules 103 are described in U.S. Application No. 14/444,938, titled "Camera Array Including Camera Modules", filed July 28, 2014, which is herein incorporated in its entirety by reference.

The camera modules 103 may be coupled to the connection hub 123. For example, the camera module 103 a is communicatively coupled to the connection hub 123 via a signal line 102a, the camera module 103b is communicatively coupled to the connection hub 123 via a signal line 102b, and the camera module 103n is communicatively coupled to the connection hub 123 via a signal line 102n. In some implementations, a signal line in the disclosure may represent a wired connection or any combination of wired connections such as connections using Ethernet cables, high-definition multimedia interface (HDMI) cables, universal serial bus (USB) cables, RCA cables, Firewire, CameraLink, or any other signal line suitable for transmitting video data and audio data. Alternatively, a signal line in the disclosure may represent a wireless connection such as a wireless fidelity (Wi-Fi) connection or a BLUETOOTH® connection.

The microphone array 107 may include one or more microphones configured to capture sounds from different directions in an environment. In some implementations, the microphone array 107 may include one or more processors and one or more memories. The microphone array 107 may include a heat dissipation element. In the illustrated implementation, the microphone array 107 is coupled to the connection hub 123 via a signal line 104. Alternatively or additionally, the microphone array 107 may be directly coupled to other entities of the system 100 such as the client device 127.

The microphone array 107 may capture sound from various directions. The sound may be stored as raw audio data on a non-transitory memory communicatively coupled to the microphone array 107. The microphone array 107 may detect directionality of the sound. The directionality of the sound may be encoded and stored as part of the raw audio data.

In some implementations, the microphone array 107 may include a Core Sound Tetramic soundfield tetrahedral microphone array following the principles of ambisonics, enabling reconstruction of sound from any arbitrary direction. For example, the microphone array 107 may include an ambisonics microphone mounted on top of the camera array 101 and used to record sound and sonic directionality. In some implementations, the microphone array 107 includes a Joseph Grado HMP-1 recording system, or any other microphone system configured according to the same or similar acoustical principles. In some implementations, the microphone array 107 includes the Eigenmike, which advantageously includes a greater number of microphones and, as a result, can perform higher-order (i.e. more spatially accurate) ambisonics. The microphone may be mounted to the top of the camera array 101, be positioned between camera modules 103, or be positioned within the body of the camera array 101.

In some implementations, the camera modules 103 may be mounted around a camera housing (e.g., a spherical housing or a housing with another suitable shape). The microphone array 107 may include multiple microphones mounted around the same camera housing, with each microphone located in a different position. The camera housing may act as a proxy for the head-shadow sound-blocking properties of a human head. As described below with reference to Figure 2, during playback of the recorded audio data, an audio module 212 may select an audio track for a user's ear from a microphone that has a closest orientation to the user's ear. Alternatively, the audio track for the user's ear may be interpolated from audio tracks recorded by microphones that are closest to the user's ear.

The connection hub 123 may receive the raw audio data recorded by the microphone array 107 and forward the raw audio data to the client device 127 for processing and storage. The connection hub 123 may also receive and aggregate streams of raw video data describing image frames captured by the respective camera modules 103. The connection hub 123 may then transfer the raw video data to the client device 127 for processing and storage. The connection hub 123 is communicatively coupled to the client device 127 via a signal line 106. In some examples, the connection hub 123 may be a USB hub. In some implementations, the connection hub 123 includes one or more batteries 125 for supplying power to the camera modules 103 in the camera array 101. Alternatively or additionally, one or more batteries 125 may be coupled to the connection hub 123 for providing power to the camera modules 103.

The client device 127 may be a processor-based computing device. For example, the client device 127 may be a personal computer, laptop, tablet computing device, smartphone, set top box, network-enabled television, or any other processor based computing device. In some implementations, the client device 127 includes network functionality and is communicatively coupled to the network 105 via a signal line 108. The client device 127 may be configured to transmit data to the server 129 or to receive data from the server 129 via the network 105. The client device 127 may receive raw video data and raw audio data from the connection hub 123. In some implementations, the client device 127 may store the raw video data and raw audio data locally in a storage device associated with the client device 127. Alternatively, the client device 127 may send the raw video data and raw audio data to the server 129 via the network 105 and may store the raw video data and the audio data on a storage device associated with the server 129. In some implementations, the client device 127 includes an aggregation system 131 for aggregating raw video data captured by the camera modules 103 to form 3D video data and aggregating raw audio data captured by the microphone array 107 to form 3D audio data. Alternatively or additionally, the aggregation system 131 may be operable on the server 129.

The aggregation system 131 may include a system configured to aggregate raw video data and raw audio data to generate a stream of 3D video data and a stream of 3D audio data, respectively. The aggregation system 131 may be stored on a single device or a combination of devices of Figure 1A. In some implementations, the aggregation system 131 can be implemented using hardware including a field-programmable gate array ("FPGA") or an application-specific integrated circuit ("ASIC"). In some other implementations, the aggregation system 131 may be implemented using a combination of hardware and software. The aggregation system 131 is described below in more detail with reference to Figures 2-5 and 11-15B.

The viewing system 133 may include or use a computing device to decode and render a stream of 3D video data on a VR display device (e.g., Oculus Rift VR display) or other suitable display devices that include, but are not limited to: augmented reality glasses; televisions, smartphones, tablets, or other devices with 3D displays and/or position tracking sensors; and display devices with a viewing position control, etc. The viewing system 133 may also decode and render a stream of 3D audio data on an audio reproduction device (e.g., a headphone or other suitable speaker devices). The viewing system 133 may include the VR display configured to render the 3D video data and the audio reproduction device configured to render the 3D audio data. The viewing system 133 may be coupled to the client device 127 via a signal line 110 and the network 105 via a signal line 112. A user 134 may interact with the viewing system 133. In some implementations, the viewing system 133 may receive VR content from the client device 127. Alternatively or additionally, the viewing system 133 may receive the VR content from the server 129. The viewing system 133 may also be coupled to the aggregation system 131 and may receive the VR content from the aggregation system 131. The VR content may include one or more of a stream of 3D video data, a stream of 3D audio data, a compressed stream of 3D video data, a compressed stream of 3D audio data, and other suitable content.

The viewing system 133 may track a head orientation of a user. For example, the viewing system 133 may include one or more accelerometers or gyroscopes used to detect a change in the user's head orientation. The viewing system 133 may decode and render the stream of 3D video data on a VR display device and the stream of 3D audio data on a speaker system based on the head orientation of the user. As the user changes his or her head orientation, the viewing system 133 may adjust the rendering of the 3D video data and 3D audio data based on the changes of the user's head orientation.

The viewing system 133 may provide an immersive viewing experience to the user 134. For example, the viewing system 133 may include a VR display device that has a wide field of view so that the user 134 viewing the VR content feels like he or she is surrounded by the VR content in a manner similar to in a real- life environment. A complete 360-degree view of the scene is provided to the user 134, and the user 134 may view the scene in any direction. As the user 134 moves his or her head, the view is modified to match what the user 134 would see as if he or she was moving his or her head in the real world. By providing a different view to each eye (e.g., a stream of left panoramic images for left eye viewing and a stream of right panoramic images for right eye viewing), which simulates what the left and right eyes may see in the real world, the viewing system 133 may give the user 134 a 3D view of the scene. Additionally, 3D surrounding sound may be provided to the user 134 based on the user's head orientation to augment the immersive 3D viewing experience. For example, if a character in an immersive movie is currently behind the user 134, the character's voice may appear to be emanating from behind the user 134.

In some implementations, the viewing system 133 may allow the user 134 to adjust the left panoramic images and the right panoramic images to conform to the user's interpupillary distance. The left panoramic images and the right panoramic images may move further apart for users with larger interpupillary distances or may move closer for users with smaller interpupillary distances.

In some implementations, the viewing system 133 includes a peripheral device such as a microphone, camera, mouse, or keyboard that is configured to enable the user 134 to provide an input to one or more components of the system 100. For example, the user 134 may interact with the peripheral device to provide a status update to the social network service provided by the social network server 135. In some implementations, the peripheral device includes a motion sensor such as the Microsoft® Kinect or another similar device, which allows the user 134 to provide gesture inputs to the viewing system 133 or other entities of the system 100.

In some implementations, the viewing system 133 includes peripheral devices for making physical contact with the user to make the virtual reality experience more realistic. The viewing system 133 may include gloves for providing the user 134 with tactile sensations that correspond to virtual reality content. For example, the virtual reality content may include images of another user and when the user 134 reaches out to touch the other user the viewing system 133 provides pressure and vibrations that make it feel like the user 134 is making physical contact with the other user. In some implementations, the viewing system 133 may include peripheral devices for other parts of the body.

In some implementations, multiple viewing systems 133 may receive and consume the VR content streamed by the aggregation system 131. In other words, two or more viewing systems 133 may be communicatively coupled to the aggregation system 131 and configured to simultaneously or contemporaneously receive and consume the VR content generated by the aggregation system 131.

The network 105 may be a conventional type, wired or wireless, and may have numerous different configurations including a star configuration, token ring configuration, or other configurations. Furthermore, the network 105 may include a local area network (LAN), a wide area network (WAN) (e.g., the Internet), or other interconnected data paths across which multiple devices may communicate. In some implementations, the network 105 may be a peer-to-peer network. The network 105 may also be coupled to or include portions of a telecommunications network for sending data in a variety of different communication protocols. In some implementations, the network 105 may include BLUETOOTH® communication networks or a cellular communication network for sending and receiving data including via short messaging service (SMS), multimedia messaging service (MMS), hypertext transfer protocol (HTTP), direct data connection, WAP, e-mail, etc.

The server 129 may be a hardware server that includes a processor, a memory, and network communication capabilities. In the illustrated implementation, the server 129 is coupled to the network 105 via a signal line 120. The server 129 sends and receives data to and from one or more of the other entities of the system 100 via the network 105. For example, the server 129 receives VR content including a stream of 3D video data (or compressed 3D video data) and a stream of 3D audio data (or compressed 3D audio data) from the client device 127 and stores the VR content on a storage device associated with the server 129. Alternatively, the server 129 includes the aggregation system 131 that receives raw video data and raw audio data from the client device 127 and aggregates the raw video data and raw audio data to generate the VR content. The viewing system 133 may access the VR content from the server 129 or the client device 127.

The second server 198 may be a hardware server that includes a processor, a memory, and network communication capabilities. In the illustrated implementation, the second server 198 is coupled to the network 105 via a signal line 197. The second server 198 sends and receives data to and from one or more of the other entities of the system 100 via the network 105. The second server 198 may provide computer-generated imagery to the aggregation system 131 for insertion into the stream so that live and computer-generated images may be combined. In other implementations, the second server 198 provides audio tracks that may be provided to the aggregation system 131 for insertion into the stream so that live content includes an audio track. For example, the audio track is a soundtrack.

In some implementations, the second server 198 includes functionality to modify the video or audio provided to the aggregation system 131. For example, the second server 198 includes code and routines executed by a processor and configured to provide noise cancellation of audio, reverberation effects for audio, insertion of video effects, etc. Accordingly, the second server 198 may be configured to enhance or transform video and audio associated with the aggregation system 131.

In some implementations, the system 100 includes two or more camera arrays 101 and two or more microphone arrays 107, and a user may switch between two or more viewpoints of the two or more camera arrays 101. For example, the system 100 may be used to record a live event such as a baseball game. The user may use the viewing system 133 to watch the baseball game from a first view point associated with a first camera array 101. A play is developing on the field and the user may want to switch viewpoints to have a better vantage of the play. The user provides an input to the aggregation system 131 via the viewing system 133, and the aggregation system 131 may switch to a second camera array 101 which provides a better vantage of the play. The second camera array 101 may be associated with a different microphone array 107 which provides different sound to the user specific to the user's new vantage point.

Figure IB illustrates a block diagram of some implementations of an example system 199 that collects and aggregates image frames and audio data to generate virtual reality content, arranged in accordance with at least some implementations described herein. The illustrated system 199 includes the camera array 101, the connection hub 123, the microphone array 107, the client device 127, and the viewing system 133. In some implementations, the system 199 additionally includes the server 129, a social network server 135, a content server 139, an advertisement (ad) server 141, and the second server 198. The client device 127, the viewing system 133, the server 129, the social network server 135, the content server 139, the second server 198, and the ad server 141 may be communicatively coupled via the network 105.

While Figure IB illustrates one camera array 101, one connection hub 123, one microphone array 107, one client device 127, one server 129, one social network server 135, one content server 139, one ad server 141, one second server 198, and one viewing system 133, the disclosure applies to a system architecture having one or more camera arrays 101, one or more connection hubs 123, one or more microphone arrays 107, one or more client devices 127, one or more servers 129, one or more social network servers 135, one or more content servers 139, one or more ad servers 141, one or more second servers 198, and one or more viewing systems 133. Furthermore, although Figure IB illustrates one network 105 coupled to the entities of the system 199, in practice one or more networks 105 may be connected to these entities and the one or more networks 105 may be of various and different types.

A content system 171 may be operable on the client device 127, the server 129, or another entity of the system 199. The viewing system 133 may also be coupled to the content system 171 and may receive the virtual reality content from the content system 171. The second server 198 may be configured to enhance or transform video and audio associated with the content system 171.

The ad server 141 may be a hardware server that includes a processor, a memory, and network communication capabilities. In the illustrated embodiment, the ad server 141 is coupled to the network 105 via a signal line 114. The ad server 141 sends and receives data to and from one or more of the other entities of the system 199 via the network 105. In some implementations, the ad server 141 is an advertisement repository for advertisements that are requested by the content system 171 for display as part of the virtual reality content.

In some implementations, the ad server 141 includes rules for targeting advertisements to specific users, for targeting advertisements to be displayed in conjunction with various types of content (e.g., content served by the content server 139, virtual reality content served by the client device 127 or the server 129), for targeting advertisements to specific locations or Internet Protocol (IP) addresses associated with the client device 127, the viewing system 133, or the user 134. The ad server 141 may include other rules for selecting and/or targeting advertisements.

In some implementations, the ad server 141 receives metadata associated with virtual reality content displayed by the viewing system 133 and selects advertisements for presentation in conjunction with the virtual reality content based on the metadata. For example, the ad server 141 selects stored advertisements based on keywords associated with the virtual reality content. Other methods are possible for providing targeted advertisements to users, which may alternatively or additionally be implemented in the implementations described herein.

The content server 139 may be a hardware server that includes a processor, a memory, and network communication capabilities. In the illustrated embodiment, the content server 139 is coupled to the network 105 via a signal line 116. The content server 139 sends and receives data to and from one or more of the other entities of the system 199 via the network 105. The content provided by the content server 139 may include any content that is configured to be rendered as 3D video data and/or 3D audio data. In some implementations, the content provided by the content server 139 may be videos of events such as sporting events, weddings, press conferences or other events, movies, television shows, music videos, interactive maps such as Google® Street View maps, and any other virtual reality content. In some implementations, the content includes a video game. In other implementations, the content includes a picture such as a family photo that has been configured to be experienced as virtual reality content.

In some implementations, the content server 139 provides content responsive to a request from the content system 171, the client device 127, or the viewing system 133. For example, the content server 139 is searchable using keywords. The client device 127, the viewing system 133, or the content system 171 provides a keyword search to the content server 139 and selects content to be viewed on the viewing system 133. In some implementations, the content server 139 enables a user to browse content associated with the content server 139. For example, the content includes a virtual store including items for purchase and the user may navigate the store in 3D. In some implementations, the content server 139 may provide the user with content recommendations. For example, the content server 139 recommends items for purchase inside the 3D store.

The social network server 135 may be a hardware server that includes a processor, a memory, and network communication capabilities. In the illustrated embodiment, the social network server 135 is coupled to the network 105 via a signal line 118. The social network server 135 sends and receives data to and from one or more of the other entities of the system 199 via the network 105. The social network server 135 includes a social network application 137. A social network may be a type of social structure where the users may be connected by a common feature. The common feature includes relationships/connections, e.g., friendship, family, work, an interest, etc. Common features do not have to be explicit. For example, the common feature may include users who are watching the same live event (e.g., football game, concert, etc.), playing the same video game, etc. In some implementations, the users are watching the event using the functionality provided by the content system 171 and the viewing systems 133. The common features may be provided by one or more social networking systems including explicitly defined relationships and relationships implied by social connections with other online users, where the relationships form a social graph. In some examples, the social graph may reflect a mapping of these users and how they may be related.

Although only one social network server 135 with one social network application 137 is illustrated, there may be multiple social networks coupled to the network 105, each having its own server, application, and social graph. For example, a first social network may be more directed to business networking, a second may be more directed to or centered on academics, a third may be more directed to local business, a fourth may be directed to dating, and others may be of general interest or a specific focus. In another embodiment, the social network application 137 may be part of the content system 171.

In some implementations, the social network includes a service that provides a social feed describing one or more social activities of a user. For example, the social feed includes one or more status updates for the user describing the user's actions, expressed thoughts, expressed opinions, etc. In some implementations, the service provided by the social network application 137 is referred to as a "social network service." Other implementations may be possible.

In some implementations, the social network server 135 communicates with one or more of the camera array 101, the microphone array 107, the content system 171, the server 129, the viewing system 133, and the client device 127 to incorporate data from a social graph of a user in a virtual reality experience for the user.

Referring now to Figure 2, an example of the aggregation system 131 is illustrated in accordance with at least some implementations described herein. Figure 2 is a block diagram of a computing device 200 that includes the aggregation system 131 , a memory 237, a processor 235, a storage device 241, and a communication unit 245. In the illustrated implementation, the components of the computing device 200 are communicatively coupled by a bus 220. In some implementations, the computing device 200 may be a personal computer, smartphone, tablet computer, set top box, or any other processor-based computing device. The computing device 200 may be one of the client device 127, the server 129, and another device in the system 100 of Figure 1A. The processor 235 may include an arithmetic logic unit, a microprocessor, a general-purpose controller, or some other processor array to perform computations and provide electronic display signals to a display device. The processor 235 is coupled to the bus 220 for communication with the other components via a signal line 238. The processor 235 may process data signals and may include various computing architectures including a complex instruction set computer (CISC) architecture, a reduced instruction set computer (RISC) architecture, or an architecture implementing a combination of instruction sets. Although Figure 2 includes a single processor 235, multiple processors may be included. Other processors, operating systems, sensors, displays, and physical configurations may be possible.

The memory 237 includes a non-transitory memory that stores data for providing the functionality described herein. The memory 237 may be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, flash memory, or some other memory devices. In some implementations, the memory 237 also includes a non- volatile memory or similar permanent storage device and media including a hard disk drive, a floppy disk drive, a CD-ROM device, a DVD-ROM device, a DVD-RAM device, a DVD-RW device, a flash memory device, or some other mass storage device for storing information on a more permanent basis. The memory 237 may store the code, routines, and data for the aggregation system 131 to provide its functionality. The memory 237 is coupled to the bus 220 via a signal line 244.

The communication unit 245 may transmit data to any of the entities of the system 100 depicted in Figure 1A. Similarly, the communication unit 245 may receive data from any of the entities of the system 100 depicted in Figure 1A. The communication unit 245 may include one or more Ethernet switches for receiving the raw video data and the raw audio data from the connection hub 123. The communication unit 245 is coupled to the bus 220 via a signal line 246. In some implementations, the communication unit 245 includes a port for direct physical connection to a network, such as the network 105 of Figure 1A, or to another communication channel. For example, the communication unit 245 may include a port such as a USB, SD, RJ45, or similar port for wired communication with another computing device. In some implementations, the communication unit 245 includes a wireless transceiver for exchanging data with another computing device or other communication channels using one or more wireless communication methods, including IEEE 802.11, IEEE 802.16, BLUETOOTH®, or another suitable wireless communication method.

In some implementations, the communication unit 245 includes a cellular communications transceiver for sending and receiving data over a cellular communications network including via short messaging service (SMS), multimedia messaging service (MMS), hypertext transfer protocol (HTTP), direct data connection, WAP, e-mail, or another suitable type of electronic communication. In some implementations, the communication unit 245 includes a wired port and a wireless transceiver. The communication unit 245 also provides other conventional connections to a network for distribution of data using standard network protocols including TCP/IP, HTTP, HTTPS, and SMTP, etc.

The storage device 241 may be a non-transitory storage medium that stores data for providing the functionality described herein. The storage device 241 may be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, flash memory, or some other memory devices. In some implementations, the storage device 241 also includes a non-volatile memory or similar permanent storage device and media including a hard disk drive, a floppy disk drive, a CD-ROM device, a DVD-ROM device, a DVD-RAM device, a DVD-RW device, a flash memory device, or some other mass storage device for storing information on a more permanent basis. The storage device 241 is communicatively coupled to the bus 220 via a signal line 242.

In the implementation illustrated in Figure 2, the aggregation system 131 includes a communication module 202, a calibration module 204, a camera mapping module 206, a video module 208, a correction module 210, the audio module 212, and a stream combination module 214. These modules of the aggregation system 131 are communicatively coupled to each other via the bus 220.

In some implementations, each module of the aggregation system 131 (e.g., modules 202, 204, 206, 208, 210, 212, or 214) may include a respective set of instructions executable by the processor 235 to provide its respective functionality described below. In some implementations, each module of the aggregation system 131 may be stored in the memory 237 of the computing device 200 and may be accessible and executable by the processor 235. Each module of the aggregation system 131 may be adapted for cooperation and communication with the processor 235 and other components of the computing device 200.

The communication module 202 may be software including routines for handling communications between the aggregation system 131 and other components of the computing device 200. The communication module 202 may be communicatively coupled to the bus 220 via a signal line 222. The communication module 202 sends and receives data, via the communication unit 245, to and from one or more of the entities of the system 100 depicted in Figure 1A. For example, the communication module 202 may receive raw video data from the connection hub 123 via the communication unit 245 and may forward the raw video data to the video module 208. In another example, the communication module 202 may receive VR content from the stream combination module 214 and may send the VR content to the viewing system 133 via the communication unit 245.

In some implementations, the communication module 202 receives data from components of the aggregation system 131 and stores the data in the memory 237 or the storage device 241. For example, the communication module 202 receives VR content from the stream combination module 214 and stores the VR content in the memory 237 or the storage device 241. In some implementations, the communication module 202 retrieves data from the memory 237 or the storage device 241 and sends the data to one or more appropriate components of the aggregation system 131. Alternatively or additionally, the communication module 202 may also handle communications between components of the aggregation system 131.

The calibration module 204 may be software including routines for calibrating the camera modules 103 in the camera array 101. The calibration module 204 may be adapted for cooperation and communication with the processor 235 and other components of the computing device 200 via a signal line 224.

In some implementations, lenses included in the camera modules 103 may have some amount of spherical distortion. Images captured with the camera modules 103 may have a barrel distortion or a pin-cushion distortion that needs to be corrected during creation of panoramic images from the distorted images. The barrel distortion may be referred to as a "fish eye effect." For each camera module 103, the calibration module 204 calibrates a lens in the corresponding camera module 103 to determine associated distortion caused by the lens. For example, a snapshot of a test pattern that has known geometries placed in a known location (e.g., a checkerboard in a known location) may be captured by the camera module 103. The calibration module 204 may determine properties of a lens included in the camera module 103 from the snapshot of the test pattern. Properties of a lens may include, but are not limited to, distortion parameters, an optical center, and other optical properties associated with the lens.

The calibration module 204 stores data describing the properties of each lens in a configuration file. The configuration file may include data describing properties of all lenses of all the camera modules 103 in the camera array 101. For example, the configuration file includes data describing distortion parameters, an optical center, and other optical properties for each lens in the camera array 101.

Alternatively or additionally, the calibration module 204 may perform multi- camera geometric calibration on the camera array 101 to determine variations in the physical properties of the camera array 101. For example, the calibration module 204 may determine slight variations in camera orientation for each lens in the camera array 101, where the slight variations in the camera orientation may be caused by human errors occurring during an installation or manufacture process of the camera array 101. In another example, the calibration module 204 may estimate errors in the predicted roll, pitch, and yaw of a corresponding lens in each camera module 103. The calibration module 204 may determine a position and a rotational offset for the corresponding lens in each camera module 103 and may store the position and the rotational offset for the corresponding lens in the configuration file. As a result, the relative position of each two lenses in the camera array 101 may be determined based on the positions and rotational offsets of the two corresponding lenses. For example, spatial transformation between each two lenses may be determined based on the positions and rotational offsets of the two corresponding lenses.

The camera mapping module 206 may be software including routines for constructing a left camera map and a right camera map. The camera mapping module 206 may be adapted for cooperation and communication with the processor 235 and other components of the computing device 200 via a signal line 226. A two-dimensional (2D) spherical panoramic image may be used to represent a panorama of an entire scene. As described below with reference to the video module 208, two stereoscopic panorama images may be generated for two eyes to provide a stereoscopic view of the entire scene. For example, a left panoramic image may be generated for the left eye viewing and a right panoramic image may be generated for the right eye viewing. An example panoramic image is illustrated in Figure 6A.

A pixel in a panoramic image may be presented by a yaw value and a pitch value. Yaw represents rotation around the center and may be represented on the horizontal x-axis as:

yaw = 3600xx/width. (1)

Yaw has a value between 00 and 3600. Pitch represents up or down rotation and may be represented on the vertical y-axis as:

pitch = 900x(height/2-y)/(height/2). (2)

Pitch has a value between -900 and 900.

The panoramic images may give a sense of real depth by exploiting a human brain's capacity to transform disparity (e.g., shifts in pixel positions) into depth. For example, a nearby object may have a larger disparity than a far-away object. Disparity may represent pixel shifts in positions between two images. Disparity may be caused by an interocular distance which represents a distance between two eyes. Each eye may receive a slightly different image, which creates a sense of depth.

Typical stereoscopic systems (e.g., 3D movies) may respectively show two different planar images to two eyes to create a sense of depth. In each planar image, all pixels in the image represent a single eye viewing position. For example, all pixels in the planar image may represent a view into the same viewing direction. However, in the panoramic image described herein (the left or right panoramic image), each pixel in the panoramic image may represent a view into a slightly different direction. For example, a pixel at a position with yaw^e ^ ^ and pitch=00 in a left panoramic image may represent an eye viewing position of the left eye as the head is rotated to the position indicated by the yaw value and the pitch value. Similarly, a pixel at the position with yaw ^e ^ ^ and pitch=00 in a right panoramic image represents an eye viewing position of the right eye as the head is rotated to the position indicated by the yaw value and the pitch value. For pitch=00

(e.g., no up and down rotations), as the head is rotated from yaw⁼ 0 to yaw⁼ 360 _? a blended panorama for eye viewing positions with all 360-degree head rotations in the horizontal axis may be produced.

In some implementations, the blended panorama is effective for head rotations along the horizontal axis (e.g., yaw) but not for the vertical axis (e.g., pitch). As a user tilts his or her head upwards or downwards (e.g., pitch≠ 0 ), the dominant orientation of the user's eyes with respect to points on the sphere may become less well defined compared to pitch= 0 . For example, when the user looks directly upward with pitch= 90 , the orientation of the user's eyes with respect to the north pole point of the sphere may be completely ambiguous since the user's eyes may view the north pole point of the sphere from any yaw. Stereo vision may not be supported in the upward and downward directions using left/right eye spheres that are supported in the horizontal orientation. As a result, binocularity may be phased out by diminishing the interocular distance with an adjustment function f(pitch). An output of the adjustment function f(pitch) may decline from 1 to 0 as the pitch increases from 0 to 90 _or decreases from 0 to ^~ 90 p_or example, the adjustment function f(pitch) may include cos(pitch). The interocular distance may be adjusted based on the adjustment function f(pitch). For example, the interocular distance associated with the pitch may be adjusted as:

interocular distance = max(interocular distance) xf(pitch), (3)

where max(interocular distance) represents the maximum value of the interocular distance (e.g., the interocular distance is at its maximum when pitch=00). If f(pitch)=cos(pitch), then the interocular distance may be expressed as:

interocular distance = max(interocular distance)^xcos(pitch). (4)

In some examples, the maximum value of the interocular distance may be about 60 millimeters. In other examples, the maximum value of the interocular distance may have a value greater than 60 millimeters or less than 60 millimeters.

The camera mapping module 206 may construct a left camera map that identifies a corresponding matching camera module 103 for each pixel in a left panoramic image. For example, for a pixel in a left panoramic image that represents a point in a panorama, the left camera map may identify a matching camera module 103 that has a best view for the point in the panorama compared to other camera modules 103. Thus, the left camera map may map pixels in a left panoramic image to matching camera modules 103 that have best views for the corresponding pixels. Determination of a matching camera module 103 for a pixel is described below in more detail.

An example camera map is illustrated with reference to Figure 6B. A camera map may include a left camera map or a right camera map. A camera map may use (yaw, pitch) as an input and may generate an output of (an identifier of a matching camera module, x, y), indicating a pixel (yaw, pitch) in a panoramic image may be obtained as a pixel (x, y) in an image plane of the identified matching camera module. The camera map may store the output (an identifier of a matching camera module, x, y) in a map entry related to the input (yaw, pitch). Pixels in an image plane of a camera module may be determined by using a camera model (e.g., a pinhole camera model or more complex lens model) to map points in 3D space onto pixels in the image plane of the camera module, where the points in the 3D space are assumed to be at a particular distance from the camera module. For example, referring to Figure 7A, a distance for a point 716 may refer to a distance from the point 716 to a center of the camera array 101. The distance may be set at a fixed radius or varied as a function of pitch and yaw. The distance may be determined by: (1) measuring the scene; (2) manual adjustment by a human operator; (3) using a depth sensor to measure depths of the points in the 3D space; or (4) determining the depths using stereo disparity algorithms.

For each pixel in a left panoramic image that represents a point in a panorama, the camera mapping module 206 may determine a yaw, a pitch, and an interocular distance using the above mathematical expressions (1), (2), and (3), respectively. The camera mapping module 206 may use the yaw and pitch to construct a vector representing a viewing direction of the left eye (e.g., a left viewing direction) to the corresponding point in the panorama.

In some implementations, a matching camera module 103 for a pixel in a left panoramic image that has a better view of the pixel may have a viewing direction to a point in a panorama that corresponds to the pixel in the left panoramic image. The viewing direction of the matching camera module 103 is closer to the left viewing direction than other viewing directions of other camera modules 103 to the same point in the panorama. For example, referring to Figure 7 A, the viewing direction 714 of the matching camera module 103 a is more parallel to a left viewing direction 704 than other viewing directions of other camera modules 103. In other words, for each pixel in the left panoramic image, the left camera map may identify a corresponding matching camera module 103 that has a viewing direction most parallel to the left viewing direction than other viewing directions of other camera modules 103. Illustrations of a matching camera module 103 with a more parallel viewing direction to a left viewing direction are illustrated with reference to Figures 7 A and 7B.

Similarly, the camera mapping module 206 may construct a right camera map that identifies a corresponding matching camera module 103 for each pixel in a right panoramic image. For example, for a pixel in a right panoramic image that represents a point in a panorama, the right camera map may identify a matching camera module 103 that has a better view for the point in the panorama than other camera modules 103. Thus, the right camera map may map pixels in a right panoramic image to matching camera modules 103 that have better views for the corresponding pixels.

For each pixel in a right panoramic image that represents a point in a panorama, the camera mapping module 206 may determine a yaw, a pitch, and an interocular distance using the above mathematical expressions (1), (2), and (3), respectively. The camera mapping module 206 may use the yaw and pitch to construct a vector representing a viewing direction of the right eye (e.g., a right viewing direction) to the corresponding point in the panorama.

In some implementations, a matching camera module 103 for a pixel in a right panoramic image that has a better view of the pixel may have a viewing direction to a point in a panorama that corresponds to the pixel in the right panoramic image. The viewing direction of the matching camera module 103 is closer to the right viewing direction than other viewing directions of other camera modules 103 to the same point in the panorama. For example, the viewing direction of the matching camera module 103 is more parallel to the right viewing direction than other viewing directions of other camera modules 103. In other words, for each pixel in the right panoramic image, the right camera map may identify a corresponding matching camera module 103 that has a viewing direction most parallel to the right viewing direction than other viewing directions of other camera modules 103.

Since the physical configuration of the camera array 101 is fixed, the left and right camera maps are the same for different left panoramic images and right panoramic images, respectively. The left and right camera maps may be pre- computed and stored to achieve a faster processing speed compared to an on-the-fly computation.

The video module 208 may be software including routines for generating a stream of 3D video data configured to render 3D video when played back on a VR display device. The video module 208 may be adapted for cooperation and communication with the processor 235 and other components of the computing device 200 via a signal line 280. The stream of 3D video data may describe a stereoscopic panorama of a scene that may vary over time. The stream of 3D video data may include a stream of left panoramic images for left eye viewing and a stream of right panoramic images for right eye viewing.

In some implementations, the video module 208 receives raw video data describing image frames from the various camera modules 103 in the camera array 101. The video module 208 identifies a location and timing associated with each of the camera modules 103 and synchronizes the image frames based on locations and timings of the camera modules 103. The video module 208 synchronizes corresponding image frames that are captured by different camera modules 103 at the same time.

For example, the video module 208 receives a first stream of image frames from a first camera module 103 and a second stream of image frames from a second camera module 103. The video module 208 identifies that the first camera module 103 is located at a position with yaw = 00 and pitch = 00 and the second camera module 103 is located at a position with yaw = 300 and pitch = 00. The video module 208 synchronizes the first stream of image frames with the second stream of image frames by associating a first image frame from the first stream captured at a first particular time T=T0 with a second image frame from the second stream captured at the same particular time T=T0, a third image frame from the first stream captured at a second particular time T=T1 with a fourth image frame from the second stream captured at the same particular time T=T1, and so on and so forth. In some implementations, the video module 208 sends the synchronized image frames to the correction module 210 so that the correction module 210 may correct calibration errors in the synchronized image frames. For example, the correction module 210 may correct lens distortion, orientation errors, and rotation errors, etc., in the image frames. The correction module 210 may send the image frames back to the video module 208 after correcting the calibration errors.

The video module 208 may receive a left camera map and a right camera map from the camera mapping module 206. Alternatively, the video module 208 may retrieve the left and right camera maps from the storage device 241 or the memory 237. The video module 208 may construct a stream of left panoramic images from the image frames based on the left camera map. For example, the video module 208 identifies matching camera modules 103 listed in the left camera map. The video module 208 constructs a first left panoramic image PIL,0 by stitching image frames that are captured by the matching camera modules 103 at a first particular time T=T0. The video module 208 constructs a second left panoramic image PIL, 1 by stitching image frames that are captured by the matching camera modules 103 at a second particular time T=T1, and so on and so forth. The video module 208 constructs the stream of left panoramic images to include the first left panoramic image PIL,0, the second left panoramic image PIL,1, and other constructed left panoramic images.

Specifically, for a pixel in a left panoramic image PIL,i at a particular time T=Ti (i = 0, 1, 2, ...), the video module 208: (1) identifies a matching camera module 103 from the left camera map; and (2) configures the pixel in the left panoramic image PIL,i to be a corresponding pixel from an image frame that is captured by the matching camera module 103 at the particular time T=Ti. The pixel in the left panoramic image PIL,i and the corresponding pixel in the image frame of the matching camera module 103 may correspond to the same point in the panorama. For example, for a pixel location in the left panoramic image PIL,i that corresponds to a point in the panorama, the video module 208: (1) retrieves a pixel that also corresponds to the same point in the panorama from the image frame that is captured by the matching camera module 103 at the particular time T=Ti; and (2) places the pixel from the image frame of the matching camera module 103 into the pixel location of the left panoramic image PIL,i. Similarly, the video module 208 constructs a stream of right panoramic images from the image frames based on the right camera map by performing operations similar to those described above with reference to the construction of the stream of left panoramic images. For example, the video module 208 identifies matching camera modules 103 listed in the right camera map. The video module 208 constructs a first right panoramic image PIR,0 by stitching image frames that are captured by the matching camera modules 103 at a first particular time T=T0. The video module 208 constructs a second right panoramic image PIR,1 by stitching image frames that are captured by the matching camera modules 103 at a second particular time T=T1, and so on and so forth. The video module 208 constructs the stream of right panoramic images to include the first right panoramic image PIR,0, the second right panoramic image PIR,1, and other constructed right panoramic images.

Specifically, for a pixel in a right panoramic image PIR,i at a particular time T=Ti (i = 0, 1, 2, ...), the video module 208: (1) identifies a matching camera module 103 from the right camera map; and (2) configures the pixel in the right panoramic image PIR,i to be a corresponding pixel from an image frame that is captured by the matching camera module 103 at the particular time T=Ti. The pixel in the right panoramic image PIR,i and the corresponding pixel in the image frame of the matching camera module 103 may correspond to the same point in the panorama.

In some implementations, the video module 208 may construct pixels in a left or right panoramic image by blending pixels from image frames of multiple camera modules 103 according to weights associated with the multiple camera modules 103. An example pixel blending process is described below in more detail with reference to Figure 8.

In some implementations, the left and right panoramic images may be optimized for stereoscopic viewing in a horizontal plane (e.g., yaw^e [00, 3600] and pitch = 00). Alternatively or additionally, the left and right panoramic images may be optimized based on a user's viewing direction. For example, the video module 208 may adaptively construct the streams of left panoramic images and right panoramic images based on the user's current viewing direction. A panorama provided by the streams of left and right panoramic images may have a high- resolution in the user's current viewing direction and a low-resolution in a reverse viewing direction. This panorama may be referred to as a directional panorama. As the user rotates his or her head to view the panorama in a new viewing direction, the directional panorama may be adjusted to have a high resolution in the new viewing direction and a low resolution in a viewing direction opposite to the new viewing direction. Since only a directional panorama is constructed, bandwidth and other resources may be saved compared to constructing a full high-resolution panorama. However, quality of the 3D viewing experience is not affected if the user does not change viewing directions rapidly.

In some implementations, a constructed left or right panoramic image may have color deficiencies. For example, since the lenses in the camera modules 103 may point to different directions, light and color conditions may vary for the different lenses. Some image frames taken by some camera modules 103 may be over-exposed while some other image frames taken by other camera modules 103 may be under-exposed. The exposure or color deficiencies between image frames from different camera modules 103 may be corrected by the correction module 210 during a construction process of the left or right panoramic image.

Additionally or alternatively, due to the disparity between neighboring camera modules 103, a constructed left or right panoramic image may have stitching artifacts (or, stitching errors) where the viewpoint switches from a camera module 103 to a neighboring camera module 103. Objects that are far away from the camera modules 103 may have negligible disparity and there may be no stitching errors for the far-away objects. However, objects that are near the camera modules 103 may have noticeable disparity and there may be stitching errors for the nearby objects. Correction of the stitching errors is described below in more detail with reference to the correction module 210.

The correction module 210 may be software including routines for correcting aberrations in image frames or panoramic images. The correction module 210 is communicatively coupled to the bus 220 via a signal line 228. The aberrations may include calibration errors, exposure or color deficiencies, stitching artifacts, and other types of aberrations. The stitching artifacts may include errors made by the video module 208 when stitching image frames from various camera modules 103 to form a left or right panoramic image. The correction module 210 may analyze the image frames or the panoramic images to identify the aberrations. The correction module 210 may process the image frames or panoramic images to mask or correct the aberrations. The correction module 210 may automatically correct the aberrations or provide an administrator of the aggregation system 131 with tools or resources to manually correct the aberrations.

In some implementations, the correction module 210 receives image frames captured by a camera module 103 and corrects calibration errors on the image frames. For example, the correction module 210 may correct lens distortion (e.g., barrel or pin-cushion distortion) and camera orientation errors in the image frames based on lens distortion parameters, a position, and a rotational offset associated with the camera module 103.

In another example, the correction module 210 may analyze the image frames captured by the camera module 103, determine the calibration errors present in the image frames, and determine calibration factors used to calibrate the camera module 103. The calibration factors may include data used to automatically modify the image frames captured by the camera module 103 so that the image frames include fewer errors. In some implementations, the calibration factors are applied to the image frames by the correction module 210 so that the image frames include no errors that are detectable during user consumption of the VR content. For example, the correction module 210 may detect the deficiencies in the image frames caused by the calibration errors. The correction module 210 may determine one or more pixels associated with the deficiencies. The correction module 210 may determine the pixel values associated with these pixels and then modify the pixel values using the calibration factors so that the deficiencies are corrected. In some implementations, the calibration factors may also be provided to an administrator of the camera array 101 who uses the calibration factors to manually correct the calibration deficiencies of the camera array 101.

In some implementations, the correction module 210 may detect and correct exposure or color deficiencies in the image frames captured by the camera array 101. For example, the correction module 210 may determine one or more pixels associated with the exposure or color deficiencies. The correction module 210 may determine the pixel values associated with these pixels and then modify the pixel values so that the exposure or color deficiencies are not detectable by the user 134 during consumption of the VR content using the viewing system 133. In some implementations, the camera modules 103 of the camera array 101 have overlapping fields of view, and exposure or color deficiencies in the image frames captured by the camera array 101 may be corrected or auto -corrected using this overlap. In other implementations, exposure or color deficiencies in the image frames captured by the camera array 101 may be corrected using calibration based on color charts of known values.

In some implementations, the correction module 210 may correct stitching errors caused by close-by objects. For example, the closer an object is to the camera array 101, the greater the difference of a viewing angle from each camera module 103 to the object. Close-by objects that cross a stitching boundary may abruptly transition between viewing angles and may thus produce an obvious visual discontinuity. This may be referred to herein as the "close object problem." Stitching artifacts may be incurred for close-by objects. One example mechanism to reduce the stitching errors may include increasing the number of camera modules 103 distributed throughout a spherical housing case of the camera array 101 to approach an ideal of a single, continuous, and spherical image sensor. The mechanism may reduce the viewing angle discrepancy between neighboring cameras and may thus reduce the stitching artifacts. Alternatively, virtual cameras may be interpolated between real cameras to simulate an increasing camera density so that stitching artifacts may be reduced. Image stitching using virtual cameras is described in more detail in U.S. Application No. , titled "Image Stitching" and filed August 21, 2014, which is incorporated herein in its entirety by reference.

The audio module 212 may be software including routines for generating a stream of 3D audio data configured to render 3D audio when played back on an audio reproduction device. The audio module 212 is communicatively coupled to the bus 220 via a signal line 230. The audio module 212 may generate the 3D audio data based on the raw audio data received from the microphone array 107. In some implementations, the audio module 212 may process the raw audio data to generate four-channel ambisonic audio tracks corresponding to the 3D video data generated by the video module 208. The four-channel ambisonic audio tracks may provide a compelling 3D 360-degree audio experience to the user 134. In some implementations, the four-channel audio tracks may be recorded in an "A" format by the microphone array 107 such as a Tetramic microphone. The audio module 212 may transform the "A" format four-channel audio tracks to a "B" format that includes four signals: W, X, Y, and Z. The W signal may represent a pressure signal that corresponds to an omnidirectional microphone, and the X, Y, Z signals may correspond to directional sounds in front-back, left-right, and up-down directions, respectively. In some implementations, the "B" format signals may be played back in a number of modes including, but not limited to, mono, stereo, binaural, surround sound including four or more speakers, and any other modes. In some examples, an audio reproduction device may include a pair of headphones, and the binaural playback mode may be used for the sound playback in the pair of headphones. The audio module 212 may convolve the "B" format channels with Head Related Transfer Functions (HRTFs) to produce binaural audio with a compelling 3D listening experience for the user 134.

In some implementations, the audio module 212 generates 3D audio data that is configured to provide sound localization to be consistent with the user's head rotation. For example, if a sound is emanating from the user's right-hand side and the user rotates to face the sound, the audio reproduced during consumption of the VR content sounds as if it is coming from in front of the user.

In some implementations, the raw audio data is encoded with the directionality data that describes the directionality of the recorded sounds. The audio module 212 may analyze the directionality data to produce 3D audio data that changes the sound reproduced during playback based on the rotation of the user's head orientation. For example, the directionality of the sound may be rotated to match the angle of the user's head position. Assume that the VR content depicts a forest with a canopy of tree limbs overhead. The audio for the VR content includes the sound of a river. The directionality data indicates that the river is behind the user 134, and so the 3D audio data generated by the audio module 212 is configured to reproduce audio during playback that makes the river sound as if it is located behind the user 134. This is an example of the 3D audio data being configured to reproduce directionality. Upon hearing the audio for the river, the user 134 may sense that the river is behind him or her. The 3D audio data is configured so that as the user 134 tilts his or her head to the side, the sound of the water changes. As the angle of the tilt approaches 180 degrees relative to the starting point, the river sounds as though it is in front of the user 134. This is an example of the 3D audio data being configured to reproduce directionality based on the angle of the user's 134 head position. The 3D audio data may be configured so that the sound of the river becomes more distinct and clearer, and the user 134 has a better sense of how far the water is from the user 134 and how fast the water is flowing.

The stream combination module 214 may be software including routines for combining a stream of 3D video data and a stream of 3D audio data to generate VR content. The stream combination module 214 is communicatively coupled to the bus 220 via a signal line 229. The stream of 3D video data includes a stream of left panoramic images for left eye viewing and a stream of right panoramic images for right eye viewing. Redundancy exists between the stream of left panoramic images and the stream of right panoramic images.

The stream combination module 214 may compress the stream of left panoramic images and the stream of right panoramic images to generate a stream of compressed 3D video data using video compression techniques. In some implementations, within each stream of the left or right panoramic images, the stream combination module 214 may use redundant information from one frame to a next frame to reduce the size of the corresponding stream. For example, with reference to a first image frame (e.g., a reference frame), redundant information in the next image frames may be removed to reduce the size of the next image frames. This compression may be referred to as temporal or inter-frame compression within the same stream of left or right panoramic images.

Alternatively or additionally, the stream combination module 214 may use one stream (either the stream of left panoramic images or the stream of right panoramic images) as a reference stream and may compress the other stream based on the reference stream. This compression may be referred to as inter-stream compression. For example, the stream combination module 214 may use each left panoramic image as a reference frame for a corresponding right panoramic image and may compress the corresponding right panoramic image based on the referenced left panoramic image.

In some implementations, the stream combination module 214 may encode the stream of 3D video data (or compressed 3D video data) and 3D audio data to form a stream of VR content. For example, the stream combination module 214 may compress the stream of 3D video data using H.264 and the stream of 3D audio data using advanced audio coding (AAC). In another example, the stream combination module 214 may compress the stream of 3D video data and the stream of 3D audio data using a standard MPEG format. The VR content may be constructed by the stream combination module 214 using any combination of the stream of 3D video data (or the stream of compressed 3D video data), the stream of 3D audio data (or the stream of compressed 3D audio data), content data from the content server 139, advertisement data from the ad server 141, social data from the social network server 135, and any other suitable VR content.

In some implementations, the VR content may be packaged in a container format such as MP4, WebM, VP8, and any other suitable format. The VR content may be stored as a file on the client device 127 or the server 129 and may be streamed to the viewing system 133 for the user 134 from the client device 127 or the server 129. Alternatively, the VR content may be stored on a digital versatile disc (DVD), a flash memory, or another type of storage devices.

Referring now to Figure 3, an example method 300 for aggregating image frames and audio data to generate VR content is described in accordance with at least some implementations described herein. The method 300 is described with respect to Figures 1 and 2. Although illustrated as discrete blocks, various blocks may be divided into additional blocks, combined into fewer blocks, or eliminated, depending on the desired implementation.

In the illustrated implementation of Figure 3, the method 300 may include the communication module 202 receiving 302 raw video data. The raw video data may describe image frames from the camera modules 103. The communication module 202 receives 304 raw audio data from the microphone array 107. The video module 208 aggregates 306 the image frames to generate a stream of 3D video data. The stream of 3D video data includes a stream of left panoramic images and a stream of right panoramic images. The audio module 212 generates 310 a stream of 3D audio data from the raw audio data. The stream combination module 214 generates 312 VR content that includes the stream of 3D video data and the stream of 3D audio data. One skilled in the art will appreciate that, for this and other processes and methods disclosed herein, the functions performed in the processes and methods may be implemented in differing order. Furthermore, the outlined steps and operations are only provided as examples, and some of the steps and operations may be optional, combined into fewer steps and operations, or expanded into additional steps and operations without detracting from the essence of the disclosed implementations .

Figures 4A-4C illustrate another example method 400 for aggregating image frames and audio data to generate VR content according to some implementations. The method 400 is described with respect to Figures 1 and 2. Although illustrated as discrete blocks, various blocks may be divided into additional blocks, combined into fewer blocks, or eliminated, depending on the desired implementation.

Referring to Figure 4A, the calibration module 204 calibrates 402 the camera modules 103 in the camera array 101. The communication module 202 receives 404 raw video data describing image frames from the camera modules 103. The communication module 202 receives 406 raw audio data from the microphone array 107. The video module 208 identifies 408 a location and timing associated with each of the camera modules 103. The video module 208 synchronizes 410 the images frames based on locations and timings associated with the camera modules 103. The camera mapping module 206 constructs 412 a left camera map and a right camera map. The left camera map identifies matching camera modules 103 for pixels in a left panoramic image. For example, for a pixel in a left panoramic image that represents a point in a panorama, the left camera map identifies a matching camera module 103 that has a better view to the point than other camera modules 103. Similarly, the right camera map identifies matching camera modules 103 for pixels in a right panoramic image.

Referring to Figure 4B, the video module 208 generates 414, based on the left camera map, a stream of left panoramic images from the image frames. For example, the video module 208 identifies matching camera modules 103 for pixels in left panoramic images based on the left camera map. The video module 208 stitches image frames that are captured by the corresponding matching camera modules 103 at a particular time to form a corresponding left panoramic image. The correction module 210 corrects 416 color deficiencies in the left panoramic images. The correction module 210 corrects 418 stitching errors in the left panoramic images.

The video module 208 generates 420, based on the right camera map, a stream of right panoramic images from the image frames. For example, the video module 208 identifies matching camera modules 103 for pixels in right panoramic images based on the right camera map. The video module 108 stitches image frames that are captured by the corresponding matching camera modules 103 at a particular time to form a corresponding right panoramic image. The correction module 210 corrects 422 color deficiencies in the right panoramic images. The correction module 210 corrects 424 stitching errors in the right panoramic images.

Referring to Figure 4C, the stream combination module 214 compresses 426 the stream of left panoramic images and the stream of right panoramic images to generate a compressed stream of 3D video data. The audio module 212 generates 428 a stream of 3D audio data from the raw audio data. The stream combination module 214 generates 430 VR content that includes the compressed stream of 3D video data and the stream of 3D audio data. In some implementations, the stream combination module 214 may also compress the stream of 3D audio data to form a compressed stream of 3D audio data, and the VR content may include the compressed stream of 3D video data and the compressed stream of 3D audio data.

Figure 5 illustrates an example process 500 of generating a left panoramic image and a right panoramic image from multiple image frames that are captured by multiple camera modules 103a, 103b... l03n at a particular time, arranged in accordance with at least some implementations described herein. At the particular time T=Ti (i = 0, 1, 2, ...), the camera module 103a captures an image frame 502a, the camera module 103b captures an image frame 502b, and the camera module 103n captures an image frame 502n. The video module 208 receives the image frames 502a, 502b, and 502n. The video module 208 aggregates the image frames 502a, 502b, and 502n to generate a left panoramic image 508 based on a left camera map 504 and a right panoramic image 510 based on a right camera map 506. The left panoramic image 508 and the right panoramic image 510 are associated with the particular time T=Ti.

Figure 6A is a graphic representation 600 that illustrates an example panoramic image, arranged in accordance with at least some implementations described herein. The panoramic image has a first axis "yaw" which represents rotation in a horizontal plane and a second axis "pitch" which represents up and down rotation in a vertical direction. The panoramic image covers an entire 360- degree sphere of a scene panorama. A pixel at a position [yaw, pitch] in the panoramic image represents a point in a panorama viewed with a head rotation having a "yaw" value and a "pitch" value. Thus, the panoramic image includes a blended view from various head rotations rather than a single view of the scene from a single head position.

Figure 6B is a graphic representation 650 that illustrates an example camera map, arranged in accordance with at least some implementations described herein. The example camera map matches first pixels in camera sections 652a and 652b of a panoramic image to a first matching camera module 103, second pixels in a camera section 654 to a second matching camera module 103, and third pixels in camera sections 656a and 656b to a third matching camera module 103. For the first pixels of the panoramic image within the camera sections 652a and 652b, values for the first pixels may be configured to be corresponding pixel values in a first image frame captured by the first matching camera module 103. Similarly, for the second pixels of the panoramic image within the camera section 654, values for the second pixels may be configured to be corresponding pixel values in a second image frame captured by the second matching camera module 103. For the third pixels of the panoramic image within the camera sections 656a and 656b, values for the third pixels may be configured to be corresponding pixel values in a third image frame captured by the third matching camera module 103. In this example, the panoramic image is stitched using part of the first image frame from the first matching camera module 103, part of the second image frame from the second matching camera module 103, part of the third image frame from the third matching camera module 103, and part of other image frames from other matching camera modules 103.

Figures 7 A and 7B are graphic representations 700 and 730 that illustrate example processes of selecting matching camera modules 103 for a pixel in a left and a right panoramic images, arranged in accordance with at least some implementations described herein. Referring to Figure 7A, the camera array 101 includes camera modules 103a, 103b, 103c, 103d and other camera modules mounted on a spherical housing. Assume that a point 716 corresponds to a head rotation position with yaw=900 and pitch =00. An interocular distance 712 is illustrated between a left eye position 718 and a right eye position 720. Since pitch =00, the interocular distance 712 is at its maximum value. The left eye position 718 and the right eye position 720 may be determined by: (1) drawing a first line from the point 716 to a center of the camera array 101; (2) determining an interocular distance based on a current pitch value; (3) drawing a second line that is perpendicular to the first line and also parallel to a plane with yaw=[00, 3600] and pitch=00, where the second line has a length equal to the determined interocular distance and is centered at the center of the camera array 101; and (4) configuring a left end point of the second line as the left eye position 718 and a right end point of the second line as the right eye position 720.

A left viewing direction 704 from the left eye position 718 to the point 716 and a right viewing direction 708 from the right eye position 720 to the point 716 are illustrated in Figure 7A. The camera modules 103a, 103b, and 103c have viewing directions 714, 722, 710 to the point 716, respectively.

Since the viewing direction 714 of the camera module 103 a is more parallel to the left viewing direction 704 compared to other viewing directions 722 and 710 (e.g., an angle between the viewing direction 714 and the left viewing direction 704 is smaller than angles between the left viewing direction 704 and other viewing directions 722 and 710), the camera module 103a is selected as a matching camera module that has a better view for the point 716 than other camera modules in a left camera map. Since the viewing direction 710 of the camera module 103c is more parallel to the right viewing direction 708 compared to other viewing directions 722 and 714, the camera module 103c is selected as a matching camera module that has a better view for the point 716 than other camera modules in a right camera map.

Referring to Figure 7B, assume that a point 736 in a panorama corresponds to a head rotation position with yaw=800 and pitch =00. An interocular distance 742 is illustrated between a left eye position 748 and a right eye position 749. A left viewing direction 734 from the left eye position 748 to the point 736 and a right viewing direction 740 from the right eye position 749 to the point 736 are illustrated in Figure 7B. The camera modules 103a, 103b, 103c, and 103d have viewing directions 732, 738, 744, 731 to the point 736, respectively. Since the viewing direction 732 of the camera module 103a is more parallel to the left viewing direction 734 compared to other viewing directions 738, 744, 731, the camera module 103 a is selected as a matching camera module that has a better view for the point 736 in a left camera map. Since the viewing direction 738 of the camera module 103b is more parallel to the right viewing direction 740 compared to other viewing directions 731, 734, 744, the camera module 103b is selected as a matching camera module that has a better view for the point 736 in a right camera map.

In some implementations, operations to determine a matching camera module for the point 736 in a left panoramic image for left eye viewing may be summarized as following: (1) determining a set of camera modules that have the point 736 in their respective fields of view; (2) determining the left viewing direction 734 from the left eye position 748 to the point 736; (3) determining a set of viewing directions to the point 736 for the set of camera modules; (4) selecting the viewing direction 732 from the set of viewing directions, where the viewing direction 732 forms a smallest angle with the left viewing direction 734 compared to angles formed between the left viewing direction 734 and other viewing directions in the set (in other words, the viewing direction 732 being more parallel to the left viewing direction 734 than the other viewing directions); and (5) configuring a matching camera module for the point 736 as the camera module 103 a that has the viewing direction 732. Some other cost functions for determining the matching camera module for the point 736 in the left panoramic image are possible as long as the cost functions may define some notion of best approximation to the view from the left eye position 748.

Similarly, operations to determine a matching camera module for the point 736 in a right panoramic image for right eye viewing may be summarized as following: (1) determining the set of camera modules that have the point 736 in their respective fields of view; (2) determining the right viewing direction 740 from the right eye position 749 to the point 736; (3) determining the set of viewing directions to the point 736 for the set of camera modules; (4) selecting the viewing direction 738 from the set of viewing directions, where the viewing direction 738 forms a smallest angle with the right viewing direction 740 compared to angles formed between the right viewing direction 740 and other viewing directions in the set; and (5) configuring a matching camera module for the point 736 as the camera module 103b that has the viewing direction 738. Some other cost functions for determining the matching camera module for the point 736 in the right panoramic image are possible as long as the cost functions may define some notion of best approximation to the view from the right eye position 749.

Figure 8 is a graphic representation 800 that illustrates an example process of blending pixels on a border of two camera sections, arranged in accordance with at least some implementations described herein. By way of example, the following description refers to blending pixels on a border 802 of two camera sections 804 and 806. More generally, the description also applies to blending pixels on borders of other camera sections.

Referring to Figure 8, an example camera map 810 maps pixels in camera sections 804 and 806 to a first matching camera module 103 and a second matching camera module 103, respectively. In other words, the first matching camera module 103 has a better view for first pixels in the camera section 804 than other camera modules, and the second camera module 103 has a better view for second pixels in the camera section 806 than other camera modules.

For pixels of a panoramic image located inside the camera section 804, values for the pixels may be configured to be corresponding pixel values captured by the first matching camera module 103. Similarly, for pixels of a panoramic image inside the camera section 806, values for the pixels may be configured to be corresponding pixel values captured by the second matching camera module 103. However, for pixels of a panoramic image on the border 802, first pixel values captured by the first matching camera module 103 may be blended with second pixel values captured by the second matching camera module 103 to form pixel values of the panoramic image on the border 802 so that visible seams caused by slight color or lighting mismatches between camera modules may be reduced or eliminated on the border 802.

For example, the first pixel values captured by the first camera module 103 may be separated into a first high-frequency part and a first low-frequency part, and the second pixel values captured by the first camera module 103 may be separated into a second high-frequency part and a second low-frequency part. The first low- frequency part and the second low-frequency part may be combined to form a blended low-frequency part using weights associated with the corresponding camera modules. One of the first high-frequency part and the second high-frequency part may be selected and may be combined with the blended low-frequency part to form pixel values for the blended pixels on the border 802. For example, the blended pixels may be obtained as:

(values of blended pixels) = (high-frequency part associated with a selected camera module) +

(low-frequency part of camera module i)^xWi,

where M represents a total number of camera modules (or matching camera modules) that capture the pixels on the border 802, and Wi represents a weight for the corresponding camera module i.

The weight Wi for the low-frequency part of the camera module i may decline as a viewing point of a user moves toward a field of view boundary of the camera module i. For example, as the user rotates his or her head and the user's viewing point moves from the field of view of the camera module i to a field of view of a camera module i+1, the weight Wi for the low-frequency part of the camera module i may decline to zero and a weight Wi+1 for the low- frequency part of the camera module i+1 may increase from zero to a non-zero value.

In some implementations, the weights for the low-frequency parts of the camera modules may be stored in a camera map. As described above, a camera map may store an entry "(an identifier of a matching camera module, x, y)" in a map entry related to an input (yaw, pitch), where the input (yaw, pitch) may represent a pixel (yaw, pitch) in a panoramic image and (x, y) may represent a pixel at the position (x, y) in an image plane of the identified matching camera module. The camera map may also store a respective weight for a low-frequency part of each identified matching camera module. For example, the camera map may store an entry "(an identifier of a matching camera module, x, y, a weight for a low- frequency part of the matching camera module)" in the map entry related to the input (yaw, pitch).

Figures 9A and 9B are graphic representations 900 and 920 that illustrate an example panoramic image (e.g., a left or right panoramic image) with improved representation, arranged in accordance with at least some implementations described herein. Referring to Figure 9A, an example panoramic image 901 may include an equator region 902 (3600^900), a north pole region 904 (e.g., a 3600^450 ceiling region), and a south pole region 906 (e.g., a 3600x450 floor region). The equator region 902 may include an area with less distortion than the north pole region 904 and the south pole region 906.

Rather than constructing a panorama using the panoramic image 901 that includes the regions 902, 904, and 906, the panorama may be constructed using the equator region 902, a square north pole part 924 (900^χ900, with the north pole in the center of the north pole part 924), and a square south pole part 926 (900x900, with the south pole in the center of the south pole part 926). In other words, the north pole part 924 and the south pole part 926 may replace the north pole region 904 and the south pole region 906 to construct the panorama, respectively. For example, the panorama may be constructed by pasting the equator region 902 into a middle section of a sphere, the square north pole part 924 into a top section of the sphere, and the square south pole part 926 into a bottom section of the sphere. The north pole part 924 has a circumference of 900^χ4=3600, which matches a top edge of the equator region 902. Similarly, the south pole part 926 has a circumference of 900^χ4=3600, which matches a bottom edge of the equator region 902.

Compared to the panorama constructed using the regions 902, 904, and 906 of Figure 9A, the panorama constructed using the equator region 902, the north pole part 924, and the south pole part 926 has less pixels (e.g., 25% less pixels) and less distortion in the polar regions. The resolution for the parts 924 and 926 may be lower than the resolution for the equator region 902, which further improves efficiency of representing the panorama. The equator region 902, the north pole part 924, and the south pole part 926 may be arranged as a rectangular image as illustrated in Figure 9B and transmitted to the viewing system 133.

Figures 1 OA- IOC are graphic representations 1000, 1040, and 1060 that illustrate a relationship between an increasing density of camera modules 103 and a reduction of stitching errors in panoramic images according to some implementations. Referring to Figure 10A, four camera modules (e.g., cameras 1, 2, 3, and 4) are located inside two concentric walls, with a circular inner wall 1002 closer to the cameras 1, 2, 3, and 4 and shorter than a circular outer wall 1004. A panoramic view of the inner wall 1002 and the outer wall 1004 may be split into four equal quadrants as illustrated using solid lines 1006a, 1006b, 1006c, and 1006d since there are four cameras capturing the panoramic view. Each camera may need a wide field of view to capture a corresponding portion of the panoramic view of the inner wall 1002. For example, a wide field of view of camera 2 is illustrated using dashed lines 1008a and 1008b.

Since centers of cameras 1, 2, 3, and 4 are not co-located in the center of the inner wall 1002 and the outer wall 1004, each camera may have a view of the inner wall 1002 that has less overlap between camera quadrants than a view of the outer wall 1004. For example, a view of the inner wall 1002 and the outer wall 1004 from camera 2 is illustrated in a left graph of Figure 10B, with shaded areas illustrating the overlaps. To stitch images from the different cameras, a boundary of each quadrant image may need to be a straight line. However, there may be no straight line that can eliminate the overlap for both the inner wall 1002 and the outer wall 1004. For example, straight lines 1042 A and 1042B that eliminate overlap for the outer wall 1004 cut off part of the inner wall 1002. Thus, part of the inner wall 1002 may disappear in the panorama. Straight lines 1044A and 1044B that eliminate overlap for the inner wall 1002 leave overlap of the outer wall 1004 in the panorama. Thus, part of the outer wall 1004 may be replicated in the panorama.

After removing the overlaps between camera quadrants, the view of the inner wall 1002 and the outer wall 1004 is illustrated in a middle graph of Figure 10B. However, since the inner wall 1002 is closer to camera 2 than the outer wall 1004, the view of the inner wall 1002 is larger in size than the view of the outer wall 1004 as illustrated in the middle graph of Figure 10B. Stitching errors may occur if the views of the inner wall 1002 and the outer wall 1004 from different cameras are stitched together without adjusting the views of the inner wall 1002. In other words, to avoid visually detectable stitching errors, the view of the inner wall 1002 may be adjusted to be consistent with the view of the outer wall 1004, as illustrated in a right graph of Figure 10B.

In scenarios where various objects are located in various locations in a scene, it may be a challenge to adjust views of closer objects to fit to views of far-away objects. However, if more cameras are added to capture the scene (e.g., a density of cameras is increased), each camera may use a narrower field of view to capture the scene and viewing angles of each camera for the inner wall 1002 and the outer wall 1004 may converge. As a result, stitching errors incurred from aggregating images from different cameras may be reduced or eliminated. By way of examples, camera 2 and viewing angles of camera 2 are illustrated in Figure IOC. If a narrower field of view of camera 2 is used, viewing angles of the inner wall 1002 and the outer wall 1004 from camera 2 may converge. An example mechanism to increase a camera density in the camera array 101 may include adding virtual cameras to the camera array 101, which is described below in more detail with reference to Figures 11-15B.

Referring now to Figure 11, another example of the aggregation system 131 is illustrated in accordance with at least some implementations described herein. Figure 11 is a block diagram of a computing device 1100 that includes the aggregation system 131, a memory 1137, a processor 1135, a storage device 1141, and a communication unit 1145. In the illustrated implementation, the components of the computing device 1100 are communicatively coupled by a bus 1120. In some implementations, the computing device 1100 may be a personal computer, smartphone, tablet computer, set top box, or any other processor-based computing device. The computing device 1100 may be one of the client device 127, the server 129, and another device in the system 100 of Figure 1A.

The processor 1135 may include an arithmetic logic unit, a microprocessor, a general-purpose controller, or some other processor array to perform computations and provide electronic display signals to a display device. The processor 1135 is coupled to the bus 1120 for communication with the other components via a signal line 1138. The processor 1135 may process data signals and may include various computing architectures including a complex instruction set computer (CISC) architecture, a reduced instruction set computer (RISC) architecture, or an architecture implementing a combination of instruction sets. Although Figure 11 includes a single processor 1135, multiple processors may be included. Other processors, operating systems, sensors, displays, and physical configurations may be possible.

The memory 1137 includes a non-transitory memory that stores data for providing the functionality described herein. The memory 1137 may be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, flash memory, or some other memory devices. In some implementations, the memory 1137 also includes a non- volatile memory or similar permanent storage device and media including a hard disk drive, a floppy disk drive, a CD-ROM device, a DVD-ROM device, a DVD-RAM device, a DVD-RW device, a flash memory device, or some other mass storage device for storing information on a more permanent basis. The memory 1137 may store the code, routines, and data for the aggregation system 131 to provide its functionality. The memory 1137 is coupled to the bus 1120 via a signal line 1144.

The communication unit 1145 may transmit data to any of the entities of the system 100 depicted in Figure 1A. Similarly, the communication unit 1 145 may receive data from any of the entities of the system 100 depicted in Figure 1A. The communication unit 1145 may include one or more Ethernet switches for receiving the raw video data and the raw audio data from the connection hub 123. The communication unit 1145 is coupled to the bus 1120 via a signal line 1146. In some implementations, the communication unit 1145 includes a port for direct physical connection to a network, such as a network 105 of Figure 1A, or to another communication channel. For example, the communication unit 1145 may include a port such as a USB, SD, RJ45, or similar port for wired communication with another computing device. In some implementations, the communication unit 1145 includes a wireless transceiver for exchanging data with another computing device or other communication channels using one or more wireless communication methods, including IEEE 802.11, IEEE 802.16, BLUETOOTH®, or another suitable wireless communication method.

In some implementations, the communication unit 1145 includes a cellular communications transceiver for sending and receiving data over a cellular communications network including via short messaging service (SMS), multimedia messaging service (MMS), hypertext transfer protocol (HTTP), direct data connection, WAP, e-mail, or another suitable type of electronic communication. In some implementations, the communication unit 1145 includes a wired port and a wireless transceiver. The communication unit 1145 also provides other conventional connections to a network for distribution of data using standard network protocols including TCP/IP, HTTP, HTTPS, and SMTP, etc.

The storage device 1141 may be a non-transitory storage medium that stores data for providing the functionality described herein. The storage device 1141 may be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, flash memory, or some other memory devices. In some implementations, the storage device 1 141 also includes a non- volatile memory or similar permanent storage device and media including a hard disk drive, a floppy disk drive, a CD-ROM device, a DVD-ROM device, a DVD-RAM device, a DVD-RW device, a flash memory device, or some other mass storage device for storing information on a more permanent basis. The storage device 1141 is communicatively coupled to the bus 1120 via a signal line 1142.

In the implementation illustrated in Figure 11 , the aggregation system 131 includes a communication module 1102, a disparity module 1104, a virtual camera module 1106, a similarity score module 1108, a camera mapping module 1110, a video module 1112, an audio module 1114, and a stream combination module 1116. These modules of the aggregation system 131 are communicatively coupled to each other via the bus 1120.

In some implementations, each module of the aggregation system 131 (e.g., module 1102, 1104, 1106, 1108, 1110, 1112, 1114, or 1116) may include a respective set of instructions executable by the processor 1135 to provide its respective functionality described below. In some implementations, each module of the aggregation system 131 may be stored in the memory 1137 of the computing device 1100 and may be accessible and executable by the processor 1135. Each module of the aggregation system 131 may be adapted for cooperation and communication with the processor 1135 and other components of the computing device 1100.

The communication module 1102 may be software including routines for handling communications between the aggregation system 131 and other components of the computing device 1100. The communication module 1102 may be communicatively coupled to the bus 1120 via a signal line 1122. The communication module 1102 sends and receives data, via the communication unit 1145, to and from one or more of the entities of the system 100 depicted in Figure 1A. For example, the communication module 1102 may receive raw video data from the connection hub 123 via the communication unit 1145 and may forward the raw video data to the video module 1112. In another example, the communication module 1102 may receive VR content from the stream combination module 1116 and may send the VR content to the viewing system 133 via the communication unit 1145. In some implementations, the communication module 1102 receives data from components of the aggregation system 131 and stores the data in the memory 1137 or the storage device 1141. For example, the communication module 1102 receives VR content from the stream combination module 1116 and stores the VR content in the memory 1 137 or the storage device 1141. In some implementations, the communication module 1102 retrieves data from the memory 1137 or the storage device 1141 and sends the data to one or more appropriate components of the aggregation system 131. Alternatively or additionally, the communication module 1102 may also handle communications between components of the aggregation system 131.

The disparity module 1104 may be software including routines for estimating disparity maps between two or more camera modules 103. The disparity module 1104 may be communicatively coupled to the bus 1120 via a signal line 1124. In some implementations, the two or more camera modules 103 may be two or more neighboring camera modules 103. Two or more neighboring camera modules 103 may refer to two or more camera modules 103 in the camera array 101 that locate in proximity to each other and have overlapping fields of view. Alternatively, the two or more camera modules 103 may not be neighboring camera modules. The two or more camera modules 103 may have an overlapping field of view. For simplicity and convenience of discussion, estimation of disparity maps is described below with reference to a first neighboring camera module 103 (also referred to as "Camera A") and a second neighboring camera module 103 (also referred to as "Camera B"). The description also applies to estimation of disparity maps between more than two neighboring camera modules 103.

Camera A and Camera B may have an overlapping field of view. Objects within this overlapping field of view may be visible to both cameras, and appearance of these objects in image frames captured by the cameras may be determined based on the point of view of the corresponding camera. For example, Camera A may capture a first image for a scene and Camera B may capture a second image for the scene at a particular time. The first image may have a first sub-image that overlaps with a second sub-image from the second image in the overlapping field of view. The first sub-image may represent a portion of the first image that overlaps with Camera B's field of view in an area of the overlapping field of view. The second sub-image may represent a portion of the second image that overlaps with Camera A's field of view in the area of the overlapping field of view. For convenience of discussion, the first sub-image may be referred to as "Image AB" and the second sub-image may be referred to as "Image BA." Image AB and Image BA overlap with each other in the overlapping field of view of Camera A and Camera B.

If image planes of Camera A and Camera B are not coplanar, a transformation such as image rotation may be applied to create coplanar images. If the image planes of the Camera A and Camera B are coplanar and projection centers of the two cameras are closer to each other compared to objects in the scene, the appearances of objects in the first and second images may differ primarily in their displacement along an epipolar line that connects the projection centers of the two cameras. The different appearances of the objects in the first and second images may be referred to as parallax, and the difference in the object positions in the first and second images may be referred to as disparity. An illustration of disparity is illustrated in Figure 18.

A disparity map may represent a two-dimensional (2D) map that specifies disparity within an overlapping field of view between two cameras at a level of individual pixels. For example, a first disparity map from Camera A to Camera B may map disparity of pixels from Image AB to Image BA and may be referred to as Disparity(AB^→'BA). A second disparity map from Camera B to Camera A may map disparity of pixels from Image BA to Image AB and may be referred to as Disparity(BA^→AB). The first disparity map "Disparity(AB^→ BA)" and the second disparity map "Disparity(BA^→AB)" may be substantially symmetric and may differ at points of occlusion. Points of occlusion may refer to pixels that are visible to one camera and invisible to another camera because the view from the other camera may be blocked by other objects.

For example, assume that Camera A is horizontally displaced to the left of Camera B so that all epipolar lines are horizontal, or along an x-axis. Both Image AB and Image BA have a size of 100 pixels x 100 pixels, respectively. The first disparity map "Disparity(AB^→ BA)" and the second disparity map "Disparity(BA ^~~ ^AB)" may each have a size of 100 x 100 since each of the first and second disparity maps covers the entire overlapping field of view. Assume that a map entry at a position (8,4) in the first disparity map "Disparity(AB^→ BA)" has a disparity value of "-5," which means that a pixel of Image AB at the position (8,4) corresponds to a pixel of Image BA at a position (3,4) (e.g., the x coordinate value 3=8-5). The disparity value "-5" may represent a disparity of "5" in a direction opposite to an epipolar direction along an epipolar line that connects a projection center of Camera A to a projection center of Camera B. Symmetrically, a map entry at the position (3,4) of the second disparity map "Disparity(BA^→AB)" may have a disparity value of "5," which means that the pixel of Image BA at the position (3,4) corresponds to the pixel of Image A at the position (8,4) (e.g., the x coordinate value 8=3+5). The disparity value "5" may represent a disparity of "5" in the epipolar direction along the epipolar line.

As a result, given Image AB and the first disparity map "Disparity(AB^→ BA)," an estimate of Image BA may be determined except at points that are visible to Camera B and invisible to Camera A. Similarly, given Image BA and the second disparity map "Disparity(BA^→A )," an estimate of Image AB may be determined except at points that are visible to Camera A and invisible to Camera B.

The disparity module 1104 may estimate the first disparity map "Disparity(AB^→BA)" by comparing pixels of Image AB and pixels of Image BA. If exposure, gain, white balance, focus, and other properties of Camera A and Camera B are not identical, Image AB and Image BA may be adjusted to match the brightness, color, and sharpness between the two images. For a pixel (x,y) in Image AB, a set of disparity values are selected and a set of similarity scores corresponding to the set of disparity values for the pixel (x,y) is determined by the similarity score module 1108 as described below in more detail. A map entry at the position (x,y) of the first disparity map "Disparity(AB^→ BA)" may have a value equal to a disparity value that has a highest similarity score from the set of similarity scores. An example method of estimating a disparity map is described with reference to Figures 14A and 14B.

For example, assume Image AB and Image BA have horizontal disparity. For a pixel (3,5) of Image AB, a first disparity value "0" is selected. Thus, a pixel (3,5) of Image BA is compared to the pixel (3,5) of Image AB to determine a first similarity score, since the pixel (3,5) of Image BA has a "0" disparity to the pixel (3,5) of Image AB. Next, a second disparity value is selected and a pixel (2,5) of Image BA is compared to the pixel (3,5) of Image AB to determine a second similarity score, since the pixel (3,5) of Image AB has a disparity to the pixel (2,5) of Image BA. Similarly, other disparity values may be selected and corresponding similarity scores may be determined for the pixel (3,5) of Image AB. A map entry at the position (3,5) of the first disparity map "Disparity(AB^→ BA)" may be configured to have a disparity value that corresponds to the highest similarity score from the determined similarity scores.

A disparity value may include an integer value (e.g., 0, -1, 1, -2, 2, ...) or a non-integer value. Non-integer disparity values may be used to determine similarity scores using pixel interpolation. A maximum absolute value for the disparity value may be determined based on how close the objects in the scene are expected to get to the cameras.

Similarly, the disparity module 1104 may estimate the second disparity map "Disparity(BA^→AB)" by performing operations similar to those described above. Alternatively, the disparity module 1104 may estimate the second disparity map "Disparity(BA→AB)" from the first disparity map "Disparity(AB^→BA)." For example, if a map entry at a position (x,y) of the first disparity map has a disparity value of "d," a map entry at a position (x+d,y) of the second disparity map has a disparity value of "-d."

In some implementations, one or more pixels in Image AB may not have corresponding pixels in Image BA and vice versa, since foreground objects may occlude background objects in the scene. The disparity module 1104 may detect pixel occlusion by configuring a similarity score threshold. For example, if a highest similarity score for a pixel is below the similarity score threshold, a map entry that corresponds to the pixel in a disparity map may be configured to be blank to indicate a pixel occlusion.

In some implementations, the disparity module 1104 may detect disparity collisions. Since each pixel's disparity may be determined independently, collisions may occur in the disparity map. A collision may indicate that two or more pixels in a first image may map to a common pixel in a second image, and the two or more pixels may be referred to as collision pixels. The disparity module 1104 may select a collision pixel with a higher similarity score from the collision pixels, and may configure a corresponding map entry in the disparity map that maps the collision pixel with the higher similarity score to the common pixel in the second image. For other collision pixels with lower similarity scores, the disparity module 1104 may leave associated map entries blank in the disparity map to indicate pixel occlusion.

For example, during computation of the first disparity map "Disparity(AB^→ BA)," both pixels (10,13) and (7,13) in Image AB may correspond to a common pixel (6,13) in Image BA with disparity values of "-4" and and similarity scores of "10" and "8," respectively. In this example, a disparity collision occurs for the pixels (10,13) and (7,13) in Image AB. The disparity module 1104 may configure a map entry at the position (10,13) with a disparity value "-4" and a map entry at the position (7,13) to be blank to indicate pixel occlusion, since the pixel (10,13) has a higher similarity score than the pixel (7,13).

In some implementations, the disparity module 1104 may estimate a disparity value for an occluded pixel. For example, the disparity module 1104 may determine two non-occluded pixels along the epipolar line that are closest to the occluded pixel, with the two non-occluded pixels each on one side of the occluded pixel. The two non-occluded pixels may have two disparity values, respectively. The disparity module 1 104 may select a smaller disparity value from the two disparity value as a disparity value for the occluded pixel. For example, assume that a disparity map along the epipolar line includes map entries with disparity values "2," "3," "4," "occluded," "occluded," "7," "7," "8," respectively. The disparity module 1104 may estimate the disparity values for the map entries to be "2," "3," ^«4 ^{» «}4 ^{» «}4 ^{» «}7 ^{» «}7 ^» " ," respectively, where the occluded map entries may be estimated to have disparity values of "4" and "4."

Alternatively, the disparity module 1104 may model a trend of disparity to capture trending features such as a wall slanting toward the camera. For example, assume that a disparity map along the epipolar line includes map entries with disparity values "1," "2," "3," "occluded," "occluded," "9," "9," "10," respectively. The disparity module 1104 may estimate the disparity values for the map entries to be "1," "2," "3," "4," "5," "9," "9," "10," respectively. In this example, the disparity values "1," "2," and "3" may indicate an increasing trend and the occluded map entries may be estimated to have disparity values "4" and "5" following the increasing trend.

In some implementations, more than two cameras may overlap in the same overlapping field of view, and disparity information from different cameras may be combined to improve the disparity estimation. For example, assume projection centers of a first camera, a second camera, and a third camera are located along a horizontal epipolar line. The first camera and the second camera may form a first pair of a left-eye viewing and a right-eye viewing to observe objects in the scene. The second camera and the third camera may form a second pair of the left-eye viewing and the right-eye viewing to observe objects in the scene. If the projection centers of three cameras are spaced at equal distances along the horizontal epipolar line, ideally both the first pair and the second pair may have the same disparity measurement for the same object in the scene. However, since disparity measurements may have noise, a first disparity measurement of the first pair may be different from a second disparity measurement of the second pair. The first disparity measurement and the second disparity measurement may be used to check for agreement and may be combined to generate a disparity measurement to improve measurement accuracy. In some implementations, the disparity map may be noisy, and the disparity module 1104 may apply edge-preserving filters such as median filters to smooth the disparity map.

The virtual camera module 1106 may be software including routines for determining virtual cameras and virtual camera images for the virtual cameras. The virtual camera module 1 106 may be coupled to the bus 1120 via a signal line 1126. In some implementations, the virtual camera module 1106 may interpolate one or more virtual cameras between neighboring camera modules 103 in the camera array 101. For example, the virtual camera module 1106 may interpolate one or more virtual cameras between Camera A and Camera B and may determine one or more positions for the one or more virtual cameras relative to positions of Camera A and Camera B. The virtual camera module 1106 may also interpolate other virtual cameras between other neighboring camera modules 103 in the camera array 101.

For each virtual camera between Camera A and Camera B, the virtual camera module 1106 may estimate a virtual camera image based on the first disparity map "Disparity(AB^→BA)," the second disparity map "Disparity(BA^→ AB)," and a position of the virtual camera relative to positions of Camera A and Camera B. A position of the virtual camera relative to positions of Camera A and Camera B may be determined by a scalar a with a value between 0 and 1, where a=0 indicates that the virtual camera co-locates with Camera A and a=l indicates that the virtual camera co-locates with Camera B. The virtual camera image for the virtual camera may be estimated from Image AB of Camera A and Image BA of Camera B.

For example, the virtual camera module 1106 may scale disparity values stored in map entries of the first disparity map "Disparity(AB^→ BA)" by the scalar a, and may shift respective pixels in Image AB by the respective scaled disparity values to generate a first shifted image from Image AB. The virtual camera module 1106 may scale disparity values stored in map entries of the second disparity map "Disparity(BA^→AB)" by a scalar 1-a, and may shift respective pixels in Image BA by the respective scaled disparity values to generate a second shifted image from Image BA. The virtual camera module 1106 may combine the first shifted image and the second shifted image to generate the virtual camera image for the virtual camera. For example, for each pixel defined in both the first shifted image and the second shifted image, the virtual camera module 1106 may make an average over or take a maximum value from the corresponding pixel values of the two shifted images. The virtual camera module 1106 may use a linear or non-linear filter and temporal information from previous or future image frames to fill in missing pixels in the virtual camera image. An example non-linear filter includes a median filter. An example method of estimating a virtual camera image for a virtual camera is described below with reference to Figures 13A and 13B.

The similarity score module 1108 may be software including routines for determining similarity scores between a first pixel in a first image (e.g., Image AB) and second pixels in a second image (e.g., Image BA). The second pixels in the second image may have different disparities to the first pixel in the first image. The similarity score module 1108 may be coupled to the bus 1120 via a signal line 1180.

For a particular disparity value, the similarity score module 1108 generates metric values for pixels of Image AB along the epipolar line. A metric value may include one of a sum of absolute difference (SAD), a sum of squared difference (SSD), a correlation-based value, or other suitable metrics. The metric value may be determined across all red, green, blue (RGB) color channels or in some other color space such as YUV, luminance, etc. For example, two pixels (1,5) and (2,5) of Image AB are along the epipolar line. For a disparity value "3," the similarity score module 1108 determines: (1) a first metric value for the pixel (1,5) of Image AB by comparing the pixel (1,5) of Image AB to a pixel (4,5) of Image BA; and (2) a second metric value for the pixel (2,5) of Image AB by comparing the pixel (2,5) of Image AB to a pixel (5,5) of Image BA. For a disparity value "4," the similarity score module 1108 determines: (1) a first metric value for the pixel (1,5) of Image AB by comparing the pixel (1,5) of Image AB to a pixel (5,5) of Image BA; and (2) a second metric value for the pixel (2,5) of Image AB by comparing the pixel (2,5) of Image AB to a pixel (6,5) of Image BA.

A metric value may also be referred to as a distance metric score. The metric value may measure how similar two pixels are by calculating a distance between the two pixels. A zero-value metric value may indicate that the two pixels are identical with a zero distance. A larger metric value may represent more dissimilarity between two pixels than a smaller metric value.

In some implementations, the similarity score module 1108 may initially filter or process Image AB and Image BA to reduce noise that may affect the pixel matching measurements. The similarity score module 1108 may perform a search along a direction that is perpendicular to the epipolar line for pixels with a better match to counteract slight misalignments in the direction perpendicular to the epipolar line.

The similarity score module 1108 may determine a metric threshold that may be used to define runs of adjacent pixels along the epipolar line. A run may include a contiguous group of pixels with metric values below the determined metric threshold. The similarity score module 1108 may determine runs for pixels along the epipolar line based on metric values associated with the pixels and the metric threshold. For example, a particular pixel along the epipolar line that participates in a run calculation may have a run value equal to the calculated run. The similarity score module 1108 may determine preliminary scores for pixels along the epipolar line based on runs of the pixels and the metric threshold. For example, a preliminary score for each pixel along the epipolar line may be equal to the run of the corresponding pixel divided by the metric threshold. Next, the similarity score module 1108 may vary the metric threshold and determine different preliminary scores for the pixels along the epipolar line for the different metric thresholds. The metric threshold may be varied in a range between zero and a maximum threshold. The maximum threshold may be determined based on how much difference a user may visually tolerate before determining two images are images with different objects. If a metric value exceeds the maximum threshold, the two images used to calculate the metric value may not be treated as images capturing the same object. The similarity score module 1108 may determine a similarity score for a pixel along the epipolar line as a highest preliminary score of the pixel. A similarity score may indicate a degree of similarity between two pixels. A higher similarity score for two pixels may indicate more similarity between the two pixels than a smaller similarity score. A method of determining similarity scores is described below with reference to Figures 15A and 15B.

For example, SAD metric values for pixels along the epipolar line for a particular disparity value may include: 3, 4, 2, 3, 1, 6, 8, 3, 1. If the similarity score module 1108 determines a metric threshold to be 5, runs of adjacent pixels that are not above the metric threshold may include: 5, 0, 2, where the first five metric values "3, 4, 2, 3, 1" are below the metric threshold and thus a run of "5" is generated, the next two metric values "6, 8" are above the metric threshold and thus a run of "0" is generated, and the last two metric values "3, 1" are below the metric threshold and thus a run of "2" is generated. Thus, the first five pixels with metric values "3, 4, 2, 3, 1" may each have a run of "5," the next two pixels with metric values "6, 8" may each have a run of "0," and the last two pixels with metric values of "3, 1" may each have a run of "2." As a result, runs for the pixels along the epipolar line include: 5, 5, 5, 5, 5, 0, 0, 2, 2. Preliminary scores for the pixels along the epipolar line may be equal to the corresponding runs divided by the metric threshold "5" and may include: 1, 1, 1, 1, 1, 0, 0, 2/5, 2/5. Next, the metric threshold may be modified to be 6. Runs for the pixels may include: 6, 6, 6, 6, 6, 6, 0, 0, 2, 2. Another set of preliminary scores for the pixels along the epipolar line for the metric threshold "6" may include: 1, 1, 1, 1, 1, 1, 0, 2/6, 2/6. The similarity score module 1108 may select different metric thresholds and determine different preliminary scores associated with the different metric thresholds for the pixels. The similarity score module 1108 may determine a similarity score for a particular pixel along the epipolar line as a highest preliminary score of the particular pixel.

More generally, the mechanisms described herein to estimate disparity maps and to determine similarity scores are provided by way of example. There may be numerous other ways to estimate the disparity maps and the similarity scores. The camera mapping module 1110 may be software including routines for constructing a left camera map and a right camera map. The camera mapping module 1110 may be adapted for cooperation and communication with the processor 1135 and other components of the computing device 1100 via a signal line 1128.

A camera map may include a left camera map or a right camera map. A camera map may use (yaw, pitch) as an input and may generate an output of (an identifier of a matching camera, x, y), indicating a pixel (yaw, pitch) in a panoramic image may be obtained as a pixel (x, y) in an image plane of the identified matching camera. The camera map may store the output (an identifier of a matching camera, x, y) in a map entry related to the input (yaw, pitch). Pixels in an image plane of a camera module may be determined by using a camera model (e.g., a pinhole camera model or more complex lens model) to map points in 3D space onto pixels in the image plane of the camera module, where the points in the 3D space are assumed to be at a particular distance from the camera module.

A two-dimensional (2D) spherical panoramic image may be used to represent a panorama of a scene. As described below with reference to the video module 1112, two stereoscopic panorama images may be generated for two eyes to provide a stereoscopic view of the entire scene. For example, a left panoramic image may be generated for the left eye viewing and a right panoramic image may be generated for the right eye viewing. An example panoramic image is illustrated in Figure 16B.

The camera mapping module 1110 may construct a left camera map that identifies a respective matching camera for each pixel in a left panoramic image. For example, for a pixel in a left panoramic image that represents a point in a panorama, the left camera map may identify a matching camera module 103 or a matching virtual camera that has a better view for the point in the panorama than other camera modules 103 and other virtual cameras. A matching camera may include a matching camera module 103 (e.g., a real camera) or a matching virtual camera. Thus, the left camera map may map pixels in a left panoramic image to matching cameras that have better views for the corresponding pixels. Determination of a matching camera for a pixel is described below in more detail. An example camera map is illustrated with reference to Figure 16C. For a pixel in a left panoramic image that represents a point in a panorama, the camera mapping module 1110 may determine a yaw, a pitch, and an interocular distance using the above mathematical expressions (1), (2), and (3), respectively. The camera mapping module 1110 may use the yaw and pitch to construct a vector representing a viewing direction of the left eye (e.g., a left viewing direction) to the point in the panorama.

In some implementations, a matching camera for a pixel in a left panoramic image has a viewing direction to a point that corresponds to the pixel. The viewing direction of the matching camera is closer to the left viewing direction than other viewing directions of other camera modules 103 and virtual cameras to the same point in the panorama. For example, the viewing direction of the matching camera is more parallel to the left viewing direction than other viewing directions of other camera modules 103 and virtual cameras. Illustrations of a matching camera are illustrated with reference to Figures 17A-17C.

Similarly, the camera mapping module 1110 may construct a right camera map that identifies a corresponding matching camera for each pixel in a right panoramic image. For example, for a pixel in a right panoramic image that represents a point in a panorama, the right camera map may identify a matching camera that has a better view for the point in the panorama than other camera modules 103 and other virtual cameras. Thus, the right camera map may map pixels in a right panoramic image to matching cameras that have better views for the corresponding pixels.

In some implementations, the left and right camera maps may be pre- computed and stored to achieve a faster processing speed compared to an on-the-fly computation.

The video module 1112 may be software including routines for generating a stream of 3D video data configured to render 3D video when played back on a VR display device. The video module 1112 may be adapted for cooperation and communication with the processor 1135 and other components of the computing device 1100 via a signal line 1130. The stream of 3D video data may describe a stereoscopic panorama of a scene that may vary over time. The stream of 3D video data may include a stream of left panoramic images for left eye viewing and a stream of right panoramic images for right eye viewing. In some implementations, the video module 1112 receives raw video data describing image frames from the various camera modules 103 in the camera array 101. The video module 1112 identifies a location and timing associated with each of the camera modules 103 and synchronizes the image frames based on locations and timings of the camera modules 103. The video module 1112 synchronizes corresponding images frames that are captured by different camera modules 103 at the same time. In some implementations, the video module 1112 or another module in the aggregation system 131 may correct calibration errors in the synchronized image frames.

The video module 1112 may receive a left camera map and a right camera map from the camera mapping module 1110. Alternatively, the video module 1112 may retrieve the left and right camera maps from the storage device 1141 or the memory 1137. The video module 1112 may construct a stream of left panoramic images from the image frames captured by the camera modules 103 and virtual camera images of virtual cameras based on the left camera map. For example, the video module 1112 identifies matching cameras from the left camera map. The matching cameras may include matching camera modules 103 and matching virtual cameras. The video module 1112 constructs a first left panoramic image PIL,0 associated with a first particular time T=T0 by stitching together: (1) image frames that are captured by the matching camera modules 103 at the first particular time T=T0; and (2) virtual camera images of the matching virtual cameras associated with the first particular time T=T0. The video module 1112 constructs a second left panoramic image PIL,1 associated with a second particular time T=T1 by stitching: (1) image frames captured by the matching camera modules 103 at the second particular time T=T1; and (2) virtual camera images of the matching virtual cameras associated with the second particular time T=T1, and so on and so forth. The video module 1112 constructs the stream of left panoramic images to include the first left panoramic image PIL,0 associated with the first particular time T=T0, the second left panoramic image PIL,1 associated with the second particular time T=T1, and other left panoramic images.

Specifically, for a pixel in a left panoramic image PIL,i associated with a particular time T=Ti (i = 0, 1, 2, ...), the video module 1112: (1) identifies a matching camera from the left camera map (the matching camera including a matching camera module 103 or a matching virtual camera); and (2) configures the pixel in the left panoramic image PIL,i to be a corresponding pixel from an image of the matching camera associated with the particular time T=Ti (e.g., the image being an image frame captured by the matching camera module 103 at the particular time T=Ti or a virtual camera image of the matching virtual camera associated with the particular time T=Ti). The pixel in the left panoramic image PIL,i and the corresponding pixel in the image of the matching camera may correspond to the same point in the panorama. For example, for a pixel location in the left panoramic image PIL,i that corresponds to a point in the panorama, the video module 1112: (1) retrieves a pixel that also corresponds to the same point in the panorama from the image of the matching camera associated with the particular time T=Ti; and (2) places the pixel from the image of the matching camera into the pixel location of the left panoramic image PIL,i.

Similarly, the video module 1112 constructs a stream of right panoramic images from the image frames captured by the camera modules 103 and virtual camera images of virtual cameras based on the right camera map by performing operations similar to those described above with reference to the construction of the stream of left panoramic images. The description will not be repeated here.

The audio module 1114 may be software including routines for generating a stream of 3D audio data configured to render 3D audio when played back on an audio reproduction device. The audio module 1114 is communicatively coupled to the bus 1120 via a signal line 1113. The audio module 1114 may generate the 3D audio data based on the raw audio data received from the microphone array 107. In some implementations, the audio module 1114 may process the raw audio data to generate four-channel ambisonic audio tracks corresponding to the 3D video data generated by the video module 1112. The four-channel ambisonic audio tracks may provide a compelling 3D 360-degree audio experience to the user 134.

In some implementations, the four-channel audio tracks may be recorded in an "A" format by the microphone array 107 such as a Tetramic microphone. The audio module 1114 may transform the "A" format four-channel audio tracks to a "B" format that includes four signals: W, X, Y, and Z. The W signal may represent a pressure signal that corresponds to an omnidirectional microphone, and the X, Y, Z signals may correspond to directional sounds in front-back, left-right, and up- down directions, respectively. In some implementations, the "B" format signals may be played back in a number of modes including, but not limited to, mono, stereo, binaural, surround sound including 4 or more speakers, and any other modes. In some examples, an audio reproduction device may include a pair of headphones, and the binaural playback mode may be used for the sound playback in the pair of headphones. The audio module 1114 may convolve the "B" format channels with Head Related Transfer Functions (HRTFs) to produce binaural audio with a compelling 3D listening experience for the user 134.

In some implementations, the audio module 1114 generates 3D audio data that is configured to provide sound localization to be consistent with the user's head rotation. For example, the raw audio data is encoded with the directionality data that describes the directionality of the recorded sounds. The audio module 1114 may analyze the directionality data to produce 3D audio data that changes the sound reproduced during playback based on the rotation of the user's head orientation.

The stream combination module 1116 may be software including routines for combining a stream of 3D video data and a stream of 3D audio data to generate VR content. The stream combination module 1116 is communicatively coupled to the bus 1120 via a signal line 1131. The stream of 3D video data includes a stream of left panoramic images for left eye viewing and a stream of right panoramic images for right eye viewing.

The stream combination module 1116 may compress the stream of left panoramic images and the stream of right panoramic images to generate a stream of compressed 3D video data using video compression techniques. In some implementations, within each stream of the left or right panoramic images, the stream combination module 1116 may use redundant information from one frame to a next frame to reduce the size of the corresponding stream. For example, with reference to a first image frame (e.g., a reference frame), redundant information in the next image frames may be removed to reduce the size of the next image frames. This compression may be referred to as temporal or inter-frame compression within the same stream of left or right panoramic images.

Alternatively or additionally, the stream combination module 1116 may use one stream (either the stream of left panoramic images or the stream of right panoramic images) as a reference stream and may compress the other stream based on the reference stream. This compression may be referred to as inter-stream compression. For example, the stream combination module 1116 may use each left panoramic image as a reference frame for a corresponding right panoramic image and may compress the corresponding right panoramic image based on the referenced left panoramic image.

In some implementations, the stream combination module 1116 may encode the stream of 3D video data (or, compressed 3D video data) and 3D audio data to form a stream of VR content. For example, the stream combination module 1116 may compress the stream of 3D video data using h.264 and the stream of 3D audio data using advanced audio coding (AAC) to form a stream of VR content. In another example, the stream combination module 1116 may compress the stream of 3D video data and the stream of 3D audio data using a standard MPEG format to form a stream of VR content.

Figures 12A and 12B illustrate an example method 1200 for stitching image frames captured at a particular time to generate a left panoramic image and a right panoramic image according to some implementations. The method 1200 is described with respect to Figures 1 A and 11. Although illustrated as discrete blocks, various blocks may be divided into additional blocks, combined into fewer blocks, or eliminated, depending on the desired implementation.

Referring to Figure 12 A, the method 1200 may include the communication module 1102 receiving 1202 image frames that are captured by the camera modules 103 of the camera array 101 at a particular time. The communication module 1102 receives 1204 data describing configuration of the camera modules 103 in the camera array 101. For example, the communication module 1102 receives data describing positions and orientations of the camera modules 103. The virtual camera module 1106 determines different sets of neighboring camera modules 103 in the camera array 101. Each set of neighboring camera modules 103 may include two or more camera modules 103 that locate in proximity to each other in the camera array 101 and have an overlapping field of view.

For each set of neighboring camera modules, the disparity module 1104 determines 1206 a set of disparity maps related to the corresponding set of neighboring camera modules. The disparity module 1104 generates different sets of disparity maps for different sets of neighboring camera modules. For each set of neighboring camera modules, the virtual camera module 1106 determines 1208 one or more virtual cameras interpolated between neighboring camera modules from the corresponding set. The virtual camera module 1106 determines different virtual cameras for different sets of neighboring camera modules. For a virtual camera interpolated between a set of neighboring camera modules, the virtual camera module 1106 generates 1210 a virtual camera image for the virtual camera associated with the particular time by: interpolating image frames captured by the neighboring camera modules at the particular time based on (1) a set of disparity maps associated with the set of neighboring camera modules and (2) a position of the virtual camera. Similarly, the virtual camera module 1106 generates virtual camera images associated with the particular time for all the virtual cameras. An example method for generating a virtual camera image associated with a particular time for a virtual camera is described below with reference to Figures 13 A and 13B.

Referring to Figure 12B, the camera mapping module 1110 constructs 1212 a left camera map and a right camera map based on configurations of the camera modules 103 in the camera array 101 and positions of the virtual cameras. The video module 1112 constructs 1214, based on the left camera map, a left panoramic image associated with the particular time from (1) the image frames captured by the camera modules 103 at the particular time and (2) the virtual camera images of the virtual cameras associated with the particular time. The video module 1112 constructs 1216, based on the right camera map, a right panoramic image associated with the particular time from (1) the image frames captured by the camera modules 103 at the particular time and (2) the virtual camera images of the virtual cameras associated with the particular time.

One skilled in the art will appreciate that, for this and other processes and methods disclosed herein, the functions performed in the processes and methods may be implemented in differing order. Furthermore, the outlined steps and operations are only provided as examples, and some of the steps and operations may be optional, combined into fewer steps and operations, or expanded into additional steps and operations without detracting from the essence of the disclosed implementations.

Figures 13A and 13B illustrate an example method 1300 for generating a virtual camera image associated with a particular time for a virtual camera located between two neighboring camera modules according to some implementations. The method 1300 is described with respect to Figures 1A and 11. Although illustrated as discrete blocks, various blocks may be divided into additional blocks, combined into fewer blocks, or eliminated, depending on the desired implementation.

Referring to Figure 13 A, the disparity module 1104 determines 1302 an overlapping field of view between a first neighboring camera module and a second neighboring camera module. The disparity module 1104 determines 1304 a first image frame captured by the first neighboring camera module and a second image frame captured by the second neighboring camera module at the particular time. The disparity module 1104 determines 1306 a first sub-image (e.g., Image AB) from the first image and a second sub-image (e.g., Image BA) from the second image, with the first sub-image and the second sub-image overlapping with each other in the overlapping field of view. The disparity module 1104 generates 1308 a first disparity map that maps disparity of pixels from the first sub-image to the second sub-image. The disparity module 1104 generates 1310 a second disparity map that maps disparity of pixels from the second sub-image to the first sub-image. An example method for generating a disparity map is described below with reference to Figures 14A and 14B.

Referring to Figure 13B, the virtual camera module 1106 determines 1312 a position of a virtual camera located between the first neighboring camera module and the second neighboring camera module. The virtual camera module 1106 generates 1314 a first shifted sub-image from the first sub-image based on the first disparity map and the position of the virtual camera. The virtual camera module 1106 generates 1316 a second shifted sub-image from the second sub-image based on the second disparity map and the position of the virtual camera. The virtual camera module 1106 combines 1318 the first shifted sub-image and the second shifted sub-image to generate a virtual camera image associated with the particular time for the virtual camera.

Figures 14A and 14B illustrate an example method 1400 for estimating a disparity map that maps disparity of pixels from a first sub-image of a first neighboring camera module to a second sub-image of a second neighboring camera module according to some implementations. The method 1400 is described with respect to Figures 1A and 11. Although illustrated as discrete blocks, various blocks may be divided into additional blocks, combined into fewer blocks, or eliminated, depending on the desired implementation.

Referring to Figure 14 A, the disparity module 1104 selects 1402 a pixel location in an overlapping field of view between the first neighboring camera module and the second neighboring camera module. The disparity module 1104 selects 1403 a disparity value. The similarity score module 1108 determines 1404 a similarity score between (1) a pixel of the first sub-image at the selected pixel location and (2) a second pixel of the second sub-image at a second pixel location, where the second pixel location has a distance to the selected pixel location by the selected disparity value. The determined similarity score is associated with the selected disparity value. An example method for determining similarity scores is described below with reference to Figures 15A and 15B.

The disparity module 1104 determines 1406 whether there is at least an additional disparity value to select. If there is at least an additional disparity value to select, the method 1400 moves to block 1403. Otherwise, the method 1400 moves to block 1408. As a result, different similarity scores associated with different disparity values are generated for the selected pixel location. The disparity module 1104 determines 1408 a highest similarity score from the similarity scores that correspond to different disparity values. The disparity module 1104 determines 1410 a disparity value associated with the highest similarity score.

Referring to Figure 14B, the disparity module 1104 assigns 1412 the selected pixel location with the disparity value associated with the highest similarity score. The disparity module 1104 determines 1416 whether there is at least an additional pixel location in the overlapping field of view to process. If there is at least an additional pixel location to process, the method 1400 moves to block 1402. Otherwise, the method 1400 moves to block 1418. At block 1418, the disparity module 1104 generates a disparity map that includes disparity values associated with corresponding highest similarity scores for pixel locations in the disparity map.

Figures 15A and 15B illustrate an example method 1500 for determining similarity scores associated with a disparity value for pixels along an epipolar line that connects projection centers of two neighboring camera modules according to some implementations. The method 1500 is described with respect to Figures 1A and 11. Although illustrated as discrete blocks, various blocks may be divided into additional blocks, combined into fewer blocks, or eliminated, depending on the desired implementation.

Referring to Figure 15 A, the similarity score module 1108 generates 1502 metric values for first pixel locations along an epipolar line by comparing (1) first pixels at the first pixel locations from a first sub-image to (2) corresponding second pixels at second pixel locations along the epipolar line from a second sub-image, respectively. A corresponding second pixel location has a distance to a corresponding first pixel location by the disparity value, and a corresponding metric value is generated for the pair of the corresponding first pixel location and the corresponding second pixel location. The first sub-image is captured by a first neighboring camera module, the second sub-image is captured by a second neighboring camera module, and the first sub-image overlaps with the second sub- image in an overlapping field of view of the first and second neighboring camera modules.

Initially, the similarity score module 1108 sets 1504 similarity scores for the first pixel locations to be zeros. The similarity score module 1108 selects 1506 a metric threshold that is used to determine runs of metric values. The similarity score module 1108 determines 1508 runs for the first pixel locations based on the metric threshold and the metric values. The similarity score module 1108 determines 1510 preliminary scores for the first pixel locations based on corresponding runs of the first pixel locations and the metric threshold. For example, a preliminary score for a corresponding first pixel location may be equal to a corresponding run of the corresponding first pixel location divided by the metric threshold.

Referring to Figure 15B, the similarity score module 1108 determines 1512 whether there are one or more third pixel locations from the first pixel locations having one or more similarity scores lower than one or more corresponding preliminary scores. If the one or more third pixel locations have one or more similarity scores lower than corresponding preliminary scores, the similarity score module 1108 configures 1514 the one or more similarity scores for the one or more third pixel locations to be the one or more corresponding preliminary scores. Otherwise, the method 1500 moves to block 1516.

At block 1516, the similarity score module 1108 determines whether there is at least an additional metric threshold to select. If there is at least an additional metric threshold to select, the method moves to block 1506. Otherwise, the method 1500 moves to block 1518. At block 1518, the similarity score module 1108 outputs similarity scores for the first pixels at the first pixel locations along the epipolar line.

Figure 16A illustrates an example process 1600 of generating a left panoramic image and a right panoramic image from (1) multiple image frames that are captured by multiple camera modules at a particular time and (2) virtual camera images of virtual cameras associated with the particular time according to some implementations. At a particular time T=Ti (i = 0, 1, 2, ...), the camera module 103 a captures an image frame 1602a, the camera module 103b captures an image frame 1602b, and the camera module 103n captures an image frame 1602n. A virtual camera image 1603 of a virtual camera located between the camera module 103a and the camera module 103b is generated. The virtual camera image 1603 is associated with the particular time T=Ti. Other virtual camera images of other virtual cameras associated with the particular time T=Ti may also be generated. The video module 1112 receives the image frames 1602a, 1602b... l602n. The video module 1112 stitches the image frames 1602a, 1602b... l602n, the virtual camera image 1603, and other virtual camera images to generate (1) a left panoramic image 1608 associated with the particular time T=Ti based on a left camera map 1604 and (2) a right panoramic image 1610 associated with the particular time T=Ti based on a right camera map 1606.

Figure 16B is a graphic representation 1630 that illustrates an example panoramic image according to some implementations. The panoramic image has a first axis "yaw" which represents rotation in a horizontal plane and a second axis "pitch" which represents up and down rotation in a vertical direction. The panoramic image covers an entire 360-degree sphere of a scene panorama. A pixel at a position [yaw, pitch] in the panoramic image represents a point in a panorama viewed at a head rotation with a "yaw" value and a "pitch" value. Thus, the panoramic image includes a blended view from various head rotations rather than a single view of the scene from a single head position.

Figure 16C is a graphic representation 1650 that illustrates an example camera map according to some implementations. The example camera map maps first pixels in camera sections 1652a and 1652b of a panoramic image to a first camera module 103, second pixels in a camera section 1654 to a virtual camera, and third pixels in a camera section 1655 to a second camera module 103. The first and second camera modules 103 are neighbors in the camera array 101, and the virtual camera is interpolated between the first camera module 103 and the second camera module 103.

For the first pixels of the panoramic image within the camera sections 1652a and 1652b, values for the first pixels may be configured to be corresponding pixel values in a first image frame captured by the first camera module 103. Similarly, for the second pixels of the panoramic image within the camera section 1654, values for the second pixels may be configured to be corresponding pixel values in a virtual camera image of the virtual camera. The virtual camera image may be generated based on the first image frame of the first camera module 103 and a second image frame of the second camera module 103. For the third pixels of the panoramic image within the camera section 1655, values for the third pixels may be configured to be corresponding pixel values in the second image frame captured by the second camera module 103. In this example, the panoramic image is stitched using part of the first image frame from the first camera module 103, part of the virtual camera image of the virtual camera, part of the second image frame from the second camera module 103, and part of other images from other camera modules 103 or virtual cameras.

Figures 17A-17C are graphic representations 1700, 1730, and 1760 that illustrate selection of matching cameras for a point in a panorama for construction of a left and a right camera maps according to some implementations. Referring to Figure 17A, the camera array 101 includes camera modules 103a, 103b, 103c and other camera modules mounted on a spherical housing. No virtual cameras are interpolated between the camera modules. Assume that a point 1703 corresponds to a head rotation position with yaw=650 and pitch =00. An interocular distance 1702 is illustrated between a left eye position 1704 and a right eye position 1706. Since pitch =00, the interocular distance 1702 is at its maximum value.

A left viewing direction 1712 from the left eye position 1704 to the point 1703 and a right viewing direction 1714 from the right eye position 1706 to the point 1703 are illustrated in Figure 17A. The camera modules 103a and 103b have viewing directions 1710 and 1716 to the point 1703, respectively.

Since the viewing direction 1710 of the camera module 103 a is more parallel to the left viewing direction 1712 compared to the viewing direction 1716 and other viewing directions (e.g., an angle between the viewing direction 1710 and the left viewing direction 1712 is smaller than angles between the left viewing direction 1712 and other viewing directions), the camera module 103a may be selected as a matching camera that has a better view for the point 1703 than other camera modules for constructing a left camera map. Thus, a pixel of a left panoramic image that corresponds to the point 1703 may have a pixel value equal to that of a corresponding pixel in an image frame captured by the camera module 103 a.

Since the viewing direction 1716 of the camera module 103b is more parallel to the right viewing direction 1714 compared to the viewing direction 1710 and other viewing directions, the camera module 103b may be selected as a matching camera that has a better view for the point 1703 than other camera modules for constructing a right camera map. Thus, a pixel of a right panoramic image that corresponds to the point 1703 may have a pixel value equal to that of a corresponding pixel in an image frame captured by the camera module 103b.

Referring to Figure 17B, virtual cameras 1742 and 1744 are interpolated between the camera modules 103 a and 103b. The virtual camera 1742 has a viewing direction 1749 to the point 1703, and the virtual camera 1744 has a viewing direction 1746 to the point 1703.

Since the viewing direction 1749 of the virtual camera 1742 is more parallel to the left viewing direction 1712 compared to the viewing directions 1710, 1746, 1716 and other viewing directions, the virtual camera 1742 may be selected as a matching camera that has a better view for the point 1703 than other camera modules or virtual cameras for constructing the left camera map. Thus, a pixel of a left panoramic image that corresponds to the point 1703 may have a pixel value equal to that of a corresponding pixel in a virtual camera image of the virtual camera 1742.

Since the viewing direction 1746 of the virtual camera 1744 is more parallel to the right viewing direction 1714 compared to the viewing directions 1710, 1749, 1716 and other viewing directions, the virtual camera 1744 may be selected as a matching camera that has a better view for the point 1703 than other camera modules or virtual cameras for constructing the right camera map. Thus, a pixel of a right panoramic image that corresponds to the point 1703 may have a pixel value equal to that of a corresponding pixel in a virtual camera image of the virtual camera 1744.

Referring to Figure 17C, assume that there are numerous virtual cameras interpolated between neighboring camera modules, which simulates to capture a panorama with continuous viewpoints. A virtual camera 1762 that has the same viewing direction as the left viewing direction 1712 may be selected as a matching camera for the point 1703 for constructing the left camera map. Thus, a pixel of a left panoramic image that corresponds to the point 1703 may have a pixel value equal to that of a corresponding pixel in a virtual camera image of the virtual camera 1762. A virtual camera 1764 that has the same viewing direction as the right viewing direction 1714 may be selected as a matching camera for the point 1703 for constructing the right camera map. Thus, a pixel of a right panoramic image that corresponds to the point 1703 may have a pixel value equal to that of a corresponding pixel in a virtual camera image of the virtual camera 1764.

Figure 18 is a graphic representation 1800 that illustrates example disparity along an epipolar direction of an epipolar line according to some implementations. Camera A has a first image plane and Camera B has a second image plane. The first image plane and the second image plane are coplanar image planes. If the first and second image planes are not coplanar, images of Camera A and Camera B may be transformed to be coplanar. A pinhole location 1802 for Camera A and a pinhole location 1806 for Camera B are illustrated in Figure 18. A point 1804 in a panorama is captured by Camera A as a point 1808 in its image plane with respect to the pinhole location 1802. The point 1804 is also captured by Camera B as a point 1814 in its image plane with respect to the pinhole location 1806. The point 1814 is shifted to the right with respect to the pinhole location 1806. A virtual camera is added at a center point between Camera A and Camera B. The virtual camera is associated with a pinhole location 1803 which is directly above the center of the virtual camera's image plane. The pinhole location 1802 is also directly above the center of Camera A's image plane, and the pinhole location 1806 is directly above the center of Camera B's image plane. The point 1804 may be captured by the virtual camera as a point 1809 in its image plane. Since (1) the pinhole location 1803 of the virtual camera is halfway between Camera A's pinhole location 1802 and Camera B's pinhole location 1806 and (2) the image plane of the virtual camera is also halfway between Camera A's image plane and Camera B's image plane, a position of the point 1809 is halfway between positions of the points 1808 and 1814.

Figure 19 is a graphic representation 1900 that illustrates interpolation of virtual cameras between real cameras and virtual cameras according to some implementations. Three real cameras (Camera 1, Camera 2, Camera 3) are illustrated in a left graph of Figure 19. Views of a scene along lines 1902, 1904, and 1906 may be interpolated by two of the real cameras, respectively. The lines 1902, 1904, and 1906 each connect two of the three real cameras. Virtual cameras may be interpolated along the lines 1902, 1904, and 1906. For example, a virtual camera 4 may be interpolated along the line 1902 as illustrated in a right graph in Figure 19. Furthermore, the virtual camera 4 may also be used to determine other virtual cameras inside a triangle formed by the three real cameras. For example, referring to the right graph in Figure 19, a virtual camera 5 may be interpolated along a line 1908 between the virtual camera 4 and Camera 3. Similarly, other virtual cameras may be interpolated between two real cameras, between a real camera and a virtual camera, or between two virtual cameras.

Referring now to Figure 20, an example of the content system 171 is illustrated in accordance with at least some implementations described herein. Figure 20 is a block diagram of a computing device 2000 that includes the content system 171, a memory 2037, a processor 2035, a storage device 2041, and a communication unit 2045. In the illustrated embodiment, the components of the computing device 2000 are communicatively coupled by a bus 2020. In some implementations, the computing device 2000 may be a personal computer, smartphone, tablet computer, set top box, or any other processor-based computing device. The computing device 2000 may be one of the client device 127, the server 129, and another device in the system 199 of Figure IB.

The processor 2035 may include an arithmetic logic unit, a microprocessor, a general-purpose controller, or some other processor array to perform computations and provide electronic display signals to a display device. The processor 2035 is coupled to the bus 2020 for communication with the other components via a signal line 2038. The processor 2035 may process data signals and may include various computing architectures including a complex instruction set computer (CISC) architecture, a reduced instruction set computer (RISC) architecture, or an architecture implementing a combination of instruction sets. Although Figure 20 includes a single processor 2035, multiple processors may be included. Other processors, operating systems, sensors, displays, and physical configurations may be possible.

The memory 2037 includes a non-transitory memory that stores data for providing the functionality described herein. The memory 2037 may be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, flash memory, or some other memory devices. In some implementations, the memory 2037 also includes a non- volatile memory or similar permanent storage device and media including a hard disk drive, a floppy disk drive, a CD-ROM device, a DVD-ROM device, a DVD-RAM device, a DVD-RW device, a flash memory device, or some other mass storage device for storing information on a more permanent basis. The memory 2037 may store the code, routines, and data for the content system 171 to provide its functionality. The memory 2037 is coupled to the bus 2020 via a signal line 2044.

The communication unit 2045 may transmit data to any of the entities of the system 199 depicted in Figure IB. Similarly, the communication unit 2045 may receive data from any of the entities of the system 199 depicted in Figure IB. The communication unit 2045 may include one or more Ethernet switches for receiving the raw video data and the raw audio data from the connection hub 123. The communication unit 2045 is coupled to the bus 2020 via a signal line 2046. In some implementations, the communication unit 2045 includes a port for direct physical connection to a network, such as the network 105 of Figure IB, or to another communication channel. For example, the communication unit 2045 may include a port such as a USB, SD, RJ45, or similar port for wired communication with another computing device. In some implementations, the communication unit 2045 includes a wireless transceiver for exchanging data with another computing device or other communication channels using one or more wireless communication methods, including IEEE 802.11 , IEEE 802.16, BLUETOOTH®, or another suitable wireless communication method.

In some implementations, the communication unit 2045 includes a cellular communications transceiver for sending and receiving data over a cellular communications network including via short messaging service (SMS), multimedia messaging service (MMS), hypertext transfer protocol (HTTP), direct data connection, WAP, e-mail, or another suitable type of electronic communication. In some implementations, the communication unit 2045 includes a wired port and a wireless transceiver. The communication unit 2045 also provides other conventional connections to a network for distribution of data using standard network protocols including TCP/IP, HTTP, HTTPS, and SMTP, etc.

The storage device 2041 may be a non-transitory storage medium that stores data for providing the functionality described herein. The storage device 2041 may be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, flash memory, or some other memory devices. In some implementations, the storage device 2041 also includes a non- volatile memory or similar permanent storage device and media including a hard disk drive, a floppy disk drive, a CD-ROM device, a DVD-ROM device, a DVD-RAM device, a DVD-RW device, a flash memory device, or some other mass storage device for storing information on a more permanent basis. The storage device 2041 is communicatively coupled to the bus 2020 via a signal line 2042.

In the embodiment illustrated in Figure 20, the content system 171 includes a communication module 2002, a calibration module 2004, a camera mapping module 2006, a video module 2008, a correction module 2010, an audio module 2012, a stream combination module 2014, an advertising module 2016, a social module 2018, and a content module 2056. These modules of the content system 171 are communicatively coupled to each other via the bus 2020.

In some implementations, each module of the content system 171 (e.g., modules 2002, 2004, 2006, 2008, 2010, 2012, 2014, 2016, 2018, or 2056) may include a respective set of instructions executable by the processor 2035 to provide its respective functionality described below. In some implementations, each module of the content system 171 may be stored in the memory 2037 of the computing device 2000 and may be accessible and executable by the processor 2035. Each module of the content system 171 may be adapted for cooperation and communication with the processor 2035 and other components of the computing device 2000.

The communication module 2002 may be software including routines for handling communications between the content system 171 and other components of the computing device 2000. The communication module 2002 may be communicatively coupled to the bus 2020 via a signal line 2022. The communication module 2002 sends and receives data, via the communication unit 2045, to and from one or more of the entities of the system 199 depicted in Figure IB. For example, the communication module 2002 may receive raw video data from the connection hub 123 via the communication unit 2045 and may forward the raw video data to the video module 2008. In another example, the communication module 2002 may receive virtual reality content from the stream combination module 2014 and may send the virtual reality content to the viewing system 133 via the communication unit 2045.

In some implementations, the communication module 2002 receives data from components of the content system 171 and stores the data in the memory 2037 or the storage device 2041. For example, the communication module 2002 receives virtual reality content from the stream combination module 2014 and stores the virtual reality content in the memory 2037 or the storage device 2041. In some implementations, the communication module 2002 retrieves data from the memory 2037 or the storage device 2041 and sends the data to one or more appropriate components of the content system 171. Alternatively or additionally, the communication module 2002 may also handle communications between components of the content system 171.

The calibration module 2004 may be software including routines for calibrating the camera modules 103 in the camera array 101. The calibration module 2004 may be adapted for cooperation and communication with the processor 2035 and other components of the computing device 2000 via a signal line 2024. The calibration module 2004 may perform operations and provide functionality similar to those of the calibration module 204 of Figure 2, and the description will not be repeated herein.

The camera mapping module 2006 may be software including routines for constructing a left camera map and a right camera map. The camera mapping module 2006 may be adapted for cooperation and communication with the processor 2035 and other components of the computing device 2000 via a signal line 2026. The camera mapping module 2006 may perform operations and provide functionality similar to those of the camera mapping module 206 of Figure 2 or the camera mapping module 1110 of Figure 11, and the description will not be repeated herein.

The video module 2008 may be software including routines for generating a stream of 3D video data configured to render 3D video when played back on a virtual reality display device. The video module 2008 may be adapted for cooperation and communication with the processor 2035 and other components of the computing device 2000 via a signal line 2080. The stream of 3D video data may describe a stereoscopic panorama of a scene that may vary over time. The stream of 3D video data may include a stream of left panoramic images for left eye viewing and a stream of right panoramic images for right eye viewing. The video module 2008 may perform operations and provide functionality similar to those of the video module 208 of Figure 2 or the video module 1112 of Figure 11, and the description will not be repeated herein.

The correction module 2010 may be software including routines for correcting aberrations in image frames or panoramic images. The correction module 2010 is communicatively coupled to the bus 2020 via a signal line 2028. The correction module 2010 may perform operations and provide functionality similar to those of the correction module 210 of Figure 2, and the description will not be repeated herein.

The audio module 2012 may be software including routines for generating a stream of 3D audio data configured to render 3D audio when played back on an audio reproduction device. The audio module 2012 is communicatively coupled to the bus 2020 via a signal line 2030. The audio module 2012 may perform operations and provide functionality similar to those of the audio module 212 of Figure 2, and the description will not be repeated herein.

The stream combination module 2014 may be software including routines for combining a stream of 3D video data and a stream of 3D audio data to generate virtual reality content. The stream combination module 2014 is communicatively coupled to the bus 2020 via a signal line 2029. The stream combination module 2014 may perform operations and provide functionality similar to those of the stream combination module 214 of Figure 2, and the description will not be repeated herein.

The virtual reality content may be constructed by the stream combination module 2014 using any combination of the stream of 3D video data (or the stream of compressed 3D video data), the stream of 3D audio data (or the stream of compressed 3D audio data), content data from the content server 139, advertisement data from the ad server 141, social data from the social network server 135, and any other suitable virtual reality content.

In some implementations, the virtual reality content may be packaged in a container format such as MP4, WebM, VP8, and any other suitable format. The virtual reality content may be stored as a file on the client device 127 or the server 129 and may be streamed to the viewing system 133 for the user 134 from the client device 127 or the server 129. Alternatively, the virtual reality content may be stored on a digital versatile disc (DVD), a flash memory, or another type of storage devices.

The advertising module 2016 may be software including routines for adding advertisements to the virtual reality content generated by the content system 171. For example, the ad server 141 stores and transmits advertisement data that describes one or more advertisements. The advertising module 2016 incorporates the advertisements in the virtual reality content. For example, the advertisement includes an image, audio track, or video, and the advertising module 2016 incorporates the advertisement into the virtual reality content. The advertisement may be a video that is stitched in the virtual reality content. In some implementations, the advertisement includes an overlay that the advertising module 2016 incorporates in the virtual reality content. For example, the overlay includes a watermark. The watermark may be an advertisement for a product or service. In some implementations, the advertisements from the ad server 141 include ad data describing a location for displaying the advertisement. In this case, the advertising module 2016 may display the advertisements according to the ad data. The advertising module 2016 may be communicatively coupled to the bus 2020 via a signal line 2017.

In some implementations, the advertisement data includes data describing how the advertisement may be incorporated in the virtual reality content. For example, the advertisement data describes where the advertisement may be included in the virtual reality content. The advertising module 2016 may analyze the advertisement data and incorporate the advertisement in the virtual reality content according to the advertisement data. In other implementations, the user 134 provides user input to the advertising module 2016 and the user input specifies a user preference describing how the advertisement may be incorporated in the virtual reality content. The advertising module 2016 may analyze the user input and incorporate the advertisement based at least in part on the user input.

The advertisement may take many forms. For example, the advertisement may be a logo for a company or product placement of a graphical object that a user can interact with. In some implementations, the advertising module 2016 inserts the advertisement into an area where users commonly look. In other implementations, the advertising module 2016 inserts the advertisement into less commonly used areas. For example, Figure 21 A is an illustration 2100 of a user 2101 with virtual reality content displayed in a top panel 2102 and a bottom panel 2103. The user 2101 is able to view the virtual reality content in the top panel 2102 by moving his or her head upwards. The user 2101 is able to view the virtual reality content in the bottom panel 2103 by moving his or her head downwards.

Figure 21B is an illustration 2105 of how the advertising module 2016 might incorporate an advertisement into one of the panels in Figure 21 A. For example, Figure 21B illustrates the top panel 2102. The virtual reality content is of a forest. The user can see the edges of trees 2106 in the top panel 2102. In the center of the top panel 2102, the advertising module 2016 inserts an advertisement 2108 for San Francisco tours. The top panel 2102 may be a good place for advertisements because the advertisement does not overlap with virtual reality content that may be important for the user to view. In some implementations, the virtual reality content includes a stitching aberration caused by errors in generating the virtual reality content. An element of the content system 171 such as the correction module 2010 analyzes the virtual reality content to identify the stitching aberration. The correction module 2010 transmits data that describes the location of the stitching aberration in the virtual reality content to the advertising module 2016. The advertising module 2016 incorporates an advertisement at the location of the stitching aberration in the virtual reality content so that the stitching aberration is not visible to the user 134 upon playback of the virtual reality content.

In one embodiment, the advertising module 2016 determines where to place advertisements based on user gaze. For example, the advertising module 2016 receives information about how the user interacts with the virtual reality content from the viewing system 133. The information may include a location within each frame or a series of frames where the user looks. For example, the user spends five seconds staring at a table within the virtual reality content. In another example, the advertising module 2016 determines the location of user gaze based on where users typically look. For example, users may spend 80% of the time looking straight ahead.

The advertising module 2016 may determine a cost associated with advertisements to charge advertisers. For example, the advertising module 2016 charges advertisers for displaying advertisements from the ad server 141. In some implementations, the advertising module 2016 determines a cost for displaying the advertisement based on the location of the advertisement. The cost may be based on where the users look within a particular piece of virtual reality content, where users general look at virtual reality content, personalized for each user, etc. In some implementations, the advertising module 2016 determines a cost associated with interacting with advertisements. For example, similar to an online magazine that charges more money when a user clicks on an advertisement (click through), the advertising module 2016 may charge more money when the advertising module 2016 determines based on user gaze that a user looked at a particular advertisement. In some implementations, the advertising module 2016 generates links for the advertisements such that a user may select the advertisement to be able to view virtual reality content about the advertisement, such as a virtual reality rendering of the advertiser's webpage. The advertising module 2016 may charge more for this action since it is also similar to a click through.

The advertising module 2016 may generate a profile for users based on the user's gaze at different advertisements. The advertising module 2016 may identify a category for the advertisement and determine that the user is interested in the category or the advertiser associated with the category. In some implementations, the advertising module 2016 determines a cost for displaying advertisements based on the user profile. For example, the advertising module 2016 determines that a user is interested in potato chips, or is interested in a specific manufacturer of potato chips. As a result, the advertising module 2016 charges more for displaying advertisements for potato chips to that user than other users without a demonstrated interest in potato chips.

The social module 2018 may be software including routines for enabling the viewing system 133 or the content system 171 to interact with the social network application 137. For example, the social module 2018 may generate social data describing the user's 134 interaction with the viewing system 133 or the content system 171. The interaction may be a status update for the user 134. The user 134 may approve the social data so that social data describing the user 134 will not be published without the user's 134 approval. In one embodiment, the social module 2018 transmits the social data to the communication unit 2045 and the communication unit 2045 transmits the social data to the social network application 137. In another embodiment, the social module 2018 and social network application 137 are the same. The social module 2018 may be communicatively coupled to the bus 2020 via a signal line 2019.

In some implementations, the social network application 137 generates a social graph that connects users based on common features. For example, users are connected based on a friendship, an interest in a common subject, one user follows posts published by another user, etc. In one embodiment, the social module 2018 includes routines for enabling the user 134 and his or her connections via the social graph to consume virtual reality content contemporaneously. For example, the content system 171 is communicatively coupled to two or more viewing systems 133. A first user 134 is connected to a second user 134 in a social graph. The first user 134 interacts with a first viewing system 133 and the second user 134 interacts with a second viewing system 133. The first user 134 and the second user 134 may consume virtual reality content provided by the content system 171 using their respective viewing systems 133 simultaneously. In some implementations, the consumption of the virtual reality content may be integrated with the social network.

In some implementations, the social module 2018 transmits information about user interactions with virtual reality content to the social network application 137. For example, the social module 2018 may determine subject matter for an entire video or frames within the video. The social module 2018 may transmit information about the subject matter and the user to the social network application 137, which generates a social graph based on shared subject matter. In some implementations, the social module 2018 receives information about how the user reacts to advertisements from the viewing system 133 and transmits the information to the social network application 137 for incorporation into the social graph. For example, the viewing system 133 transmits information about how the user's gaze indicates that the user is interested in advertisements about home decorating. The social module 2018 transmits the user's interest in home decorating to the social network application 137, which updates the user's profile with the interest and identifies other users that the user could connect with that are also interested in home decorating. In another embodiment, the social network application 137 uses the information about advertisements to provide advertisements within the social network to the user.

In some implementations, the social network application 137 suggests connections between users based on their interactions with the virtual reality content. For example, if both a first user and a second user access the same virtual reality content, the social network application 137 may suggest that they become friends. In another example, if two users access virtual reality content with the same subject matter, the social network application 137 suggests that they become connected. In yet another example, where a first user on the social network is an expert in a type of subject matter and the second user views a threshold number of pieces of virtual reality content with the same subject matter, the social network application 137 suggests that the second user follow the first user in the social network. The social network application 137 may suggest that the user join groups in the social network based on the user's consumption of virtual reality content. For example, for a user that views virtual reality content that involves science fiction adventures with other users, the social network application 137 suggests a group in the social network about science fiction roleplaying.

In some implementations, the social module 2018 transmits information about user interactions to the social network application 137 that the social network application 137 uses for posting updates about the user. The update may include information about the type of user interaction that is occurring, information about the virtual reality content, and a way to access the virtual reality content. The social network application 137 may post an update about a first user to other users in the social network that are connected to the first user via a social graph. For example, the update is viewed by friends of the first user, friends of friends of the first user, or a subset of connections of the first user including only close friends. The social network application 137 may also post the update to other users that have viewed the same virtual reality content or demonstrated an interest in subject matter that is part of the virtual reality content.

In some implementations, the social network application 137 determines subject matter associated with the virtual reality content and determines other users that are interested in the subject matter. For example, the social network application 137 determines that users are expressly interested in the subject matter because it is listed as part of a user profile that they created during registration. In another example, the social network application 137 uses implicit activities to determine interest in the subject matter, such as a user that watches a predetermined number of videos with the subject matter or posts a predetermined number of articles about the subject matter. The social network application 137 may limit social network updates about user interactions with virtual reality content to other users that are interested in the same subject matter.

In some implementations, the social network application 137 posts updates as long as the virtual reality content is not sensitive. For example, the social network application 137 does not post updates where the virtual reality content is pornography. In some implementations, the social network application 137 provides users with user preferences about the type of subject matter that cannot be part of the updates. For example, where the social network is for business connections, the social network application 137 does not post updates about users consuming virtual reality content involving celebrities.

Figure 21C is an example illustration 2110 of a profile page of social network updates for a user. In this example the user Jane Doe has had user interactions with virtual reality content for Virtual World I and Treasure Palace. A first update 2111 of virtual reality content includes identification of the user, a join button 2112, and links for approving, disapproving, or commenting on the update. If a user clicks the join button 2112, in one embodiment, the social network application 137 instructs the content system 171 to launch. The approval link can take many forms, such as like, +1, thumbs up, etc. The disapproval link can take many forms, such as -1, dislike, thumbs down, etc.

A second update 2113 of virtual reality content includes a status update about the user's progress in a virtual reality game. In this example, another user may be able to provide in-game rewards to the user by selecting a reward Jane button 2114. For example, selecting the button could cause the social network application 137 to instruct the content system 171 to provide the user with an additional life, credits for purchasing objects in the same, time, etc.

The content module 2056 may be software including routines for enabling the content system 171 to receive content from the content server 139 and, in some implementations, provide analysis of virtual reality content. For example, the content server 139 stores content such as videos, images, music, video games, or any other VR content suitable for playback by the viewing system 133. The content module 2056 may be communicatively coupled to the bus 2020 via a signal line 2057.

In some implementations, the content server 139 is communicatively coupled to a memory that stores content data. The content data includes video data or audio data. For example, since a video may include a series of images synchronized with a corresponding audio track, a video stored on the content sever 139 has a video element and an audio element. The video element is described by the video data and the audio element is described by the audio data. In this example, the video data and the audio data are included in the content data transmitted by the content server 139. The content system 171 receives the content data and proceeds to generate virtual reality content for the viewing system 133 based at least in part on the video data and the audio data included in the content data. Similar examples are possible for images, music, video games, or any other content hosted by the content server 139.

In some implementations, the content module 2056 provides analysis of the virtual reality content. For example, the content module 2056 may receive information about the location of all users' gazes and aggregate the information. The content module 2056 may generate a heat map where different colors correspond to a number of users that looked at a particular location in the image. For example, where the image is a room in a kitchen, the heat map illustrates that most users looked at the kitchen table and appliances and fewer users looked at the wall.

The content module 2056 may use analytics data to generate playlists of virtual reality content. For example, the content module 2056 may determine the popularity of different virtual reality experiences. The popularity may be based on a number of users that access the virtual reality content, user ratings after a user has interacted with the virtual reality content, etc. The playlists may be topic based, such as the 10 best depictions of Paris, France or may be based on overall popularity, such as the 50 best virtual reality content available. In some implementations, the content module 2056 generates playlists from people that are experts in subject matter. For example, the content module 2056 generates a playlist of the best cooking videos as rated (or created) by well-known chefs. In another example, the content module 2056 accepts playlists created by experts, such as the best virtual reality content about technology that was submitted by an owner of a billion-dollar technology company.

In some implementations, the content module 2056 manages gamification of the virtual reality content. The content module 2056 may track user interactions with the virtual reality content and provides rewards for achieving a threshold amount of user interactions. For example, the content module 2056 rewards a user with new virtual reality content when the user identifies five objects in the game. The content module 2056 may also provide clues for how to find the objects.

In some implementations, the content module 2056 generates links within virtual reality content for users to move from one virtual reality experience to another. Figure 21D illustrates 2120 an example of virtual reality content 2121 where the user is experiencing walking around and approaching a road with cars 2122 on the road. In the upper right-hand corner is a linked image 2123 that the user could select to access virtual reality content of a house. The user may use a peripheral device, such as a glove to reach out and touch the linked image 2123. In some implementations, the content system 171 recognizes a particular motion for accessing the linked image 2123, such as making a tap with an index finger similar to using a mouse to click on an object. The user may also access a peripheral device, such as a mouse, to position a cursor on the screen to select the linked image 2123.

Figures 22A-22C illustrate an example method 2200 for aggregating image frames and audio data to generate virtual reality content according to some implementations. The method 2200 is described with respect to Figures IB and 20. Although illustrated as discrete blocks, various blocks may be divided into additional blocks, combined into fewer blocks, or eliminated, depending on the desired embodiment.

Referring to Figure 22A, the calibration module 2004 calibrates 2202 the camera modules 103 in the camera array 101. The communication module 2002 receives 2204 raw video data describing image frames from the camera modules 103. The communication module 2002 receives 2206 raw audio data from the microphone array 107. The video module 2008 identifies 2208 a location and timing associated with each of the camera modules 103. The video module 2008 synchronizes 2210 the images frames based on locations and timings associated with the camera modules 103. The camera mapping module 2006 constructs 2212 a left camera map and a right camera map. The left camera map identifies matching camera modules 103 for pixels in a left panoramic image. For example, for a pixel in a left panoramic image that represents a point in a panorama, the left camera map identifies a matching camera module 103 that has a better view to the point than other camera modules 103. Similarly, the right camera map identifies matching camera modules 103 for pixels in a right panoramic image.

Referring to Figure 22B, the video module 2008 generates 2214, based on the left camera map, a stream of left panoramic images from the image frames. For example, the video module 2008 identifies matching camera modules 103 for pixels in left panoramic images based on the left camera map. For a particular time, the video module 2008 stitches image frames synchronized at the particular time from the corresponding matching camera modules 103 to form a left panoramic image for the particular time frame. The correction module 2010 corrects 2216 color deficiencies in the left panoramic images. The correction module 2010 corrects 2218 stitching errors in the left panoramic images.

The video module 2008 generates 2220, based on the right camera map, a stream of right panoramic images from the image frames. For example, the video module 2008 identifies matching camera modules 103 for pixels in right panoramic images based on the right camera map. For a particular time, the video module 108 stitches image frames synchronized at the particular time from the corresponding matching camera modules 103 to form a right panoramic image for the particular time. The correction module 2010 corrects 2222 color deficiencies in the right panoramic images. The correction module 2010 corrects 2224 stitching errors in the right panoramic images.

Referring to Figure 22C, the stream combination module 2014 compresses

2226 the stream of left panoramic images and the stream of right panoramic images to generate a compressed stream of 3D video data. The audio module 2012 generates 2228 a stream of 3D audio data from the raw audio data. The stream combination module 2014 generates 2230 VR content that includes the compressed stream of 3D video data and the stream of 3D audio data. In some implementations, the stream combination module 2014 may also compress the stream of 3D audio data to form a compressed stream of 3D audio data, and the VR content may include the compressed stream of 3D video data and the compressed stream of 3D audio data.

Figure 23 illustrates an example method 2300 for generating advertisements in a virtual reality system. The method 2300 is described with respect to Figures IB and 20. Although illustrated as discrete blocks, various blocks may be divided into additional blocks, combined into fewer blocks, or eliminated, depending on the desired embodiment.

The stream combination module 2014 generates 2302 virtual reality content that includes a stream of three-dimensional video data and a stream of three- dimensional audio data. The stream combination module 2014 provides 2304 the virtual reality content to a user. The advertising module 2016 detects 2306 a location of the user's gaze at the virtual reality content. For example, the advertising module 2016 receives user gaze information from the viewing system 133 or the advertising module 2016 uses statistical information about where users typically look in virtual reality content. The advertising module 2016 suggests 2308 a first advertisement based on the location of the user's gaze. For example, the advertising module 2016 determines that an advertisement should be placed where the user most commonly looks. In another example, the advertising module 2016 determines that the advertisement should be placed in a location where the user looks less frequently so that the advertisement does not interfere with the virtual reality content.

In some implementations, the advertising module 2016 determines 2310 a cost for displaying advertisements based on the location of the user's gaze. For example, the cost will be higher for regions where the user more commonly looks. In some implementations, the advertising module 2016 provides 2312 a graphical object as part of the virtual reality content that is linked to a second advertisement. For example, the graphical object includes a soda can that the user can touch to access a second advertisement. The second advertisement may be displayed as part of the virtual reality content, such as a pop up window that appears above the object, or the second advertisement may be part of another application that is activated by the ad server 141 responsive to the user touching the graphical object.

Figure 24 illustrates an example method 2400 for generating a social network based on virtual reality content. The method 2400 is described with respect to Figures IB and 20. Although illustrated as discrete blocks, various blocks may be divided into additional blocks, combined into fewer blocks, or eliminated, depending on the desired embodiment. Although the method 2400 is described as a social network application 137 performing steps on a social network server 135 that is separate from the content system 171, the method steps may also be performed by the social module 2018 that is part of the content system 171.

The social network application 137 receives 2402 virtual reality content that includes a stream of three-dimensional video data and a stream of three-dimensional audio data for a first user from a content system 171. The social network application 137 generates 2404 a social network for the first user. For example, the social network connects users based on a shared attribute. The social network application 137 generates 2406 a social graph that includes user interactions with the virtual reality content. For example, users are connected in the social network based on users that message each other within the virtual reality world. In some implementations, the social network application 137 generates links for connected users to view the same virtual reality content. For example, the users may click on buttons in the social network to launch the content system 171.

In one embodiment, the social network application 137 suggests 2408 a connection between the first user and a second user based on the virtual reality content. For example, the social network application 137 makes a suggestion where the first and second users view the same virtual reality content. In another embodiment, the social network application 137 suggests 2410 a group associated with the social network based on the virtual reality content. For example, the social network application 137 suggests a group about subject matter that is also part of virtual reality content that the user views. The social network application 137 may automatically generate 2412 social network updates based on the first user interacting with the virtual reality content. For example, the social network application 137 generates an update when a user begins a stream, achieves a goal, etc. The social network application 137 may store 2414 information in a social graph about the first user's gaze at advertisements displayed as part of the virtual reality content. In some implementations, the social network application 137 uses the information to display advertisements within the social network.

Figure 25 illustrates an example method 2500 for analyzing virtual reality content. The method 2500 is described with respect to Figures IB and 20. Although illustrated as discrete blocks, various blocks may be divided into additional blocks, combined into fewer blocks, or eliminated, depending on the desired embodiment.

The stream combination module 2014 provides 2502 virtual reality content that includes a stream of three-dimensional video data and a stream of three- dimensional audio data to users. The content module 2056 information about user gaze, for example, from the viewing system 133 after the stream is displayed. The content module 2056 determines 2504 locations of user gaze of the virtual reality content and generates 2506 a heat map that includes different colors based on a number of user gazes for each location. For example, the heat map uses red to illustrate the most commonly viewed area, orange for less commonly viewed, and yellow for least commonly viewed. In one embodiment, the content module 2056 generates 2508 a playlist of virtual reality experiences. For example, the playlist includes most viewed virtual reality content, most highly rated virtual reality content, a playlist for a particular region, or a playlist from an expert in certain subject matter.

The implementations described herein may include the use of a special- purpose or general-purpose computer including various computer hardware or software modules, as discussed in greater detail below.

Implementations described herein may be implemented using computer- readable media for carrying or having computer-executable instructions or data structures stored thereon. Such computer-readable media may be any available media that may be accessed by a general-purpose or special-purpose computer. By way of example, and not limitation, such computer-readable media may include tangible computer-readable storage media including Random Access Memory (RAM), Read-Only Memory (ROM), Electrically Erasable Programmable Readonly Memory (EEPROM), Compact Disc Read-Only Memory (CD-ROM) or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory devices (e.g., solid state memory devices), or any other storage medium which may be used to carry or store desired program code in the form of computer - executable instructions or data structures and which may be accessed by a general- purpose or special-purpose computer. Combinations of the above may also be included within the scope of computer-readable media.

Computer-executable instructions comprise, for example, instructions and data which cause a general-purpose computer, special-purpose computer, or special- purpose processing device (e.g., one or more processors) to perform a certain function or group of functions. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

As used herein, the terms "module" or "component" may refer to specific hardware implementations configured to perform the operations of the module or component and/or software objects or software routines that may be stored on and/or executed by general-purpose hardware (e.g., computer-readable media, processing devices, etc.) of the computing system. In some implementations, the different components, modules, engines, and services described herein may be implemented as objects or processes that execute on the computing system (e.g., as separate threads). While some of the system and methods described herein are generally described as being implemented in software (stored on and/or executed by general- purpose hardware), specific hardware implementations or a combination of software and specific hardware implementations are also possible and contemplated. In this description, a "computing entity" may be any computing system as previously defined herein, or any module or combination of modulates running on a computing system.

All examples and conditional language recited herein are intended for pedagogical objects to aid the reader in understanding the disclosure and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Although implementations of the present disclosures have been described in detail, it may be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the disclosure.

Claims

What is claimed is:

1. A computer-implemented method comprising:

receiving video data describing image frames from camera modules;

receiving audio data from a microphone array;

aggregating, by a processor-based computing device programmed to perform the aggregating, the image frames to generate a stream of three- dimensional (3D) video data, the stream of 3D video data including a stream of left panoramic images and a stream of right panoramic images;

generating a stream of 3D audio data from the audio data; and

generating virtual reality content that includes the stream of 3D video data and the stream of 3D audio data.

2. The method of claim 1, wherein aggregating the image frames to generate the stream of 3D video data comprises:

identifying first matching camera modules for left panoramic images based on a left camera map;

identifying second matching camera modules for right panoramic images based on a right camera map;

stitching first image frames captured by the first matching camera modules at a particular time to form a corresponding left panoramic image in the stream of left panoramic images; and

stitching second image frames captured by the second matching camera

modules at a particular time to form a corresponding right panoramic image in the stream of right panoramic images.

3. The method of claim 2, wherein:

for a pixel with a yaw value and a pitch value in a panorama:

the left camera map identifies a first matching camera module for the pixel in the panorama and matches the pixel in the panorama to a pixel in an image plane of the first matching camera module; and the right camera map identifies a second matching camera module for the pixel in the panorama and matches the pixel in the panorama to a pixel in an image plane of the second matching camera module.

4. The method of claim 2, wherein:

the left camera map associates a pixel location in left panoramic images to a corresponding first matching camera module, wherein the pixel location corresponds to a point of a panorama in a left viewing direction;

the corresponding first matching camera module has a field of view that includes a viewing direction to the point of the panorama; and the viewing direction of the corresponding first matching camera module is closer to the left viewing direction than other viewing directions associated with other camera modules.

5. The method of claim 1, wherein aggregating the image frames to generate the stream of 3D video data comprises:

determining a current viewing direction associated with a user; and generating the stream of left panoramic images and the stream of right

panoramic images based on the current viewing direction.

6. The method of claim 5, wherein:

the left panoramic images have a higher resolution in the current viewing direction of the user than a second viewing direction opposite to the current viewing direction; and

the right panoramic images have a higher resolution in the current viewing direction of the user than the second viewing direction opposite to the current viewing direction.

7. The method of claim 1, further comprising:

correcting color deficiencies in the left panoramic images and the right

panoramic images; and

correcting stitching errors in the left panoramic images and the right

panoramic images.

8. A system comprising:

one or more processors;

one or more non-transitory tangible computer-readable mediums

communicatively coupled to the one or more processors and storing executable instructions executable by the one or more processors to perform operations comprising:

receiving video data describing image frames from camera modules; receiving audio data from a microphone array;

aggregating the stream of 3D video data including a stream of left panoramic images and a stream of right panoramic images; generating a stream of 3D audio data from the audio data; and generating virtual reality content that includes the stream of 3D video data and the stream of 3D audio data.

9. The system of claim 8, wherein the instructions executable by the one or more processors aggregate the image frames to generate the stream of 3-D video data by:

stitching first image frames captured by the first matching camera modules at a particular time to form a corresponding left panoramic image in the steam of left panoramic images; and

stitching second image frames captured by the second matching camera

10. The system of claim 9, wherein:

for a pixel with a yaw value and a pitch value in a panorama:

11. The system of claim 9, wherein:

12. The system of claim 8, wherein the instructions executable by the one or more processors aggregates the image frames to generate the stream of 3D video data by:

panoramic images based on the current viewing direction.

13. The system of claim 12, wherein:

14. The system of claim 8, wherein the instructions executable by the one or more processors perform operations further comprising:

correcting color deficiencies in the left panoramic images and the right

panoramic images; and correcting stitching errors in the left panoramic images and the right panoramic images.

15. A computer program product comprising a non-transitory computer- usable medium including a computer-readable program, wherein the computer- readable program when executed on a computer causes the computer to:

receive video data describing image frames from camera modules;

receive audio data from a microphone array;

aggregate the image frames to generate a stream of three-dimensional (3D) video data, the stream of 3D video data including a stream of left panoramic images and a stream of right panoramic images;

generate a stream of 3D audio data from the audio data; and

generate virtual reality content that includes the stream of 3D video data and the stream of 3D audio data.

16. The computer program product of claim 15, wherein aggregating the image frames to generate the stream of 3D video data comprises:

stitching second image frames captured by the second matching camera

17. The computer program product of claim 16, wherein:

for a pixel with a yaw value and a pitch value in a panorama:

18. The computer program product of claim 16, wherein:

19. The computer program product of claim 15, wherein aggregating the frames to generate the stream of 3D video data comprises:

panoramic images based on the current viewing direction.

20. The computer program product of claim 19, wherein:

21. A method comprising :

receiving image frames that are captured by two or more camera modules of a camera array at a particular time;

interpolating a first virtual camera between a first set of camera modules from the two or more camera modules; determining a first set of disparity maps between the first set of camera modules;

generating, by a processor-based computing device programmed to perform the generating, a first virtual camera image associated with the particular time for the first virtual camera from a first set of image frames that are captured by the first set of camera modules at the particular time, the first virtual camera image being generated based on the first set of disparity maps; and

constructing a left panoramic image and a right panoramic image associated with the particular time from the image frames captured by the two or more camera modules and the first virtual camera image of the first virtual camera.

22. The method of claim 21 , wherein:

the first set of camera modules includes a first camera module and a second camera module that have an overlapping field of view; the first set of image frames captured by the first set of camera modules includes a first image captured by the first camera module and a second image captured by the second camera module; the first image includes a first sub-image that overlaps with a second sub- image of the second image on the overlapping field of view; and determining the first set of disparity maps between the first set of camera modules comprises:

determining a first disparity map that maps disparity of pixels from the first sub-image to the second sub-image; and

determining a second disparity map that maps disparity of pixels from the second sub-image to the first sub-image.

23. The method of claim 22, wherein generating the first virtual camera image for the first virtual camera comprises:

determining a position of the first virtual camera relative to the first and second camera modules;

generating a first shifted sub-image from the first sub-image based on the first disparity map and the position of the first virtual camera; generating a second shifted sub-image from the second sub-image based on the second disparity map and the position of the first virtual camera; and

combining the first shifted sub-image and the second shifted sub-image to generate the first virtual camera image for the first virtual camera.

24. The method of claim 22, wherein determining the first disparity map that maps disparity of pixels from the first sub-image to the second sub-image comprises:

for each corresponding pixel location in the overlapping field of view: determining similarity scores between a first pixel of the first sub- image at the pixel location and second pixels of the second sub-image at second pixel locations, wherein a corresponding distance between the pixel location and each of the second pixel locations is equal to a different disparity value;

determining a highest similarity score from the similarity scores; determining a disparity value associated with the highest similarity score; and

assigning the disparity value associated with the highest similarity score to the pixel location in the first disparity map.

25. The method of claim 22, wherein determining the second disparity map that maps disparity of pixels from the second sub-image to the first sub-image comprises:

for each corresponding pixel location in the overlapping field of view: determining similarity scores between a first pixel of the second sub- image at the pixel location and second pixels of the first sub- image at second pixel locations, wherein a corresponding distance between the pixel location and each of the second pixel locations is equal to a different disparity value;

determining a highest similarity score from the similarity scores; determining a disparity value associated with the highest similarity score; and assigning the disparity value associated with the highest similarity score to the pixel location in the second disparity map.

26. The method of claim 21, further comprising:

interpolating a second virtual camera between the first virtual camera and a first camera module from the first set of camera modules;

determining a second set of disparity maps associated with the first virtual camera and the first camera module;

generating, based on the second set of disparity maps, a second virtual

camera image associated with the particular time for the second virtual camera from an image frame of the first camera module and the first virtual camera image of the first virtual camera module; and wherein the left panoramic image and the right panoramic image associated with the particular time are constructed from the image frames captured by the two or more camera modules, the first virtual camera image of the first virtual camera, and the second virtual camera image of the second virtual camera.

27. The method of claim 21, further comprising:

interpolating a third virtual camera between a second set of camera modules from the two or more camera modules;

determining a third set of disparity maps associated with the second set of camera modules;

generating, based on the third set of disparity maps, a third virtual camera image associated with the particular time for the third virtual camera from a second set of image frames that are captured by the second set of camera modules at the particular time; and

wherein the left panoramic image and the right panoramic image associated with the particular time are constructed from the image frames captured by the two or more camera modules, the first virtual camera image of the first virtual camera, and the third virtual camera image of the third virtual camera.

28. The method of claim 21, further comprising: constructing a left camera map and a right camera map based on

configurations of the two or more camera modules and the first virtual camera;

wherein the left panoramic image is constructed from the image frames and the first virtual camera image based on the left camera map; and wherein the right panoramic image is constructed from the image frames and the first virtual camera image based on the right camera map.

29. A system comprising:

one or more processors;

one or more non-transitory tangible computer-readable mediums

receiving image frames that are captured by two or more camera modules of a camera array at a particular time; interpolating a first virtual camera between a first set of camera modules from the two or more camera modules; determining a first set of disparity maps between the first set of camera modules;

generating, based on the first set of disparity maps, a first virtual camera image associated with the particular time for the first virtual camera from a first set of image frames that are captured by the first set of camera modules at the particular time; and

30. The system of claim 29, wherein:

the first set of camera modules includes a first camera module and a second camera module that have an overlapping field of view; the first set of image frames captured by the first set of camera modules includes a first image captured by the first camera module and a second image captured by the second camera module; the first image includes a first sub-image that overlaps with a second sub- image of the second image on the overlapping field of view; and the instructions executable by the one or more processors determine the first set of disparity maps between the first set of camera modules by: determining a first disparity map that maps disparity of pixels from the first sub-image to the second sub-image; and

31. The system of claim 30, wherein the instructions executable by the one or more processors generate the first virtual camera image for the first virtual camera by:

generating a first shifted sub-image from the first sub-image based on the first disparity map and the position of the first virtual camera;

generating a second shifted sub-image from the second sub-image based on the second disparity map and the position of the first virtual camera; and

32. The system of claim 30, wherein the instructions executable by the one or more processors determine the first disparity map that maps disparity of pixels from the first sub-image to the second sub-image by:

for each corresponding pixel location in the overlapping field of view: determining similarity scores between a first pixel of the first sub- image at the pixel location and second pixels of the second sub-image at second pixel locations, wherein a corresponding distance between the pixel location and each of the second pixel locations is equal to a different disparity value; determining a highest similarity score from the similarity scores; determining a disparity value associated with the highest similarity score; and

33. The system of claim 30, wherein the instructions executable by the one or more processors determine the second disparity map that maps disparity of pixels from the second sub-image to the first sub-image by:

assigning the disparity value associated with the highest similarity score to the pixel location in the second disparity map.

34. The system of claim 29, wherein the instructions executable by the one or more processors perform operations further comprising:

generating, based on the second set of disparity maps, a second virtual

35. The system of claim 29, wherein the instructions executable by the one or more processors perform operations further comprising:

generating, based on the third set of disparity maps, a third virtual camera image associated with the particular time for the third virtual camera from a second set of image frames captured by the second set of camera modules at the particular time; and

36. A computer program product comprising a non-transitory computer- usable medium including a computer-readable program, wherein the computer- readable program, when executed on a computer, causes the computer to:

receive image frames that are captured by two or more camera modules of a camera array at a particular time;

interpolate a first virtual camera between a first set of camera modules from the two or more camera modules;

determine a first set of disparity maps between the first set of camera

modules;

generate, based on the first set of disparity maps, a first virtual camera image associated with the particular time for the first virtual camera from a first set of image frames that are captured by the first set of camera modules at the particular time; and

construct a left panoramic image and a right panoramic image associated with the particular time from the image frames captured by the two or

- I l l - more camera modules and the first virtual camera image of the first virtual camera.

37. The computer program product of claim 36, wherein:

38. The computer program product of claim 37, wherein generating the first virtual camera image for the first virtual camera comprises:

39. The computer program product of claim 37, wherein determining the first disparity map that maps disparity of pixels from the first sub-image to the second sub-image comprises:

40. The computer program product of claim 37, wherein determining the second disparity map that maps disparity of pixels from the second sub-image to the first sub-image comprises:

41. The computer program product of claim 36, wherein the computer- readable program, when executed on the computer, further causes the computer to: interpolate a second virtual camera between the first virtual camera and a first camera module from the first set of camera modules;

determine a second set of disparity maps associated with the first virtual camera and the first camera module;

generate, based on the second set of disparity maps, a second virtual camera image associated with the particular time for the second virtual camera from an image frame of the first camera module and the first virtual camera image of the first virtual camera module; and wherein the left panoramic image and the right panoramic image associated with the particular time are constructed from the image frames captured by the two or more camera modules, the first virtual camera image of the first virtual camera, and the second virtual camera image of the second virtual camera.

42. The computer program product of claim 36, wherein the computer- readable program, when executed on the computer, further causes the computer to: interpolate a third virtual camera between a second set of camera modules from the two or more camera modules;

determine a third set of disparity maps associated with the second set of camera modules;

generate, based on the third set of disparity maps, a third virtual camera image associated with the particular time for the third virtual camera from a second set of image frames that are captured by the second set of camera modules at the particular time; and

43. A method comprising :

generating virtual reality content that includes a compressed stream of three- dimensional video data and a stream of three-dimensional audio data with a processor-based computing device programmed to perform the generating;

providing the virtual reality content to a user;

detecting a location of the user's gaze at the virtual reality content; and suggesting a first advertisement based on the location of the user's gaze.

44. The method of claim 43, further comprising determining a cost for displaying advertisements based on the location of the user's gaze.

45. The method of claim 44, wherein determining the cost for displaying advertisements is further based on a length of time that the user gazes at the location.

46. The method of claim 43, further comprising providing a graphical object as part of the virtual reality content that is linked to a second advertisement.

47. The method of claim 43, further comprising:

generating graphics for displaying a bottom portion and a top portion that include at least some of the virtual reality content; and providing an advertisement that is part of at least the bottom portion or the top portion.

48. A method comprising:

receiving virtual reality content that includes a stream of three-dimensional video data and a stream of three-dimensional audio data to a first user with a processor-based computing device programmed to perform the receiving;

generating a social network for the first user; and

generating a social graph that includes user interactions with the virtual

reality content.

49. The method of claim 48, further comprising suggesting a connection between the first user and a second user based on the virtual reality content.

50. The method of claim 48, further comprising suggesting a group associated with the social network based on the virtual reality content.

51. The method of claim 48, further comprising automatically generating social network updates based on the first user interacting with the virtual reality content.

52. The method of claim 51 , further comprising:

determining subject matter associated with the virtual reality content;

determining other users that are interested in the subject matter; and wherein the other users receive the social network updates based on the first user interacting with the virtual reality content.

53. The method of claim 52, wherein determining other users that are interested in the subject matter is based on the other users expressly indicating that they are interested in the subject matter.

54. The method of claim 48, further comprising generating privacy settings for the first user for determining whether to publish social network updates based on a type of activity.

55. The method of claim 48, further comprising storing information in a social graph about the first user's gaze at advertisements displayed as part of the virtual reality content.

56. The method of claim 48, further comprising transmitting instructions to physical hardware to vibrate to provide physical stimulation.

57. A method comprising:

providing virtual reality content that includes a compressed stream of three- dimensional video data and a stream of three-dimensional audio data with a processor-based computing device programmed to perform the providing;

determining locations of user gaze of the virtual reality content; and generating a heat map that includes different colors based on a number of user gazes for each location.

58. The method of claim 57, further comprising generating a playlist of virtual reality experiences.

59. The method of claim 58, wherein the playlist is based on most user views of virtual reality content.

60. The method of claim 58, wherein the playlist is based on a geographical location.

61. The method of claim 58, wherein the playlist is generated by a user that is an expert in subject matter and the playlist is based on the subject matter.

62. The method of claim 58, wherein the playlist is automatically generating based on trends among two or more user profiles of a social graph.