WO2018027067A1

WO2018027067A1 - Methods and systems for panoramic video with collaborative live streaming

Info

Publication number: WO2018027067A1
Application number: PCT/US2017/045358
Authority: WO
Inventors: Pasi Sakari Ojala; Samian Kaur
Original assignee: Pcms Holdings, Inc.
Priority date: 2016-08-05
Filing date: 2017-08-03
Publication date: 2018-02-08

Abstract

Systems and methods are described for presenting embedded live streams within a first live stream. In one embodiment, live video streams are received at a server, each associated with a position. The server receives, from a first client, a request for a first live video stream, and communicates the first live video stream to the first client. A subset of the live video streams having an associated position within a view associated with the first live video stream is determined. The server sends, to the first client, at least information regarding the positions associated with each of the live video streams in the subset.

Description

METHODS AND SYSTEMS FOR PANORAMIC VIDEO

WITH COLLABORATIVE LIVE STREAMING

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] The present application is a non-provisional filing of, and claims benefit under 35 U.S.C. §119(e) from, U.S. Provisional Patent Application Serial No. 62/371,698, entitled "METHODS AND SYSTEMS FOR PANORAMIC VIDEO AND IMMERSIVE AUDIO PRESENTATION WITH COLLABORATIVE LIVE STREAMING," filed August 5, 2016, the entirety of which is incorporated herein by reference.

BACKGROUND

[0002] Live video streaming services are currently available for access on user's mobile devices. Recently available services include Meerkat, Periscope, and Facebook Live. Such services enable video content capturing, sharing, and representation with an application running on a smart phone or a tablet. The live video is shared via a service provider that distributes the access information and the actual video stream to consuming applications. The consumers that are subscribed to the service may then pick up any of the available ongoing, or even recorded streams, and start watching. Basically, live video streaming users are sharing what they see at that moment through their camera lens. Consumers, on the other hand, may post comments which are visible on the viewfinder of the recording device as well as on the stream all the consumers receive.

[0003] Off-line video editing tools, such as those from Kolor, allow the combination of multi camera video streams into 360-degree presentations. They also have tools for combining still images into a video stream. Such tools are able to stitch together panorama images with video streams.

[0004] Camera manufactures such as Graava offer intelligent video recording that stores only contextually relevant events. The user may keep the camera on constantly and the application creates an automatic composition on interesting events. The application may also connect several Graava cameras together to create an automatically edited presentation. Livit has a live video streaming service with a mobile application that can be connected to an external camera. Livit allows for 360-degree video streaming and can also be applied to connect live webcam views of scenic places.

[0005] It is expected that in the near future, when wearable devices such as smart glasses become more popular, people will stream live videos more frequently, or even continuously. Hence, live content will often be available from many public places. Live streaming services are likely to have multiple video streams from the same location at the same time. These streams may also have surround sound, as multi-channel and binaural recording is possible with wearable devices containing microphone arrays.

[0006] Within live streaming services, there is no mechanism to seamlessly add multiple audio streams into a composition or to explore a view from different viewpoints and live streams originating from different devices by different users. Current streaming services are even missing 3D sound. Even if the receiver is able to explore the 360-degree video and select the viewpoint, the corresponding audio presentation does not reflect the viewpoint.

SUMMARY

[0007] Described herein are systems and methods related to managing and representing panorama views in a live video streaming service to enable nearly 360-degree immersive experience using existing smart phone camera hardware. A video streaming user may engage live video capturing after which a server may search for available panoramic 360-degree content using the device location and orientation sensor information. The panoramic content is available as additional information that is streamed together with the live video stream of a selected target provided by a selected user. Alternatively, a service may apply third party live or prerecorded panoramic content. A video streaming consumer user can view embedded visual markers overlaid on top of a panoramic image or live video stream being currently viewed.

[0008] In further embodiments, described herein are systems and methods related to managing multiple immersive 3D audio streams captured simultaneously in a collaborative live video streaming service. A service may bundle incoming bit streams together into a single HTTP stream which share the same context and geographical location. An immersive 3D audio of each live stream may be rendered in correct locations and directions relative to the visual streams. Spatial filtering may be used when rendering sound sources that are available also in other streams of the bundle. The sound sources that appear in the direction of the other video may be tuned down as the corresponding stream already contains the audio source as the main target. Similarly, the overall composition of a 3D audio image may be built by prioritizing different segments from different streams based on the quality and contextual information.

[0009] In one embodiment, there are systems and methods for presenting embedded live streams within a first live stream. A method may comprise receiving a plurality of live video streams, each associated with a position. The method may also comprise receiving, from a first client, a request for a first live video stream of the plurality of live video streams. The method may also comprise communicating the first live video stream to the first client. The method may also comprise determining a subset of the plurality of live video streams for which their associated position is within a view associated with the first live video stream. The method may also comprise sending, to the first client, at least information regarding the positions associated with each of the live video streams in the subset.

[0010] In one embodiment, a method may comprise communicating to a server from a user device a user request for a first live video stream associated with a specified location. The method may also comprise receiving at the user device from the server the first live video stream and at least a second live video stream proximal to the specified location. The method may also comprise rendering a 360-degree view of the specified location. The method may also comprise aligning the first live video stream and at least the second live video stream to the rendered 360-degree view of the specified location. The method may also comprise overlaying user interface elements associated with each of the first live video stream and at least the second live video stream at positions within the 360-degree view corresponding to positions associated with each of the first live video stream and at least the second live video stream.

[0011] In an exemplary embodiment, a method is performed by a user device, such as a head- mounted display, phone, or tablet, to present a switchable display of live video streams to a user. The device receives a first live video stream and information indicating a first capture location at which the first live video stream is being captured, and the device displays the first live video stream to the user. The device further receives metadata identifying at least a second live video stream and a second capture location at which the second live video stream is being captured. The device then displays a selectable user interface element as an overlay on the first live video stream at a display position corresponding to the second capture location. In some embodiments, based on a viewing orientation selected by the user, the device determines a cone of view of the user and displays to the user only the portion of the first live video stream that is within the cone of view. In response to a user input selecting the selectable user interface, the user device responsively displaying the second live video stream to the user. The second live video stream may be displayed together with the first live video stream, or it may be displayed in place of the first live video stream.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] A more detailed understanding may be had from the following description, presented by way of example in conjunction with the accompanying drawings, wherein:

[0013] FIG. 1 illustrates one embodiment of a panoramic and live video stream methodology.

[0014] FIG. 2 illustrates an overall architecture and data flows for a live video streaming service with interactive panoramic images, according to one embodiment. [0015] FIG. 3 A illustrates a flow chart for one embodiment of creating the video and metadata content by a recording user for the streaming server.

[0016] FIG. 3B is a block diagram of an exemplary media presentation description data model.

[0017] FIG. 4 illustrates a process flow of one embodiment of rendering an interactive panoramic scene.

[0018] FIG. 5A is a process flow chart of one embodiment of a receiving application which reads metadata about video content context from the streaming server and renders the content accordingly.

[0019] FIG. 5B is a schematic plan view of an embodiment employing the method of FIG. 5A.

[0020] FIG. 6A is a message flow diagram of one embodiment of live video streaming.

[0021] FIG. 6B is a message flow diagram of one embodiment of composing a presentation comprising more than one video source with a panoramic background in a given location.

[0022] FIG. 7 illustrates one embodiment of a video stream relative to a dual camera still image.

[0023] FIG. 8 illustrates one embodiment of a live video stream delivered with a wide angle camera, with a panoramic image further extending the view.

[0024] FIG. 9 illustrates one embodiment of aligning a live video stream with a panorama.

[0025] FIG. 10 illustrates one embodiment of a user interface (UI) for viewing a live video stream with a panorama.

[0026] FIG. 11 is a schematic illustration of an exemplary embodiment of a 360-degree panoramic stream made available by another user or third party.

[0027] FIG. 12A is a schematic illustration of an exemplary embodiment of a live video of a tour guide describing an area in which a video is captured.

[0028] FIG. 12B is a schematic illustration of live video captured of a second live video user in the same area as FIG. 12 A.

[0029] FIG. 13 illustrates an exemplary embodiment of a composition of a live video stream and a panoramic still image.

[0030] FIG. 14 illustrates another exemplary embodiment of a composition of a live video stream and a panoramic still image.

[0031] FIG. 15 illustrates one embodiment wherein two recording devices in the same location capture different targets, in some cases with another sound source not visible.

[0032] FIGs. 16A-16B illustrate one embodiment of two individual live video streams with surrounding audio and panoramic background. [0033] FIG. 17 illustrates one embodiment of a 360-degree audio visual presentation composed of two individual streams.

[0034] FIG. 18 is a schematic plan view illustrating an exemplary embodiment of a live video stream experience.

[0035] FIG. 19 is a block diagram of one embodiment of decoding an audio bit stream.

[0036] FIG. 20 illustrates one embodiment of tuning curves of audio level and BCC parameterization for audio streams from different sources.

[0037] FIG. 21 illustrates one embodiment of spatial audio filtering for mono representation.

[0038] FIG. 22 is a block diagram of one embodiment of an audio processing chain in collaborative live streaming.

[0039] FIG. 23 is a schematic perspective view of one embodiment of rendering individual streams of live video together with a panoramic background, with each stream in its correct location.

[0040] FIG. 24 is a schematic plan view of one embodiment of a receiving user and a composition of two streams with video, surround sound, and a panoramic background.

[0041] FIG. 25 illustrates one embodiment of a composition 360-degree view with two live video and audio streams.

[0042] FIG. 26 illustrates a view of a live video stream as displayed on a user device in some embodiments.

[0043] FIG. 27 illustrates a view of a live video stream as displayed on a user device in some embodiments.

[0044] FIG. 28 illustrates a view of a live video stream as displayed on a user device in some embodiments.

[0045] FIG. 29 illustrates an exemplary wireless transmit/receive unit (WTRU) that may be employed in some embodiments, for example as a head-mounted display and/or as a user video recording device.

[0046] FIG. 30 illustrates an exemplary network entity that may be employed in some embodiments, for example as a live video streaming server.

DETAILED DESCRIPTION

[0047] A detailed description of illustrative embodiments will now be provided with reference to the various Figures. Although this description provides detailed examples of possible implementations, it should be noted that the provided details are intended to be by way of example and in no way limit the scope of the application. [0048] Note that various hardware elements of one or more of the described embodiments are referred to as "modules" that carry out (i.e., perform, execute, and the like) various functions that are described herein in connection with the respective modules. As used herein, a module includes hardware (e.g., one or more processors, one or more microprocessors, one or more microcontrollers, one or more microchips, one or more application-specific integrated circuits (ASICs), one or more field programmable gate arrays (FPGAs), one or more memory devices) deemed suitable by those of skill in the relevant art for a given implementation. Each described module may also include instructions executable for carrying out the one or more functions described as being carried out by the respective module, and it is noted that those instructions could take the form of or include hardware (i.e., hardwired) instructions, firmware instructions, software instructions, and/or the like, and may be stored in any suitable non-transitory computer- readable medium or media, such as commonly referred to as RAM, ROM, etc.

[0049] In systems described herein, where a composite video is generated from more than one video stream with immersive multi-channel 3D sound, the audio rendering in the receiving application may take into account the audio mixing. Although different video streams may cover different visual targets, the corresponding immersive audio typically contains audio sources from all different directions. Hence, two or more video streams may well contain the very same audio image with the same audio sources. When rendering a composition of such content, the sources from different streams may be prioritized. And the mixing may be adapted based on user equipment capabilities.

[0050] Presenting a whole 360-degree wide scene and especially providing a detailed view of the surrounding environment is difficult in a real-time video streaming application. Special multi- camera solutions, external camera equipment, or big fisheye type lenses may be used to capture the whole scene. Integrating such a camera hardware is not generally feasible for devices that users carry in their pockets. Not all live video content creators have 360-degree video equipment available at all times. 360-degree video streams are often created by professional content providers.

[0051] Dual camera systems, such as the system of U.S. Pat. App. 2016/0007008, do not provide any significant change in the view angle. Such dual camera systems are mainly intended for image quality enhancement and are not sufficient for covering a whole 360-degree environment. The user still needs to turn the camera if stream recipients want to view something outside the view angle.

[0052] One available solution in a real-time video streaming service for a smartphone user without extra hardware and external equipment to deliver an overview of the surrounding environment is simply to point in different directions. Typically, the view angle is relatively narrow even with a 135-degree lens, and hence the recipient may find it difficult to obtain an overall understanding of the scene and details. Also, the recipient does not have any means to return to an interesting viewing angle. Instead, a recipient may provide feedback with, e.g., short text messages that are presented on top of the video stream. Hence, a new service functionality may be desirable to deliver the whole 360-degree view and the information about the environment including behind the person capturing the video.

[0053] In one aspect of the present disclosure, there are systems and methods for a live streaming application which has a panorama-capturing functionality to capture the 360-degree view of a live stream and coordinates of the recording device. A server may have a selection of panoramic content tagged with location and orientation information that can be included in the selected live stream.

[0054] A consumer client rendering engine may render the user selected live video stream and overlay other proximal live video streams in proximity using visual markers on top.

[0055] There are various different embodiments for capturing panoramic live streaming video. In a first embodiment, the user is capturing the live video and the service checks independently to determine whether there is a suitable panoramic video or image available in the given location. A second embodiment offers an application program interface (API) for third parties to provide visual content for the service. In this case, the server may use external content for the panoramic images.

[0056] FIG. 1 illustrates one embodiment of a panorama and live video stream methodology as disclosed herein. FIG. 1 schematically illustrates a composite 360-degree video 100 generated using a 360-degree panoramic background image 102. A viewport of a currently-active viewer of the image is illustrated as viewport 104. A live video stream 106 is stitched into the 360-degree background 102. The position of the live video stream 106 within the background image 102 may be determined based on the location and orientation at which background image 102 was captured (e.g. camera location and orientation) as compared to the location and orientation at which the live video 106 is being captured (e.g. camera location and orientation). Matching of background features between live video 106 and background 102 may also be used in determining the position of video 106 within background 102.

[0057] A panorama mode of live video may provide new media consumption functionality, including but not limited to:

• Methods to capture and include panorama view around the user who is capturing live video, enabling interactive representation of the scenery. There is no need for the recording user to point at random things and create a confusing video stream with irregular movements. • A live streaming service may be enhanced with a selection of panoramic content that is available in different locations either by other recording users with 360-degree camera equipment or commercial content creators promoting a particular area, e.g., a touristic spot.

• Since the receiving application is aligning the panoramic image in the correct location and orientation relative to the on-going live stream, any new content in the stream will effectively update the view. The interesting detail or event happening in the scenery will be captured with the live video while the rest of the landscape is covered with additional panorama.

• Followers are able to study the environment in detail. The selected live stream offers content from the target of interest, which may be something the overall panorama does not contain. In addition, the receiving users may scroll and zoom their view around the panorama image and concentrate on interesting directions. There is no need to send textual feedback and request the capturing user to turn the camera in random directions. Every follower may study their favorite spot on the scenery. Only if there is a special need to update something in the scenery, the follower may send a request.

Overall architecture.

[0058] One embodiment of the present aspect of the disclosure is shown in FIG. 2, illustrating a streaming live video architecture, such as with one or more live video recording clients/devices 202, 204.

[0059] The function of a live video recording device is to record the live audio/video content and to transmit all the relevant data to the live streaming service 206. In some embodiments herein, the live video recording device may compose panoramic content and transmit the content to the service. The live stream bit stream transmitted to the service also contains continuous contextual sensor information flow about the real-time location and vertical/horizontal orientation of the recording device. The sampling rate of the location/orientation is selected to be high enough to capture the natural motion of the device.

[0060] A live video stream may be provided with metadata information including data about the capture location, the left and right borders (e.g. in compass angles), and upper and lower border angles of the captured view relative to the horizon. This information is applied later in the rendering of the presentation relative to the receiving user and is used to align the panorama correctly with the live stream.

[0061] In some embodiments, protocols such as, but not limited to, Real Time Streaming Protocol (RTSP) and MPEG ISO/IEC standard Dynamic Adaptive Streaming over HTTP (DASH) support real-time audio visual content and metadata transfer between the client and server. The recording clients may stream the audio visual content over RTSP to the live streaming server.

[0062] The server collects collaborating streams together by extracting the encoded bit streams from each incoming RTSP stream bundling them into a single DASH media presentation. For example, the server may reserve 360-degree video streams as panorama content for one or more users. Each individual stream from one or more different recording applications may be presented in the Media Presentation Description (MPD) manifestation file and corresponding encoded bit streams are included in the Media Presentation Data Model.

[0063] Receiving clients, such as clients 208, 210, 212, may request the live video stream after which the media composition is transmitted over HTTP in blocks of data comprising a short segment of each individual video and/or audio-visual stream. In some embodiments, the receiving device/client of a live video streaming service will request for available live streams in a given geographical area, or generally based on some predefined criteria. When a particular stream is selected, the application on the receiving device(s) may also request any available and/or related panorama image or video data.

[0064] The receiving client/device will then receive a live video stream bundle comprising one or more live video and/or audio-video streams with related location and orientation metadata using the DASH protocol. The latest and most relevant panoramic content with location and orientation details may be received within the protocol as an additional stream whenever a new composition is available. In some embodiments, the receiving device will receive the most relevant proximal video streams with location and orientation details for a rendering.

[0065] In some embodiments, a composed panorama may contain information about view size, e.g., the limits of the view in location as well as orientation. That is, the side information includes data about the view location, and the left and right borders for example in compass angles as well as upper and lower border angles for example relative to the horizon. This information may be applied later in the rendering process to align the content correctly with the live stream.

[0066] The receiving device of a live video streaming service may request information regarding available live streams in given geographical area or generally based on some predefined criteria. Based on the request, a service may respond with a list of available individual and composition streams. When an individual stream or a stream composition is selected, the application on the receiving devices may automatically request any available related panorama data.

Recording application. [0067] In some embodiments, each recording user is capturing a live video and transmitting it to the live video server, for example using the RTSP protocol. The live video may be either a conventional recording with a smartphone, a 360-degree stream with an external camera module, or the like.

[0068] In some embodiments, depending on the device capabilities, the recording application may capture multi-channel audio. Especially if the device is connected to a headset with binaural microphones or microphone array, the application may be configured to stream surround sound.

[0069] Audio parameters regarding the number of channels, dynamic range, and the like may be included with the live audio stream as side information. Typically, live streaming protocols accommodate the transmission of side information that includes details about the audio track.

[0070] Live video streaming protocols are media codec agnostic. Hence, the recording application may use any state-of-the-art standard to encode the audio visual content. For example, the video stream could be encoded with H.264/MPEG Advanced Video Coding (AVC) and Advanced Audio Coding (AAC), or the like. When the recording device supports multi-channel audio or binaural audio capture, the MPEG Spatial Audio Codec (SAC) could be used, or the like. It provides a backwards compatible way to transmit the audio to the receiver. The core bit stream may comprise standard mono audio, e.g., with AAC, and the spatial audio may be in a separate field.

[0071] In some embodiments, the recording application includes contextual information about the video content as metadata. Depending on the video type, the location information may contain different location and orientation information.

[0072] For example, a third party recording user providing 3D video with 360-degree coverage may utilize details about the alignment as well as the location of the view relative to the map. Polar coordinates from the motion API of the recording may be sufficient, or the like. The server may handle this type of visual stream as panorama that may be offered as panorama background for other video streams within the same area.

[0073] 2D video with "narrow" view uses information about the orientation of the camera view in the environment as well as the location on the map. Again, the motion API of the device will provide the data from the sensors (e.g., acceleration, gyro, compass, etc.).

Server functionality.

[0074] One task of the live video streaming server in some embodiments is to collect the live content from the recording applications, check the corresponding contextual information in order to combine collaborating applications and their content together, collect the incoming streams into a single bundle and build a MPD manifest to describe the available content. The DASH bundle is made available in the server for the receiving clients. When a receiving client is requesting a particular content stream, the server will transmit the data in short segments over the HTTP protocol using the DASH protocol.

[0075] The live video streaming service may also have a container for panoramic content related to the live stream. When the server publishes collection of incoming live streams for the consuming applications, e.g., returns the information about the available streams in service, it may also provide information about the availability of panoramic content.

[0076] The actual video content and corresponding metadata may be encapsulated using a Media Presentation Description (MPD) data model that is made available to the streaming clients from the live streaming server. One embodiment of a flow of actions is illustrated in FIG. 3A, depicting a flow chart for creating the video and metadata content by a recording user for the streaming server. As illustrated in FIG. 3A, an audio and video signal is captured in step 302. In the case of 360-degree video, contextual information such as polar coordinates and location of the camera is determined (step 304). In the case of 2D video, contextual information such as location and orientation of the camera view is determined (step 306). The video bit stream and appropriate contextual parameters are combined (step 308), and an MPD data model is generated (step 310) with the audio-video content and the contextual metadata.

[0077] In one embodiment, the stream from the recording application is presented in the MPD manifest file, and corresponding encoded bit streams are included in the MPD Model. The receiving application requests the live video stream, after which the media composition is transmitted over HTTP in blocks of data comprising a short segment of each individual audiovisual stream. The MPD contains information identifying the content (e.g. an address from which the content can be retrieved) and related metadata on content location, orientation, sampling rates, bit rates, etc.

[0078] An example of an MPD manifest is presented in Table 1. The MPD contains information about media content regarding sampling rates, bit rates, codec selection etc. In addition, the manifest contains metadata on content location metadata in polar coordinates. The stream information may also indicate if the stream is at a fixed location, in which case the coordinate's information is represented as fixed polar coordinates. Alternatively, the stream may be a moving stream, and the coordinate information may indicate the initial coordinates, final coordinates and trajectory information. The rendering client may use the stream meta-information to appropriately overlay the visual indicator of the streams on top of the rendered live video stream. In one method, the rendering client may first compute the angle of view of the client headset based on the orientation of the user's head. The rendering client may then use the computed angle of view to compute the visual indicators of relevance to the user's viewpoint.

Table 1. Exemplary MPD

<MPD xmlns:xsi="http://www. w3.org/2001/XM LSchema-instance" xmlns="urn:mpeg:dash:schema:mpd:2011" xsi:schemaLocation="urn:mpeg:dash:schema:mpd:2011

http://standards.iso. org/ittf/PubliclyAvailableStandards/M PEG-DASH_schema_files/DASH-MPD.xsd"

type="dynamic" availabilityStartTime="2016-04-07T17:33:32Z" publishTime="2016-04-07T17:33:32Z" timeShiftBufferDepth="PT3H" minimumUpdatePeriod="PT595H" maxSegmentDuration="PT5S"

minBufferTime="PTlS" profiles="urn:mpeg:dash:profile:isoff-live:2011_/urn:com:dashif:dash264">

Representation id="128kbps" bandwidth="128000" codecs="mp4a.40.2" audioSamplingRate="48000"> <SegmentTemplate duration="2" media="../dash/250k/bitcodin-$Number$.m4a" initialization="../dash/250k/bitcodin-init.m4a" startNumber="92352"

liveEdgeNumber="92352"/>

</Representation>

</AdaptationSet>

<SegmentTemplate duration="2" media="../dash/250k/bitcodin-$Number$.m4v" initialization="../dash/250k/bitcodin-init.m4v" startNumber="92352"

liveEdgeNumber="92352"/>

</Representation>

</AdaptationSet>

</Period>

</MPD>

[0079] In the example of Table 1, the source file "stream 1.txt" identifies textual content to be applied along with a first video stream, where the coordinates for rendering of the first video stream are provided with the attributes "int-coord" and "final-coord". The attributes "int-coord" and "final-coord" may each indicate the location of corners of a rectangular frame in which the first video stream is rendered, where the location of each corner may be provided in polar coordinates. In some embodiments, the rectangle in which a stream is positioned may change over time (e.g. over the course of a segment). In such embodiments, "int-coord" may be used to indicate the position of the frame at a starting time (e.g. at the start of a segment), and "final-coord" may be used to indicate the position of the frame at the final time (e.g. at the end of a segment). The attribute "trajectory" may be given the value "straight" in cases where the rectangle moves in a straight line between locations indicated by the initial coordinates and the final coordinates.

[0080] In another embodiment, the metadata for the neighboring streams may be sent at frame level. In one method, the metadata about the location coordinates about the neighboring streams may be coded as a supplemental enhancement information (SEI) message in the H.265. The client rendering application is provided this information from the decoder and uses this information to overlay a visual representation of the live neighboring streams on top of the rendered video stream.

[0081] In a further embodiment, the information may be split into the MPD and the coder. The MPD file may contain the coarse level information about the fixed coordinate streams, and the frame coding may carry the information about the moving streams.

[0082] At the server side, the live video stream is listed in the service API from which the application will fetch it as the target video content or as a panoramic background. Exemplary embodiments provide storage and access information for an interactive panorama of the scenery around the recording user. The relevant panorama may be made available by the service for the consuming users based on the location and orientation information of the live video.

[0083] An exemplary live video streaming service may have a container for panoramic content stream access details. When the server publishes the incoming live stream for the consuming applications (e.g., when the server returns the live stream details in response to an application requesting information identifying available streams), the server may also provide information about the availability of panoramic video.

Live streaming service functionality.

[0084] In an exemplary embodiment, the live video streaming server collects incoming streams and bundle streams in the same area or location covering the same content and targets into a single media presentation description data model. The corresponding manifest file will contain relevant information, e.g., a URL link pointing to the location of each content chunk. FIG. 3B illustrates an exemplary MPD data model. The media content streams are split into short periods for transmission over HTTP. As shown in FIG. 3B's media presentation description data model, each period of the media content contains a plurality of content streams from different recording applications, and thus the segment information contains access details for more than one simultaneous stream.

[0085] In this case, the media content may comprise a bundle of one or more audiovisual bit streams from different recording applications. Short periods may comprise several different adaptation sets of different modalities. In some cases, audio and video are in different sets. Each component contains presentations of different audio or video streams. Finally, the segment information of each presentation has the details used for streaming. For example, the URL for the presentation is found in the segment info. The segment information contains access information, and/or the like. The receiver may select one or more of the segments for streaming and presenting to the user. In this case, the receiver may render a composition coming from more than one recording application.

[0086] The rendering of an interactive panoramic scene is illustrated in a process flow in FIG. 4, for rendering a live video overlay on top of the panoramic view. In step 402, the receiving user selects the desired live stream from the desired user or target either from the map or directly from the list of streams. In case of a map selection, the UI lists available streams within a selected area on a map.

[0087] When the stream is selected, the application will request the live video bit stream and starts to render the panoramic video and audio, checking continuously in step 404 for availability of an appropriate panorama background.

[0088] When the application receives the contextual information about the proximal video stream, it will render the visual markers for the proximal video streams relative to the panoramic live video stream in step 406. The sensor information about the orientation, e.g., about the vertical and horizontal alignment of the recording device, is available in the stream. Hence, the position and projection of the video streams is calculated by the rendering unit relative to the panorama. At first, the live video is centered on the screen, and the panorama is rendered accordingly.

[0089] In step 408, user device movements or scrolling of screen are captured to interact with the panorama view. In step 410, live video streaming updates the live video and panorama orientation.

[0090] The receiving application will select relevant information from the MPD data model from the live streaming server. In some embodiments, the receiving application may collect the stream information from the data model of FIG. 3B. For example, the application may select proper bit rate content streams and formats. In some embodiments, the application may switch between different bit rate streams if such selection is available. Also, in some embodiments, the application may choose to limit the number of simultaneous streams even though the data model is supporting segment information for multiple simultaneous streams.

[0091] An application may also receive context information for rendering the visual content that is streamed from the server. The metadata in the MPD may contain location and orientation information of the content. The application may then compare the orientation information of the receiving device and determine the relative location, orientation and projection of the content to the user. The application is then able to render the content either in 3D or 2D in the correct location, direction and projection in a 360-degree view around the user.

[0092] FIGs. 5A-5B illustrate one embodiment of a flow of operations for the receiving application, where the receiving application reads the metadata about video content context from the streaming server and renders the content accordingly. In step 502, the application determines the orientation (and in some embodiments the location) of the receiving device through a motion API. An MPD file 504 available on the streaming server is received by the application. The application in step 506 reads video metadata on location and orientation coordinates from the MPD file received from the streaming server, comparing the location and orientation of the receiving device and the location and orientation conveyed in the metadata, the application in step 508 determines the user device position relative to the streamed content location and orientation, in step 510, the receiving application renders the visual content in correct location and orientation relative to user device. FIG. 5B schematically illustrates a video presentation resulting from the process of FIG. 5 A, in which a user 512 is able to view streaming video content 514 at an orientation appropriate to the orientation of the user and the orientation of the content as indicated in metadata.

[0093] FIG. 6A is a message flow diagram for one embodiment of live video streaming. A user initiates recording of a live video, and video recording proceeds in step 602. The video stream is provided to a streaming service through an API and is listed on the service in step 604. Location data for the live video is also provided to the live streaming service. A live video receiving user may issue, through a viewing application, a request for a live video. The viewing application receives the requested stream and renders it in step 606. Based on location data received for various streams, the live streaming service identifies neighboring video streams in step 608 and provides information on the neighboring video streams to the viewing application. In step 610, the application uses the location and orientation data to align video streams into a composite video. The user may explore the content in the composite video in step 612 by rotating the playback device and/or by scrolling the display screen of the device.

[0094] Identification of neighboring panoramic content or other related content may be made using explicit location information or other implicit features of the content, such as video stream analysis regarding lighting conditions and other possible contextual information (e.g., weather information, etc.).

[0095] A user who is consuming live video may start the application and select the video stream. If there is a panorama available, the application supports browsing of the content on the screen. The rendering of the video on top of the panorama (e.g., projecting and stitching the video to the other stream) may use standard rendering tools, such as the tools utilized in Autopano by Kolor, or similar tools as known to one of ordinary skill in the art. In some embodiments, the system operates to stich live video and panorama in cases where the panorama and live video were/are captured from different camera locations. The location and orientation information of each video stream and image may be determined and applied to improve the stitching result. When the camera location/orientation of each content stream is available, the rendering entity may create a proper projection of the stream/image so that they fit each other. Detection of visual cues may be used to help fine tuning of content matching. In some embodiments, the user may rotate the device, zoom, and scroll the 360-degree content on the screen.

[0096] In some embodiments, the live video streaming service may have a special service API dedicated for panoramic content in either live video or still images. Third parties may upload special panoramic content from selected locations in different contexts, such as time of the year, time of the day, different weather conditions, and/or the like, or simply continuously stream video with a 360-degree camera. If a recording user has not provided the panorama, the service may provide it from the third-party content. The receiving user may also have an option to accept or reject the panorama.

Creating the panorama in background.

[0097] In some embodiments, the interactive panoramic view around a recording user may be processed in the service back end. The service may collect the device location and orientation sensor information and select frames from the video stream when needed. A new frame is picked whenever the location or orientation is moved more than a predefined threshold, or the like. The service back end may then compose the panorama image and, in some instances, store it together with a corresponding time stamp as a side information to the given video stream.

[0098] Alternatively, the service back end may analyze the video stream and corresponding contextual sensor information on device location and orientation. The service may pick frames from the video stream autonomously and compose the panoramic view automatically without recording user interaction.

Applying dual camera module.

[0099] In some embodiments, a smartphone may have a dual camera system, as known in the art, and a content capture task may be divided between the camera modules. Typically, the second camera module is intended for wide angle usage. Therefore, in some cases, the second camera module may be suitable for capturing a panoramic still image.

[0100] The recording functionality of a real-time video streaming device and/or service may utilize the dual camera system by capturing the wide angle still images. The application may collect the wide angle images during the video capture when the camera is pointing in different directions and stitch together the panorama.

[0101] The still image that is transmitted as side information with the real-time video stream may be updated every time the user has turned the camera more than a predetermined angle, or in response to another factor. These images, in some cases only if captured within a reasonable time frame, may be combined together into a wide angle panorama either in the application or in the real-time video streaming server.

[0102] In some embodiments, a wait state may also be implemented such that an update is not transmitted too frequently from the same location and direction.

[0103] In some cases, an advantage of a dual camera module is that the application, service provider, or even user is able to switch between the main camera (narrow angle) and dual camera (wide angle) for the video stream, or the like. In some instances, if the main camera is selected, the dual camera may continue to support with the panoramic still image. In some alternative embodiments, when the dual camera with wide angle is selected for the real time video, the operation may more generally track as discussed above, where the panoramic still image is stitched from the same material as the video stream. A difference may be that a receiver may still view the stream in narrow angle mode, in which case the whole video stream and the panorama image extending the view to full 360-degrees can be explored by turning the device, or browsing the view, or the like.

[0104] FIG. 7 illustrates an exemplary embodiment in which a video stream 702 is displayed as an overlay on a background panorama 704 captured with a wide-angle dual camera.

[0105] FIG. 8 illustrates one embodiment of the functionality with wide angle video 802, narrow viewport angle 804, and panoramic background 806. The user is turning the device or browsing the stream, and hence, exploring the content accordingly. Alternatively, the user may consume the wide-angle video stream as whole in wide angle view and explore the surrounding environment offered by the panoramic image. In this case the receiver's view angle in FIG. 8 has the same size as the wide-angle video stream. Again, the user may explore the whole environment with the help of the panoramic still image.

[0106] In some particular embodiments, the panoramic content is made available in the live streaming service as side information. The receiving application may continuously check for the most relevant available content. When the panoramic stream is found, it may be rendered on background and the actual live video stream overlaid on top of it. FIG. 9 illustrates one embodiment of the 360-degree background 902 and the live stream 904 that is placed in the location and orientation corresponding to the actual sensor information. That is, the location and orientation information of the live stream is applied to align and project the panorama and the live stream. Thus, the live stream will blend with the background image. In some embodiments, if the scenery is relatively stable, the live stream view boundaries may fade in the background.

[0107] When the recording user is changing the orientation and pointing the camera in different direction, the receiving application may render the live video stream icon of FIG. 9 in a corresponding location relative to the panorama.

[0108] The receiving application may also update the background using the video stream. When the live video stream image moves, old frames from the video may be frozen on top of the panorama.

[0109] The receiving user may study the environment of the recording user by scrolling the image in a user interface (UI) view, or by rotating the device, or the like. FIG. 10 illustrates one embodiment of a UI for the receiving application. For example, the panoramic image and live video stream of FIG. 9 may be rendered on the screen of a smartphone 1002. In some embodiments, by default, the selected live video stream 1004 is centered on the screen. In some embodiments, the scrolling buttons 1006 (or swipe gestures, or the like) can be used to rotate and zoom the image. As the live video stream 1004 is aligned to a panorama 1008, the video 1004 is also moved when the screen is scrolled and zoomed. In some embodiments, the view can also be explored by rotating the device 1002. The application may read the relevant sensors (e.g., acceleration, gyroscope, compass, etc.) in order to detect device orientation changes and align the combined live video and panorama accordingly. In some embodiments, the view can be centered, such as by pushing the center area of a scrolling tool 1006.

[0110] In some embodiments, a user interface on device 1002 includes a map view 1010. The map view 1010 may display an indicator 1012 of a location at which the live video overlay 1012 was captured and may further display an indicator 1014 representing an orientation at which the live video overlay 1012 was captured.

[0111] An example of panorama rendering was set forth above in relation to FIG. 1. A receiving user is able to study the full 360-degree panorama 100 by virtually scrolling a wheel on which the image is rendered or providing other input to change the position of viewport 104. The live video stream 106 may be aligned and projected to the panoramic background 102 in the corresponding location. In some embodiments, the panorama may contain the complete sphere around the recording user. In such cases, the panorama view may be represented as a sphere rather than the cylinder of FIG. 1. In such embodiments, the receiving user may be able to scroll in both horizontal and vertical directions. [0112] An exemplary embodiment of a panorama, captured by another recording device or user, available in the area the recording user is capturing the live video is illustrated in FIG. 11.

For example, panorama 1102 of FIG. 11 may be available from the server, from another user or third party, or the like.

[0113] FIGS. 12A and 12B illustrate exemplary individual live streams 1202 and 1204 that are captured in the same general location but that represent video of different targets. For example, video 1202 may be a live video of a tour guide giving an introduction to the area where the live video is captured, and video 1204 may be a video of a tourist or other social media user in the area.

[0114] The live video streaming service may combine the live stream and the panorama. That is, the live video is stitched on the panorama stream. For example, the live stream 1202 of FIG. 12A may be stitched and projected onto the panorama 1102 (which may be a still image or a panoramic video) to generate composite video 1302 illustrated in FIG. 13. The full presentation comprises the composition of the panorama content 1102 streamed from the server along the selected live video and the live video stream 1202 stitched on top of the panorama.

[0115] The receiving user may consume the live video stream by listening to the tour guide speech and exploring the environment presented in the panorama. The receiving user may rotate the device or browse the screen to view different positions of the presentation.

[0116] The service may combine a live video stream by yet another user to the existing panorama when the stream is captured in the same area. For example, as illustrated in FIG. 14, the user of FIG. 12B is recording a live video 1204 from the same general location as the user of FIG. 12A. The service may detect that the live stream 1204 from the user of FIG. 12B fits with the panorama and combines the stream 1204 with the panorama background 1102 to generate composite video 1402 as in FIG. 14.

[0117] In some other embodiments, a composition may comprise more than one live video stream on the panorama content. For example, a receiving user may be simultaneously watching video from the users of both FIGS. 12A and 12B.

[0118] In some embodiments, the panorama of the live video stream of FIG. 1 1 may be provided by another user with a 360-degree camera, or, for example, by a third party (such as a local tourist office, or others) for the service. Thus, any or all live video streams from the given location may be enhanced with, for example, promotional material from the panorama provider (e.g., particular shops could be highlighted or otherwise noted in a panorama of a city square, etc.).

[0119] Another aspect of the systems and methods for a live streaming service disclosed herein may include composing a collaborative presentation from one or more content capturing clients. For example, combining the content streams from different recording applications may permit a multi-view streaming of the desired content. Generally, in some cases a composition may be a collaborative content network centered around a special point of interest, event, or target, or the like.

[0120] In one embodiment, a process of collaborative live streaming may include the following steps. A streaming service may pick up a plurality of live streams within the same geographical area and find potential content creation clients for the collaborating effort. As long as the collected content streams contain similar targets within the same area and have the same context, the service will bundle the streams together. The receiving application may then be enabled to render the composition of one or more collaborating streams together. The receiving client may open the HTTP stream (or the like), unbundle the multiple audio-visual content(s), and stitch the different video streams together. If one or more of the streams contain panoramic content, or 360-degree images and videos, the streams may be merged to cover as wide a viewing-angle as possible. The individual streams containing conventional video streams (below 360-degree view) are stitched in corresponding locations and orientations on top of the possible panoramic content. An audio stream of each live video stream may be rendered in the same direction as its corresponding video stream. In some embodiments, where the receiving client application user has binaural or stereo playback (or the like), any or all audio streams may be rendered simultaneously in their relative directions. In some embodiments, rendered audio sources are identified based on the direction of arrival and may be prioritized based on relevance to the video stream. An audio source in the direction of a video stream may be rendered using the live stream corresponding to the video. Hence, the audio streams corresponding to the video may be prioritized over the same surround audio content retrieved from a different video stream in a different direction. In some embodiments, the surround sound environment and sound sources not visible on live video streams may be prioritized, such as based on objective quality metrics (or the like). In some embodiments, best quality content is emphasized over low quality content. In some embodiments, spatial filtering may be applied to control the audio image and prioritize the different audio streams in different directions.

[0121] In some embodiments, such as when only mono playback is available, the application may play back a stream corresponding to the direction the user is viewing in the composition. In some embodiments, the audio playback level may be adjusted based on the distance from the observer location to the content location. For example, the audio may be faded away and a new stream added when the receiving user turns towards another video stream in the composition. [0122] From the service point of view, the multi-view coverage and possibility to switch between different viewing angles of a target from different contributing devices may improve the user experience, such as for the live video streaming service set forth above.

[0123] In some embodiments, a plurality of live video streams in the same area, possibly pointing in different directions, may be combined into a wide angle, possibly even 360-degree, video presentation. In some embodiments, when panoramic content from one or more sources is stitched with the videos and immersive 3D audio is mixed in the presentation, the user experience may approach or be a full 360-degree live video. In some embodiments, spatial filtering of the audio streams enables the best quality 3D sound representation. The best quality sound may be selected for each direction using contextual and objective criteria.

[0124] For the composition services, the overall architecture as discussed above in relation to FIG. 2 may be utilized.

[0125] For one embodiment, FIG. 6B is a message flow diagram of the timing for content capture by more than one recording user and a composition of video streams with panorama. A plurality of recording users may start live streaming (step 652). In some cases, if a user has connected a 360-degree video camera to the recording application and/or the recording device, the content may be a panoramic stream. In that case, the server may list the video as possible panoramic content for other live videos in the same location, as discussed above.

[0126] Live streams are transmitted from the recording application over RTSP to the server. The server may list those streams as being available (step 654). The streams may be further transmitted as a combination of one or more streams to the receiving application using the DASH protocol.

[0127] While the live video service issues the live streams available for the consuming users of the service, the service may bundle individual streams together based on the location, content and content similarity.

[0128] The receiving user will pick the desired live stream bundle, e.g. from the map or directly from the list of streams and requests the live stream bundle (step 656).

[0129] When the stream bundle is selected, the application starts to render the audio and video streams according to the location and orientation context information. The application may also receive panoramic content together with the selected live video. The service will carry the latest and most relevant panorama.

[0130] When the application receives the panorama, it renders the panorama (step 658) relative to the live video streams and maintains alignment (step 662) between the panorama and the live video streams. The sensor information about the orientation, e.g., about the vertical and horizontal alignment of the recording device, is available in each stream. Hence, the position is known relative to the panorama. The receiving user may explore the content in the composite video in step 660 by rotating the playback device and/or by scrolling the display screen of the device. Rendering the audio-visual composition.

[0131] In various embodiments, the content consuming application may receive a bundle of live video streams comprising a video stream, surround audio, and a panorama. All this is rendered on the screen and played back with the audio-visual equipment.

[0132] FIG. 15 depicts one embodiment of a live video recording scenario where two individual streams are captured from the same general location. Recorder 1502 is capturing a first target 1504 and recorder 1506 is capturing a second target 1508. The resulting streams may contain different content, but may share the same environment, the same audio image, and in some cases the same external panoramic content. Generally, but not necessarily, all sound sources around the recording devices are available in both streams. Another audio source 1510 may be present on the scene but may not be present in video captured by the recorders 1502 and 1506.

[0133] The live video streaming service identifies the streams coming from substantially the same location and containing similar context. Hence, they are bundled together into a single stream.

[0134] The receiving application may operate to render these two live video streams on the screen and to stitch the panorama on the background. FIGs. 16A-16B illustrate an embodiment with the two individual video streams on a 360-degree background, and immersive surround sound may create an audio image around a receiving user. One or both of the videos may actually comprise 360-degree content, but this is not necessarily the case.

[0135] As discussed above, FIGs. 16A-16B illustrate that both streams have basically identical audio environments, with the same natural audio sources as well as the same background panorama. FIGs. 16A and 16B schematically illustrate a panorama background 1602 on which live video streams 1604 and 1606 may be displayed as overlays. The live video stream displayed to a user 1608 may depend on the direction the user is facing. Audio is also provided to the user 1608 from a main audio source 1610 for live video stream 1604 and from a main audio source 1612 for live video stream 1606. A secondary audio source 1614 that does not appear within the video streams may also be rendered.

[0136] When the receiving application is rendering them together, the composition may be analyzed and identical sound sources handled properly. Generally, just adding two surround sound environments together is not sufficient. For example, simply mixing the two audio streams together will create two representations of a secondary audio source (not visible on either of the video streams). It would sound like having two almost identical speakers in the given direction.

[0137] The rendering application receiving streams illustrated in FIGs. 16A-16B may find all common sound sources. For example, in FIGs. 16A-16B, both streams contain three sound sources that are actually the same. The main audio source 1610 in live video stream 1604 is the same as the secondary audio source 1610 in live video stream 1606, etc.

[0138] The receiving application may prioritize the audio stream that has the best objective quality parameters that are included as side information. The surround sound stream with the best dynamic range, highest bit rate and best quality recording equipment is prioritized.

[0139] The 360-degree audio environment can be split in sectors around the narrow field live video stream. When the objective quality of the content is the same, the selection border is exactly between the video streams. Otherwise, the stream that has the best objective quality surround audio stream will get the widest sector. The audio corresponding to the target in the actual live video is always taken from the corresponding audio stream, but the rest of the environment is selected based on different criteria.

[0140] The two audio streams in FIGs. 16A-16B are combined together in FIG. 17. In this example, the live video stream 1604 may have slightly better objective quality audio stream, and therefore, it has wider sector, from selection border 1702 counterclockwise through selection border 1704. As a result, the secondary audio source 1610 of live video stream 1604 is selected while the corresponding audio in live video stream 1606 is filtered away. In this example, the background view has similar quality difference. Hence, the selection border for stitching the content is the same as that used for the audio image.

[0141] The audio processing may comprise spatial filtering. The surround sound from different live streams may be processed in such a way that the audio from selected directions is passed through. For example, the multi-channel surround audio may be analyzed with a Binaural Cue Coding (BCC) method. BCC parameterization is an efficient for analyzing the direction of arrival of different audio components in the multi-channel audio. BCC parameters that correspond to a selected direction of arrival are kept whereas parameters corresponding directions that may be tuned down are set to zero. When tuned coefficients are applied to recover the surround sound and combined with other streams, the result is a full surround audio with audio sources originating from different bit streams.

[0142] In some embodiments, similar sector analysis is performed for the background panorama as well if there are multiple panorama content streams available. Both video streams may include 360-degree content. In this case, the stitching that combines multiple panorama content streams together may follow the same criteria extracted from the audio. In addition, a similar objective quality criterion may be applied when combining content.

[0143] The analysis of the surround sound is a continuous process since the content is dynamic and locations of different live video streams may constantly be changing. In some embodiments, especially when the composition of different streams is evolving, the server is constantly searching for new connections with other streams and dropping others.

[0144] The width of the sector around the corresponding live video stream angle may depend on the selection criteria.

[0145] One embodiment of a user experience for a receiving user 1800 is illustrated in FIG. 18. The 360-degree environment may comprise one or more live videos of live video targets 1802 and 1804, surround sound streams (containing also the non-visible sound source 1806), and/or panoramas rendered, possibly seamlessly, around the receiving user. The user may explore the presentation by turning around or simply swiping and zooming the screen on the receiving device, or the like.

[0146] In some embodiments, such as if the receiving user is wearing a virtual reality headset, if the user turns their head, the application may detect the movement with onboard sensors and render the view accordingly. In some embodiments, the audio image is rotated identically.

Audio rendering.

[0147] In some embodiments, when the receiving user is wearing a headset with binaural equipment (e.g., with headphones or earbuds, or the like), or has high quality stereo or multichannel output from their device, the application may render the surround audio image. However, in other embodiments, when the receiving user has only a mono loudspeaker output, there is no means to represent the whole audio image as 3D. In this case, to help the user distinguish sources and follow a target, such as on the screen, or when the user turns their device to explore different directions of the 360-degree image, any audio sources disappearing from the image may be tuned down. Also, when a new audio source appears within the receiving user's view, the source in the corresponding direction may be tuned up.

[0148] The rendering of the full surround audio image may be different since the application is able to represent the whole 3D environment. Each individual surround audio stream may be analyzed with BCC parameterization in order to control the presentation, for example comprising a composition of two or more different spatial audio images. Controlling the BCC coefficients from different individual audio streams enables picking up separate image sectors and combining them into a full 3D image. FIG. 19 illustrates a block diagram of one embodiment of the 3D audio rendering of spatial audio bit stream. The spatial audio parameters may contain BCC type coefficients that represent the spatial image. These may be applied to expand the stereo or mono core audio layer encoded, for example, with an AAC codec (or the like) to spatial/3D audio. When the receiving application is receiving more than one audio stream, the spatial audio parameters may be tuned already before they are applied in 3D rendering.

[0149] One embodiment of the tuning curves of both mono and surround sound parameters based on the location of the live video streams is illustrated in FIG. 20. FIG. 20 illustrates a panoramic background 2002 along with live video streams 2004 and 2006 that may be displayed as overlays over the panoramic background 2002. An audio source 2008 is present in video stream 2004, and an audio source 2010 is present in video stream 2006. Another audio source 2012 may be provided outside of the field of view of the videos 2004 and 2006. As illustrated in FIG. 20, the audio levels from different sources may depend on the direction the receiving user is facing (or a direction otherwise selected through a user interface). Level line 2014 represents the level of the audio from source 2008 as a function of the viewing direction of the user. Level line 2016 represents the level of the audio from source 2010 as a function of the viewing direction of the user.

[0150] The tuning, and hence, the coverage of each bit stream may be controlled according to the objective quality criteria. Alternatively, in some embodiments, the tuning is controlled based on user preference(s). The application may track the orientation of the device and emphasize the directions in which the user is pointing. The spatial image may be tuned down everywhere else. When the user is turning away from a video stream, the corresponding mono audio level or surround sound source BCC parameterization may be controlled according to the tuning curves in the corresponding direction. This may help the user follow certain sound sources in the 360-degree view.

[0151] Tuning curves define the sectors that are picked up from different audio streams. In practice, BCC coefficients such as inter channel level and time difference cues corresponding to a certain direction of arrival are multiplied by the level tuning curve value in the corresponding direction. Thus, the level curve may act as a spatial filter for the surround sound.

[0152] The level tuning may affect audio sources which are not in the direction of any of the videos. For example, the sound source not visible on the videos in FIGS. 18 and 20 may be tuned with the curves according the source location. As shown in FIG. 20, audio source 2012 outside live video streams may be included in the presentation either from the stream 2004 or stream 2006, based on objective quality criteria.

[0153] In some embodiments, the audio tuning parameter overlap location in FIG. 20 may depend on the objective quality of different streams. For example, the higher the quality of a particular stream compared to the other streams, the wider coverage the corresponding audio image may receive. In an exemplary scenario, shown in the illustration of FIG. 20, the audio source 2010 covers a wider area because it has higher objective quality.

[0154] The selection criteria may also contain other than objective quality rules. For example, an audio image behind the recording user may be selected from another live stream that is better positioned for the task.

[0155] In some embodiments, the audio image selection process may be continuous as the user may change the point of interest at any point and as audio sources are changing their position.

[0156] In some embodiments, the receiving application may render the full audio image for the receiving user. In one embodiment, a scale from 0 to 360 degrees may be fixed to a compass. Thus, when the receiving user turns their device or swipes the image on the screen (or the like) the audio image may be rendered relative to the orientation of the user. In this case, the audio source 2010, rendered based on level tuning curves mainly from stream 2006, is coming from the compass direction of 200 degrees regardless of the direction the receiving user is looking.

[0157] In some embodiments, such as where the receiving user has a mono audio output, the audio representation may be further controlled with spatial tuning. The audio rendering may tune down an audio image outside the user's viewpoint sector. As shown in FIG. 21, the spatial tuning may handle the immersive audio image first. In one embodiment the application may also use the same level tuning for surround sound processing. In this case the receiving user may experience only audio coming from the direction the receiving user is looking. FIG. 21 illustrates user viewpoint sector 2012 and tuning level 2104 for mono output in addition to the components illustrated in FIG. 20.

[0158] A block diagram for one embodiment of a complete audio processing chain is shown in FIG. 22. Each individual recording application may capture audio with one or more microphones (step 2202). In some embodiments, multi-channel audio captures the spatial audio and distinguishes different audio sources in different directions around the recording device. Mono recording may have the disadvantage of mixing all sound sources together. In this case noise cancellation may filter out background noise. The captured audio signal is streamed to a live streaming service API using a streaming protocol such as RSTP (step 2204) along with contextual information regarding the stream (step 2206).

[0159] The live streaming service may bundle different live streams originating from the same location with similar context (step 2208). For example, DASH protocols support combining streams together. Alternatively, other techniques as known to one of ordinary skill in the art may be used. [0160] The receiving application may unbundle the stream(s) and decode each individual audio stream separately. As the live video stream includes information about the location and orientation of the recording device, the receiving device may align the video presentation and corresponding audio in the correct direction on the 360-degree background (step 2210).

[0161] The objective quality information regarding the audio stream (e.g., dynamic range, number of channels, bit rate, sampling rate, etc.) may be applied to prioritize each audio stream. Higher quality audio may be allocated with a wider sector in the 360-degree space compared to lower quality audio (step 2212).

[0162] Spatial filtering (step 2214) may be performed as follows. In situations where an audio stream contains mono audio, the application may allocate only a narrow sector covering mainly the corresponding video presentation frame. That is, the mono content is rendered in the direction of the corresponding video presentation on the 360-degree axis. Otherwise the audio image is filled with multi-channel content from other streams. Audio streams are rendered in their allocated directions relative to the video presentation (step 2216). It may be beneficial to apply efficient noise cancellation to the mono audio to filter out any background noise or sound sources that are not "present" in the video. The surround sound components from the other streams may handle the rest of the image.

[0163] In some embodiments, the 360-degree audio image may be split evenly between the streams when all streams have identical objective quality metrics.

[0164] In some alternative embodiments, the collaborative live streaming systems and methods set forth above may also operate without panorama content. For example, the 360-degree background outside the rendered video stream may be handled with a preselected background content available in the receiving application or in the service. The receiving device may then render the immersion or collaborative live stream with surround audio only. The user may switch between different live video streams by turning the device or swiping the screen, or the like. However, the surround audio composed from different audio sources may be rendered as a full 360-degree (3D) audio image.

[0165] In some embodiments, the 3D presentation may rely on a single video stream and stream several different audio images. The content may in this case comprise several dedicated sound sources in different directions captured with different devices by different users. In some cases, the receiving application or server may omit the video component and render only the sound sources in their corresponding directions.

[0166] As in the exemplary scenarios previously discussed in relation to FIGs. 12 to 14, there may be two (or more) individual live video stream, e.g., the streams shown in FIGs. 12A and 12B, captured in the same location but from different targets. These individual streams alone do not provide any additional information about the environment. If the stream contains surround sound (and panoramic background), does the receiving user experience a full 3D presentation.

[0167] Recording users or external parties may provide the visual panoramic content of the environment. And, as presented earlier, the application may stream this content to the server. The receiving application may collect the available panoramic content from the stream and apply it as a background for the live videos.

[0168] The live video streaming server may collect the streams in a single MPD manifest file and make the file available. The receiving application may pick the segment information from the server and stream different audio visual streams, such as by using the DASH protocol.

[0169] FIG. 23 illustrates one embodiment of the stitching of two (or more) individual live streams 2302, 2304 relative to each other and a composed 360-degree panorama 2306.

[0170] One embodiment of a full composition of multiple (narrow field) live videos, panorama composition, and surround sound around the receiving user is illustrated in FIG. 24. FIG. 24 illustrates 360-degree panorama 2306 including live video streams 2302, 2304 and corresponding respective live audio streams 2402, 2404.

[0171] FIG. 25 illustrates an exemplary scenario of a panorama representation with stitched videos. The audio sources are rendered in corresponding directions. A sound source in the middle of video stream 2502 is rendered from the audio 2506 of stream 2502, and the audio source in video 2504 is rendered from the audio 2508 of stream 2502. In addition, audio streams may carry sources that are not visible in live video streams. For example, in FIG. 25, both live video streams contain surround sound in addition to the main target on the live video. Both streams contain basically identical audio images with generally the same sound sources. They both also have the main sound source appearing on the video and two additional sources. The surround audio from different streams is allocated to the immersive presentation based on the objective criteria. In this exemplary scenario, the audio component from stream 2502 covers most of the environment. Secondary audio sources outside the video range in stream 2504 are tuned down.

[0172] A similar combination may also be applied for the background panorama. The view may be stitched together from two (or more) sources and the selection border may be based on the objective quality criteria.

[0173] Alternatively, the panoramic content from third party or external users may also carry surround sound. In this case, the immersive audio image is again prioritized based on the objective criteria. For example, in FIG. 25, the audio could be composed of three sources. Audio from stream 2502 may handle the sector on video presentation 2502 and audio from stream 2504 may cover only the video presentation 2504. The remaining image may be covered with audio from panorama content. In addition, there may be streams containing only dedicated sound sources in certain directions. For example, the server may have picked up only the mono audio covering the audio sources on the background.

[0174] Additional exemplary embodiments are described with respect to FIGs. 26-28. FIG. 26 illustrates a view of live streaming video 2602 that may be displayed on a user device, for example on the screen of a smartphone or tablet, or on a head-mounted display. Two individuals are seen in the field of view of the live streaming video of FIG. 26, and each of these individuals has a respective camera (or device equipped with camera) 2604, 2606, each of which are also capturing live video streams. A first selectable user interface element 2608 is displayed to indicate the position, within the currently-viewed live stream 2602, of the device capturing a first alternate live stream, and a second selectable user interface element 2610 is displayed to indicate the position, within the currently- vie wed live stream 2602, of the device capturing a second alternate live stream. A user may select one or more of the selectable user interface elements 2608, 2610, e.g. using a touch screen or other user input. While the selectable user interface elements are displayed as a highlighted border in FIG. 26, it should be noted that the selectable user interface elements may take other forms. In some embodiments, the size of the selectable user interface element is modified based on the distance of the respective devices capturing the alternate streams, with more-distant devices being indicated using a smaller user interface element.

[0175] For exemplary purposes, consider a case in which the user selects the first user interface element 2608, corresponding to the first alternate live stream being captured by device 2604. In response to this selection, the user device may display the first alternate live video stream 2702, as illustrated in FIG. 27. In some embodiments, the first alternate live video stream is displayed such that an orientation of the view of the first alternate live video stream substantially aligns with an orientation of the view of the originally-displayed live video stream. For example, if the display of the original live stream represents a view to the north, then the display of the selected alternate video stream may also be selected to provide a view to the north.

[0176] Embodiments as illustrated in FIG. 26-27 allow a viewer of live streams to select among different live streaming views in the same general location, for example to get a better view or to explore different perspectives. For example, the user viewing the original live stream 2602 of FIG. 26 may wish to see a better view of the group of buildings to the left. The user sees from the first selectable user interface element 2608 that a live stream is being captured from a location that is nearer to the group of buildings of interest. The user thus selects the first user interface element 2608 and obtains the live stream 2702 of FIG. 27 from more close-up position. In some embodiments, a "back" button user interface element 2704 may be provided to allow the user to return to viewing from a previous perspective.

[0177] In some embodiments, the selected alternate live stream is displayed as an overlay on the originally-viewed live stream and/or on a panoramic background. For example, the second alternate live stream may be a live stream of a person (e.g. a tour guide) speaking. For exemplary purposes, consider a case in which the user of FIG. 26 selects the second user interface element 2610, corresponding to the second alternate live stream, where the second alternate live stream 2802 is streaming video of a tour guide speaking. The user device displays a composite video 2804 that includes the video 2802 of the tour guide superimposed on the background stream 2602 at a location corresponding to the location at which the stream is being captured, as displayed in FIG. 28. In some embodiments, the user may further select the video overlay to switch to a full-screen view of the second alternate video stream.

[0178] Embodiments such as those illustrated in FIGs. 26-28, among other embodiments described herein, may operate using a streaming server that performs a method as follows. The streaming server receives a plurality of live video streams. Along with the video streams, the server receives information on the location of the device capturing the stream. The location may be determined by, e.g. GPS or A-GPS functionality of the respective devices. The server may further receive information on the orientation of the image-capturing devices, which may be captured using, for example, an on-board magnetic compass, the readings of which may be stabilized using gyroscopic sensors. Visual cues (e.g. positions of identifiable landmarks) may also be used in the determination of camera orientation. Image-capturing devices may also provide to the streaming server information on their respective fields of view. This information may be, for example information representing a vertical angle and a horizontal angle.

[0179] The streaming server receives from a user device a request for a first live video stream, and the streaming server responds by sending the first live video stream to the user device, which in turn displays the first live video stream to the user. The streaming server further operates based on the received position and orientation information to determine a field of view corresponding to the live stream that is being delivered to the user device. The streaming server further operates to identify one or more other live video streams being captured by devices that are within the determined field of view. Information representing the positions from which these one or more live streams are being captured is provided to the user device. This position information may be provided in absolute coordinates (e.g. GPS coordinates and compass heading) or in coordinates relative to the field of view (e.g. polar coordinates or pixel coordinates). The position information may be sent in a manifest, such as a DASH MPD. The user device receives this position information and, based on the position information, overlays a user interface element at the corresponding location on the display of the first live video stream.

[0180] One or more actions may be performed in response to user selection of a user interface element that corresponds to a second live stream. For example, the second live stream may be displayed as on overlay on the first live stream (see FIG. 28), or the second live stream may be displayed instead of the first live stream (see FIG. 27).

Exemplary hardware.

[0181] Exemplary embodiments disclosed herein are implemented using one or more wired and/or wireless network nodes, such as a wireless transmit/receive unit (WTRU) or other network entity.

[0182] FIG. 29 is a system diagram of an exemplary WTRU 102, which may be employed as a user device in various embodiments described herein. As shown in FIG. 29, the WTRU 102 may include a processor 118, a communication interface 119 including a transceiver 120, a transmit/receive element 122, a speaker/microphone 124, a keypad 126, a display/touchpad 128, a non-removable memory 130, a removable memory 132, a power source 134, a global positioning system (GPS) chipset 136, and sensors 138. It will be appreciated that the WTRU 102 may include any sub-combination of the foregoing elements while remaining consistent with an embodiment.

[0183] The processor 118 may be a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Array (FPGAs) circuits, any other type of integrated circuit (IC), a state machine, and the like. The processor 1 18 may perform signal coding, data processing, power control, input/output processing, and/or any other functionality that enables the WTRU 102 to operate in a wireless environment. The processor 118 may be coupled to the transceiver 120, which may be coupled to the transmit/receive element 122. While FIG. 29 depicts the processor 118 and the transceiver 120 as separate components, it will be appreciated that the processor 118 and the transceiver 120 may be integrated together in an electronic package or chip.

[0184] The transmit/receive element 122 may be configured to transmit signals to, or receive signals from, a base station over the air interface 116. For example, in one embodiment, the transmit/receive element 122 may be an antenna configured to transmit and/or receive RF signals. In another embodiment, the transmit/receive element 122 may be an emitter/detector configured to transmit and/or receive IR, UV, or visible light signals, as examples. In yet another embodiment, the transmit/receive element 122 may be configured to transmit and receive both RF and light signals. It will be appreciated that the transmit/receive element 122 may be configured to transmit and/or receive any combination of wireless signals.

[0185] In addition, although the transmit/receive element 122 is depicted in FIG. 29 as a single element, the WTRU 102 may include any number of transmit/receive elements 122. More specifically, the WTRU 102 may employ MFMO technology. Thus, in one embodiment, the WTRU 102 may include two or more transmit/receive elements 122 (e.g., multiple antennas) for transmitting and receiving wireless signals over the air interface 116.

[0186] The transceiver 120 may be configured to modulate the signals that are to be transmitted by the transmit/receive element 122 and to demodulate the signals that are received by the transmit/receive element 122. As noted above, the WTRU 102 may have multi-mode capabilities. Thus, the transceiver 120 may include multiple transceivers for enabling the WTRU 102 to communicate via multiple RATs, such as UTRA and IEEE 802.11, as examples.

[0187] The processor 118 of the WTRU 102 may be coupled to, and may receive user input data from, the speaker/microphone 124, the keypad 126, and/or the display/touchpad 128 (e.g., a liquid crystal display (LCD) display unit or organic light-emitting diode (OLED) display unit). The processor 118 may also output user data to the speaker/microphone 124, the keypad 126, and/or the display/touchpad 128. In addition, the processor 118 may access information from, and store data in, any type of suitable memory, such as the non-removable memory 130 and/or the removable memory 132. The non-removable memory 130 may include random-access memory (RAM), read-only memory (ROM), a hard disk, or any other type of memory storage device. The removable memory 132 may include a subscriber identity module (SIM) card, a memory stick, a secure digital (SD) memory card, and the like. In other embodiments, the processor 118 may access information from, and store data in, memory that is not physically located on the WTRU 102, such as on a server or a home computer (not shown).

[0188] The processor 118 may receive power from the power source 134, and may be configured to distribute and/or control the power to the other components in the WTRU 102. The power source 134 may be any suitable device for powering the WTRU 102. As examples, the power source 134 may include one or more dry cell batteries (e.g., nickel-cadmium (NiCd), nickel- zinc (NiZn), nickel metal hydride (NiMH), lithium-ion (Li -ion), and the like), solar cells, fuel cells, and the like.

[0189] The processor 118 may also be coupled to the GPS chipset 136, which may be configured to provide location information (e.g., longitude and latitude) regarding the current location of the WTRU 102. In addition to, or in lieu of, the information from the GPS chipset 136, the WTRU 102 may receive location information over the air interface 1 16 from a base station and/or determine its location based on the timing of the signals being received from two or more nearby base stations. It will be appreciated that the WTRU 102 may acquire location information by way of any suitable location-determination method while remaining consistent with an embodiment.

[0190] The processor 118 may further be coupled to other peripherals 138, which may include one or more software and/or hardware modules that provide additional features, functionality and/or wired or wireless connectivity. For example, the peripherals 138 may include sensors such as an accelerometer, an e-compass, a satellite transceiver, a digital camera (for photographs or video), a universal serial bus (USB) port, a vibration device, a television transceiver, a hands free headset, a Bluetooth® module, a frequency modulated (FM) radio unit, a digital music player, a media player, a video game player module, an Internet browser, and the like.

[0191] FIG. 30 depicts an exemplary network entity 190 that may be used in embodiments of the present disclosure. As depicted in FIG. 30, network entity 190 includes a communication interface 192, a processor 194, and non-transitory data storage 196, all of which are communicatively linked by a bus, network, or other communication path 198.

[0192] Communication interface 192 may include one or more wired communication interfaces and/or one or more wireless-communication interfaces. With respect to wired communication, communication interface 192 may include one or more interfaces such as Ethernet interfaces, as an example. With respect to wireless communication, communication interface 192 may include components such as one or more antennae, one or more transceivers/chipsets designed and configured for one or more types of wireless (e.g., LTE) communication, and/or any other components deemed suitable by those of skill in the relevant art. And further with respect to wireless communication, communication interface 192 may be equipped at a scale and with a configuration appropriate for acting on the network side— as opposed to the client side— of wireless communications (e.g., LTE communications, Wi-Fi communications, and the like). Thus, communication interface 192 may include the appropriate equipment and circuitry (perhaps including multiple transceivers) for serving multiple mobile stations, UEs, or other access terminals in a coverage area.

[0193] Processor 194 may include one or more processors of any type deemed suitable by those of skill in the relevant art, some examples including a general-purpose microprocessor and a dedicated DSP.

[0194] Data storage 196 may take the form of any non-transitory computer-readable medium or combination of such media, some examples including flash memory, read-only memory (ROM), and random-access memory (RAM) to name but a few, as any one or more types of non- transitory data storage deemed suitable by those of skill in the relevant art could be used. As depicted in FIG. 30, data storage 196 contains program instructions 197 executable by processor 194 for carrying out various combinations of the various network-entity functions described herein.

[0195] Although features and elements are described above in particular combinations, one of ordinary skill in the art will appreciate that each feature or element can be used alone or in any combination with the other features and elements. In addition, the methods described herein may be implemented in a computer program, software, or firmware incorporated in a computer- readable medium for execution by a computer or processor. Examples of computer-readable storage media include, but are not limited to, a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD- ROM disks, and digital versatile disks (DVDs). A processor in association with software may be used to implement a radio frequency transceiver for use in a WTRU, UE, terminal, base station, RNC, or any host computer.

Claims

CLAIMS We claim:

1. A method, performed by a user device, for switchable display of live video streams to a user, the method comprising:

receiving a first live video stream and information indicating a first capture location at which the first live video stream is being captured;

displaying the first live video stream to the user;

receiving metadata identifying at least a second live video stream and a second capture location at which the second live video stream is being captured;

displaying a selectable user interface element as an overlay on the first live video stream at a display position corresponding to the second capture location.

2. The method of claim 1, further comprising:

based on a viewing orientation selected by the user, determining a cone of view of the user; wherein displaying the first live video stream comprises displaying only a portion of the first live video stream that is within the cone of view.

3. The method of claim 2, wherein the selectable user interface element is displayed only if the position corresponding to the second capture location is within the cone of view.

4. The method of claim 2, wherein the viewing orientation is determined based on a scrolling input from the user.

5. The method of claim 2, wherein the user device is a head-mounted display, and wherein the viewing orientation is determined based on an orientation of the head-mounted display.

6. The method of claim 5, wherein the orientation is determined based at least in part on a magnetic compass in the head mounted display.

7. The method of claim 1, further comprising:

receiving user input selecting the selectable user interface; and

responsively displaying the second live video stream to the user.

8. The method of claim 1, wherein the metadata identifying the second live video stream and the second capture location are received in a manifest file.

9. The method of claim 1, further comprising receiving information indicating a first orientation with which the first live video stream is being captured.

10. The method of claim 1, further comprising:

receiving a panoramic background image corresponding to the first capture location; generating a first composite video in which the first live video stream is stitched into the panoramic background image;

wherein displaying the first live video stream comprises displaying the first composite video.

11. The method of claim 10, further comprising:

receiving user input selecting the selectable user interface, and responsively:

generating a second composite video in which the second live video stream is stitched into the panoramic background image; and

displaying the second first composite video.

12. The method of claim 1, wherein the first live video stream, the information indicating the first capture location, and the metadata are received from a live video streaming service.

13. The method of claim 1, further comprising:

receiving, for each of a plurality of live video streams, metadata identifying the respective live video stream and a respective capture location at which the live video stream is being captured; for each of the plurality of live video streams, displaying respective selectable user interface elements as overlays on the first live video stream at respective display positions corresponding to the second capture location.

14. A system comprising a processor, a display, and a non-transitory computer-readable storage medium storing instructions operative to perform functions comprising:

displaying the first live video stream to a user;

15. The system of claim 14, implemented in a head-mounted display.