WO2018236715A1 - Predictive content buffering in streaming of immersive video - Google Patents

Predictive content buffering in streaming of immersive video Download PDF

Info

Publication number
WO2018236715A1
WO2018236715A1 PCT/US2018/038012 US2018038012W WO2018236715A1 WO 2018236715 A1 WO2018236715 A1 WO 2018236715A1 US 2018038012 W US2018038012 W US 2018038012W WO 2018236715 A1 WO2018236715 A1 WO 2018236715A1
Authority
WO
WIPO (PCT)
Prior art keywords
video content
immersive video
tiles
stream
profile
Prior art date
Application number
PCT/US2018/038012
Other languages
French (fr)
Inventor
Lukas KROEPFL
Wolfram HOFMEISTER
Mario Graf
Daniel Weinberger
Christopher Mueller
Reinhard GRANDL
Stefan Lederer
Original Assignee
Bitmovin, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bitmovin, Inc. filed Critical Bitmovin, Inc.
Publication of WO2018236715A1 publication Critical patent/WO2018236715A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/012Head tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/013Eye tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/03Arrangements for converting the position or the displacement of a member into a coded form
    • G06F3/033Pointing devices displaced or positioned by the user, e.g. mice, trackballs, pens or joysticks; Accessories therefor
    • G06F3/038Control and interface arrangements therefor, e.g. drivers or device-embedded control circuitry
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0481Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance
    • G06F3/04815Interaction with a metaphor-based environment or interaction object displayed as three-dimensional, e.g. changing the user viewpoint with respect to the environment or object
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/21805Source of audio or video content, e.g. local disk arrays enabling multiple viewpoints, e.g. using a plurality of cameras
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/23439Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements for generating different versions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/239Interfacing the upstream path of the transmission network, e.g. prioritizing client content requests
    • H04N21/2393Interfacing the upstream path of the transmission network, e.g. prioritizing client content requests involving handling client requests
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/251Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/252Processing of multiple end-users' preferences to derive collaborative data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/258Client or end-user data management, e.g. managing client capabilities, user preferences or demographics, processing of multiple end-users preferences to derive collaborative data
    • H04N21/25866Management of end-user data
    • H04N21/25891Management of end-user data being end-user preferences
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/262Content or additional data distribution scheduling, e.g. sending additional data at off-peak times, updating software modules, calculating the carousel transmission frequency, delaying a video stream transmission, generating play-lists
    • H04N21/26258Content or additional data distribution scheduling, e.g. sending additional data at off-peak times, updating software modules, calculating the carousel transmission frequency, delaying a video stream transmission, generating play-lists for generating a list of items to be played back in a given order, e.g. playlist, or scheduling item distribution according to such list
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/60Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client 
    • H04N21/65Transmission of management data between client and server
    • H04N21/658Transmission by the client directed to the server
    • H04N21/6587Control parameters, e.g. trick play commands, viewpoint selection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/816Monomedia components thereof involving special video data, e.g 3D video

Definitions

  • This disclosure generally relates to streaming and playback for immersive video content, and more particularly to an improved predictive streaming approach using artificial neural networks.
  • tile-based streaming For example, with reference to Fig. 1, using tile-based streaming, VR content is split into several smaller rectangular regions called "tiles," which are separately encoded for each Segment 104a,b and transmitted over the network in different qualities, such as high-quality tiles lOla-1, medium quality tiles 102a-l, and low-quality tiles 103a-l.
  • tile-based streaming technology only the tiles lOle, lOlf, lOlg, lOli, lOlj and 101k for the current viewport 105 have to be transmitted in higher quality, while the tiles for the currently invisible parts of the scene 106 (tiles 103a-103d, 103h, and 1031 or tiles 102a-102d, 102h, and 1021, or a combination of these two lower quality sets) can be downloaded in a lower quality, leading to bandwidth savings.
  • This concept is illustrated in Fig. 1.
  • the quality of the downloaded tiles is chosen reactively: the playback software in the client 100 can only start downloading the needed tiles in a first frame 104a when the user actually turns her head and looks at them. This leads to a noticeable delay between the user looking at a specific part of the environment and the playback software adapting to the changed viewport 105. This delay is influenced by different factors like segment length, network round trip time ("RTT") and video buffer size on the client side.
  • RTT network round trip time
  • the playback software might not be able to adapt and download new tiles fast enough, if at all. Until those higher quality tiles are loaded, the user is presented with low quality tiles, leading to a degraded QoE. In some instances, the user may be presented low quality tiled content most of the time.
  • a method for streaming immersive video content for playback in a motion-controlled playback device includes providing a manifest file for an immersive video content.
  • the manifest file identifies two or more streams for the immersive video content, including a stream with a higher quality of video than another stream.
  • Each stream of the immersive video content comprises a set of video segments with a plurality of tiles corresponding to an immersive view of a scene in the video content.
  • the method also includes providing a prediction profile corresponding to the immersive video content.
  • the prediction profile may include one or more predicted views for at least one segment of the immersive video stream. Each predicted view identifies a subset of tiles likely to be viewed by a user.
  • the method includes receiving a request for a set of higher quality tiles from a segment in the first stream of the immersive video content.
  • the tiles requested correspond to a predicted view in the prediction profile for the segment requested.
  • the method further includes receiving another request for another set of lower quality tiles from the same segment in the lower quality stream of the immersive video content.
  • the method may also include receiving head movement data corresponding to a play back of the immersive video content, training a neural network based, at least in part, on the head movement data; and generating the prediction profile corresponding to the immersive video content based on the neural network.
  • the method can also include receiving eye- tracking data corresponding to a play back of the immersive video content.
  • training the neural network can also be based on the eye-tracking data, at least partially.
  • a method for streaming immersive video content for playback in a motion-controlled playback device includes receiving a manifest file for an immersive video content and receiving a prediction profile corresponding to the immersive video content.
  • the prediction profile may also include predicted views for at least a segment that identifies a subset of tiles in the segment that are likely to be viewed by a user.
  • the method also includes determining a set of tiles from a segment of the immersive video content based on the predicted view for that segment and another set of tiles from the same segment corresponding to sections of the scene less likely to be viewed by the user.
  • the method also includes sending requests for one set of tiles likely to be viewed from the high-quality stream identified in the manifest file and for the other set of tiles, less likely to be viewed, from the lower-quality stream identified in the manifest file. Then, the method includes playing back the segment of the immersive video content.
  • the prediction profile is an output of a neural network trained with movement data corresponding to playing back of the immersive video content.
  • the movement data used for training may, for example, include head movement data, eye movement data, or both.
  • the prediction profile may be refined based on a comparison of predicted head movements determined from the prediction profile with actual head movements detected during playback of the immersive video content.
  • the method may further include predicting a head movement associated with the segment of the immersive video content and sending an actual head movement data for refining the prediction profile with a neural network.
  • each predicted view may be provided as predicted head movements associated with a segment of the immersive video content.
  • the immersive video content may be, for example, a 360° video.
  • the motion-controlled playback device may be a virtual reality head set.
  • the manifest file identifies at least three streams for the immersive video content, including a first stream with a higher quality of video than a second stream and a third stream with a lower quality of video than the second stream.
  • a method can also include receiving yet another request for a set of lower quality tiles from the same segment in the lowest-quality stream of the immersive video content identified in the manifest file.
  • the predicted views in the prediction profile can also include a probability of being viewed by a user.
  • the method may also include providing a content provider profile that provides a threshold probability of a predicted view for requesting tiles from the high-quality stream.
  • the method can also include selecting the prediction profile corresponding to the immersive video content.
  • the selection of the prediction profile may be based in similarities between a user profile associated with the prediction profile and a user profile for the user of the motion-controlled playback device.
  • the selection may also be based on similarities between a content profile associated with the prediction profile and a content profile associated with the immersive video content.
  • FIG. 1 illustrates an exemplary tile-based streaming approach.
  • FIG. 2 is a block diagram illustrating exemplary steps, input and output involved in the training process according to one embodiment of the disclosure.
  • FIG. 3 is a block diagram illustrating an exemplary embodiment of a head movement prediction process according to one embodiment.
  • FIG. 4 is a block diagram illustrating a neural network refinement approach according to one embodiment.
  • FIG. 5 is a diagram illustrating a process for loading prediction data and different quality tiles according to one embodiment.
  • the state of the art approach for tile-based streaming has several drawbacks and disadvantages.
  • a proactive streaming approach is provided.
  • the playback device implements a tile-based streaming algorithm that downloads higher quality tiles before the user looks at them. Higher quality tiles are downloaded based on a prediction of where the user is likely to look at next. Thus, higher quality tiles for the scene the user is currently looking at have already been downloaded in advance, greatly improving the QoE.
  • an artificial neural network (“ANN”) is trained with recorded viewing data for immersive video content. This ANN is then used to predict where the user is likely to look at next for any given input video.
  • the predictive streaming reduces bandwidth requirements for streaming immersive video and reduces content delivery/distribution network (“CDN”) traffic, which in turn lessens the costs for the content provider.
  • CDN content delivery/distribution network
  • the network provides the video player with information on which parts of the immersive video content shall be buffered in advance. This allows the use of longer segment durations and a larger buffer on the client side, which helps to significantly improve QoE for tile-based streaming, while simultaneously reducing the amount of data that needs to be sent over the network and thus the costs for the content providers.
  • an ANN is trained to produce head movement predictions.
  • the training dataset for the ANN consists of a large number of training instances, which in turn consist of a video file and the corresponding head movement tracking data, as recorded by the video playback software.
  • Video files that are part of the training dataset preferably have substantially identical spatial dimensions. The video files can be scaled to a specific resolution in a preprocessing step.
  • an input 360° video 201 is input into a training system 200.
  • a head movement recording module 202 receives motion sensor data indicative of the movement of a user's head.
  • a motion-controlled playback device 210 such as a Samsung Gear VR, an Oculus Rift, an HTC Vive, a Google Daydream, or similar, includes motion sensors, such as gyroscopes, accelerometers, and the like, to detect user movement and control playback of the input video 201.
  • a display within the device 210 shows a different view of the 360° video.
  • the motion control data is provided to the head movement recording module 202.
  • the head movement recording module 202 creates a record 203 of the sensed head movement for the given input video 201.
  • head movement prediction accuracy may be improved for a given input video 201 by using data to train the network with data that is preferably representative for all types of content and viewer groups.
  • the recorded tracking data 203 may be input to a preprocessing module 204 where it can be filtered prior to using it to train the network, and outliers can be detected and removed from the training set generating a set of training data instances 205 prior to training the ANN.
  • the training data instances 205 are used in a training module 206 to train the ANN.
  • the goal of the training module 206 is to create a prediction model 207 that is able to predict where the viewer is likely going to look at for any arbitrary input video, regardless of the type of content or length, as accurately as possible.
  • the accuracy of the learned model 207, and thus the prediction depends on the size of the training dataset. Accordingly, larger training sets are preferable.
  • the learned model of head movements, or prediction profiles 207 may be saved into separate files that are accessible to the ANN module for future streaming during playback sessions.
  • the ANN predicted head movements can then be referenced in the stream manifest for streaming during playback.
  • the head movement predictions may be referenced in the manifest file in similar fashion to how spacial relationship descriptors ("SRDs") are referenced within MPEG-DASH manifests.
  • SRDs spacial relationship descriptors
  • the predicted head movement data may be directly embedded into the video stream itself.
  • HLS HTTP Live Streaming
  • Microsoft Smooth Streaming or the like.
  • content-specific or viewer-type-specific models may be used.
  • additional metadata may be collected with respect to a given prediction model to improve the accuracy of the prediction model.
  • the metadata may include characteristics for the content itself, of the user, or both.
  • user profile information may be collected, including for example, demographic information such as sex, age, nationality, or the like, and/or personal interest information, such as preferred video content category (e.g., action, scenery, travel, and the like), and/or user biometric or physiological information, such as heartrate and blood pressure while consuming immersive content, and/or other metadata, such as manual user annotations.
  • demographic information such as sex, age, nationality, or the like
  • personal interest information such as preferred video content category (e.g., action, scenery, travel, and the like)
  • user biometric or physiological information such as heartrate and blood pressure while consuming immersive content
  • other metadata such as manual user annotations.
  • Content profile information may alternatively or in addition be used with metadata regarding the content, such as for example, category (action, adventure, etc.), maturity level/rating, actors, director, ratings, and the like.
  • the training system 200 extracts the metadata from the input video 201 and includes it in the model 207.
  • the metadata may be collected and used to classify head movement data into various model - types. Classification may be done according to known classification techniques, including for example, clustering techniques, similarity metrics (e.g., cosine similarity), or the like.
  • the training data 205 can be segregated based on the metadata into the model -types and used to generate different ANNs for each model type.
  • an ANN may be generated for male users under 20 with a preference for action video content.
  • Corresponding metadata collected from the playback device user is matched against the model metadata to select the appropriate model prior to playback. Matching of the user profile to the prediction model profile may be done according to known similarity metrics, such as for example, the cosine similarity approach.
  • content-related metadata is signaled to the playback device which guides the client in selecting the right tiles.
  • the metadata could be related to visual content, e.g., saliency maps or metadata added during the production process (such as forced perspective).
  • the metadata could alternatively or in addition be also related to audio content, e.g., 3D audio coming from "behind" the user which may trigger the user to look back. All this information is backed up by an ANN which provides information of how the user' s behavior was in the past.
  • the prediction accuracy of the ANN can further be improved by taken into account eye-tracking data to get an even more accurate impression of where the user is looking at for a given video, with respect to the playback time.
  • the head movement module 202 further includes a camera sensor module configured to detect and track the movement of the user's eyes during playback of the input video 201.
  • an additional eye-tracking module (not shown) may be provided.
  • training system 200 may be implemented in different configurations.
  • training system 200 is a distributed system implemented with a user base and learning through use of the system. For example, each time a user plays a 360° video, the playback device records the user's head movement and sends the recorded head movement data 203 to a remote training server.
  • the preprocessing module 204 filters and classifies and the data and can aggregate data from multiple users for a given 360° video.
  • the resulting training set or sets 205 can be classified based on user profile types and types of content to generate corresponding models 207. As further described below, the models 207 can be continuously updated as more user data is received at the server.
  • the output of the training process is a learned prediction model 307.
  • This model 307 is then used to predict the head movement profile for any given input video 301.
  • the input video 301 for the prediction process preferably has the same spatial dimensions as the videos 201 of the training dataset, therefore the video may be scaled in a preprocessing module 302 before it is fed into the ANN module 303 for prediction.
  • the output of the prediction process is the predicted head movement profile 305 for the given input video 301, which represents the regions of interest where the viewer is likely to look at with respect to the playback time of the video.
  • the predicted head movement profile may be converted into a format that is supported by the playback software.
  • a neural network refinement approach 400 is provided according to one embodiment.
  • input video 401 is provided to an ANN module 403 that has been trained.
  • the ANN module 403 is used to predict head movement profiles 405 based on the applicable model 407.
  • playback software 408 can be used to record the actual head movements of the user for a given video file into a feedback record 409.
  • An ANN training module 406 receives the recorded actual head movements 409 for comparison with the predicted head movements 405. If, for any given scene, the recorded actual head movement 409 differ from what the predicted movement 405, the newly recorded data 409 is used to refine the learned model 407, thus gradually improving the prediction accuracy of the ANN.
  • the ANN learns and optimizes to improve its predictive output.
  • the refinement approach may be implemented in a distributed fashion between the client and the server or may be fully implemented in the client provided the refined model back to the server to be used with other users playing he given video 401.
  • a process 500 for loading prediction data and different quality tiles is described according to one embodiment.
  • the client playback software 501 initially gets 510 loads the stream manifest 502 for the immersive video, preferably it checks whether a head movement profile 507 is available for the given stream. If that is the case, the player 501 requests the file containing the prediction data 507 and parses it. Given the current viewing direction 505 of the viewer and the predicted head movement profile 506, the player uses a heuristic to determine which region of interest the viewer will probably look at next.
  • a prediction profile 506 may include several predictions with different confidence levels for where the user may look at in a given segment. As for example illustrated in Fig.
  • a primary high confidence prediction 506a identifies a first set of tiles that correspond to the most likely viewing direction 505 for that segment.
  • another set of tiles are identified by secondary predictions 506b and 506c, with lower confidence values.
  • the player requests 512a the tiles for segment 1 and the server streams 514a the corresponding tiles.
  • all the tiles that are part of a region of interest 505 corresponding to the highest confidence prediction 506a are then loaded in the highest quality while tiles for secondary viewing directions corresponding to lower confidence predictions 506b and 506c, are loaded at lower qualities.
  • the remaining tiles corresponding to parts of the scene predicted not to be viewed by the user may be loaded a yet a lower quality or not loaded at all, for example, depending on player settings or capabilities, or network conditions.
  • content provider profiles may also be used to control the streaming operations at the client device 501.
  • the player 501 may decide to request 512 different quality tiles for regions of interest 505 that are less likely to be looked at by the viewer, based on lower confidence predictions 506b and 506c.
  • the quality levels used for the different confidence level predictions 506 is determined based on the content provider profile providing the
  • the content provider values a good user experience ("UX") more than bandwidth savings, it may allow the player to download the highest quality tiles for all regions 505 that have some likelihood of being looked at in the current video segment above a threshold.
  • tiles corresponding to predictions 506a and 506b which in this embodiment are above the content provider threshold, may be obtained 512 at the highest quality available, while tiles for the predicted view corresponding to prediction 506c, with a confidence level below the content-provider threshold, are obtained 512 at a medium quality.
  • the number of tiles with probabilities above the threshold may include multiple tiles, depending on the type of the content.
  • the content provider may decide to let the player only load the tiles of the region with the highest likelihood of being looked at, in the highest quality for any segment, while loading all other tiles with lower qualities. This approach would minimize the content delivery network (“CDN”) traffic and costs.
  • CDN content delivery network
  • the UX could further be improved by combining the head movement prediction profiles with the scalable extension of FIEVC (SHVC).
  • SHVC scalable extension of FIEVC
  • the playback software could then load all tiles in the lowest quality first and keep loading enhancement layers for tiles that are likely to be looked at. This would also reduce the time needed to load an enhanced version of a lower-quality tile when the viewer starts looking at it.
  • the prediction model described herein can also be used for streaming technologies which does not rely on tiles. When no tiles are used, preferably, several versions of the same video exist. Each different version is encoded in a way that one or more certain parts of the video are encoded in higher quality than other parts.
  • the ANN-based prediction of regions of interest in a video can here be used to determine which parts of the different videos should be encoded in which quality.
  • the ANN-based prediction is also (or alternatively) used to decide which video segment should be loaded next to provide the best possible QoE to the user by providing high quality content for the parts of the video the user is looking at and will be most likely looking at in the future.
  • a software module is implemented with a computer program product comprising a computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.
  • Embodiments may also relate to an apparatus for performing the operations herein.
  • This apparatus may be specially constructed for the required purposes, and/or it may comprise a general-purpose computing device selectively activated or reconfigured by a computer program stored in the computer.
  • a computer program may be stored in a non- transitory, tangible computer readable storage medium, or any type of media suitable for storing electronic instructions, which may be coupled to a computer system bus.
  • any computing systems referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Graphics (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

Bandwidth requirements for immersive video content playback is reduced while keeping the QoE on a satisfactory level and minimizing playback disruptions with a tile-based adaptive streaming algorithm. A playback device implements a tile-based streaming algorithm that downloads higher quality tiles before the user looks at them based on a prediction of where the user is likely to look at next. An artificial neural network ("ANN") is trained with recorded viewing data for immersive video content. This ANN is then used to predict where the user is likely to look at next for any given input video. The predictive streaming reduces bandwidth requirements for streaming immersive video and reduces content delivery/distribution network ("CDN") traffic, which in turn lessens the costs for the content provider.

Description

IN THE UNITED STATES PATENT AND TRADEMARK OFFICE
International Patent Application
By
Lukas Kroepfl, Wolfram Hofmeister, Mario Graf, Daniel Weinberger, Christopher Mueller,
Reinhard Grandl, and Stefan Lederer
TITLE
Predictive Content Buffering in Streaming of Immersive Video
CROSS-REFERENCE TO RELATED APPLICATIONS
This application claims the benefit of U.S. Provisional Patent Application No. 62/521,858, filed on June 19, 2017, the contents of which are hereby incorporated by reference.
BACKGROUND
[0001] This disclosure generally relates to streaming and playback for immersive video content, and more particularly to an improved predictive streaming approach using artificial neural networks.
[0002] With the advent of immersive video applications using, for example, 360° High Definition, virtual reality (VR), augmented reality (AR), and other immersive video content and as immersive video capable playback devices, such as the Samsung Gear VR, Oculus Rift, HTC Vive or Google Daydream, become more affordable, the streaming of immersive video content over the Internet is becoming a necessity. In contrast to streaming non- immersive video content, the quality demands of immerse video are substantially higher. For example, where full HD resolution is sufficient for playing back conventional video content, VR content looks very poor when streamed in this resolution. Similarly, since video data covering the full 360° panorama has to be streamed for a proper quality of experience (QoE), the viewport visible to the user at one point in time covers only a small area of the full video frame. Therefore, much higher resolutions are needed for immersive video content to get the same perceptual video quality as for conventional videos.
[0003] As many playback devices 100 today are mobile devices, such as tablets,
smartphones, laptops, and the like, which are usually connected to a network server 110 over an unreliable wireless connection with widely variable network conditions, transmitting high- quality video over the network poses a great challenge. The bandwidth requirements needed for immersive video content playback are so high, that a satisfactory playback quality can only rarely be achieved under real-life network conditions. To cope with this problem, a solution called tile-based streaming has been proposed. For example, with reference to Fig. 1, using tile-based streaming, VR content is split into several smaller rectangular regions called "tiles," which are separately encoded for each Segment 104a,b and transmitted over the network in different qualities, such as high-quality tiles lOla-1, medium quality tiles 102a-l, and low-quality tiles 103a-l. Using the tile-based streaming technology, only the tiles lOle, lOlf, lOlg, lOli, lOlj and 101k for the current viewport 105 have to be transmitted in higher quality, while the tiles for the currently invisible parts of the scene 106 (tiles 103a-103d, 103h, and 1031 or tiles 102a-102d, 102h, and 1021, or a combination of these two lower quality sets) can be downloaded in a lower quality, leading to bandwidth savings. This concept is illustrated in Fig. 1.
[0004] In the prior art approach for tile-based streaming, the quality of the downloaded tiles is chosen reactively: the playback software in the client 100 can only start downloading the needed tiles in a first frame 104a when the user actually turns her head and looks at them. This leads to a noticeable delay between the user looking at a specific part of the environment and the playback software adapting to the changed viewport 105. This delay is influenced by different factors like segment length, network round trip time ("RTT") and video buffer size on the client side. When watching immersive video content that requires rapid head movement, like an action-heavy scene, the playback software might not be able to adapt and download new tiles fast enough, if at all. Until those higher quality tiles are loaded, the user is presented with low quality tiles, leading to a degraded QoE. In some instances, the user may be presented low quality tiled content most of the time.
[0005] There are several approaches proposed to reduce this delay problem, but they all come with new disadvantages. Some have proposed reducing the buffer size on the playback device. With the amount of video that is buffered on the client reduced, higher quality tiles can be loaded slightly earlier without the need to discard already buffered data. However, when reducing the buffer size on the client device, the video playback is more likely to stall and stutter when network conditions are not ideal or when they are fluctuating. Due to the intrinsic unpredictable nature of wireless networks, bandwidth conditions may change rapidly enough that keeping a steady, although small, buffer level is virtually impossible.
Furthermore, smaller segment sizes could reduce the end-to-end delay but would also increase the bitrates needed to encode the videos due to a smaller intra-frame period, given that there are more segments for a video of the same length and therefore more intra-frames. Thus, the period of time between the frames gets smaller and therefore their frequency is higher, requiring increased bitrates. Keeping this in mind it is most likely that current state of the art technologies will fail to deliver a good QoE for immersive video content over wireless networks.
BRIEF SUMMARY
[0006] In one embodiment, a method for streaming immersive video content for playback in a motion-controlled playback device, includes providing a manifest file for an immersive video content. The manifest file identifies two or more streams for the immersive video content, including a stream with a higher quality of video than another stream. Each stream of the immersive video content comprises a set of video segments with a plurality of tiles corresponding to an immersive view of a scene in the video content. The method also includes providing a prediction profile corresponding to the immersive video content. The prediction profile may include one or more predicted views for at least one segment of the immersive video stream. Each predicted view identifies a subset of tiles likely to be viewed by a user. In this embodiment, the method includes receiving a request for a set of higher quality tiles from a segment in the first stream of the immersive video content. The tiles requested correspond to a predicted view in the prediction profile for the segment requested. The method further includes receiving another request for another set of lower quality tiles from the same segment in the lower quality stream of the immersive video content.
[0007] According to one aspect of some embodiments, the method may also include receiving head movement data corresponding to a play back of the immersive video content, training a neural network based, at least in part, on the head movement data; and generating the prediction profile corresponding to the immersive video content based on the neural network. In addition, in some embodiments, the method can also include receiving eye- tracking data corresponding to a play back of the immersive video content. In such embodiments, training the neural network can also be based on the eye-tracking data, at least partially.
[0008] In another embodiment, a method for streaming immersive video content for playback in a motion-controlled playback device includes receiving a manifest file for an immersive video content and receiving a prediction profile corresponding to the immersive video content. In this embodiment, the prediction profile may also include predicted views for at least a segment that identifies a subset of tiles in the segment that are likely to be viewed by a user. In this embodiment, the method also includes determining a set of tiles from a segment of the immersive video content based on the predicted view for that segment and another set of tiles from the same segment corresponding to sections of the scene less likely to be viewed by the user. The method also includes sending requests for one set of tiles likely to be viewed from the high-quality stream identified in the manifest file and for the other set of tiles, less likely to be viewed, from the lower-quality stream identified in the manifest file. Then, the method includes playing back the segment of the immersive video content.
[0009] According to various embodiments, the prediction profile is an output of a neural network trained with movement data corresponding to playing back of the immersive video content. The movement data used for training may, for example, include head movement data, eye movement data, or both.
[0010] According to another aspect of some embodiments, the prediction profile may be refined based on a comparison of predicted head movements determined from the prediction profile with actual head movements detected during playback of the immersive video content. According to one embodiment, the method may further include predicting a head movement associated with the segment of the immersive video content and sending an actual head movement data for refining the prediction profile with a neural network.
[0011] In some embodiments, each predicted view may be provided as predicted head movements associated with a segment of the immersive video content. The immersive video content may be, for example, a 360° video. According to another aspect of some
embodiments, the motion-controlled playback device may be a virtual reality head set.
[0012] In one embodiment, the manifest file identifies at least three streams for the immersive video content, including a first stream with a higher quality of video than a second stream and a third stream with a lower quality of video than the second stream. In this embodiment, a method can also include receiving yet another request for a set of lower quality tiles from the same segment in the lowest-quality stream of the immersive video content identified in the manifest file. [0013] According to another aspect of various embodiments, the predicted views in the prediction profile can also include a probability of being viewed by a user. In such embodiments, the method may also include providing a content provider profile that provides a threshold probability of a predicted view for requesting tiles from the high-quality stream.
[0014] According to another aspect of some embodiments, the method can also include selecting the prediction profile corresponding to the immersive video content. In such embodiments, the selection of the prediction profile may be based in similarities between a user profile associated with the prediction profile and a user profile for the user of the motion-controlled playback device. In addition, or as an alternative, the selection may also be based on similarities between a content profile associated with the prediction profile and a content profile associated with the immersive video content.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
[0015] FIG. 1 illustrates an exemplary tile-based streaming approach.
[0016] FIG. 2 is a block diagram illustrating exemplary steps, input and output involved in the training process according to one embodiment of the disclosure.
[0017] FIG. 3 is a block diagram illustrating an exemplary embodiment of a head movement prediction process according to one embodiment.
[0018] FIG. 4 is a block diagram illustrating a neural network refinement approach according to one embodiment.
[0019] FIG. 5 is a diagram illustrating a process for loading prediction data and different quality tiles according to one embodiment.
[0020] The figures depict various example embodiments of the present disclosure for purposes of illustration only. One of ordinary skill in the art will readily recognize from the following discussion that other example embodiments based on alternative structures and methods may be implemented without departing from the principles of this disclosure and which are encompassed within the scope of this disclosure.
DETAILED DESCRIPTION
[0021] The Figures and the following description describe certain embodiments by way of illustration only. One of ordinary skill in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles described herein. Reference will now be made in detail to several embodiments, examples of which are illustrated in the
accompanying figures.
[0022] The above and other needs are met by the disclosed methods, a non-transitory computer-readable storage medium storing executable code, and systems for streaming and playing back immersive video content.
[0023] As described before, the state of the art approach for tile-based streaming has several drawbacks and disadvantages. To overcome these problems, a proactive streaming approach is provided. According to one embodiment of the present invention, a mechanism is provided that reduces the bandwidth requirements for immersive video content playback while keeping the QoE on a satisfactory level, minimizing playback disruptions. According to one embodiment, the playback device implements a tile-based streaming algorithm that downloads higher quality tiles before the user looks at them. Higher quality tiles are downloaded based on a prediction of where the user is likely to look at next. Thus, higher quality tiles for the scene the user is currently looking at have already been downloaded in advance, greatly improving the QoE.
[0024] According to one embodiment, an artificial neural network ("ANN") is trained with recorded viewing data for immersive video content. This ANN is then used to predict where the user is likely to look at next for any given input video. According to another aspect of embodiments of the invention, the predictive streaming reduces bandwidth requirements for streaming immersive video and reduces content delivery/distribution network ("CDN") traffic, which in turn lessens the costs for the content provider. The network provides the video player with information on which parts of the immersive video content shall be buffered in advance. This allows the use of longer segment durations and a larger buffer on the client side, which helps to significantly improve QoE for tile-based streaming, while simultaneously reducing the amount of data that needs to be sent over the network and thus the costs for the content providers.
[0025] According to one embodiment, an ANN is trained to produce head movement predictions. In one embodiment, the training dataset for the ANN consists of a large number of training instances, which in turn consist of a video file and the corresponding head movement tracking data, as recorded by the video playback software. Video files that are part of the training dataset preferably have substantially identical spatial dimensions. The video files can be scaled to a specific resolution in a preprocessing step.
[0026] Referring to FIG. 2, an overview of the training process in a training system 200 is presented. According to one embodiment, an input 360° video 201 is input into a training system 200. A head movement recording module 202 receives motion sensor data indicative of the movement of a user's head. For example, a motion-controlled playback device 210, such as a Samsung Gear VR, an Oculus Rift, an HTC Vive, a Google Daydream, or similar, includes motion sensors, such as gyroscopes, accelerometers, and the like, to detect user movement and control playback of the input video 201. For example, as the user turns his or her head to one side or another (left, right, up, or down) a display within the device 210 shows a different view of the 360° video. The motion control data is provided to the head movement recording module 202. The head movement recording module 202 creates a record 203 of the sensed head movement for the given input video 201. According to this embodiment, head movement prediction accuracy may be improved for a given input video 201 by using data to train the network with data that is preferably representative for all types of content and viewer groups. Therefore, the recorded tracking data 203 may be input to a preprocessing module 204 where it can be filtered prior to using it to train the network, and outliers can be detected and removed from the training set generating a set of training data instances 205 prior to training the ANN.
[0027] Once the preprocessing is finished, the training data instances 205 are used in a training module 206 to train the ANN. The goal of the training module 206 is to create a prediction model 207 that is able to predict where the viewer is likely going to look at for any arbitrary input video, regardless of the type of content or length, as accurately as possible. The accuracy of the learned model 207, and thus the prediction, depends on the size of the training dataset. Accordingly, larger training sets are preferable. According to one embodiment, the learned model of head movements, or prediction profiles 207, may be saved into separate files that are accessible to the ANN module for future streaming during playback sessions. In one embodiment, the ANN predicted head movements can then be referenced in the stream manifest for streaming during playback. For example, in one embodiment in which the video is streamed according to MPEG-DASH protocol (i.e., according to ISO/IEC 23009-1, incorporated herein by reference) the head movement predictions may be referenced in the manifest file in similar fashion to how spacial relationship descriptors ("SRDs") are referenced within MPEG-DASH manifests.
According to an alternative embodiment, the predicted head movement data may be directly embedded into the video stream itself. One of ordinary skill in the art will readily recognize that other approaches to providing the head prediction data may be used, including embodiments based on HTTP Live Streaming (HLS), Microsoft Smooth Streaming, or the like. [0028] According to alternative embodiments, content-specific or viewer-type-specific models may be used. For example, additional metadata may be collected with respect to a given prediction model to improve the accuracy of the prediction model. The metadata may include characteristics for the content itself, of the user, or both. For example, when generating learned models 207 in the training system 200, user profile information may be collected, including for example, demographic information such as sex, age, nationality, or the like, and/or personal interest information, such as preferred video content category (e.g., action, scenery, travel, and the like), and/or user biometric or physiological information, such as heartrate and blood pressure while consuming immersive content, and/or other metadata, such as manual user annotations.
[0029] Content profile information may alternatively or in addition be used with metadata regarding the content, such as for example, category (action, adventure, etc.), maturity level/rating, actors, director, ratings, and the like. In one embodiment, the training system 200 extracts the metadata from the input video 201 and includes it in the model 207. The metadata may be collected and used to classify head movement data into various model - types. Classification may be done according to known classification techniques, including for example, clustering techniques, similarity metrics (e.g., cosine similarity), or the like. The training data 205 can be segregated based on the metadata into the model -types and used to generate different ANNs for each model type. For example, an ANN may be generated for male users under 20 with a preference for action video content. Corresponding metadata collected from the playback device user is matched against the model metadata to select the appropriate model prior to playback. Matching of the user profile to the prediction model profile may be done according to known similarity metrics, such as for example, the cosine similarity approach. In one embodiment content-related metadata is signaled to the playback device which guides the client in selecting the right tiles. For example, the metadata could be related to visual content, e.g., saliency maps or metadata added during the production process (such as forced perspective). The metadata could alternatively or in addition be also related to audio content, e.g., 3D audio coming from "behind" the user which may trigger the user to look back. All this information is backed up by an ANN which provides information of how the user' s behavior was in the past.
[0030] According to another embodiment, the prediction accuracy of the ANN can further be improved by taken into account eye-tracking data to get an even more accurate impression of where the user is looking at for a given video, with respect to the playback time. In this embodiment, the head movement module 202 further includes a camera sensor module configured to detect and track the movement of the user's eyes during playback of the input video 201. Alternatively, an additional eye-tracking module (not shown) may be provided.
[0031] It should be noted that training system 200 may be implemented in different configurations. In one embodiment, training system 200 is a distributed system implemented with a user base and learning through use of the system. For example, each time a user plays a 360° video, the playback device records the user's head movement and sends the recorded head movement data 203 to a remote training server. At the server, the preprocessing module 204 filters and classifies and the data and can aggregate data from multiple users for a given 360° video. The resulting training set or sets 205 can be classified based on user profile types and types of content to generate corresponding models 207. As further described below, the models 207 can be continuously updated as more user data is received at the server.
[0032] Referring now to FIG. 3, an exemplary embodiment of a prediction process 300 is provided. According to this embodiment, the output of the training process is a learned prediction model 307. This model 307 is then used to predict the head movement profile for any given input video 301. The input video 301 for the prediction process preferably has the same spatial dimensions as the videos 201 of the training dataset, therefore the video may be scaled in a preprocessing module 302 before it is fed into the ANN module 303 for prediction. The output of the prediction process is the predicted head movement profile 305 for the given input video 301, which represents the regions of interest where the viewer is likely to look at with respect to the playback time of the video. In one embodiment, in a post processing module 304 the predicted head movement profile may be converted into a format that is supported by the playback software.
[0033] Referring now to FIG. 4, a neural network refinement approach 400 is provided according to one embodiment. According to this embodiment, in input video 401 is provided to an ANN module 403 that has been trained. The ANN module 403 is used to predict head movement profiles 405 based on the applicable model 407. In this embodiment, playback software 408 can be used to record the actual head movements of the user for a given video file into a feedback record 409. An ANN training module 406 receives the recorded actual head movements 409 for comparison with the predicted head movements 405. If, for any given scene, the recorded actual head movement 409 differ from what the predicted movement 405, the newly recorded data 409 is used to refine the learned model 407, thus gradually improving the prediction accuracy of the ANN. As additional users view the immersive content video 401, the ANN learns and optimizes to improve its predictive output. According to alternatives of this embodiment, the refinement approach may be implemented in a distributed fashion between the client and the server or may be fully implemented in the client provided the refined model back to the server to be used with other users playing he given video 401.
[0034] Referring now to FIG. 5, a process 500 for loading prediction data and different quality tiles is described according to one embodiment. According to this embodiment, when the client playback software 501 initially gets 510 loads the stream manifest 502 for the immersive video, preferably it checks whether a head movement profile 507 is available for the given stream. If that is the case, the player 501 requests the file containing the prediction data 507 and parses it. Given the current viewing direction 505 of the viewer and the predicted head movement profile 506, the player uses a heuristic to determine which region of interest the viewer will probably look at next. For example, a prediction profile 506 may include several predictions with different confidence levels for where the user may look at in a given segment. As for example illustrated in Fig. 5, for Segment 1, a primary high confidence prediction 506a identifies a first set of tiles that correspond to the most likely viewing direction 505 for that segment. In addition, another set of tiles are identified by secondary predictions 506b and 506c, with lower confidence values. Based on the predicted viewing directions, the player requests 512a the tiles for segment 1 and the server streams 514a the corresponding tiles. For each segment, all the tiles that are part of a region of interest 505 corresponding to the highest confidence prediction 506a are then loaded in the highest quality while tiles for secondary viewing directions corresponding to lower confidence predictions 506b and 506c, are loaded at lower qualities. The remaining tiles corresponding to parts of the scene predicted not to be viewed by the user may be loaded a yet a lower quality or not loaded at all, for example, depending on player settings or capabilities, or network conditions.
[0035] According to one embodiment, content provider profiles (not shown) may also be used to control the streaming operations at the client device 501. For example, as described above, the player 501 may decide to request 512 different quality tiles for regions of interest 505 that are less likely to be looked at by the viewer, based on lower confidence predictions 506b and 506c. In this embodiment, the quality levels used for the different confidence level predictions 506 is determined based on the content provider profile providing the
requirements of the content provider. If, for example, the content provider values a good user experience ("UX") more than bandwidth savings, it may allow the player to download the highest quality tiles for all regions 505 that have some likelihood of being looked at in the current video segment above a threshold. For example, tiles corresponding to predictions 506a and 506b, which in this embodiment are above the content provider threshold, may be obtained 512 at the highest quality available, while tiles for the predicted view corresponding to prediction 506c, with a confidence level below the content-provider threshold, are obtained 512 at a medium quality. The number of tiles with probabilities above the threshold may include multiple tiles, depending on the type of the content. For example, with VR content that contains mostly static scenes that don't require large head movements, the content provider may decide to let the player only load the tiles of the region with the highest likelihood of being looked at, in the highest quality for any segment, while loading all other tiles with lower qualities. This approach would minimize the content delivery network ("CDN") traffic and costs.
[0036] According to another embodiment, the UX could further be improved by combining the head movement prediction profiles with the scalable extension of FIEVC (SHVC). The playback software could then load all tiles in the lowest quality first and keep loading enhancement layers for tiles that are likely to be looked at. This would also reduce the time needed to load an enhanced version of a lower-quality tile when the viewer starts looking at it.
[0037] According to another embodiment, the prediction model described herein can also be used for streaming technologies which does not rely on tiles. When no tiles are used, preferably, several versions of the same video exist. Each different version is encoded in a way that one or more certain parts of the video are encoded in higher quality than other parts. According to one embodiment, the ANN-based prediction of regions of interest in a video can here be used to determine which parts of the different videos should be encoded in which quality. According to one embodiment, the ANN-based prediction is also (or alternatively) used to decide which video segment should be loaded next to provide the best possible QoE to the user by providing high quality content for the parts of the video the user is looking at and will be most likely looking at in the future.
[0038] The foregoing description of the embodiments has been presented for the purpose of illustration; it is not intended to be exhaustive or to limit the patent rights to the precise forms disclosed. Persons skilled in the relevant art can appreciate that many modifications and variations are possible in light of the above disclosure.
[0039] Some portions of this description describe the embodiments in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules, without loss of generality. The described operations and their associated modules may be embodied in software, firmware, hardware, or any combinations thereof.
[0040] Any of the steps, operations, or processes described herein may be performed or implemented with one or more hardware or software modules, alone or in combination with other devices. In one embodiment, a software module is implemented with a computer program product comprising a computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.
[0041] Embodiments may also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, and/or it may comprise a general-purpose computing device selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a non- transitory, tangible computer readable storage medium, or any type of media suitable for storing electronic instructions, which may be coupled to a computer system bus. Furthermore, any computing systems referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
[0042] Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the patent rights be limited not by this detailed description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of the embodiments is intended to be illustrative, but not limiting, of the scope of the patent rights, which is set forth in the following.

Claims

WHAT IS CLAIMED IS:
1. A method for streaming immersive video content for playback in a motion-controlled playback device, the method comprising:
providing a manifest file for an immersive video content, the manifest file identifying two or more streams for the immersive video content, including a first stream with a higher quality of video than a second stream, wherein each stream of the immersive video content comprises a set of video segments, each video segment comprising a plurality of tiles corresponding to an immersive view of a scene in the video content;
providing a prediction profile corresponding to the immersive video content, the prediction profile including one or more predicted views for at least one segment, each predicted view identifying a subset of tiles of the plurality of tiles likely to be viewed by a user;
receiving a first request for a first set of higher quality tiles from a first segment in the first stream of the immersive video content, the first set of tiles corresponding to a first predicted view in the prediction profile for the first segment; and receiving a second request for a second set of lower quality tiles from the first
segment in the second stream of the immersive video content.
2. The method of claim 1 further comprising:
receiving head movement data corresponding to a play back of the immersive video content;
training a neural network based, at least in part, on the head movement data; and generating the prediction profile corresponding to the immersive video content based on the neural network.
3. The method of claim 2 further comprising receiving eye-tracking data corresponding to a play back of the immersive video content and wherein the training the neural network is further based, at least in part, on the eye-tracking data.
4. The method of claim 2 further comprising refining the prediction profile based on a comparison of predicted head movements determined from the prediction profile with actual head movements detected during playback of the first video content.
5. The method of claim 1 wherein each predicted view is provided as predicted head movements associated with a segment of the immersive video content.
6. The method of claim 1 wherein the immersive video content is a 360° video.
7. The method of claim 1 wherein the motion-controlled playback device is a virtual reality head set.
8. The method of claim 1 wherein the manifest file identifies at least three streams for the immersive video content, including a first stream with a higher quality of video than a second stream and a third stream with a lower quality of video than the second stream, and further comprising receiving a third request for a third set of lower quality tiles from the first segment in the third stream of the immersive video content.
9. The method of claim 1 wherein each predicted view in the prediction profile includes a probability of being viewed by a user and further comprising providing a content provider profile, the content provider profile providing a threshold probability of a predicted view for requesting tiles from the first stream.
10. The method of claim 1 further comprising selecting the prediction profile
corresponding to the immersive video content.
11. The method of claim 10 wherein the selecting is based in one or more of similarities between a user profile associated with the prediction profile and a user profile for the user of the motion-controlled playback device and similarities between a content profile associated with the prediction profile and a content profile associated with the immersive video content.
12. A method for streaming immersive video content for playback in a motion-controlled playback device, the method comprising:
receiving a manifest file for an immersive video content, the manifest files identifying two or more streams for the immersive video content, including a first stream with a higher quality of video than a second stream, wherein each stream of the immersive video content comprises a set of video segments, each video segment comprising a plurality of tiles corresponding to an immersive view of a scene in the video content;
receiving a prediction profile corresponding to the immersive video content, the
prediction profile including one or more predicted views for at least one segment, each predicted view identifying a subset of tiles of the plurality of tiles likely to be viewed by a user;
determining a first set of tiles from a first segment of the immersive video content based on a first predicted view in the prediction profile for the first segment and a second set of tiles from the first segment of the immersive video content, the second set of tiles corresponding to sections of the scene less likely to be viewed by the user;
sending requests for the first set of tiles from the first stream identified in the manifest file and for the second set of tiles from the second stream identified in the manifest file; and
playing back the first segment of the immersive video content.
13. The method of claim 12 wherein the prediction profile is an output of a neural network trained with movement data corresponding to playing back of the immersive video content.
14. The method of claim 13 wherein the movement data includes one or more of head movement data and eye movement data.
15. The method of claim 12 wherein the determining comprises predicting a head movement associated with the first segment of the immersive video content and further comprising sending an actual head movement data for refining the prediction profile with a neural network.
16. The method of claim 12 wherein the immersive video content is a 360° video.
17. The method of claim 12 wherein the motion-controlled playback device is a virtual reality head set.
18. The method of claim 12 wherein the manifest file identifies at least three streams for the immersive video content, including a first stream with a higher quality of video than a second stream and a third stream with a lower quality of video than the second stream, and further comprising sending requests for a third set of lower quality tiles from the first segment in the third stream of the immersive video content.
19. The method of claim 12 wherein each predicted view in the prediction profile includes a probability of being viewed by a user and further comprising receiving a content provider profile, the content provider profile providing a threshold probability of a predicted view for requesting tiles from the first stream.
20. The method of claim 12 further comprising providing a user profile of a user of the motion-controlled playback device for selection of the prediction profile, the prediction profile selected at least in part based on a comparison between a user profile associated with the prediction profile and the user profile of the of a user of the motion-controlled playback device.
PCT/US2018/038012 2017-06-19 2018-06-18 Predictive content buffering in streaming of immersive video WO2018236715A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201762521858P 2017-06-19 2017-06-19
US62/521,858 2017-06-19

Publications (1)

Publication Number Publication Date
WO2018236715A1 true WO2018236715A1 (en) 2018-12-27

Family

ID=64735847

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2018/038012 WO2018236715A1 (en) 2017-06-19 2018-06-18 Predictive content buffering in streaming of immersive video

Country Status (1)

Country Link
WO (1) WO2018236715A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11729407B2 (en) * 2018-10-29 2023-08-15 University Of Washington Saliency-based video compression systems and methods

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170075416A1 (en) * 2015-09-10 2017-03-16 Google Inc. Playing spherical video on a limited bandwidth connection
US20170084073A1 (en) * 2015-09-22 2017-03-23 Facebook, Inc. Systems and methods for content streaming
US20170098122A1 (en) * 2010-06-07 2017-04-06 Affectiva, Inc. Analysis of image content with associated manipulation of expression presentation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170098122A1 (en) * 2010-06-07 2017-04-06 Affectiva, Inc. Analysis of image content with associated manipulation of expression presentation
US20170075416A1 (en) * 2015-09-10 2017-03-16 Google Inc. Playing spherical video on a limited bandwidth connection
US20170084073A1 (en) * 2015-09-22 2017-03-23 Facebook, Inc. Systems and methods for content streaming

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11729407B2 (en) * 2018-10-29 2023-08-15 University Of Washington Saliency-based video compression systems and methods

Similar Documents

Publication Publication Date Title
US20210168397A1 (en) Systems and Methods for Learning Video Encoders
US11601699B2 (en) Predictive content delivery for video streaming services
US20220030244A1 (en) Content adaptation for streaming
US9432702B2 (en) System and method for video program recognition
US11416546B2 (en) Content type detection in videos using multiple classifiers
US10530825B2 (en) Catching up to the live playhead in live streaming
US11120293B1 (en) Automated indexing of media content
US10476943B2 (en) Customizing manifest file for enhancing media streaming
US20180191801A1 (en) Adaptively updating content delivery network link in a manifest file
US20180191586A1 (en) Generating manifest file for enhancing media streaming
KR102255363B1 (en) Apparatus, method and computer program for processing video contents
US9877056B1 (en) Compressed media with still images selected from a video stream
US10091265B2 (en) Catching up to the live playhead in live streaming
WO2017210027A1 (en) Catching up to the live playhead in live streaming
Polakovič et al. Adaptive multimedia content delivery in 5G networks using DASH and saliency information
US20180191799A1 (en) Effectively fetch media content for enhancing media streaming
WO2018236715A1 (en) Predictive content buffering in streaming of immersive video
KR102574353B1 (en) Device Resource-based Adaptive Frame Extraction and Streaming Control System and method for Blocking Obscene Videos in Mobile devices
US10681105B2 (en) Decision engine for dynamically selecting media streams
US20240048807A1 (en) Leveraging insights from real-time media stream in delayed versions
US20230319327A1 (en) Methods, systems, and media for determining perceptual quality indicators of video content items
US11716454B2 (en) Systems and methods for improved delivery and display of 360-degree content

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18821051

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18821051

Country of ref document: EP

Kind code of ref document: A1