WO2018236715A1 - Predictive content buffering in streaming of immersive video - Google Patents
Predictive content buffering in streaming of immersive video Download PDFInfo
- Publication number
- WO2018236715A1 WO2018236715A1 PCT/US2018/038012 US2018038012W WO2018236715A1 WO 2018236715 A1 WO2018236715 A1 WO 2018236715A1 US 2018038012 W US2018038012 W US 2018038012W WO 2018236715 A1 WO2018236715 A1 WO 2018236715A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video content
- immersive video
- tiles
- stream
- profile
- Prior art date
Links
- 230000003139 buffering effect Effects 0.000 title description 2
- 238000013528 artificial neural network Methods 0.000 claims abstract description 16
- 238000000034 method Methods 0.000 claims description 52
- 230000004886 head movement Effects 0.000 claims description 45
- 238000012549 training Methods 0.000 claims description 30
- 238000007670 refining Methods 0.000 claims description 3
- 230000004424 eye movement Effects 0.000 claims description 2
- 238000009826 distribution Methods 0.000 abstract description 2
- 230000003044 adaptive effect Effects 0.000 abstract 1
- 238000013459 approach Methods 0.000 description 12
- 230000008569 process Effects 0.000 description 11
- 238000004590 computer program Methods 0.000 description 5
- 210000003128 head Anatomy 0.000 description 5
- 238000007781 pre-processing Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000003860 storage Methods 0.000 description 2
- 208000013036 Dopa-responsive dystonia due to sepiapterin reductase deficiency Diseases 0.000 description 1
- 208000003028 Stuttering Diseases 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000036772 blood pressure Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 201000001195 sepiapterin reductase deficiency Diseases 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/83—Generation or processing of protective or descriptive data associated with content; Content structuring
- H04N21/845—Structuring of content, e.g. decomposing content into time segments
- H04N21/8456—Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
- G06F3/012—Head tracking input arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
- G06F3/013—Eye tracking input arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/03—Arrangements for converting the position or the displacement of a member into a coded form
- G06F3/033—Pointing devices displaced or positioned by the user, e.g. mice, trackballs, pens or joysticks; Accessories therefor
- G06F3/038—Control and interface arrangements therefor, e.g. drivers or device-embedded control circuitry
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/048—Interaction techniques based on graphical user interfaces [GUI]
- G06F3/0481—Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance
- G06F3/04815—Interaction with a metaphor-based environment or interaction object displayed as three-dimensional, e.g. changing the user viewpoint with respect to the environment or object
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/21—Server components or server architectures
- H04N21/218—Source of audio or video content, e.g. local disk arrays
- H04N21/21805—Source of audio or video content, e.g. local disk arrays enabling multiple viewpoints, e.g. using a plurality of cameras
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/21—Server components or server architectures
- H04N21/218—Source of audio or video content, e.g. local disk arrays
- H04N21/2187—Live feed
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/2343—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
- H04N21/23439—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements for generating different versions
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/239—Interfacing the upstream path of the transmission network, e.g. prioritizing client content requests
- H04N21/2393—Interfacing the upstream path of the transmission network, e.g. prioritizing client content requests involving handling client requests
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/25—Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
- H04N21/251—Learning process for intelligent management, e.g. learning user preferences for recommending movies
- H04N21/252—Processing of multiple end-users' preferences to derive collaborative data
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/25—Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
- H04N21/258—Client or end-user data management, e.g. managing client capabilities, user preferences or demographics, processing of multiple end-users preferences to derive collaborative data
- H04N21/25866—Management of end-user data
- H04N21/25891—Management of end-user data being end-user preferences
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/25—Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
- H04N21/262—Content or additional data distribution scheduling, e.g. sending additional data at off-peak times, updating software modules, calculating the carousel transmission frequency, delaying a video stream transmission, generating play-lists
- H04N21/26258—Content or additional data distribution scheduling, e.g. sending additional data at off-peak times, updating software modules, calculating the carousel transmission frequency, delaying a video stream transmission, generating play-lists for generating a list of items to be played back in a given order, e.g. playlist, or scheduling item distribution according to such list
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/60—Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client
- H04N21/65—Transmission of management data between client and server
- H04N21/658—Transmission by the client directed to the server
- H04N21/6587—Control parameters, e.g. trick play commands, viewpoint selection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/816—Monomedia components thereof involving special video data, e.g 3D video
Definitions
- This disclosure generally relates to streaming and playback for immersive video content, and more particularly to an improved predictive streaming approach using artificial neural networks.
- tile-based streaming For example, with reference to Fig. 1, using tile-based streaming, VR content is split into several smaller rectangular regions called "tiles," which are separately encoded for each Segment 104a,b and transmitted over the network in different qualities, such as high-quality tiles lOla-1, medium quality tiles 102a-l, and low-quality tiles 103a-l.
- tile-based streaming technology only the tiles lOle, lOlf, lOlg, lOli, lOlj and 101k for the current viewport 105 have to be transmitted in higher quality, while the tiles for the currently invisible parts of the scene 106 (tiles 103a-103d, 103h, and 1031 or tiles 102a-102d, 102h, and 1021, or a combination of these two lower quality sets) can be downloaded in a lower quality, leading to bandwidth savings.
- This concept is illustrated in Fig. 1.
- the quality of the downloaded tiles is chosen reactively: the playback software in the client 100 can only start downloading the needed tiles in a first frame 104a when the user actually turns her head and looks at them. This leads to a noticeable delay between the user looking at a specific part of the environment and the playback software adapting to the changed viewport 105. This delay is influenced by different factors like segment length, network round trip time ("RTT") and video buffer size on the client side.
- RTT network round trip time
- the playback software might not be able to adapt and download new tiles fast enough, if at all. Until those higher quality tiles are loaded, the user is presented with low quality tiles, leading to a degraded QoE. In some instances, the user may be presented low quality tiled content most of the time.
- a method for streaming immersive video content for playback in a motion-controlled playback device includes providing a manifest file for an immersive video content.
- the manifest file identifies two or more streams for the immersive video content, including a stream with a higher quality of video than another stream.
- Each stream of the immersive video content comprises a set of video segments with a plurality of tiles corresponding to an immersive view of a scene in the video content.
- the method also includes providing a prediction profile corresponding to the immersive video content.
- the prediction profile may include one or more predicted views for at least one segment of the immersive video stream. Each predicted view identifies a subset of tiles likely to be viewed by a user.
- the method includes receiving a request for a set of higher quality tiles from a segment in the first stream of the immersive video content.
- the tiles requested correspond to a predicted view in the prediction profile for the segment requested.
- the method further includes receiving another request for another set of lower quality tiles from the same segment in the lower quality stream of the immersive video content.
- the method may also include receiving head movement data corresponding to a play back of the immersive video content, training a neural network based, at least in part, on the head movement data; and generating the prediction profile corresponding to the immersive video content based on the neural network.
- the method can also include receiving eye- tracking data corresponding to a play back of the immersive video content.
- training the neural network can also be based on the eye-tracking data, at least partially.
- a method for streaming immersive video content for playback in a motion-controlled playback device includes receiving a manifest file for an immersive video content and receiving a prediction profile corresponding to the immersive video content.
- the prediction profile may also include predicted views for at least a segment that identifies a subset of tiles in the segment that are likely to be viewed by a user.
- the method also includes determining a set of tiles from a segment of the immersive video content based on the predicted view for that segment and another set of tiles from the same segment corresponding to sections of the scene less likely to be viewed by the user.
- the method also includes sending requests for one set of tiles likely to be viewed from the high-quality stream identified in the manifest file and for the other set of tiles, less likely to be viewed, from the lower-quality stream identified in the manifest file. Then, the method includes playing back the segment of the immersive video content.
- the prediction profile is an output of a neural network trained with movement data corresponding to playing back of the immersive video content.
- the movement data used for training may, for example, include head movement data, eye movement data, or both.
- the prediction profile may be refined based on a comparison of predicted head movements determined from the prediction profile with actual head movements detected during playback of the immersive video content.
- the method may further include predicting a head movement associated with the segment of the immersive video content and sending an actual head movement data for refining the prediction profile with a neural network.
- each predicted view may be provided as predicted head movements associated with a segment of the immersive video content.
- the immersive video content may be, for example, a 360° video.
- the motion-controlled playback device may be a virtual reality head set.
- the manifest file identifies at least three streams for the immersive video content, including a first stream with a higher quality of video than a second stream and a third stream with a lower quality of video than the second stream.
- a method can also include receiving yet another request for a set of lower quality tiles from the same segment in the lowest-quality stream of the immersive video content identified in the manifest file.
- the predicted views in the prediction profile can also include a probability of being viewed by a user.
- the method may also include providing a content provider profile that provides a threshold probability of a predicted view for requesting tiles from the high-quality stream.
- the method can also include selecting the prediction profile corresponding to the immersive video content.
- the selection of the prediction profile may be based in similarities between a user profile associated with the prediction profile and a user profile for the user of the motion-controlled playback device.
- the selection may also be based on similarities between a content profile associated with the prediction profile and a content profile associated with the immersive video content.
- FIG. 1 illustrates an exemplary tile-based streaming approach.
- FIG. 2 is a block diagram illustrating exemplary steps, input and output involved in the training process according to one embodiment of the disclosure.
- FIG. 3 is a block diagram illustrating an exemplary embodiment of a head movement prediction process according to one embodiment.
- FIG. 4 is a block diagram illustrating a neural network refinement approach according to one embodiment.
- FIG. 5 is a diagram illustrating a process for loading prediction data and different quality tiles according to one embodiment.
- the state of the art approach for tile-based streaming has several drawbacks and disadvantages.
- a proactive streaming approach is provided.
- the playback device implements a tile-based streaming algorithm that downloads higher quality tiles before the user looks at them. Higher quality tiles are downloaded based on a prediction of where the user is likely to look at next. Thus, higher quality tiles for the scene the user is currently looking at have already been downloaded in advance, greatly improving the QoE.
- an artificial neural network (“ANN”) is trained with recorded viewing data for immersive video content. This ANN is then used to predict where the user is likely to look at next for any given input video.
- the predictive streaming reduces bandwidth requirements for streaming immersive video and reduces content delivery/distribution network (“CDN”) traffic, which in turn lessens the costs for the content provider.
- CDN content delivery/distribution network
- the network provides the video player with information on which parts of the immersive video content shall be buffered in advance. This allows the use of longer segment durations and a larger buffer on the client side, which helps to significantly improve QoE for tile-based streaming, while simultaneously reducing the amount of data that needs to be sent over the network and thus the costs for the content providers.
- an ANN is trained to produce head movement predictions.
- the training dataset for the ANN consists of a large number of training instances, which in turn consist of a video file and the corresponding head movement tracking data, as recorded by the video playback software.
- Video files that are part of the training dataset preferably have substantially identical spatial dimensions. The video files can be scaled to a specific resolution in a preprocessing step.
- an input 360° video 201 is input into a training system 200.
- a head movement recording module 202 receives motion sensor data indicative of the movement of a user's head.
- a motion-controlled playback device 210 such as a Samsung Gear VR, an Oculus Rift, an HTC Vive, a Google Daydream, or similar, includes motion sensors, such as gyroscopes, accelerometers, and the like, to detect user movement and control playback of the input video 201.
- a display within the device 210 shows a different view of the 360° video.
- the motion control data is provided to the head movement recording module 202.
- the head movement recording module 202 creates a record 203 of the sensed head movement for the given input video 201.
- head movement prediction accuracy may be improved for a given input video 201 by using data to train the network with data that is preferably representative for all types of content and viewer groups.
- the recorded tracking data 203 may be input to a preprocessing module 204 where it can be filtered prior to using it to train the network, and outliers can be detected and removed from the training set generating a set of training data instances 205 prior to training the ANN.
- the training data instances 205 are used in a training module 206 to train the ANN.
- the goal of the training module 206 is to create a prediction model 207 that is able to predict where the viewer is likely going to look at for any arbitrary input video, regardless of the type of content or length, as accurately as possible.
- the accuracy of the learned model 207, and thus the prediction depends on the size of the training dataset. Accordingly, larger training sets are preferable.
- the learned model of head movements, or prediction profiles 207 may be saved into separate files that are accessible to the ANN module for future streaming during playback sessions.
- the ANN predicted head movements can then be referenced in the stream manifest for streaming during playback.
- the head movement predictions may be referenced in the manifest file in similar fashion to how spacial relationship descriptors ("SRDs") are referenced within MPEG-DASH manifests.
- SRDs spacial relationship descriptors
- the predicted head movement data may be directly embedded into the video stream itself.
- HLS HTTP Live Streaming
- Microsoft Smooth Streaming or the like.
- content-specific or viewer-type-specific models may be used.
- additional metadata may be collected with respect to a given prediction model to improve the accuracy of the prediction model.
- the metadata may include characteristics for the content itself, of the user, or both.
- user profile information may be collected, including for example, demographic information such as sex, age, nationality, or the like, and/or personal interest information, such as preferred video content category (e.g., action, scenery, travel, and the like), and/or user biometric or physiological information, such as heartrate and blood pressure while consuming immersive content, and/or other metadata, such as manual user annotations.
- demographic information such as sex, age, nationality, or the like
- personal interest information such as preferred video content category (e.g., action, scenery, travel, and the like)
- user biometric or physiological information such as heartrate and blood pressure while consuming immersive content
- other metadata such as manual user annotations.
- Content profile information may alternatively or in addition be used with metadata regarding the content, such as for example, category (action, adventure, etc.), maturity level/rating, actors, director, ratings, and the like.
- the training system 200 extracts the metadata from the input video 201 and includes it in the model 207.
- the metadata may be collected and used to classify head movement data into various model - types. Classification may be done according to known classification techniques, including for example, clustering techniques, similarity metrics (e.g., cosine similarity), or the like.
- the training data 205 can be segregated based on the metadata into the model -types and used to generate different ANNs for each model type.
- an ANN may be generated for male users under 20 with a preference for action video content.
- Corresponding metadata collected from the playback device user is matched against the model metadata to select the appropriate model prior to playback. Matching of the user profile to the prediction model profile may be done according to known similarity metrics, such as for example, the cosine similarity approach.
- content-related metadata is signaled to the playback device which guides the client in selecting the right tiles.
- the metadata could be related to visual content, e.g., saliency maps or metadata added during the production process (such as forced perspective).
- the metadata could alternatively or in addition be also related to audio content, e.g., 3D audio coming from "behind" the user which may trigger the user to look back. All this information is backed up by an ANN which provides information of how the user' s behavior was in the past.
- the prediction accuracy of the ANN can further be improved by taken into account eye-tracking data to get an even more accurate impression of where the user is looking at for a given video, with respect to the playback time.
- the head movement module 202 further includes a camera sensor module configured to detect and track the movement of the user's eyes during playback of the input video 201.
- an additional eye-tracking module (not shown) may be provided.
- training system 200 may be implemented in different configurations.
- training system 200 is a distributed system implemented with a user base and learning through use of the system. For example, each time a user plays a 360° video, the playback device records the user's head movement and sends the recorded head movement data 203 to a remote training server.
- the preprocessing module 204 filters and classifies and the data and can aggregate data from multiple users for a given 360° video.
- the resulting training set or sets 205 can be classified based on user profile types and types of content to generate corresponding models 207. As further described below, the models 207 can be continuously updated as more user data is received at the server.
- the output of the training process is a learned prediction model 307.
- This model 307 is then used to predict the head movement profile for any given input video 301.
- the input video 301 for the prediction process preferably has the same spatial dimensions as the videos 201 of the training dataset, therefore the video may be scaled in a preprocessing module 302 before it is fed into the ANN module 303 for prediction.
- the output of the prediction process is the predicted head movement profile 305 for the given input video 301, which represents the regions of interest where the viewer is likely to look at with respect to the playback time of the video.
- the predicted head movement profile may be converted into a format that is supported by the playback software.
- a neural network refinement approach 400 is provided according to one embodiment.
- input video 401 is provided to an ANN module 403 that has been trained.
- the ANN module 403 is used to predict head movement profiles 405 based on the applicable model 407.
- playback software 408 can be used to record the actual head movements of the user for a given video file into a feedback record 409.
- An ANN training module 406 receives the recorded actual head movements 409 for comparison with the predicted head movements 405. If, for any given scene, the recorded actual head movement 409 differ from what the predicted movement 405, the newly recorded data 409 is used to refine the learned model 407, thus gradually improving the prediction accuracy of the ANN.
- the ANN learns and optimizes to improve its predictive output.
- the refinement approach may be implemented in a distributed fashion between the client and the server or may be fully implemented in the client provided the refined model back to the server to be used with other users playing he given video 401.
- a process 500 for loading prediction data and different quality tiles is described according to one embodiment.
- the client playback software 501 initially gets 510 loads the stream manifest 502 for the immersive video, preferably it checks whether a head movement profile 507 is available for the given stream. If that is the case, the player 501 requests the file containing the prediction data 507 and parses it. Given the current viewing direction 505 of the viewer and the predicted head movement profile 506, the player uses a heuristic to determine which region of interest the viewer will probably look at next.
- a prediction profile 506 may include several predictions with different confidence levels for where the user may look at in a given segment. As for example illustrated in Fig.
- a primary high confidence prediction 506a identifies a first set of tiles that correspond to the most likely viewing direction 505 for that segment.
- another set of tiles are identified by secondary predictions 506b and 506c, with lower confidence values.
- the player requests 512a the tiles for segment 1 and the server streams 514a the corresponding tiles.
- all the tiles that are part of a region of interest 505 corresponding to the highest confidence prediction 506a are then loaded in the highest quality while tiles for secondary viewing directions corresponding to lower confidence predictions 506b and 506c, are loaded at lower qualities.
- the remaining tiles corresponding to parts of the scene predicted not to be viewed by the user may be loaded a yet a lower quality or not loaded at all, for example, depending on player settings or capabilities, or network conditions.
- content provider profiles may also be used to control the streaming operations at the client device 501.
- the player 501 may decide to request 512 different quality tiles for regions of interest 505 that are less likely to be looked at by the viewer, based on lower confidence predictions 506b and 506c.
- the quality levels used for the different confidence level predictions 506 is determined based on the content provider profile providing the
- the content provider values a good user experience ("UX") more than bandwidth savings, it may allow the player to download the highest quality tiles for all regions 505 that have some likelihood of being looked at in the current video segment above a threshold.
- tiles corresponding to predictions 506a and 506b which in this embodiment are above the content provider threshold, may be obtained 512 at the highest quality available, while tiles for the predicted view corresponding to prediction 506c, with a confidence level below the content-provider threshold, are obtained 512 at a medium quality.
- the number of tiles with probabilities above the threshold may include multiple tiles, depending on the type of the content.
- the content provider may decide to let the player only load the tiles of the region with the highest likelihood of being looked at, in the highest quality for any segment, while loading all other tiles with lower qualities. This approach would minimize the content delivery network (“CDN”) traffic and costs.
- CDN content delivery network
- the UX could further be improved by combining the head movement prediction profiles with the scalable extension of FIEVC (SHVC).
- SHVC scalable extension of FIEVC
- the playback software could then load all tiles in the lowest quality first and keep loading enhancement layers for tiles that are likely to be looked at. This would also reduce the time needed to load an enhanced version of a lower-quality tile when the viewer starts looking at it.
- the prediction model described herein can also be used for streaming technologies which does not rely on tiles. When no tiles are used, preferably, several versions of the same video exist. Each different version is encoded in a way that one or more certain parts of the video are encoded in higher quality than other parts.
- the ANN-based prediction of regions of interest in a video can here be used to determine which parts of the different videos should be encoded in which quality.
- the ANN-based prediction is also (or alternatively) used to decide which video segment should be loaded next to provide the best possible QoE to the user by providing high quality content for the parts of the video the user is looking at and will be most likely looking at in the future.
- a software module is implemented with a computer program product comprising a computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.
- Embodiments may also relate to an apparatus for performing the operations herein.
- This apparatus may be specially constructed for the required purposes, and/or it may comprise a general-purpose computing device selectively activated or reconfigured by a computer program stored in the computer.
- a computer program may be stored in a non- transitory, tangible computer readable storage medium, or any type of media suitable for storing electronic instructions, which may be coupled to a computer system bus.
- any computing systems referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Computer Graphics (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
Bandwidth requirements for immersive video content playback is reduced while keeping the QoE on a satisfactory level and minimizing playback disruptions with a tile-based adaptive streaming algorithm. A playback device implements a tile-based streaming algorithm that downloads higher quality tiles before the user looks at them based on a prediction of where the user is likely to look at next. An artificial neural network ("ANN") is trained with recorded viewing data for immersive video content. This ANN is then used to predict where the user is likely to look at next for any given input video. The predictive streaming reduces bandwidth requirements for streaming immersive video and reduces content delivery/distribution network ("CDN") traffic, which in turn lessens the costs for the content provider.
Description
IN THE UNITED STATES PATENT AND TRADEMARK OFFICE
International Patent Application
By
Lukas Kroepfl, Wolfram Hofmeister, Mario Graf, Daniel Weinberger, Christopher Mueller,
Reinhard Grandl, and Stefan Lederer
TITLE
Predictive Content Buffering in Streaming of Immersive Video
CROSS-REFERENCE TO RELATED APPLICATIONS
This application claims the benefit of U.S. Provisional Patent Application No. 62/521,858, filed on June 19, 2017, the contents of which are hereby incorporated by reference.
BACKGROUND
[0001] This disclosure generally relates to streaming and playback for immersive video content, and more particularly to an improved predictive streaming approach using artificial neural networks.
[0002] With the advent of immersive video applications using, for example, 360° High Definition, virtual reality (VR), augmented reality (AR), and other immersive video content and as immersive video capable playback devices, such as the Samsung Gear VR, Oculus Rift, HTC Vive or Google Daydream, become more affordable, the streaming of immersive video content over the Internet is becoming a necessity. In contrast to streaming non- immersive video content, the quality demands of immerse video are substantially higher. For example, where full HD resolution is sufficient for playing back conventional video content, VR content looks very poor when streamed in this resolution. Similarly, since video data covering the full 360° panorama has to be streamed for a proper quality of experience (QoE), the viewport visible to the user at one point in time covers only a small area of the full video
frame. Therefore, much higher resolutions are needed for immersive video content to get the same perceptual video quality as for conventional videos.
[0003] As many playback devices 100 today are mobile devices, such as tablets,
smartphones, laptops, and the like, which are usually connected to a network server 110 over an unreliable wireless connection with widely variable network conditions, transmitting high- quality video over the network poses a great challenge. The bandwidth requirements needed for immersive video content playback are so high, that a satisfactory playback quality can only rarely be achieved under real-life network conditions. To cope with this problem, a solution called tile-based streaming has been proposed. For example, with reference to Fig. 1, using tile-based streaming, VR content is split into several smaller rectangular regions called "tiles," which are separately encoded for each Segment 104a,b and transmitted over the network in different qualities, such as high-quality tiles lOla-1, medium quality tiles 102a-l, and low-quality tiles 103a-l. Using the tile-based streaming technology, only the tiles lOle, lOlf, lOlg, lOli, lOlj and 101k for the current viewport 105 have to be transmitted in higher quality, while the tiles for the currently invisible parts of the scene 106 (tiles 103a-103d, 103h, and 1031 or tiles 102a-102d, 102h, and 1021, or a combination of these two lower quality sets) can be downloaded in a lower quality, leading to bandwidth savings. This concept is illustrated in Fig. 1.
[0004] In the prior art approach for tile-based streaming, the quality of the downloaded tiles is chosen reactively: the playback software in the client 100 can only start downloading the needed tiles in a first frame 104a when the user actually turns her head and looks at them. This leads to a noticeable delay between the user looking at a specific part of the environment and the playback software adapting to the changed viewport 105. This delay is influenced by different factors like segment length, network round trip time ("RTT") and video buffer size on the client side. When watching immersive video content that requires rapid head
movement, like an action-heavy scene, the playback software might not be able to adapt and download new tiles fast enough, if at all. Until those higher quality tiles are loaded, the user is presented with low quality tiles, leading to a degraded QoE. In some instances, the user may be presented low quality tiled content most of the time.
[0005] There are several approaches proposed to reduce this delay problem, but they all come with new disadvantages. Some have proposed reducing the buffer size on the playback device. With the amount of video that is buffered on the client reduced, higher quality tiles can be loaded slightly earlier without the need to discard already buffered data. However, when reducing the buffer size on the client device, the video playback is more likely to stall and stutter when network conditions are not ideal or when they are fluctuating. Due to the intrinsic unpredictable nature of wireless networks, bandwidth conditions may change rapidly enough that keeping a steady, although small, buffer level is virtually impossible.
Furthermore, smaller segment sizes could reduce the end-to-end delay but would also increase the bitrates needed to encode the videos due to a smaller intra-frame period, given that there are more segments for a video of the same length and therefore more intra-frames. Thus, the period of time between the frames gets smaller and therefore their frequency is higher, requiring increased bitrates. Keeping this in mind it is most likely that current state of the art technologies will fail to deliver a good QoE for immersive video content over wireless networks.
BRIEF SUMMARY
[0006] In one embodiment, a method for streaming immersive video content for playback in a motion-controlled playback device, includes providing a manifest file for an immersive video content. The manifest file identifies two or more streams for the immersive video content, including a stream with a higher quality of video than another stream. Each stream of the immersive video content comprises a set of video segments with a plurality of tiles
corresponding to an immersive view of a scene in the video content. The method also includes providing a prediction profile corresponding to the immersive video content. The prediction profile may include one or more predicted views for at least one segment of the immersive video stream. Each predicted view identifies a subset of tiles likely to be viewed by a user. In this embodiment, the method includes receiving a request for a set of higher quality tiles from a segment in the first stream of the immersive video content. The tiles requested correspond to a predicted view in the prediction profile for the segment requested. The method further includes receiving another request for another set of lower quality tiles from the same segment in the lower quality stream of the immersive video content.
[0007] According to one aspect of some embodiments, the method may also include receiving head movement data corresponding to a play back of the immersive video content, training a neural network based, at least in part, on the head movement data; and generating the prediction profile corresponding to the immersive video content based on the neural network. In addition, in some embodiments, the method can also include receiving eye- tracking data corresponding to a play back of the immersive video content. In such embodiments, training the neural network can also be based on the eye-tracking data, at least partially.
[0008] In another embodiment, a method for streaming immersive video content for playback in a motion-controlled playback device includes receiving a manifest file for an immersive video content and receiving a prediction profile corresponding to the immersive video content. In this embodiment, the prediction profile may also include predicted views for at least a segment that identifies a subset of tiles in the segment that are likely to be viewed by a user. In this embodiment, the method also includes determining a set of tiles from a segment of the immersive video content based on the predicted view for that segment and another set of tiles from the same segment corresponding to sections of the scene less likely to be viewed
by the user. The method also includes sending requests for one set of tiles likely to be viewed from the high-quality stream identified in the manifest file and for the other set of tiles, less likely to be viewed, from the lower-quality stream identified in the manifest file. Then, the method includes playing back the segment of the immersive video content.
[0009] According to various embodiments, the prediction profile is an output of a neural network trained with movement data corresponding to playing back of the immersive video content. The movement data used for training may, for example, include head movement data, eye movement data, or both.
[0010] According to another aspect of some embodiments, the prediction profile may be refined based on a comparison of predicted head movements determined from the prediction profile with actual head movements detected during playback of the immersive video content. According to one embodiment, the method may further include predicting a head movement associated with the segment of the immersive video content and sending an actual head movement data for refining the prediction profile with a neural network.
[0011] In some embodiments, each predicted view may be provided as predicted head movements associated with a segment of the immersive video content. The immersive video content may be, for example, a 360° video. According to another aspect of some
embodiments, the motion-controlled playback device may be a virtual reality head set.
[0012] In one embodiment, the manifest file identifies at least three streams for the immersive video content, including a first stream with a higher quality of video than a second stream and a third stream with a lower quality of video than the second stream. In this embodiment, a method can also include receiving yet another request for a set of lower quality tiles from the same segment in the lowest-quality stream of the immersive video content identified in the manifest file.
[0013] According to another aspect of various embodiments, the predicted views in the prediction profile can also include a probability of being viewed by a user. In such embodiments, the method may also include providing a content provider profile that provides a threshold probability of a predicted view for requesting tiles from the high-quality stream.
[0014] According to another aspect of some embodiments, the method can also include selecting the prediction profile corresponding to the immersive video content. In such embodiments, the selection of the prediction profile may be based in similarities between a user profile associated with the prediction profile and a user profile for the user of the motion-controlled playback device. In addition, or as an alternative, the selection may also be based on similarities between a content profile associated with the prediction profile and a content profile associated with the immersive video content.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
[0015] FIG. 1 illustrates an exemplary tile-based streaming approach.
[0016] FIG. 2 is a block diagram illustrating exemplary steps, input and output involved in the training process according to one embodiment of the disclosure.
[0017] FIG. 3 is a block diagram illustrating an exemplary embodiment of a head movement prediction process according to one embodiment.
[0018] FIG. 4 is a block diagram illustrating a neural network refinement approach according to one embodiment.
[0019] FIG. 5 is a diagram illustrating a process for loading prediction data and different quality tiles according to one embodiment.
[0020] The figures depict various example embodiments of the present disclosure for purposes of illustration only. One of ordinary skill in the art will readily recognize from the following discussion that other example embodiments based on alternative structures and
methods may be implemented without departing from the principles of this disclosure and which are encompassed within the scope of this disclosure.
DETAILED DESCRIPTION
[0021] The Figures and the following description describe certain embodiments by way of illustration only. One of ordinary skill in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles described herein. Reference will now be made in detail to several embodiments, examples of which are illustrated in the
accompanying figures.
[0022] The above and other needs are met by the disclosed methods, a non-transitory computer-readable storage medium storing executable code, and systems for streaming and playing back immersive video content.
[0023] As described before, the state of the art approach for tile-based streaming has several drawbacks and disadvantages. To overcome these problems, a proactive streaming approach is provided. According to one embodiment of the present invention, a mechanism is provided that reduces the bandwidth requirements for immersive video content playback while keeping the QoE on a satisfactory level, minimizing playback disruptions. According to one embodiment, the playback device implements a tile-based streaming algorithm that downloads higher quality tiles before the user looks at them. Higher quality tiles are downloaded based on a prediction of where the user is likely to look at next. Thus, higher quality tiles for the scene the user is currently looking at have already been downloaded in advance, greatly improving the QoE.
[0024] According to one embodiment, an artificial neural network ("ANN") is trained with recorded viewing data for immersive video content. This ANN is then used to predict where the user is likely to look at next for any given input video. According to another aspect of
embodiments of the invention, the predictive streaming reduces bandwidth requirements for streaming immersive video and reduces content delivery/distribution network ("CDN") traffic, which in turn lessens the costs for the content provider. The network provides the video player with information on which parts of the immersive video content shall be buffered in advance. This allows the use of longer segment durations and a larger buffer on the client side, which helps to significantly improve QoE for tile-based streaming, while simultaneously reducing the amount of data that needs to be sent over the network and thus the costs for the content providers.
[0025] According to one embodiment, an ANN is trained to produce head movement predictions. In one embodiment, the training dataset for the ANN consists of a large number of training instances, which in turn consist of a video file and the corresponding head movement tracking data, as recorded by the video playback software. Video files that are part of the training dataset preferably have substantially identical spatial dimensions. The video files can be scaled to a specific resolution in a preprocessing step.
[0026] Referring to FIG. 2, an overview of the training process in a training system 200 is presented. According to one embodiment, an input 360° video 201 is input into a training system 200. A head movement recording module 202 receives motion sensor data indicative of the movement of a user's head. For example, a motion-controlled playback device 210, such as a Samsung Gear VR, an Oculus Rift, an HTC Vive, a Google Daydream, or similar, includes motion sensors, such as gyroscopes, accelerometers, and the like, to detect user movement and control playback of the input video 201. For example, as the user turns his or her head to one side or another (left, right, up, or down) a display within the device 210 shows a different view of the 360° video. The motion control data is provided to the head movement recording module 202. The head movement recording module 202 creates a record 203 of the sensed head movement for the given input video 201. According to this
embodiment, head movement prediction accuracy may be improved for a given input video 201 by using data to train the network with data that is preferably representative for all types of content and viewer groups. Therefore, the recorded tracking data 203 may be input to a preprocessing module 204 where it can be filtered prior to using it to train the network, and outliers can be detected and removed from the training set generating a set of training data instances 205 prior to training the ANN.
[0027] Once the preprocessing is finished, the training data instances 205 are used in a training module 206 to train the ANN. The goal of the training module 206 is to create a prediction model 207 that is able to predict where the viewer is likely going to look at for any arbitrary input video, regardless of the type of content or length, as accurately as possible. The accuracy of the learned model 207, and thus the prediction, depends on the size of the training dataset. Accordingly, larger training sets are preferable. According to one embodiment, the learned model of head movements, or prediction profiles 207, may be saved into separate files that are accessible to the ANN module for future streaming during playback sessions. In one embodiment, the ANN predicted head movements can then be referenced in the stream manifest for streaming during playback. For example, in one embodiment in which the video is streamed according to MPEG-DASH protocol (i.e., according to ISO/IEC 23009-1, incorporated herein by reference) the head movement predictions may be referenced in the manifest file in similar fashion to how spacial relationship descriptors ("SRDs") are referenced within MPEG-DASH manifests.
According to an alternative embodiment, the predicted head movement data may be directly embedded into the video stream itself. One of ordinary skill in the art will readily recognize that other approaches to providing the head prediction data may be used, including embodiments based on HTTP Live Streaming (HLS), Microsoft Smooth Streaming, or the like.
[0028] According to alternative embodiments, content-specific or viewer-type-specific models may be used. For example, additional metadata may be collected with respect to a given prediction model to improve the accuracy of the prediction model. The metadata may include characteristics for the content itself, of the user, or both. For example, when generating learned models 207 in the training system 200, user profile information may be collected, including for example, demographic information such as sex, age, nationality, or the like, and/or personal interest information, such as preferred video content category (e.g., action, scenery, travel, and the like), and/or user biometric or physiological information, such as heartrate and blood pressure while consuming immersive content, and/or other metadata, such as manual user annotations.
[0029] Content profile information may alternatively or in addition be used with metadata regarding the content, such as for example, category (action, adventure, etc.), maturity level/rating, actors, director, ratings, and the like. In one embodiment, the training system 200 extracts the metadata from the input video 201 and includes it in the model 207. The metadata may be collected and used to classify head movement data into various model - types. Classification may be done according to known classification techniques, including for example, clustering techniques, similarity metrics (e.g., cosine similarity), or the like. The training data 205 can be segregated based on the metadata into the model -types and used to generate different ANNs for each model type. For example, an ANN may be generated for male users under 20 with a preference for action video content. Corresponding metadata collected from the playback device user is matched against the model metadata to select the appropriate model prior to playback. Matching of the user profile to the prediction model profile may be done according to known similarity metrics, such as for example, the cosine similarity approach. In one embodiment content-related metadata is signaled to the playback device which guides the client in selecting the right tiles. For example, the metadata could be
related to visual content, e.g., saliency maps or metadata added during the production process (such as forced perspective). The metadata could alternatively or in addition be also related to audio content, e.g., 3D audio coming from "behind" the user which may trigger the user to look back. All this information is backed up by an ANN which provides information of how the user' s behavior was in the past.
[0030] According to another embodiment, the prediction accuracy of the ANN can further be improved by taken into account eye-tracking data to get an even more accurate impression of where the user is looking at for a given video, with respect to the playback time. In this embodiment, the head movement module 202 further includes a camera sensor module configured to detect and track the movement of the user's eyes during playback of the input video 201. Alternatively, an additional eye-tracking module (not shown) may be provided.
[0031] It should be noted that training system 200 may be implemented in different configurations. In one embodiment, training system 200 is a distributed system implemented with a user base and learning through use of the system. For example, each time a user plays a 360° video, the playback device records the user's head movement and sends the recorded head movement data 203 to a remote training server. At the server, the preprocessing module 204 filters and classifies and the data and can aggregate data from multiple users for a given 360° video. The resulting training set or sets 205 can be classified based on user profile types and types of content to generate corresponding models 207. As further described below, the models 207 can be continuously updated as more user data is received at the server.
[0032] Referring now to FIG. 3, an exemplary embodiment of a prediction process 300 is provided. According to this embodiment, the output of the training process is a learned prediction model 307. This model 307 is then used to predict the head movement profile for any given input video 301. The input video 301 for the prediction process preferably has the same spatial dimensions as the videos 201 of the training dataset, therefore the video may be
scaled in a preprocessing module 302 before it is fed into the ANN module 303 for prediction. The output of the prediction process is the predicted head movement profile 305 for the given input video 301, which represents the regions of interest where the viewer is likely to look at with respect to the playback time of the video. In one embodiment, in a post processing module 304 the predicted head movement profile may be converted into a format that is supported by the playback software.
[0033] Referring now to FIG. 4, a neural network refinement approach 400 is provided according to one embodiment. According to this embodiment, in input video 401 is provided to an ANN module 403 that has been trained. The ANN module 403 is used to predict head movement profiles 405 based on the applicable model 407. In this embodiment, playback software 408 can be used to record the actual head movements of the user for a given video file into a feedback record 409. An ANN training module 406 receives the recorded actual head movements 409 for comparison with the predicted head movements 405. If, for any given scene, the recorded actual head movement 409 differ from what the predicted movement 405, the newly recorded data 409 is used to refine the learned model 407, thus gradually improving the prediction accuracy of the ANN. As additional users view the immersive content video 401, the ANN learns and optimizes to improve its predictive output. According to alternatives of this embodiment, the refinement approach may be implemented in a distributed fashion between the client and the server or may be fully implemented in the client provided the refined model back to the server to be used with other users playing he given video 401.
[0034] Referring now to FIG. 5, a process 500 for loading prediction data and different quality tiles is described according to one embodiment. According to this embodiment, when the client playback software 501 initially gets 510 loads the stream manifest 502 for the immersive video, preferably it checks whether a head movement profile 507 is available for
the given stream. If that is the case, the player 501 requests the file containing the prediction data 507 and parses it. Given the current viewing direction 505 of the viewer and the predicted head movement profile 506, the player uses a heuristic to determine which region of interest the viewer will probably look at next. For example, a prediction profile 506 may include several predictions with different confidence levels for where the user may look at in a given segment. As for example illustrated in Fig. 5, for Segment 1, a primary high confidence prediction 506a identifies a first set of tiles that correspond to the most likely viewing direction 505 for that segment. In addition, another set of tiles are identified by secondary predictions 506b and 506c, with lower confidence values. Based on the predicted viewing directions, the player requests 512a the tiles for segment 1 and the server streams 514a the corresponding tiles. For each segment, all the tiles that are part of a region of interest 505 corresponding to the highest confidence prediction 506a are then loaded in the highest quality while tiles for secondary viewing directions corresponding to lower confidence predictions 506b and 506c, are loaded at lower qualities. The remaining tiles corresponding to parts of the scene predicted not to be viewed by the user may be loaded a yet a lower quality or not loaded at all, for example, depending on player settings or capabilities, or network conditions.
[0035] According to one embodiment, content provider profiles (not shown) may also be used to control the streaming operations at the client device 501. For example, as described above, the player 501 may decide to request 512 different quality tiles for regions of interest 505 that are less likely to be looked at by the viewer, based on lower confidence predictions 506b and 506c. In this embodiment, the quality levels used for the different confidence level predictions 506 is determined based on the content provider profile providing the
requirements of the content provider. If, for example, the content provider values a good user experience ("UX") more than bandwidth savings, it may allow the player to download the
highest quality tiles for all regions 505 that have some likelihood of being looked at in the current video segment above a threshold. For example, tiles corresponding to predictions 506a and 506b, which in this embodiment are above the content provider threshold, may be obtained 512 at the highest quality available, while tiles for the predicted view corresponding to prediction 506c, with a confidence level below the content-provider threshold, are obtained 512 at a medium quality. The number of tiles with probabilities above the threshold may include multiple tiles, depending on the type of the content. For example, with VR content that contains mostly static scenes that don't require large head movements, the content provider may decide to let the player only load the tiles of the region with the highest likelihood of being looked at, in the highest quality for any segment, while loading all other tiles with lower qualities. This approach would minimize the content delivery network ("CDN") traffic and costs.
[0036] According to another embodiment, the UX could further be improved by combining the head movement prediction profiles with the scalable extension of FIEVC (SHVC). The playback software could then load all tiles in the lowest quality first and keep loading enhancement layers for tiles that are likely to be looked at. This would also reduce the time needed to load an enhanced version of a lower-quality tile when the viewer starts looking at it.
[0037] According to another embodiment, the prediction model described herein can also be used for streaming technologies which does not rely on tiles. When no tiles are used, preferably, several versions of the same video exist. Each different version is encoded in a way that one or more certain parts of the video are encoded in higher quality than other parts. According to one embodiment, the ANN-based prediction of regions of interest in a video can here be used to determine which parts of the different videos should be encoded in which quality. According to one embodiment, the ANN-based prediction is also (or alternatively)
used to decide which video segment should be loaded next to provide the best possible QoE to the user by providing high quality content for the parts of the video the user is looking at and will be most likely looking at in the future.
[0038] The foregoing description of the embodiments has been presented for the purpose of illustration; it is not intended to be exhaustive or to limit the patent rights to the precise forms disclosed. Persons skilled in the relevant art can appreciate that many modifications and variations are possible in light of the above disclosure.
[0039] Some portions of this description describe the embodiments in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are commonly used by those skilled in the data processing arts to convey the substance of their work effectively to others skilled in the art. These operations, while described functionally, computationally, or logically, are understood to be implemented by computer programs or equivalent electrical circuits, microcode, or the like. Furthermore, it has also proven convenient at times, to refer to these arrangements of operations as modules, without loss of generality. The described operations and their associated modules may be embodied in software, firmware, hardware, or any combinations thereof.
[0040] Any of the steps, operations, or processes described herein may be performed or implemented with one or more hardware or software modules, alone or in combination with other devices. In one embodiment, a software module is implemented with a computer program product comprising a computer-readable medium containing computer program code, which can be executed by a computer processor for performing any or all of the steps, operations, or processes described.
[0041] Embodiments may also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, and/or it may comprise a general-purpose computing device selectively activated or reconfigured by a
computer program stored in the computer. Such a computer program may be stored in a non- transitory, tangible computer readable storage medium, or any type of media suitable for storing electronic instructions, which may be coupled to a computer system bus. Furthermore, any computing systems referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
[0042] Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the patent rights be limited not by this detailed description, but rather by any claims that issue on an application based hereon. Accordingly, the disclosure of the embodiments is intended to be illustrative, but not limiting, of the scope of the patent rights, which is set forth in the following.
Claims
1. A method for streaming immersive video content for playback in a motion-controlled playback device, the method comprising:
providing a manifest file for an immersive video content, the manifest file identifying two or more streams for the immersive video content, including a first stream with a higher quality of video than a second stream, wherein each stream of the immersive video content comprises a set of video segments, each video segment comprising a plurality of tiles corresponding to an immersive view of a scene in the video content;
providing a prediction profile corresponding to the immersive video content, the prediction profile including one or more predicted views for at least one segment, each predicted view identifying a subset of tiles of the plurality of tiles likely to be viewed by a user;
receiving a first request for a first set of higher quality tiles from a first segment in the first stream of the immersive video content, the first set of tiles corresponding to a first predicted view in the prediction profile for the first segment; and receiving a second request for a second set of lower quality tiles from the first
segment in the second stream of the immersive video content.
2. The method of claim 1 further comprising:
receiving head movement data corresponding to a play back of the immersive video content;
training a neural network based, at least in part, on the head movement data; and generating the prediction profile corresponding to the immersive video content based on the neural network.
3. The method of claim 2 further comprising receiving eye-tracking data corresponding to a play back of the immersive video content and wherein the training the neural network is further based, at least in part, on the eye-tracking data.
4. The method of claim 2 further comprising refining the prediction profile based on a comparison of predicted head movements determined from the prediction profile with actual head movements detected during playback of the first video content.
5. The method of claim 1 wherein each predicted view is provided as predicted head movements associated with a segment of the immersive video content.
6. The method of claim 1 wherein the immersive video content is a 360° video.
7. The method of claim 1 wherein the motion-controlled playback device is a virtual reality head set.
8. The method of claim 1 wherein the manifest file identifies at least three streams for the immersive video content, including a first stream with a higher quality of video than a second stream and a third stream with a lower quality of video than the second stream, and further comprising receiving a third request for a third set of lower quality tiles from the first segment in the third stream of the immersive video content.
9. The method of claim 1 wherein each predicted view in the prediction profile includes a probability of being viewed by a user and further comprising providing a content provider profile, the content provider profile providing a threshold probability of a predicted view for requesting tiles from the first stream.
10. The method of claim 1 further comprising selecting the prediction profile
corresponding to the immersive video content.
11. The method of claim 10 wherein the selecting is based in one or more of similarities between a user profile associated with the prediction profile and a user profile for the user of the motion-controlled playback device and similarities between a content profile associated with the prediction profile and a content profile associated with the immersive video content.
12. A method for streaming immersive video content for playback in a motion-controlled playback device, the method comprising:
receiving a manifest file for an immersive video content, the manifest files identifying two or more streams for the immersive video content, including a first stream with a higher quality of video than a second stream, wherein each stream of the immersive video content comprises a set of video segments, each video segment comprising a plurality of tiles corresponding to an immersive view of a scene in the video content;
receiving a prediction profile corresponding to the immersive video content, the
prediction profile including one or more predicted views for at least one segment, each predicted view identifying a subset of tiles of the plurality of tiles likely to be viewed by a user;
determining a first set of tiles from a first segment of the immersive video content based on a first predicted view in the prediction profile for the first segment and a second set of tiles from the first segment of the immersive video content, the second set of tiles corresponding to sections of the scene less likely to be viewed by the user;
sending requests for the first set of tiles from the first stream identified in the manifest file and for the second set of tiles from the second stream identified in the manifest file; and
playing back the first segment of the immersive video content.
13. The method of claim 12 wherein the prediction profile is an output of a neural network trained with movement data corresponding to playing back of the immersive video content.
14. The method of claim 13 wherein the movement data includes one or more of head movement data and eye movement data.
15. The method of claim 12 wherein the determining comprises predicting a head movement associated with the first segment of the immersive video content and further comprising sending an actual head movement data for refining the prediction profile with a neural network.
16. The method of claim 12 wherein the immersive video content is a 360° video.
17. The method of claim 12 wherein the motion-controlled playback device is a virtual reality head set.
18. The method of claim 12 wherein the manifest file identifies at least three streams for the immersive video content, including a first stream with a higher quality of video than a second stream and a third stream with a lower quality of video than the second stream, and further comprising sending requests for a third set of lower quality tiles from the first segment in the third stream of the immersive video content.
19. The method of claim 12 wherein each predicted view in the prediction profile includes a probability of being viewed by a user and further comprising receiving a content provider profile, the content provider profile providing a threshold probability of a predicted view for requesting tiles from the first stream.
20. The method of claim 12 further comprising providing a user profile of a user of the motion-controlled playback device for selection of the prediction profile, the prediction
profile selected at least in part based on a comparison between a user profile associated with the prediction profile and the user profile of the of a user of the motion-controlled playback device.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201762521858P | 2017-06-19 | 2017-06-19 | |
US62/521,858 | 2017-06-19 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2018236715A1 true WO2018236715A1 (en) | 2018-12-27 |
Family
ID=64735847
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2018/038012 WO2018236715A1 (en) | 2017-06-19 | 2018-06-18 | Predictive content buffering in streaming of immersive video |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2018236715A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11729407B2 (en) * | 2018-10-29 | 2023-08-15 | University Of Washington | Saliency-based video compression systems and methods |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170075416A1 (en) * | 2015-09-10 | 2017-03-16 | Google Inc. | Playing spherical video on a limited bandwidth connection |
US20170084073A1 (en) * | 2015-09-22 | 2017-03-23 | Facebook, Inc. | Systems and methods for content streaming |
US20170098122A1 (en) * | 2010-06-07 | 2017-04-06 | Affectiva, Inc. | Analysis of image content with associated manipulation of expression presentation |
-
2018
- 2018-06-18 WO PCT/US2018/038012 patent/WO2018236715A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170098122A1 (en) * | 2010-06-07 | 2017-04-06 | Affectiva, Inc. | Analysis of image content with associated manipulation of expression presentation |
US20170075416A1 (en) * | 2015-09-10 | 2017-03-16 | Google Inc. | Playing spherical video on a limited bandwidth connection |
US20170084073A1 (en) * | 2015-09-22 | 2017-03-23 | Facebook, Inc. | Systems and methods for content streaming |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11729407B2 (en) * | 2018-10-29 | 2023-08-15 | University Of Washington | Saliency-based video compression systems and methods |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210168397A1 (en) | Systems and Methods for Learning Video Encoders | |
US11601699B2 (en) | Predictive content delivery for video streaming services | |
US20220030244A1 (en) | Content adaptation for streaming | |
US9432702B2 (en) | System and method for video program recognition | |
US11416546B2 (en) | Content type detection in videos using multiple classifiers | |
US10530825B2 (en) | Catching up to the live playhead in live streaming | |
US11120293B1 (en) | Automated indexing of media content | |
US10476943B2 (en) | Customizing manifest file for enhancing media streaming | |
US20180191801A1 (en) | Adaptively updating content delivery network link in a manifest file | |
US20180191586A1 (en) | Generating manifest file for enhancing media streaming | |
KR102255363B1 (en) | Apparatus, method and computer program for processing video contents | |
US9877056B1 (en) | Compressed media with still images selected from a video stream | |
US10091265B2 (en) | Catching up to the live playhead in live streaming | |
WO2017210027A1 (en) | Catching up to the live playhead in live streaming | |
Polakovič et al. | Adaptive multimedia content delivery in 5G networks using DASH and saliency information | |
US20180191799A1 (en) | Effectively fetch media content for enhancing media streaming | |
WO2018236715A1 (en) | Predictive content buffering in streaming of immersive video | |
KR102574353B1 (en) | Device Resource-based Adaptive Frame Extraction and Streaming Control System and method for Blocking Obscene Videos in Mobile devices | |
US10681105B2 (en) | Decision engine for dynamically selecting media streams | |
US20240048807A1 (en) | Leveraging insights from real-time media stream in delayed versions | |
US20230319327A1 (en) | Methods, systems, and media for determining perceptual quality indicators of video content items | |
US11716454B2 (en) | Systems and methods for improved delivery and display of 360-degree content |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18821051 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18821051 Country of ref document: EP Kind code of ref document: A1 |