WO2020069976A1

WO2020069976A1 - Concepts for improved head motion prediction and efficient encoding of immersive video

Info

Publication number: WO2020069976A1
Application number: PCT/EP2019/076069
Authority: WO
Inventors: Sebastian Bosse; Wojciech SAMEK; Thomas Schierl; Gabriel Curio; Thomas Wiegand; Markus Wenzel; Cornelius Hellge; Tamer AJAJ; Robert SKUPIN
Original assignee: Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.
Priority date: 2018-10-01
Filing date: 2019-09-26
Publication date: 2020-04-09

Abstract

An apparatus performs a prediction of a location of the viewport portion within the video at a predetermined look ahead time using a predictor and receives a feedback information on an actual location of the viewport portion the user looked at the predetermined look ahead time. A viewport portion of the video a user is predicted to look at, i.e., a field of view of the user is predicted based on the predicted head orientation.

Description

Concepts for improved Head Motion Prediction and Efficient Encoding of

Immersive Video

Description

This application is concerned with a concept for head motion prediction usable, for instance, with a head mounted display, and a concept for efficient encoding of immersive video.

A head mounted display (HMD) is an image or video display device which may be worn by a user on the head or as part of a helmet. A typical HMD has one or two small displays, with lenses and semi-transparent mirrors embedded in eyeglasses, a visor, or a helmet. The display units are miniaturized and may include cathode ray tubes (CRT), liquid crystal displays (LCDs), liquid crystal on silicon (LCoS), or organic light-emitting diodes (OLED). Visual media content can be presented to the user on the HMD placed in front of the eyes, e.g., in virtual reality (VR) or augmented reality (AR) or 360°-video presentation systems. AR is generated to superimpose a computer-generated imagery and live imagery from the physical world. For presenting 360°-video one needs very high resolution, such as 4K, 8K and higher. The user's head pose/orientation (i.e., position and direction/angle relative to a spherical panorama around the user) can be tracked, which allows for adapting the visual content to the current head pose and thus the perspective of the user. Ideally, this adaptation is immediate, because a temporal mismatch between head motion and visual content can be uncomfortable for the user, A temporal delay may even lead to symptoms similar to motion sickness due to the conflict between the inputs to the user's visual and vestibular systems. However, practical restrictions can delay the adaptation, e.g., network or other transmission latencies, or computational constraints in rendered virtual or augmented environments. As countermeasure against delays, the entire spherical panorama around the user (see, for example, Fig. 1) can be rendered and/or transmitted, including areas outside of the present field of view of the user. This approach makes it possible to select the section of the spherical panorama to be displayed locally, close to the HMD, and hence with very low delay after a head rotation.

However, transmission bandwidth and computational power for rendering are limited in practice. Thus, providing visual content from a panorama section larger than the field of view (see, for example, Fig. 1) happens at the expense of the perceived quality of the visual content actually looked at. That is, the user never sees a whole 360° video, and streaming the 360° video in its full resolution would waste resources, including bandwidth, storage and computational power.

It is the object of the subject-matter of the present application to provide concepts which assist in, or enable, dealing with the above outlined dilemma. In particular, the present application provides a concept for head motion prediction which results in improved forecasts of head motions, thereby enabling, for instance, bridging round trip times in immersive video streaming scenarios for sake of tailoring the encoding or available network throughput resources to a user’s viewport. Alternatively or additionally, the present application provides a concept for improved video stream generation resulting in a higher quality appearance for the user in case of immersive video scenarios.

This object is achieved by the subject-matter of the claims of the present application.

It is a basic idea underlying the subject-matter of the present application according to a first aspect that an improved forecast of head motions is possible by using feedback information on an actual location of the viewport portion the user looked at the predetermined look ahead time. For instance, head motions may be forecast, i.e. the prediction of the location of the viewport portion be done, with a lead (look ahead) time between 0.05 and 1 seconds. In accordance with an embodiment, the feedback information on the actual location of the viewport portion is used for updating a parameterization of the predictor using which the prediction of the location is performed. Prediction and measurement for obtaining the feedback, thus, co-operate in that the measurement takes the look ahead time of the prediction into account. A training of the prediction ending-up into the update, hence, is optimized for predicting the viewport portion’s location at the look ahead time.

This concept of the first aspect is a perfect candidate for embodiments relating to a second aspect of the present application. According to the second aspect, it is an idea of the present application that video stream generation may be made more efficient when used in connection with varying viewport locations such as in immersive video applications, when the video stream generation is rendered dependent on information on a predicted viewport portion of the video. The dependency may relate to encoding resources. That is, the predicted location of the viewport portion may steer whereto available encoding resources are spent more than onto other portions of the video. This is feasible owing to the predictive nature, i.e. a look ahead, of the prediction of the viewport portion location. The encoding resources may relate to bitrate and/or computational/coding power. Additionally or alternatively, the dependency may relate to a packetization of an encoded representation of the video. For instance, the packetization composes the video stream for the user’s client in a manner so that the video is represented by the video stream at the view port portion at higher quality than at a region outside thereof. The packetization might be done, in this case, for the user’s client on the fly or the focusing of the packetization is done be selecting a packetized version ought to become the data stream out of several pre-packetized versions of the encoded representation which differ in locations of improved video quality. Hence, rending the video stream generation dependent on predicted viewport location enables achieving more efficiency for immersive video streaming.

Both aspects discussed above may be used together or may advantageously be used individually. In both aspects, the prediction leading to the viewport location information may be obtained, for instance, based on user sensor data and/or based on an evaluation of the video material/content itself. According to the first aspect, it is feasible to improve the prediction by feedback information on actually assumed viewport portion locations. In the second aspect, the prediction may, in case of at least partially exploiting user sensor data, result into a closed loop being formed between the video stream generation site, such as the immersive video server, and the user equipment for displaying an immersive video presentation, such as the client: the video is presented to the user, while it is predicted, using user sensor data measuring a behavior of the user while watching the video, or the current viewport, where the viewport portion will move to within the picture area of the video at a certain lead time, for instance; the video stream generation site is notified on the predicted location of the viewport portion so that the encoding resources such as computational power and/or bitrate, and/or packetization of the encoded representation, may be focused onto this predicted location of the viewport portion. Owing to the look ahead, the focusing will end-up, with a likelihood which is the better the viewport location prediction is, into the user experiencing an improved video quality. The user equipment sends the report of the predicted location, for instance, along with temporal information as to when the user is going to look at the viewport portion to the server. When, optionally, additionally using the first aspect’s feeding back information onto the actual viewport portion’s location, the prediction may by improved with respect to subsequent predictions.

As user sensor data, one or more of a multichannel E G signal, a multichannel EEG signal, a multichannel EOG signal, an eye track signal, a multichannel skeletal tracker signal, and a signal based on a visual media content which is looked at by the user may be used as a basis, for instance. Additionally or alternatively, the visual content itself may be analyzed in order to predict the viewport portion’s location in a statistical sense, i.e. in the sense that the predicted viewport portion coincides with typical user behaviors when the users are watching the video.

Hence, the second aspect’s approach renders possible an optimal spatial allocation of limited resources, e.g., of coded bits, of computing power and/or of available transmission bandwidth. The prediction of the position of the viewport portion may, as stated, be calculated based on a specific user’s behavior such as past head, eye or body movements, or based on electrophysiological data, or may be calculated based on visual properties of the scene content of the video itself such as based on the optical flow (i.e., the pattern of apparent motion of objects, surface and edges in a visual scene caused by the relative motion between the user and a scene) or saliency (i.e., an item which is the state or quality by which it stands out from its neighbors), or combinations of these features. The prediction may comprise estimators of the head pose/orientation or derived measures such as motion direction, speed, acceleration, onset, occurrence, or absence at one or more time points or intervals in the future. The function that maps the available features to the motion or viewport portion estimate may be learned offline from a previously recorded dataset and/or learned and adapted online during usage.

In accordance with the embodiments of the present application, an apparatus for predicting a viewport portion of a video a user is going to look at, is configured to perform a prediction of a location of the viewport portion within the video at a predetermined look ahead time using a predictor, and receive a feedback information on an actual location of the viewport portion the user looked at the predetermined look ahead time. The apparatus may update a parameterization of the predictor using the actual location. The predictor comprises a plurality of sub-predictors each configured to generate respective prediction candidate, and the apparatus may derive the prediction from the generated prediction candidates. The apparatus, for example, receives a multichannel EMG signal (i.e., a multichannel electromyography signal) from a user sensor (e.g., electrode attached to the user), and uses the multichannel EMG signal to forecast the change in the head orientation of the user. The change in the head orientation of the user may be predicted in two steps, i.e., the first step as a first classification predicts whether the head is going to move in a future time interval or not, and, when the first classification predicts that the head is going to move, the second step as a second classification predicts into which direction (e.g., right or left) the head is going to move. Alternatively, a regression technique may be used for prediction. Therefore, the head orientation, i.e., the head position/location and direction is predicted and a viewport portion of the video a user is predicted to look at, i.e., a field of view of the user is predicted based on the predicted head orientation. In other word, it is possible to predict the display area where the user is going to look at on the fly mode, i.e., ongoing change of the head orientation of the user is predicted. Hence, it is possible to improve the visual content quality without change the computational power for rendering.

In accordance with the embodiments of the present application, a user equipment for displaying an immersive video presentation may be configured to retrieve a video from a server, predict a location of a viewport portion within the video, where a user is going to look at, at a predetermined look ahead time, and report the location along with temporal information as to when the user is going to look at the viewport portion to the server. The look ahead time may relate to an absolute presentation time indication or a presentation time difference or a combination of absolute presentation time or presentation time difference with a mapping of presentation time to a universal time used for synchronization between server and client (e.g. UTC) or, in even other words, relate to a time basis somehow synchronized between the sender of the location information, i.e. the forecast apparatus, and the recipient thereof which, in accordance with an example, might be some universal time.

In accordance with the embodiments of the present application, an apparatus for generating a video stream representing a video may be configured to obtain information on a predicted viewport portion of the video a user is predicted to look at, i.e., ongoing field of view of the user, and focuses encoding resources for encoding the video into the data stream onto the predicted viewport portion and/or focusing the packaging of the video to the available throughout of the network onto the predicted viewport portion. That is, by obtaining the information indicating the predicted viewport portion, encoding resources, i.e., available bit and/or computational power is effectively used to improve the perceived quality of the visual content and avoid a temporal mismatch. Hence, it is possible to effectively avoid a mismatch between the user motion and the visual content caused by the rendering delay.

In accordance with the embodiments of the present application, a system for presenting a video to a user comprising a detector configured to predict a viewport portion of the video which the user is going to look at; an apparatus for generating a video stream representing a video according to the present invention ; and an interface configured to inform the video encoder the predicted viewport portion; wherein the video encoder is configured to focus encoding resources for encoding the video into the data stream onto the predicted location of the viewport portion. For example, the detector receives a multichannel EMG signal from a user sensor, and use the multichannel EMG signal to forecast the change in the head orientation of the user for predicting the viewport of the user. Then, the encoder obtains information on the predicted viewport portion via the interface and focuses encoding resources for encoding the video into the data stream onto the predicted viewport portion. That is, the head orientation of the user is predicted in real time, i.e., a prediction is implemented on the fly mode and hence, it is possible to improve the perceived quality of the visual content and avoid a temporal mismatch by using limited computational power.

Further advantages are achieved by the apparatus, the user equipment and the system of the claims. Preferred embodiments of the present application are described below with respect to the figures, amongst which:

Fig. 1 shows a schematic Illustration of the actual field of view as a section of the full spherical panorama;

Fig. 2 shows a schematic diagram of a system for presenting a video to a user according to embodiments of the present application;

Fig. 3 shows a schematic diagram explaining to present visual content to a user according to embodiments of the present application;

Fig. 4 shows a block diagram of an apparatus for predicting a viewport portion of a video a user is going to look at as an example of the apparatus where prediction concept according to embodiments of the present application could be implemented;

Fig. Sshows a block diagram of an apparatus for predicting a viewport portion of a video a user is going to look at as another example of the apparatus where prediction concept according to embodiments of the present application could be implemented;

Fig. 6 shows a block diagram of user equipment for displaying an immersive video presentation as an example of the user equipment where prediction concept according to embodiments of the present application could be implemented; Fig. 7 shows a block diagram of an apparatus for generating a video stream representing a video as an example of the apparatus where prediction concept according to embodiments of the present application could be implemented;

Fig. 8 shows a schematic diagram of EMG based head motion forecasting with classification according to embodiments of the present application;

Fig. 9 shows a schematic diagram of EMG based head pose forecasting with regression according to embodiments of the present application; and

Fig. 10 shows a schematic diagram head motion anticipation according to embodiments of the present application.

The following description sets forth specific details such as particular embodiments, procedure, techniques, and etc., for purposes of explanation and not limitation. It will be appreciated by those skilled in the art that other embodiments may be employed apart from these specific details. For example, although the following description is facilitated using non-limiting example applications, the technology may be employed to any type of video codec. In some instances, detailed description of well-known methods, interfaces, circuits and devices are omitted so as to not obscure the description with unnecessary detail.

Equal or equivalent elements with equal or equivalent functionality are denoted in the following description by equal or equivalent reference signs.

The following description of the figures starts with a presentation of a description of a viewport/field of view of a user wearing a head mounted display (HMD). The viewport of the user is described with respect to Fig. 1.

Fig. 1 shows a user 2 wearing a HMD 4 via which the user 2 is presented a panoramic video 6. As shown in Fig. 1 , the user sees limited area, i.e., field of view (viewport) 8, of the panoramic video 6 at each time instant while a remaining area is out of sight 10 for the user 2 during the user looking at viewport 8. That is, the user 2 looks at a scene or panoramic video 6 via HMD 4, but only sees the viewport 8, and therefore, high resolution of the remaining area is not required since out of sight 10 not being looked at by the user 2. As shown in Fig. 2, the HMD 4 is connected to a computing device 12 which has a function to work as an apparatus of the present application, and/or a decoder. The computing device 12 is connected via some network to a server 14 providing a video content to the user, e.g., the server could be a content server or a distribution server. The server 14 represents an apparatus for video stream generation. The computing device 12 includes an interface which informs the server 14 on a predicted location of the viewport portion.

For the presentation of visual media content on the HMD 4 to the user 2 as shown Fig. 1 , several steps as indicated in Fig. 3 are involved. The user 2 is equipped, for instance, with at least one of the following sensors to obtain signals which are used to predict a change in the head orientation of the user and, thus, the viewport’s 8 location. The sensors include electrodes, cameras and/or other devices which track/record/sense/detect the head or eye motion of the user.

The apparatus for predicting a viewport portion of a video a user is going to look at according to the present invention may be implemented on the HMD 4, the computing device 12 or the server 14, or a part of them. In other words, the HMD 4, the computing device 12 and/or the server 14 may have this functionality. For example, Fig. 4 shows the computing device 12 as comprising a main processor 20 and a predictor (e.g., modality) 22. Fig. 5 shows another example of the computing device 12. Here, the computing device 12 comprises the main processor 20 and the predictor 22 with the predictor 22 being composed of a plurality of sub-predictors 22a, 22b and 22c. In Fig. 5, three sub-predictors are exemplarily indicated, but the number of the sub-predictors is not limited.

Fig. 6 shows a user equipment as the HMD 4 according to the present invention. The HMD 4 comprises a main processor 30 and a plurality of sensors 32a to 32c. however, again, the number of the sensors is not limited and the HMD 4 may also include the predictor (subpredictors).

Fig. 7 shows an example of the server 14 according to the present invention. The server 14 comprises a video stream data generator 40 and an encoding core 42. The encoding core 42 generates, using some encoding resources, an encoded representation which the generator 40 then subjects to packetization to yield the video stream output by the server 14 to the device 12 for viewport presentation to the user. One or both of encoding core 42 and generator 40 may operate depending on a predicted viewport location. It is not indicated, however, that the server 14 may include a memory or a plurality of memories to store the video content and other data. The server 14 may comprise the predictor (or some or all sub-predictors).

For example, a head tracking signal is acquired by the HMD, i.e., the HMD 4 worn by the user 2. To this end, same may include several sensors 32a to 32c as described in Fig. 6, for example. An internal measurement unit may be built into the HMD 4, for example. Additionally or alternatively, optical sensors may track the user’s head motion (e.g., head tracker).

One of the sensors 32a to 32c may acquire an EMG (electromyography) signal. It may be acquired by electrodes placed on the skin of the user. The EMG of the user is measured by sensing the activity of one or more muscles with the electrodes. For instance, sixteen EMG electrodes can be placed on the neck surface in eight pairs of respectively two vertically aligned electrodes, which results in a 16-dimensional time series. One electrode pair is put on the left and another pair on the right sternocleidomastoid muscle. The remaining electrode pairs are arranged in between with equal spacing around the neck, forming an arc with a frontal opening.

One of the sensors 32a to 32c may acquire eye movements, such as by an EOG (electrooculography) signal which is acquired by two or more electrodes in contact with the skin on the user’s head, or by one or more (infrared) cameras built in the HMD. The EOG signal indicates the eye movements.

In case, the video content relates to a kind of game and the user inter actively plays a game using controllers, then, the controllers or skeletal trackers may have one or more of the sensors 32a to 32c so as to measure body movements of the user to acquire the body movement signal.

One of the sensors 32a to 32c may use electroencephalography (EEG) to capture the brain activity with two or more electrodes on scalps of the user.

Meta information of the visual signal relating to visual media content (or pointers referring to the content, e.g., name and current time of a video, or a Uniform Resource Locator) may also be used as input for the viewport location prediction. That is, one or more of the above signals may be used in the prediction nprocess performed by the predictor 22.

As indicated in step 100 of Fig. 3, the user 2 wears the HDM 4 which includes sensors or selected sensors, e.g., sensors 32a to 32c, and/or all or selected electrodes to track the movement of the user 2. In particular, the user is presented the video via immersive streaming. That is, the user 2 sees the viewport 8 while assuming a current viewport location and reacts in order to move the viewport 8 according to his/her wish, which movement is measured by sensors 32a to 32c.

In step 102, the above signals or a selection thereof are acquired in synchrony and sent to for example, the computing device 12 in Fig. 4 or 5 which forms an embodiment for an apparatus of the present application for predicting a location of the viewport 8. The prediction is done in real-time during HMD usage. In Fig. 3, the prediction task is exemplarily shown to be composed of a sequence of two steps, namely processing steps 104 and 106.

Step 104 is a kind of preparation step of the prediction. Here, the inbound signals of step 102 are subject to feature extraction and/or integration.

As optional pre-processing of the head tracker signals, the head direction/orientation data, e.g., a rotation matrix, may be decomposed into Euler angles. The head position vector may be adjusted to the site of HMD usage, e.g., related to the room center. Body tracker signals can be pre-processed similarly. EMG, EOG, and EEG signals are amplified and analogue- to-digital converted - if applicable - and may be filtered in frequency (e.g., high-, low-, or band-pass filtering) and space (e.g., Laplace filtering, bipolar subtraction, selection of electrode locations). Filter parameters that maximize the predictive performance are chosen according to the training data (explained in the later step 106), e.g., by searching over a grid of different frequency bands. Laplacian spatial filters subtract the average signal of adjacent electrodes from each electrode, may increase the signal-to-noise ratio and compensate for inter-individual anatomical differences and for changes of the electrode locations. Bipolar subtraction of EMG-electrodes on the left and right sides of the neck of the user may better capture the activity onset of the sternocleidomastoid muscles.

The horizontal and vertical orientation of the eye(s) may be inferred from the eye image(s) recorded by the camera(s) (with image processing methods and an additional eye tracker calibration session with known eye orientations). Unprocessed eye images can be used as signal as well.

As a further part of step 104, Multi-dimensional features, e.g., EMG signals, eye movement recorded by the camera, the body movement signal and/or other signals, may be extracted from the signal streams acquired by the sensors and received by the apparatus 12. In detail, the acquired sensor signals, e.g., one or more of a multichannel EMG signal from a user sensor, a multichannel EEG signal, a multichannel EOG signal, an eye track signal, a multichannel skeletal tracker signal, a signal based on a visual media content which is looked at by the user, and statistical user behavior models are sent to the computing device 12 from the HMD 4.

For example, in case head and body trackers signals are received, the obtained time series may be used directly and/or time points of motion onsets may be inferred from the pose data, e.g., with motion speed thresholds or according to the spectral flux.

In case EMG and EEG signals are received, sliding windows of the electrophysiological time series can be extracted and the multi-channel windows can be vectorized, after an optional down-sampling step. Window lengths that are optimal with respect to the predictive performance can be estimated according to the training data by the adaptive learning algorithm (detail is explained below regarding the process of step 106),

Eye movements signals are received, the time series may be used unprocessed and/or time points of eye saccade onsets may be determined. Visual features in the media content may be detected and mapped, e.g., objects, faces, visual saliency, contextual priors, or optical flow (with image processing methods). Alternatively, the unprocessed media content can serve as feature. If only pointers to the media content are provided as signal stream, the respective content is loaded for this purpose. This might be useful in case the video content is a game content and unprocessed game character has priority or specific movement which might catch the gaze of the user.

Synchronization and interpolation can correct for time lags between the different feature time series and for different sampling frequencies. Whitening or other normalization procedures may further integrate the multi-dimensional feature time-series. Features from previous HMD usage sessions of the user and/or of sessions of other users can be added to the training data set (in detail, see blow the description part regarding step 106). In step 106, the actual prediction takes place. An adaptive leaning algorithm may be used to this end, i.e. for head motion forecasting, as follows. A function may map the incoming features (i.e., signals from the sensors, see step 104) to an estimate of the head pose or motion in the future (see step 108 for more details about the properties of the output). Methods from supervised machine learning calculate the mapping function based on a data set of features acquired in the past that are labelled with the actual head poses or motions that occurred at later time points than the respective features. These labels are extracted from the head-tracker-features of the very same data set. The mapping function can be calculated with data from the current or previous HMD usage session(s) of the user and from other users, or with data from a calibration phase before the actual HMD usage.

For example, the computing device 12 as indicated in Fig. 5 receives one or a plurality of signals from the HMD 4. In addition, the computing device 12 retrieves video content information and a video stream to be displayed by the HMD 4. This video is then subject to video stream generation at the server 14. A plurality of sub-predictors may act on these signals which might have been subject to the preprocessing at step 104. Each sub predictor may generate a respective prediction candidate, i.e. a candidate for the prediction of the location of the viewport at some look ahead. E.g., the sub-predictor 22a may receive the EMG signal and generate a prediction candidate based on the EMG signal. In the similar manner, the sub-predictor 22b may receive an eye tracking signal and generate a prediction candidate based on the eye tracking signal, and the sub-predictor 22c may receive a signal based on a visual media content which is looked at by the user, i.e., retrieved video content information, and may generates a prediction candidate based on the retrieved video content information. In case the predictor is not composed of more than one sub-predictor, the predictor 22 generates a plurality of prediction candidates based on the received signals or just one prediction is done and the merging or combination of candidates described as being performed in the processor 20 may be left off. In the above, the prediction candidate is generated based on one received signal, however, a plurality of signals may be used for generating one prediction candidate. The main processor 20 receives a plurality of generated prediction candidates and derives the final prediction from the generated prediction candidates. For example, a selection of one of the prediction candidates is performed such as the selection of the most appropriate prediction or the most likely best prediction from the plurality of the prediction candidates. The selection of the prediction candidate may be done based on the statistical user behavior which is stored in a memory (not shown in Fig. 4 and 5). Alternatively, the most appropriate (or reliable) prediction candidate is generated considering the statistical user behavior or the feedback information based on the plurality of the prediction candidates such as deriving the final prediction by way of a weighted sum of the prediction candidates with weighting the candidates with weights appropriately mutually ranking the prediction candidates according to their prediction quality. Exemplary details of deriving the prediction are explained below. In addition, the computing device 12 may generate a confidence value (explained in detail below) indicating the confidence of the prediction and the generated confidence value may be used to accompany the prediction. The confidence value may be generated based the stored variables, i.e., sensor signals, feedback information, content information, statistical user behavior information and so on. In case the server 14 includes the predictor/subpredictors, or for the one or more sub predictors residing in server 14, the above mentioned prediction schema is processed at the server 14 which is the recipient of the viewport location prediction in order to use same for tailoring encoding/packetization thereonto. The higher the confidence, the more the tailoring/focusing may be localized to the predicted viewport location. The lower the confidence, the less significant the increase of encoding resources and/or increase of quality of the packetized video stream in the predicted viewport location compared the surrounding thereof might be.

It should be noted that the prediction result of step 106 relates to some non-zero look ahead time. This is contrary to, for example, “Hahne, J. (2016) Machine learning based myoelectric control (PhD thesis)" that infers the present movement type. For forecasting the onset of a movement in the near future (on the scale of tenths or hundreds of a second or for look ahead times between 0.05 and 1 seconds) several movements of the user or sensor data or combination of the acquired user movement and sensor data, for example, such as the EMG, may be used. The forecast be performed through classification or regression techniques. In both cases, the forecasting performance can be enhanced through signal decomposition. For example, Common spatial patterns (CSP) and Source Power Comodulation (SPoC) are statistical techniques for the supervised decomposition of multivariate signals which may be used.

CSP is described, for instance, in“(a) Fukunaga (1972). Introduction to statistical pattern recognition. Academic Press, San Diego, CA, USA (b) Ramoser, H., Muller-Gerking, J., & Pfurtscheller, G. (2000), Optimal spatial filtering of single trial EEG during imagined hand movement. IEEE Transactions on Rehabilitation Engineering, 8(4), 441-446." CSP acts as a spatial filter that maximizes the discriminability between signals from two classes according to the signal variance. After filtering with CSP, the logarithm of the variance of the decomposed signals can be calculated and classified, e.g. with a classification function learned with linear discriminant analysis (LDA) or with a support vector machine. The disclosed methods train two or more combinations of CSP filters and a classifier (shown below in Fig. 8). The first combination predicts whether the head moves in a future time interval or not. If so, the second combination forecasts whether the head is likely to move to the left or right. As an option, this approach can be repeated on multiple levels. The spherical panorama around the user (see, for example, Fig 1 ) can be divided into 2N sections, which requires N classification stages. The output with the largest posterior probability will be considered.

SPoC is described in document "Dahne, Meinecke, Haufe, Hohne, Tangermann, Muller, and Nikulin (2014). SPoC: a novel framework for relating the amplitude of neuronal oscillations to behaviorally relevant parameters. Neuroimage, 86, 111-122". SPoC decomposes the multivariate signal under consideration of the amplitude of signal oscillations, too. However, SPoC is guided by a continuous target variable instead of binary classes. The disclosed methods use the future head pose as target variable for SPoC (at time t plus a time difference At). After spatial filtering with SPoC, band power values calculated from the decomposed signals are passed to a regressor (e.g., ridge regression) in order to estimate the future head pose (shown below in Fig. 9). Optionally, the continuous forecast values can be quantized or clustered.

Fig. 8 shows the EMG-based head motion forecasting with classification according to the embodiments of the present application. From the ongoing multichannel EMG-signal 200, a sliding window is extracted. The window can be represented as a matrix with E rows and T columns, with E being the number of EMG-electrodes, and with T being the number of time points sampled within the window. Exemplary values of these variables are T=200 and E=16 for a sliding window of length 200 ms sampled with a frequency of 1000 Hz at sixteen electrodes. The window is spatially filtered with filters that have been optimized with CSP beforehand according to the training data. CSP calculates spatial filters that maximize the difference of the signal variance between two classes by solving a generalized Eigenvalue decomposition problem.

The spatial filtering step results in a set of N spatial components (exemplary value N=16; N can be any integer between two and the number of electrodes E). Calculating the variances of all or selected components and the logarithm thereof results in a feature vector that a classifier maps to a decision. The classification function is computed here with linear discriminant analysis, but also other classifiers such as a support vector machine can be employed. Two decisions are taken in separate stages. In the first stage, a classifier decides if the head is about to move or not within a period At in the immediate future. The selection of At depends on the application and is on the scale of tenths or hundreds of a second. If the occurrence of a movement is considered as likely, a second decision stage predicts whether the head is about to move to the left or right. Both stages comprise the same single steps as illustrated in the figure. The parameters (window size, spatial filters, classification function) of the first and of the second decision stage can be different and have been optimized beforehand according to labelled training data from the past.

Fig. 9 shows the EMG-based head pose forecasting with regression. From the EMG signal 200, a sliding window is extracted as explained above. The window is decomposed by projecting the data from the original signal space (with E dimensions; E is the number of electrodes; e.g., E=16 for sixteen electrodes) to a space with N dimension (N can be any integer between two and E). The projection parameters are learned beforehand with SPoC from the training data (explained below). Then, the band power of the decomposed signal is calculated, resulting in a vector with N dimensions, which is mapped by a regression function to a continuous head pose forecast.

The parameters for decomposition and regression are calculated from the training data as follows. SPoC computes a set of spatial filters by solving a generalized Eigenvalue decomposition problem guided by the values of a target variable. Here, the known head pose from a time point shifted by a period At in the relative future serves as target variable (with At in the range of a fraction of a second). The projection with the spatial filters results in N signal components. The components are segmented in windows of length T (with T time points sampled within the window). Computing the band power of each of the N components results in one vector (with N dimensions) per time window. The set of vectors from all time segments and the corresponding target variable values serve for the parameter optimization of the regression function with supervised machine learning (e.g., with ridge regression or a different regression method). The result of the learning phase is a set of filters for SPoC to decompose the EMG, and a regression function that maps the band- power-vector obtained from the decomposed EMG to a forecast of the head pose.

Optionally, the continuous head pose forecast from the regression can be quantized to discrete categorical variables in an additional stage. This can result in a classification output similar to the CSP-based method with three classes: “absence of a movement” versus “movement to the left" versus“movement to the right”. The SPoC algorithm is closely related to CSP and it obtains CSP as a special case of SPoC by using a binary variable as target.

Besides EMG, also features from the EEG and from head-, body-, and eye (EOG) trackers can be exploited. For example, according to the labeled EEG features, a linear classifier (e.g., regularized linear discriminant analysis or support vector machine) learns to classify whether a head movement occurs in a future time interval or not (e.g., by exploiting the “readiness potential”). A statistical learning machine (e.g., a deep neural network or other regression/classification/statistical pattern recognition technique) is trained to forecast the future head pose or motion based on past head, body and eye movements of the respective user and of other users, and based on the visual media content looked at:

- The machine may learn that video viewers tend to turn their head to coordinates (x, y) at time t of a spherical video v with a probability p.

- Methods suited for sequential data with a temporal structure, e.g. a Long-Short- Term-Memory recurrent neural net, may learn to anticipate the head motion from the features.

- Humans tend to look at particular visual features, e.g. at faces or fast moving objects.

Therefore, the user will probably turn the head toward a face or fast moving object appearing at the border of the HMD. Such visual cues can be learned in a supervised fashion, e.g. by providing a convolutional neural network with the visual media content or derived visual maps (see, for example, corresponding description part to step 104) that are labeled with the future head pose. It should be noted that the given intuitive explanation of what the complex statistical machine actually does learn from the data is only tentative and serves as example.

The head pose/motion/orientation estimates of different feature-specific predictors can be combined by posterior probability summation or with a meta-classifier. Multi-feature- integration can also be performed with a deep neural net, which - automatically by design - abstracts and combines the raw features of different types that serve as input on higher levels of its architecture (i.e., in deeper layers). The mapping function can be adapted to the incoming data and re-calculated online during HMD usage (and can be initialized based on previous data or with random values). In other words, using feedback information from HMD 4, for instance, regarding the actually assume location of the viewport 8 at some time for which the location had been predicted using some look ahead time, the predictor 22 such as one or more of its sub-predictors and/or their usage to form the final prediction in the processor, which may be called mapping function, may be updated. The mapping function may be updated iteratively and its predictive performance may be evaluated continuously by comparing the forecasts made with the subsequent actual head motions or, to be more precise, adapting the mapping function so as to result into a better prediction, i.e. a predicted location closer to the actual location when using the same input signals, i.e. a stored version of the input of the respective sub-predictor in case of updating its prediction, and/or a stored version of prediction candidates in case of updating the selection/merging by processor 20. Filter parameters and window lengths can be adapted in a similar fashion like the mapping function. This online mechanism allows for accounting for inter-individual physiological differences and for individual changes over time.

Step 108 of Fig. 3 shows output, i.e., the information regarding the predicted head orientation/pose. The head motion forecast can comprise estimates of the head position vector, orientation rotation matrix, Euler angles or other derived measures such as the direction, speed, acceleration, onset, occurrence, or absence of a head motion, at one or more time points or intervals in the future. Probabilistic measures of the expected certainty or precision can be included in the estimates. The specific use case determines the actual measures and parameters (e.g., selected time points or intervals in the future). The forecast or prediction of the viewport location is sent to the server 14 for use in generating the video stream in step 1 10, such as for focusing thereonto encoding resources and/or for performing, depending thereon, a packetization of the encoded representation so as to from, by the packetization, the video stream going to be retrieved from the server 14 by the device 12, in a manner having increased quality in the predicted viewport location compared to the surrounding of the predicted viewport location. Again, the focusing degree, i.e. the increase in encoding resources and/or quality of the packetization result, in the predicted viewport location compared to the surrounding, may be set by the server 14 according to a confidence information accompanying the information on the predicted viewport location. Another recipient may, additionally, the device 12 in its function of retrieving the immersive video, for instance, namely, in order to perform representation selection among several representations of the video being available at server 14 and/or pre-fetching in case of using a client driven adaptive streaming protocol such as DASH.

Optionally, the features and/or forecasts and/or sensor signals depending on which the prediction has been done, may be buffered or are sent to, and stored, in a data base that can be accessed by an adaptive learning algorithm in later or other HMD usage sessions (e.g., for transfer learning across subjects). Such stored data may be used to optimize or update the predictor 22 in the manner outlined above, when the prediction device is provided with feedback information on the actually assumed location of the viewport 8 for a time for which the location had been predicted at the mentioned look ahead using this stored data. The continuous performance evaluation (as described in the section regarding step 106) and the probabilistic output measures allow for deciding online (with a threshold) whether to start, to stop or to reactivate the exploitation of the forecast. Furthermore, the eye orientation can be forecast instead of (or in addition to) the head pose (by simply replacing the target variable of the predictions in step 106).

Thus, as indicated in Fig. 3, after outputting the predicted head orientation/pose, i.e. , the derived prediction is outputted from the computing device 12 to the server 14, the visual media content is generated there as step 1 10 accordingly. Again, as already outlined above, the prediction may at least partially already be performed at the server 14. The video steam generation ends-up into the video stream which is then used for presentation of the video to the user at 1 12, i.e. the actually assumed viewport 8 is displayed at the HMD 4. If the viewport location prediction worked well, the user who watches the video at improved quality compared to the above-mentioned tasks not being implemented. As described above, the degree of focusing the data stream generation onto the predicted location of the viewport portion may be made depending on the confidence of the prediction so that in case the confidence being low, the size experiencing more encoding resources and/or being of improved quality in the packetized version, is wider, i.e. exceeds the predicted viewport location more than in case the confidence is high.

The result of the above described steps is a feedback loop: The video is at streamed is displayed at the HMD 4, the user behavior is detected by the sensors 32a to 32c and the detected signals are provided to the computing device 12 etc. That is, the sensors 32a to 32c detect the actual viewport portion also for that time for which a prediction of the viewport location has been made some time ago, and optionally, this actual location may be provided to the computing device 12 as the feedback information. Then, the computing device 12 may update a parameterization of the predictor based on the feedback information (actual location).

Summarizing the above, a streaming server 14 may send packages of visual data that are optimized in terms of bit allocation depending on the forecast viewing perspectives (viewports). Additionally or alternatively, the server 14 or some device generating the encoded representation to be packetized, allocates limited computational resources with the objective to provide higher quality and/or computational power for forecast viewport locations (e.g., in computer graphics or gaming) or with the aim to save computational resources or energy (e.g., on a mobile device with low computing power or battery capacity). The forecast can also be included in games, where player movements are anticipated in order to modify the gaming experience or resource allocation, e.g. bit rate, computing power, or (spatial) fidelity allocation per region (viewport versus non-viewport). Predicting the eye orientation (in addition to the head pose) may be exploited for foveated rendering or for creating saliency maps. Foveated rendering is a graphics rendering technique which uses an eye tracker integrated with a virtual reality headset, e.g., HMD, to reduce the rendering workload by greatly reducing the image quality in the peripheral vision (i.e. , outside of the zone gazed by the fovea: out of sight).

The above mentioned prediction scheme is not the only thinkable prediction scheme. For instance, while the prediction scheme according to the present invention is capable of predicting orientation change before any movement occurs, further methods exists, e.g. velocity or acceleration-based prediction, which can accurately prediction future pose once movement began. Furthermore, saliency-based prediction or object tracking schemes can be facilitated based on content characteristics or known user behavior characteristics to predict orientation changes. Content based (saliency-based, object-based or user statistics- based) prediction can be performed on the sending entity (server/encoder/decoder). On the user side, orientation prediction can be performed by all mentioned device/equipment/modalities. Combination of the different modalities can happen on the sending or on the user side.

Another scenario is a conversational, cloud gaming, low latency virtual-reality-service as shown Fig. 10. Streaming is happening over the Real-Time Transmission Protocol or over MPEG Media Transport or any other streaming protocol capable of low latency streaming such as with latencies lower than 2 seconds, where the server 14 decides on the encoding and content to be send to the receiver. That is, scenarios where the device 12 decides on a packetized version of the encoded presentation by itself, with the capability to select a packetized version having the highest quality at the predicted viewport location, such as in DASH, may be discriminated from scenarios where the server 14 decides on the streamed version: In the former case, the dependency on the viewport location forecast may be used to focus the encoding resources yielding the encoded representation, for instance. The latter may than packetized to yield a set of available representations for download by the device or client 12. Owing to the focusing, the user will likely experience higher quality than he/she would if focusing of the encoding resources would not have been used. In the latter scenario, the server 14 decides on the packetization which then yields the stream retrieved by the device or client 12 and which is then used for presenting the actual viewport 8 to the user. Again, the user will likely experience higher quality than he/she would if performing the packetization not depending on the viewport forecast. The viewport of a user is predicted e.g., from EEG, EMG, or extrapolation e.g., At = 200ms ahead. The computing device 12 in Fig. 2 may send, e.g., periodically a message with the forecast to the server 14, with the information of the position of the viewport of the receiver and an information on At, i.e. the look ahead time. Depending on round-trip-time (RTT), this gives the server 14 more time to, e.g., prepare the content, i.e., better quality through better encoding or packetizing. The message signals the ahead time At together with the orientation or position forecast. Time At can be sent explicitly or may be send by indexing one of preset values for At. Again, the ahead time may be indicated in terms of presentation time or presentation time difference or as a combination of presentation time or presentation time difference with a universal time used for synchronization. In an alternative example, the advance time At could be fixed by a system specification, could be known by default to the sever, with this information, thus, not accompanying the locational information.

Two possible system designs for such a system can be envisioned as follows. The first system design encompasses at server side a per-user video stream generation, in which, for each user, an encoder resource allocation is directly influenced by the received prediction of the user’s viewport. I.e., the encoder for that user creates content, i.e. an encoded representation of the video 6, individualized for this user in that the encoding resources such as bitrate for instance are focused onto the predicted viewport location of this user. The encoded representation is then sent to the user. The second design uses a set of independent encoders that create various variants of the content 6 or portions thereof. “Independent” indicates that these encoded representations may be generated irrespective of the predicted viewport location. A packager at the server 14, such as for instance RTF, whose operation is driven by the received prediction, assembles suitable content from the multiple encoders for a given user. That is, according to the second design, the packetizing is done user individually. The packetizing is done in a manner, for instance, so that the packetized data stream, as composed out of the available encoded variants, has increased quality at the predicted location. In both designs, the device or client 12 only retrieves the thus generated video stream and presents to the user that viewport portion where he actually looks at. If the prediction is good, all works fine. In case of the client being offered several representations of the video content differing in areas of creased quality in the panoramic scene, first design option may be varied in the following manner: The encoded representation is offered to the client in tiles. Tiles at and near the predicted viewport location have been created by increased encoding resources. The device or client 12 retrieves those tiles necessary. The tiles not necessarily cover the whole scene. Even here, the user most likely experiences a better quality in his/her viewport.

As explained above, in one embodiment, the ahead time At and orientation or position information is accompanied with a confidence value that expresses the degree of reliability of the ahead time At measurement. Given a high value (in the range of 0 to 100%) of the confidence value, the server 14 can be very sure that the transmitted orientation or position will be reached in the indicated ahead time At, whereas a low confidence value would indicate that the prediction is of limited reliability. The confidence value would hence influence server side processing in the sense that certain server side measures can be enacted, e.g., covering a wider (bigger) field of view (FoV) as the actual client-side equipment to overprovision the FoV of the end-user and mitigate the effect of a wrongful ahead time prediction. In another embodiment, the time At ahead and orientation or position information is accompanied with a FoV information reflecting the confidence of the prediction.

Although some aspects have been described in the context of an apparatus or a system, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus and/or system. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, one or more of the most important method steps may be executed by such an apparatus.

The inventive data stream can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.

Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.

Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.

Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.

Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.

In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.

A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or nontransitionary. A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.

A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.

A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.

A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.

In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are preferably performed by any hardware apparatus.

The apparatus described herein may be implemented using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.

The apparatus described herein, or any components of the apparatus described herein, may be implemented at least partially in hardware and/or in software.

The methods described herein may be performed using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer. The methods described herein, or any components of the apparatus described herein, may be performed at least partially by hardware and/or by software.

The above described embodiments are merely illustrative for the principles of the present invention. It is understood that modifications and variations of the arrangements and the details described herein will be apparent to others skilled in the art. It is the intent, therefore, to be limited only by the scope of the impending patent claims and not by the specific details presented by way of description and explanation of the embodiments herein.

Claims

1. Apparatus for predicting a viewport portion of a video a user is going to look at, configured to

perform a prediction of a location of the viewport portion within the video at a

predetermined look ahead time using a predictor, and

receive a feedback information on an actual location of the viewport portion the user looked at the predetermined look ahead time.

2. Apparatus according to claim 1 , configured to

update a parameterization of the predictor using the actual location.

3. Apparatus according to claim 1 or 2, wherein the predictor comprises a plurality of sub- predictors each configured to generate respective prediction candidate,

wherein the apparatus configured to

derive the prediction from the generated prediction candidates.

4. Apparatus according to any one of claims 1 to 3, wherein the predictor is configured to use to form one or more prediction candidates based on one or more of signals including a multichannel EMG signal from a user sensor, a multichannel EEG signal, a multichannel EOG signal, an eye track signal, a multichannel skeletal tracker signal, a signal based on a visual media content which is looked at by the user, and statistical user behavior models.

5. Apparatus according to any one of claims 1 to 4, configured to

subject a multichannel signal to a first classification revealing whether a head orientation of the user is going to change, and

subject a multichannel signal to a second classification revealing into which direction a head orientation of the user is going to move.

6. Apparatus according to any one of claims 1 to 4, configured to

map a multichannel signal to a multi component prediction signal, and

subject the multi component prediction signal to a regression to obtain a head pose forecast.

7. Apparatus according to any one of claims 1 to 6, configured to

store variables involved in the prediction of the location of the viewport portion, and use the stored variables in the updating the parameterization of the predictor using the actual location.

8. Apparatus according to any one of claims 1 to 7, configured to

generate a confidence value indicating a confidence of the prediction.

9. Apparatus according to any one of claims 1 to 8, wherein the look ahead time is between 0.05 and 1 second.

10. User equipment for displaying an immersive video presentation configured to retrieve a video from a server,

predict a location of a viewport portion within the video, where a user is going to look at, at a predetermined look ahead time, and

report the location along with temporal information as to when the user is going to look at the viewport portion to the server.

11. User equipment according to claim 10, wherein the look ahead time relates to an absolute presentation time indication or a presentation time difference or a combination of absolute presentation time indication or a presentation time difference with a universal time used for server-client synchronization.

12. Method for predicting a viewport portion of a video a user is looking at, comprising performing a prediction of a location of the viewport portion within the video at a predetermined look ahead time using a predictor, and

receiving a feedback information on an actual location of the view port portion the user looked at the predetermined advance time.

13. Method according to claim 12, comprising

updating a parameterization of the predictor using the actual location.

14. Method according to claim 12 or 13, wherein the predictor comprises a plurality of subpredictors each configured to generate respective prediction candidate,

wherein the method comprises

deriving the prediction from the generated prediction candidates.

15. Method according to any one of claims 12 to 14, comprising

using to form one or mode prediction candidates based on one or more of signals including a multichannel EMG signal, an eye tracker signal, a multichannel EEG signal, a multichannel EOG signal, a multichannel skeletal tracker signal, and a visual media content which is looked at by the user.

16. Method according to any one of claims 12 to 15, comprising

subjecting a multichannel sensor signal to a first classification revealing whether the head orientation of the user is going to change, and

subjecting a multichannel sensor signal to a second classification revealing into which direction the head orientation of the user is going to move.

17. Method according to any one of claims 12 to 16, comprising

storing variables involved in the prediction of the location of the viewport portion, and using the stored variables in the updating the parameterization of the predictor using the actual location.

18. Method according to any one of claims 12 to 17, comprising

generating a confidence value indicating a confidence of the predict.

19. Method for displaying an immersive video presentation comprising,

retrieving the video from the server,

predicting a location of a viewport portion within the video, where a user is going to look at, at a predetermined look ahead time, and

reporting the location along with temporal information as to when the user is going to look at the viewport to the server.

20. Computer program having a program code for performing, when on a computer, a method according to claim 12 to 19.

21. Data stream generated by a method according to claim 12 to 19.

22. Apparatus for generating a video stream representing a video configured to obtain an information on a predicted location of a viewport portion of the video a user is predicted to look at; and focus encoding resources for encoding the video into the video stream and/or a packetization of an encoded representation of the video onto the predicted location of the viewport portion.

23. Apparatus according to claim 22, wherein encoding resources are available bit rate and/or computational power and/or spatial fidelity per region.

24. Apparatus according to claim 22 or 23, configured to perform the encoding in real time.

25. Apparatus according to any one of claims 22 to 24, configured to obtain the information from a user equipment.

26. Apparatus according to any one of claims 22 to 25, configured to obtain the information from a user equipment in a manner so that the information indicates the predicted viewport portion for a time instant corresponding to, or temporally following, a currently encoded frame.

27. Apparatus according to any one of claims 22 to 26, configured to receive a prediction of an eye orientation of the user and/or a head position of the user, and

generate the information based on the prediction.

28. Apparatus according to claim 27, configured to

receive along with the prediction of the eye orientation of the user and/or the head position of the user a confidence value indicating the confidence of the prediction.

29. Apparatus according to claim 28, configured to

adapt a size of the predicted viewport portion depending on the confidence.

30. System for presenting a video to a user comprising

a detector configured to predict a viewport portion of the video which the user is going to look at;

an apparatus according to any one of claims 22 to 29; and

an interface configured to inform the apparatus the predicted location of the viewport portion; wherein the apparatus is configured to focus encoding resources for encoding the video into the data stream onto the predicted location of the viewport portion.

31. System according to claim 30, wherein the detector is configured to

the apparatus according to any one of claims 1 to 11.

32. Method for encoding a video into a data stream, comprising

obtaining an information on a predicted location of a viewport portion of the video a user is predicted to look at; and

focusing encoding resources for encoding the video into the data stream and/or a packetization of an encoded representation of the video onto the predicted location of the viewport portion.

33. Method for presenting a video to a user, comprising

predicting a location of a viewport portion of the video which the user is going to look at; obtaining information on a predicted location of the viewport portion of the video a user is predicted to look at;

focusing encoding resources for encoding the video into the data stream and/or a packetization of an encoded representation of the video onto the predicted location of the viewport portion; and

informing the video encoder of the predicted location of the viewport portion.

34. Computer program having a program code for performing, when on a computer, a method according to claim 32 or 33.

35. Data stream generated by a method according to claim 32 or 33.