GB2568691A

GB2568691A - A method, an apparatus and a computer program product for augmented/virtual reality

Info

Publication number: GB2568691A
Application number: GB1719426.7A
Authority: GB
Inventors: Pystynen Johannes; Roimela Kimmo; Cricri Francesco
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2017-11-23
Filing date: 2017-11-23
Publication date: 2019-05-29
Also published as: GB201719426D0

Abstract

The invention relates to a method, to an apparatus and to a computer program product for implementing the method. The method comprises receiving input data streams e.g. high resolution and low resolution streams, separating the input data streams to individual tiles and culling tiles of the individual tiles according to their location in relation to a predetermined area such as a view frustum. The tiles are then grouped into different tile groups according to their importance, possibly based on their position in the frustum, processing the tile groups from a less important group to the most important groups by obtaining quality parameters, possibly related to pose and/or gaze, for each tile in a tile group and then defining for each tile group a desired processing decision such as resolution input and/or image processing kernel to use. The processing of the most important tile groups is continued until a predefined stopping criterion has been reached and then the tile groups are rendered with the defined processing decision.

Description

A METHOD, AN APPARATUS AND A COMPUTER PROGRAM PRODUCT FOR AUGMENTED/VIRTUAL REALITY

Technical Field

The present solution generally relates to augmented reality/virtual reality (AR/VR).

Background

Since the beginning of photography and cinematography, the most common type of image and video content has been captured by cameras with relatively narrow field of view, and displayed as a rectangular scene on flat displays. Such content is referred to as “flat content”, or “flat image”, or “flat video” in this application. The cameras are mainly directional, whereby they capture only a limited angular field of view (the field of view towards which they are directed).

More recently, new image and video capture devices are available. These devices are able to capture visual and audio content all around them, i.e. they can capture the whole angular field of view, sometimes referred to as 360 degrees field of view. More precisely, they can capture a spherical field of view (i.e., 360 degrees in all spatial directions). Furthermore, new types of output technologies have been invented and produced, such as head-mounted displays. These devices allow a person to see visual content all around him/her, giving a feeling of being “immersed” into the scene captured by the 360 degrees camera. The new capture and display paradigm, where the field of view is spherical, is commonly referred to as virtual reality (VR) and is believed to be the common way people will experience media content in the future. Augmented reality (AR) is a term for describing real-world environment being augmented with a computer-generated data, for example graphics, image, video when the environment is viewed through a display device.

Summary

Now there has been invented an improved method and technical equipment implementing the method, for optimizing and improving quality, and reducing the latency in a streaming based Augmented/Virtual Reality (AR/VR) content processing and rendering. Various aspects of the invention include a method, an apparatus, and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments of the invention are disclosed in the dependent claims.

According to a first aspect, there is provided a method comprising receiving input data streams; separating the input data streams to individual tiles; culling tiles of the individual tiles according to their location in relation to a predetermined area; grouping the tiles into different tile groups according to their importance; processing the tile groups from a less important group to the most important groups by obtaining quality parameters for each tile in a tile group and then defining for each tile group a desired processing decision; continuing the processing of the most important tile groups until a predefined stopping criterion has been reached; and rendering the tile groups with the defined processing decision.

According to a second aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: receive input data streams; separate the input data streams to individual tiles; cull tiles of the individual tiles according to their location in relation to a predetermined area; group the tiles into different tile groups according to their importance; process the tile groups from a less important group to the most important groups by obtaining quality parameters for each tile in a tile group and then defining for each tile group a desired processing decision; continue the processing of the most important tile groups until a predefined stopping criterion has been reached; and render the tile groups with the defined processing decision.

According to a third aspect, there is provided a computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to receive input data streams; separate the input data streams to individual tiles; cull tiles of the individual tiles according to their location in relation to a predetermined area; group the tiles into different tile groups according to their importance; process the tile groups from a less important group to the most important groups by obtaining quality parameters for each tile in a tile group and then defining for each tile group a desired processing decision; continue the processing ofthe most important tile groups until a predefined stopping criterion has been reached; and render the tile groups with the defined processing decision.

According to a fourth aspect, there is provided an apparatus comprising means for receiving input data streams; means for separating the input data streams to individual tiles; means for culling tiles of the individual tiles according to their location in relation to a predetermined area; means for grouping the tiles into different tile groups according to their importance; means for processing the tile groups from a less important group to the most important groups by obtaining quality parameters for each tile in a tile group and then defining for each tile group a desired processing decision, and continuing the processing of the most important tile groups until a predefined stopping criterion has been reached; and means for rendering the tile groups with the defined processing decision.

According to an embodiment, the predetermined area is based on a view frustum.

According to an embodiment, wherein the predetermined area is based on an importance or saliency of objects in the area.

According to an embodiment, the quality parameters relate to a pose and/or gaze.

According to an embodiment, the input data streams comprise a low-resolution data stream and a high-resolution data stream.

According to an embodiment, the tile groups are based on a distance of a tile from the most important area in the view frustum.

According to an embodiment, the processing decision relates to a resolution input and/or a type of image processing kernel being used.

Description of the Drawings

In the following, various embodiments of the invention will be described in more detail with reference to the appended drawings, in which

Fig. 1 shows a system according to an embodiment;

Fig. 2 shows an apparatus according to an embodiment;

Fig. 3 shows a viewing device according to an embodiment;

Fig. 4 shows an overview of a solution according to an embodiment;

Fig. 5 shows an example of a neural network; and

Fig. 6 is a flowchart illustrating a method according to an embodiment.

Description of Example Embodiments

In the following, several embodiments of the invention will be described in the context of augmented reality/virtual reality (AR/VR). In particular, the several embodiments enable optimizing and improving quality, and reducing the latency in a streaming of AR/VR content to a display device.

An example of a system and apparatus for virtual reality is shown in Figure 1. The task of the system is that sufficient visual and auditory information from a specific location are captured so that a convincing reproduction of the experience, or presence, of being in that location can be achieved by one or more viewers physically located in different locations and optionally at a time later in the future. Such reproduction requires more information that can be captured by a single camera or microphone, in order that a viewer can determine the distance and location of objects within the scene using their eyes and their ears. To create a pair of images with disparity, two camera sources are used. In a similar manner, for the human auditory system to be able to sense the direction of sound, at least two microphones are used (the commonly known stereo sound is created by recording two audio channels).

The system of Fig. 1 comprises three main parts: image sources, a server and a rendering device. A video source SRC1 comprises multiple cameras CAM1, CAM2, ..., CAMN with overlapping field of view so that regions of the view around the video capture device is captured from at least two cameras. The multiple cameras CAM1, CAM2, ..., CAMN may physically locate in the same device, or may be in separate devices. The video source SRC1 may comprise multiple microphones to capture the timing and phase differences of audio originating from different directions. The video source SRC1 may comprise a high resolution orientation sensor so that the orientation (direction of view) of the plurality of cameras CAM1, CAM2, ..., CAMN can be detected and recorded. The video source SRC1 comprises or is functionally connected to each of the plurality of cameras CAM1, CAM2, ..., CAMN. The video source SRC1 comprises or is functionally connected to a computer processor and memory, the memory comprising computer program code for controlling the source and/or the plurality of cameras. The image stream captured by the video source, i.e. the plurality of the cameras, may be stored on a memory device for use in another device, e.g. a viewing device, and/or transmitted to a server using a communication interface. It needs to be understood that although a video source comprising three cameras is described here as part of the system, another amount of camera devices may be used instead as part of the system. The image stream captured by the video source SRC1 may be stored on a memory device MEM5 for use in another device, e.g. a viewing device, or transmitted to a server or the viewing device using a communication interface C0MM1.

Alternatively or in addition to the source device SRC1 creating an image stream, or a plurality of such, one or more sources SRC2 of synthetic images may be present in the system. Such sources of synthetic images may use a computer model of a virtual world to compute the various image streams it transmits. For example, the source SRC2 may compute N video streams corresponding to N virtual cameras located at a virtual viewing position. When such a synthetic set of video streams is used for viewing, the viewer may see a three-dimensional virtual world. The device SRC2 comprises or is functionally connected to a computer processor PROC2 and memory MEM2, the memory comprising computer program PROGR2 code for controlling the synthetic sources device SRC2.

There may be a storage, processing and data stream serving network in addition to the capture device SRC1. For example, there may be a server SERVER or a plurality of servers storing the output from the capture device SRC1 or computation device SRC2. The device SERVER comprises or is functionally connected to a computer processor PROC3 and memory MEM3, the memory comprising computer program PROGR3 code for controlling the server. The device SERVER may be connected by a wired or wireless network connection, or both, to sources SRC1 and/or SRC2, as well as the viewing devices VIEWER1 and VIEWER2 over the communication interface COMM3.

For viewing the captured or created video content, there may be one or more playback devices (also referred to as “viewing devices”) VIEWER1 and VIEWER2. These devices may have a rendering module and a display module, or these functionalities may be combined in a single device. The viewing devices may comprise or be functionally connected to a computer processor PROC4 and memory MEM4, the memory comprising computer program PROG4 code for controlling the viewing devices. Alternatively, each viewing device (VIEWER1, VIEWER2) may include a processor, memory, and/or communications interface similar to a processor PROC4, memory MEM4, and communications interface C0MM4. The viewing (playback) devices may comprise a data stream receiver for receiving one or more video data streams from a server and for decoding the video data streams. The data streams may be received over a network connection through communications interface COMM4, or from a memory device MEM6 like a memory card CARD2. The viewer devices may have a graphics processing unit for processing of the data to a suitable format for viewing. The viewing device VIEWER1 comprises a high-resolution stereo-image head-mounted display for viewing the rendered stereo video sequence. The head-mounted display may have an orientation sensor DET1 and stereo audio headphones.

According to an embodiment, the viewing device VIEWER2 comprises a display enabled with 3D technology (for displaying stereo video), and the rendering device may have a head-orientation detector DET2 connected to it. Alternatively, the viewing device VIEWER2 may comprise a 2D display, since the volumetric video rendering can be done in 2D by rendering the viewpoint from a single eye instead of a stereo eye pair.

Any of the devices (SRC1, SRC2, SERVER, RENDERER, VIEWER1, VIEWER2) may be a computer or a portable computing device, or be connected to such. Such rendering devices may have computer program code for carrying out methods according to various examples described in this text.

Figure 2 shows an example of an apparatus, i.e. a computing system representing an aforementioned computer or portable computing device. The generalized structure of the computer system will be explained in accordance with the functional blocks of the system. Several functionalities can be carried out with a single physical device, e.g. all calculation procedures can be performed in a single processor if desired. A data processing system of an apparatus according to an example of Fig. 2 comprises a main processing unit 100, a memory 102, a storage device 104, an input device 106, an output device 108, and a graphics subsystem 110, which are all connected to each other via a data bus 112.

The main processing unit 100 is a processing unit comprising processor circuitry and arranged to process data within the data processing system. The memory 102, the storage device 104, the input device 106, and the output device 108 may include conventional components as recognized by those skilled in the art. The memory 102 and storage device 104 store data within the data processing system 100. Computer program code resides in the memory 102 for implementing, for example, computer vision process. The input device 106 inputs data into the system while the output device 108 receives data from the data processing system and forwards the data, for example to a display, a data transmitter, or other output device. The data bus 112 is a conventional data bus and while shown as a single line it may be any combination of the following: a processor bus, a PCI bus, a graphical bus, an ISA bus. Accordingly, a skilled person readily recognizes that the apparatus may be any data processing device, such as a computer device, a personal computer, a server computer, a mobile phone, a smart phone or an Internet access device, for example Internet tablet computer.

It needs to be understood that different embodiments allow different parts to be carried out in different elements. For example, various processes of the system may be carried out in one or more processing devices; for example, entirely in one computer device, or in one server device or across multiple user devices.

The viewing device can be a head-mounted display (HMD). However, it is appreciated that other suitable means for experiencing immersive content can be utilized instead of HMD. Figure 3 shows an example of a head-mounted display (HMD) for viewing. The head-mounted display comprises two screen sections or two screens DISP1 and DISP2 for displaying the left and right eye images. The displays are close to the eyes, and therefore lenses are used to make the images easily viewable and for spreading the images to cover as much as possible of the eyes’ field of view. The device is attached to the head of the user so that it stays in place even when the use turns his head. The device may have an orientation detection module ORDET1 for determining the head movements, direction of the head, and/or gaze direction, and providing such information to other system(s). The head-mounted display gives a three-dimensional (3D) perception of the recorded/streamed content to a user.

Low motion-to-photon latency is important in AR/VR applications. Every component in a system that streams content to a head-mounted display (HMD) or other AR/VR viewing devices adds latency to the whole system, which reduces user’s feeling on immersing to the content. High latency can cause nausea and/or can drastically cripple the whole presentation pipeline with reduced performance. Reduced performance can be a by-product of a pipeline system that compensates for latency by doing more work (on per pixel basis and on per tile section basis, e.g. too conservative culling).

Today, more immersive AR/VR experiences are sought by using high resolution displays with wide field of view. As a result, more data need to be processed and more pixels need to be output, thus increasing the frame processing times on the Graphics Processing Unit (GPU).

Some of the VR display vendor SDKs (Software Development Kit) comprise a builtin asynchronous time warp and/or asynchronous space warp. Asynchronous time warp aims to fix the low frame rate and/or high latency after the frame has been rendered by moving the image to match the final head orientation of the HMD. The asynchronous time warp is not able to take into account the final position because it is orientation-only time warp. Asynchronous space warp is an extension to time warp that takes into account the final position as well. Both of these methods can create artefacts by trying to fix the output without knowing all the inputs and the content.

The present embodiments are targeted to automatically optimize color and depth input data streams for higher quality output, better performance per wat ratio, and doing it with only minimal added latency. In addition, the scalable design can be used to automatically increase the rendering quality with target frame rate on the most important areas (e.g., center of the view, or the region gazed on by the viewer).

The present latency hiding techniques reduce the need for certain post processing techniques like asynchronous time warp and asynchronous space warp, which would further add latency to the pipeline and create an erroneous final output.

An overview of a solution according to an embodiment is shown in Figure 4. The solution may be implemented in a playback device, i.e. a viewing device, an example of which has been shown in Figure 3. The system receives high-resolution data stream 401 (e.g. a 2D (two-dimensional) color stream) and a low-resolution data stream 402 (e.g. a 2D depth stream) as an input. The data streams may be so called tiled data streams, but that is not a requirement. The input data streams or the tiles therein are then decoded. After decoding, the data is tiled for processing and rendering purposes on the GPU. Tiling is a pre-processing step that happens real-time on the GPU. The tiling can take into consideration tiled input streams and their characteristics. Though, the actual GPU processing tile size may differ from the input stream tile sizes.

Then a conservative tile culling is performed for the individual tiles, wherein the tile culling may be based on an area so that the tiles that are located outside the determined area can be ignored. The area can be a view frustum, i.e. a volume containing objects that are seen by a viewing person, wherein the tile culling means that tiles that are located outside the view frustum can be ignored. Instead of a view frustum, the area can be defined to be based on some other criteria. For example, the most important view areas for the tile culling may be determined based on additional information such as importance or saliency of individual objects in the area. Such information may result from an automatic analysis or explicit assignment of priorities by the content author. Tile culling is conservative in the sense that a predefined or predicted safe region around the area is maintained if manual time warp is just in the rendering pipeline. The safe region can be defined as a predefined distance to borders of the view frustum or other area, or a predefined distance to center of the view. Alternatively, one can take the current velocity of the HMD into account from previous frames and predict the needed extra tiles that are needed to compensate within the smaller than one frame of latency for the orientation. I.e. if low priority tiles are time warped within a frame or so.

The tiles that are within the view frustum or within another important area (i.e. the tiles that remain after the culling) may be then grouped 403 according to their distance to the current pose, i.e. their distance from the center of the view (fovea). The pose can be determined based on HMD/gaze tracking 411. The processing decisions for individual tiles and tile groups may be based on the location of the tile compared to fovea.

There can be two or more tile groups. The tile groups that have the lowest priority are processed first, and then the processing is continued towards tile groups having the highest priority one tile group at a time. According to an embodiment, the priority may be defined based on the distance ofthe tile group to the fovea. List of tiles are then sent to rendering 408 in order of the next (non-rendered) lowest priority group. Tile “qualities” may be decided based on the latest pose/gaze that is visible to the shader program by a coherently mapped GPU buffer.

Pose of the HMD is updated from the HMD/gaze tracking 411 and fed to coherent GPU buffers iteratively in real-time. The system takes the latest pose from the GPU buffer when it starts to process a tile group. In addition, for quality selection, individual times can use the latest pose and it doesn’t need to be the same for all the tiles in the group. Due to this, decisions for tiles, e.g. what resolution inputs and/or what type of image processing kernels are used, can occur at tile processing time, with the most important tiles receiving the best quality. The process ends when there are not more tile groups left or a predefined synchronization point is reached. The method, for the highest quality, keeps incrementally updating the rendering of the most important areas until a safe synchronization point that is just before buffer swapping for display 450 is reached. With the predefined synchronization point the quality of the most important tile groups and its tiles can be iteratively improved.

It is realized that the tile groups are processed from less important group to the most important group (e.g. from outside of fovea to inside of fovea within the view frustum). Every time the processing of a tile (for e.g. 8x8 or 16x16 pixel size) is started, a pose/gaze based quality parameters 412 may be loaded for each tile, and a desired processing decision is defined for each tile. The tiles are then rendered according to the defined processing decision.

The method according to an embodiment comprises

- processing decoded high-resolution and low-resolution data streams in real time, wherein the data streams may comprise color and depth streams respectively, the processing comprising o separating the data in both data streams into individual tiles;

o culling the tiles conservatively within a narrow margin to an area from the latest pose that has been updated, and adding the tiles to coherently mapped GPU buffer without adding pipeline latency, wherein the area is a view frustum or some other area determined according to a certain importance criterion;

o after the tile cull, putting the tiles into lists of different tile groups;

o processing the different tile groups from less important groups to the most important groups (e.g. from outside of fovea to inside of fovea);

after the first tile group has started the processing, the pose based tile projection may not be changed unless extra conservation is added between tile groups (i.e. conservatively map neighbor tiles to both groups);

o obtaining the latest pose if it has been prepared for asynchronous time warp and latest gaze in all situations from the coherently mapped GPU buffer, and updating the fovea based tile “quality” parameters per individual tiles. For example, every time a tile having e.g. 8x8 or 16x16 pixel size is started to be processed, at first a pose/gaze based quality parameters are loaded, and then only the desired input data per pixel with desired qualities are loaded. With this approach the conservatively selected important areas have the best quality with close to zero extra latency;

o incrementally updating the rendering quality of the most important areas until a predefined stop criterion has been reached; and o rendering the tile groups.

As mentioned the tile grouping may be based on the distance of the tile from the most important area based on the most recent pose/gaze (e.g. center of the view), but other suitable importance criteria may be considered. An example of this is to base the tile grouping on the distance of the tile from a salient area within the current field of view, the salient area being provided for example by a neural network. The term “saliency” relates to high-level features of an area, i.e. a highly semantic saliency detection result which may for example comprise a final output of a deep neural network trained for salient object detection instead of low-level features such as color, edges, texture, comers.

An example of a Convolutional Neural Network (CNN), which is feature extractor used in deep learning techniques, is shown in Fig. 5. A CNN may be composed of one or more convolutional layers with fully connected layers on top. The final layer of a CNN may be referred to as a classification layer or a regression layer. In a regression network, the output value of fully connected layer may be directly used. In case of a classification layer this output may be converted to a class index by an additional operation, such as for example argmax.

In Fig. 5, the input to a CNN is an image or a region of an image, but any other media content object, such as video or audio file, could be used as well. Each layer of a CNN represents a certain abstraction (or semantic) level, and the CNN extracts multiple feature maps. The CNN in Fig. 5 has only three feature (or abstraction, or semantic) layers C1, C2, C3 for the sake of simplicity, but top-performing CNNs may have over 1000 feature layers.

The first convolution layer C1 of the CNN consists of extracting 4 feature-maps from the first layer (i.e. from the input image). These maps may represent low-level features found in the input image, such as edges and comers. The second convolution layer C2 of the CNN, consisting of extracting 6 feature-maps from the previous layer, increases the semantic level of extracted features. Similarly, the third convolution layer C3 may represent more abstract concepts found in images, such as combinations of edges and comers, shapes, etc. The last layer of the CNN (fully connected MLP) does not extract feature-maps. Instead, it may use the featuremaps from the last feature layer in order to predict (recognize) the object class. For example, it may predict that the object in the image is a house.

The system can target and prioritize different criteria. E.g. for mobile application, the target can be performance per watt ratio, whereas for desktop application the target may be the highest pixel quality in target frame rate. In mobile application, it may be optimal to render in sub frame time to reduce the GPU clock rate in order to reduce the power consumption. On desktop application, it may be ideal to target the best quality with a desired frame time with a maximum GPU clock rate. With the best quality option, the quality can be incrementally updated for the most important areas until a safe synchronization point just before buffer swapping for display is reached. This incremental updating can happen after the tile group queue is empty and there is still frame time left before the synchronization point.

Figure 6 is a flowchart illustrating a method according to an embodiment. A method comprises receiving input data streams 610; separating the input data streams to individual tiles 620; culling tiles of the individual tiles according to their location in relation to a predetermined area 630; grouping the tiles into different tile groups according to their importance 640; processing the tile groups from a less important group to the most important groups by obtaining quality parameters for each tile in a tile group and then defining for each tile group a desired processing decision 650; continuing the processing of the most important tile groups until a predefined stopping criterion has been reached 660; and rendering the tile groups with the defined processing decision 670.

An apparatus according to an embodiment comprises means for receiving input data streams; means for separating the input data streams to individual tiles; means for culling tiles of the individual tiles according to their location in relation to a predetermined area; means for grouping the tiles into different tile groups according to their importance; means for processing the tile groups from a less important group to the most important groups by obtaining quality parameters for each tile in a tile group and then defining for each tile group a desired processing decision; means for continuing the processing of the most important tile groups until a predefined stopping criterion has been reached; and means for rendering the tile groups with the defined processing decision. The means comprises a processor, such as a processing unit 101 of Figure 2, and a memory, such as a memory 102 of Figure 2, and a computer program code residing in the memory.

The memory, such as a memory 102 of Figure 2, includes computer program code comprising one or more operational characteristics. Said operational characteristics are being defined through configuration by said computer based on the type of said processor, wherein a system is connectable to said processor by a bus, wherein a programmable operational characteristic of the system comprises receiving input data streams; separating the input data streams to individual tiles; culling tiles of the individual tiles according to their in relation to a predetermined area; grouping the tiles into different tile groups according to their importance; processing the tile groups from a less important group to the most important groups by obtaining quality parameters for each tile in a tile group and then defining for each tile group a desired processing decision; continuing the processing of the most important tile groups until a predefined stopping criterion has been reached; and rendering the tile groups with the defined processing decision.

The various embodiments may provide advantages. For example, the tile based rendering system according to embodiments is easily scalable and it adds only minimal latency due to incremental use of the latest pose and/or gaze. Therefore, the system can stay immersive even with latency spikes and/or latency increase in the input stream/decode phase. Furthermore, with adaptive and scalable tile based design, a performance per watt and/or quality per frame time can be targeted. Tiling is a pre-processing step done on the GPU and it can take into consideration tiled input streams and their characteristics. Though, the tile size on the actual GPU may differ from the input stream tile sizes. The system according to a present solution is able to separate the tiles conservatively into tile groups based on the distance from the center of the view (fovea). There can be two or more tile groups. The tiles that are farthest away from the fovea are processed at first, and then the processing is continued towards the center of the fovea one tile group at a time. Pose of the HMD is updated and fed to coherent GPU buffers directly all the time. The playback system takes the latest pose from the GPU buffer when it starts to process an individual tile. With this, decisions for tiles, e.g. what resolution inputs and/or what type of image processing kernels are used can happen at tile processing time, with the most important tiles receiving the most up-to-date viewing information.

The various embodiments of the invention can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the invention. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.

Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.

Claims

Claims:

1. A method comprising:

- receiving input data streams;

- separating the input data streams to individual tiles;

- culling tiles of the individual tiles according to their location in relation to a predetermined area;

- grouping the tiles into different tile groups according to their importance;

- processing the tile groups from a less important group to the most important groups by obtaining quality parameters for each tile in a tile group and then defining for each tile group a desired processing decision;

- continuing the processing of the most important tile groups until a predefined stopping criterion has been reached; and

- rendering the tile groups with the defined processing decision.

2. The method according to claim 1, wherein the predetermined area is based on a view frustum.

3. The method according to claim 1, wherein the predetermined area is based on an importance or saliency of objects in the area.

4. The method according to claim 1 or 2 or 3, wherein the quality parameters relate to a pose and/or gaze.

5. The method according to any of the claims 1 to 4, wherein the input data streams comprise a low-resolution data stream and a high-resolution data stream.

6. The method according to any of the claims 1 to 5, wherein the tile groups are based on a distance of a tile from the predetermined area.

7. The method according to any of the claims 1 to 6, wherein the processing decision relates to a resolution input and/or a type of image processing kernel being used.

8. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- receive input data streams;

- separate the input data streams to individual tiles;

- cull tiles of the individual tiles according to their location in relation to a predetermined area;

- group the tiles into different tile groups according to their importance;

- process the tile groups from a less important group to the most important groups by obtaining quality parameters for each tile in a tile group and then defining for each tile group a desired processing decision;

- continue the processing of the most important tile groups until a predefined stopping criterion has been reached; and

- render the tile groups with the defined processing decision.

9. The apparatus according to claim 8, wherein the predetermined area is based on a view frustum.

10. The apparatus according to claim 8, wherein the predetermined area is based on an importance or saliency of objects in the area.

11. The apparatus according to claim 8 or 9 or 10, wherein the quality parameters relate to a pose and/or gaze.

12. The apparatus according to any of the claims 8 to 11, wherein the input data streams comprise a low-resolution data stream and a high-resolution data stream.

13. The apparatus according to any of the claims 8 to 12, wherein the tile groups are based on a distance of a tile from the predetermined area.

14. The apparatus according to any of the claims 8 to 13, wherein the processing decision relates to a resolution input and/or a type of image processing kernel being used.

15. A computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to:

- receive input data streams;

- separate the input data streams to individual tiles;

- group the tiles into different tile groups according to their importance;

- render the tile groups with the defined processing decision.

16. The computer program product according to claim 15, wherein the quality parameters relate to a pose and/or gaze.

17. The computer program product according to claim 15 or 16, wherein the input data streams comprise a low-resolution data stream and a high-resolution data stream.

18. The computer program product according to claim 15 or 16 or 17, wherein the tile groups are based on a distance of a tile from the predetermined area.

19. The computer program product according to any of the claims 15 to 18, wherein the processing decision relates to a resolution input and/or a type of image processing kernel being used.

20. An apparatus comprising

- means for receiving input data streams;

- means for separating the input data streams to individual tiles;

- means for culling tiles of the individual tiles according to their location in relation to a predetermined area;

- means for grouping the tiles into different tile groups according to their importance;

- means for processing the tile groups from a less important group to the most important groups by obtaining quality parameters for each tile in a tile group and then defining for each tile group a desired processing decision, and continuing the processing of the most important tile groups until a predefined stopping criterion has been reached; and means for rendering the tile groups with the defined processing decision.