GB2556017A

GB2556017A - Image compression method and technical equipment for the same

Info

Publication number: GB2556017A
Application number: GB1610763.3A
Authority: GB
Inventors: Aflaki Beni Payman
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2016-06-21
Filing date: 2016-06-21
Publication date: 2018-05-23
Also published as: WO2017220851A1; GB201610763D0

Abstract

A most probable viewing direction in a content, such as for a 360 degrees image, is determined. The content is segmented based on distance between pixels of the content, and the determined most probable viewing direction (MPVD). The segments are encoded with different qualities, wherein the lightest encoding is applied to a segment that is closest to the determined most probable viewing direction. The heavier encoding can mean greater compression which results in lower quality in the less likely viewing directions. This results in less of a bandwidth requirement. A multi-camera device (figures 4a-b) may be used to capture the content and a head movement tracker and eye gaze tracker may be used to ascertain the MPVD. The segments may be a particular kind of shape based on how the content has been created.

Description

(54) Title of the Invention: Image compression method and technical equipment for the same Abstract Title: Encoding based on most probable viewing direction (57) A most probable viewing direction in a content, such as for a 360 degrees image, is determined. The content is segmented based on distance between pixels of the content, and the determined most probable viewing direction (MPVD). The segments are encoded with different qualities, wherein the lightest encoding is applied to a segment that is closest to the determined most probable viewing direction. The heavier encoding can mean greater compression which results in lower quality in the less likely viewing directions. This results in less of a bandwidth requirement. A multi-camera device (figures 4a-b) may be used to capture the content and a head movement tracker and eye gaze tracker may be used to ascertain the MPVD. The segments may be a particular kind of shape based on how the content has been created.

Fig. 6

1/11

2/11

Fig. 2a

ORDET1

Fig. 2b

3/11

Fig. 3

4/11

Fig. 4a

DIR CAM'1

.......DIR CAM2

DIR CAMN

DIR CAM2

Fig. 4b

............

...........

'a I .· f/V' /¾.. I χΧ' \ 4 / ’Ζ λ.?\ ΐ Χ.Λ \ >'

X3f·: /A·² ’’'MlF'’· Λ V *\.

Jk^-' -s λ/

A \,>

NUN \

-' \V

DIR CAM1 »,.·, i. j? / ? Γ\>, [ ...../ /

...............^z / X>\' ’ 3%>

DIR CAMN

5/11

	' £	' £	' £	' £	' £
CD CD	*> CD	£ CD	*> CD	*> CD	*> CD
x: '>	θ >	° >	θ >	θ >	θ >
Ό >>	Ό >>	Ό >>	Ό >>	Ό >>	Ό >>
CD .1±	CD .1±	<D .1±	CD .1±	CD .1±	CD ±i
O 1	O 1		O 1	O 1	°!
O cr	O cr	O cr	O cr	O cr	O CT

Input image Downsampling Sample-domain Transform-based

σ σ t

σ φ

ο.

Ω_ ζ

t σ σ σ <Η-Η4Τ <-ΗΗ- <-ΗΗ t t t t □

z — z M tilt

©

CM d

.□ o

>-

6/11

CD

LU

7/11 ο

ο rο ο

r710

8/11

Ο)

9/11

Rectangular Circular

>

o

Fig. 9

10/11

Fig. 10a

11/11

ο

ο ο

IMAGE COMPRESSION METHOD AND TECHNICAL EQUIPMENT FOR THE

SAME

Technical Field

The present solution generally relates to processing media content. In particular, the solution relates to a method and technical equipment for reducing bitrate of the media content based on characteristics of the media content.

Background

Content provided for 360 degree video experience requires quite a lot of bandwidth to transmit the data from a device to another in real time. In the cases where stereoscopic presentation is needed, the bandwidth overload is even greater issue. Hence, it is required to decrease the bandwidth required to encode the 360 degree images/video as much as possible without sacrificing the quality of experience for the users.

Summary

Now there has been invented an improved method and technical equipment implementing the method, for reducing the bitrate required to encode e.g. the 360 degree images. Various aspects of the invention include a method, an apparatus and a computer readable medium comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various embodiments of the invention are disclosed in the dependent claims.

According to a first aspect, there is provided a method comprising determining a most probable viewing direction in a content; segmenting the content into a plurality of segments based on distance between pixels of the content and the determined most probable viewing direction; and encoding the plurality of segments with different qualities, wherein the lightest encoding is applied to a segment that is closest to the determined most probable viewing direction.

According to an embodiment the content is in the form of 360 degree image or a two-dimensional image.

According to an embodiment the method further comprises defining a segment with a certain type of a shape.

According to an embodiment a segment is partially located on one side of image, and partially located on another side of the image.

According to an embodiment the content is captured with a multi-camera device.

According to an embodiment the content is received through a communications network.

According to an embodiment the most probable viewing direction is obtained by one or more of the following: determined by a head movement tracker; determined by an eye gaze tracker, received from a content provider; determined by an indication on the amount of movement in the content; determined based on depth information and closeness of the pixels to the viewer.

According to an embodiment the shape for a segment is selected based on how the content has been created.

According to a second aspect, there is provided an apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following: determine a most probable viewing direction in a content; segment the content into a plurality of segments based on distance between pixels of the content and the determined most probable viewing direction; and encode the plurality of segments with different qualities, wherein the lightest encoding is applied to a segment that is closest to the determined most probable viewing direction.

According to an embodiment the apparatus is further caused to define a segment with a certain type of a shape.

According to an embodiment the content is captured with a multi-camera device.

According to a third aspect, there is provided an apparatus comprising: means for determining a most probable viewing direction in a content; means for segmenting the content into a plurality of segments based on distance between pixels of the content and the determined most probable viewing direction; and means for encoding the plurality of segments with different qualities, wherein the lightest encoding is applied to a segment that is closest to the determined most probable viewing direction.

According to an embodiment the apparatus further comprises means for defining a segment with a certain type of a shape.

According to an embodiment the content is captured with a multi-camera device.

According to a fourth aspect, there is provided computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to: determine a most probable viewing direction in a content; segment the content into a plurality of segments based on distance between pixels of the content and the determined most probable viewing direction; and encode the plurality of segments with different qualities, wherein the lightest encoding is applied to a segment that is closest to the determined most probable viewing direction.

Description of the Drawings

In the following, various embodiments of the invention will be described in more detail with reference to the appended drawings, in which

Fig. 1 shows a system and apparatuses for stereo viewing;

Fig. 2a shows a camera device for stereo viewing;

Fig. 2b shows a head-mounted display for stereo viewing;

Fig. 3 shows a camera according to an embodiment;

Fig. 4a, b show examples of a multicamera device;

Fig. 5

Fig.	6
Fig.	7a, b
Fig.	8
Fig.	9
Fig.	10a, b
Fig.	11

shows examples of asymmetric stereoscopic coding of related technology;

is a flowchart illustrating a method according to an embodiment;

show presentations of an object in an image;

is a flowchart illustrating a method according to another embodiment;

shows different shapes used for segmenting an image;

show presentations of oval and rectangular shapes in an image; and shows an image being segmented to different zones based on horizontal difference.

Description of Example Embodiments

In the following, several embodiments of the invention will be described in the context of transmitting 360 degree images and/or video in real time. It is to be noted, however, that the invention is not limited to this context only. In fact, the different embodiments have applications in any image/video processing environment where bitrate reduction is required.

Also, the present embodiments are discussed in relation to content captured with a multicamera device. A multicamera device has a view direction and comprises a plurality of cameras, at least one central camera and at least two peripheral cameras. Each said camera has a respective field of view, and each said field of view covers the view direction of the multicamera device. The cameras are positioned with respect to each other such that the central cameras and peripheral cameras from at least two stereo camera pairs with a natural disparity and a stereo field of view, each said stereo field of view covering the view direction of the multicamera device. The multicamera device has a central field of view, the central field of view comprising a combined stereo field of view of the stereo camera pairs, and a peripheral field of view comprising fields of view of the cameras at least partly outside the central field of view.

The multicamera device may comprise cameras at locations corresponding to at least some of the eye positions of a human head at normal anatomical posture, eye positions ofthe human head at maximum flexion anatomical posture, eye positions ofthe human head at maximum extension anatomical postures, and/or eye positions of the human head at maximum left and right rotation anatomical postures. The multicamera device may comprise at least three cameras, the cameras being disposed such that their optical axes in the direction of the respective camera’s field of view fall within a hemispheric field of view, the multicamera device comprising no cameras having their optical axes outside the hemispheric field of view, and the multicamera device having a total field of view covering a full sphere.

The multicamera device described here may have cameras with wide-angle lenses. The multicamera device may be suitable for creating stereo viewing image data, comprising a plurality of video sequences for the plurality of cameras. The multicamera may be such that any pair of cameras ofthe at least three cameras has a parallax corresponding to parallax (disparity) of human eyes for creating a stereo image. At least three cameras may have overlapping fields of view such that an overlap region for which every part is captured by said at least tree cameras is defined, and such overlap area can be used in forming the image for stereo viewing.

Same embodiments of the current invention are also applicable for panoramic images captured with only one camera. Such camera may include fish eye lenses or any other type of wide-angle lenses.

Fig. 1 shows a system and apparatuses for stereo viewing, that is, for 3D video and 3D audio digital capture and playback. The task of the system is that of capturing sufficient visual and auditory information from a specific location such that a convincing reproduction of the experience, or presence, of being in that location can be achieved by one or more viewers physically located in different locations and optionally at a time later in the future. Such reproduction requires more information that can be captured by a single camera or microphone, in order that a viewer can determine the distance and location of objects within the scene using their eyes and their ears. To create a pair of images with disparity, two camera sources are used. In a similar manner, for the human auditory system to be able to sense the direction of sound, at least two microphones are used (the commonly known stereo sound is created by recording two audio channels). The human auditory system can detect the cues, e.g. in timing difference of the audio signals to detect the direction of sound.

The system of Fig. 1 may consist of three main parts: image sources, a server and a rendering device. A video capture device SRC1 comprises multiple (for example, 8) cameras CAM1, CAM2, ..., CAMN with overlapping field of view so that regions of the view around the video capture device is captured from at least two cameras. The device SRC1 may comprise multiple microphones to capture the timing and phase differences of audio originating from different directions. The device SRC1 may comprise a high resolution orientation sensor so that the orientation (direction of view) of the plurality of cameras can be detected and recorded. The device SRC1 comprises or is functionally connected to a computer processor PROC1 and memory MEM1, the memory comprising computer program PROGR1 code for controlling the video capture device. The image stream captured by the video capture device may be stored on a memory device MEM2 for use in another device, e.g. a viewer, and/or transmitted to a server using a communication interface C0MM1. It needs to be understood that although an 8-camera-cubical setup is described here as part of the system, another multicamera device may be used instead as part of the system.

Alternatively or in addition to the video capture device SRC1 creating an image stream, or a plurality of such, one or more sources SRC2 of synthetic images may be present in the system. Such sources of synthetic images may use a computer model of a virtual world to compute the various image streams it transmits. For example, the source SRC2 may compute N video streams corresponding to N virtual cameras located at a virtual viewing position. When such a synthetic set of video streams is used for viewing, the viewer may see a three-dimensional virtual world. The device SRC2 comprises or is functionally connected to a computer processor PROC2 and memory MEM2, the memory comprising computer program PROGR2 code for controlling the synthetic sources device SRC2. The image stream captured by the device may be stored on a memory device MEM5 (e.g. memory card CARD1) for use in another device, e.g. a viewer, or transmitted to a server or the viewer using a communication interface C0MM2.

There may be a storage, processing and data stream serving network in addition to the capture device SRC1. For example, there may be a server SERVER or a plurality of servers storing the output from the capture device SRC1 or computation device SRC2. The device SERVER comprises or is functionally connected to a computer processor PROC3 and memory MEM3, the memory comprising computer program PROGR3 code for controlling the server. The device SERVER may be connected by a wired or wireless network connection, or both, to sources SRC1 and/or SRC2, as well as the viewer devices VIEWER1 and VIEWER2 over the communication interface C0MM3.

For viewing the captured or created video content, there may be one or more viewer devices VIEWER1 and VIEWER2. These devices may have a rendering module and a display module, or these functionalities may be combined in a single device. The devices may comprise or be functionally connected to a computer processor PROC4 and memory MEM4, the memory comprising computer program PROG4 code for controlling the viewing devices. The viewer (playback) devices may consist of a data stream receiver for receiving a video data stream from a server and for decoding the video data stream. The data stream may be received over a network connection through communications interface C0MM4, or from a memory device MEM6 like a memory card CARD2. The viewer devices may have a graphics processing unit for processing of the data to a suitable format for viewing. The viewer VIEWER1 comprises a high-resolution stereo-image head-mounted display for viewing the rendered stereo video sequence. The head-mounted display may have an orientation sensor DET1 and stereo audio headphones. The viewer VIEWER2 comprises a display enable with 3D technology (for displaying stereo video), and the rendering device may have a head-orientation detector DET2 connected to it. Any of the devices (SRC1, SRC2, SERVER, RENDERER, VIEWER1, VIEWER2) may be a computer or a portable computing device, or be connected to such. Such rendering devices may have computer program code for carrying out methods according to various examples described in this text.

Fig. 2a shows a camera device for stereo viewing. The camera comprises three or more cameras that are configured into camera pairs for creating the left and right eye images, or that can be arranged to such pairs. The distances between cameras may correspond to the usual (or average) distance between the human eyes. The cameras may be arranged so that they have significant overlap in their field-of-view. For example, wide-angel lenses of 180-degrees or more may be used, and there may be 3, 4, 5, 6, 7, 8, 9, 10, 12, 16, or 20 cameras. The cameras may be regularly or irregularly spaced access the whole sphere of view, or they may cover only part of the whole sphere. For example, there may be three cameras arranged in a triangle and having different directions of view towards one side of the triangle such that all three cameras cover an overlap area in the middle of the directions of view. As another example, 8 cameras having wide-angle lenses and arranged regularly at the corners of a virtual cube and covering the whole sphere such that the whole or essentially whole sphere is covered at all directions by at least 3 or 4 cameras. In Fig. 2a three stereo camera pairs are shown.

Multicamera devices with other types of camera layouts may be used. For example, a camera device with all cameras in one hemisphere may be used. The number of cameras may be e.g., 3, 4, 6, 8, 12, or more. The cameras may be placed to create a central field of view where stereo images can be formed from image data of two or more cameras, and a peripheral (extreme) field of view where one camera covers the scene and only a normal non-stereo image can be formed.

Fig. 2b shows a head-mounted display for stereo viewing. The head-mounted display comprises two screen sections or two screens DISP1 and DISP2 for displaying the left and right eye images. The displays are close to the eyes, and therefore lenses are used to make the images easily viewable and for spreading the images to cover as much as possible of the eyes’ field of view. The device is attached to the head of the user so that it stays in place even when the user turns his head. The device may have an orientation detecting module ORDET1 for determining the head movements and direction of the head. The head-mounted display gives a three-dimensional (3D) perception ofthe recorded/streamed content to a user.

Fig. 3 illustrates a camera CAM1. The camera has a camera detector CAMDET1, comprising a plurality of sensor elements for sensing intensity of the light hitting the sensor element. The camera has a lens OBJ1 (or a lens arrangement of a plurality of lenses), the lens being positioned so that the light hitting the sensor elements travels through the lens to the sensor elements. The camera detector CAMDET1 has a nominal center point CP1 that is a middle point of the plurality of sensor elements, for example for a rectangular sensor the crossing point of the diagonals. The lens has a nominal center point PP1, as well, lying for example on the axis of symmetry of the lens. The direction of orientation of the camera is defined by the line passing through the center point CP1 of the camera sensor and the center point PP1 of the lens. The direction of the camera is a vector along this line pointing in the direction from the camera sensor to the lens. The optical axis of the camera is understood to be this line CP1-PP1.

The system described above may function as follows. Time-synchronized video, audio and orientation data is first recorded with the capture device. This can consists of multiple concurrent video and audio streams as described above. These are then transmitted immediately or later to the storage and processing network for processing and conversion into a format suitable for subsequent delivery to playback devices. The conversion can involve post-processing steps to the audio and video data in order to improve the quality and/or reduce the quantity of the data while preserving the quality at a desired level. Finally, each playback device receives a stream of the data from the network, and renders it into a stereo viewing reproduction of the original location which can be experienced by a user with the head-mounted display and headphones.

In the following a method for creating stereo images is described. With the method, the user may be able to turn their head in multiple directions, and the playback device is able to create a high-frequency (e.g. 60 frames per second) stereo video and audio view of the scene corresponding to that specific orientation as it would have appeared from the location of the original recording. Other methods of creating the stereo images for viewing from the camera data may be used as well.

Figs. 4a and 4b show an example of a camera device for being used as an image source. To create a full 360 degree stereo panorama every direction of view needs to be photographed from two locations, one for the left eye and one for the right eye. In case of video panorama, these images need to be shot simultaneously to keep the eyes in sync with each other. As one camera cannot physically cover the whole 360 degree view, at least without being obscured by another camera, there need to be multiple cameras to form the whole 360 degree panorama. Additional cameras however increase the cost and size of the system and add more data streams to be processed. This problem becomes even more significant when mounting cameras on a sphere or platonic solid shaped arrangement to get more vertical field of view. However, even by arranging multiple camera pairs on for example a sphere or platonic solid such as octahedron or dodecahedron, the camera pairs will not achieve free angle parallax between the eye views. The parallax between eyes is fixed to the positions of the individual cameras in a pair, that is, in the perpendicular direction to the camera pair, no parallax can be achieved. This is problematic when the stereo content is viewed with a head mounted display that allows free rotation of the viewing angle around z-axis as well.

The requirement for multiple cameras covering every point around the capture device twice would require a very large number of cameras in the capture device. In this technique lenses are used with a field of view of 180 degree (hemisphere) or greater, and the cameras are arranged with a carefully selected arrangement around the capture device. Such an arrangement is shown in Fig. 4a, where the cameras have been positioned at the corners of a virtual cube, having orientations DIR_CAM1, DIR_CAM2, ..., DIR_CAMN pointing away from the center point of the cube. Naturally, other shapes, e.g. the shape of a cuboctahedron, or other arrangement, even irregular ones, can be used.

Overlapping super wide field of view lenses may be used so that a camera can serve both as the left eye view of a camera pair and as the right eye view of another camera pair. This reduces the amount of needed cameras to half. As a surprising advantage, reducing the number of cameras in this manner increases the stereo viewing quality, because it also allows to pick the left eye and right eye cameras arbitrarily among all the cameras as long as they have enough overlapping view with each other. Using this technique with different number of cameras and different camera arrangements such as sphere and platonic solids enables picking the closest matching camera for each eye achieving also vertical parallax between the eyes. This is beneficial especially when the content is viewed using head mounted display. The described camera setup may allow creating stereo viewing with higher fidelity and smaller expenses of the camera device.

The present embodiments relate to multicamera system, where part of data captured by the multicamera system may be degraded in order to reduce the amount of data being broadcasted without sacrificing the quality of experience. The multicamera view quality degradation is based on the most probable viewing direction (MPVD), which is determined e.g. by means of eye tracking. The method according to embodiment comprises finding a location of MPVD on all available views and segmenting the content into segments having a specific shape and size. The segments which better cover the MPVD are selected and encoded with a higher quality, while the rest of the parts of the image may be encoded with lower qualities.

In the related technology, video compression may be achieved by removing spatial frequency and temporal redundancies. Different types of prediction and quantization of transform-domain prediction residuals are jointly used in many video coding standards to exploit both spatial and temporal redundancies. In addition, as coding schemes have a practical limit in the redundancy that can be removed, spatial and temporal sampling frequency as well as the bit depth of samples can be removed, spatial and temporal sampling frequency as well as the bit depth of samples can be selected in such a manner that the subjective quality is degraded as little as possible.

One branch of research for obtaining compression improvement in stereoscopic video is known as asymmetric stereoscopic video coding, in which there is a quality difference between the two coded views. This is attributed to the widely believed assumption of the binocular suppression theory that the Human Visual System (HVS) fuses the stereoscopic image pair such that the perceived quality is close to that of the higher quality view.

Asymmetry between the two views can be achieved by one or more of the following methods:

a) Mixed-resolution (MR) stereoscopic video coding, also referred to as resolution-asymmetric stereoscopic video coding. One of the views is lowpass filtered and hence has a smaller amount of spatial details or a lower spatial resolution. Furthermore, the low-pass filtered view is usually sampled with a coarser sampling grid, i.e., represented by fewer pixels.

b) Mixed-resolution chroma sampling. The chroma pictures of one view are represented by fewer samples than the respective chroma pictures of the other view.

c) Asymmetric sample-domain quantization. The sample values of the two views are quantized with a different step size. For example, the luma samples of one view may be represented with the range of 0 to 255 (i.e., 8 bits per sample) while the range may be scaled to the range of 0 to 159 for the second view. Thanks to fewer quantization steps, the second view can be compressed with a higher ratio compared to the first view. Different quantization step sizes may be used for luma and chroma samples. As a special case of asymmetric sample-domain quantization, one can refer to bitdepth-asymmetric stereoscopic video when the number of quantization steps in each view matches a power of two.

d) Asymmetric transform-domain quantization. The transform coefficients of the two views are quantized with a difference step size. As a result, one of the views has a lower fidelity and may be subject to a greater amount of visible coding artifacts, such as blocking and ringing.

e) A combination of different encoding techniques above.

The aforementioned types of asymmetric stereoscopic video coding are illustrated in Figure 5. The first row presents the higher quality view 510 which is only transform-coded. The remaining rows present several encoding combinations which have been investigated to create the lower quality view using different steps, namely, downsampling, sample domain quantization, and transform-based coding. It can be observed from the Figure 5 that downsampling or sample-domain quantization can be applied or skipped regardless of how other steps in the processing chain are applied. Likewise, the quantization step in the transformdomain coding step can be selected independently of the other steps. Thus, practical realizations of asymmetric stereoscopic video coding may use appropriate techniques for achieving asymmetry in a combined manner as illustrated in Figure 5 (item e).

In addition to the aforementioned types of asymmetric stereoscopic video coding, mixed temporal resolution (i.e. different picture rate) between views has been proposed.

When texture views are low-pass filtered (LPF), the target is in removing the highfrequency components (HFCs) while keeping the spatial resolution and general structure of the image untouched. This enables the compression of the same content with reduced number of bits since less detail (high frequency components) need to be encoded. In the case where videos are presented in polarized displays, a downsampling with ration 1/2 along the vertical direction is applied to the content. This is because the vertical spatial resolution of the display is divided between the left and right view and hence, each one has half the vertical resolution. In such cases, depending on the display and content, a huge aliasing artifact might be introduced while perceiving the stereoscopic content. However, applying LPF reduces such artifact considerably since the high frequency components responsible for the creation of aliasing are removed in a pre-processing stage.

Currently, content provided for 360 degree video experience need quite a lot of bandwidth to transmit the information in real-time. In the cases where the stereoscopic presentation is to be met, the bandwidth overload is more remarkable issue. Thus, the bandwidth required to encode the 360 degree images/video needs to be decreased as much as possible without sacrificing the quality of experience for the users.

The images originating from plurality of different cameras may be stitched together to form one 360° panorama image. Photo stitching includes the process of perspective warping of images to align them perfectly. Therefore, the perspective of the images is changed in order to blend them targeting one aligned seamless 360° panorama image. Several related image processing algorithms may be used in the process of image stitching including: template matching, keypoints matching, corner detection, feature transform, utilization of image gradients etc. The present embodiments are based on processing one 360 degree image so that quality of different parts of the 360 degree image is changed. The present embodiments exploit the characteristics of 360 degree image to continue the encoding from one side of the image to another side of the image. The present embodiments are thus configured to tackle the problem for one 2D and 360 degree image by finding the regions of the image, which can have lower quality. Moreover, the present embodiments are configured to change the compression strength of some zones of the image taking into consideration the characteristics of the image, even when the selected zone continues on the other side of the image. The quality degradation for parts of the image is based on the distance between the pixels and most probable viewing direction (MPVD).

A method according to an embodiment is shown in Figure 6, and such method comprises steps for finding 610 a MPVD for a content, e.g. in the form of 360 degree image, captured with a multi-camera device; encoding 620 the area around the MPVD with highest quality and encoding 630 the rest of the image with a lower qualities. According to embodiments, the quality degradation can be a function of the distance/degree difference from MPVD. MPVD can be found by a solution of related technology, e.g. based on subjects head movement/eye gaze direction (e.g. averaged over several subjects watching the same content), based on the amount of movement in the scene (the pixel with highest spatial location movement along a specific period of time or a specific number of frames, e.g. one group of pictures (GOP)); based on the depth information and closeness of the pixels to the viewer;

provided by the content provider and alongside the content; by any combination of the aforementioned. There are several different audio/video methods to direct the user to MPVD while wearing the head-mounted display, e.g. making a sound in one specific direction, adding augmented content to direct viewers towards the MPVD.

Eye gaze tracking is a process of measuring either the point of gaze (where one is looking) or the motion of an eye relative to the head. An eye gaze tracker is a device for measuring eye positions and eye movement and to follow the movement of eye’s pupil to figure out exactly to which point the user is looking at. Eye gaze trackers are used in research on the visual systems and subjective tests enabling researches to follow the eye movement of users considering different content presented. Eye gaze can be tracked for example by using a camera tracking the movement of pupil in user’s eye. The process can be done in real time and with a relatively low processing and resources required. This has been discussed by Qiang Ji, et al. in “Real-time eye, gaze and face pose tracking for monitoring driver vigilance” (2002). The eye gaze can also be predicted based on the characteristics of the content as discussed by Laurent in “Automatic foveation for video compression using a neurobiological model of visual attention” (2004). The process requires a considerable amount of operations per pixel and hence, in most of the handheld devices cannot be utilized due to extensive power consumption. The eye gaze is expected to define the direction towards the most probable viewing part of the scene. Therefore, it is assumed that majority of the viewers may select to watch towards this direction or in a very close vicinity of that.

Figures 7a-b illustrate two presentations of an object in 360 degree image. In the Figure 7a, the object 710 is in the left and top part of the image 700, while in Figure 7b, the object 710 is on the bottom right part of the image 700. The object 710 represents the MPVD. If there are objects introduced as the important part of the image, then the MPVD is configured to point to the center of the presented objects. Such center location may be defined as a pixel where average distance to all pixel locations belonging to the said object is minimum.

The method according to an embodiment is illustrated in Figure 8. The method comprises finding the MPVD 100. This can be achieved by e.g. a head movement and eye gaze tracker. Where same content is perceived by several subjects and the head movement and tracked eye gaze of them is averaged to find the MPVD.

Alternatively, the MPVD can be provided by the content provider (wherein the users are directed to the main events in the scene to ensure that the important part of the content is well perceived by the users), or indicated based on the amount of movement in the scene (the pixel with highest spatial location movement along a specific period of time, or a specific number of frames e.g. one GOP); or based on the depth information and closeness of the pixels to the viewer (e.g. if most parts of the image are relatively far from the user and only a small part is very close, the part that is close may be considered to present the MPVD), or any combination of the aforementioned, or any other known methods.

In step 200 of the embodiment of the method, the MPVD is determined and selected. The MPVD shows the direction where it is expected to have the main event of the scene happening, and it is assumed that the users pay their attention mostly to this part of the image. Sometimes, the user may miss the main event of the scene as the content is in the form of 360 degree presentation. Therefore the user may be notified and directed towards the MPVD. Such notification can be of several different types, including but not limited to the following: a user interface (UI) arrow or any other type of augmented direction indicator on the image directing the user towards the direction of MPVD, an audio signal, a vibration or any other type of indicator on the potential HMD or glasses that are being used by the user. As a result of such indication, the users are expected to notice the presence of MPVD and change their viewing direction towards it.

In step 300, characteristics of the 360 degree image are taken into account to be used in step 400. In step 400 the relative size and type of shapes (as depicted in Fig. 9) are defined. The characteristics of the shapes (e.g. their type and size) depend on the characteristics of the 360 degree image, e.g. how the 360 degree image has been created.

If the image is created using few fish eye lenses, the spatial resolution on corners of the image is already degraded to some extent. This may be taken into account while choosing the type of the shape, e.g. oval shape rather than rectangular shape, and also the strength of encoding, i.e. using relatively lower quality degradation in those areas. On the other hand, if the content has been created using several high quality cameras, the quality degradation shall be equal in all areas and hence, shape of rectangular type may work better. The examples of various types of shapes are illustrated in Figure 9. The size of the shape depends on the compression criteria.

The better the compression ratio required, the larger the size of the shapes should be.

The size of different shapes may also depend on the scene content. For example, if the scene includes small objects (may be understood based on any type of object extraction algorithm) then the size ofthe shapes should be small. On the other hand, in the scene includes larger objects, then the size of the shapes should be large. Hence, in this case where an understanding of the size of objects in the scene is available, the size of the shapes can be modified accordingly with the size of the objects in the scene.

In the case that MPVD is achieved based on a target object, the size and shape of that object will be an indicator for the size and type of the respective selected shapes.

Now, the MPVD location is known from step 200, and the type and size of the shapes according to which the quality degradation of the 360 degree image is done are known based on step 400. The steps 500 and 600 may work based on the distance and direction of pixels in the 360 degree image compared to MPVD. In an embodiment, this is addressed by introducing different types of the shapes, different sizes of the shapes and one or more layers of those shapes. It is to be noticed that this can be addressed by any other method which results in the same expected output.

Based on the central location defined by MPVD, in steps 500 and 600 the shapes are located on the 360 degree image. It is to be noticed that since the image is representing the full content in 360 degree environment, then if a shape hits the border on one side, exactly the same continuation of the shape should be followed on the opposite border to correctly take into account the structure of 360 degree image. This is shown in Figures 10a and b, where in Fig. 10a a presentation of oval shapes and in Fig. 10b a presentation of rectangular shapes are shown, the presentations having the MPVD 1010 in different locations on the 360 degree image 1000.

According to an embodiment N > 1 different sizes of the same shape are defined and used to map the surrounding area ofthe MPVD on the 360 degree image based on defining different layers of the shapes. The greater the number N is (i.e. more shapes), the smoother is the transmission between the quality levels and the easier it is to reach a target bitrate. According to embodiments, a single shape may be used to define the respective area. According to embodiments, it is also possible to mix the shapes. This means that there is an oval shape in the center of the image, while the rest of the areas are defined with a circular shape.

As mentioned, Figures 10a, b illustrate two presentations of 360 degree image segmentation with oval and rectangular shapes, where star 1010 represents the MPVD or object of interest. Differently hatched layers introduced in Figures 10a, b represent the different quality levels that should be applied to the 360 degree image 1000. This is explained in more detailed manner with reference to step 600 (Figure 8). It is to be noticed that if the shape is selected to be of the circular type, then the direct distance between the pixel values and the MPVD is the deciding factor for the quality of that pixel. However, if the shape is any other type, then the criteria for the quality of that pixel includes the direction and the distance of that pixel from MPVD.

Introduction of several shapes on the same image enables better controlling the quality degradation form the highest quality (the inner shape) towards the lower quality (the outside area of the outer shape). This means introducing smoother transition between the high and low quality of the 360 degree image. Moreover, depending on target bitrate, it is easier to adjust the compression strength of the encoder belonging to different shapes to meet the said target bitrate.

It is also possible to select different zones/areas based on the horizontal degree difference between the MPVD and the respective pixels. This is shown in Figure 11 where an image 1100 is illustrated. Within the image 1100 a star 1110 represents the MPVD or object of interest. The image 1100 has been segmented to rectangular segments that are encoded differently compared to each other. In other words, only rectangular shapes with a height equal to the height of the image are used in this embodiment to define different encoding areas on the 360 degree image 1100.

In a method shown in Figure 8, step 600 comprises applying encoding on the segmented parts of the image being defined in step 500. The general rule is that a lighter compression (resulting in higher subjective quality) is applied to parts of the image that are closer to the MPVD, while the stronger compression (resulting in lower subjective quality) is applied to farther pixels/layers from the MPVD. This is also shown in Figures 10 and 11, where different hatchings represent different compressions. It should be noted that the closeness of different locations to MPVD is defined in a 360 degree scenario. This means some parts of the scene which are close to MPVD but are displayed on the other side of the image (continuation of the scene on the border of the image) are encoded with the same high quality as the parts of the scene which are close to the MPVD and do not cross the border of the image (on the same side of the image).

The lower quality in different mapping zones can be achieved by low pass filtering prior to encoding, larger quantization parameter, sample value quantization, lower spatial resolution, or any combination of the aforementioned.

It is to be noticed that the same method can be applied also to stereoscopic 360 degree images as well as 2D images. In the stereoscopic views, the disparity between the images can also be taken into account along with the zone selection of the first view to find the respective zones in the second view. Or, similar technique can be applied on both views of the stereoscopic 360 degree images separately.

In previous, a method according to an embodiment was discussed with various examples. An apparatus according to an embodiment comprises means for implementing the method, i.e. means for finding a MPVD for a content, e.g. in the form of 360 degree image, captured with a multi-camera device; means for encoding the area around the MPVD with highest quality and encoding the rest of the image with a lower quality. An apparatus also comprises means for determining the quality degradation as a function of the distance/degree difference from MPVD. These means comprise a processor, a memory, and a computer program code residing in the memory.

The various embodiments may provide advantages. For example, the embodiments considerably decrease the required bitrate for encoding the 360 degree images that are intended for broadcasting, streaming, and storage.

The various embodiments of the invention can be implemented with the help of computer program code that resides in a memory and causes the relevant apparatuses to carry out the invention. For example, a device may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the device to carry out the features of an embodiment. Yet further, a network device like a server may comprise circuitry and electronics for handling, receiving and transmitting data, computer program code in a memory, and a processor that, when running the computer program code, causes the network device to carry out the features of an embodiment.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with other. Furthermore, if desired, one or more of the above-described functions and embodiments may be optional or may be combined.

Although various aspects of the embodiments are set out in the independent claims, other aspects comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

It is also noted herein that while the above describes example embodiments, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications, which may be made without departing from the scope of the present disclosure as, defined in the appended claims.

Claims

Claims:

1. A method, comprising:

- determining a most probable viewing direction in a content;

- segmenting the content into a plurality of segments based on distance between pixels of the content and the determined most probable viewing direction; and

- encoding the plurality of segments with different qualities, wherein the lightest encoding is applied to a segment that is closest to the determined most probable viewing direction.

2. The method according to claim 1, wherein the content is in the form of 360 degree image or a two-dimensional image.

3. The method according to claim 1 or 2, further comprising defining a segment with a certain type of a shape.

4. The method according to claim 2 or 3, wherein a segment is partially located on one side of image, and partially located on another side of the image.

5. The method according to any ofthe claims 1 to 4, wherein the content is captured with a multi-camera device.

6. The method according to any of the claims 1 to 5, wherein the content is received through a communications network.

7. The method according to any of the claims 1 to 6, wherein the most probable viewing direction is obtained by one or more of the following: determined by a head movement tracker; determined by an eye gaze tracker, received from a content provider; determined by an indication on the amount of movement in the content; determined based on depth information and closeness of the pixels to the viewer.

8. The method according to any of the claims 3 to 7, wherein the shape for a segment is selected based on how the content has been created.

9. An apparatus comprising at least one processor, memory including computer program code, the memory and the computer program code configured to, with the at least one processor, cause the apparatus to perform at least the following:

- determine a most probable viewing direction in a content;

- segment the content into a plurality of segments based on distance between pixels of the content and the determined most probable viewing direction; and

- encode the plurality of segments with different qualities, wherein the lightest encoding is applied to a segment that is closest to the determined most probable viewing direction.

10. The apparatus according to claim 9, wherein the content is in the form of 360 degree image or a two-dimensional image.

11. The apparatus according to claim 9 or 10, being further caused to define a segment with a certain type of a shape.

12. The apparatus according to claim 10 or 11, wherein a segment is partially located on one side of image, and partially located on another side of the image.

13. The apparatus according to any of the claims 9 to 12, wherein the content is captured with a multi-camera device.

14. The apparatus according to any of the claims 9 to 13, wherein the content is received through a communications network.

15. The apparatus according to any of the claims 9 to 14, wherein the most probable viewing direction is obtained by one or more of the following: determined by a head movement tracker; determined by an eye gaze tracker, received from a content provider; determined by an indication on the amount of movement in the content; determined based on depth information and closeness of the pixels to the viewer.

16. The apparatus according to any of the claims 11 to 15, wherein the shape for a segment is selected based on how the content has been created.

17. An apparatus comprising:

- means for determining a most probable viewing direction in a content;

- means for segmenting the content into a plurality of segments based on distance between pixels of the content and the determined most probable viewing direction; and

- means for encoding the plurality of segments with different qualities, wherein the lightest encoding is applied to a segment that is closest to the determined most probable viewing direction.

18. The apparatus according to claim 17, wherein the content is in the form of 360 degree image or a two-dimensional image.

19. The apparatus according to claim 17 or 18, further comprising means for defining a segment with a certain type of a shape.

20. The apparatus according to claim 18 or 19, wherein a segment is partially located on one side of image, and partially located on another side of the image.

21. The apparatus according to any of the claims 17 to 20, wherein the content is captured with a multi-camera device.

22. The apparatus according to any of the claims 17 to 21, wherein the content is received through a communications network.

23. The apparatus according to any of the claims 17 to 22, wherein the most probable viewing direction is obtained by one or more of the following: determined by a head movement tracker; determined by an eye gaze tracker, received from a content provider; determined by an indication on the amount of movement in the content; determined based on depth information and closeness of the pixels to the viewer.

24. The apparatus according to any of the claims 19 to 23, wherein the shape for a segment is selected based on how the content has been created.

25. A computer program product embodied on a non-transitory computer readable medium, comprising computer program code configured to, when executed on at least one processor, cause an apparatus or a system to:

- determine a most probable viewing direction in a content;

- encode the plurality of segments with different qualities, wherein the

5 lightest encoding is applied to a segment that is closest to the determined most probable viewing direction.

Intellectual

Property

Office