WO2019068310A1

WO2019068310A1 - Method and apparatus for improved encoding of immersive video

Info

Publication number: WO2019068310A1
Application number: PCT/EP2017/075022
Authority: WO
Inventors: Olie BAUMANN
Original assignee: Telefonaktiebolaget Lm Ericsson (Publ)
Priority date: 2017-10-02
Filing date: 2017-10-02
Publication date: 2019-04-11

Abstract

A video processing apparatus arranged to receive a video stream and an accompanying 3D audio stream, and slice the video stream into a plurality of video segments, each video segment relating to a different area of a field of view of the received video stream. The video processing apparatus is further arranged to identify an audio source in the 3D audio stream to determine at least one region of interest, and encode each video segment, wherein the video processing apparatus applies more compression effort to the video segments in the region of interest.

Description

METHOD AND APPARATUS FOR IMPROVED

ENCODING OF IMMERSIVE VIDEO

Technical field

The present application relates to a video processing apparatus, a method in a video processing apparatus, a transmission apparatus, a method for transmitting immersive video to a user device, and a computer-readable medium.

Background

Virtual reality as a technology that is gaining interest. Current implementations include headsets based on the use of mobile phones with acceleration sensors strapped closely in front of the eyes. The image displayed in the screen changes in response to the acceleration signals and hence to the position of the user's head. This allows the user to look around a virtual environment giving a sense of immersion. One challenge associated with this technology is how to get the video data to the display in such a way as to minimize latency. One way is to have the source content on a system remote to the display, a server or PC for example. The sensor data is then transmitted to the remote system which responds by transmitting only the area of the image in the field of view to the display. This reduces the data bandwidth required for transmission of the immersive video but requires a very low system latency for the delay between head movement and display update to be imperceptible to the user. An alternative method is to send full 360° immersive video to the display device where it can be buffered in the display memory and decoded and displayed when necessary, in response to the sensor signals. The latter method has the potential to respond very quickly to user movement and so give a good user experience, but it requires that the immersive video can be transmitted to and stored on the device which requires a larger data bandwidth and/ or compression of the video stream.

In addition to immersive video, 3D audio systems allow an audio decoder to position an audio object anywhere in the user's audio environment. Whether delivered via stereo headphones multi-channel headphones or multi-channel loudspeaker systems, the audio system can respond to the user's head position and thus anchor the audio stimulus to the video in the virtual environment. Immersive video describes a video of a scene, where the view in multiple directions is viewed or is at least viewable at the same time. Immersive video is sometimes described as recording the view in every direction, sometimes with a caveat excluding the camera support. Strictly interpreted, this is an unduly narrow definition, and in practice the term immersive video is applied to any video with a very wide field of view.

Immersive video is better described as video where a viewer is expected to watch only a portion of the video at any one time. For example, the IMAX® motion picture film format, developed by the IMAX Corporation provides very high resolution video to viewers on a large screen where it is normal that at any one time some portion of the screen is outside of the viewer's field of view. This is in contrast to a smartphone display or even a television, where usually a viewer can see the whole screen at once.

US 6,141 ,034 to Immersive Media, describes a system for dodecahedral imaging, this is used for the creation of extremely wide angle images. This document describes the geometry required to align camera images. Further, standard cropping mattes for dodecahedral images are given, and compressed storage methods are suggested for a more efficient distribution of dodecahedral images in a variety of media. US 3,757,040 to The Singer Company describes a wide-angle display for digitally generated information. In particular, the document describes how to display an image stored in planar form onto a non-planar display.

In an immersive video system it is efficient to encode only some of the tiles at high quality. However, it is hard to appropriately select which tiles to encode at higher quality.

PCT/EP2014/070936 (Ericsson ref: P43964) describes using the current location of the user's view to transmit particular tiles at higher quality. This also describes using a statistical measure of popularity to focus encoding effort on the more popular tiles.

EP1087618 uses view direction feedback in immersive imagery. The apparatus determines which portions of an immersive presentation to present to a viewer, in response to a time series of view direction information received from the viewer. The record of view direction is called a VDL (view direction log). This document goes on to describe varying degrees of compression where those degrees of compression are in response to that the view direction log, whether for an individual viewer or for a plurality of viewers. The prior art methods can reduce bandwidth consumption where immersive video is streamed to a user, but they require some test phase where at least initially all tiles are encoded with high quality versions available to a viewer in case the viewer should look in the direction of that tile. There is a need for an improved method for identifying the tiles of an immersive video that are more likely to be observed by a viewer of the immersive video, preferably before said video is encoded or viewed.

Summary

This document describes using the location of sounds within a 3D sound space to determine where a viewer is likely to look within an immersive video and therefore require high quality tiles in that field of view. A technology used in conjunction with immersive video is three dimensional (3D) sound localization, which allows a sound source to be located within 3D space. When presented with immersive video, a viewer can be expected to focus on sound sources to locate a portion of the immersive video to view. Indeed, sound cues can be used in certain immersive videos to encourage the viewer to turn to and view a particular area of the immersive video.

The 3D location of sounds in the soundtrack accompanying the immersive video are used to identify tiles of that immersive video that are more likely to be viewed and therefore should be encoded at high quality. The remaining tiles can be encoded in a low quality version only and yet still give a user an acceptable viewing experience.

Accordingly, there is provided a video processing apparatus arranged to receive a video stream and an accompanying 3D audio stream, and slice the video stream into a plurality of video segments, each video segment relating to a different area of a field of view of the received video stream. The video processing apparatus is further arranged to identify an audio source in the 3D audio stream to determine at least one region of interest, and encode each video segment, wherein the video processing apparatus applies more compression effort to the video segments in the region of interest.

Sound sources in the 3D audio stream are used as a proxy indicator to identify regions of interest within the immersive video. Given that a user is unable to view the whole immersive video at once, efficiencies in encoding, storage and/ or transmission can be had by devoting greater compression effort to identified regions of interest.

A video segment may be considered as co-located with an audio source if they are located in the same direction in virtual space from the viewpoint of a viewer. A video segment may be considered as co-located with an audio source if they are located in the same field of view from the viewpoint of a viewer. The direction in virtual space or field of view from the viewpoint of the viewer define the region of interest. Where the audio source is a point location, the region of interest may comprise several video segments surrounding that point location. The video stream and the accompanying 3D audio stream may comprise an immersive media presentation.

The compression effort may determine the quality of the encoded video segment.

Therefore, the video segments within a region of interest are encoded at higher quality. The compression effort may be increased by the application of pre-smoothing, reducing the quantization parameter (QP); or allowing more bits for the encoded video segment.

A region of interest may be an area of a predetermined size created when an audio source is identified having a loudness and/ or duration exceeding respective thresholds. Further, a region of interest may be an area of a predetermined size created when an audio source is identified having as speech.

The region of interest may be surrounded by a boundary zone, such that video segments in the boundary zone are encoded with less encoding effort than those in the region of interest but more encoding effort than background video segments. A background video segment is one that is less likely to be viewed by a viewer. The background video segment is unrelated to background and foreground objects in a video scene. There is further provided a method in a video processing apparatus, the method comprising receiving a video stream and an accompanying 3D audio stream, and slicing the video stream into a plurality of video segments, each video segment relating to a different area of a field of view of the received video stream. The method further comprises identifying an audio source in the 3D audio stream to determine at least one region of interest, and encoding each video segment, wherein the video processing apparatus applies more compression effort to the video segments in the region of interest. There is further provided a transmission apparatus for transmitting immersive video to a user device, the transmission apparatus arranged to receive a plurality of encoded video segments and an accompanying 3D audio stream, wherein each video segment is encoded at a plurality of quality levels, and identify an audio source in the 3D audio stream to determine at least one region of interest. The transmission apparatus is further arranged to select the versions of the video segments encoded at a higher quality for transmission to the user device when transmitting segments in the region of interest.

Sound sources in the 3D audio stream are used as a proxy to identify regions of interest within the immersive video. Given that a user is unable to view the whole immersive video at once, efficiencies can be had by devoting more transmission resources to identified regions of interest within the immersive video.

The transmission apparatus may be further arranged to select the versions of the video segments encoded at a lower quality for transmission to the user device when transmitting segments outside the region of interest.

There is further provide a method for transmitting immersive video to a user device, the method comprising receive a plurality of encoded video segments and an accompanying 3D audio stream, wherein each video segment is encoded at a plurality of quality levels; identifying an audio source in the 3D audio stream to determine at least one region of interest, and when transmitting segments in the region of interest, selecting the versions of the video segments encoded at a higher quality for transmission to the user device. There is further provided a computer-readable medium, carrying instructions, which, when executed by computer logic, causes said computer logic to carry out any of the methods defined herein. There is further provided a computer-readable storage medium, storing instructions, which, when executed by computer logic, causes said computer logic to carry out any of the methods defined herein. The computer program product may be in the form of a non- volatile memory or volatile memory, e.g. an EEPROM (Electrically Erasable Programmable Read-only Memory), a flash memory, a disk drive or a RAM (Random-access memory). There is further provided an apparatus for processing immersive video comprising a processor and a memory, said memory containing instructions executable by said processor whereby said apparatus is operative to: receive a video stream and an accompanying 3D audio stream; slice the video stream into a plurality of video segments, each video segment relating to a different area of a field of view of the received video stream; identify an audio source in the 3D audio stream to determine at least one region of interest; and encode each video segment, wherein the video processing apparatus applies more compression effort to the video segments in the region of interest.

There is further provided an apparatus for transmitting video segments to a user device, the apparatus comprising a processor and a memory, said memory containing

instructions executable by said processor whereby said apparatus is operative to: receive a plurality of encoded video segments and an accompanying 3D audio stream, wherein each video segment is encoded at a plurality of quality levels; identify an audio source in the 3D audio stream to determine at least one region of interest; and when transmitting segments in the region of interest, select the versions of the video segments encoded at a higher quality for transmission to the user device.

Brief description of the dr wings

A method and apparatus for improved encoding of immersive video will now be described, by way of example only, with reference to the accompanying drawings, in which:

Figure 1 illustrates a user terminal displaying a portion of an immersive video;

Figure 2 shows a man watching an immersive video on his smartphone and wearing headphones; Figure 3 shows a woman watching an immersive video on a virtual reality headset and wearing earbuds;

Figure 4 illustrates an arrangement wherein the video segments made available to the user terminal each relate to a different field of view taken from a different location; Figures 5a, 5b and 5c show different examples of regions of interest;

Figures 6a, 6b and 6c illustrate a time sequence showing how a region of interest may vary with time;

Figure 7 illustrates an apparatus for processing immersive video;

Figure 8 shows a method in a video processing apparatus;

Figure 9 shows an apparatus for transmitting video segments to a user device; and

Figure 10 illustrates a method for transmitting immersive video to a user device.

Detailed description

Figure 1 illustrates a user terminal 100 displaying a portion of an immersive video 180. The user terminal is shown as a smartphone and has a screen 110, which is shown displaying a selected portion 185 of immersive video 180. In this example, immersive video 180 is a panoramic or cylindrical view of a city skyline.

Smartphone 100 comprises gyroscope sensors to measure its orientation, and in response to changes in its orientation the smartphone 100 displays different sections of immersive video 180. For example, if the smartphone 100 were rotated to the left about its vertical axis, the portion 185 of video 180 that is selected would also move to the left and a different area of video 180 would be displayed. The user terminal 100 may comprise any kind of personal computer such as a television, a smart television, a set-top box, a games-console, a home-theatre personal computer, a tablet, a smartphone, a laptop, or even a desktop PC.

It is apparent from figure 1 that where the video 180 is stored at a location remote from the user terminal 100, transmitting the video 180 at high quality over its entirety to the user terminal, just for selected portion 185 to be displayed, is inefficient. This inefficiency is addressed by the system and apparatus described herein. As described herein, an immersive video, such as video 180 is separated into a plurality of video segments, each video segment relating to a different area of a field of view of the received video stream. Each video segment is separately encoded. The user terminal is arranged to select a subset of the available video segments, retrieve only the selected video segments, and to knit these together to form a knitted video image that is larger than a single video segment. Referring to the example of figure 1 , the knitted video image comprises the selected portion 185 of the immersive video 180. With modern viewing methods using a handheld device like a smartphone or a virtual reality headset, only a portion of the video is displayed at any one time. Never-the-less, video recorded for playback must encompass the entire possible field of view which can be displayed in response to the sensor inputs. In order for the video to not appear pixelated over the relatively small portion of video shown, the total resolution of the immersive video must be high. For real time transmission of such video this high resolution presents a problem.

There is disclosed herein an arrangement that makes use of audio cues such as dialogue or sudden, loud stimuli to make assumptions about the region of the video most likely to be observed. Standard video encoding concepts such as pre-smoothing or adaptive quantization parameter (QP) can then be applied to provide more bits, and therefore greater perceived image quality in these regions of interest (ROIs). The segments outside the region of interest, the background segments, are still made available, but at lower quality as these are less likely to be viewed.

Figure 2 shows a man watching a video 280 on his smartphone 200 and wearing headphones 220. Smartphone 200 has a display 210 which displays area 285 of the video 280. Headphones 220 are connected to smartphone 200 by a wireless connection and play audio associated with the video 280. The video 280 is split into a plurality of segments 281. The segments 281 are illustrated in figure 2 as tiles of a sphere, representing the total area of the video 280 that is available for display by smartphone 200 as the user changes the orientation of this user terminal. Similarly the audio played through headphones 220 is derived from a 3D audio stream and the sound played in the headphones 220 is dependent upon the detected orientation of the smartphone 200. The displayed area 285 of video 280 spans six segments or tiles 281. In this embodiment, only the six segments 290 which are included in displayed area 285 are selected by the user terminal for retrieval. The set of video segments displayed by the user terminal is defined by a physical location and/ or orientation of the user terminal. This information is obtained from sensors in the user terminal, such as a magnetic sensor (or compass), and a gyroscope. Alternatively, the user terminal may have a camera and use this together with image processing software to determine a relative orientation of the user terminal. The segment selection may also be based on user input to the user terminal. For example, such a user input may be via a touch screen on the smartphone 200.

Figure 3 shows a woman watching an immersive video 380 on a virtual reality headset 300. The virtual reality headset 300 comprises a display 310 and earbuds 320. The display 310 may comprise a screen, or a plurality of screens, or a virtual retina display that projects images onto the retina. The earbuds 320 play audio associated with the immersive video 380. Video 380 is segmented into individual segments 381. The segments 381 are again illustrated here as tiles of a sphere, representing which area of the video 380 may be selected for display by virtual reality headset 300 as the user changes the orientation of her head, and also the orientation of the headset strapped to her head. Similarly the audio played through earbuds 320 is derived from a 3D audio stream and the sound played in the earbuds 320 is dependent upon the detected orientation of the headset 300. The displayed area 385 of video 380 spans seven segments or tiles 381. In the examples of both figures 2 and 3 a subset of available tiles is selected for display to the viewer. In each example either all segments may be retrieved by the user device, or only a subset including those segments required for display are retrieved by the user device. The former is more likely in locally stored immersive video, the latter more likely for streamed immersive video.

The segments in figures 2 and 3 are illustrated as tiles of a sphere. Alternatively, the segments may comprise tiles on the surface of a cylinder. Where the segments relate to tiles of the surface of a cylinder, then the vertical extent of the immersive video is limited by the top and bottom edges of that cylinder. If the cylinder wraps fully around the user, then this may accurately be described as 360 degree video.

The selection of a subset of video segments by the user terminal is defined by a physical location and/ or orientation of the headset 300. This information is obtained from gyroscope and/ or magnetic sensors in the headset. The selection may also be based on user input to the user terminal. For example such a user input may be via a keyboard connected to the headset 300. Segments 281, 381 of the video 280, 380 relate to a different field of view taken from a common location in either the real or virtual worlds. That is, the video may be generated by a device having a plurality of lenses pointing in different directions to capture different fields of view. Alternatively, the video may be generated from a virtual world, using graphical rendering techniques in a computer. Such graphical rendering may comprise using at least one virtual camera to translate the information of the three dimensional virtual world into a two dimensional image for display on a screen. Further, video segments 281, 381 relating to adjacent fields of view may include a proportion of the view that is common to both segments. Such a proportion may be considered an overlap, or a field overlap. Such an overlap is not illustrated in the figures attached hereto for clarity.

The audio played to the user over headphones 220 or earbuds 320 is derived from a three dimensional (3D) audio track. The 3D audio track comprises sounds, or audio tracks, and their source locations in 3D space, relative to the video camera or viewer. Most mammals (including humans) use binaural hearing to localize sound, by comparing the information received from each ear in a complex process that involves a significant amount of synthesis. It is possible to recreate a similar effect through stereo

headphones, with greater sound separation possible over more complex multi-channel headphones, sometimes referred to as surround sound headphones.

The sounds may be taken from a location common to the camera in either the real or virtual worlds. That is, the video may be generated by a real world recording device having a plurality of lenses pointing in different directions to capture different fields of view. Here, the recording device will include a plurality of directional microphones to record sound and its location. The location of an audio source is usually determined by the direction of the incoming sound waves (horizontal and vertical angles) and the distance between the source and sensors. The 3D audio track is created using the structural arrangement design of the sensors and signal processing techniques.

Alternatively, the video may be generated from a virtual world, using graphical rendering techniques in a computer. Such graphical rendering may comprise using at least one virtual camera to translate the information of the three dimensional virtual world into a two dimensional image for display on a screen. Here, audio events located within the virtual world are captured together with their position relative to the virtual camera.

Video segments 281, 381 relating to adjacent fields of view may include a proportion of view that is common to both segments. Such a proportion may be considered an overlap, or a field overlap. Such an overlap is not illustrated in the figures attached hereto for clarity.

This disclosure describes a compression system for 360° video recorded for virtual reality reproduction. Aspects of the video compression, specifically the adaptive QP algorithm, use input from the associated 3D audio to predict the region of interest and increase the quality (by using, for example a lower QP or different mode decision) of that region and/ or regions which must be traversed during the associated change in field of view. By doing so, the system will reduce the required data associated with the 360° video which will, in turn, reduce the storage and/ or transmission requirements. One example would be in a virtual reality depiction of a film scene in which the user is immersed in the same room as two actors having a conversation. A portion of the 360° video would include the actors and audio objects would be associated with the actors' voices. The audio objects associated with the actor's voices will have a position in 3D space. This location will have an associated region of the 360° video which the compression algorithm will increase the quality of. Within the constraint of a constant bitrate encoding this will naturally require that other regions of the image will be encoded at a reduced quality. One such region might, for example be the ceiling above the user where they are unlikely to be focused. On presentation of another audio cue, a door opening to the right of the user for example, the user will likely turn to look in the direction of the associated audio object, particularly if it is loud. The algorithm would then reduce the quality of video in the region of the actors and increase it in the region of video associated with the new audio cue. In some cases, the trajectory of field of view may be predictable. In the case described above, for example, it might be assumed that the head would be turned directly from the dialogue audio to the opening door audio cue. In this case, the video between the two audio objects could be treated in a particular way. For example, it is known that high frequencies are less important when the video is in motion, it is also true that they are hard to encode efficiently. Therefore, by smoothing the video data in the trajectory between the two audio object locations, an improved encoding efficiency can be achieved.

The system will act upon an input comprising a baseband 360° video stream and an associated, time aligned, audio stream. The audio stream will be decoded as objects and the associated location of each object tracked. Of all the audio objects, one or more dominant audio objects will be identified. The identification of dominant objects may be based on a mixture of the content type, dialogue for example, and/ or the level in the case of explosions, gunshots etc. The spatial region or regions of the 360° video which will be in the field of view when the user is looking in the direction of the audio object will be identified for each dominant object. The rate control algorithm controlling the QP and/ or mode decisions in the video encoder is informed of these regions of interest via, for example a quality demand map. The encoder can then make decisions which increase the encoded video quality in these regions relative to other areas.

Figure 4 illustrates an alternative arrangement wherein the video segments made available to the user terminal each relate to a different field of view taken from a different location. In such an arrangement each segment relates to a different point of view. The different location may be in either the real or virtual worlds. A plan view of such an arrangement is illustrated in figure 4. A video 480 is segmented into a grid of segments 481, a plan view of this is illustrated. At a first viewing position 420 the viewer sees display area 485a and the four segments that define that are required to show that area. The viewing position then moves, and at the new position 425 a different field of view 485b is shown to the user representing a sideways translation, side-step, or strafing motion within the virtual world 450. Transitioning from one set of segments to another thus gives the impression of a camera moving within the world.

Two hardware examples are given above; figure 2 shows the user terminal as a smartphone 200, and figure 3 shows the user terminal as a virtual reality headset 300. In alternative embodiments the user terminal may comprises any one of a smartphone, tablet, television, set top box, or games console. Further, the above embodiments refer to the user terminal displaying a portion of an immersive video. It should be noted that the video image may be any large video image, such as a high resolution video, an immersive video, a "360 degree" video, or a wide-angled video. The term "360 degree" is sometimes used to refer to a total perspective view, but the term is apparently a misnomer with a true 360 degree view only giving a full perspective view within one plane. The plurality of video segments relating to the total available field of view, or total video area are each encoded at different quality levels. In particular, during encoding of the immersive video, the 3D audio track is analyzed to identify regions of interest within the immersive video, then segments within that region of interest are encoded using more processing resources than segments outside a region of interest. The additional processing resources result in the segments in the region of interest being encoded at higher quality than segments outside of a region of interest. Further, this may result in the segments outside of a region of interest being encoded at a lower bitrate than those within a region of interest. Whether an immersive video is stored locally or transmitted in advance of viewing, the above scheme results in efficient distribution of bits such that segments within a region of interest take a disproportionate share of the bandwidth available for encoding the immersive video. It should be noted that this approach also has benefits in the case where a user terminal streams an immersive video and only retrieves segments required for display at any one time. In the unlikely event that a user looks outside a region of interest they will receive an acceptable but lower quality visual experience. However, the majority of users will focus on the region of interest and so receive a high quality visual experience of that section of the immersive media presentation. Although the bandwidth usage may not be significantly reduced in such an arrangement, fewer encoding resources will be required to generate the encoded immersive video. The quality level of an encoded video segment may be determined by the bit rate, the quantization parameter, or the pixel resolution. A lower quality segment should require fewer resources for encoding, transmission and processing. By assigning more compression resources and/ or more bitrate to segments within a region of interest a good user experience can be provided with an efficient use of resources.

Figures 5a, 5b and 5c show different examples of regions of interest. Figure 5a is a simple arrangement of an immersive video 580 comprising an array of segments. The location of an audio source 570 is illustrated overlaid the immersive video 580. This defines a region of interest 572 comprising a 3 by 3 array of segments. The region of interest 572 is defined here as the segment within which the sound source is located and all surrounding segments. The segments in the region of interest 572 are encoded at higher quality than the surrounding background segments 581.

Figure 5b illustrates a more complex arrangement where three quality levels are used: one for background segments 581, one for segments in the region of interest 574, and a third intermediate quality level for segments in a boundary zone between the region of interest and the background segments. This boundary zone is shown as an annular area 576. Here, the region of interest 574 comprises a 2 by 2 array of segments defined as the segments surrounding the segment vertex nearest the audio source 570. The boundary zone 576 is defined as a single segment thick ring around the region of interest 574. The intermediate quality level used for the segments within the boundary zone 576 allows a user's view to stray outside the region of interest and still receive an improved quality video experience over the quality level of the background segments. Additionally, the intermediate quality level helps disguise the difference in quality between the relatively high quality segments within the region of interest 574 and the relatively low quality segments of the background 581.

Figure 5c illustrates a third arrangement with four quality levels. As before, there is a background quality level for background segments 581, and a high quality level for segments in the region of interest 574; but here there are two intermediate quality levels for two concentric rings of boundary zones between the region of interest 574 and the background segments 581. A first boundary zone 576, comprising segments encoded at a first intermediate quality level, and outside of this there is a second boundary zone 578 comprising segments encoded at a second intermediate quality level. The first intermediate quality level is lower quality than the high quality level of the region of interest. The second intermediate quality is lower quality than the first intermediate quality. The background quality level is lower quality than the second intermediate quality.

Here, the region of interest 574 comprises a 2 by 2 array of segments defined as the segments surrounding the segment vertex nearest the audio source 570. The first boundary zone 576 is defined as a single segment thick ring around the region of interest 576. The second boundary zone 578 is defined as a single segment thick ring around the first boundary zone 576.

The intermediate quality levels used for the segments within the rings of curiosity 576, 578 allows a user's view to stray outside the region of interest and still receive an improved quality video experience over the quality level of the background segments. Additionally, the intermediate quality levels help disguise the difference in quality between the relatively high quality segments within the region of interest 574 and the relatively low quality segments of the background 581.

Additional quality levels and intermediate rings of curiosity may be applied. The relative quality levels may be proportional or a sliding scale. And the size of the different quality areas may be varied, and are not limited to a thickness of just a single segment. More complex definitions of regions of interest may take into account a plurality of sound sources having distinct but close together locations. Furthermore, the temporal extent of the region of interest may be defined as starting at the time instant of the sound source and extending for 5 seconds thereafter. However, where the segments are encoded as video chunks, each lasting, say, 30 seconds, then the region of interest may be defined as extending for the length of a chunk that contains the audio event. Figures 6a, 6b and 6c illustrate a time sequence showing how a region of interest may vary with time. In figure 6b an audio event occurs at a location 670 shown overlaid the grid of segments of the immersive video 680. This audio event occurs at to and a region of interest 674 with a single boundary zone 676 is defined as shown in figure 5b above. In this example each segment comprises a 30 second chunk of video. A chunk has the same quality level for its duration. Figure 6a shows the quality level used for the preceding chunks in the vicinity of the upcoming audio event 670. Here, the segments that will be in the region of interest 674 comprise a precursor region of interest 672 having a quality level the same as the intermediate quality level as the boundary zone 676. Precursor region of interest 672 can be thought of as a temporal boundary zone, whereas the previously introduced boundary zones are spatial boundary zones.

Figure 6c shows the quality level for the following chunks in the vicinity of the now passed audio event 670. Here, the segments that were in the region of interest 674 and those that were in the boundary zone 676 comprise a descendent region of interest 678 having a quality level the same as the intermediate quality level as the boundary zone 676.

A different intermediate quality level could be used for either or both of the precursor region of interest 672 and the descendent region of interest 678.

Figure 7 illustrates an apparatus 700 for processing immersive video. The apparatus 700 comprises a processor 720, a memory 725, a receiver 730 and an output 740. The memory 725 contains instructions executable by said processor 730. Said apparatus is operative to: receive a video stream and an accompanying 3D audio stream; at the receiver 730 slice the video stream into a plurality of video segments in the processor 720, each video segment relating to a different area of a field of view of the received video stream; identify an audio source in the 3D audio stream to determine at least one region of interest; and encode each video segment, wherein the video processing apparatus applies more compression effort to the video segments in the region of interest. The encoded segments are output via output 740.

Sound sources in the 3D audio stream are used as a proxy to identify regions of interest within the immersive video. Given that a user is unable to view the whole immersive video at once, efficiencies can be had by devoting greater compression effort to identified regions of interest.

A video segment may be considered as collocated with an audio source if they are located in the same direction in virtual space from the viewpoint of a viewer. A video segment may be considered as collocated with an audio source if they are located in the same field of view from the viewpoint of a viewer. The direction in virtual space or field of view from the viewpoint of the viewer define the region of interest. Where the audio source is a point location, the region of interest may comprise several video segments surrounding that point location.

The video stream and the accompanying 3D audio stream may comprise an immersive media presentation. The compression effort may determine the quality of the encoded video segment. Therefore, the video segments within a region of interest are encoded at higher quality.

The compression effort may be increased by the application of pre-smoothing, reducing the quantization parameter (QP); or allowing more bits for the encoded video segment. A region of interest may be an area of a predetermined size created when an audio source is identified having a loudness and/ or duration exceeding respective thresholds. Further, a region of interest may be an area of a predetermined size created when an audio source is identified as speech. The region of interest may be surrounded by a boundary zone, such that video segments in the boundary zone are encoded with less encoding effort than those in the region of interest but more encoding effort than background video segments. A background video segment is one that is less likely to be viewed by a viewer. The background video segment is unrelated to background and foreground objects in a video scene.

Figure 8 shows a method in a video processing apparatus, the method comprising receiving 810 a video stream and an accompanying 3D audio stream, and slicing 820 the video stream into a plurality of video segments, each video segment relating to a different area of a field of view of the received video stream. The method further comprises identifying 830 an audio source in the 3D audio stream to determine at least one region of interest, and encoding 840 each video segment, wherein the video processing apparatus applies more compression effort to the video segments in the region of interest.

Figure 9 shows an apparatus 900 for transmitting video segments to a user device. The apparatus comprising a processor 920 a memory 925, a receiver 930, a transmitter 940 and a storage component 950. The memory 925 contains instructions executable by said processor 920. The receiver 930 receives a plurality of encoded video segments and an accompanying 3D audio stream, wherein the processor 920 encodes each video segment at a plurality of quality levels, and stores these in storage 950. The processor 920 identifies an audio source in the 3D audio stream to determine at least one region of interest. When the transmitter 940 is transmitting segments in the region of interest, the processor 920 selects the versions of the video segments encoded at a higher quality for transmission to the user device.

There is further provided a transmission apparatus for transmitting immersive video to a user device, the transmission apparatus arranged to receive a plurality of encoded video segments and an accompanying 3D audio stream, wherein each video segment is encoded at a plurality of quality levels, and identify an audio source in the 3D audio stream to determine at least one region of interest. The transmission apparatus is further arranged to select the versions of the video segments encoded at a higher quality for transmission to the user device when transmitting segments in the region of interest. Sound sources in the 3D audio stream are used as a proxy to identify regions of interest within the immersive video. Given that a user is unable to view the whole immersive video at once, efficiencies can be had by devoting more transmission resources to identified regions of interest within the immersive video. The transmission apparatus may be further arranged to select the versions of the video segments encoded at a lower quality for transmission to the user device when transmitting segments outside the region of interest. Figure 10 illustrates a method 1000 for transmitting immersive video to a user device, the method comprising receiving 1010 a plurality of encoded video segments and an accompanying 3D audio stream, wherein each video segment is encoded 1020 at a plurality of quality levels; identifying 1030 an audio source in the 3D audio stream to determine at least one region of interest, and when transmitting segments in the region of interest, selecting 1040 the versions of the video segments encoded at a higher quality for transmission to the user device.

There is further provided a computer-readable medium, carrying instructions, which, when executed by computer logic, causes said computer logic to carry out any of the methods defined herein. There is further provided a computer-readable storage medium, storing instructions, which, when executed by computer logic, causes said computer logic to carry out any of the methods defined herein. The computer program product may be in the form of a non- volatile memory or volatile memory, e.g. an EEPROM (Electrically Erasable Programmable Read-only Memory), a flash memory, a disk drive or a RAM (Random-access memory).

In this detailed description, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the invention. It will be understood by those skilled in the art, however, that the embodiments of the invention may be practiced without these specific details. In other instances, well known methods, procedures, components and circuits have not been described in detail so as not to obscure the embodiments of the invention. It can be appreciated that the specific structural and functional details disclosed herein may be representative and do not necessarily limit the scope of the invention.

An embodiment of the invention may include functionality that may be implemented as software executed by a processor, hardware circuits or structures, or a combination of both. The processor may be a general-purpose or dedicated processor, such as a processor from the family of processors made by Intel Corporation, Motorola

Incorporated, Sun Microsystems Incorporated and others. The software may comprise programming logic, instructions or data to implement certain functionality for an embodiment of the invention. The software may be stored in a medium accessible by a machine or computer-readable medium, such as read-only memory (ROM), random- access memory (RAM), magnetic disk (e.g., floppy disk and hard drive), optical disk (e.g., CD-ROM) or any other data storage medium. In one embodiment of the invention, the media may store programming instructions in a compressed and/ or encrypted format, as well as instructions that may have to be compiled or installed by an installer before being executed by the processor.

Alternatively, an embodiment of the invention may be implemented as specific hardware components that contain hard-wired logic for performing the recited functionality, or by any combination of programmed general-purpose computer components and custom hardware components.

It will be apparent to the skilled person that the exact order and content of the actions carried out in the method described herein may be altered according to the requirements of a particular set of execution parameters. Accordingly, the order in which actions are described and/ or claimed is not to be construed as a strict limitation on order in which actions are to be performed.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. The word

"comprising" does not exclude the presence of elements or steps other than those listed in a claim, "a" or "an" does not exclude a plurality, and a single processor or other unit may fulfil the functions of several units recited in the claims. Any reference signs in the claims shall not be construed so as to limit their scope

The above described method may be carried out by suitably adapted hardware, such as an adapted form of the exemplary hardware implementation of a video encoder as shown in Fig. 7, where the adaptation involves providing analysis of a 3D audio track to identify at least one region of interest.

The method may also be embodied in a set of instructions, stored on a computer readable medium, which when loaded into a computer processor, Digital Signal

Processor (DSP) or similar, causes the processor to carry out the hereinbefore described method for improved encoding of an immersive video. One exemplary hardware embodiment is that of a Field Programmable Gate Array (FPGA) programmed to carry out the described method, located on a daughterboard of a rack mounted video encoder, for use in, for example, a television studio or satellite or cable TV head end.

Another exemplary hardware embodiment of the present invention is that of an encoder comprising an Application Specific Integrated Circuit (ASIC).

The user device may be a user apparatus. The user device may be any kind of personal computer such as a television, a smart television, a set-top box, a games-console, a home- theatre personal computer, a tablet, a smartphone, a laptop, or even a desktop PC.

Claims

1. A video processing apparatus arranged to:

receive a video stream and an accompanying 3D audio stream;

slice the video stream into a plurality of video segments, each video segment relating to a different area of a field of view of the received video stream;

identify an audio source in the 3D audio stream to determine at least one region of interest; and

encode each video segment, wherein the video processing apparatus applies more compression effort to the video segments in the region of interest.

2. The video processing apparatus of claim 1, wherein the video stream and the accompanying 3D audio stream comprise an immersive media presentation.

3. The video processing apparatus of claim 1 or 2, wherein the compression effort determines the quality of the encoded video segment.

4. The video processing apparatus of any preceding claim, wherein the compression effort is increased by at least one of:

application of pre- smoothing;

reducing the quantization parameter (QP); and

allowing more bits for the encoded video segment.

5. The video processing apparatus of any preceding claim, wherein a region of interest is an area of a predetermined size created when an audio source is identified having a loudness and/ or duration exceeding respective thresholds.

6. The video processing apparatus of any preceding claim, wherein a region of interest is an area of a predetermined size created when an audio source is identified as speech.

7. The video processing apparatus of any preceding claim, wherein the region of interest is surrounded by a boundary zone, such that video segments in the boundary zone are encoded with less encoding effort than those in the region of interest but more encoding effort than background video segments.

8. A method in a video processing apparatus, the method comprising:

receiving a video stream and an accompanying 3D audio stream;

slicing the video stream into a plurality of video segments, each video segment relating to a different area of a field of view of the received video stream;

identifying an audio source in the 3D audio stream to determine at least one region of interest; and

encoding each video segment, wherein the video processing apparatus applies more compression effort to the video segments in the region of interest.

9. The method of claim 8, wherein the video stream and the accompanying 3D audio stream comprise an immersive media presentation.

10. The method of claim 8 or 9, wherein the compression effort determines the quality of the encoded video segment.

11. The method of any of claims 8 to 10, wherein the compression effort is increased by at least one of:

application of pre- smoothing;

reducing the quantization parameter (QP); and

allowing more bits for the encoded video segment.

12. The method of any of claims 8 to 11, wherein a region of interest is an area of a predetermined size created when an audio source is identified having a loudness and/ or duration exceeding respective thresholds.

13. The method of any of claims 8 to 12, wherein a region of interest is an area of a predetermined size created when an audio source is identified as speech.

14. The method of any of claims 8 to 13, wherein the region of interest is surrounded by a boundary zone, such that video segments in the boundary zone are encoded with less encoding effort than those in the region of interest but more encoding effort than background video segments.

15. A transmission apparatus for transmitting immersive video to a user device, the transmission apparatus arranged to:

receive a plurality of encoded video segments and an accompanying 3D audio stream, wherein each video segment is encoded at a plurality of quality levels;

when transmitting segments in the region of interest, selecting the versions of the video segments encoded at a higher quality for transmission to the user device.

16. The transmission apparatus of claim 15, further arranged to: when transmitting segments outside the region of interest, select the versions of the video segments encoded at a lower quality for transmission to the user device.

17. The transmission apparatus of claims 15 or 16, wherein the video stream and the accompanying 3D audio stream comprise an immersive media presentation.

18. The transmission apparatus of any of claims 15 to 17, wherein a region of interest is an area of a predetermined size created when an audio source is identified having a loudness and/ or duration exceeding respective thresholds.

19. The transmission apparatus of any of claims 15 to 18, wherein a region of interest is an area of a predetermined size created when an audio source is identified having as speech.

20. The transmission apparatus of any of claims 15 to 19, wherein the region of interest is surrounded by a boundary zone, such that when transmitting segments in the boundary zone, selecting the versions of the video segments encoded at a quality lower than that used in the region of interest but greater than that used for background video segments.

21. A method for transmitting immersive video to a user device, the method comprising:

receiving a plurality of encoded video segments and an accompanying 3D audio stream, wherein each video segment is encoded at a plurality of quality levels; identifying an audio source in the 3D audio stream to determine at least one region of interest; and

22. A computer-readable medium, carrying instructions, which, when executed by computer logic, causes said computer logic to carry out any of the methods defined by claims 8 to 14 and 21.