GB2598751A

GB2598751A - Spatial audio parameter encoding and associated decoding

Info

Publication number: GB2598751A
Application number: GB2014257.6A
Authority: GB
Inventors: Tapani Vilermo Miikka; Olavi Heikkinen Mikko; Johannes Eronen Antti; Juhani Lehtiniemi Arto
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2022-03-16
Also published as: GB202014257D0

Abstract

A Immersive Voice Audio (IVAS) parametric spatial audio coding system uses image data and Field of View (FoV) information 202 from a camera device to warp a quantization spatial resolution such that an encoded direction parameter points to the same audio source when played on a device having a different FoV 264. The viewing angle for both FoVs is determined, and the quantisation grid modified to so that the playback device has larger angles between points in the centre of the FoV.

Description

SPATIAL AUDIO PARAMETER ENCODING AND ASSOCIATED DECODING

Field

The present application relates to apparatus and methods for sound-field related parameter encoding, but not exclusively for warping and selection of a quantization grid for time-frequency domain direction related parameter encoding for an audio encoder and decoder.

Background

Parametric spatial audio processing is a field of audio signal processing where the spatial aspect of the sound is described using a set of parameters. For example, in parametric spatial audio capture from microphone arrays, it is a typical and an effective choice to estimate from the microphone array signals a set of directional metadata parameters such as directions of the sound in frequency bands, and the ratios between the directional and non-directional parts of the captured sound in frequency bands. These parameters are known to well describe the perceptual spatial properties of the captured sound at the position of the microphone array. These parameters can be utilized in synthesis of the spatial sound accordingly, for headphones binaurally, for loudspeakers, or to other formats, such as Ambisonics.

The directional metadata such as directions and direct-to-total energy ratios in frequency bands are thus a parameterization that is particularly effective for spatial audio capture.

A directional metadata parameter set consisting of one or more direction value for each frequency band and an energy ratio parameter associated with each direction value can be also utilized as spatial metadata (which may also include other parameters such as spread coherence, number of directions, distance, etc.) for an audio codec. The directional metadata parameter set may also comprise other parameters or may be associated with other parameters which are considered to be non-directional (such as surround coherence, diffuse-to-total energy ratio, remainder-to-total energy ratio). For example, these parameters can be estimated from microphone-array captured audio signals, and for example a stereo or a mono signal can be generated from the microphone array signals to be conveyed with the spatial metadata.

As some codecs are expected to operate at various bit rates ranging from very low bit rates to relatively high bit rates, various strategies are needed for the compression of the spatial metadata to optimize the codec performance for each operating point. The raw bitrate of the encoded parameters (metadata) is relatively high, so especially at lower bitrates it is expected that only the most important parts of the metadata can be conveyed from the encoder to the decoder.

A decoder can decode the audio signals into PCM signals and process the sound in frequency bands (using the spatial metadata) to obtain the spatial output, for example a binaural output.

The aforementioned solution is particularly suitable for encoding captured spatial sound from microphone arrays (e.g., in mobile phones, video cameras, VR cameras, stand-alone microphone arrays). However, it may be desirable for such an encoder to have also other input types than microphone-array captured signals, for example, loudspeaker signals, audio object signals, or Ambisonics signals. An example parametric spatial audio codec is 3GPP IVAS (31d Generation Partnership Project, Immersive Voice and Audio Services) which aims to bring spatial audio with direction audio metadata into mobile communications. FoV (Field of View) based processing is field of current research with respect to audio codecs.

Wide angle lenses for video capture are increasingly becoming more popular in mobile devices and as such the FoV of the capture device may differ from the FoV of the playback device.

Summary

There is provided according to a first aspect an apparatus comprising means configured to: obtain at least one image and associated field-of-view information; obtain at least one audio signal; determine at least one direction parameter value for a time-frequency part of the at least one audio signal; encode the direction parameter value based on a quantization spatial resolution such that the encoded direction parameter value is configured to be modified based on the apparatus fieldof-view information and a further apparatus field-of-view information.

The means may be further configured to receive the further apparatus fieldof-view information.

The means configured to encode the direction parameter value based on a quantization spatial resolution may be configured to modify the at least one direction parameter value based on the apparatus field-of-view information and the further apparatus field-of-view information such that the modified at least one direction parameter value is located relatively at the same position within the further apparatus field-of-view and the apparatus field-of-view.

The means may be further configured to receive the further apparatus viewing angle information.

The means configured to encode the direction parameter value based on a quantization spatial resolution may be configured to modify the at least one direction parameter value based on an apparatus capture angle associated with the field-of-view information and the further apparatus viewing angle information such that the modified at least one direction parameter value is located relatively at the same position within the further apparatus field-of-view and the apparatus fieldof-view.

The means configured to encode the direction parameter value based on a quantization spatial resolution may be configured to generate a quantization grid based at least on the apparatus field-of-view information such that within the apparatus field-of-view the quantization grid is configured with larger angles between grid points towards the centre of the field-of-view.

The means configured to encode the direction parameter value based on a quantization spatial resolution may be configured to generate a quantization grid based at least on the further apparatus field-of-view information such that the apparatus field-of-view the quantization grid is configured with grid points which are equally spaced from each other from the view point of the further apparatus. The means configured to encode the direction parameter value based on a quantization spatial resolution may be configured to generate a quantization grid based at least on the further apparatus viewing angle information such that the apparatus field-of-view the quantization grid is configured with grid points which are equally spaced from each other from the view point of the further apparatus.

The means may be configured to obtain at least one energy ratio for the time-frequency pad, wherein each energy ratio is associated with a respective direction parameter value, and the means configured to determine a quantization spatial resolution based on the field-of-view information may be configured to: generate a first quantization grid based on the at least one energy ratio; and modify the first quantization grid based on the field-of-view information.

The means may be further configured to: encode the at least one image; encode the at least one audio signal; transmit to the further apparatus the encoded at least one image, the encoded at least one audio signal and the encoded direction parameter value.

According to a second aspect there is provided an apparatus comprising means configured to: obtain at least one image; obtain at least one field-of-view information associated with a display of the at least one image; obtain at least one audio signal; obtain at least one direction parameter value associated with a time-frequency part of the at least one audio signal, wherein the direction parameter value is modified based on the apparatus field-of-view information and a further apparatus field-of-view information, wherein the further apparatus field-of-view information is associated with the capture of the at least one image.

The means may be further configured to receive the further apparatus field-of-view information.

The means may be further configured to modify the at least one direction parameter value based on the apparatus field-of-view information and the further apparatus field-of-view information such that the modified at least one direction parameter value is located relatively at the same position within the further apparatus field-of-view and the apparatus field-of-view.

The means may be further configured to obtain an apparatus viewing angle information.

The means may be further configured to modify the at least one direction parameter value based on an apparatus capture angle associated with the further apparatus field-of-view information and the further apparatus viewing angle information such that the modified at least one direction parameter value is located relatively at the same position within the further apparatus field-of-view and the apparatus field-of-view.

The means configured to obtain at least one direction parameter value may be configured to: obtain at least one encoded direction parameter value; determine a quantization spatial resolution used for the encoding of the encoded direction parameter value, wherein the quantization spatial resolution is based on the

apparatus field-of-view information.

The means configured to obtain at least one direction parameter value may be configured to: obtain at least one encoded direction parameter value; determine a quantization spatial resolution used for the encoding of the encoded direction parameter value, wherein the quantization spatial resolution is based on the further

apparatus field-of-view information.

The means configured to obtain at least one direction parameter value may be configured to: obtain at least one encoded direction parameter value; determine a quantization spatial resolution used for the encoding of the encoded direction parameter value, wherein the quantization spatial resolution is based on the apparatus viewing angle information.

The means may be configured to obtain at least one energy ratio for the time-frequency part, wherein each energy ratio is associated with a respective direction parameter value, and the means configured to obtain at least one direction parameter value may be configured to: obtain at least one encoded direction parameter value; generate a first quantization grid based on the at least one energy ratio; and modify the first quantization grid based on at least one of the apparatus field-of-view information and further apparatus field-of-view information.

According to a third aspect there is provided a method for an apparatus, the method comprising: obtaining at least one image and associated field-of-view information; obtaining at least one audio signal; determining at least one direction parameter value for a time-frequency part of the at least one audio signal; encoding the direction parameter value based on a quantization spatial resolution such that the encoded direction parameter value is configured to be modified based on the apparatus field-of-view information and a further apparatus field-of-view information.

The method may further comprise receiving the further apparatus field-ofview information.

Encoding the direction parameter value based on a quantization spatial resolution may comprise modifying the at least one direction parameter value based on the apparatus field-of-view information and the further apparatus field-ofview information such that the modified at least one direction parameter value is located relatively at the same position within the further apparatus field-of-view and

the apparatus field-of-view.

The method may further comprise receiving the further apparatus viewing angle information.

Encoding the direction parameter value based on a quantization spatial resolution may comprise modifying the at least one direction parameter value based on an apparatus capture angle associated with the field-of-view information and the further apparatus viewing angle information such that the modified at least one direction parameter value is located relatively at the same position within the further apparatus field-of-view and the apparatus field-of-view.

Encoding the direction parameter value based on a quantization spatial resolution may comprise generating a quantization grid based at least on the apparatus field-of-view information such that within the apparatus field-of-view the quantization grid is configured with larger angles between grid points towards the centre of the field-of-view.

Encoding the direction parameter value based on a quantization spatial resolution may comprise generating a quantization grid based at least on the further apparatus field-of-view information such that the apparatus field-of-view the quantization grid is configured with grid points which are equally spaced from each other from the view point of the further apparatus.

Encoding the direction parameter value based on a quantization spatial resolution may comprise generating a quantization grid based at least on the further apparatus viewing angle information such that the apparatus field-of-view the quantization grid is configured with grid points which are equally spaced from each other from the view point of the further apparatus.

The method may further comprise obtaining at least one energy ratio for the time-frequency part, wherein each energy ratio is associated with a respective direction parameter value, and determining a quantization spatial resolution based on the field-of-view information may comprise: generating a first quantization grid based on the at least one energy ratio; and modifying the first quantization grid based on the field-of-view information.

The method may further comprise: encoding the at least one image; encoding the at least one audio signal; and transmitting to the further apparatus the 5 encoded at least one image, the encoded at least one audio signal and the encoded direction parameter value.

According to a fourth aspect there is provided a method for an apparatus, the method comprising: obtaining at least one image; obtaining at least one fieldof-view information associated with a display of the at least one image; obtaining at least one audio signal; obtaining at least one direction parameter value associated with a time-frequency part of the at least one audio signal, wherein the direction parameter value is modified based on the apparatus field-of-view information and a further apparatus field-of-view information, wherein the further apparatus field-of-view information is associated with the capture of the at least one image.

The method may further comprise modifying the at least one direction parameter value based on the apparatus field-of-view information and the further apparatus field-of-view information such that the modified at least one direction parameter value is located relatively at the same position within the further apparatus field-of-view and the apparatus field-of-view.

The method may further comprise obtaining an apparatus viewing angle information.

The method may further comprise modifying the at least one direction parameter value based on an apparatus capture angle associated with the further apparatus field-of-view information and the further apparatus viewing angle information such that the modified at least one direction parameter value is located relatively at the same position within the further apparatus field-of-view and the

apparatus field-of-view.

Obtaining at least one direction parameter value may comprise: obtaining at least one encoded direction parameter value; determining a quantization spatial resolution used for the encoding of the encoded direction parameter value, wherein the quantization spatial resolution is based on the apparatus field-of-view information.

Obtaining at least one direction parameter value may comprise: obtaining at least one encoded direction parameter value; determining a quantization spatial resolution used for the encoding of the encoded direction parameter value, wherein the quantization spatial resolution is based on the further apparatus field-of-view information.

Obtaining at least one direction parameter value may comprise: obtaining at least one encoded direction parameter value; determining a quantization spatial resolution used for the encoding of the encoded direction parameter value, wherein the quantization spatial resolution is based on the apparatus viewing angle information.

The method may further comprise obtaining at least one energy ratio for the time-frequency pad, wherein each energy ratio is associated with a respective direction parameter value, and obtaining at least one direction parameter value may comprise: obtaining at least one encoded direction parameter value; generating a first quantization grid based on the at least one energy ratio; and modifying the first quantization grid based on at least one of the apparatus field-of-view information and further apparatus field-of-view information.

According to a fifth aspect there is provided an apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain at least one image and associated field-of-view information; obtain at least one audio signal; determine at least one direction parameter value for a time-frequency part of the at least one audio signal; encode the direction parameter value based on a quantization spatial resolution such that the encoded direction parameter value is configured to be modified based on the apparatus field-of-view information and a further apparatus field-of-view information.

The apparatus may be further caused to receive the further apparatus field-of-view information.

The apparatus caused to encode the direction parameter value based on a quantization spatial resolution may be caused to modify the at least one direction parameter value based on the apparatus field-of-view information and the further apparatus field-of-view information such that the modified at least one direction parameter value is located relatively at the same position within the further apparatus field-of-view and the apparatus field-of-view.

The apparatus may be further caused to receive the further apparatus viewing angle information.

The apparatus caused to encode the direction parameter value based on a quantization spatial resolution may be caused to modify the at least one direction parameter value based on an apparatus capture angle associated with the field-of-view information and the further apparatus viewing angle information such that the modified at least one direction parameter value is located relatively at the same position within the further apparatus field-of-view and the apparatus field-of-view. The apparatus caused to encode the direction parameter value based on a quantization spatial resolution may be caused to generate a quantization grid based at least on the apparatus field-of-view information such that within the apparatus field-of-view the quantization grid is configured with larger angles between grid points towards the centre of the field-of-view.

The apparatus caused to encode the direction parameter value based on a quantization spatial resolution may be caused to generate a quantization grid based at least on the further apparatus field-of-view information such that the apparatus field-of-view the quantization grid is configured with grid points which are equally spaced from each other from the view point of the further apparatus.

The apparatus caused to encode the direction parameter value based on a quantization spatial resolution may be caused to generate a quantization grid based at least on the further apparatus viewing angle information such that the apparatus field-of-view the quantization grid is configured with grid points which are equally spaced from each other from the view point of the further apparatus.

The apparatus may be further caused to obtain at least one energy ratio for the time-frequency part, wherein each energy ratio is associated with a respective direction parameter value, and the apparatus caused to determine a quantization spatial resolution based on the field-of-view information may be caused to: generate a first quantization grid based on the at least one energy ratio; and modify the first quantization grid based on the field-of-view information.

The apparatus may be further caused to: encode the at least one image; encode the at least one audio signal; transmit to the further apparatus the encoded at least one image, the encoded at least one audio signal and the encoded direction parameter value.

According to a sixth aspect there is provided an apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain at least one image; obtain at least one field-of-view information associated with a display of the at least one image; obtain at least one audio signal; obtain at least one direction parameter value associated with a time-frequency part of the at least one audio signal, wherein the direction parameter value is modified based on the apparatus field-of-view information and a further apparatus field-of-view information, wherein the further apparatus field-of-view information is associated with the capture of the at least one image.

The apparatus may be further caused to receive the further apparatus fieldof-view information.

The apparatus may be further caused to modify the at least one direction parameter value based on the apparatus field-of-view information and the further apparatus field-of-view information such that the modified at least one direction parameter value is located relatively at the same position within the further apparatus field-of-view and the apparatus field-of-view.

The apparatus may be further caused to obtain an apparatus viewing angle information The apparatus may be further caused to modify the at least one direction parameter value based on an apparatus capture angle associated with the further apparatus field-of-view information and the further apparatus viewing angle information such that the modified at least one direction parameter value is located relatively at the same position within the further apparatus field-of-view and the

apparatus field-of-view.

The apparatus caused to obtain at least one direction parameter value may be caused to: obtain at least one encoded direction parameter value; determine a quantization spatial resolution used for the encoding of the encoded direction parameter value, wherein the quantization spatial resolution is based on the apparatus field-of-view information.

The apparatus caused to obtain at least one direction parameter value may be caused to: obtain at least one encoded direction parameter value; determine a quantization spatial resolution used for the encoding of the encoded direction parameter value, wherein the quantization spatial resolution is based on the further apparatus field-of-view information.

The apparatus caused to obtain at least one direction parameter value may be caused to: obtain at least one encoded direction parameter value; determine a quantization spatial resolution used for the encoding of the encoded direction parameter value, wherein the quantization spatial resolution is based on the apparatus viewing angle information.

The apparatus may be further caused to obtain at least one energy ratio for the time-frequency part, wherein each energy ratio is associated with a respective direction parameter value, and the apparatus caused to obtain at least one direction parameter value may be caused to: obtain at least one encoded direction parameter value; generate a first quantization grid based on the at least one energy ratio; and modify the first quantization grid based on at least one of the apparatus field-of-view information and further apparatus field-of-view information.

According to a seventh aspect there is provided an apparatus comprising: means for obtaining at least one image and associated field-of-view information; obtaining at least one audio signal; determining at least one direction parameter value for a time-frequency part of the at least one audio signal; encoding the direction parameter value based on a quantization spatial resolution such that the encoded direction parameter value is configured to be modified based on the apparatus field-of-view information and a further apparatus field-of-view information.

According to an eighth aspect there is provided an apparatus comprising: means for obtaining at least one image; means for obtaining at least one field-of-view information associated with a display of the at least one image; means for obtaining at least one audio signal; means for obtaining at least one direction parameter value associated with a time-frequency part of the at least one audio signal, wherein the direction parameter value is modified based on the apparatus field-of-view information and a further apparatus field-of-view information, wherein the further apparatus field-of-view information is associated with the capture of the at least one image.

According to a ninth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus to perform at least the following: obtaining at least one image and associated field-of-view information; obtaining at least one audio signal; determining at least one direction parameter value for a time-frequency part of the at least one audio signal; encoding the direction parameter value based on a quantization spatial resolution such that the encoded direction parameter value is configured to be modified based on the apparatus field-of-view information and a further apparatus field-of-view information.

According to a tenth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus to perform at least the following: obtaining at least one image; obtaining at least one field-of-view information associated with a display of the at least one image; obtaining at least one audio signal; obtaining at least one direction parameter value associated with a time-frequency part of the at least one audio signal, wherein the direction parameter value is modified based on the apparatus field-of-view information and a further apparatus field-of-view information, wherein the further apparatus field-of-view information is associated with the capture of the at least one image.

According to an eleventh aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtaining at least one image and associated field-of-view information; obtaining at least one audio signal; determining at least one direction parameter value for a time-frequency part of the at least one audio signal; encoding the direction parameter value based on a quantization spatial resolution such that the encoded direction parameter value is configured to be modified based on the apparatus field-of-view information and a further apparatus field-of-view information.

According to a twelfth aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtaining at least one image; obtaining at least one field-of-view information associated with a display of the at least one image; obtaining at least one audio signal; obtaining at least one direction parameter value associated with a time-frequency part of the at least one audio signal, wherein the direction parameter value is modified based on the apparatus field-of-view information and a further apparatus field-of-view information, wherein the further apparatus field-of-view information is associated with the capture of the at least one image.

According to a thirteenth aspect there is provided an apparatus comprising: obtaining circuitry configured to obtain at least one image and associated field-ofview information; obtaining circuitry configured to obtain at least one audio signal; determining circuitry configured to determine at least one direction parameter value for a time-frequency part of the at least one audio signal; encoding circuitry configured to encode the direction parameter value based on a quantization spatial resolution such that the encoded direction parameter value is configured to be modified based on the apparatus field-of-view information and a further apparatus field-of-view information.

According to a fourteenth aspect there is provided an apparatus comprising: obtaining circuitry obtaining at least one image; obtaining circuitry configured to obtain at least one field-of-view information associated with a display of the at least one image; obtaining circuitry configured to obtain at least one audio signal; obtaining circuitry configured to obtain at least one direction parameter value associated with a time-frequency part of the at least one audio signal, wherein the direction parameter value is modified based on the apparatus field-of-view information and a further apparatus field-of-view information, wherein the further apparatus field-of-view information is associated with the capture of the at least one image.

According to a fifteenth aspect there is provided a computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtaining at least one image and associated field-of-view information; obtaining at least one audio signal; determining at least one direction parameter value for a time-frequency part of the at least one audio signal; encoding the direction parameter value based on a quantization spatial resolution such that the encoded direction parameter value is configured to be modified based on the apparatus field-of-view information and a further apparatus field-of-view information.

According to a sixteenth aspect there is provided a computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtaining at least one image; obtaining at least one field-of-view information associated with a display of the at least one image; obtaining at least one audio signal; obtaining at least one direction parameter value associated with a time-frequency part of the at least one audio signal, wherein the direction parameter value is modified based on the apparatus field-of-view information and a further apparatus field-of-view information, wherein the further apparatus field-ofview information is associated with the capture of the at least one image.

An apparatus comprising means for performing the actions of the method as described above.

An apparatus configured to perform the actions of the method as described above.

A computer program comprising program instructions for causing a 20 computer to perform the method as described above.

A computer program product stored on a medium may cause an apparatus to perform the method as described herein.

An electronic device may comprise apparatus as described herein.

A chipset may comprise apparatus as described herein.

Embodiments of the present application aim to address problems associated with the state of the art.

Summary of the Figures

For a better understanding of the present application, reference will now be 30 made by way of example to the accompanying drawings in which: Figure 1 shows schematically an example Field of View difference between a playback device and capture device; Figure 2 shows schematically an example device suitable for implementing some embodiments; Figure 3 shows schematically an example Field of View compensation according to some embodiments; Figure 4 shows schematically an example capture device as shown in Figure 1 suitable for implementing some embodiments as shown in Figure 3; Figure 5 show a flow diagram of the example capture device as shown in Figure 4 according to some embodiments; Figure 6 shows schematically an example playback device as shown in Figure 1 suitable for implementing some embodiments as shown in Figure 3; Figure 7 show a flow diagram of the example playback device as shown in Figure 6 according to some embodiments; Figure 8 shows schematically a Field of View based parameter encoding example according to some embodiments; Figure 9 shows schematically an example capture device as shown in Figure 1 suitable for implementing some embodiments as shown in Figure 8; Figure 10 show a flow diagram of the example capture device as shown in Figure 4 according to some embodiments; Figure 11 shows schematically an example playback device as shown in Figure 1 suitable for implementing some embodiments as shown in Figure 8; Figure 12 show a flow diagram of the example playback device as shown in Figure 11 according to some embodiments; Figure 13 shows schematically a further Field of View based parameter encoding example according to some embodiments; Figure 14 shows schematically an example capture device as shown in Figure 1 suitable for implementing some embodiments as shown in Figure 13; Figure 15 show a flow diagram of the example capture device as shown in Figure 14 according to some embodiments; Figure 16 shows schematically an example playback device as shown in Figure 1 suitable for implementing some embodiments as shown in Figure 13; Figure 17 show a flow diagram of the example playback device as shown in Figure 16 according to some embodiments; Figure 18 shows schematically a further Field of View based parameter encoding example according to some embodiments; and Figure 19 shows schematically an example device suitable for implementing the device shown.

Embodiments of the Application The following describes in further detail suitable apparatus and possible mechanisms for the encoding of parametric spatial audio streams with transport audio signals and spatial metadata.

As discussed above a FoV (Field of View) of a capture device and a playback device may be significantly different. In such circumstances the sound source directions that are analysed and determined from microphone signals in azimuths in a circular coordinate system from the capture device may generate in the playback device sound sources with incorrect directions with respect to the displayed images because the circular coordinate system of the audio differs from the essentially planar coordinate system inside the FOV of the video and the optics that capture it.

This can be experienced as discrepancy between object directions in visual and audio domains and also as suboptimal quantization of spatial audio metadata 20 directions.

This can be seen within an example as shown in Figure 1. Figure 1 for example shows a capture device which has a defined capture FoV 101 which has a range of 150 degrees and is centred at point A. Furthemore is shown a playback device with a defined playback FoV 103 of 60 degrees which is centred at point B. In this example both the capture FoV 101 and the playback FoV 103 share a common central axis CAB as such the position from the playback device of point C directly central to both the capture FoV 101 and the playback FoV 103 is the same as the experienced 'view' from the capture device. However off the common axis the experienced distances and angles will differ. Basic trigonometry for this example gives us the angles CAD=60°, CAE=75°, CBD=15°, CBE=30°.

Some selected (for numerical simplicity) distances between points A, B, C, D, and E are shown. For example, where the distance AC is defined as 1, then the distance CD is a square root of three (in any suitable measure, metres, feet, inches etc). Similarly the distance CE is a square root of three and two and a distance BC is twice the square root of three and three. In other words AC=1 CD=N,13 CE=./3+2 BC =2,13 +3 This example shows that if there are sound objects at locations D and E and a linear mapping is used to move their location for the playback view angle this wouldn't work since 60/75 # 15/30.

Therefore, a non-linear warping for the sound source directions inside the playback FoV is required to maintain sound in the same direction as the displayed visual elements.

As such in some embodiments the apparatus and methods are configured to correct sound directions in a non-linear manner when the sound source is inside 15 the playback device (or image) FoV when the capture FoV and playback FoV are different.

In some embodiments the apparatus and methods are configured to improve on low bit rate coding of spatial direction metadata wherein grid points used in the quantization of the metadata are optimized by defining the grid points equally 20 spaced inside the capture FoV region.

Furthermore in some embodiments the apparatus and methods are configured to improve low bit rate coding of spatial direction metadata wherein grid points used in the quantization of the metadata are optimized by making the grid points equally spaced inside the playback FoV region.

With respect to Figure 2 is shown a system of apparatus within which some embodiments may be implemented.

For example Figure 2 shows a capture device (or capture apparatus) 200. The capture device 200 is configured to encode audio signals and video image signals 208 which may be stored or transmitted as shown by the cloud 240.

Furthermore Figure 2 shows a playback device (or playback apparatus) 260. The playback device 260 is configured to receive, retrieve or otherwise obtain the encoded audio signals and video image signals 208 and present or render them to the user.

The capture device 200 is shown comprising a camera 201 configured to capture video images and at least two microphones 203 configured to capture audio signals.

The camera 201 is configured to generate video image or frame (still) image signals which may be processed and encoded in any suitable manner. The encoding of the video image signals is not described in detail hereafter. The camera 201 is further configured to generate camera FoV information signals 202 which may be passed to an encoder pre-processor 205. The capture device (camera) FoV information may be defined as the camera and lens combination FoV in the capture device and as such may be defined based on the intrinsic properties of the camera 201.

The (at least two) microphones 203 (the example shown in Figure 2 comprises three microphones) are configured to capture audio signals and pass the audio signals 204 to the encoder pre-processor 205.

The encoder pre-processor 205 may in some embodiments be configured to generate the transport audio signals from the (microphone) audio signals 204, in other words generate audio signals suitable for encoding according to a defined coding method. Additionally the encoder pre-processor 205 may be configured in some embodiments to generate at least one spatial parameter or (spatial) metadata and pass these to the encoder 207. In some embodiments the capture device captures audio in a form where audio directions can therefore be modified. That is easiest if the audio is captured in a form where there is metadata that directly gives the audio directions. Such forms are for example: DirAC (Directional Audio Coding), US patents U59313599 and US9456289, and IVAS. Also, audio in Ambisonics or loudspeaker formats (such as 5.1, 22.2) can be modified with known methods in prior art although this doesn't typically give such a high quality.

Furthermore in some embodiments the processor is configured to receive the spatial parameters, the transport audio signals (and video signals) and encode these in a suitable manner as discussed in further detail herein.

The encoder 207 in some embodiments is configured to receive the playback device (Display) FoV 264 from the playback apparatus. The encoded signals 208 can then be passed to the playback device 260 (via suitable storage or communication means 240) The playback device 260 in some embodiments comprises a decoder 261, a display 264 and two or more loudspeakers 265. In this example there are shown a display 263 and loudspeakers 265 however any suitable display means (e.g. glasses mounted display, VR/AR headset display) suitable for rendering an image to the user and any suitable audio signal rendering means (e.g. headphones, earbuds, headset) suitable for rendering the audio signals to the user. In some embodiments the display means and the speaker means may be external to the playback device.

The decoder 261 in some embodiments is configured to receive the encoded signals (and audio signals, spatial metadata and video signals) and further render the decoded audio signals and spatial metadata to generate suitable audio signals for output to the loudspeakers 265. The decoder 261 may further, as discussed in further detail herein, receive the capture device FoV information and/or the playback device FoV information and from this information perform metadata processing.

In some embodiments the playback device may be the same or different as the capture device.

In the following examples of the embodiments the processing can be implemented in either in the capture device (the pre-processor) or in the playback device before the audio signal is rendered.

The playback device (display) FoV information 264 can be determined using any suitable manner. For example in some embodiments the user may input the FoV information. The FoV information may in some embodiments be determined using a sensor (camera, 3D camera such as in Microsoft Kinect, ToF (Time of Flight) sensor) in order to determine a distance from the user to the playback display and the playback display size. In some embodiments the user may take an image during playback of the playback display and the FoV from the user location may be determined from the image taking device FoV and how big the playback display appears in the taken image. In some embodiments the FoV may be pre-determined based on playback display parameters and playback device type. For example, mobile devices may be assumed to be held approximately 0.5m in front of the user and which combined with the mobile device display size gives the assumed FoV. A television display may be assumed to be viewed from a distance of 3m and the size of the television display and the distance gives the assumed FoV.

With respect to Figure 3 is shown an example correction of FoV operation which may be implemented in some embodiments. Figure 3 for example shows a capture device which has a defined capture FoV 301 and is centred at point A. Furthemore is shown a playback device with a defined playback FoV 303 centred at point B. In this example both the capture FoV 301 and the playback FoV 303 share a common central axis CAB as such the position from the playback device of point C directly central to both the capture FoV 301 and the playback FoV 303 is the same as the experienced 'view' from the capture device. In this example the angle a is half of the capture device FoV 301, f3 is half of the playback device FoV 303. y is the original direction and p is the modified (warped) direction.

In words, the apparatus is configured to apply a modification or correction to the direction such that the sound source capture metadata directions are converted into points inside a plane with borders. The plane is orthogonal to the capture device optical axis. The borders are defined by the capture device FoV 301. The plane with borders is resized to be the same size as the capture display and the points move linearly with the resizing. The angles from the playback user location to the points in the resized plane i.e. playback display are calculated and the capture metadata directions are modified to be directed towards the points on the playback display.

From Error! Reference source not found.3 it can been seen that the needed modification for audio directions inside the FOV is: can y tan)3) p = tan-tan a) In some embodiments the metadata is modified with the above equation before rendering audio in the playback device. In such embodiments the capture device is configured to send capture device FoV information, for example as metadata to the playback device. Alternatively, in some embodiments the playback device may send playback FoV information to the capture device and the capture device makes the modification to the directions.

With respect to Figure 4 is shown an example capture device encoder preprocessor 205 and encoder 207 configured to implement some embodiments such as described with respect to Figure 3.

The example capture device encoder pre-processor 205 is configured to 5 receive the audio signals 204 and the Camera FoV signals (or camera FoV information).

The encoder pre-processor 205 in some embodiments comprises a time-frequency domain transformer 401.

In some embodiments the time-frequency domain transformer 401 is configured to receive the audio signals 204 and apply a suitable time to frequency domain transform such as a Short Time Fourier Transform (STFT) in order to convert the input time domain signals into a suitable time-frequency signals. These time-frequency signals may be passed to a spatial analyser 403 and to a transport audio signal determiner 405.

Thus for example the time-frequency signals 402 may be represented in the time-frequency domain representation by where b is the frequency bin index and n is the time-frequency block (frame) index and i is the channel index. In another expression, n can be considered as a time index with a lower sampling rate than that of the original time-domain signals.

These frequency bins can be grouped into subbands that group one or more of the bins into a subband of a band index k = K-1. Each subband k has a lowest bin bkjow and a highest bin bichigh, and the subband contains all bins from bkjow to bk,high* The widths of the subbands can approximate any suitable distribution. For example the Equivalent rectangular bandwidth (ERB) scale or the Bark scale.

In some embodiments the encoder pre-processor 205 comprises a spatial analyser 403. The spatial analyser 403 may be configured to receive the time-frequency signals 402 and based on these signals estimate one or more direction parameters 406. The direction parameters may be determined based on any audio based 'direction' determination.

For example in some embodiments the spatial analyser 403 is configured to estimate the direction with two or more signal inputs. This represents the simplest configuration to estimate a 'direction', more complex processing may be performed with even more signals.

The spatial analyser 403 may thus be configured to identify at least one 'direction', dir, of audio arrival for each frequency band and based on this direction provide at least one azimuth and elevation for each frequency band and temporal time-frequency block within a frame of an audio signal. These may be denoted as Q directions where Diri..Q Azimuth pi...Q(k,n) and elevation 61...Q(k,n). The direction parameters 406 may be also be passed to a direction encoder (index generator) 425 within the encoder 207.

The spatial analyser 203 may also be configured to determine one or more energy ratio parameters 410. The energy ratio may be considered to be a determination of the energy of the audio signal which can be considered to arrive from the identified 'direction'. The direct-to-total energy ratio ri...u(k,n) can be estimated, e.g., using a stability measure of the directional estimate, or using any correlation measure, or any other suitable method to obtain a ratio parameter. The energy ratio may be passed to an energy ratio encoder 427 within the encoder 207. The spatial analyser 203 may furthermore be configured to determine other parameters such as coherence parameters which may include surrounding 20 coherence (y(k,n)) and spread coherence (((k,n)), both analysed in time-frequency domain.

Therefore in summary the analysis processor is configured to receive time domain multichannel or other format such as microphone or ambisonic audio signals.

Following this the analysis processor may apply a time domain to frequency domain transform (e.g. STFT) to generate suitable time-frequency domain signals for analysis and then apply direction analysis to determine direction and energy ratio parameters.

The analysis processor may then be configured to output the determined 30 parameters.

Although parameters are here expressed for each time index n, in some embodiments the parameters may be combined over several time indices. Same applies for the frequency axis, as has been expressed, the direction of several frequency bins b could be expressed by one direction parameter in band k consisting of several frequency bins b.

Furthermore, as shown in Figure 4, in some embodiments the encoder preprocessor 205 comprises a transport audio signal determiner 405. The transport audio signal determiner 405 is configured to receive or obtain the time-frequency audio signals 405 and generate suitable transport audio signals for encoding. In some embodiments this may comprise a selection of and/or processing of one or more channels of the audio signals. For example in some embodiments the transport audio signal determiner 405 is configured to select a 'left' channel audio signal and a 'right' channel audio signal.

In some embodiments the encoder pre-processor 205 comprises a FoV analyser 403. The FoV analyser 403 may be configured to receive or obtain the camera FoV signals 202 or camera intrinsic and/or extrinsic parameters and based on these determine suitable FoV parameters 408 which can be passed to an FoV encoder 429 within the encoder 207.

In some embodiments the encoder 207 comprises a transport audio signal encoder 423. The transport audio signal encoder 423 can be any suitable encoder configured to generate encoded audio signals 424 (for example according the IVAS standard).

In some embodiments the encoder comprises an energy ratio encoder 427 configured to receive the energy ratios 410 and generate suitable encoded energy ratios 428. In other words the energy ratio encoder 427 is configured to compress the energy ratios associated with each direction and for the sub-bands and the time-frequency blocks.

In such embodiments the quantized energy ratio values 428 are the same for all the TF blocks of a given sub-band.

The energy ratio encoder 427 may be further configured to pass the quantized (encoded) energy ratio values 428 to a combiner 431.

Furthermore in some embodiments the quantized energy ratio values 208 30 are passed to a direction encoder 425.

The encoder 207 may in some embodiments comprise a direction encoder 425 configured to receive the direction parameter 406 and encode them in a suitable manner. For example in some embodiments the encoding is implemented based at least on a quantization of the direction, where the quantization is implemented using a grid with a determined resolution. The determination of the grid configuration (and thus the determined resolution) can be a single or multistage process. In the following examples the determination of the grid points (locations and distances) is based on the energy ratios. This determination may be defined as the absolute grid point configuration. These determined grid points can then be modified (such that, for example, the grid is mapped from a linear grid to a nonlinear grid)_based on the field-of-view information.

The determined resolution in this example may be based on the encoded energy ratio values 428. The determined quantization resolution may be any suitable quantization resolution arrangement or configuration such as those described within the patent applications PCT/FI2019/050675, GB1811071.8, and GB1913274.5. In some embodiments the quantization resolution is based on an arrangement of spheres forming a spherical grid arranged in rings on a 'surface' sphere which are defined by a look up table defined by the determined quantization resolution. In other words the spherical grid uses the idea of covering a sphere with smaller spheres and considering the centres of the smaller spheres as points defining a grid of almost equidistant directions. The smaller spheres therefore define cones or solid angles about the centre point which can be indexed according to any suitable indexing algorithm. Although spherical quantization is described here any suitable quantization, linear or non-linear may be used.

In other words the direction encoder 425 is configured to receive the direction parameters 108 and based on a determined quantization resolution based on the encoded energy ratios 428 (and in some embodiments an expected bit allocation) generate a suitable encoded output. The encoded direction parameters 426 may then be passed to the combiner 431.

The encoder 207 may further comprise a FoV encoder 429. The FoV encoder is configured to receive the FoV parameters 408 and encode these in a suitable manner to generate encoded FoV parameters 430 which can be passed to the combiner 431.

The encoder 207 may comprise a combiner 431. The combiner 431 may be configured to receive the encoded FoV parameters 430, the encoded (or quantized/compressed) directional parameters, encoded energy ratio parameters and encoded audio signals and combine these to generate a suitable output (for example a metadata bit stream which is combined with the transport signal or be separately transmitted or stored from the transport signal).

With respect to Figure 5 is shown a flow diagram showing the operations of 5 the encoder pre-processor and processor as shown in Figure 4.

An initial operation is obtaining the audio signals as shown in Figure 5 by step 501.

Furthermore the FoV information is obtained from the camera as shown in Figure 5 by step 502.

The FoV information may be analysed to determine the FoV parameters as shown in Figure 5 by step 504.

The FoV parameters may then be encoded to generate encoded FoV parameters as shown in Figure 5 by step 506.

The spatial parameters, for example the direction and energy ratios, may 15 furthermore be determined, for example based on an analysis of the audio signals as shown in Figure 5 by step 503.

The energy ratios may then be encoded or quantized as shown in Figure 5 by step 507.

The directions may then be encoded or quantized based on an energy ratio based quantization resolution as shown in Figure 5 by step 511.

Furthermore the transport audio signals may be determined based on the audio signals as shown in Figure 5 by step 505.

The determined transport audio signals may then be encoded as shown in Figure 5 by step 509.

Then the encoded directional values, energy ratios and FoV parameters may be combined (together with the transport audio signals) as shown in Figure 5 by step 513.

With respect to Figure 6 is shown an example decoder (and renderer) 261. The decoder 261 in some embodiments comprises a demultiplexer 601 configured to receive the encoded (multiplexed) signals comprising the encoded metadata and encoded transport audio signals and extract from the encoded signals 208 the encoded energy ratios (index or otherwise encoded), the encoded direction parameters (index values of otherwise encoded), the encoded capture device FoV parameters and the encoded transport audio signals.

The decoder 261 in some embodiments comprises an energy ratio decoder 603 configured to convert the index (or otherwise encoded value) and generate 5 energy ratios 604 based on an inverse process to the energy ratio encoding operations described above.

The decoder 261 in some embodiments comprises a direction decoder 605 configured to receive the decoded energy ratios and determine the resolution of the encoding used to encode the direction parameters. The direction decoder further is configured to receive the encoded directions and based on the encoded directions (for example the direction index or otherwise encoded values) and perform an inverse operation to the direction encoder as described previously using the spatial resolution as determined by the energy ratios and output the decoded direction values.

Furthermore the decoder comprises a FoV parameter decoder 603 configured to receive the encoded FoV parameters and based on these determine FoV parameters associated with the capture device. The decoded FoV parameters associated with the capture device may be passed to the direction modifier 607. The decoder 261 in some embodiments comprises a direction modifier 607. 20 The direction modifier 607 is configured to receive the decoded FoV parameters associated with the capture device and the playback device FoV information or parameters 264 and the decoded directions. The direction modifier 607 is configured then to modify the decoded directions based on the decoded FoV parameters associated with the capture device and the playback device FoV information or parameters 264 in a manner as described above. The modified directions 612 can then be passed to the audio renderer 611.

In some embodiments the decoder 261 comprises a (transport) audio signals decoder 609 configured to receive the encoded audio signals and decode them according to a suitable inverse of the encoding operation as described above. 30 The decoded audio signals 610 can then be passed to the audio renderer 611. The decoder 261 can in some embodiments comprise an audio renderer 611 configured to receive the audio signals and the decoded (and modified) metadata or spatial parameters and then perform audio signal rendering to generate at least two rendered audio signals 610 according to any suitable spatial rendering method. The rendered audio signals 610 may then be output to a suitable playback means (for example as shown in Figure 2 at least two loudspeakers).

With respect to Figure 7 is shown a flow diagram showing the operations of 5 the decoder as shown in Figure 6.

Thus the encoded signals or data is obtained and demultiplexed as shown in Figure 7 by step 701.

The energy ratio values from the dem ultiplex operation can then be decoded as shown in Figure 7 by step 703.

Having determined the decoded energy values then the direction values can then be determined based on an index quantization resolution (the spatial resolution) as shown in Figure 7 by step 705.

Furthermore the capture device FoV information/parameters may be decoded as shown in Figure 7 by step 711.

The playback device FoV information/parameters may further be obtained as shown in Figure 7 by step 702.

Having obtained the capture device FoV information/parameters and the playback device FoV information/parameters the decoded direction parameters may then be modified based on the capture device FoV information/parameters and the playback device FoV information/parameters as shown in Figure 7 by step 713.

The decoded audio signals may then be rendered based on the spatial parameters (including the modified direction parameters) to generate suitable output audio signals.

In summary the above encoder/decoder operation is one where the capture device captures audio with directional metadata and calculates its FOV. The capture device may then transmit audio, metadata and FOV to a playback device. The playback device may then estimate its FOV and modifies metadata. The playback device may then render audio signals using modified metadata.

In the above example the direction parameter modification occurs within the playback device. However in some embodiments the playback device is configured to encode the FoV information or suitable FoV parameters and transmit or pass these to the capture device. The capture device in such embodiments may comprise a direction modifier, for example within the encoder pre-processor or within the encoder which is configured to receive the capture device FoV information/parameters and the playback device FoV information/parameters and modify the determined directional parameters, encode the modified directional parameters and transmit or store these encoded modified directional parameters for the playback device to obtain.

In some embodiments the direction modification is applied for the regions within the playback device FoV (the display FoV). In some embodiments outside the FoV the direction modification can be any suitable mapping. For example mapping the remaining part (outside the capture FOV) of the audio directions equally outside the playback FOV.

In some embodiments the playback device (user) view angle or other extrinsic information may also be estimated and based on this view angle or extrinsic information used to modify the direction metadata. As explained previously different methods can be used to detect user FOV. The same methods may be used to detect the presence of several uses in which case an average of the multiple user FOVs may be used. Also the video or image that the user views may be presented in a window that is smaller than the display in which case the window FOV is used.

In such embodiments the sound directions (the direction parameters) are thus better matched between the audio and video even when the capture device FoV differs from the playback device FoV.

In some embodiments a low bit rate coding of the spatial direction metadata may be implemented wherein the spatial resolution grid points employed in the encoding/quantization of the directions are optimized by making the grid points equally spaced inside the capture FOV.

In these embodiments, as described herein, when audio metadata direction angles (azimuth and possibly elevation) are quantized, a grid is chosen and the angles are quantized to the grid points. The grid points may have a low bitrate coding representation e.g. Huffman or vector quantization and the representation may be sent from the capture device to playback device alongside other coded data such as audio and video.

When (audio metadata) directions are quantized, the quantization grid points may be chosen so that the angle between the grid points is either constant or so that the grid is denser in the front in order to follow psychoacoustics. Psychoacoustically humans have more accurate direction hearing in the front.

However, this is not optimal for cases where people are not sitting directly in the centre of the acoustic spatial image when listening to the audio. Typically, the acoustic centre is the same as the playback display centre line. A more general optimization is presented in these embodiments.

For example Figure 8 shows a capture FoV 801 centred at point A and defined at a distance from the capture distance from point C to point I (and via points D to H with point F being directly in front of the capture device). Further there are defined angles a which is the angle between EAF, /3 which is the angle between DAE, y which is the angle between CAD, 5 which is the angle between GAF, c which is the angle between HAG, and (which is the angle between IAH.

The optimization as described herein may produce better results for all viewing positions. That is, all the distances between points C,...,I are the same for consecutive points. The points C to I are equally spaced (spread equally) on the image, but the angles a, 13, y, 5, c, care not equal but they are narrower near the display edges. In some embodiments the number of grid points On other words the distribution of the quantization resolution) is selected based on the capture device FoV. For example in some embodiments the quantization resolution is determined such that there are grid points every 10 degrees (2 to 30 degrees are typical values).

In such embodiments the selection of quantization grid points produces a good result for different size of playback displays and listening positions. In some embodiments where the spatial analysis is configured to determine directions comprising height information (for example elevation information) then this method may be employed using the capture device vertical FoV information to determine grid points in vertical direction.

In some embodiments points outside the FoV can be selected using any known means, for example equally spaced but more sparse than inside the FoV.

Figure 9 for example shows a further example capture device encoder preprocessor 205 and encoder 207 configured to implement some embodiments such as described with respect to Figure 8.

The example capture device encoder pre-processor 205 is similar to the 5 capture device encoder pre-processor 205 shown in Figure 4.

The example capture device encoder 207 differs from the encoder shown in Figure 4 in that the direction encoder 925 is configured to furthermore receive the encoded FoV parameters 430. The direction encoder 925 may then be configured to determine a quantization resolution and quantization grid for the direction parameters as shown in Figure 8 and described above.

With respect to Figure 10 is shown a flow diagram showing the operations of the encoder pre-processor and processor as shown in Figure 9.

An initial operation is obtaining the audio signals as shown in Figure 10 by step 501.

Furthermore the FoV information is obtained from the camera as shown in Figure 10 by step 502.

The FoV information may be analysed to determine the FoV parameters as shown in Figure 10 by step 504.

The FoV parameters may then be encoded to generate encoded FoV 20 parameters as shown in Figure 10 by step 506.

The spatial parameters, for example the direction and energy ratios, may furthermore be determined, for example based on an analysis of the audio signals as shown in Figure 10 by step 503.

The energy ratios may then be encoded or quantized as shown in Figure 10 25 by step 507.

The directions may then be encoded or quantized based on an capture device FoV parameter (and energy ratio) based quantization resolution as shown in Figure 10 by step 1011.

Furthermore the transport audio signals may be determined based on the audio signals as shown in Figure 10 by step 505.

The determined transport audio signals may then be encoded as shown in Figure 10 by step 509.

Then the encoded directional values, energy ratios and FoV parameters may be combined (together with the transport audio signals) as shown in Figure 10 by step 513 With respect to Figure 11 is shown an example decoder (and renderer) 261 suitable for receiving the encoded signals generated by the encoder as shown in Figure 10.

The difference between the decoder 261 as shown in Figure 11 and the decoder as shown in Figure 6 is that the direction decoder 1005 is configured to receive the decoded FoV parameters associated with the capture device and determine the resolution of the encoding used to encode the direction parameters based on the FoV parameters (and in some embodiments the energy ratio also). The direction decoder further is configured to receive the encoded directions and based on the encoded directions (for example the direction index or otherwise encoded values) and perform an inverse operation to the direction encoder as described previously using the spatial resolution and output the decoded direction values. In the example shown in Figure 11 the FoV parameters from the capture and playback devices may furthermore be used to modify the directions.

With respect to Figure 12 is shown a flow diagram showing the operations of the decoder as shown in Figure 11.

Thus the encoded signals or data is obtained and demultiplexed as shown in Figure 12 by step 701.

The energy ratio values from the demultiplex operation can then be decoded as shown in Figure 12 by step 703.

Furthermore the capture device FoV information/parameters may be 25 decoded as shown in Figure 12 by step 1205.

The direction values can then be determined based on an index quantization resolution (the spatial resolution) determined based on the decoded FoV information/parameters (and in some embodiments the energy ratios) as shown in Figure 12 by step 1211.

Having obtained the capture device FoV information/parameters and the playback device FoV information/parameters the decoded direction parameters may then be modified based on the capture device FoV information/parameters and the playback device FoV information/parameters as shown in Figure 12 by step 713.

The audio signals can then be rendered based on the spatial parameters, and the decoded audio signals as shown in Figure 12 by step 715.

With respect to Figures 13 and 18 are shown the concept behind some further embodiments wherein low bit rate coding of spatial direction metadata is implemented employing a spatial resolution (or grid points in quantization) of the metadata based on a resolution which is equally spaced inside the playback FoV.

In such embodiments the encoder in the capture device is configured to receive the playback device FoV information or parameters.

In some embodiments the capture device is in communication with the playback device. The devices determine their FoV parameters and the playback device communicates its FoV information to the capture device. In some embodiments the playback device is further configured to determine a current user view angle and further communicate this information to the capture device. In some embodiments the view angle is determined by detecting user angle with respect to the playback display. A suitable sensor may be used to determine the user angle, for example a camera, ToF sensor, 3D sensor, user mobile phone location etc. In some embodiments the capture device is configured to define or 20 determine a spatial resolution (quantization resolution) for audio direction parameters or metadata by selecting grid points so that they are angularly equally separated for the playback device FoV (and furthermore the user view angle). This for example is shown in Error! Reference source not found. and Error! Reference source not found..

Figure 13 for example shows a capture device which has a defined capture FoV 1301 which is centred at point A. Furthemore is shown a playback device with a defined playback FoV 1303 centred at point B. In this example both the capture FoV 1301 and the playback FoV 1303 share a common central axis FAB as such the position from the playback device of point F directly central to both the capture FoV 1301 and the playback FoV 1303 is the same as the experienced 'view' from the capture device. However off the common axis the experienced distances and angles will differ.

In this example there is shown capture angles aG, Pc, yG, 5c, co, (G which represent angles between quantization grid points for the capture device and playback angles ap, $p, Yp, 5p, 4, Cp represent how the grid point separation is perceived from current user position with respect to playback device display (line from C to l). The playback angles ap, Yp, 5p, Ep,cp in some embodiments are defined such that they are equal. From a trigonometric analysis implemented by either the capture or playback device correct grid points can be determined for the capture device.

For example in some embodiments the quantization grid spatial resolution is defined such that from the viewpoint of the playback device the angles between quantization grid is defined as being 10 degrees. However, in some embodiments the grid may have any suitable resolution and furthermore the resolution may be determined such that playback angles are perceptually defined and be narrower towards the centre. In such embodiments the angles may be defined to be narrower near the FoV edges because of the user viewing perspective.

In some embodiments, where the playback FoV is very wide (for example greater than 80 to 130 degrees), more quantization grid points are needed and if the playback FoV is narrow (for example less than 40 to 80 degrees) then fewer quantization grid points are needed. In some embodiments the quantization resolution is defined such that there is a grid point every 10 degrees (although a suitable range of values may be from 2-30 degrees).

In some embodiments where the user view angle (or playback angle) is not in line with the capture device such as shown in Figure 18 then this information may furthermore be used to assist in the determination of the quantization grid points.

For example as shown in Figure 18 the difference between the playback FoV 1803 and the playback FoV 1303 shown in Figure 13 is that the Playback FoV is offset by an angle (playback view angle) 0, and the playback device display width (line segment from C to I) is "a", then the distance of the user from the device (line segment from B to 0) is (the line segment "m"): m - 4(cos4 0+sin4 0-cos2 fl cos4 0-cos2 sin4 0+2 cos2 sin2 0-2 cos2 cos2 sin2 0) \ia2 cos2 0+a2 sin2 0+a2 cos2 fl cos2 0-a2 cos2 /1 sin2 0+2a2 cos ft cos 0/cos2 0+sin2 0-cos2 ft sin2 0 The line segment from B to I, "c" is: c= 2 --m sin 0) + (m cos 0)2 Angle CIB y is: sinfi \IE2 + a 2 y = arcsin In such embodiments where there are N points horizontally on the grid, the playback FoV (13) 1803 is divided into N equally large sectors. a

Thus the distance IH is: Distance IG is: Distance IF is: and so forth. fl

c sm -N sin (1800 -y -14) c sin ff 2 sin (180° -y 42) c sin ff 3 sin (180° -y 1/4 3) As such the grid points for azimuth may therefore be calculated from the distances: Angle 'c is then: arctan c sin ff a 2 sin (180° -v-N) a/2 tan a/2 where capture FOV is notated a.

Cc + Cc is: arctan arctan and so forth.

a c sin '12 2 sin (180° -y a/2 tan a/2 a c sin13 2 sin (180° -y -3) a/2 tan a/2 and Cc + Cc + ac is: In this way it may be possible to determine all the azimuth grid points for quantizing capture audio directions so that the grid points are equally spaced as azimuth angles on the playback side for current user FOV and view angle.

In some implementations as discussed in earlier applications the psychoacoustic properties of human hearing can be taken into account so that the quantization grid (playback angles) are more narrow near the centre of the FoV and wider near the edge of the FoV. For example, angles}fp, and op could be narrower and angles am 8p, cp, and cp could be wider. The playback angles can then be converted into capture grid points for example using the equations above.

Such embodiments may be able to produce a good quality output when the 20 capture device knows the playback display FOV and viewing angle for a single viewing position. Therefore, these embodiments may be usefully implemented for a single user viewing the content from the playback device.

Figure 14 for example shows a further example capture device encoder preprocessor 205 and encoder 207 configured to implement some embodiments such as described with respect to Figure 8.

The example capture device encoder pre-processor 205 is similar to the capture device encoder pre-processor 205 shown in Figure 4.

The example capture device encoder 207 differs from the encoder shown in Figure 4 in that the direction encoder 1425 is configured to furthermore receive the encoded (capture device) FoV parameters 430 and the playback FoV parameters 264. The direction encoder 1425 may then be configured to determine a quantization resolution and quantization grid for the direction parameters as shown in Figures 13 and 18 and described above.

With respect to Figure 15 is shown a flow diagram showing the operations of the encoder pre-processor and processor as shown in Figure 14.

An initial operation is obtaining the audio signals as shown in Figure 14 by step 501.

Furthermore the FoV information is obtained from the camera as shown in Figure 14 by step 502.

The FoV information may be analysed to determine the FoV parameters as shown in Figure 14 by step 504.

The FoV parameters may then be encoded to generate encoded FoV parameters as shown in Figure 14 by step 506.

The spatial parameters, for example the direction and energy ratios, may furthermore be determined, for example based on an analysis of the audio signals as shown in Figure 14 by step 503.

The energy ratios may then be encoded or quantized as shown in Figure 14 by step 507.

The playback device FoV (and direction) information may then be received or otherwise obtained as shown in Figure 15 by step 1501.

The directions may then be encoded or quantized based on a playback device FoV information and direction information and furthermore capture device FoV information (and energy ratio) based quantization resolution as shown in Figure 15 by step 1511.

Furthermore the transport audio signals may be determined based on the audio signals as shown in Figure 15 by step 505.

The determined transport audio signals may then be encoded as shown in Figure 15 by step 509.

Then the encoded directional values, energy ratios and FoV parameters may be combined (together with the transport audio signals) as shown in Figure 15 by step 513 With respect to Figure 16 is shown an example decoder (and renderer) 261 suitable for receiving the encoded signals generated by the encoder as shown in Figures 13 and 18.

The difference between the decoder 261 as shown in Figure 16 and the decoders as shown in Figures 6 and 10 is that the direction decoder 1605 is configured to receive the FoV information (parameters) and in some embodiments the user playback angle associated with the playback device and determine the resolution of the encoding used to encode the direction parameters based on the FoV parameters. In some embodiment the decoded FoV parameters associated with the capture device and energy ratio may also be used to determine the resolution of the encoding used to encode the direction parameters based on the FoV parameters and thus the quantization grid can be determined which enables the determination of the directions when the direction decoder further receives the encoded directions. In the example shown in Figure 11 the FoV parameters from the capture and playback devices may furthermore be used to modify the directions.

With respect to Figure 17 is shown a flow diagram showing the operations of the decoder as shown in Figure 16.

Thus the encoded signals or data is obtained and demultiplexed as shown in Figure 17 by step 701.

The playback device FoV (and direction) information may then be obtained as shown in Figure 17 by step 702.

The energy ratio values from the dem ultiplex operation can then be decoded as shown in Figure 17 by step 703.

Furthermore the capture device FoV information/parameters may be decoded as shown in Figure 17 by step 705.

The direction values can then be determined based on an index quantization resolution (the spatial resolution) determined based on the playback FoV information/parameters (and in some embodiments the capture FoV information and/or energy ratios) as shown in Figure 17 by step 1711.

Having obtained the capture device FoV information/parameters and the playback device FoV information/parameters the decoded direction parameters may then be modified based on the capture device FoV information/parameters and the playback device FoV information/parameters as shown in Figure 17 by step 713.

The audio signals can then be rendered based on the spatial parameters, and the decoded audio signals as shown in Figure 17 by step 715.

Although the examples shown above describe the determination of spatial resolution (the quantization grid) with respect to a horizontal direction and the modification of the direction in a horizontal direction it would be understood that a similar approach may be applied to vertical direction quantization determination and/or modification.

In some embodiments (since the horizontal direction is more important with regards to perception and typical home theatre setups only have speakers in a horizontal plane) the determination of the spatial resolution and/or modification of the direction may be omitted for vertical direction.

With respect to Figure 19 an example electronic device which may be used as the analysis or synthesis device is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1900 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc. In some embodiments the device 1900 comprises at least one processor or central processing unit 1907. The processor 1907 can be configured to execute various program codes such as the methods such as described herein.

In some embodiments the device 1900 comprises a memory 1911. In some embodiments the at least one processor 1907 is coupled to the memory 1911. The memory 1911 can be any suitable storage means. In some embodiments the memory 1911 comprises a program code section for storing program codes implementable upon the processor 1907. Furthermore in some embodiments the memory 1911 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1907 whenever needed via the memory-processor coupling.

In some embodiments the device 1900 comprises a user interface 1905. The user interface 1905 can be coupled in some embodiments to the processor 1907.

In some embodiments the processor 1907 can control the operation of the user interface 1905 and receive inputs from the user interface 1905. In some embodiments the user interface 1905 can enable a user to input commands to the device 1900, for example via a keypad. In some embodiments the user interface 1905 can enable the user to obtain information from the device 1900. For example the user interface 1905 may comprise a display configured to display information from the device 1900 to the user. The user interface 1905 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1900 and further displaying information to the user of the device 1900. In some embodiments the user interface 1905 may be the user interface for communicating with the position determiner as described herein.

In some embodiments the device 1900 comprises an input/output port 1909. The input/output port 1909 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1907 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.

The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (VVLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).

The transceiver input/output port 1909 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1907 executing suitable code.

It is also noted herein that while the above describes example embodiments, there are several variations and modifications which may be made to the disclosed solution without departing from the scope of the present invention.

In general, the various embodiments may be implemented in hardware or special purpose circuitry, software, logic or any combination thereof. Some aspects of the disclosure may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the disclosure is not limited thereto. While various aspects of the disclosure may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

As used in this application, the term "circuitry" may refer to one or more or all of the following: (a) hardware-only circuit implementations (such as implementations in only analog and/or digital circuitry) and (b) combinations of hardware circuits and software, such as (as applicable): (i) a combination of analog and/or digital hardware circuit(s) with software/firmware and (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and (c) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g., firmware) for operation, but the software may not be present when it is not needed for operation." This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware.

The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device The embodiments of this disclosure may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware.

Computer software or program, also called program product, including software routines, applets and/or macros, may be stored in any apparatus-readable data storage medium and they comprise program instructions to perform particular tasks. A computer program product may comprise one or more computer-executable components which, when the program is run, are configured to carry out embodiments. The one or more computer-executable components may be at least one software code or portions of it.

Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD. The physical media is a non-transitory media.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may comprise one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), FPGA, gate level circuits and processors based on multi core processor architecture, as non-limiting examples.

Embodiments of the disclosure may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

The scope of protection sought for various embodiments of the disclosure is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various

embodiments of the disclosure.

The foregoing description has provided by way of non-limiting examples a full and informative description of the exemplary embodiment of this disclosure. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this disclosure will still fall within the scope of this invention as defined in the appended claims. Indeed, there is a further embodiment comprising a combination of one or more embodiments with any of the other embodiments previously discussed.

Claims

CLAIMS: 1. An apparatus comprising means configured to: obtain at least one image and associated field-of-view information; obtain at least one audio signal; determine at least one direction parameter value for a time-frequency part of the at least one audio signal; encode the direction parameter value based on a quantization spatial resolution such that the encoded direction parameter value is configured to be modified based on the apparatus field-of-view information and a further apparatusfield-of-view information.
2. The apparatus as claimed in claim 1, wherein the means is further configured to receive the further apparatus field-of-view information. 15
3. The apparatus as claimed in claim 2, wherein the means configured to encode the direction parameter value based on a quantization spatial resolution is configured to modify the at least one direction parameter value based on the apparatus field-of-view information and the further apparatus field-of-view information such that the modified at least one direction parameter value is located relatively at the same position within the further apparatus field-of-view and the apparatus field-of-view.
4. The apparatus as claimed in any of claims 1 to 3, wherein the means is further configured to receive the further apparatus viewing angle information.
5. The apparatus as claimed in claim 4, wherein the means configured to encode the direction parameter value based on a quantization spatial resolution is configured to modify the at least one direction parameter value based on an apparatus capture angle associated with the field-of-view information and the further apparatus viewing angle information such that the modified at least one direction parameter value is located relatively at the same position within the further apparatus field-of-view and the apparatus field-of-view.
6. The apparatus as claimed in any of claims 1 to 5, wherein the means configured to encode the direction parameter value based on a quantization spatial resolution is configured to generate a quantization grid based at least on the apparatus field-of-view information such that within the apparatus field-of-view the quantization grid is configured with larger angles between grid points towards the centre of the field-of-view.
7. The apparatus as claimed in claim 2 or any claim dependent on claim 2, wherein the means configured to encode the direction parameter value based on a quantization spatial resolution is configured to generate a quantization grid based at least on the further apparatus field-of-view information such that the apparatus field-of-view the quantization grid is configured with grid points which are equally spaced from each other from the view point of the further apparatus.
8. The apparatus as claimed in claim 4 or any claims dependent on claim 4, wherein the means configured to encode the direction parameter value based on a quantization spatial resolution is configured to generate a quantization grid based at least on the further apparatus viewing angle information such that the apparatus field-of-view the quantization grid is configured with grid points which are equally spaced from each other from the view point of the further apparatus.
9. The apparatus as claimed in any of claims 1 to 8, wherein the means is configured to obtain at least one energy ratio for the time-frequency part, wherein each energy ratio is associated with a respective direction parameter value, and the means configured to determine a quantization spatial resolution based on the field-of-view information is configured to: generate a first quantization grid based on the at least one energy ratio; and modify the first quantization grid based on the field-of-view information.
10. The apparatus as claimed in any of claims 1 to 9, wherein the means is further configured to: encode the at least one image; encode the at least one audio signal; transmit to the further apparatus the encoded at least one image, the encoded at least one audio signal and the encoded direction parameter value.
11. An apparatus comprising means configured to: obtain at least one image; obtain at least one field-of-view information associated with a display of the at least one image; obtain at least one audio signal; obtain at least one direction parameter value associated with a time-frequency part of the at least one audio signal, wherein the direction parameter value is modified based on the apparatus field-of-view information and a further apparatus field-of-view information, wherein the further apparatus field-of-view information is associated with the capture of the at least one image.
12. The apparatus as claimed in claim 11, wherein the means is further configured to receive the further apparatus field-of-view information.
13. The apparatus as claimed in claim 12, wherein the means is further configured to modify the at least one direction parameter value based on the apparatus field-of-view information and the further apparatus field-of-view information such that the modified at least one direction parameter value is located relatively at the same position within the further apparatus field-of-view and the apparatus field-of-view.
14. The apparatus as claimed in any of claims 11 to 13, wherein the means is further configured to obtain an apparatus viewing angle information.
15. The apparatus as claimed in claim 14, wherein the means is further configured to modify the at least one direction parameter value based on an apparatus capture angle associated with the further apparatus field-of-view information and the further apparatus viewing angle information such that the modified at least one direction parameter value is located relatively at the same position within the further apparatus field-of-view and the apparatus field-of-view.
16. The apparatus as claimed in any of claims 11 to 15, wherein the means configured to obtain at least one direction parameter value is configured to: obtain at least one encoded direction parameter value; determine a quantization spatial resolution used for the encoding of the encoded direction parameter value, wherein the quantization spatial resolution is based on the apparatus field-of-view information.
17. The apparatus as claimed in claim 12 or any of claims 13 to 15 when dependent on claim 12, wherein the means configured to obtain at least one direction parameter value is configured to: obtain at least one encoded direction parameter value; determine a quantization spatial resolution used for the encoding of the encoded direction parameter value, wherein the quantization spatial resolution is based on the further apparatus field-of-view information.
18. The apparatus as claimed in claim 14 or 15, wherein the means configured to obtain at least one direction parameter value is configured to: obtain at least one encoded direction parameter value; determine a quantization spatial resolution used for the encoding of the encoded direction parameter value, wherein the quantization spatial resolution is based on the apparatus viewing angle information.
19. The apparatus as claimed in any of claims 11 to 18, wherein the means is configured to obtain at least one energy ratio for the time-frequency part, wherein each energy ratio is associated with a respective direction parameter value, and the means configured to obtain at least one direction parameter value is configured to: obtain at least one encoded direction parameter value; generate a first quantization grid based on the at least one energy ratio; and modify the first quantization grid based on at least one of the apparatus field-ofview information and further apparatus field-of-view information.
20. A method for an apparatus, the method comprising: obtaining at least one image and associated field-of-view information; obtaining at least one audio signal; determining at least one direction parameter value for a time-frequency part of the at least one audio signal; encoding the direction parameter value based on a quantization spatial resolution such that the encoded direction parameter value is configured to be modified based on the apparatus field-of-view information and a further apparatus field-of-view information.
21. A method for an apparatus, the method comprising: obtaining at least one image; obtaining at least one field-of-view information associated with a display of the at least one image; obtaining at least one audio signal; obtaining at least one direction parameter value associated with a time-frequency part of the at least one audio signal, wherein the direction parameter value is modified based on the apparatus field-of-view information and a further apparatus field-of-view information, wherein the further apparatus field-of-view information is associated with the capture of the at least one image.