WO2022223133A1

WO2022223133A1 - Spatial audio parameter encoding and associated decoding

Info

Publication number: WO2022223133A1
Application number: PCT/EP2021/060754
Authority: WO
Inventors: Tapani PIHLAJAKUJA; Lasse Juhani Laaksonen; Adriana Vasilache; Mikko-Ville Laitinen; Anssi Sakari RÄMÖ
Original assignee: Nokia Technologies Oy
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2022-10-27

Abstract

An apparatus for encoding spatial audio signal parameters, the spatial audio signal parameters associated with at least one audio signal, the apparatus comprising means configured to: obtain the spatial audio signal parameters, the spatial audio signal parameters comprising at least one directional parameter, at least one energy ratio, at least one spread coherence parameter; encode the at least one energy ratio to generate an encoded at least one energy ratio; generate an encoded spread coherence parameter; and encode the at least one directional parameter based on encoded at least one energy ratio and the encoded spread coherence parameter.

Description

SPATIAL AUDIO PARAMETER ENCODING AND ASSOCIATED DECODING

Field

The present application relates to apparatus and methods for sound-field related parameter encoding, but not exclusively for time-frequency domain direction related parameter encoding for an audio encoder and decoder.

Background

The immersive voice and audio services (IVAS) codec is an extension of the 3GPP EVS (enhanced voice services) codec and intended for new immersive voice and audio services over 4G/5G. Such immersive services include, e.g., immersive voice and audio for virtual reality (VR). The multi-purpose audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio. It is expected to support a variety of input formats, such as channel-based and scene-based inputs. It is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions.

Metadata-assisted spatial audio (MASA) is one input format proposed for IVAS. It uses audio signal(s) together with corresponding spatial metadata. The spatial metadata comprises parameters which define the spatial aspects of the audio signals and which may contain for example, directions and direct-to-total energy ratios in frequency bands. The MASA stream can, for example, be obtained by capturing spatial audio with microphones of a suitable capture device. For example a mobile device comprising multiple microphones may be configured to capture microphone signals where the set of spatial metadata can be estimated based on the captured microphone signals. The MASA stream can be obtained also from other sources, such as specific spatial audio microphones (such as Ambisonics), studio mixes (for example, a 5.1 audio channel mix) or other content by means of a suitable format conversion.

One such conversion method is disclosed in Tdoc S4-191167 (Nokia Corporation: Description of the IVAS MASA C Reference Software; 3GPP TSG- SA4#106 meeting; 21 -25 October, 2019, Busan, Republic of Korea). Summary

There is provided according to a first aspect an apparatus for encoding spatial audio signal parameters, the spatial audio signal parameters associated with at least one audio signal, the apparatus comprising means configured to: obtain the spatial audio signal parameters, the spatial audio signal parameters comprising at least one directional parameter, at least one energy ratio, at least one spread coherence parameter; encode the at least one energy ratio to generate an encoded at least one energy ratio; generate an encoded spread coherence parameter; and encode the at least one directional parameter based on encoded at least one energy ratio and the encoded spread coherence parameter.

The means configured to encode the at least one directional parameter based on encoded at least one energy ratio and the encoded spread coherence parameter may be configured to: determine a quantization grid arrangement based on the at least one energy ratio and the encoded spread coherence parameter; and generate a codeword as the encoded at least one directional parameter based on the determined quantization grid and the at least one directional parameter.

The means configured to generate the encoded spread coherence parameter vector element; and encode the at least one directional parameter based on encoded at least one energy ratio and the encoded spread coherence parameter may be configured to generate the encoded spread coherence parameter vector element and to encode the at least one directional parameter based on encoded at least one energy ratio and the encoded spread coherence parameter based on the at least one audio signal being a first format.

The first format may be a multi-channel input format. The means may be further configured to, based on the format of the at least one audio signal being a second format: encode the at least one directional parameter based on encoded at least one energy ratio; and generate the encoded spread coherence parameter based on an encoded at least one directional parameter. The second format may be a metadata-assisted spatial audio format.

The means configured to generate the encoded spread coherence parameter based on an encoded at least one directional parameter may be configured to: select a codebook based on the encoded at least one directional parameter; and generate an encoded spread coherence parameter based on the selection of the codebook.

The means configured to encode the at least one directional parameter based on encoded at least one energy ratio may be configured to: determine a quantization grid arrangement based on the at least one energy ratio; and generate a codeword as the encoded at least one directional parameter based on a quantized at least one directional parameter, the quantized at least one directional parameter being formed by the application of the determined quantization grid to the at least one directional parameter.

The means configured to select a codebook based on the encoded at least one directional parameter may be further configured to select a codebook based on a variance of the quantized at least one directional parameter.

The quantized at least one directional parameter may be a quantized azimuth of the at least one directional parameter.

The means configured to generate an encoded spread coherence parameter may be configured to: discrete cosine transform a vector formed from the at least one spread coherence parameter to generate at least one zero order discrete cosine transformed spread coherence parameter vector element; and generate an encoded spread coherence parameter vector element.

The means configured to discrete cosine transform the vector formed from the at least one spread coherence parameter may be further configured to generate at least one first order discrete cosine transformed spread coherence parameter vector element, and the means may be configured to encode the at least the first order discrete cosine transformed spread coherence parameter vector element.

According to a second aspect there is provided an apparatus for decoding encoded spatial audio signal parameters, the encoded spatial audio signal parameters associated with at least one audio signal, the apparatus comprising means configured to: obtain the encoded spatial audio signal parameters, the encoded spatial audio signal parameters comprising at least one encoded directional parameter, at least one encoded energy ratio, and at least one encoded spread coherence parameter; and determine at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the at least one encoded spread coherence parameter.

The means configured to determine at least one decoded directional parameter based on the at least one encoded directional parameter, and the encoded at least one energy ratio and the encoded spread coherence parameter may be configured to: determine a quantization grid arrangement based on the encoded at least one energy ratio and the at least one encoded spread coherence parameter; and generate at least one decoded directional parameter value for the at least one directional parameter from the encoded at least one directional parameter applied to the determined quantization grid.

The means may be configured to determine the at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the at least one encoded spread coherence parameter based on the at least one audio signal being a first format.

The first format may be a multi-channel input format.

The means may be further configured to, based on the format of the at least one audio signal being a second format: determine at least one decoded directional parameter based on the at least one encoded directional parameter and the encoded at least one energy ratio; select a codebook for the encoded spread coherence parameter, the selection based on the decoded directional parameter; and generate the decoded zero order discrete cosine transformed spread coherence parameter vector element based on the selection of the codebook for the encoded zero order discrete cosine transformed spread coherence parameter vector element.

The second format may be a metadata-assisted spatial audio format.

The means configured to decode the at least one directional parameter based on encoded at least one energy ratio may be configured to: determine a quantization grid arrangement based on the at least one energy ratio; and generate a decoded quantized at least one directional parameter based on the application of the determined quantization grid to the at least one encoded directional parameter.

The means configured to select a codebook for the encoded zero order discrete cosine transformed spread coherence parameter vector element, the selection based on the decoded directional parameter may be further configured to select a codebook for the encoded spread coherence parameter based on a variance of the quantized at least one directional parameter.

The means may be configured to determine at least one encoded zero order discrete cosine transformed spread coherence parameter vector element from the at least one encoded spread coherence parameter based on the at least one encoded energy ratio.

The means configured to determine the at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the at least one encoded spread coherence parameter may be configured to determine the at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the zero order discrete cosine transformed spread coherence parameter vector element.

According to a third aspect there is provided a method for an apparatus configured to encode spatial audio signal parameters, the spatial audio signal parameters associated with at least one audio signal, the method comprising: obtaining the spatial audio signal parameters, the spatial audio signal parameters comprising at least one directional parameter, at least one energy ratio, at least one spread coherence parameter; encoding the at least one energy ratio to generate an encoded at least one energy ratio; generating an encoded spread coherence parameter; and encoding the at least one directional parameter based on encoded at least one energy ratio and the encoded spread coherence parameter.

Encoding the at least one directional parameter based on the encoded at least one energy ratio and the encoded spread coherence parameter may comprise: determining a quantization grid arrangement based on the at least one energy ratio and the encoded spread coherence parameter; and generating a codeword as the encoded at least one directional parameter based on the determined quantization grid and the at least one directional parameter.

Generating the encoded spread coherence parameter vector element; and encoding the at least one directional parameter based on encoded at least one energy ratio and the encoded spread coherence parameter may comprise generating the encoded spread coherence parameter vector element and encoding the at least one directional parameter based on encoded at least one energy ratio and the encoded spread coherence parameter based on the at least one audio signal being a first format.

The first format may be a multi-channel input format.

The method may further comprise: determining the format of the at least one audio signal being a second format: encoding the at least one directional parameter based on encoded at least one energy ratio based on determining the format of the at least one audio signal being the second format; and generating the encoded spread coherence parameter based on an encoded at least one directional parameter.

The second format may be a metadata-assisted spatial audio format.

Generating the encoded spread coherence parameter based on an encoded at least one directional parameter may comprise: selecting a codebook based on the encoded at least one directional parameter; and generating an encoded spread coherence parameter based on the selection of the codebook.

Encoding the at least one directional parameter based on encoded at least one energy ratio may comprise: determining a quantization grid arrangement based on the at least one energy ratio; and generating a codeword as the encoded at least one directional parameter based on a quantized at least one directional parameter, the quantized at least one directional parameter being formed by the application of the determined quantization grid to the at least one directional parameter.

Selecting a codebook based on the encoded at least one directional parameter may further comprise selecting a codebook based on a variance of the quantized at least one directional parameter.

Generating an encoded spread coherence parameter may comprise: discrete cosine transforming a vector formed from the at least one spread coherence parameter to generate at least one zero order discrete cosine transformed spread coherence parameter vector element; and generating an encoded spread coherence parameter vector element. Discrete cosine transforming the vector formed from the at least one spread coherence parameter may further comprise generating at least one first order discrete cosine transformed spread coherence parameter vector element, and the method may further comprise encoding the at least the first order discrete cosine transformed spread coherence parameter vector element.

According to a fourth aspect there is provided a method for an apparatus configured to decode encoded spatial audio signal parameters, the encoded spatial audio signal parameters associated with at least one audio signal, the method comprising: obtaining the encoded spatial audio signal parameters, the encoded spatial audio signal parameters comprising at least one encoded directional parameter, at least one encoded energy ratio, and at least one encoded spread coherence parameter; and determining at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the at least one encoded spread coherence parameter.

Determining at least one decoded directional parameter based on the at least one encoded directional parameter, and the encoded at least one energy ratio and the encoded spread coherence parameter may comprise: determining a quantization grid arrangement based on the encoded at least one energy ratio and the at least one encoded spread coherence parameter; and generating at least one decoded directional parameter value for the at least one directional parameter from the encoded at least one directional parameter applied to the determined quantization grid.

Determining the at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the at least one encoded spread coherence parameter may be based on the at least one audio signal being a first format.

The first format may be a multi-channel input format.

The method may further comprise: determining the format of the at least one audio signal being a second format; determining at least one decoded directional parameter based on the at least one encoded directional parameter and the encoded at least one energy ratio, based on the determining of the format of the at least one audio signal being a second format; selecting a codebook for the encoded spread coherence parameter, the selection based on the decoded directional parameter; and generating the decoded zero order discrete cosine transformed spread coherence parameter vector element based on the selection of the codebook for the encoded zero order discrete cosine transformed spread coherence parameter vector element.

The second format may be a metadata-assisted spatial audio format.

Decoding the at least one directional parameter based on encoded at least one energy ratio may comprise: determining a quantization grid arrangement based on the at least one energy ratio; and generating a decoded quantized at least one directional parameter based on the application of the determined quantization grid to the at least one encoded directional parameter.

Selecting a codebook for the encoded zero order discrete cosine transformed spread coherence parameter vector element, the selection based on the decoded directional parameter may further comprise selecting a codebook for the encoded spread coherence parameter based on a variance of the quantized at least one directional parameter.

The method may comprise determining at least one encoded zero order discrete cosine transformed spread coherence parameter vector element from the at least one encoded spread coherence parameter based on the at least one encoded energy ratio.

Determining the at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the at least one encoded spread coherence parameter may comprise determining the at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the zero order discrete cosine transformed spread coherence parameter vector element.

According to a fifth aspect there is provided an apparatus configured to encode spatial audio signal parameters, the spatial audio signal parameters associated with at least one audio signal, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to:: obtain the spatial audio signal parameters, the spatial audio signal parameters comprising at least one directional parameter, at least one energy ratio, at least one spread coherence parameter; encode the at least one energy ratio to generate an encoded at least one energy ratio; generate an encoded spread coherence parameter; and encoding the at least one directional parameter based on encoded at least one energy ratio and the encoded spread coherence parameter.

The apparatus caused to encode the at least one directional parameter based on the encoded at least one energy ratio and the encoded spread coherence parameter may be caused to: determine a quantization grid arrangement based on the at least one energy ratio and the encoded spread coherence parameter; and generate a codeword as the encoded at least one directional parameter based on the determined quantization grid and the at least one directional parameter.

The apparatus caused to generate the encoded spread coherence parameter vector element; and encode the at least one directional parameter based on encoded at least one energy ratio and the encoded spread coherence parameter may be caused to generate the encoded spread coherence parameter vector element and encode the at least one directional parameter based on encoded at least one energy ratio and the encoded spread coherence parameter based on the at least one audio signal being a first format.

The first format may be a multi-channel input format.

The apparatus may be further caused to: determine the format of the at least one audio signal being a second format: encode the at least one directional parameter based on encoded at least one energy ratio based on determining the format of the at least one audio signal being the second format; and generate the encoded spread coherence parameter based on an encoded at least one directional parameter.

The second format may be a metadata-assisted spatial audio format.

The apparatus caused to generate the encoded spread coherence parameter based on an encoded at least one directional parameter may be caused to: select a codebook based on the encoded at least one directional parameter; and generate an encoded spread coherence parameter based on the selection of the codebook. The apparatus caused to encode the at least one directional parameter based on encoded at least one energy ratio may be caused to: determine a quantization grid arrangement based on the at least one energy ratio; and generate a codeword as the encoded at least one directional parameter based on a quantized at least one directional parameter, the quantized at least one directional parameter being formed by the application of the determined quantization grid to the at least one directional parameter.

The apparatus caused to select a codebook based on the encoded at least one directional parameter may further be caused to select a codebook based on a variance of the quantized at least one directional parameter.

The apparatus caused to generate an encoded spread coherence parameter may be caused to: discrete cosine transform a vector formed from the at least one spread coherence parameter to generate at least one zero order discrete cosine transformed spread coherence parameter vector element; and generate an encoded spread coherence parameter vector element.

The apparatus caused to discrete cosine transform the vector formed from the at least one spread coherence parameter may further be caused to generate at least one first order discrete cosine transformed spread coherence parameter vector element, and the apparatus may further be caused to encode the at least the first order discrete cosine transformed spread coherence parameter vector element.

According to a sixth aspect there is provided an apparatus configured to decode encoded spatial audio signal parameters, the encoded spatial audio signal parameters associated with at least one audio signal, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain the encoded spatial audio signal parameters, the encoded spatial audio signal parameters comprising at least one encoded directional parameter, at least one encoded energy ratio, and at least one encoded spread coherence parameter; and determine at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the at least one encoded spread coherence parameter.

The apparatus caused to determine at least one decoded directional parameter based on the at least one encoded directional parameter, and the encoded at least one energy ratio and the encoded spread coherence parameter may be caused to: determine a quantization grid arrangement based on the encoded at least one energy ratio and the at least one encoded spread coherence parameter; and generate at least one decoded directional parameter value for the at least one directional parameter from the encoded at least one directional parameter applied to the determined quantization grid.

The apparatus caused to determine the at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the at least one encoded spread coherence parameter may be based on the at least one audio signal being a first format.

The first format may be a multi-channel input format.

The apparatus may be caused to: determine the format of the at least one audio signal being a second format; determine at least one decoded directional parameter based on the at least one encoded directional parameter and the encoded at least one energy ratio, based on the determining of the format of the at least one audio signal being a second format; select a codebook for the encoded spread coherence parameter, the selection based on the decoded directional parameter; and generate the decoded zero order discrete cosine transformed spread coherence parameter vector element based on the selection of the codebook for the encoded zero order discrete cosine transformed spread coherence parameter vector element.

The second format may be a metadata-assisted spatial audio format.

The apparatus caused to decode the at least one directional parameter based on encoded at least one energy ratio may be caused to: determine a quantization grid arrangement based on the at least one energy ratio; and generate a decoded quantized at least one directional parameter based on the application of the determined quantization grid to the at least one encoded directional parameter.

The apparatus caused to select a codebook for the encoded zero order discrete cosine transformed spread coherence parameter vector element, the selection based on the decoded directional parameter may further be caused to select a codebook for the encoded spread coherence parameter based on a variance of the quantized at least one directional parameter.

The apparatus may be caused to determine at least one encoded zero order discrete cosine transformed spread coherence parameter vector element from the at least one encoded spread coherence parameter based on the at least one encoded energy ratio.

The apparatus caused to determine the at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the at least one encoded spread coherence parameter may be caused to determine the at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the zero order discrete cosine transformed spread coherence parameter vector element.

According to a seventh aspect there is provided an apparatus for encoding spatial audio signal parameters, the spatial audio signal parameters associated with at least one audio signal, the apparatus comprising: means for obtaining the spatial audio signal parameters, the spatial audio signal parameters comprising at least one directional parameter, at least one energy ratio, at least one spread coherence parameter; means for encoding the at least one energy ratio to generate an encoded at least one energy ratio; means for generating an encoded spread coherence parameter; and means for encoding the at least one directional parameter based on encoded at least one energy ratio and the encoded spread coherence parameter.

According to an eighth aspect there is provided an apparatus for decoding encoded spatial audio signal parameters, the encoded spatial audio signal parameters associated with at least one audio signal, the apparatus comprising: means for obtaining the encoded spatial audio signal parameters, the encoded spatial audio signal parameters comprising at least one encoded directional parameter, at least one encoded energy ratio, and at least one encoded spread coherence parameter; and means for determining at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the at least one encoded spread coherence parameter.

According to a ninth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus for encoding spatial audio signal parameters, the spatial audio signal parameters associated with at least one audio signal, the apparatus to perform at least the following: obtain the spatial audio signal parameters, the spatial audio signal parameters comprising at least one directional parameter, at least one energy ratio, at least one spread coherence parameter; encode the at least one energy ratio to generate an encoded at least one energy ratio; generate an encoded spread coherence parameter; and encode the at least one directional parameter based on encoded at least one energy ratio and the encoded spread coherence parameter.

According to a tenth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus for decoding encoded spatial audio signal parameters, the encoded spatial audio signal parameters associated with at least one audio signal, the apparatus to perform at least the following: obtain the encoded spatial audio signal parameters, the encoded spatial audio signal parameters comprising at least one encoded directional parameter, at least one encoded energy ratio, and at least one encoded spread coherence parameter; and determine at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the at least one encoded spread coherence parameter.

According to an eleventh aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus for encoding spatial audio signal parameters, the spatial audio signal parameters associated with at least one audio signal, the apparatus to perform at least the following: obtain the spatial audio signal parameters, the spatial audio signal parameters comprising at least one directional parameter, at least one energy ratio, at least one spread coherence parameter; encode the at least one energy ratio to generate an encoded at least one energy ratio; generate an encoded spread coherence parameter; and encode the at least one directional parameter based on encoded at least one energy ratio and the encoded spread coherence parameter.

According to a twelfth aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus for decoding encoded spatial audio signal parameters, the encoded spatial audio signal parameters associated with at least one audio signal, the apparatus to perform at least the following: obtain the encoded spatial audio signal parameters, the encoded spatial audio signal parameters comprising at least one encoded directional parameter, at least one encoded energy ratio, and at least one encoded spread coherence parameter; and determine at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the at least one encoded spread coherence parameter.

According to a thirteenth aspect there is provided an apparatus for encoding spatial audio signal parameters, the spatial audio signal parameters associated with at least one audio signal, the apparatus comprising: obtaining circuitry configured to obtain the spatial audio signal parameters, the spatial audio signal parameters comprising at least one directional parameter, at least one energy ratio, at least one spread coherence parameter; encoding circuitry configured to encode the at least one energy ratio to generate an encoded at least one energy ratio; generate an encoded spread coherence parameter; and encoding circuitry configured to encode the at least one directional parameter based on encoded at least one energy ratio and the encoded spread coherence parameter.

According to a fourteenth aspect there is provided an apparatus for decoding encoded spatial audio signal parameters, the encoded spatial audio signal parameters associated with at least one audio signal, the apparatus comprising: obtaining circuitry configured to obtain the encoded spatial audio signal parameters, the encoded spatial audio signal parameters comprising at least one encoded directional parameter, at least one encoded energy ratio, and at least one encoded spread coherence parameter; and determining circuitry configured to determine at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the at least one encoded spread coherence parameter. According to a fifteenth aspect there is provided a computer readable medium comprising program instructions for causing an apparatus for encoding spatial audio signal parameters, the spatial audio signal parameters associated with at least one audio signal, the apparatus to perform at least the following: obtain the spatial audio signal parameters, the spatial audio signal parameters comprising at least one directional parameter, at least one energy ratio, at least one spread coherence parameter; encode the at least one energy ratio to generate an encoded at least one energy ratio; generate an encoded spread coherence parameter; and encode the at least one directional parameter based on encoded at least one energy ratio and the encoded spread coherence parameter.

According to a sixteenth aspect there is provided a computer readable medium comprising program instructions for causing an apparatus for decoding encoded spatial audio signal parameters, the encoded spatial audio signal parameters associated with at least one audio signal, the apparatus to perform at least the following: obtain the encoded spatial audio signal parameters, the encoded spatial audio signal parameters comprising at least one encoded directional parameter, at least one encoded energy ratio, and at least one encoded spread coherence parameter; and determine at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the at least one encoded spread coherence parameter.

An apparatus comprising means for performing the actions of the method as described above.

An apparatus configured to perform the actions of the method as described above.

A computer program comprising program instructions for causing a computer to perform the method as described above.

A computer program product stored on a medium may cause an apparatus to perform the method as described herein.

An electronic device may comprise apparatus as described herein.

A chipset may comprise apparatus as described herein.

Embodiments of the present application aim to address problems associated with the state of the art. Summary of the Figures

For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which: Figure 1 shows schematically a system of apparatus suitable for implementing some embodiments;

Figure 2 shows schematically the metadata encoder according to some embodiments;

Figure 3 shows a flow diagram of the operation of the example metadata encoder as shown in Figure 2 according to some embodiments;

Figure 4 shows schematically an example coherence encoder as shown in figure 2 within the metadata encoder according to some embodiments;

Figure 5 shows a flow diagram of the operation of the example coherence metadata encoder as shown in Figure 4 according to some embodiments; Figure 6 shows schematically an example vector encoder as shown in

Figure 4 according to some embodiments;

Figure 7 shows schematically a direction encoder as shown in Figure 2 according to some embodiments;

Figure 8 shows a flow diagram of the operation of the example vector encoder as shown in Figure 6 and direction encoder as shown in Figure 7 according to some embodiments;

Figure 9 shows schematically an example decoder as shown in Figure 2 according to some embodiments;

Figures 10a and 10b shows flow diagrams of the operation of the example decoder as shown in Figure 9 for different formats according to some embodiments; and

Figure 11 shows schematically an example device suitable for implementing the apparatus shown. Embodiments of the Application

The following describes in further detail suitable apparatus and possible mechanisms for the encoding of parametric spatial audio streams with transport audio signals and spatial metadata. As discussed above Metadata-Assisted Spatial Audio (MASA) is an example of a parametric spatial audio format and representation suitable as an input format for IVAS.

It can be considered an audio representation consisting of ‘N channels + spatial metadata’. It is a scene-based audio format particularly suited for spatial audio capture on practical devices, such as smartphones. The idea is to describe the sound scene in terms of time- and frequency-varying sound source directions. Sound energy that is not defined (described) by the directions, is described as diffuse (coming from all directions).

As discussed above spatial metadata associated with the audio signals may comprise multiple parameters (such as multiple directions and associated with each direction a direct-to-total ratio, distance, etc.) per time-frequency tile. The spatial metadata may also comprise other parameters or may be associated with other parameters which are considered to be non-directional (such as surround coherence, diffuse-to-total energy ratio, remainder-to-total energy ratio) but when combined with the directional parameters are able to be used to define the characteristics of the audio scene. For example a reasonable design choice which is able to produce a good quality output is one where the spatial metadata comprises one or more directions for each time-frequency subframe (and associated with each direction direct-to-total ratios, distance values etc) are determined. However as also discussed above, bandwidth and/or storage limitations may require a codec not to send spatial metadata parameter values for each frequency band and temporal sub-frame.

As described above, parametric spatial metadata representation can use multiple concurrent spatial directions. With MASA, the proposed maximum number of concurrent directions is two. For each concurrent direction, there may be associated parameters such as: Direction index; Direct-to-total ratio; Spread coherence; and Distance. In some embodiments other parameters such as Diffuse- to-total energy ratio; Surround coherence; and Remainder-to-total energy ratio are defined.

The direction index may be encoded using a number of bits, for example 16, which defines a direction of arrival of the sound at a time-frequency parameter interval. In some embodiments the encoding using spherical representation with 16 bits enables a direction with about 1 -degree accuracy where all directions are covered. Direct-to-total ratios describe how much of energy comes from specific directions and may be calculated as energy in the direction against the total energy. The Spread coherence represents a spread of energy associated with a direction index of a time-frequency tile (i.e., a measure of a ‘concentration of energy’ for a time-frequency subframe direction and defines whether the direction is to be reproduced as a point source or coherently around the direction). A diffuse-to-total energy ratio defines an energy ratio of non-directional sound over surrounding directions and may be calculated as energy of non-directional sound against the total energy and describes how much of the energy does not come from any specific direction. The direct-to-total energy ratio(s) and the diffuse-to-total sum to one (if there is no remainder energy present). The surround coherence describes the coherence of the non-directional sound over the surrounding directions. A remainder-to-total energy ratio defines the energy ratio of the remainder (such as microphone noise) sound energy and fulfils the requirement that the sum of energy ratios is 1 . The Distance parameter defines the distance of the sound originating from the direction index. It may be defined in terms of time-frequency subframes and in meters on a logarithmic scale and may define a range of values, for example, 0 to 100 m.

However the MASA format may further comprise other parameters, such as:

Version which describes the incremental version number for the MASA metadata format.

Channel audio format which describes the following fields (and may be stored as two bytes):

Number of directions which indicates the number of directions in the metadata, where each direction is associated with a set of direction dependent spatial metadata;

Number of channels which indicates a number of transport channels in the format;

Transport channel definition which describes the transport channels.

Source format which describes the original format from which the audio signals was created; Source format description which may provide further description of the specific source format; and

Channel distance which describes the channel distance.

As the IVAS codec is expected to operate at various bit rates ranging from very low bit rates (13 kb/s) to relatively high bit rates (500 kb/s), various strategies are needed for compression of the spatial metadata. The raw bitrate of the MASA metadata is relatively high (about 310 kb/s for 1 direction and about 500 kb/s for 2 directions), so at lower bitrates it is expected that only the most important parts of the metadata will be conveyed from the encoder to the decoder. In practice, it is not possible to send parameter values for each frequency band, temporal sub- frame, and direction (at least for most practical bit rates). Instead, some values have to be merged (e.g., send only 1 direction instead of 2 directions and/or send the same direction(s) for multiple frequency bands and/or temporal sub-frames). At absolute lowest bitrates, drastic reduction is needed as there is very few bits available for describing the metadata.

The IVAS codec is configured to operate with a frame size of 20ms. Similarly, the MASA metadata update rate is typically once per 20ms. This allows for synchronization of the audio waveform and spatial metadata encoding. However, in order to, e.g., react to fast events, the internal metadata structure for MASA allows for a 5-ms update rate by use of 4 temporal sub-frames.

For practical devices, such as mobile phones, the spatial analysis is more difficult than for well-understood dedicated spatial audio capture microphone arrays. Therefore, various fluctuations are often present in the original metadata. For this reason, longer analysis windows and smoothing is often used.

Moreover, MASA metadata contains a complex sound scene description model with multiple parameters. However, the use of these parameters may vary from frame to frame and some practical systems creating the MASA format may not be able (or purposefully do not) analyse or form values for all parameters. Thus, a valid MASA format input frame may vary significantly from frame to frame how the parameters are used.

As described above, the MASA format contains multiple parameters. These parameters together perceptually represent the overall captured or generated sound field. Quantizing and encoding of the direction index has been defined in earlier approaches. For example a spherical index based system where a specific number of bits are allowed for use per unique direction parameter value. This number of bits is controlled by the direct-to-total ratio (or similar diffuse-to-total ratio) in such a way that when the direct-to-total ratio is low, the accuracy of the direction can be lower as well. This is due to the fact that a higher direct-to-total ratio is usually associated with perceptually more relevant direction parameters.

However, the spread coherence parameter in the MASA format affects the rendering of the spatial direction. In practice, when a spread coherence increases from zero, the reproduction becomes less and less point-like until at 0.5 it is reproduced by centre and edges together, and at 1 .0 it is reproduced only by the edges. This means in practice that when spread coherence is present, the perceived sound direction for the corresponding TF-tile is less point-like or even less accurate.

The concept as discussed further in the embodiments herein is take this into account in direction encoding in order to obtain a more optimal bit allocation in parameter encoding and thus, better overall quality.

Thus in summary the embodiments described herein relate to an encoding of parametric spatial audio streams with transport audio signals and spatial metadata. These embodiments feature a method which attempts to optimize the quantization of the spatial direction per TF-tile by analyzing the spread coherence parameter value in addition to or instead of the direct-to-total energy ratio parameter value (or similar energy ratio) and automatically selecting (an optimized) quantization accuracy and bit use for the corresponding direction parameter value.

With respect to Figure 1 an example apparatus and system for implementing embodiments of the application are shown. The system 100 is shown with an ‘analysis’ part 121 and a ‘synthesis’ part 131. The ‘analysis’ part 121 is the part from receiving the multi-channel signals up to an encoding of the spatial metadata and transport signal and the ‘synthesis’ part 131 is the part from a decoding of the encoded spatial metadata and transport signal to the presentation of the re generated signal (for example in multi-channel loudspeaker form).

In the following description the ‘analysis’ part 121 is described as a series of parts however in some embodiments the part may be implemented as functions within the same functional apparatus or part. In other words in some embodiments the ‘analysis’ part 121 is an encoder comprising at least one of the transport signal generator or analysis processor as described hereafter.

The input to the system 100 and the ‘analysis’ part 121 is the multi-channel signals 102. The ‘analysis’ part 121 may comprise a transport signal generator 103, analysis processor 105, and encoder 107. In the following examples a microphone channel signal input is described, which can be two or more microphones integrated or connected onto a mobile device (e.g., a smartphone). However any suitable input (or synthetic multi-channel) format may be implemented in other embodiments. For example other suitable audio signals format inputs could be microphone arrays, e.g., B-format microphone, planar microphone array or Eigenmike, Ambisonic signals, e.g., first-order Ambisonics (FOA), higher-order Ambisonics (HOA), loudspeaker surround mix and/or objects, artificially created spatial mix, e.g., from audio or VR teleconference bridge, or combinations of the above.

The multi-channel signals are passed to a transport signal generator 103 and to an analysis processor 105.

In some embodiments the transport signal generator 103 is configured to receive the multi-channel signals and generate a suitable audio signal format for encoding. The transport signal generator 103 can for example generate a stereo or mono audio signal. The transport audio signals generated by the transport signal generator can be any known format. For example when the input is one where the audio signals input are mobile phone microphone array audio signals, the transport signal generator 103 can be configured to select a left-right microphone pair, and apply any suitable processing to the audio signal pair, such as automatic gain control, microphone noise removal, wind noise removal, and equalization. In some embodiments when the input is a first order Ambisonic/higher order Ambisonic (FOA/HOA) signal, the transport signal generator can be configured to formulate directional beam signals towards left and right directions, such as two opposing cardioid signals. Additionally in some embodiments when the input is a loudspeaker surround mix and/or objects, then the transport signal generator 103 can be configured to generate a downmix signal that combines left side channels to a left downmix channel, combines right side channels to a right downmix channel and adds centre channels to both transport channels with a suitable gain.

In some embodiments the transport signal generator is bypassed (or in other words is optional). For example, in some situations where the analysis and synthesis occur at the same device at a single processing step, without intermediate processing there is no transport signal generation and the input audio signals are passed unprocessed. The number of transport channels generated can be any suitable number and not for example one or two channels.

The output of the transport signal generator 103 can be passed to an encoder 107.

In some embodiments the analysis processor 105 is also configured to receive the multi-channel signals and analyse the signals to produce the spatial metadata 106 associated with the multi-channel signals and thus associated with the transport signals 104. In some embodiments the spatial metadata associated with the audio signals may be a provided to the encoder as a separate bit-stream. In some embodiments the multichannel signals 102 input comprises spatial metadata and this is passed directly to the encoder 107.

The analysis processor 105 may be configured to generate the spatial metadata parameters which may comprise, for each time-frequency analysis interval, at least one direction parameter 108 and at least one energy ratio parameter 110 (and in some embodiments other parameters such as described earlier and of which a non-exhaustive list includes number of directions, surround coherence, diffuse-to-total energy ratio, remainder-to-total energy ratio, a spread coherence parameter, and distance parameter). The direction parameter may be represented in any suitable manner, for example as spherical co-ordinates denoted as azimuth cp(k,n) and elevation 0(k,n).

In some embodiments the number of the spatial metadata parameters may differ from time-frequency tile to time-frequency tile. Thus for example in band X all of the spatial metadata parameters are obtained (generated) and transmitted, whereas in band Y only one of the spatial metadata parameters is obtained and transmitted, and furthermore in band Z no parameters are obtained or transmitted. A practical example of this may be that for some time-frequency tiles corresponding to the highest frequency band some of the spatial metadata parameters are not required for perceptual reasons. The spatial metadata 106 may be passed to an encoder 107.

In some embodiments the analysis processor 105 is configured to apply a time-frequency transform for the input signals. Then, for example, in time-frequency tiles when the input is a mobile phone microphone array, the analysis processor could be configured to estimate delay-values between microphone pairs that maximize the inter-microphone correlation. Then based on these delay values the analysis processor may be configured to formulate a corresponding direction value for the spatial metadata. Furthermore the analysis processor may be configured to formulate a direct-to-total ratio parameter based on the correlation value.

In some embodiments, for example where the input is a FOA signal, the analysis processor 105 can be configured to determine an intensity vector. The analysis processor may then be configured to determine a direction parameter value for the spatial metadata based on the intensity vector. A diffuse-to-total ratio can then be determined, from which a direct-to-total ratio parameter value for the spatial metadata can be determined. This analysis method is known in the literature as Directional Audio Coding (DirAC).

In some examples, for example where the input is a FIOA signal, the analysis processor 105 can be configured to divide the FIOA signal into multiple sectors, in each of which the method above is utilized. This sector-based method is known in the literature as higher order DirAC (FIO-DirAC). In these examples, there is more than one simultaneous direction parameter value per time-frequency tile corresponding to the multiple sectors.

Additionally in some embodiments where the input is a loudspeaker surround mix and/or audio object(s) based signal, the analysis processor can be configured to convert the signal into a FOA/FIOA signal(s) format and to obtain direction and direct-to-total ratio parameter values as above.

The encoder 107 may comprise an audio encoder core 109 which is configured to receive the transport audio signals 104 and generate a suitable encoding of these audio signals. The encoder 107 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs. The audio encoding may be implemented using any suitable scheme. The encoder 107 may furthermore comprise a spatial metadata encoder/quantizer 111 which is configured to receive the spatial metadata and output an encoded or compressed form of the information. In some embodiments the encoder 107 may further interleave, multiplex to a single data stream or embed the spatial metadata within encoded downmix signals before transmission or storage shown in Figure 1 by the dashed line. The multiplexing may be implemented using any suitable scheme.

In some embodiments the transport signal generator 103 and/or analysis processor 105 may be located on a separate device (or otherwise separate) from the encoder 107. For example in such embodiments the spatial metadata (and associated non-spatial metadata) parameters associated with the audio signals may be a provided to the encoder as a separate bit-stream.

In some embodiments the transport signal generator 103 and/or analysis processor 105 may be part of the encoder 107, i.e., located inside of the encoder and be on a same device.

In the following description the ‘synthesis’ part 131 is described as a series of parts however in some embodiments the part may be implemented as functions within the same functional apparatus or part.

In the decoder side, the received or retrieved data (stream) may be received by a decoder/demultiplexer 133. The decoder/demultiplexer 133 may demultiplex the encoded streams and pass the audio encoded stream to a transport signal decoder 135 which is configured to decode the audio signals to obtain the transport audio signals. Similarly the decoder/demultiplexer 133 may comprise a metadata extractor 137 which is configured to receive the encoded spatial metadata (for example a direction index representing a direction parameter value) and generate spatial metadata.

The decoder/demultiplexer 133 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs.

The decoded metadata and transport audio signals may be passed to a synthesis processor 139.

The system 100 ‘synthesis’ part 131 further shows a synthesis processor 139 configured to receive the transport audio signal and the spatial metadata and re-creates in any suitable format a synthesized spatial audio in the form of multi channel signals 110 (these may be multichannel loudspeaker format or in some embodiments any suitable output format such as binaural or Ambisonics signals, depending on the use case) based on the transport signals and the spatial metadata.

The synthesis processor 139 thus creates the output audio signals, e.g., multichannel loudspeaker signals or binaural signals based on any suitable known method. This is not explained here in further detail. However, as a simplified example, the rendering can be performed for loudspeaker output according to any of the following methods. For example the transport audio signals can be divided to direct and ambient streams based on the direct-to-total and diffuse-to-total energy ratios. The direct stream can then be rendered based on the direction parameter(s) using amplitude panning. The ambient stream can furthermore be rendered using decorrelation. The direct and the ambient streams can then be combined.

The output signals can be reproduced using a multichannel loudspeaker setup or headphones which may be head-tracked.

It should be noted that the processing blocks of Figure 1 can be located in same or different processing entities. For example, in some embodiments, microphone signals from a mobile device are processed with a spatial audio capture system (containing the analysis processor and the transport signal generator), and the resulting spatial metadata and transport audio signals (e.g., in the form of a MASA stream) are forwarded to an encoder (e.g., an IVAS encoder), which contains the encoder. In other embodiments, input signals (e.g., 5.1 channel audio signals) are directly forwarded to an encoder (e.g., an IVAS encoder), which contains the analysis processor, the transport signal generator, and the encoder.

In some embodiments there can be two (or more) input audio signals, where the first audio signal is processed by the apparatus shown in Figure 1 (resulting in data as an input for the encoder) and the second audio signal is directly forwarded to an encoder (e.g., an IVAS encoder), which contains the analysis processor, the transport signal generator, and the encoder. The audio input signals may then be encoded in the encoder independently or they may, e.g., be combined in the parametric domain according to what may be called, e.g., MASA mixing. In some embodiments there may be a synthesis part which comprises separate decoder and synthesis processor entities or apparatus, or the synthesis part can comprise a single entity which comprises both the decoder and the synthesis processor. In some embodiments, the decoder block may process in parallel more than one incoming data stream. In the application the term synthesis processor may be interpreted as an internal or external Tenderer.

Therefore in summary first the system (analysis part) is configured to receive multi-channel audio signals. Then the system (analysis part) is configured to generate a suitable transport audio signal (for example by selecting some of the audio signal channels). The system is then configured to encode for storage/transmission the transport audio signal. After this the system may store/transmit the encoded transport audio signal and metadata. The system may retrieve/receive the encoded transport audio signal and metadata. Then the system is configured to extract the transport audio signal and metadata from encoded transport audio signal and metadata parameters, for example demultiplex and decode the encoded transport audio signal and metadata parameters.

The system (synthesis part) is configured to synthesize an output multi channel audio signal based on extracted transport audio signal and metadata.

With respect to Figure 2 an example analysis processor 105 and Metadata encoder/quantizer 111 (as shown in Figure 1) according to some embodiments is described in further detail.

The analysis processor 105 in some embodiments comprises a time- frequency domain transformer 201 .

In some embodiments the time-frequency domain transformer 201 is configured to receive the multi-channel signals 102 and apply a suitable time to frequency domain transform such as a Short Time Fourier Transform (STFT) in order to convert the input time domain signals into a suitable time-frequency signals. These time-frequency signals may be passed to a spatial analyser 203.

Thus for example the time-frequency signals 202 may be represented in the time-frequency domain representation by

S_j(b, n), where b is the frequency bin index and n is the time block (frame) index and i is the channel index. In another expression, n can be considered as a time index with a lower sampling rate than that of the original time-domain signals. These frequency bins can be grouped into subbands that group one or more of the bins into a subband of a band index k = 0,..., K-1 . Each subband k has a lowest bin b_klow and a highest bin b_khigh, and the subband contains all bins from b_klow to b_khigh. The widths of the subbands can approximate any suitable distribution. For example the Equivalent rectangular bandwidth (ERB) scale or the Bark scale.

In some embodiments the analysis processor 105 comprises a spatial analyser 203. The spatial analyser 203 may be configured to receive the time- frequency signals 202 and based on these signals estimate direction parameters 108. The direction parameters may be determined based on any audio based ‘direction’ determination.

For example in some embodiments the spatial analyser 203 is configured to estimate the direction with two or more signal inputs. This represents the simplest configuration to estimate a ‘direction’, more complex processing may be performed with even more signals.

The spatial analyser 203 may thus be configured to provide at least one azimuth and elevation for each frequency band and temporal time-frequency block within a frame of an audio signal, denoted as azimuth cp(k,n) and elevation 0(k,n). The direction parameters 108 may also be passed to a direction index generator 205.

The spatial analyser 203 may also be configured to determine an energy ratio parameter 110. The energy ratio may be considered to be a determination of the energy of the audio signal which can be considered to arrive from a direction. The energy ratio may in some embodiments be a direct-to-total energy ratio.

The direct-to-total energy ratio r(k,n) can be estimated, e.g., using a stability measure of the directional estimate, or using any correlation measure, or any other suitable method to obtain a ratio parameter. The energy ratio may be passed to an energy ratio encoder 207.

The spatial analyser 203 may furthermore be configured to determine a number of coherence parameters which may include surround coherence (y(/c,n)) 112 and spread coherence (k,n)) 114, both analysed in time-frequency domain. Therefore in summary the analysis processor is configured to receive time domain multichannel or other format such as microphone or ambisonic audio signals.

Following this the analysis processor may apply a time domain to frequency domain transform (e.g. STFT) to generate suitable time-frequency domain signals for analysis and then apply direction analysis to determine direction and energy ratio parameters.

The analysis processor may then be configured to output the determined parameters.

Although directions, energy ratios, and coherence parameters are here expressed for each time index n, in some embodiments the parameters may be combined over several time indices. Same applies for the frequency axis, as has been expressed, the direction of several frequency bins b could be expressed by one direction parameter in band k consisting of several frequency bins b. The same applies for all of the discussed spatial parameters herein.

In some embodiments the directional data may be represented using 16 bits such that the each azimuth parameter is approximately represented on 9 bits, and the elevation on 7 bits. In such embodiments the energy ratio parameter may be represented on 8 bits. For each frame there may be N=5 subbands and M=4 time frequency (TF) blocks. Thus in this example there are (16+8)xMxN bits needed to store the uncompressed direction and energy ratio metadata for each frame. The coherence data for each TF block may be a floating point representation between 0 and 1 and may be originally represented on 8 bits.

As also shown in Figure 2 an example metadata encoder/quantizer 111 is shown according to some embodiments.

The metadata encoder/quantizer 111 may comprise a direction encoder 205. The direction encoder 205 is configured to receive the direction parameters (such as the azimuth cp(k, n) and elevation 0(k,n) 108, a coherence control 210 (and in some embodiments an expected bit allocation and/or encoded (or quantized) energy ratio) and from this generate a suitable encoded output. In some embodiments the expected bit allocation and/or quantized (or encoded) energy ratio is used to define an initial bit allocation (and thus, quantization accuracy for each frequency band for coding direction. In some embodiments the encoding is based on an arrangement of spheres forming a spherical grid arranged in rings on a ‘surface’ sphere which are defined by a look up table defined by the determined quantization resolution. In other words the spherical grid uses the idea of covering a sphere with smaller spheres and considering the centres of the smaller spheres as points defining a grid of almost equidistant directions. The smaller spheres therefore define cones or solid angles about the centre point which can be indexed according to any suitable indexing algorithm. Although spherical quantization is described here any suitable quantization, linear or non-linear may be used.

Furthermore in some embodiments the direction encoder 205 is configured to determine a variance of the quantized azimuth parameter value and pass this to the coherence encoder 209.

The encoded direction parameters may then be passed to the combiner 211 .

The metadata encoder/quantizer 111 may comprise an energy ratio encoder 207. The energy ratio encoder 207 is configured to receive the energy ratios and determine a suitable encoding for compressing the energy ratios for the sub-bands and the time-frequency blocks. For example in some embodiments the energy ratio encoder 207 is configured to use 3 bits to encode each energy ratio parameter value.

Furthermore in some embodiments rather than transmitting or storing all energy ratio values for all TF blocks, only one weighted average value per sub band is transmitted or stored. The average may be determined by taking into account the total energy of each time block, favouring thus the values of the sub bands having more energy.

In such embodiments the quantized energy ratio value is the same for all the TF blocks of a given sub-band.

In some embodiments the energy ratio encoder 207 is further configured to pass the quantized (encoded) energy ratio value to the direction encoder 205, the combiner 211 and to the coherence encoder 209.

The metadata encoder/quantizer 111 may comprise a coherence encoder 209. The coherence encoder 209 is configured to receive the coherence values (such as the surround coherence 112 and spread coherence 114 values), the encoded energy ratio value and the quantized azimuth values and determine a suitable encoding for compressing the coherence values for the sub-bands and the time-frequency blocks. A 3-bit precision value for the coherence parameter values has been shown to produce acceptable audio synthesis results but even then this would require a total of 3x20 bits for the coherence data for all TF blocks (in the example 8 sub-band and 5 TF block per frame).

As described hereafter in some embodiments the encoding is implemented in the DCT domain and may be dependent on the current (quantized) energy ratio and variance of quantized azimuth values. In some embodiments the coherence encoder 209 is configured to determine whether the input audio signals are a multi channel. Where the input audio signals are a multi-channel format then the coherence encoder 209 is configured to generate quantized or compressed coherence value options based on the encoded energy ratio values (for example by encoding in the DCT domain as indicated hereafter) and a coherence value indicator is then generated. These coherence values can then be used to assist in the compression or encoding of the directional values.

Where the input audio signals are not a multi-channel format then the coherence encoder 209 can be configured to reserve space in bitstream for spread coherence codewords. Following the encoding of the directional values then these encoded directional values can be used to assist in the selection of the quantized coherence value options.

The encoded coherence parameter values may then be passed to the combiner 211 .

The metadata encoder/quantizer 111 may comprise a combiner 211. The combiner is configured to receive the encoded (or quantized/compressed) directional parameters, energy ratio parameters and coherence parameters and combine these to generate a suitable output (for example a metadata bit stream which may be combined with the transport signal or be separately transmitted or stored from the transport signal).

With respect to Figure 3 is shown an example operation of the metadata encoder/quantizer as shown in Figure 2 according to some embodiments.

The initial operation is obtaining the metadata (such as azimuth values, elevation values, energy ratios, coherence values) as shown in Figure 3 by step 301. The energy ratio values are compressed or encoded (for example by generating a weighted average per sub-band and then quantizing these as a 3 bit value) as shown in Figure 3 by step 303.

The next operation is determining whether the input audio signals are a multi-channel format as shown in Figure 3 by step 304.

Where the input audio signals are a multi-channel format then quantized or compressed coherence value options are generated based on the encoded energy ratio values (for example by encoding in the DCT domain as indicated hereafter) and a coherence value indicator is generated as shown in Figure 3 by step 305.

The directional values (elevation, azimuth) may then be compressed or encoded (for example by applying a spherical quantization, or any suitable compression) based on the encoded ratio values and the coherence value indicator. The generation of the compressed directional value is shown in Figure 3 by step 307.

Where the input audio signals are not a multi-channel format then the coherence encoder can be configured to reserve space in bitstream for spread coherence codewords as shown in Figure 3 by step 306.

Flaving reserved the space in the bitstream then the quantized azimuth values are generated by encoding/quantizing the directional values as shown in Figure 3 by step 308.

Then the quantized coherence value is selected from the options based on the variance of the quantized azimuth value as shown in Figure 3 by step 309.

The encoded directional values, energy ratios, coherence values are then combined to generate the encoded metadata which can be output as shown in Figure 3 by step 311 .

With respect to Figure 4 is shown an example coherence encoder 209 as shown in Figure 2.

In some embodiments the coherence encoder 209 comprises a coherence vector generator 401 . The coherence vector generator 401 is configured to receive the coherence values 112, which may be 8 bit floating point representations between 0 and 1 .

The coherence vector generator 401 is configured for each sub-band to generate a vector of coherence values. Thus in the example where there are M time-frequency blocks then the coherence vector generator 401 is configured to generate an M dimensional vector of coherence data 402.

The coherence data vector 402 is output to the discrete cosine transformer

403.

In some embodiments the coherence encoder 209 comprises the discrete cosine transformer 403. The discrete cosine transformer 403 may be configured to receive the M dimensional coherence data vector 402 and discrete cosine transform (DCT) the vector.

Any suitable method for performing a DCT may be implemented. For example in some embodiments where the vector comprises a 4 dimensional vector of coherences corresponding to a sub-band. Then the vector x = (x₁ x₂, x₃, x₄)' the matrix multiplication with the DCT matrix of order 4 is equivalent to:

where a = x₁ + x₂ b = x₂ + x₃ c = x — x₄ d = x₂ — x ₃

This reduces the number of operations for the DCT transform from 28 to 14. The DCT coherence vector 404 may then be output to the vector encoder

405.

In some embodiments the coherence encoder 209 comprises a vector encoder 405. The vector encoder 405 is configured to receive the DCT coherence vector 404 and encode it by using a suitable codebook. In some embodiments the vector encoder 405 is configured to receive or otherwise obtain the encoded/quantized energy ratio 412 and the variance of the quantized (encoded) azimuth 414 (which may be determined from the energy ratio encoder and the direction encoder as shown in Figure 2) and obtainall possible first DCT parameter (DCT0) quantization options of the spread coherence (for multi- channel input audio format signals) or encode (for other input formats such as the MASA input format) the spread coherence based on the quantized known direct- to-total energy ratio and/or variance of the quantized (encoded) azimuth 414. Furthermore even when the codebook is selected based on the azimuth, pre- computing the codevectors from all codebooks is not required because the number of bits needed are known and the space in the bitstream saved. Once the quantization of the azimuth and the variance of the quantized azimuth are obtained it is possible to find the codevector from the corresponding codebook.

In some embodiments the encoding of the first DCT parameter is implemented in a manner different than the encoding of further DCT parameters. This is because the first and further DCT parameters have significantly different distributions. The encoded coherence 406 values can then be output.

In some embodiments (and as discussed previously) 3 bits are used to encode each energy ratio value and only one weighted average value per subband is generated and transmitted (and/or stored). This means that the quantized energy ratio value is the same for all the TF blocks of a given subband.

The variance of the encoded azimuth is configured to influence the selection for the first DCT parameter based on whether the variance of the quantized azimuth within the subband is very small (under a determined threshold) or larger than the threshold.

In some embodiments furthermore a number of sub-bands are selected l_N. For example in some embodiments l_N = 3. In such embodiments the sub-bands upto the selected sub-band limit are encoded using a first number of secondary DCT parameters and the remaining sub-bands encoded using a second number of secondary DCT parameters. The first number in some embodiments is 1 and the second number is 2. In other words in some embodiments the vector encoder is configured such that the sub-bands <= l_N encode only the first 2 components of the DCT transformed vector (one primary and one secondary) and the sub-bands >I_N encode only the first 3 components of the DCT transformed vector (one primary and two secondary). These two additional components can be encoded with a 2 dimensional vector quantizer or, they could be added as extra dimensions to the N-dimensional vector quantizer of the second DCT parameters and use an N+2 dimensional vector quantizer for the encoding of all secondary parameters at once. The overview of the encoding of the coherence parameter is shown in a flow diagram, Figure 5.

The first operation is obtaining the coherence parameter values as shown in Figure 5 by step 501.

Flaving obtained the coherence parameter values for the frame the next operation is to generate M dimensional coherence vectors for each sub-band as shown in Figure 5 by step 503.

The M dimensional coherence vectors are then transformed, for example using a discrete cosine transform (DCT), as shown in Figure 5 by step 505.

Then the DCT representations are sorted into sub-bands below the determined sub-band selection value and above the value as shown in Figure 5 step 507. In other words determining whether a current sub-band being processed is less than or equal to l_N or more than l_N.

The DCT representations for M dimensional coherence vectors for sub bands less than or equal to l_N are then encoded by encoding the first 2 components of the DCT transformed vector as shown in Figure 5 step 509.

The DCT representations for M dimensional coherence vectors for sub bands more than l_N are then encoded by encoding the first 3 components of the DCT transformed vector as shown in Figure 5 step 511.

This for example may be summarised as the following pseudocode form.

For each subband i=1 :N

The M dimensional vector of coherence data is DCT transformed

If i <= l_N

Encode the first 2 components of the DCT transformed vector

Else

Encode the first 3 components of the DCT transformed vector

End if

End for

With respect to Figure 6 is shown in further detail the vector encoder 405. The vector encoder 405 is shown receiving the DCT coherence vector 404 as an input.

The vector encoder in some embodiments comprises a surround coherence encoder 603. The surround coherence encoder 603 is configured to receive the surround coherence parameters and from these encode the surround coherence parameter (and calculate the number of bits for surround coherence). In some embodiments the surround coherence encoder 603 is configured to transmit one surround coherence value per sub-band. In a manner as described with respect to the encoding of the energy ratio, the value may be obtained in some embodiments as a weighted average of the time-frequency blocks of the sub-band, the weights being determined by the signal energies.

In some embodiments the averaged surround coherence values are scalar quantized with codebooks whose length (number of codewords) is dependent on the energy ratio index (2, 3, 4, 5, 6, 7, 8, 8 codewords for the indexes: 0,1 ,2, 3, 4, 5, 6, 7). The indexes in some embodiments are encoded using a Golomb Rice encoder on the mean removed values or by joint encoding taking into account the number of codewords used (in other words selecting either entropy coding, such as GR coding, or joint coding based on which one encodes the value as fewer bits).

The encoded surround coherence encoder 603 can then be configured to output the encoded surround coherence values to a coherence value combiner 613.

The vector encoder 405 in some embodiments further comprise a DCT order 1 (and further order e.g. 2) spread coherence encoder 601 . The DCT order 1 (and further order) spread coherence encoder 601 is configured to receive the DCT coherence vector 404 and from this encode the DCT parameter of order 1 (and 2 onwards for the sub-bands which encode further secondary parameters) for spread coherence, using a Golomb Rice coding for the mean removed indexes of the quantized indexes. The indexes in some embodiments are obtained from scalar quantization in codebooks dependent on the index of the sub-band. The number of code-words can be the same for all sub-bands, for example 5 code-words. However in some embodiments the number of code-word may differ from sub-band to sub band.

The output encoded DCT order 1 (and 2 onwards) encoded spread coherence parameters can be passed to the coherence value combiner 613.

In some embodiments the vector encoder comprises a DCT order 0 (DCT0) quantization option determiner 605. The DCT order 0 (DCT0) quantization option determiner 605 is configured to obtain the encoded/quantized energy ratio 412 values and all possible DCTO quantization options of spread coherence based on the quantized known direct-to-total ratio and all direction-based codebook alternatives.

Furthermore in some embodiments the vector encoder comprises a DCT order 0 (DCTO) quantized spread coherence determiner 607. The DCT order 0 (DCTO) quantized spread coherence determiner 607 is configured to obtain all possible quantized spread coherence values from the DCT-coefficients based on all possible DCTO quantization options of spread coherence. These can be passed to a DCTO coefficient selector 609 and a quantized spread coherence value indicator generator 611 .

In some embodiments the vector encoder 405 comprises a DCTO coefficient selector/indicator generator 609. The DCTO coefficient selector/indicator generator 609 is configured to obtain the (all possible) quantized spread coherence values from the DCT-coefficients, an input format indicator 410, and in some embodiments a variance of encoded azimuth 414 value. The DCTO coefficient selector/indicator generator 609 is configured to, when the input format indicator 410 indicates that the audio signals are in the multi-channel input format, select the codebook of the DCT coefficient of order zero corresponding to higher azimuth variance. By higher azimuth variance, the variance of the encoded azimuth values for the considered subband is calculated, and if it is higher than a threshold then one codebook is used for the DCTO, otherwise the other codebook is used. In some embodiments the DCTO coefficient selector/indicator generator 609 employs a dedicated optimized codebook. The resulting quantized spread coherence values can then be used to determine the resolution used for the encoding of the directional parameters, azimuth and elevation. Thus the DCTO coefficient selector/indicator generator 609 is configured to deterministically choose one of the options for use as quantized spread coherence value indicator x'. In some embodiments this deterministic selection can be by selecting the minimum quantized spread coherence value. The quantized spread coherence value indicator x' can then be output as a coherence control value to the direction encoder 205. In other words these resulting quantized spread coherence values can be output as coherence control (quantized spread coherence value indicators) 210. In some embodiments the DCTO coefficient selector/indicator generator 609 is configured to, when the input format indicator 410 indicates that the audio signals are in a format other than a multi-channel input format - for example a MASA input format, quantize the DCT coefficient of order zero with a codebook dependent on the variance of the quantized azimuth. The decision on which codebook to use can be implemented in such embodiments after the directional data is quantized. In this case the spread coherence does not influence the quantization resolution of the direction.

The selected DCT order 0 (DCTO) coefficient can then be passed to the coherence value combiner 613.

In some embodiments, for multi channel input data that is encoded in MASA representation, the vector encoder 405 comprises a coherence value combiner 613. The coherence value combiner 613 is configured to receive the selected one of the precomputed DCTO-coefficients for spread coherence value, the DCT order 1 (and further order DCTs) encoded parameters and the encoded surround coherence parameters and generate combined coherence values which can be output as an encoded coherence vector 404.

Furthermore with respect to Figure 7 is shown a schematic view of an example direction encoder 205 according to some embodiments. In some embodiments the direction encoder 205 comprises a quantization grid determiner 701. The quantization grid determiner 701 is configured to receive the coherence control, the quantized spread coherence value indicator, 210 and the encoded/quantized energy ratio 412. The quantization grid determiner 701 is configured to select or determine a quantization grid based on the quantized spread coherence value indicator 210 and the encoded/quantized energy ratio 412. The quantization grid which defines the quantization and encoding accuracy of the direction parameter in some embodiments can be reduced when the corresponding quantized direct-to-total ratio parameter value is low. This is because a low direct- to-total ratio signifies that the sound for the specific TF-tile is not coming from any single apparent direction. Thus, the direction also does not need to be quantized as accurately.

Furthermore the quantization grid determination is determined such that a non-zero spread coherence parameter value signifies that a normally point-like sound source is intended to be coherently reproduced from an area surrounding the direction defined by the quantized direction parameter value, when the spread coherence value is around 0.5. The quantization grid determination furthermore is configured so that a normally point-like sound source is intended to be coherently reproduced from the end points of the area surrounding the direction defined by the quantized direction parameter value when the spread coherence value is 1 .0.

In practical implementations with, for example, loudspeaker reproduction, the grid can be determined so that sound is reproduced within in range of ±30° azimuth using one, three, or two loudspeakers. When the spread coherence is higher, a lower accuracy grid may be used, as the lower accuracy grid is able to encode the direction parameter within the correct area range.

Thus in some embodiments the direction encoder is configured to take the spread coherence into account by comparing the quantized spread coherence value indicator 210 to a threshold a. If the quantized spread coherence value indicator 210 is larger than this threshold, then the quantization grid determiner 701 is configured to reduce the quantization accuracy of the direction parameter (e.g., by decreasing number of bits given to encode the direction parameter).

Furthermore in some embodiments the quantized spread coherence value indicator 210 is compared to a further threshold b and if the quantized spread coherence value indicator 210 is less than the further threshold then the quantization accuracy of the direction parameter can be increased for example by increasing the number the bits given to encode the direction parameter.

A suitable example for the values of a and b are 0.35 and 0.05 correspondingly and the respective adjustments may be -2 and +1 bits.

In some embodiments this approach can be extended such there are multiple thresholds or a look up table which is able to adjust the direction parameter quantization accuracy based on the quantized spread coherence value indicator 210.

The determined quantization grid can be passed to the direction quantizer/encoder 703. The

In some embodiments the direction encoder 205 comprises a direction quantizer/encoder 703. The direction quantizer/encoder 703 is configured to receive the determined quantization grid from the quantization grid determiner 701 and the direction parameters 108. The direction quantizer/encoder 703 is then configured to quantize the direction parameters 108 based on the determined quantization grid. In some embodiments the direction quantizer/encoder 703 is configured to further compress or encode the quantized direction parameters. The encoded direction parameters 108 can then be output.

In some embodiments the direction encoder 205 further comprises an encoded direction variance determiner 705. The encoded direction variance determiner 705 is configured to receive the encoded direction parameters and determine a variance of the encoded azimuth 414 value which is configured to be passed to the vector encoder 405.

With respect to Figure 8 is shown a flow diagram showing an example operation of the encoders, and specifically the example vector encoder 405 and direction encoder 205.

Thus in some embodiments the direction values are obtained as shown in Figure 8 by step 802.

Additionally the energy ratio values are obtained as shown in Figure 8 by step 851.

Furthermore in some embodiments the DCT vector values are obtained as shown in Figure 8 by step 801 .

The surround coherence values are encoded as shown in Figure 8 by step

803.

The DCT1 (and further orders such as DCT2 onwards where implemented) of the spread coherence values are then encoded as shown in Figure 8 by step 805.

Furthermore having obtained the determined energy ratio values, the energy ratio values are then encoded as shown in Figure 8 by step 853.

As shown in Figure 8 by step 811 , when the input format is a MASA input format there is reserved space in the bitstream for spread coherence codewords or when the input format is a multi-channel input format the quantized coherence value is generated based on the higher azimuth variance option.

Flaving obtained the direction values a quantization grid is determined based on the encoded energy ratio values (and the selected quantized spread coherence value indicator when the input format is a multi-channel input format) as shown in Figure 8 by step 804.

The direction values can then be encoded based on the determined quantization grid (and any further suitable encoding) as shown in Figure 8 by step 806.

Flaving determined the encoded direction values, a variance of the encoded direction values is determined except for multi-channel input formats as shown in Figure 8 by step 808. The variance of the encoded direction values, for example when the input format is a MASA input format, can then be used to select one of the codevector from the corresponding codebook (the DCT0 coefficients for spread coherence value encoding) as shown in Figure 8 by step 813.

Then the encoded metadata can then be output as shown in Figure 8 by step 815.

Thus for the multichannel input format there is one codebook for encoding the DCT0 coefficients and the determined spread coherence value influences the accuracy of the direction quantization whereas for the MASA input format there are two (or more in some embodiments) codebooks for the encoding of the DCT0 coefficient. The codebook is selected based on the variance of the encoded or quantized azimuth. Flowever in the MASA input format the spread coherence does not influence the direction quantization resolution.

With respect to Figure 9 is shown an example metadata extractor (or decoder) 137 as part of the decoder 133 from the viewpoint of the extraction and decoding the direction parameters, energy ratio parameters and coherence parameter values according to some embodiments.

In some embodiments metadata extractor 137 comprises a metadata demultiplexer 901. The encoded datastream 212 is passed to the metadata demultiplexer 901. The metadata demultiplexer 901 is configured to extract the encoded direction indices, energy ratio indices and coherence indices (and may also in some embodiments extract the other metadata and transport audio signals not shown). The demultiplexed encoded energy ratios 902 can be passed to the energy ratio decoder 903, and also passed to the initial quantized grid determiner 923, and the DCT order 0 (DCT0) quantization option determiner 907. In some embodiments the metadata extractor 137 comprises an initial quantized grid determiner 923, The initial quantized grid determiner 923 is configured to receive the encoded energy ratios 902 and generate the initial quantized grid information and pass this to the direction decoder 927.

The metadata extractor 137 in some embodiments comprises an energy ratio decoder 903. The energy ratio decoder 903 is configured to obtain the demultiplexed encoded energy ratios 902 and decode the demultiplexed encoded energy ratios 902 to generate the decoded energy ratio parameters 952 for the frame by performing the inverse of the encoding of the energy ratios implemented by the energy ratio encoder.

In some embodiments the metadata extractor 137 comprises a coherence decoder 905. The coherence (Surround and DCT0, DCT1 and DCT2) decoder 905 is configured to receive the demultiplexed encoded coherence parameters 904. The coherence (Surround and DCT0, DCT1 and DCT2) decoder 905 is configured to decode the encoded surround coherence parameters in an inverse operation as that performed to encode the surround coherence parameters in the surround coherence encoder 603 as shown in the example vector encoder 405. The decoded surround coherence parameters 956 furthermore are output. The coherence decoder 905 furthermore is configured to decode the DCT0 and DCT1 (and further spread coherence order elements) which are inverse of the further DCT0 and DCT 1 (and further order) spread coherence parameter encoding implemented within the vector encoder 405.

Furthermore in some embodiments the metadata extractor 137 comprises a DCT order 0 (DCT0) quantization option determiner 907. The DCT order 0 (DCT0) quantization option determiner 907 is configured to receive the encoded energy ratios 902 and precompute all possible decoded spread coherence quantization options based on the known quantized direct-to-total ratio value and all possible direction-based codebooks. The possible decoded spread coherence quantization options can be passed to the (DCT0) Quantized spread coherence determiner 909.

In some embodiments the metadata extractor 137 comprises a (DCT0) Quantized spread coherence determiner 909. The (DCT0) Quantized spread coherence determiner 909 is configured to obtain the all possible decoded spread coherence quantization options based on the known quantized direct-to-total ratio value and all possible direction-based codebooks and the decoded DCTO parameters and generate all possible decoded spread coherence values and pass these to a quantized spread coherence value indicator generator 911 and the decoded DCTO coefficient selector 913.

The metadata extractor 137, in some embodiments, comprises a DCTO coefficient selector/indicator generator 913. The DCTO coefficient selector/indicator generator 913 is configured to obtain the (all possible) quantized spread coherence values from the DCT-coefficients, an input format indicator 410, and in some embodiments a variance of encoded azimuth 914 value. The DCTO coefficient selector/indicator generator 913 is configured, when the input format indicator 410 indicates that the audio signals are in the multi-channel input format, the codebook of the DCT coefficient of order zero corresponding to higher azimuth variance. In some embodiments the DCTO coefficient selector/indicator generator 913 employs a dedicated optimized codebook.

In some embodiments the DCTO coefficient selector/indicator generator 913 is configured to, when the input format indicator 410 indicates that the audio signals are in a format other than a multi-channel input format - for example a MASA input format, quantize the DCT coefficient of order zero with a codebook dependent on the variance of the quantized azimuth. The decision on which codebook to use can be implemented in such embodiments after the directional data is quantized. In this case the spread coherence does not influence the quantization resolution of the direction. The selected DCTO coefficients as well as the further DCT order coefficients are then passed to the IDCT 915.

In some embodiments the metadata extractor 137 comprises a direction decoder 927. The direction decoder 927 is configured to receive the encoded direction parameters, the initial direction quantization grid information, and the quantized spread coherence value indicator 912 and decode the direction parameters based on the initial direction quantization grid information and the quantized spread coherence value indicator 912 (which can adjust the grid resolution). The decoded direction parameters 950 can then be output. Furthermore the variance of the decoded azimuth 914 can then be output to the decoded DCTO coefficient selector 913. In some embodiments the metadata extractor 137 furthermore comprises a inverse discrete cosine transformer (IDCT) 915 which is configured to receive the DCT0 and further DCT order coefficients and from these inverse cosine transform them using an inverse to the process used in the encoder. The inverse discrete cosine transformed coefficients can then be passed to a vector decoder 917.

The metadata extractor 137 can furthermore be configured with a vector decoder configured to receive the inverse discrete cosine transformed coefficients and decode these to generate the decoded spread coherence vector parameters 954. The value of the spread coherence, or a suitable spread coherence value indicator 912 can be passed to the direction decoder 927 to modify the quantization grid of the direction and then dequantize the direction for the multichannel input data. In other words the resulting decoded quantized spread coherence values can then be used to determine the resolution used for the encoding of the directional parameters, azimuth and elevation. Thus the vector decoder 917 is configured to determine a quantized spread coherence value indicator x'. In some embodiments this determination can be by selecting the minimum quantized spread coherence value. The quantized spread coherence value indicator x' can then be output as a coherence control value 912 to the direction decoder 927.

With respect to Figure 10a there is shown an example operation of the metadata extractor as shown in Figure 9 when the input format for the audio signal is a MASA input format.

The initial operation is one of obtaining/demultiplexing the encoding metadata which is shown in Figure 10a by step 1001 .

Then the energy ratios can be decoded as shown in Figure 10 by step 1002.

The directional quantization grid can then be determined based on the energy ratios as shown in Figure 10a by step 1003.

Also the coherence (Surround and DCT1 and possibly DCT2) values are decoded (according to any known method) based on the energy ratios as shown in Figure 10a by step 1005.

The direction values can then be decoded as shown in Figure 10a by step

1013. Having decoded the direction values then the variance of the decoded azimuth values is determined as shown in Figure 10a by step 1015.

Thus when the input format is a MASA input form then the variance of the decoded azimuth values is then used to select the decoded DCT0 coefficient from the multiple codebook options as shown in Figure 10a by step 1017.

The decoded DCT0 and further order coefficients are then inverse DCT transformed as shown in Figure 10a by step 1019.

Then the coherence vector is decoded and then the metadata output as shown in Figure 10a by step 1021 .

With respect to Figure 10b there is shown an example operation of the metadata extractor as shown in Figure 9 when the input format for the audio signal is a multichannel input format.

The initial operation is one of obtaining/demultiplexing the encoding metadata which is shown in Figure 10b by step 1001 .

Then the energy ratios can be decoded as shown in Figure 10b by step

1002.

The initial directional quantization grid can then be determined based on the energy ratios as shown in Figure 10b by step 1053.

Also the coherence (Surround and DCT1 and DCT2) values are decoded based on the energy ratios as shown in Figure 10b by step 1005.

The DCT order 0 (DCT0) values are then decoded/determined (and in some embodiments this may be based on the energy ratios) as shown in Figure 10b by step 1057. In some embodiments the determination of the Surround, DCT0, DCT1 and DCT2 values are obtained in a single operation i.e. decoding DCT0 and DCT1 (and possibly DCT2) in a single step.

Here for the multi-channel input format, the quantized spread coherence value indicator is generated as shown in Figure 10b by step 1011.

The direction values can then be decoded based on the quantized spread coherence value indicator as shown in Figure 10 by step 1063.

The decoded DCT0 and further order coefficients are then inverse DCT transformed as shown in Figure 10 by step 1019.

Then the coherence vector is decoded and then the metadata output as shown in Figure 10 by step 1021 . In the above embodiments the direction parameter encoding/decoding is implemented based on the DCT-based spread coherence encoding. In some embodiments the direction grid or accuracy adjustment is not based on the spread coherence (and is thus based only on the energy ratios). In such embodiments where the spread coherence encoding is not used to adjust the direction encoding there is no dependency in spread coherence coding on the direction element, then encoding and decoding can be done separately and direction quantization accuracy can be adjusted directly.

The above embodiments furthermore feature using quantized spread coherence values (or indicators) but these values can be replaced with encoded and decoded indices as long as the information is present.

In some embodiments instead of using a single indicator value for controlling the direction parameter encoding, it is also possible to take into account the possible quantized spread coherence value options in more complex way. For example, if there are multiple possibilities, a mean and variance of the options could be computed and used for deciding how to encode the direction parameter.

The above embodiments feature a direction quantization accuracy being dependent on both direct-to-total ratio value and spread coherence value. However, in some embodiments the quantization accuracy is defined only based on spread coherence value.

With respect to Figure 11 an example electronic device which may be used as the analysis or synthesis device is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.

In some embodiments the device 1400 comprises at least one processor or central processing unit 1407. The processor 1407 can be configured to execute various program codes such as the methods such as described herein.

In some embodiments the device 1400 comprises a memory 1411. In some embodiments the at least one processor 1407 is coupled to the memory 1411. The memory 1411 can be any suitable storage means. In some embodiments the memory 1411 comprises a program code section for storing program codes implementable upon the processor 1407. Furthermore in some embodiments the memory 1411 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1407 whenever needed via the memory-processor coupling.

In some embodiments the device 1400 comprises a user interface 1405. The user interface 1405 can be coupled in some embodiments to the processor 1407. In some embodiments the processor 1407 can control the operation of the user interface 1405 and receive inputs from the user interface 1405. In some embodiments the user interface 1405 can enable a user to input commands to the device 1400, for example via a keypad. In some embodiments the user interface 1405 can enable the user to obtain information from the device 1400. For example the user interface 1405 may comprise a display configured to display information from the device 1400 to the user. The user interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1400 and further displaying information to the user of the device 1400. In some embodiments the user interface 1405 may be the user interface for communicating with the position determiner as described herein.

In some embodiments the device 1400 comprises an input/output port 1409. The input/output port 1409 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.

The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802. X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).

The transceiver input/output port 1409 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1407 executing suitable code. Furthermore the device may generate a suitable downmix signal and parameter output to be transmitted to the synthesis device.

In some embodiments the device 1400 may be employed as at least part of the synthesis device. As such the input/output port 1409 may be configured to receive the downmix signals and in some embodiments the parameters determined at the capture device or processing device as described herein, and generate a suitable audio signal format output by using the processor 1407 executing suitable code. The input/output port 1409 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones or similar.

The transceiver input/output port 1409 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1407 executing suitable code.

It is also noted herein that while the above describes example embodiments, there are several variations and modifications which may be made to the disclosed solution without departing from the scope of the present invention.

In general, the various embodiments may be implemented in hardware or special purpose circuitry, software, logic or any combination thereof. Some aspects of the disclosure may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the disclosure is not limited thereto. While various aspects of the disclosure may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

As used in this application, the term “circuitry” may refer to one or more or all of the following:

(a) hardware-only circuit implementations (such as implementations in only analog and/or digital circuitry) and

(b) combinations of hardware circuits and software, such as (as applicable):

(i) a combination of analog and/or digital hardware circuit(s) with software/firmware and

(ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and

(c) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g., firmware) for operation, but the software may not be present when it is not needed for operation.”

This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware.

The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device.

The embodiments of this disclosure may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Computer software or program, also called program product, including software routines, applets and/or macros, may be stored in any apparatus-readable data storage medium and they comprise program instructions to perform particular tasks. A computer program product may comprise one or more computer- executable components which, when the program is run, are configured to carry out embodiments. The one or more computer-executable components may be at least one software code or portions of it.

Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD. The physical media is a non-transitory media.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may comprise one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), FPGA, gate level circuits and processors based on multi core processor architecture, as non-limiting examples.

Embodiments of the disclosure may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

The scope of protection sought for various embodiments of the disclosure is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the disclosure.

The foregoing description has provided by way of non-limiting examples a full and informative description of the exemplary embodiment of this disclosure. Flowever, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this disclosure will still fall within the scope of this invention as defined in the appended claims. Indeed, there is a further embodiment comprising a combination of one or more embodiments with any of the other embodiments previously discussed.

Claims

CLAIMS:

1 . An apparatus for encoding spatial audio signal parameters, the spatial audio signal parameters associated with at least one audio signal, the apparatus comprising means configured to: obtain the spatial audio signal parameters, the spatial audio signal parameters comprising at least one directional parameter, at least one energy ratio, at least one spread coherence parameter; encode the at least one energy ratio to generate an encoded at least one energy ratio; generate an encoded spread coherence parameter; and encode the at least one directional parameter based on encoded at least one energy ratio and the encoded spread coherence parameter.

2. The apparatus as claimed in claim 1 , wherein the means configured to encode the at least one directional parameter based on encoded at least one energy ratio and the encoded spread coherence parameter is configured to: determine a quantization grid arrangement based on the at least one energy ratio and the encoded spread coherence parameter; and generate a codeword as the encoded at least one directional parameter based on the determined quantization grid and the at least one directional parameter.

3. The apparatus as claimed in any of claims 1 or 2, wherein the means configured to generate the encoded spread coherence parameter vector element; and encode the at least one directional parameter based on encoded at least one energy ratio and the encoded spread coherence parameter is configured to generate the encoded spread coherence parameter vector element and to encode the at least one directional parameter based on encoded at least one energy ratio and the encoded spread coherence parameter based on the at least one audio signal being a first format.

4. The apparatus as claimed in claim 3, wherein the first format is a multi channel input format.

5. The apparatus as claimed in any of claims 3 or 4, wherein the means is further configured to, based on the format of the at least one audio signal being a second format: encode the at least one directional parameter based on encoded at least one energy ratio; and generate the encoded spread coherence parameter based on an encoded at least one directional parameter.

6. The apparatus as claimed in claim 5, wherein the second format is a metadata-assisted spatial audio format.

7. The apparatus as claimed in any of claims 5 or 6, wherein the means configured to generate the encoded spread coherence parameter based on an encoded at least one directional parameter is configured to: select a codebook based on the encoded at least one directional parameter; and generate an encoded spread coherence parameter based on the selection of the codebook.

8. The apparatus as claimed in any of claims 5 to 7, wherein the means configured to encode the at least one directional parameter based on encoded at least one energy ratio is configured to: determine a quantization grid arrangement based on the at least one energy ratio; and generate a codeword as the encoded at least one directional parameter based on a quantized at least one directional parameter, the quantized at least one directional parameter being formed by the application of the determined quantization grid to the at least one directional parameter.

9. The apparatus as claimed in claim 8, wherein the means configured to select a codebook based on the encoded at least one directional parameter is further configured to select a codebook based on a variance of the quantized at least one directional parameter.

10. The apparatus as claimed in claim 9, wherein the quantized at least one directional parameter is a quantized azimuth of the at least one directional parameter.

11. The apparatus as claimed in any of claims 1 to 10, wherein the means configured to generate an encoded spread coherence parameter is configured to: discrete cosine transform a vector formed from the at least one spread coherence parameter to generate at least one zero order discrete cosine transformed spread coherence parameter vector element; and generate an encoded spread coherence parameter vector element.

12. The apparatus as claimed in claim 11 , wherein the means configured to discrete cosine transform the vector formed from the at least one spread coherence parameter is further configured to generate at least one first order discrete cosine transformed spread coherence parameter vector element, and the means is configured to encode the at least the first order discrete cosine transformed spread coherence parameter vector element.

13. An apparatus for decoding encoded spatial audio signal parameters, the encoded spatial audio signal parameters associated with at least one audio signal, the apparatus comprising means configured to: obtain the encoded spatial audio signal parameters, the encoded spatial audio signal parameters comprising at least one encoded directional parameter, at least one encoded energy ratio, and at least one encoded spread coherence parameter; and determine at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the at least one encoded spread coherence parameter.

14. The apparatus as claimed in claim 13, wherein the means configured to determine at least one decoded directional parameter based on the at least one encoded directional parameter, and the encoded at least one energy ratio and the encoded spread coherence parameter is configured to: determine a quantization grid arrangement based on the encoded at least one energy ratio and the at least one encoded spread coherence parameter; and generate at least one decoded directional parameter value for the at least one directional parameter from the encoded at least one directional parameter applied to the determined quantization grid.

15. The apparatus as claimed in any of claims 13 or 14, wherein the means configured to determine the at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the at least one encoded spread coherence parameter based on the at least one audio signal being a first format.

16. The apparatus as claimed in claim 15, wherein the first format is a multi channel input format.

17. The apparatus as claimed in any of claims 15 or 16, wherein the means is further configured to, based on the format of the at least one audio signal being a second format: determine at least one decoded directional parameter based on the at least one encoded directional parameter and the encoded at least one energy ratio; select a codebook for the encoded spread coherence parameter, the selection based on the decoded directional parameter; and generate the decoded zero order discrete cosine transformed spread coherence parameter vector element based on the selection of the codebook for the encoded zero order discrete cosine transformed spread coherence parameter vector element.

18. The apparatus as claimed in claim 17, wherein the second format is a metadata-assisted spatial audio format.

19. The apparatus as claimed in any of claims 17 or 18, wherein the means configured to decode the at least one directional parameter based on encoded at least one energy ratio is configured to: determine a quantization grid arrangement based on the at least one energy ratio; and generate a decoded quantized at least one directional parameter based on the application of the determined quantization grid to the at least one encoded directional parameter.

20. The apparatus as claimed in claim 21 , wherein the means configured to select a codebook for the encoded zero order discrete cosine transformed spread coherence parameter vector element, the selection based on the decoded directional parameter is further configured to select a codebook for the encoded spread coherence parameter based on a variance of the quantized at least one directional parameter.

21 . The apparatus as claimed in claim 20, wherein the quantized at least one directional parameter is a quantized azimuth of the at least one directional parameter.

22. The apparatus as claimed in any of claims 13 to 21 , wherein the means is configured to determine at least one encoded zero order discrete cosine transformed spread coherence parameter vector element from the at least one encoded spread coherence parameter based on the at least one encoded energy ratio.

23. The apparatus as claimed in claim 22 when dependent on any of claims 14 to 16, wherein the means configured to determine the at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the at least one encoded spread coherence parameter is configured to determine the at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the zero order discrete cosine transformed spread coherence parameter vector element.

24. A method for an apparatus for encoding spatial audio signal parameters, the spatial audio signal parameters associated with at least one audio signal, the method comprising: obtaining the spatial audio signal parameters, the spatial audio signal parameters comprising at least one directional parameter, at least one energy ratio, at least one spread coherence parameter; encoding the at least one energy ratio to generate an encoded at least one energy ratio; generating an encoded spread coherence parameter; and encoding the at least one directional parameter based on encoded at least one energy ratio and the encoded spread coherence parameter.

25. A method for an apparatus for decoding encoded spatial audio signal parameters, the encoded spatial audio signal parameters associated with at least one audio signal, the method comprising: obtaining the encoded spatial audio signal parameters, the encoded spatial audio signal parameters comprising at least one encoded directional parameter, at least one encoded energy ratio, and at least one encoded spread coherence parameter; and determining at least one decoded directional parameter based on the at least one encoded directional parameter, the encoded at least one energy ratio and the at least one encoded spread coherence parameter.