GB2626953A

GB2626953A - Audio rendering of spatial audio

Info

Publication number: GB2626953A
Application number: GB2301773.4A
Authority: GB
Inventors: Johannes Pihlajakuja Tapani; Lintervo Arvi; Ilari Laitinen Mikko-Ville; Vilkamo Juha
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2023-02-08
Filing date: 2023-02-08
Publication date: 2024-08-14
Also published as: WO2024165271A1

Abstract

A Metadata Assisted Spatial Audio (MASA) stream for an Immersive Voice and Audio Services (IVAS) codec is rendered by obtaining multichannel input audio from eg. a microphone array and associated property parameters (eg. bitrate, codec format or configuration of source directions), determining a control parameter (eg. regularization) and processing parameters (eg. regularised covariance matrices) in order to finally render the audio signal spatially.

Description

AUDIO RENDERING OF SPATIAL AUDIO

Field

The present application relates to apparatus and methods for audio rendering of spatial audio and the application of regularization in the rendering, but not exclusively for configuring a mixing solution regularization for the rendering.

Background

There are many ways to capture spatial audio. One option is to capture the spatial audio using a microphone array, e.g., as part of a mobile device. Using the microphone signals, spatial analysis of the sound scene can be performed to determine spatial metadata in frequency bands. Moreover, transport audio signals can be determined using the microphone signals. The spatial metadata and the transport audio signals can be combined to form a spatial audio stream.

Metadata-assisted spatial audio (MASA) is one example of a spatial audio stream. It is one of the input formats the upcoming immersive voice and audio services (IVAS) codec will support. It uses audio signal(s) together with corresponding spatial metadata (containing, e.g., directions and direct-to-total energy ratios in frequency bands) and descriptive metadata (containing additional information relating to, e.g., the original capture and the (transport) audio signal(s)). The MASA stream can, e.g., be obtained by capturing spatial audio with microphones of, e.g., a mobile device, where the set of spatial metadata is estimated based on the microphone signals. The MASA stream can be obtained also from other sources, such as specific spatial audio microphones (such as Ambisonics), studio mixes (e.g., 5.1 mix) or other content by means of a suitable format conversion. It is also possible to use MASA tools inside a codec for the encoding of multichannel channel signals by converting the multichannel signals to a MASA stream and encoding that stream.

Summary

According to a first aspect there is provided a method for generating an audio signal, the method comprising: obtaining an input audio signal comprising at least two audio channels; obtaining at least one input property parameter associated with the input audio signal; determining at least one control parameter based at least on the at least one input property parameter; determining processing parameters based at least on the at least two audio channels, wherein the determining of the processing parameters is controlled based at least partially on the at least one control parameter; and generating the audio signal based at least on the at least two audio channels and the processing parameters.

The input audio signal may be a spatial audio signal, the spatial audio signal further comprising at least one spatial parameter associated with the at least two audio channels.

Determining processing parameters based at least on the at least two audio channels may further comprise determining the processing parameters further based at least on the at least one spatial parameter.

The at least one input property parameter may comprise at least one of: a bitrate associated with the at least one input audio signal; a codec format indicating an origin of the at least one input audio signal; and a configuration of the at least one input audio signal.

The configuration of the at least one input audio signal may comprise at least one of: a source format indicating an original or input format from which the input audio signal was created; transport channel description; number of channels; channel distance; and channel angle.

The codec format indicating the origin of the at least one input audio signal may comprise at least one of: a metadata assisted spatial audio stream origin codec format value indicating the at least one input audio signal represents a metadata assisted spatial audio signal; a multichannel audio stream origin codec format value indicating the at least one input audio signal represents a multichannel audio signal; an audio object codec format value indicating the input audio signal represents audio objects; and an Ambisonic format value indicating the input audio signal represents an Ambisonic audio signal.

The at least one spatial parameter may comprise at least one of: information that describes an organization of sound in space with respect to the at least one input audio signal, the information comprising: a direction parameter configured to indicate from where the sound arrives; and a ratio parameter configured to indicate a portion of the sound that arrives from that direction; information that describes properties of an original multi-channel or multi-object sound scene, the information comprising at least one of: channel levels; object levels; inter-channel correlations; inter-object correlations; and object directions; processing coefficients related to obtaining spatial audio format signals based at least on the at least one input audio signal.

The codec format indicating the origin of the at least one input audio signal may comprise an indication that the origin is unknown or undefined.

Determining the at least one control parameter based at least on the at least one input property parameter may comprise: determining a first set of control parameter values based on a first one of the at least one input property parameter; and selecting one control parameter value from the first set of control parameter values based on a second one of the at least one input property parameter.

The first one of the at least one input property parameter may be the codec 15 format and the second one of the at least one input property parameter may be the bitrate.

Determining processing parameters based at least on the at least two audio channels wherein the determining of the processing parameters is controlled based at least partially on the at least one control parameter may comprise: generating processing matrices based at least on the at least two audio channels wherein the generating of the processing matrices may be controlled based at least partially on the at least one control parameter.

Generating the processing matrices controlled based at least partially on the control parameter may comprise regularizing the generation of the processing matrices based at least on the control parameter.

Generating the audio signal based at least on the at least two audio channels and the processing parameters may comprise processing the at least two audio channels using the regularized processing matrices to generate the audio signal.

Generating the processing matrices controlled based at least partially on the at least one control parameter may comprise generating entries of a diagonal matrix based at least on at least two control parameter values.

Obtaining at least one input property parameter associated with the input audio signal may comprise deducing from the input audio signal the at least one input property parameter.

Obtaining at least one input property parameter associated with the input audio signal may comprise receiving the at least one input property parameter as part of the input audio signal or as configuration information.

Receiving the at least one input property parameter as part of the input audio signal may further comprise: receiving an encoded input property parameter as part of the input audio signal; and decoding the encoded input property parameter to obtain the input property parameter.

Generating the audio signal based at least on the at least two audio channels and the processing parameters may comprise generating at least one of: a binaural audio signal; and a multichannel audio signal.

According to a second aspect there is provided an apparatus for generating an audio signal, the apparatus comprising means configured to: obtain an input audio signal comprising at least two audio channels; obtain at least one input property parameter associated with the input audio signal; determine at least one control parameter based at least on the at least one input property parameter; determine processing parameters based at least on the at least two audio channels, wherein the determining of the processing parameters is controlled based at least partially on the at least one control parameter; and generating the audio signal based at least on the at least two audio channels and the processing parameters. The input audio signal may be a spatial audio signal, the spatial audio signal further comprising at least one spatial parameter associated with the at least two audio channels.

The means configured to determine processing parameters based at least on the at least two audio channels further comprises may further be configured to determine the processing parameters further based at least on the at least one spatial parameter.

The means configured to determine the at least one control parameter based at least on the at least one input property parameter may be further configured to: determine a first set of control parameter values based on a first one of the at least one input property parameter; and select one control parameter value from the first set of control parameter values based on a second one of the at least one input property parameter.

The first one of the at least one input property parameter may be the codec format and the second one of the at least one input property parameter may be the bitrate.

The means configured to determine processing parameters based at least on the at least two audio channels wherein the determining of the processing parameters is controlled based at least partially on the at least one control parameter may be further configured to: generate processing matrices based at least on the at least two audio channels wherein the generating of the processing matrices may be controlled based at least partially on the at least one control parameter.

The means configured to generate the processing matrices controlled based at least partially on the control parameter may be further configured to regularize the generation of the processing matrices based at least on the control parameter.

The means configured to generate the audio signal based at least on the at least two audio channels and the processing parameters may be further configured to process the at least two audio channels using the regularized processing matrices to generate the audio signal.

The means configured to generate the processing matrices controlled based at least partially on the at least one control parameter may be further configured to generate entries of a diagonal matrix based at least on at least two control parameter values.

The means configured to obtain at least one input property parameter associated with the input audio signal may be further configured to deduce from the input audio signal the at least one input property parameter.

The means configured to obtain at least one input property parameter associated with the input audio signal may be further configured to receive the at least one input property parameter as part of the input audio signal or as configuration information.

The means configured to receive the at least one input property parameter as part of the input audio signal may further be configured to: receive an encoded input property parameter as part of the input audio signal; and decode the encoded input property parameter to obtain the input property parameter.

The means configured to generate the audio signal based at least on the at least two audio channels and the processing parameters may be further configured to generate at least one of: a binaural audio signal; and a multichannel audio signal.

According to a third aspect there is provided an apparatus for generating an audio signal, the apparatus comprising at least one processor and at least one memory storing instructions that, when executed by the at least one processor, cause the system at least to perform: obtaining an input audio signal comprising at least two audio channels; obtaining at least one input property parameter associated with the input audio signal; determining at least one control parameter based at least on the at least one input property parameter; determining processing parameters based at least on the at least two audio channels, wherein the determining of the processing parameters is controlled based at least partially on the at least one control parameter; and generating the audio signal based at least on the at least two audio channels and the processing parameters.

The apparatus caused to perform determining processing parameters based at least on the at least two audio channels may further be caused to perform determining the processing parameters further based at least on the at least one spatial parameter.

The apparatus caused to perform determining the at least one control parameter based at least on the at least one input property parameter may be further caused to perform: determining a first set of control parameter values based on a first one of the at least one input property parameter; and selecting one control parameter value from the first set of control parameter values based on a second one of the at least one input property parameter.

The apparatus caused to perform determining processing parameters based at least on the at least two audio channels wherein the determining of the processing parameters is controlled based at least partially on the at least one control parameter may be further caused to perform: generating processing matrices based at least on the at least two audio channels wherein the generating of the processing matrices may be controlled based at least partially on the at least one control parameter.

The apparatus caused to perform generating the processing matrices controlled based at least partially on the control parameter may be further caused to perform regularizing the generation of the processing matrices based at least on the control parameter.

The apparatus caused to perform generating the audio signal based at least on the at least two audio channels and the processing parameters may be futher caused to perform processing the at least two audio channels using the regularized processing matrices to generate the audio signal.

The apparatus caused to perform generating the processing matrices controlled based at least partially on the at least one control parameter may be further caused to perform generating entries of a diagonal matrix based at least on at least two control parameter values.

The apparatus caused to perform obtaining at least one input property parameter associated with the input audio signal may be further caused to perform deducing from the input audio signal the at least one input property parameter.

The apparatus caused to perform obtaining at least one input property parameter associated with the input audio signal may be further caused to perform receiving the at least one input property parameter as part of the input audio signal or as configuration information.

The apparatus caused to perform receiving the at least one input property parameter as part of the input audio signal may further be caused to perform: receiving an encoded input property parameter as part of the input audio signal; and decoding the encoded input property parameter to obtain the input property parameter.

The apparatus caused to perform generating the audio signal based at least on the at least two audio channels and the processing parameters may be further caused to perform generating at least one of: a binaural audio signal; and a multichannel audio signal.

According to a fourth aspect there is provided an apparatus for generating an audio signal, the apparatus comprising: means for obtaining an input audio signal comprising at least two audio channels; means for obtaining at least one input property parameter associated with the input audio signal; means for determining at least one control parameter based at least on the at least one input property parameter; means for determining processing parameters based at least on the at least two audio channels, wherein the determining of the processing parameters is controlled based at least partially on the at least one control parameter; and generating the audio signal based at least on the at least two audio channels and the processing parameters.

According to a fifth aspect there is provided an apparatus for generating an audio signal, the apparatus comprising: obtaining circuitry configured to obtain an input audio signal comprising at least two audio channels; obtaining circuitry configured to obtain at least one input property parameter associated with the input audio signal; determining circuitry configured to determine at least one control parameter based at least on the at least one input property parameter; determining circuitry configured to determine processing parameters based at least on the at least two audio channels, wherein the determining of the processing parameters is controlled based at least partially on the at least one control parameter; and generating the audio signal based at least on the at least two audio channels and the processing parameters.

According to a sixth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus for generating an audio signal to perform at least the following: obtaining an input audio signal comprising at least two audio channels; obtaining at least one input property parameter associated with the input audio signal; determining at least one control parameter based at least on the at least one input property parameter; determining processing parameters based at least on the at least two audio channels, wherein the determining of the processing parameters is controlled based at least partially on the at least one control parameter; and generating the audio signal based at least on the at least two audio channels and the processing parameters.

According to a seventh aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus for generating an audio signal to perform at least the following: obtaining an input audio signal comprising at least two audio channels; obtaining at least one input property parameter associated with the input audio signal; determining at least one control parameter based at least on the at least one input property parameter; determining processing parameters based at least on the at least two audio channels, wherein the determining of the processing parameters is controlled based at least partially on the at least one control parameter; and generating the audio signal based at least on the at least two audio channels and the processing parameters. According to an eighth aspect there is provided a computer readable medium comprising program instructions for causing an apparatus for generating an output audio signal to perform at least the following: obtaining an input audio signal comprising at least two audio channels; obtaining at least one input property parameter associated with the input audio signal; determining at least one control parameter based at least on the at least one input property parameter; determining processing parameters based at least on the at least two audio channels, wherein the determining of the processing parameters is controlled based at least partially on the at least one control parameter; and generating the audio signal based at least on the at least two audio channels and the processing parameters.

An apparatus comprising means for performing the actions of the method as described above.

An apparatus configured to perform the actions of the method as described above.

A computer program comprising program instructions for causing a computer to perform the method as described above.

A computer program product stored on a medium may cause an apparatus to perform the method as described herein.

An electronic device may comprise apparatus as described herein.

A chipset may comprise apparatus as described herein.

Embodiments of the present application aim to address problems associated with the state of the art.

Summary of the Figures

For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which: Figures 1 to 3 shows schematically an example system of capture or otherwise obtaining spatial audio signals in the form of transport audio signals and spatial metadata; Figure 4 shows schematically example system of encoding spatial audio signals in the form of transport audio signals and spatial metadata and playback of spatial audio signals suitable for implementing some embodiments; Figure 5 shows schematically an example system of multiple operating mode based encoding spatial audio signals in the form of transport audio signals and spatial metadata and playback of spatial audio signals suitable for implementing some embodiments; Figure 6 shows schematically an example playback apparatus suitable for implementing some embodiments; Figure 7 shows schematically an example decoder as shown in Figure 5 according to some embodiments; Figure 8 shows a flow diagram of the operation of the example decoder apparatus shown in Figure 7 according to some embodiments; Figure 9 shows schematically an example spatial synthesizer as shown in Figure 5 according to some embodiments; Figure 10 shows a flow diagram of the operation of the example spatial synthesizer shown in Figure 9 according to some embodiments; Figure 11 shows a flow diagram of the operation of the example regularization factor determiner as shown in Figure 9 according to some 20 embodiments; and Figure 12 shows example processing outputs.

Embodiments of the Application The following describes in further detail suitable apparatus and possible mechanisms for the rendering of suitable output audio signals from parametric spatial audio streams (or signals) from captured or otherwise obtained audio signals.

As discussed above Metadata-Assisted Spatial Audio (MASA) is an example of a parametric spatial audio format and representation suitable as an input format for IVAS.

It can be considered an audio representation consisting of 'N channels + spatial metadata'. It is a scene-based audio format particularly suited for spatial audio capture on practical devices, such as smartphones. The idea is to describe the sound scene in terms of time-and frequency-varying sound directions and, e.g., energy ratios. Sound energy that is not defined (described) by the directions, is described as diffuse (coming from all directions).

As discussed above spatial metadata associated with the audio signals may comprise multiple parameters (such as multiple directions and associated with each direction (or directional value) a direct-to-total ratio, spread coherence, distance, etc.) per time-frequency tile. The spatial metadata may also comprise other parameters or may be associated with other parameters which are considered to be non-directional (such as surround coherence, diffuse-to-total energy ratio, remainder-to-total energy ratio) but when combined with the directional parameters are able to be used to define the characteristics of the audio scene. For example a reasonable design choice which is able to produce a good quality output is one where the spatial metadata comprises one or more directions for each time-frequency element (also known as a time-frequency portion or time-frequency tile). Furthermore associated with each direction the metadata comprises other parameters such as direct-to-total ratios, spread coherence, distance values.

As described above, parametric spatial metadata representation can use multiple concurrent spatial directions. With MASA, the proposed maximum number of concurrent directions is two. For each concurrent direction, there may be associated parameters such as: Direction index; Direct-to-total ratio; Spread coherence; and Distance. In some embodiments other parameters such as Diffuseto-total energy ratio; Surround coherence; and Remainder-to-total energy ratio are defined.

The parametric spatial metadata values are available for each time-frequency tile (the MASA format defines that there are 24 frequency bands and 4 temporal sub-frames in each frame). The frame size in IVAS is 20 ms. Furthermore currently MASA supports 1 or 2 directions for each time-frequency tile.

Example metadata parameters can be: Format descriptor, this parameter defines the MASA format for IVAS. This may be stored as a 64 bit value and may be eight 8-bit ASCII characters: 01001001, 01010110, 01000001, 01010011,01001101, 01000001, 01010011, 01000001 where the values are stored as 8 consecutive 8-bit unsigned integers; Channel audio format, this parameter defines a combined following fields stored in two bytes where the value is stored as a single 16-bit unsigned integer; Number of directions, this parameter defines the number of directions which are described by the spatial metadata. Each direction can be associated with a set of direction dependent spatial metadata as described afterwards. The number of directions parameter may be one of a range of values. Currently the defined range of possible values are 1 or 2, but more than 2 directions may be employed in some embodiments; Number of channels, this parameter defines the number of transport channels employed by the format. This number of channels parameter may be one of a range of values. Currently the defined range of possible values are 1 or 2, but in some embodiments more than 2 channels can be employed; Source format, this parameter describes the original or input format from which the MASA output was created, (for example the input format is 5.1 multichannel).

In some circumstances there can be further description or parameter fields based on the values of 'Number of channels' and 'Source format' parameter fields.

In some embodiments when all of the allocated bits are not used, zero padding can be applied.

Examples of the MASA format spatial metadata parameters which are dependent of number of directions can be: Direction index, this parameter defines a direction of arrival of the sound at a time-frequency parameter interval. Typically this direction of arrival is a spherical representation at about 1-degree accuracy; Direct-to-total energy ratio, this parameter defines an energy ratio for the direction index (in other words defining an energy ratio associated with the direction for the time-frequency subframe); Spread coherence, this parameter defines a spread of energy for the direction index (in other words defining a measure of the 'width' of the direction for the time-frequency subframe); Transport definition, this describes the configuration of the two transport channels; Channel angle, this describes symmetric angle positions for transport signals with directivity patterns; Channel distance, this describes distance between the two transport channels; and Channel layout, when source format is multichannel, this describes the channel layout of the multichannel source format.

Examples of MASA format spatial metadata parameters which are independent of number of directions can be: Diffuse-to-total energy ratio, this parameter defines an energy ratio of non-directional sound over surrounding directions; Surround coherence, this parameter defines a coherence of the non-directional sound over the surrounding directions; Remainder-to-total energy ratio, this parameter defines an energy ratio of the remainder (such as microphone noise) sound energy to fulfil requirement that sum of energy ratios is 1.

Furthermore example spatial metadata frequency bands can be Band LF (Hz) HF (Hz) BW Band LF (Hz) HF (Hz) BW (Hz) (Hz) 1 0 400 400 13 4800 5200 400 2 400 800 400 14 5200 5600 400 3 800 1200 400 15 5600 6000 400 4 1200 1600 400 16 6000 6400 400 1600 2000 400 17 6400 6800 400 6 2000 2400 400 18 6800 7200 400 7 2400 2800 400 19 7200 7600 400 8 2800 3200 400 20 7600 8000 400 9 3200 3600 400 21 8000 10000 2000 3600 4000 400 22 10000 12000 2000 11 4000 4400 400 23 12000 16000 4000 12 4400 4800 400 24 16000 24000 8000 The MASA stream can be rendered to various outputs, such as multichannel loudspeaker signals (e.g., 5.1) or binaural signals.

One example rendering method is described in Vilkamo, J, BackstrOm, T., & Kuntz, A. (2013). Optimized covariance domain framework for time-frequency processing of spatial audio. Journal of the Audio Engineering Society, 61(6), 403411. The rendering method is based on multi-channel mixing. The method processes the given audio signals in frequency bands so that a desired covariance matrix is obtained for the output signal in frequency bands. The covariance matrix contains the channel energies of all channels and inter-channel relationships between all channel pairs, namely the cross-correlation and the inter-channel phase differences. These features are known to convey the perceptually relevant spatial features of a multi-channel sound in various playback situations, such as binaurally for headphones, surround loudspeakers, Ambisonics, and cross-talk-cancelled stereo.

As discussed above, the rendering method in Vilkamo, J., BackstrOm, T, & Kuntz, A. (2013). Optimized covariance domain framework for time-frequency processing of spatial audio. Journal of the Audio Engineering Society, 61(6), 403- 411 attempts to process the input audio signals (e.g., decoded transport audio signals) such that the produced output audio signals (e.g., binaural audio signals) have a desired covariance matrix. This target covariance matrix is determined based on the energy of the input audio signals and the spatial metadata that the renderer receives.

In practice, this rendering method first attempts to achieve the target covariance matrix by mixing the input audio signals. However, to prevent amplifying the input audio signals with very large gain values, a regularization is applied in the processing. For example, if the input audio signals are highly correlated (in other words have highly similar signals), but the target covariance matrix is set to have low correlation, this would lead to processing gains that would amplify the incoherent signal portion excessively. If there was no regularization applied, this would lead to significant amplification of noises and other unwanted sounds in the input audio signals, causing bad audio quality.

As the operation of obtaining the mixing gains is regularized, the target covariance matrix is not always achieved by mixing alone. When there is regularization applied, some signal portions are not amplified as much as needed to reach the target, and therefore some of the expected or desired signal energy is missing. As a result, these sounds would effectively be attenuated in the reproduction, and the desired incoherence between the output channels may not be reached. A proposed solution to achieving the desired incoherence is to decorrelate the input signals to obtain incoherent signals and process these signals with mixing gains to obtain the covariance matrix of the missing signal portion. By these means the target covariance matrix is obtained for the output signals also when the regularization has limited the processing.

There are positives and negatives to both means, mixing and decorrelating. The decorrelation processing has a drawback that it affects the sound quality, especially when there is significant amount of decorrelated signal in the output.

When a sound is decorrelated its phase spectrum is modified at least to a degree, typically time-invariantly to preserve the tones. It is known that the perceived quality of certain sounds such as speech or applause will degrade in this process.

On the other hand, mixing with too large gains causes problems such as excessive amplification of small signal components that could be in some situations mostly noise.

The amount of regularization can be used to control how much decorrelated energy is mixed to the output. If only mild regularization is applied (i.e., the maximum allowed mixing gains are large), the output contains mostly a mixture of input signals without much decorrelated signals. On the contrary, if significant amount of regularization is applied (i.e., the maximum allowed mixing gains are small), the output contains more decorrelated signals, as large mixing gains are prevented.

Thus, the amount of regularization is a compromise between avoiding the problems of mixing and decorrelation. The optimal amount of regularization 30 depends on the situation, and a few examples are discussed below.

The IVAS codec operates on a wide range of bitrates (from 13.2 kbps to 512 kbps). Especially at the lowest end, there are significant coding artefacts (such as musical and other noises and distortions) in the transport audio signals that the renderer receives. Thus, significant amount of regularization is needed to avoid the amplification of the coding artefacts, which would make them even more prominent, and significantly decrease the perceived audio quality. On the other hand, having a high amount of regularization (more decorrelation) would be a suboptimal choice at higher bitrates. There are significantly less of those coding artefacts in the transport audio signals, so this would lead to more decorrelation than needed, causing suboptimal audio quality.

Moreover, the IVAS codec supports multiple input formats. The MASA format often originates from mobile devices which typically have inexpensive microphones, and thus there is typically perceivable microphone noise in the MASA input signals. Thus, significant amount of regularization is needed in the renderer to avoid amplification of the microphone noises. On the other hand, IVAS supports also multichannel inputs (such as 5.1), which are typically recorded and produced in a studio with professional microphones having less noise. Thus, a fixed setting for regularization is not optimum in this respect either, as different formats would require different values to achieve a range of good experiences.

Known rendering methods may have fixed regularization with a value set for situations where the signal SNR characteristics are large. A fixed regularization value based on a high SNR situation may not produce acceptable results for low-SNR situations. For example a spatial audio codec operating at low bit rates, such as 32 kbps or below, may not benefit with a regularization value set for a high bit rate low-noise input signal.

As a result, a fixed regularization value produces a rendering which is optimal only for certain type of input signals and/or for certain bit rates where the transported audio signals are encoded. For other types of input signals and/or bit rates, there can be either too much regularization (leading into too much decorrelation) or too little regularization (leading into amplification of noises), both of which lead to a deterioration of the perceived audio quality.

Too much decorrelation produces a sound which can be perceived as reverberant and non-engaging, and amplification of noises produces a reproduced sound with perceived artefacts.

The following embodiments and the concept as discussed in the application herein is one of rendering spatial audio from a parametric spatial audio stream (audio signal(s) and associated spatial metadata) or more generally at least one audio signal comprising two or more audio channels. In these embodiments a rendering method (and thus a renderer apparatus) is proposed that enables obtaining improved audio quality by controlling the regularization in the determination of rendering gains based on an input property (or input parameter) associated with the at least one audio signal (such as bitrate and/or codec format). The controlled regularization is aimed at minimizing artefacts due to decorrelation (added reverberance and loss of engagement) and excessive amplification of noises (perception of loud musical and other noises), thus improving the perceived audio quality by mitigating the perception of those artefacts.

In some embodiments this can be achieved by obtaining the (parametric spatial) audio stream; obtaining the input property; determining a regularization value based on the input property; determining regularized rendering gains based on the regularization value, the audio signal(s), and the associated spatial metadata; rendering spatial audio signals (such as binaural audio signals) from the audio signal(s) using the regulated rendering gains.

In some embodiments the input property can mean various things in different embodiments. For example the input property can be: the bitrate that was used for coding the parametric spatial audio stream (in case it was encoded/decoded before the rendering); the codec format, stating the origin of the stream, e.g., if it is a MASA stream (i.e., typically from a mobile device) or if it is created from multichannel signals (such as 5.1); the configuration of the audio signal(s) the renderer receives (e.g., if they are from cardioid or omnidirectional microphones).

As discussed above the embodiments discussed in further detail hereafter present systems utilizing a spatial sound renderer and controlling the regularization of the rendering gains based on at least one input property. In some embodiments, the renderer is implemented within a decoder and two example input properties are considered. The input properties discussed herein are Codec format and Bitrate. In some other embodiments, the renderer can be implemented elsewhere than in a decoder, and there can be other input properties.

First, examples the "Codec format" input property is described in further detail, and how signals of different codec formats can be obtained. Then, the "Bitrate" input property is described in further detail. After this, an implementation of some embodiments within a decoder is described in further detail.

A codec format input property can indicate the type of a parametric spatial audio signal being transmitted. As stated previously, a parametric spatial audio signal is a signal that comprises one or more audio signals and associated spatial metadata. The spatial metadata has information indicating how the sound is spatially organized, and it is typically provided at a substantially lower sampling rate than the audio signal.

Examples of spatial metadata in different "codec formats" include: information that describes the organization of the sound in space. For example, in frequency bands, a direction parameter may indicate from where the sound arrives, and a ratio parameter may indicate the portion of sound that arrives from that direction; information that describes the properties of an original multi-channel or multi-object sound scene. For example, channel or object levels and inter-channel or inter-object correlations, or object directions; processing coefficients related to obtaining certain spatial audio format signals (such as Ambisonic audio signals) based on the transmitted audio signals.

The examples shown above can be extended with other types of spatial metadata also used. The codec format input property can thus indicate the type of the spatial metadata being conveyed.

In some embodiments the codec format input property can also indicate (or alternatively indicate) the type or origin of the transport audio signals.

For example, the codec format input property can in some embodiments indicate that the transport audio signals are a downmix of multichannel signals, or that they have been captured with a microphone array, or that they are a decomposition of Ambisonic audio signals. In some embodiments the codec input property can have a defined value which could indicate that the origin of the transport audio signals is not known or undefined.

In some embodiments, two different codec formats may have mutually same type of spatial metadata or same type of transport audio signals.

Any of the abovementioned audio formats may be rendered to a multitude of output spatial audio formats, such as a multichannel loudspeaker output (e.g., 5.1), head-tracked or non-head-tracked binaural output, or cross-talk-cancel stereo. The following examples describe the synthesis of binaural audio signals.

The examples described herein can be straightforwardly extended and are applicable to reproduction of other output types.

Although the following embodiments are described with respect to the application of regularization in a renderer where the input is a parametric spatial audio, it would be understood that some embodiments can be applied to any suitable renderer and rendering operation where regularization is applied to the input audio to be rendered and the regularization value is controlled based on input parameters such as the bitrate and codec format.

With respect to Figure 1 is shown an example of apparatus configured to determine a parametric spatial audio signal from microphone array signals. This 15 apparatus can, for example be implemented on a mobile phone comprising a microphone array (integrated within the mobile phone).

In such embodiments the microphone array signals 100 can be forwarded to a microphone array frontend 101. The microphone array frontend can, for example, be implemented using methods such as presented in US10873814, and is configured to output transport audio signals 102 and spatial metadata 104. The transport audio signals 102 and the spatial metadata 104 can, for example be in the form of a MASA stream.

In some embodiments the microphone array frontend 101 is configured to measure inter-microphone correlations in different delays between the microphones in frequency bands, and find the delay that maximizes the correlation, and determine the direction of the arriving sound based on that delay. The microphone array frontend can in some embodiments also determine the direct-tototal energy ratio parameter in frequency bands based on the correlation value.

In some embodiments the microphone array frontend can also provide the transport audio signals 102. The process of determining the transport audio signals 102 depends on what type or format microphone array signals are input. For example, if the microphone array signals 100 are from a mobile device, the microphone array frontend 101 can be configured to select a microphone signal from the left side of the device as the left transport signal and another one from the right side of the device as the right transport signal. The microphone array frontend 101 can furthermore in some embodiments also apply any suitable pre-processing steps, such as equalization, microphone noise suppression, wind noise suppression, automatic gain control, beamforming and other spatial filtering, ambient noise suppression, and limiter.

With respect to Figure 2 is shown a further example apparatus configured to determine a parametric spatial audio signal from microphone array signals. In some embodiments a microphone array providing microphone array signals could be a dedicated microphone array, for example a microphone array that provides first-order Ambisonic (FOA) signals at its output. Thus in some embodiments the apparatus comprises an ambisonic decomposer 201 configured to receive the ambisonics signals 200 and generate transport audio signals 102 and spatial metadata 104. In some embodiments, the ambisonic decomposer 201 is configured to determine spatial metadata. The spatial metadata could be determined, for example, using methods similar to Directional Audio Coding (DirAC) such as described in Pulkki, V. (2007). Spatial sound reproduction with directional audio coding. Journal of the Audio Engineering Society, 55(6), 503-516, and the transport audio signals 104 could be cardioid patterns towards left and right directions generated from the ambisonics (FOA) signals 200. Thus, it should be noted that even though the spatial metadata and the transport audio signals can be of similar types, the origin of the signals may be different.

In some embodiments the ambisonic decomposer 201 can be configured to receive an Ambisonics signals 200, for example FOA or higher-order Ambisonics (HOA), and determine spatial metadata 104 such that it enables reconstructing the Ambisonic signals in the decoder based on the transport audio signals. For example, it could be that the Ambisonic signal is conveyed as one or more channel transport audio signals 102, where the first channel is the omnidirectional W component, and the remaining channels are residual signals. The ambisonic decomposer 201 determines these signals by determining prediction coefficients that enable predicting the remaining channels from the W component, and the residual signals are then the prediction error. For those channels for which the residuals are not transmitted, parameters can be estimated that enable processing the omnidirectional component to that residual signal by means of decorrelation. All these prediction and processing parameters may form the spatial metadata 104. With respect to Figure 3 is shown a further example apparatus configured to determine a parametric spatial audio signal from microphone array signals. In some 5 embodiments the downmixer and metadata determiner 301 is configured to receive the channel-based audio signals 300 and generate transport audio signals 102 and spatial metadata 104. Example channel-based audio signals 300 are surround 5.1 sound or audio object sounds. The downmixer and metadata determiner 301 is configured to generate the spatial metadata 104 by converting the audio object signals and/or audio channels to a FOA format, and then determining the spatial metadata using DirAC or similar means. The downmixer and metadata determiner 301 is configured to generate the transport audio signals 104 using amplitude panning so that the channels/objects beyond ±30 degrees (and the corresponding cone of confusion) are panned to the left and right channels fully, and the channels/objects in between ±30 degrees are panned to both channels depending on their direction.

The codec format input parameter can thus indicate various aspects described in the foregoing. Depending on the use case, it can indicate an entirely different signal format (e.g., stereo transports vs main-residual signal definition), and/or different characteristics of the same signal format (e.g., stereo transport signals from downm ix vs stereo transport signals from microphones), and/or it can indicate the metadata format. For example the codec format input parameter can be an index in a predefined list of options defining both the kind of transport audio signals and the spatial metadata being used.

A bitrate input parameter can be configured to indicate the number of bits used to transmit information within a specific time frame. In network transmission, in which also audio codecs such as IVAS are used, the usual description is to measure bits used per second (which is shortened to bps). As mentioned above, IVAS is expected to operate between bitrates 13.2 kbps and 512 kbps. Bitrate may also be constant or varying through time. Thus each transmission frame may have equal number of bits or variable number of bits. In IVAS, the transmission frame is expected to be 20 ms and the bitrate is expected to be constant in steady state situations. Thus, the above bitrates would translate to 264 bits/frame and 10240 bits/frame. It should be noted that this is usually the total bitrate, which is then further distributed for signaling, audio channel coding, and, if present, metadata coding. This further division may again be constant or variable.

In the following description the bitrate is indicative of the overall quality of the transmitted audio. In general, the overall quality of transmitted audio increases in lossy audio codecs, such as IVAS, when bitrate is increased. Depending on the codec format, this increase in quality may come from increased bits used for audio channel coding which directly decrease presence of coding artifacts, or it may come from increased bits in spatial metadata coding which increase the accuracy of the reproduced spatial scene. Thus, based on bitrate, there is provided an indication of the expected overall quality.

With respect to Figures 4 and 5 is shown an overview of example apparatus showing the signal flow following the generation of the transport audio signals 102 and spatial metadata 104.

Figure 4 shows an example apparatus which comprises an encoder 401 configured to receive the transport audio signals 102 and spatial metadata 104 and encode this to form a bitstream 402. Furthermore the apparatus comprises a decoder 402 which is configured to receive the bitstream 402 and output the spatial audio output 404. In other words the system can be considered to operate in a first mode where processing software external to the encoder 401 determines and provides the transport audio signals 102 and spatial metadata 104. For example, that software could be a microphone array frontend software optimized to process the microphone signals of a specific device.

Figure 5 shows further example apparatus which comprises the encoder 401 configured to receive the transport audio signals 102 and spatial metadata 104 and encode this to form a bitstream 402. Furthermore the apparatus comprises a decoder 402 which is configured to receive the bitstream 402 and output the spatial audio output 404. Furthermore is shown that prior to the encoder 401 is an encoder preprocessor 501. The encoder preprocessor 501 is configured to receive the audio signals 500 and generate the transport audio signals 102 and the spatial metadata 104. In other words the system can be considered to operate in a second mode where the encoding system 511 that performs the encoding also performs also the encoder preprocessing to obtain the transport audio signals 102 and spatial metadata 104. For example, an IVAS encoder (which comprises both an encoder preprocessor 501 and encoder 401) could be configured to accept 5.1 or Ambisonics input to generate the transport audio signals 102 and spatial metadata 104 that are then encoded.

The encoder 401 thus forms a bitstream 402, which can be stored or transmitted. The bitstream contains the transport audio signals 102 and the spatial metadata 104 in an encoded form. The audio signals can, e.g., be encoded using an IVAS core codec, EVS, or AAC encoder (or any other suitable encoder), and the metadata can, e.g., be encoded using the methods presented in US20210295855, US20220343928, US20220036906, EP4091166 (and/or any other suitable methods). The encoder 401 can furthermore in some embodiments multiplex the encoded audio and encoded spatial metadata to form the bitstream 402.

In addition, the encoder 401 is configured to write or include within the bitstream 402 the input parameters such as codec format input parameter being used. For example, this could be signalled directly with signalling bits which define the codec format input parameter. For example, the codec format can be signalled based on two bits such that a value "00" is a multi-channel format, "01" is MASA format, and "10" is Ambisonics format (these are merely example values).

Furthermore, the encoder 401 may also write to the bit stream 402 other input parameters such as the bitrate input parameter. The bitrate input parameter may be constant or vary across time.

In some embodiments, the bitrate input parameter may be inferred in the decoder from the size of the received bitstream frame.

In some embodiments the input parameter information such as the codec format input parameter and bitrate input parameter information is provided through a different communication channel other than the bitstream 402 that conveys the transport audio signal and spatial metadata.

The bitstream 402 can be forwarded to a decoder 403, which can, for example, be an IVAS decoder (or any other suitable decoder). The decoder 403 decodes the audio signals and the metadata, and renders the spatial audio output 404, which can, e.g., be binaural audio signals. The decoder 403 may be at a different or same device than the encoder 401.

With respect to Figure 6 is shown an example (decoder) apparatus for implementing some embodiments. In the example shown in Figure 6, there is shown a mobile phone 601 coupled via a wired or wireless connection 613 with headphones 619 worn by the user of the mobile phone 601. The wired or wireless connection 613 may enable audio signals 615 to be passed to the headphones 619 and a mono audio signal (or more than one audio signal) 617 from the headphones 619 (which can in some embodiments further include head orientation/position information metadata from the headphones).

In the following the example device or apparatus is a mobile phone as shown in Figure 6. However the example apparatus or device could also be any other suitable device, such as a tablet, a laptop, computer, or any teleconference device. The apparatus or device could furthermore be the headphones itself so that the operations of the exemplified mobile phone 601 are performed by the headphones.

In this example the mobile phone 601 comprises a processor 603. The processor 603 can be configured to execute various program codes such as the methods such as described herein. The processor 603 is configured to communicate with the headphones 619 using the wired or wireless headphone connection 615. In some embodiments the wired or wireless headphone connection 615 is a Bluetooth 5.3 or Bluetooth LE Audio connection. The connection 615 provides from a processor 603 a (two-channel) audio signal to be reproduced to the user with the headphones 619.

The headphones 619 could be over-ear headphones as shown in Figure 6, or any other suitable type such as in-ear, or bone-conducting headphones, or any other type of headphones. In some embodiments, the headphones 619 have a head orientation sensor providing head orientation information to the processor 603 via the connection 613. In some embodiments, a head-orientation sensor is separate from the headphones 619 and the data is provided to the processor 603 separately. In further embodiments, the head orientation is tracked by other means, such as using the device 601 camera and a machine-learning based face orientation analysis.

In some embodiments the processor 603 is coupled with a memory 605 having program code 607 providing processing instructions according to the following embodiments. The program code 607 has instructions to process the transport audio signals and spatial metadata received by the transceiver 611 or retrieved from the storage 609 to a rendered form suitable for effective output to the headphones.

The transceiver 611 can communicate with further apparatus by any suitable 5 known communications protocol. For example in some embodiments the transceiver can use a suitable radio access architecture based on long term evolution advanced (LTE Advanced, LTE-A) or new radio (NR) (or can be referred to as 5G), universal mobile telecommunications system (UMTS) radio access network (UTRAN or E-UTRAN), long term evolution (LTE, the same as E-UTRA), 10 2G networks (legacy network technology), wireless local area network (WLAN or Wi-Fi), worldwide interoperability for microwave access (WiMAX), Bluetooth®, personal communications services (PCS), ZigBee®, wideband code division multiple access (WCDMA), systems using ultra-wideband (UWB) technology, sensor networks, mobile ad-hoc networks (MANETs), cellular internet of things (loT) RAN and Internet Protocol multimedia subsystems (IMS), any other suitable option and/or any combination thereof.

The apparatus of Figure 6 may in some embodiments be also configured to implement the functions of the microphone array frontend 101, ambisonic decomposer 201, downmixer and metadata determiner 301, encoder 401 and encoder pre-processor 501 as well, for example, when the use case is two-way spatial audio communication with a remote apparatus.

With respect to Figure 7 is shown a schematic view of an example decoder 403 as shown in Figure 5.

The input bitstream 402 is forwarded to a demultiplexer and decoder 701.

The demultiplexer and decoder 701 is configured to demultiplex and decode from the bitstream 402 the transport audio signals 706 and spatial metadata 708 from it. The decoding corresponds to the encoding applied in the encoder 401. It should be noted that the transport audio signals 706 and the spatial metadata 708 typically are not identical to the ones presented earlier as they have been encoded and decoded. Nevertheless, they are referred to using the same term (but with a different reference number) for simplicity.

Furthermore, the demultiplexer and decoder 701 is configured to determine the bitrate 702 and the codec format 704 parameters based on the information within the bitstream 402. In one example, it could be that these are indicated by dedicated bits in the bitstream 402. In other examples, one or both can be detected from the bitstream, or they can be communicated through other channels, for example, via RTP or session control.

The spatial metadata 708, transport audio signals 706, codec format 704, and bitrate 702 are then provided to the spatial synthesizer 703.

The spatial synthesiser 703 is configured to synthesize the spatial audio output 404 in the desired format based on the spatial metadata 708, transport audio signals 706, codec format 704, and bitrate 702. In some examples only one of the codec format 704 and bitrate 702 parameters is used, and in other examples both. This information is used for balancing between using decorrelation and signal mixing in performing the spatial synthesis. The output provided by the spatial synthesiser 703 may, for example, be binaural audio signals.

With respect to the Figure 8 an example flow diagram shows the operations 15 of the example decoder shown in Figure 7 according to some embodiments.

Thus the first operation can comprise as shown by 801, obtaining the (encoded spatial audio) bitstream.

Then as shown by 803, the (encoded spatial audio) bitstream is demultiplexed and decoded to generate transport audio signals, spatial metadata 20 and input parameters such as codec format and bitrate.

Following this, as shown by 805, the spatial audio signals are synthesised from the transport audio signals based on the spatial metadata, and input parameters such as the codec format and bitrate. In other words spatially process transport audio signals based on codec format, bitrate, spatial metadata to generate spatial audio signals.

Then as shown by 807 the spatial audio signals are output (for example binaural audio signals are output to the headphones).

With respect to Figure 9 the spatial synthesiser 703 of Figure 7 is shown in further detail. In some embodiments the spatial synthesiser 703 is configured to receive the transport audio signals 706, the spatial metadata 708 and the input parameters as shown by the bitrate 702 and codec format 704.

In some embodiments the spatial synthesiser 703 comprises a forward-filter bank 901. As shown in Figure 9 the transport audio signals 706 are provided to the forward-filter bank 901, which transforms the signals to a time-frequency representation. Any filter bank suitable for audio processing may be utilized, such as the complex-modulated quadrature mirror filter (QMF) bank, or a low-delay variant thereof, or the short-time Fourier transform (STFT). Similarly the forward-filter bank 901 can be implemented by any suitable time-frequency transformer.

In the example described herein, the filter bank is configured to have 60 frequency bins, and sufficient stop-band attenuation to avoid significant aliasing to occur when the frequency bin signals are processed. In this configuration, all frequency bins can be processed independently from each other, except that some frequency bins share the same spatial metadata. For example, the spatial metadata may comprise spatial parameters in a limited number of frequency bands, for example 5, 12, or 24 bands, and each of these bands correspond to a set of one or more frequency bins provided by the forward filter bank 901. Although this example shows a specific number of bands there can be any suitable number of bands, for example the number of frequency bands can be 5, 8, 12, 18, or 24 bands The output of the forward filter-bank 901 are time-frequency transport signals 906, which are provided to a decorrelator and mixer 907, a processing matrices determiner 905, and an input and target covariance matrix determiner 909. The time-frequency transport signals 906 S(b, t, i) can be denoted as x(b, - [S(b, t, 1)1 [S(b, t, 2)1 where b is the frequency bin index, t is the time-frequency signal temporal index, and i is the channel index. In this example, the transport audio signals have 25 exactly two channels, but could be more than two in other examples.

In some embodiments the spatial synthesiser 703 comprises an input and target covariance matrix determiner 909. The input and target covariance matrix determiner 909 is configured to receive the spatial metadata 708 and the time-frequency transport signals 906 and is configured to determine covariance matrices 906. The covariance matrices 906 comprise an input covariance matrix representing the time-frequency transport signals 906 and a target covariance matrix representing desired time-frequency spatial audio signals (that are to be rendered). The input covariance matrix can be determined or measured from the time-frequency transport signals 906, denoted as a column vector x(b, t), where the row indicates the transport signal channel. This is achieved in some embodiments by: t2 (n) Ci(b, n) = x(b, t)2cH (b, t=ti (n) where the superscript H indicates a conjugate transpose and ti (n) and t2(n) are the first and last time-frequency signal temporal indices corresponding to frame n (or sub-frame n in some embodiments). In this example, there are four time indices t at each frame n. As said, the covariance matrix is determined for each bin. In other embodiments, the covariance matrix could be also averaged (or summed) over multiple frequency bins, in a resolution that approximates human hearing resolutions, or in the resolution of the determined spatial metadata parameters, or any suitable resolution.

The target covariance matrix can be determined based on the spatial metadata and the overall signal energy. The overall signal energy Eo(b, n) can be obtained as the mean of the diagonal values of Cy(b, n). Then, in some embodiments, the spatial metadata comprises the direction parameters azimuth 0(k, n) and elevation (p(k,n) and a direct-to-total ratio parameter r(k, n). Note that the band index k is the one where the bin b resides. Then, assuming the output is a binaural signal, then the target covariance matrix is Cy(b,n) = E (b, n) r(k, n)h(b, (k, n), (19(k, n))11ll (b, (k, n), (k, n)) + E0(b,n) (1 -r(k,n))Cd(b) where h(b, (k, n), (k, n)) is a head-related transfer function column vector for bin b, azimuth 6 (k, n) and elevation co(k, n), and it is a column vector of length two with complex values, where the values correspond to the HRTF amplitude and phase for left and right ears. In high frequencies, the HRTF values may be also real because phase differences are not needed for perceptual reasons at high frequencies. Obtaining HRTFs for a given direction and frequency is known and any suitable method applied to obtain them. Cd(b) is the diffuse field binaural covariance matrix, which can be determined for example in an offline stage by taking a spatially uniform set of HRTFs, formulating their covariance matrices independently, and averaging the result.

The input covariance matrix Cx(b,n) and the target covariance matrix Cy (b, n) are then output as covariance matrices 906 to the processing matrices determiner 905.

The above example considered only directions and ratios. The procedure generating a target covariance matrix has been detailed also more broadly in W02019086757A1 where additionally to the directions and ratios, using also spatial coherence parameters was also described, and furthermore, other output types than binaural output were also covered.

In some embodiments the spatial synthesiser 703 comprises a regularization factor determiner 903. The regularization factor determiner 903 is configured to receive the input parameters, such as the codec format 704 and the bitrate 702 and determine a regularization factor R (n) 904. The regularization factor determiner 903 is configured to output the regularization factor R(n) 904 to the processing matrix determiner 905.

In some embodiments the spatial synthesiser 703 comprises a processing matrix determiner 905. The processing matrix determiner 905 is configured to receive the covariance matrices C1(b, n) and Cy(b, n) 906 and the regularization factor R(n) 904 and determines processing matrices 908 M(b, n) and Mr(b,n). The determination of such processing matrices based on the covariance matrices is based on a method as shown in Vilkamo, J., BackstrOrn, T., & Kuntz, A. (2013). Optimized covariance domain framework for time-frequency processing of spatial audio. Journal of the Audio Engineering Society, 61(6), 403-411. This method determines mixing matrices for processing input audio signals having a measured covariance matrix Cx(b, n) such that the output audio signals (i.e., the processed input audio signals) attain a determined target covariance matrix Cy (b, n).

This method has been used in various situations, including generation of binaural and surround loudspeaker signals. In formulating the processing matrices, the method further uses a prototype matrix which is a matrix that informs the optimization procedure which kind of signals generally are meant for each output (with a constraint that the output must attain the target covariance matrix). When the transport audio signals are two channels (left and right), and the output is binaural signals, the prototype matrix Q can, for example, simply be El or r 1 0.051 When the processing is head-tracked binaural, then when the user is [0.05 1 facing rear directions the transport audio signals may be processed (for example right after or before the forward filter-bank 901) so that they replace mutually each when the user is facing rear directions.

The embodiments herein implements regularization based on R(n) in determining the processing matrices M(b,n) and Mr(b, n) based on C."(b,n), Cy(b,n) and Q(n). The derivation of the equations to determine the processing matrices is thoroughly explained in Vilkamo, J., EtackstrOrn, T, & Kuntz, A. (2013). Optimized covariance domain framework for time-frequency processing of spatial audio. Journal of the Audio Engineering Society, 61(6), 403-411. Note that this implementation is an example implementation, and some operations can be performed in other ways to achieve a same or similar result. In some embodiments the example implementation provided in the appendix of Vilkamo, J., BackstrOm, T., & Kuntz, A. (2013). Optimized covariance domain framework for time-frequency processing of spatial audio. Journal of the Audio Engineering Society, 61(6), 403411, can form the basis of an implementation for determining processing matrices with the difference being that the regularization factor R(n) is employed in place of the fixed number 0.2 in the program code line 21 (of the example implementation).

The use of the regularization in the determination of the mixing matrices is here explained for completeness. Note that the notation for the matrices below is not exactly the same as in Vilkamo, J., Backstreim, T., & Kuntz, A. (2013). Optimized covariance domain framework for time-frequency processing of spatial audio. Journal of the Audio Engineering Society, 61(6), 403-411. Also, the frequency and bin indices are omitted for brevity.

First, the input and target covariance matrices are decomposed to C. = KiK,H, and similarly for Cy using a singular value decomposition (SVD) operation, [U,, Si, lc] = SVD (Cx) [Uy, Sy, Vy] = SVD(Cy) Then, Icy = U",y.13, where the square root is an entry-wise operation and notation x,y indicates either of the input or target covariance matrix. As implied in Vilkamo, J., Backstrom, T, & Kuntz, A. (2013). Optimized covariance domain framework for time-frequency processing of spatial audio. Journal of the Audio Engineering Society, 61(6), 403-411, if there were no regularization, a processing matrix that provides the covariance matrix for the output signal, when applied to the input transport signal, is obtained by M = KyPKV where the matrix inverse is not regularized, and where P is a unitary matrix that is formulated so that the similarity of the output (i.e. input signal processed with M) and the prototype signal (i.e. input signal processed with Q, with potential gain normalization) is maximized, where the constraint is that the output signal must attain the target covariance matrix, i.e., MCXMH = Cy. The formulation of matrix P is detailed in the above reference and is not the key focus point of the current embodiments. However, a focus point with respect to the embodiments herein is the matrix inverse KV, which is regularized. If the matrix is not regularized, potentially there can be infinitely large processing gains in the matrix M. In some embodiments regularization can be implemented in the 25 decomposed domain IC = Ux Sx where the diagonal matrix S.,.sq = .1Sx is normalized so that its diagonal values are bottom limited to a value that is R times the maximum value of S",,q, resulting in diagonal matrix Sxsq,rey. This means that when R approaches zero, then S,,,q,"9 approaches Sxsq, and thus the regularization approaches a state of no regularization. Conversely, when R approaches 1, then S",sqng approaches a matrix with all diagonal values being the same as the largest diagonal value of Sxsq. This is the maximum regularization. The regularized inverse is formulated by Kx,1--eg = Sx,17,regUll Then, the regularized inverse is applied when formulating the mixing matrix M = KyPK4eg In this case, depending on if the regularization is active (i.e., the bottom-limiting regularization caused effect to the matrix Sx,sq), the condition MC,MH = Cy may not hold any more. There may be a non-zero missing covariance matrix Cr = Cy -MCMH which may be called a residual covariance matrix. Thus, the same method as presented in the reference cited previously may be used to generate another processing matrix Mr that is used to process the decorrelated version of the transport audio signals to attain properties of that missing portion, in other words, to attain the covariance matrix Cr. This may be implemented by setting Cr as the target covariance matrix, removing non-diagonal elements (because the sound is decorrelated) of C, and using that as the input covariance matrix, and otherwise performing the same operations as described to obtain Mr. The function of the regularization value is not of significant importance at this stage, because the input is incoherent. Thus, for example, a fixed value of 0.2 can be used.

The processing matrices determiner 905 can then be configured to output the processing matrices M(b,n) and Mr(b,n) 908, which have been regularized based on the regularization factor R(n) 904.

In some embodiments the spatial synthesiser 703 comprises a decorrelator and mixer 907. The decorrelator and mixer 907 is configured to receive the time-frequency transport signals x(b,t) 906 and the processing matrices M(b,n) and Mr(b, n) 908. The decorrelator and mixer 907 is first configured to process the time-frequency transport signals 906 with decorrelators to generate decorrelated signals TD(b,t).

Then the decorrelator and mixer 907 is configured to apply the following mixing procedure to generate the time-frequency spatial audio signals y(b, t) 910 which then can be output by the decorrelator and mixer 907.

y(b, = M(b, n)x(b, t) +Mr(b,n)xD(b,t) In the above processing, although not explicitly written in the equation, the processing matrices may be linearly interpolated between frames n such that at each temporal index of the time-frequency signal the matrices take a step from M(b,n -1) towards M(b,n). The interpolation rate may be adjusted if an onset is detected (fast interpolation) or not (normal interpolation).

In some embodiments the spatial synthesiser 703 comprises an inverse filter-bank 911. The inverse filter-bank 911 applies an inverse transform corresponding to that used by the forward filter-bank 901 to convert the time-frequency spatial audio signals 910 to spatial audio output 404, which is the output of the system.

With respect to the Figure 10 an example flow diagram showing the operations of the spatial synthesizer shown in Figure 9 is shown according to some embodiments.

Thus the first operation can comprise as shown by 1001, obtaining transport audio signals, spatial metadata and the input parameters such as codec format and bitrate.

Then as shown by 1003, the transport audio signals are time-frequency transformed to generate time-frequency transport audio signals.

The time-frequency transport audio signals can then be used to generate the input covariance matrix. Having determined the input covariance matrix then this can be used to determine an overall energy. Then using the overall energy and and spatial metadata the target covariance matrix can be generated. The generation of the input and the target covariance matrices is shown by 1007.

Furthermore the regularization factor is determined based on the input parameters such as the codec format and the bitrate as shown by 1005.

Having determined the covariance matrices and the regularization factor the processing matrices are determined based on: the input covariance matrix; target covariance matrix; and regularization factor as shown by 1009.

Then the time-frequency transport signal is decorrelated and mixed, based on the processing matrices to generate a time-frequency spatial audio signals as shown by 1011.

Then the time-frequency spatial audio signals are inverse time-frequency transformed to generate spatial audio signals (for example using the inverse filter bank) as shown by 1013.

Then the spatial audio signals can be output as the spatial audio output as shown by 1015.

With respect to Figure 11 is shown operations of the example regularization factor determiner 903 as shown in Figure 9. In this example the regularization factor determiner is presented in the context of the above example spatial audio codec system. However, similar process can be applied in any context where a regularization factor or similar mixing or amplification limiting factor is used to control automatic mixing of input channels to output channels, and a constant value does not produce optimal quality.

As such the initial operations are those of obtaining the input parameters affecting the regularization factor. For example this can be obtaining the codec format, as shown by 1101, which can describe the spatial audio format of the codec that is passed to the renderer. For example, this could be MASA format, premixed multi-channel format, Ambisonics format, etc. Additionally in some embodiments this can also comprise, as shown by 1105, obtaining a bitrate input parameter which describes the bitrate that was used for encoding the bitstream.

After obtaining the parameters, the process of defining or determining a regularization factor starts.

The regularization factor itself can, in some embodiments, be defined as a decimal value in range [0,1] where value 1 restricts the rendering gains the most and value 0 does not restrict them at all.

In some embodiments the codec format parameter is used, as shown by 1103, to select an initial set of available values for the regularization factor to be used in following operations steps.

For example this could be implemented in a form of a number of tables where each table corresponds to a specific codec format. For example, the table for MASA format could be [1.0, 0.8, 0.5, 0.2] and a similar table for Ambisonics format could be [1.0, 0.7, 0.4, 0.2]. With pre-mixed multi-channel format, a similar example table can be [0.8, 0.5, 0.3, 0.2] as this content is usually professionally produced content and compression of the audio is usually less prone to compression artifacts. It should be noted that the use of tables is only one example implementation option and other equivalent or even more suitable ways to implement the selection can be created.

Furthermore as shown by 1107, the set of values are taken, and based on the obtained codec bitrate, a selection from the initial set of values an initial value for the regularization factor is selected. For example taking the aforementioned MASA format table, specific bitrates are associated to each table value. As there are, in this specific example, four values, an example rule set can be: - If bitrate < 64 kbps, then use value 1.0 - If bitrate > 64 kbps and bitrate 96 kbps, then use value 0.8 20 -If bitrate > 96 kbps and bitrate 160 kbps, then use value 0.5 - If bitrate > 160 kbps, then use value 0.2 In some embodiments an alternative way of implementing this determination would be to have a defined value in the set of values for each possible bitrate.

Any suitable method can be employed which allows obtaining an initial regularization factor for a given bitrate and codec format combination. The general reasoning with example values presented herein is that when the bitrate increases, it can be expected that the overall audio signal quality also improves. This in turn supports the use of smaller regularization factors (in other words, the rendering gains may be larger) as artifacts are less prone to appear.

Then a further operation is one of outputting the regularization factor R(n) 904 from the regularization factor determiner 903 and pass the factor to the processing matrix determiner 905, which applies the regularization factor R(n) 904 as part of the mixing matrix solution.

It should be noted that all of the example values shown above are so called tuneable values for the system. In practice, there are many interactions between different parts of the codec and the renderer, and the final used values are usually obtained using a set of objective measures and/or subjective listening tests. The given example values are suitable, but they may not produce best quality in all situations and it is very much expected that various different sets of values can be used within the context of these embodiments.

In the above examples a single regularization factor is defined. When considering the matrix solution described in the example embodiments, the regularization is implemented by controlling the entries of a diagonal matrix based on this regularization factor. In some other embodiments, more fine-grained tuning of the regularization could be applied, for example, so that different regularization factors or different regularization schemes are used for different entries of that diagonal matrix. In other words, the regularization information may be not a single value but a more elaborate set of information.

The above description shows the invention in the context of binaural rendering. However, the method presented in Vilkamo, J., BackstrOm, T., & Kuntz, A. (2013). Optimized covariance domain framework for time-frequency processing of spatial audio. Journal of the Audio Engineering Society, 61(6), 403-411 provides a general mixing framework and can be applied to any input-output mixing cases. This means that the target output could be loudspeaker, Ambisonics, or any other audio channels in addition to binaural target output. Likewise, the example embodiments describing adapting the regularization factor can be suitably adapted for any other input-output mixing cases. For example, mixing from two transport channels and metadata to 7.1.4 output can use the method described in the above reference and depending on the bitrate, a different regularization factor should be used to obtain optimal quality. In addition, the output format may also be used to determine the selection of the regularization factor. For example, some artifacts may be more noticeable when using loudspeaker rendering than they are when using binaural rendering. This is turn suggests different regularization factor values for them.

In some alternative embodiments, additional adjustments steps may be done for the regularization factor after an initial value has been selected based on the input parameters such as codec format and bitrate. These adjustments may change the regularization factor by absolute (e.g., +0.1 or -0.1) or relative (e.g., multiply with 0.9) steps. If such adjustments are done, then also an additional step should be introduced before outputting the final regularization factor. This step is limiting of the regularization factor within specific range, e.g., 0.2-1.0, to avoid unintended values.

The source information for these adjustment steps can have many options.

For example, the source information can be one or more of the following: Descriptive metadata of MASA format or any similar data. For example the source format can distinguish between channel-based audio and microphone captures which may use different tables as described above. In addition, transport channel configuration may be used to perform adjustments as, for example, cardioid pattern transports may be more sensitive to artifacts than omnidirectional type transports.

If objective measure for the signal quality factor can be obtained per frame basis, this could be used to decrease or increase regularization factor, if quality is known to be high or low respectively.

In some embodiments, the order of selecting the regularization factor in the presented method could be reordered with trivial effort as the resulting regularization factor is simply a combination of multiple factors. Other practical implementations are also equally valid.

In some embodiments, the codec format can be signalled as a combination of bits indicating the kind of spatial metadata and/or transport audio signal used instead of directly signalling a format.

It should be noted that although the above description implies regularization to be a control between mixing original signals and adding decorrelated versions of them to achieve the target covariance matrix, it is not mandatory to have decorrelated signals present at all. The presented method can be also used in a situation where only the mixing of original signals solution is used. In this case, target covariance matrix is not always achieved.

In the above embodiments, the covariance-matrix-based rendering as presented in the reference above was used as an example. However, it should be noted that the presented methods can also be used with other kind of renderers, as long as there is some kind of regularization for the gains. For example in an alternative embodiment, the "direct" and the "ambient" sound may be rendered separately. In that case, it may be possible that prototype signals are created separately for each of them (using, e.g., decorrelation for the ambient part), and the target energies are also created separately for the direct and the ambient parts. Then, by comparing the energies of the prototype signals and the target energies, gains are computed for the direct and the ambient part rendering. These gains typically need regularization to avoid excessive gaining of noises, as presented above for the main embodiment. These gains can be limited using the present invention. It should be noted that in this embodiment the regularization does not balance between mixing and decorrelation, but it balances between having artefacts from excessive gaining of noises and attenuation of some signals.

With respect to Figure 12 is shown a processing example according to some embodiments. The processing is with a system such as described herein, where an original 5.1 sound is conveyed over two transport audio channels, and at a decoder the audio is rendered to a binaural sound using the transmitted spatial metadata. The spatial metadata and the audio signals are encoded with two different bitrates, 48 kbps and 256 kbps, represented by the two columns 48 kbps 1200 and 256 kbps 1202.

The top row 1201 shows the binaural processed sound, where the middle 1203 and bottom 1205 rows show only the decorrelated path sound, in other words, where after formulating M(b, n) and Mr(b, n) the first is set to zero prior to applying them to the signals. The middle row 1203 shows a prior art solution where the regularization factor is not adapted based on the bitrate, but instead, it is set to a default value 1.0, which is safer in terms of that it minimally amplifies the codec artefacts at the rendering. Instead, it uses decorrelation to a larger degree.

The bottom row 1205 shows the processing according to some embodiments where the regularization is adapted based on the bitrate. When the bitrate is higher, the regularization factor drops to 0.2, which causes the system to generate the required incoherence more with the mixing approach, and thus uses less decorrelation. In this case, since the bit rate is high, the codec artefacts are of lower level and their amplification is not audible. On the other hand, better quality is provided, since less decorrelation is applied.

In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof.

For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general-purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.

Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

As used in this application, the term "circuitry" may refer to one or more or all of the following: (a) hardware-only circuit implementations (such as implementations in only analog and/or digital circuitry) and (b) combinations of hardware circuits and software, such as (as applicable): (i) a combination of analog and/or digital hardware circuit(s) with software/firmware and (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and I hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g., firmware) for operation, but the software may not be present when it is not needed for operation.

This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device. The term "non-transitory," as used herein, is a limitation of the medium itself (i.e., tangible, not a signal) as opposed to a limitation on data storage persistency (e.g., RAM vs. ROM).

As used herein, "at least one of the following: <a list of two or more elements>" and "at least one of <a list of two or more elements>" and similar wording, where the list of two or more elements are joined by "and" or "or", mean at least any one of the elements, or at least any two or more of the elements, or at least all the elements The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

Claims

CLAIMS: 1. A method for generating an audio signal, the method comprising: obtaining an input audio signal comprising at least two audio channels; obtaining at least one input property parameter associated with the input audio signal; determining at least one control parameter based at least on the at least one input property parameter; determining processing parameters based at least on the at least two audio channels, wherein the determining of the processing parameters is controlled based at least partially on the at least one control parameter; and generating the audio signal based at least on the at least two audio channels and the processing parameters.
2. The method as claimed in claim 1, wherein the input audio signal is a spatial audio signal, the spatial audio signal further comprising at least one spatial parameter associated with the at least two audio channels.
3. The method as claimed in claim 2, wherein determining processing parameters based at least on the at least two audio channels further comprises determining the processing parameters further based at least on the at least one spatial parameter.
4. The method as claimed in any of claims 1 to 3, wherein the at least one input property parameter comprises at least one of: a bitrate associated with the at least one input audio signal; a codec format indicating an origin of the at least one input audio signal; and a configuration of the at least one input audio signal.
5. The method as claimed in claim 4, wherein the configuration of the at least one input audio signal comprises at least one of: a source format indicating an original or input format from which the input audio signal was created;transport channel description;number of channels; channel distance; and channel angle.
6. The method as claimed in any of claims 4 or 5, wherein the codec format indicating the origin of the at least one input audio signal comprises at least one of: a metadata assisted spatial audio stream origin codec format value indicating the at least one input audio signal represents a metadata assisted spatial audio signal; a multichannel audio stream origin codec format value indicating the at least one input audio signal represents a multichannel audio signal; an audio object codec format value indicating the input audio signal represents audio objects; and an Ambisonic format value indicating the input audio signal represents an Ambisonic audio signal.
7. The method as claimed in any of claims 4 to 6 when dependent on claim 2, wherein the at least one spatial parameter comprises at least one of: information that describes an organization of sound in space with respect to the at least one input audio signal, the information comprising: a direction parameter configured to indicate from where the sound arrives; and a ratio parameter configured to indicate a portion of the sound that arrives from that direction; information that describes properties of an original multi-channel or multi-object sound scene, the information comprising at least one of: channel levels; object levels; inter-channel correlations; inter-object correlations; and object directions; processing coefficients related to obtaining spatial audio format signals based at least on the at least one input audio signal.
8. The method as claimed in any of claims 4 to 7, wherein the codec format indicating the origin of the at least one input audio signal comprises an indication that the origin is unknown or undefined.
9. The method as claimed in any of claims 1 to 8, wherein determining the at least one control parameter based at least on the at least one input property parameter comprises: determining a first set of control parameter values based on a first one of the at least one input property parameter; and selecting one control parameter value from the first set of control parameter values based on a second one of the at least one input property parameter.
10. The method as claimed in claim 9 when dependent on claim 4, wherein the first one of the at least one input property parameter is the codec format and the second one of the at least one input property parameter is the bitrate.
11. The method as claimed in any of claims 1 to 10, wherein determining processing parameters based at least on the at least two audio channels wherein the determining of the processing parameters is controlled based at least partially on the at least one control parameter comprises: generating processing matrices based at least on the at least two audio channels wherein the generating of the processing matrices is controlled based at least partially on the at least one control parameter.
12. The method as claimed in claim 11, wherein generating the processing matrices controlled based at least partially on the control parameter comprises regularizing the generation of the processing matrices based at least on the control parameter.
13. The method as claimed in claim 12, wherein generating the audio signal based at least on the at least two audio channels and the processing parameters comprises processing the at least two audio channels using the regularized processing matrices to generate the audio signal.
14. The method as claimed in any of claims 12 or 13, wherein generating the processing matrices controlled based at least partially on the at least one control parameter comprises generating entries of a diagonal matrix based at least on at least two control parameter values.
15. The method as claimed in any of claims 1 to 14, wherein obtaining at least one input property parameter associated with the input audio signal comprises deducing from the input audio signal the at least one input property parameter.
16. The method as claimed in any of claims 1 to 14, wherein obtaining at least one input property parameter associated with the input audio signal comprises receiving the at least one input property parameter as part of the input audio signal or as configuration information.
17. The method as claimed in claim 16, wherein receiving the at least one input property parameter as part of the input audio signal further comprises: receiving an encoded input property parameter as part of the input audio signal; and decoding the encoded input property parameter to obtain the input property parameter.
18. The method as claimed in any of claims 1 to 17, wherein generating the audio signal based at least on the at least two audio channels and the processing parameters comprises generating at least one of: a binaural audio signal; and a multichannel audio signal.
19. An apparatus comprising means for performing the method of any of claims 1 to 18.
20. A computer program comprising instructions, which, when executed by an apparatus, cause the apparatus to perform the method of any of claims 1 to 18.