EP3766262B1

EP3766262B1 - Spatial audio parameter smoothing

Info

Publication number: EP3766262B1
Application number: EP19767481.5A
Authority: EP
Inventors: Mikko-Ville Laitinen
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2018-03-13
Filing date: 2019-03-07
Publication date: 2022-11-23
Anticipated expiration: 2039-03-07
Also published as: GB201803993D0; EP3766262A1; GB2571949A; EP3766262A4; WO2019175472A1

Description

Field

The present application relates to apparatus and methods for temporal spatial audio parameter smoothing. This includes but is not exclusively for sound reproduction systems and sound reproduction methods producing multichannel audio channel outputs.

Background

Parametric spatial audio processing is a field of audio signal processing where the spatial aspect of the sound is described using a set of parameters. For example, in parametric spatial audio capture from microphone arrays, it is a typical and an effective choice to estimate from the microphone array signals a set of parameters such as directions of the sound in frequency bands, and the ratio parameters expressing relative energies of the directional and non-directional parts of the captured sound in frequency bands. These parameters are known to well describe the perceptual spatial properties of the captured sound at the position of the microphone array. These parameters can be utilized in synthesis of the spatial sound accordingly, for headphones binaurally, for loudspeakers, or to other formats, such as Ambisonics.
The directions and direct-to-total energy ratios in frequency bands are thus a parameterization that is particularly effective for spatial audio capture.
A parameter set consisting of a direction parameter in frequency bands and an energy ratio parameter in frequency bands (indicating the proportion of the sound energy that is directional) can be also utilized as the spatial metadata for an audio codec. For example, these parameters can be estimated from microphone-array captured audio signals, and for example a stereo signal can be generated from the microphone array signals to be conveyed with the spatial metadata. The stereo signal could be encoded, for example, with an AAC encoder. A decoder can decode the audio signals into PCM signals, and process the sound in frequency bands (using the spatial metadata) to obtain the spatial output, for example a binaural output.
Patent application publication US2009067634 discloses modifying spatial audio parameters associated with one or more audio objects of a stereo or multichannel audio signal to provide remixing capabilities.
Patent application publication WO2014162171 discloses a spatial audio analyser configured to determine an audio source with a location associated with a visual image element, and an audio processor arranged to change an audio characteristic of the audio source in response to a control input.
Patent application publication EP2942981 discloses an audio signal processing system for consistent acoustic scene reproduction based on informed spatial filtering.
Patent application publication US2013329922 discloses using vector base amplitude panning (VBAP) for playing back an object's audio and using the positioning of sound reproduction devices and the object' s location information to determine which sound reproduction devices are used for playing back the object's audio.
Patent application publication JP2015080119 discloses a method of improving the degree of freedom when calculating a panning coefficient for sound image localisation within a three-dimensional space.

Summary

There is provided according to a first aspect an apparatus for spatial audio signal processing, as set forth in independent claim 1.
According to a second aspect there is provided a method for spatial audio signal processing, as set forth in independent claim 7.
Preferred embodiments are set forth in the dependent claims.
A computer program product stored on a medium may cause an apparatus to perform the method as described herein.
An electronic device may comprise apparatus as described herein.
A chipset may comprise apparatus as described herein.
Embodiments of the present application aim to address problems associated with the state of the art.

Summary of the Figures

For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:

Figure 1 shows schematically an example system utilizing embodiments described hereafter;
Figure 2 shows a flow diagram of the operation of the example system shown in Figure 1;
Figure 3 shows schematically an example analysis processor shown in Figure 1 according to some embodiments;
Figure 4 shows a flow diagram of the operation of the example analysis processor shown in Figure 3;
Figure 5 shows schematically an example synthesis processor shown in Figure 1 according to some embodiments;
Figure 6 shows a flow diagram of the operation of the example synthesis processor shown in Figure 5;
Figures 7a and 7b show schematically example spatial synthesizers shown in Figure 5 according to some embodiments;
Figures 8a and 8b show flow diagrams of the operation of the spatial synthesizers shown in Figures 7a and 7b;
Figure 9 shows schematically an example smoothing coefficients determiner shown in Figures 7a and 7b according to some embodiments;
Figure 10 shows a flow diagram of the operation of the smoothing coefficients determiner shown in Figure 9;
Figure 11 shows example graphs demonstrating the effect of implementing the embodiments;
Figure 12 shows an example implementation of the embodiments as shown in Figures 1 to 10;
Figure 13 shows a further example implementation of the embodiments as shown in Figures 1 to 10; and
Figure 14 shows schematically an example device suitable for implementing the embodiments shown.

Embodiments of the Application

The following describes in further detail suitable apparatus and possible mechanisms for the provision of adaptive parameter smoothing.
In the following embodiments and examples the spatial sound source is a microphone array. Alternatively the spatial sound source may be a 5.1 multichannel or other format multi-channel mix or Ambisonics signals.
As described above parametric spatial audio capture methods can be used to enable a perceptually accurate spatial sound reproduction. Parametric spatial audio capture refers to adaptive DSP-driven audio capture methods covering 1) analysing perceptually relevant parameters in frequency bands, for example, the directionality of the propagating sound at the recording position, and 2) reproducing spatial sound in a perceptual sense at the rendering side according to the estimated spatial parameters. The reproduction can be, for example, for headphones or multichannel loudspeaker setups. By estimating and reproducing the perceptually relevant spatial properties (parameters) of the sound field, a spatial perception similar to that which would occur in the original sound field can be reproduced. As the result, the listener can perceive the multitude of sources, their directions and distances, as well as properties of the surrounding physical space, among the other spatial sound features, as if the listener was in the position of the capture device.
Parametric spatial audio capture methods (SPAC) may employ these determined parameters such as directions of the sound in frequency bands, and the ratios between the directional and non-directional parts of the captured sound in frequency bands to describe the perceptual spatial properties of the captured sound at the position of the microphone array and may use these parameters in synthesis of the spatial sound. As the spatial properties are estimated from the sound field, they can significantly fluctuate over time and frequency, e.g., due to the reverberation and/or multiple simultaneous sound sources. Hence, parametric spatial audio processing methods typically utilize smoothing in the synthesis, in order to avoid possible artefacts caused by rapidly fluctuating parameters (these artefacts are typically referred to as "musical noise").
Similar parametrization may also be used for the compression of spatial audio, e.g., from 5.1 multichannel signals. In this case, the parameters are estimated from the input loudspeaker signals. Nevertheless, the parameters typically fluctuate also in this case. Hence, the temporal smoothing is needed also with loudspeaker input.
Typically, the spatial parameters are determined in the time-frequency domain, i.e., each parameter value is associated with a certain frequency band and temporal frame. Examples of possible spatial parameters include (but are not limited to):

Direction and direct-to-total energy ratio
Direction and diffuseness
Inter-channel level difference, inter-channel phase difference, and inter-channel coherence

These parameters are determined in time-frequency domain. It should be noted that also other parametrizations may be used than those presented above. In general, typically the spatial audio parametrizations describe how the sound is distributed in space, either generally (e.g., using directions) or relatively (e.g., as level differences between certain channels). Moreover, it should be noted that, in such methods, the audio and the parameters may be processed and/or transmitted/stored in between the analysis and the synthesis.
The parametric spatial audio processing methods are often based on analysing the direction of arrival (and other spatial parameters) in frequency bands. If there would be a single sound source in anechoic condition, the direction would stably point to the sound source at all frequencies. However, in typical acoustic environments, the microphones capture also other sounds than just the sound source, such as reverberation and ambient sounds. Moreover, there may be multiple simultaneous sources. As a result, the estimated directions typically significantly fluctuate over time and the estimates are different at different frequency bands.
Parametric spatial audio processing methods such as employed in embodiments as described in further detail hereafter synthesize the spatial sound based on the analysed parameters (such as the aforementioned direction) and related audio signals (e.g., 2 captured microphone signals). In the case of loudspeaker rendering, vector base amplitude panning (VBAP) is a common method to position the audio to the analysed direction. VBAP computes gains for a subset of loudspeakers based on the direction, and the audio signal is multiplied with these gains and fed to these loudspeakers.
The concept as discussed hereafter proposes apparatus and methods to adapt the smoothing needed in the synthesis of spatial sound in parametric spatial audio processing in order to have quality audio output with different types of sound scenes.
Furthermore the embodiments as described hereafter relate to parametric spatial audio processing where a solution is provided to improve the temporal smoothing processing needed in the synthesis of spatial audio in the aforementioned parametric spatial audio processing and where the temporal smoothing is improved by analysing the required amount of smoothing adaptively. The analysis being related to the stability of the direction-related parameter(s) and producing a measure of directional stability and determining the time coefficients of the temporal smoothing based on the measure of directional stability.
The direction-related parameter may as described in further detail in the embodiments hereafter refer to a direction.
In some embodiments the amount of smoothing can be analysed using the direct-to-total energy ratio. The value of the energy ratio is monitored over time, and where it is constantly high, the time coefficient of the smoothing can be set smaller (less smoothing applied). Correspondingly, where the energy ratio is not constantly high, the time coefficient can be set to a default value (more smoothing applied).
A block diagram of an example system for implementing some embodiments is shown in Figure 1.
Figure 1 shows an example capture device 101. The capture device may be a VR capture device, a mobile phone or any other suitable electronic apparatus comprising one or more microphone arrays. The capture device 101 thus in some embodiments comprises microphones 100₁, 100_2. The microphone audio signals 102 captured by the microphones 100₁, 100₂ may be stored and later processed, or directly processed.
An analysis processor 103 may receive the microphone audio signals 102 from the capture device 101. The analysis processor 103 can, for example, be a computer or a mobile phone (running suitable software), or alternatively a specific device utilizing, for example, field programmable gate arrays (FPGAs) or application specific integrated circuits (ASICs). In some embodiments the capture device 101 and the analysis processor 103 are implemented on the same apparatus or device.
Based on the microphone-array signals, the analysis processor creates a data stream 104. The data stream may comprise transport audio signals and spatial metadata (e.g., directions and energy ratios in frequency bands). The data stream 104 may be transmitted or stored for example within some storage 105 such as memory, or alternatively directly processed in the same device.
A synthesis processor 107 may receive the data stream 104. The synthesis processor 107 can, for example, be a computer or a mobile phone (running suitable software), or alternatively a specific device utilizing, for example, FPGAs or ASICs. Based on the data stream (the transport audio signals and the metadata). The synthesis processor can be configured to produce output audio signals. For headphone listening, the output signals can be binaural signals 109. For loudspeaker rendering, the output signals can be multi-channel signals.
The headphones 111 or other playback apparatus may be configured to receive the output of the synthesis processor 107 and output the audio signals in a format suitable for listening.
With respect to Figure 2 is shown an example summary of the operations of the apparatus shown in Figure 1.
The initial operation is the capture (or otherwise input) of the audio signals as shown in Figure 2 by step 201.
Having captured the audio signals they are analysed to generate the data stream as shown in Figure 2 by step 203.
The data stream may then be transmitted and received (or stored and retrieved) as shown in Figure 2 by step 205.
Having received or retrieved the data stream, the output may be synthesized based at least on the data stream as shown in Figure 2 by step 207.
The synthesized audio signal output signals may then be output to a suitable output such as headphones as shown in Figure 2 by step 209.
With respect to Figure 3 an example analysis processor 103, such as shown in Figure 1, is presented. The input to the analysis processor 103 are the microphone array signals 102.
A transport audio signal generator 301 may be configured to receive the microphone array signals 102 and create the transport audio signals. In some embodiments the transport audio signals are selected from the microphone array signals. In some embodiments the microphone array signals may be downmixed to generate the transport audio signals. In some embodiments the transport audio signals may be obtained by processing the microphone array signals.
The transport audio signal generator 301 may be configured to generate any suitable number of transport audio signals (or channels), for example in some embodiments the transport audio signal generator 301 is configured to generate two transport audio signals. In some embodiments the transport audio signal generator 301 is further configured to compress the audio signals. For example in some embodiments the audio signals may be compressed using an advanced audio coding (AAC) or enhanced voice services (EVS) compression coding.
Furthermore the analysis processor 103 comprises a spatial analyser 303. The spatial analyser 303 is also configured to receive the microphone array signals 103 and generate metadata 304 based on a spatial analysis of the microphone array signals. The spatial analyser 303 may be configured to determine any suitable spatial metadata parameter. For example spatial metadata parameters determined include (but are not limited to): Direction and direct-to-total energy ratio; Direction and diffuseness; Inter-channel level difference, inter-channel phase difference, and inter-channel coherence. In some embodiments these parameters are determined in time-frequency domain. It should be noted that also other parametrizations may be used than those presented above. In general, typically the spatial audio parametrizations describe how the sound is distributed in space, either generally (e.g., using directions) or relatively (e.g., as level differences between certain channels). In the example shown in Figure 3 the metadata 304 comprises directions 306 and energy ratios 308. In some embodiments the metadata may be compressed and/or quantized. The analysis processor 103 may furthermore comprise a multiplexer or mux 305 which is configured to receive the metadata 304 and the transport audio signals 302 and generate a combined data stream 104. The combination may be any suitable combination.
It should be noted that in some embodiments the input to the analysis processor 103 can also be other types of audio signals, such as multichannel loudspeaker signals, audio objects, or Ambisonic signals. Furthermore, the exact implementation of the analysis processor may be any suitable implementation (as indicated above a computer running suitable software, a FPGAs or ASICs etc). caused to produce the transport audio signals and the spatial metadata in the time-frequency domain.
With respect to Figure 4 is shown an example summary of the operations of the analysis processor shown in Figure 3.
The initial operation is receiving the microphone array audio signals as shown in Figure 4 by step 401.
Having received the microphone audio signals they are analysed to generate the transport audio signals (for example selection, downmixing or other processing) as shown in Figure 4 by step 403.
Furthermore the microphone audio signals are spatially analysed to generate the metadata, for example the directions and energy ratios as shown in Figure 4 by step 405.
The metadata and the transport audio signals may then be combined to generate the data stream as shown in Figure 4 by step 407.
With respect to Figure 5 an example synthesis processor 107 (as shown in Figure 1) according to some embodiments is shown.
A demultiplexer, or demux, 501 is configured to receive the data stream 104 and caused to demultiplex the data stream into a transport audio signals 502 and metadata 504. In some embodiments, where the transport audio signals were compressed within the analysis processor, the demultiplexer is furthermore caused to decode the audio signals. The metadata in some embodiments is with the time-frequency domain, and comprises parameters such as directions θ(k,n) 506 and direct-to-total energy ratios r(k,n) 508, where k is the frequency band index and n the temporal frame index. In the embodiments where the metadata is compressed/dequantized then the demultiplexed data is furthermore decompressed/dequantized to attempt to regenerate the originally determined parameters.
A spatial synthesizer 503 is configured to receive the transport audio signals 502 and the metadata and caused to generate the multichannel output signals 510 such as the binaural output signals 109 shown in Figure 1.
With respect to Figure 6 is shown an example summary of the operations of the synthesis processor shown in Figure 5.
The initial operation is receiving the data stream as shown in Figure 6 by step 601.
Having received the data stream, it is demultiplexed and optionally decoded to generate the transport audio signals and the metadata as shown in Figure 6 by step 603.
The multichannel (binaural or otherwise) output signals may then be synthesized from the transport audio signals and the metadata as shown in Figure 6 by step 605.
The multichannel (binaural or otherwise) output signals may then be output as shown in Figure 6 by step 607.
With respect to Figures 7a and 7b example spatial synthesizers 503 (as shown in Figure 5) according to some embodiments is shown.
The input to the spatial synthesizer 503 is in some embodiments the transport audio signals 502 and furthermore the metadata 504 (which may include the energy ratios 508 and the directions 506).
In some embodiments the transport audio signals 502 are transformed to the time-frequency domain using a suitable transformer. For example as shown in Figure 7a and 7b a short-time Fourier transformer (STFT) 701 is configured to apply a short-time Fourier transform to the transport audio signals to generate suitable time-frequency domain audio signals S_i (k, n) 700. In some embodiments any suitable time-frequency transformer may be used, for example a quadrature mirror filterbank (QMF).
A divider 705 may receive the time-frequency domain audio signals S_i (k, n) 700 and the energy ratios 508 and divide the time-frequency domain audio signals S_i (k, n) 700 to ambient and direct parts using the energy ratio r(k, n) 508.
With respect to Figure 7a a smoothing coefficients determiner 703 may also receive the time-frequency domain audio signals S_i (k, n) 700 and the energy ratios 508 and determine suitable smoothing coefficients 706. Figure 7b differs with respect to the example spatial synthesizer shown in Figure 7a in that the smoothing coefficients determiner in Figure 7b is caused to receive the time-frequency domain audio signals S_i (k, n) 700 and the directions 506.
The smoothing coefficients determiner 703 may be configured to adaptively determine the smoothing coefficient(s) α(k, n).
A panning gain determiner 715 may be configured to receive the directions 506 and based on the output speaker/headphone configuration and the directions determine suitable panning gains 708. The amplitude panning gains may be computed using any suitable manner, for example vector base amplitude panning (VBAP) based on the received direction θ(k,n).
In some embodiments a panning gain smoother 717 is configured to receive the panning gains 708 and the smoothing coefficients 706 and based on these determine suitable smoothed panning gains 710. There are many ways to perform the smoothing. In some embodiments a first-order smoothing may be used. Thus for example the panning gain smoother 717 is configured to receive a current gain g(k, n), smoothing coefficients α(k, n) and also knowledge on the last smoothed gain g'(k, n - 1) and determine a smoothed gain by: $g' (k, n) = α (k, n) g (k, n) + (1 - α (k, n)) g' (k, n - 1)$
In other words the current gain is multiplied with the smoothing coefficient α and the previous smoothed gain is multiplied with (1 - α).
In other embodiments any suitable smoothing may be applied. The smoothing 'filter' may therefore be multiple order and similarly the smoothing coefficient α(k, n) may be a vector value. The actual value(s) of α may depend on the filterbank, and typically is frequency-dependent (values may include, e.g., 0.1). In general, the larger the value is, the less smoothing is applied.
A decorrelator 707 is configured to receive the ambient audio signal part 702 and process it to make it perceived as being surrounding, for example by decorrelating and spreading the ambient audio signal part 702 across the audio scene.
A positioner 709 is configured to receive the direct audio signal part 704 and the smoothed panning gains 710 and position the direct audio signal part 704 using a suitable positioning, for example using the smoothed panning gains and an amplitude panning operation.
A merger 711 or other suitable combiner is configured to receive the spread ambient signal part from the decorrelator 707 and the positioned direct audio signals part from the positioner 709 and combine or merge these resulting audio signals.
An inverse short-time Fourier transformer (Inverse STFT) 713 is configured to receive the combined audio signals and apply an inverse short-time Fourier transform (or other suitable frequency to time domain transform) to generate the multi-channel audio signals 510 which may be passed to a suitable output device such as the headphones or multi-channel loudspeaker setup.
In the examples and embodiments described in detail herein, for example as described with respect to the examples shown in Figures 7a and 7b the panning gains are determined directly from the direction metadata, and the "direct sound" is also positioned with these gains after smoothing.
In some embodiments there may be implementations where the panning gains are not directly determined from the direction metadata, but instead determined indirectly. Thus the smoothing of these gains as described above may be applied to any suitably generated gains.
Thus for example, in some embodiments the directions may be used (together with the energy ratios and transport audio signals) to determine a target energy distribution of the output multichannel signals. The target energy distribution may be compared to the energy distribution of the transport audio signals (or to the energy distribution of intermediate signals obtained from the transport audio signals by mixing). Panning gains (or any gains that position audio) may be obtained as a ratio of these values and the "Smoother" 717 may be applied to these gains.
In summary the method of generating the panning gains may be one of many optional methods which is then smoothed according to methods as described herein.
With respect to Figures 8a and 8b the operations of the spatial synthesizer 503 shown in Figures 7a and 7b according to some embodiments are described in further detail.
The spatial synthesizer in some embodiments is configured to receive the transport audio signals as shown in Figures 8a and 8b by step 801.
The spatial synthesizer in some embodiments is furthermore configured to receive the energy ratios as shown in Figures 8a and 8b by step 803.
The spatial synthesizer in some embodiments is also configured to receive the directions as shown in Figures 8a and 8b by step 805.
The received transport audio signals are in some embodiments converted into a time-frequency domain form, for example by applying a suitable time-frequency domain transform to the transport audio signals as shown in Figures 8a and 8b by step 807.
The time-frequency domain audio signals may then in some embodiments be divided into ambient and direct parts (based on the energy ratios) as shown in Figures 8a and 8b by step 813.
Furthermore the smoothing coefficients may be determined based on the energy ratios and the time-frequency domain audio signals as shown in Figure 8a by step 811. Alternatively the smoothing coefficients may be determined based on the directions and the time-frequency domain audio signals as shown in Figure 8b by step 851.
Panning gains may be determined based on the received directions as shown in Figures 8a and 8b by step 809.
A series of smoothed panning gains may be determined based on the determined panning gains and the smoothing coefficients as shown in Figures 8a and 8b by step 817.
The ambient audio signal part may be decorrelated as shown in Figures 8a and 8b by step 815.
The positional component of the audio signals may then be determined based on the smoothed panning gains and the direct audio signal part as shown in Figures 8a and 8b by step 819. In such embodiments a positional component of the audio signals or positioned audio signal can be a number of audio signals which are combined to produce a virtual sound source positioned in a three dimensional space.
The positional component of the audio signals and the decorrelated ambient audio signal may then be combined or merged as shown in Figures 8a and 8b by step 821.
Furthermore the combined audio signals may then be inverse time-frequency domain transformed to generate the multichannel audio signals in a suitable format to be output as shown in Figures 8a and 8b by step 823.
With respect to Figure 9 an example smoothing coefficients determiner 703 (such as shown in Figures 7a and 7b) according to some embodiments is shown. The smoothing coefficients determiner 703 is configured to generate values which may be used to smooth the panning gains in order to avoid "musical noise" artefacts.
The inputs to the smoothing coefficients determiner 703 are shown as the time-frequency domain audio signals 700 and the energy ratios 508.
An energy estimator 901 may be configured to receive the time-frequency domain audio signals 700 and determine the energy E(k, n) 902 of the audio signals. For example in some embodiments the energy estimator 901 is configured to generate the energy based on: $E (k, n) = {\sum_{i} S_{i} (k, n)}^{2}$
A direction smoothness estimator 903 is configured to estimate a direction smoothness ξ(k, n). In some embodiments, such as shown in the examples in Figures 7a and 8a, this direction smoothness may be estimated or determined from the energy ratios r(k, n) 508. For example the direction smoothness estimator may be configured to calculate the direction smoothness by the following: $ξ (k, n) = r {(k, n)}^{p}$
where p is a constant (e.g., p = 8).
In some embodiments, such as shown in the examples in Figures 7b and 8b, the direction smoothness value ξ(k, n) 904 can be estimated by using or calculating the fluctuation of the direction value. In such embodiments a circular variance of the directions θ(k, n) is determined and this is used as the basis of a direction smoothness. In other embodiments any suitable analysis of the temporal fluctuation of the directions may be used to determine the direction smoothness estimate.
An average direction smoothness estimator 905 is configured to receive the energy 902 and direction smoothness estimates 904 and determine an average over time (and in some embodiments over frequency). The average direction smoothness estimator may therefore be configured to perform a first-order smoothing based on a current estimate ξ(k, n) a previous average value ξ'(k, n - 1) and smoothing coefficient β to generate an averaged direction smoothness estimate ξ'(k, n) 906, for example by the following: $ξ' (k, n) = β ξ (k, n) + (1 - β) ξ' (k, n - 1)$
where β may be fixed, or it can be adaptively selected, e.g., by $β (k, n) = {\begin{matrix} α_{1}, & ξ (k, n) > ξ' (k, n - 1) \\ α_{2}, & ξ (k, n) \leq ξ' (k, n - 1) \end{matrix}$
where α ₁ may, e.g., be 0.001 and α ₂ may, e.g., be 0.5. This adaptive selection attempts to find whether the energy ratio is constantly large, and hence temporal smoothing can be safely made shorter without artefacts. Moreover, the direction smoothness estimates ξ may be weighted by the energy E while performing the temporal smoothing.
A direction smoothness estimates to smoothing coefficients converter may receive the averaged direction smoothness estimate ξ'(k, n) 906 and generate the smoothing coefficients α(k, n). For example in some embodiments the averaged direction smoothness estimates ξ'(k, n) are converted to the actual smoothing coefficients by the following $α (k, n) = ξ' (k, n) α_{fast} (k) + (1 - ξ' (k, n)) α_{slow} (k)$
The values of α_fast may, e.g., include 0.4, and the values of α_slow may, e.g., include 0.1. These fast and slow coefficients may depend on the actual implementation and may be frequency-dependent.
In some embodiments the smoothing coefficients may be a vector instead of a single value. This for example may occur when the smoothing is other than a first-order IIR smoothing. These embodiments may therefore may implement "fast settings" and "slow settings" which are interpolated based on the "averaged direction smoothness estimates". In such embodiments these "settings" may depend on the implementation, for example whether it is a single value or a vector of values.
The smoothing coefficients α(k, n)706 may then be output.
With respect to Figure 10 an example flow diagram showing the operation of the smoothing coefficients determiner according to some embodiments is shown.
The time-frequency domain audio signals are received as shown in Figure 10 by step 1001.
Furthermore the energy ratios are received as shown in Figure 10 by step 1003.
The estimate of the energy (of the audio signals) may be determined based on the time-frequency domain audio signals as shown in Figure 10 by step 1005.
Furthermore the estimate of the direction smoothness based on the energy ratios (or based on any other suitable parameter such as an analysis of the directions) is shown in Figure 10 by step 1007.
The estimate of the average direction smoothness is then determined based on the energy estimate and the direction smoothness estimates as shown in Figure 10 by step 1009.
Then the average direction smoothness estimate is converted to smoothness coefficients as shown in Figure 10 by step 1011.
The smoothness coefficients are then output as shown in Figure 10 by step 1013.
In the above examples the smoothness coefficients were determined based on the use of the spatial metadata (for example the metadata generated in the analysis processor such as found within a spatial audio capture (SPAC) which generates directions and direct-to-total energy ratios). It should be noted that the above methods can be modified without inventive skill to be used with any method utilizing similar parameters. For example in context of Direct Audio Coding (DirAC), the direction smoothness can be determined as $ξ (k, n) = {(1 - ψ (k, n))}^{p}$
where ψ(k, n) is diffuseness.
Some of the advantages of the proposed embodiments is that significant amount of smoothing can applied with typical sound scenes, and thus musical noise artefacts are avoided. Furthermore when the sound scene does not require so much smoothing, the amount of smoothing applied can be reduced and thus the reproduction can react faster to changes in the sound field.
The effect of the proposed embodiments can be seen in Figure 11 which shows three graph traces showing a reference audio signal, a reproduction using a fixed smoothing coefficient and a reproduction using an adaptive smoothing coefficient. In this example the sound scene contains two sources located in different directions in anechoic conditions. The sound was rendered to a multichannel setup.
The reference graph trace 1101 shows the signal of one output channel of the audio signal and shows the first source 1103 but not the other source.
In the fixed smoothing example graph trace 1111, excessive temporal smoothing causes the sound to be reproduced partially still from the first direction (shown from 1.4 seconds to 1.8 seconds) even though the sound source is not present anymore for that direction. As a result, the reproduction is perceived to slowly react to changes in the direction.
On the contrary, the adaptive smoothing example 1121 having analysed that the directions are stable, and there is not as much need for temporal smoothing is configured to set the smoothing to a faster mode, and the sound source is not reproduced from the wrong channel. In such a manner the reproduction is perceived to react fast to changes in the direction.
With respect to Figure 12 an example implementation of some further embodiments is shown. In these embodiments the implementation can be by software implemented, for example on a mobile phone (or a computer) 1200. The software running inside the mobile phone 1200 may be configured to receive an encoded bitstream (it may have been e.g., transmitted real-time or it may have been stored to the device). The bitstream can also be any other suitable bitstream. A demultiplexer 1203 (DEMUX) is configured to demultiplex the bitstream into an audio bitstream 1204 and to a spatial metadata bitstream 1206.
An enhanced voice standard (EVS) or encoded bitstream decoder 1205 is configured to extract the transport audio signals 1206 from the audio bitstream (or any decoder that corresponds to the utilized codec).
A metadata decoder 1207 is used to decompress the spatial metadata 1208, for example comprising the directions 1210 and energy ratios 1212.
The spatial synthesiser 1209 (similar to the spatial synthesizer in the embodiments above) is configured to receive transport audio signals 1206 and the metadata 1208 and output multichannel loudspeaker signals 1211 that may be reproduced using a multichannel loudspeaker setup. In some embodiments the spatial synthesizer 1209 is configured to generate binaural audio signals that may be reproduced using headphones.
With respect to Figure 13 a further example implementation is shown according to some further embodiments. In this example implementation a microphone array 1301, for example part of a mobile phone, is configured to capture audio signals 1302. The captured microphone array audio signals 1302 may be processed by software 1300 running inside the mobile phone. In the software 1300 may be an analysis processor configured to analyse the captured microphone array signals 1302 in a manner such as described with respect to Figures 1 and 3 and is configured to generate spatial metadata 1304 (comprising directions 1306 and energy ratios 1308). Furthermore there may comprise a synthesis processor 1305 which is configured to receive the spatial metadata 1304 from the analysis processor 1303 along with the captured microphone array audio signals 1302 (or alternatively a subset or a processed set of the microphone signals). The synthesis processor 1305 may operate in a manner similar to the synthesis processor 1305 as described with respect to Figures 1 and 5, 7a, 7b and 9. Depending on the configuration, the synthesis processor 1305 may be configured to output a multichannel audio signal (for example a binaural signal or a surround loudspeaker signal or Ambisonic signal). The multichannel audio signals 1307 can therefore be listened to directly (when fed to headphones or loudspeakers, or reproduced using an Ambisonic renderer), stored (with any suitable codec) and/or transmitted to a remote device.
Although the codec use implementation is described above it is noted that some embodiments may be used with any suitable codec that utilizes smoothing and can provide information on the smoothness of the direction-related parameters.
Similarly as depicted in the example implementation in Figure 13, the proposed method can also be applied in any kind of spatial audio processing which operates in time-frequency domain.
With respect to Figure 14 an example electronic device which may be used as the analysis or synthesis processor is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.
In some embodiments the device 1400 comprises at least one processor or central processing unit 1407. The processor 1407 can be configured to execute various program codes such as the methods such as described herein.
In some embodiments the device 1400 comprises a memory 1411. In some embodiments the at least one processor 1407 is coupled to the memory 1411. The memory 1411 can be any suitable storage means. In some embodiments the memory 1411 comprises a program code section for storing program codes implementable upon the processor 1407. Furthermore in some embodiments the memory 1411 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1407 whenever needed via the memory-processor coupling.
In some embodiments the device 1400 comprises a user interface 1405. The user interface 1405 can be coupled in some embodiments to the processor 1407. In some embodiments the processor 1407 can control the operation of the user interface 1405 and receive inputs from the user interface 1405. In some embodiments the user interface 1405 can enable a user to input commands to the device 1400, for example via a keypad. In some embodiments the user interface 1405 can enable the user to obtain information from the device 1400. For example the user interface 1405 may comprise a display configured to display information from the device 1400 to the user. The user interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1400 and further displaying information to the user of the device 1400. In some embodiments the user interface 1405 may be the user interface for communicating with the position determiner as described herein.
In some embodiments the device 1400 comprises an input/output port 1409. The input/output port 1409 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).
The transceiver input/output port 1409 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1407 executing suitable code. Furthermore the device may generate a suitable downmix signal and parameter output to be transmitted to the synthesis device.
In some embodiments the device 1400 may be employed as at least part of the synthesis device. As such the input/output port 1409 may be configured to receive the downmix signals and in some embodiments the parameters determined at the capture device or processing device as described herein, and generate a suitable audio signal format output by using the processor 1407 executing suitable code. The input/output port 1409 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones or similar.
In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

Claims

An apparatus for spatial audio signal processing, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to:
receive at least one audio signal;

determine an energy ratio being a spatial parameter associated with the at least one audio signal

determine a direction smoothness parameter by applying an exponent to the energy ratio;

convert the direction smoothness parameter to an adaptive smoothing parameter;

determine panning gains for applying to a first part of the at least one audio signal;

apply the adaptive smoothing parameter to the panning gains to generate associated smoothed panning gains; and

apply the smoothed panning gains to the first part of the at least one audio signal to generate a positioned audio signal.
The apparatus as claimed in claim 1, wherein the at least one memory and the computer program code are configured to, with the at least one processor, further cause the apparatus to:
apply a decorrelation to a second part of the at least one audio signal to generate an ambient audio signal; and

combine the positioned audio signal and the ambient audio signal to generate a multichannel audio signal.
The apparatus as claimed in any of claims 1 and 2, wherein the at least one memory and the computer program code are configured to, with the at least one processor, further cause the apparatus to:
estimate an energy of the at least one audio signal; and

average the direction smoothness parameter based on the energy of the at least one audio signal, wherein the apparatus caused to convert the direction smoothness parameter to the adaptive smoothing parameter is caused to convert the averaged direction smoothness parameter to the adaptive smoothing parameter.
The apparatus as claimed in claim 3, wherein the at least one memory and the computer program code are configured to, with the at least one processor, further cause the apparatus to:
determine an averaging parameter based on the energy of the at least one audio signal; and

apply the averaging parameter to the direction smoothness parameter and unity minus the averaging parameter to a previous averaged direction smoothness parameter to generate the averaged direction smoothness parameter.
The apparatus as claimed in any of claims 1 to 4, wherein the at least one memory and the computer program code are configured to, with the at least one processor, further cause the apparatus to:
receive the at least one audio signal from at least one microphone within a microphone array;

determine the at least one audio signal from multichannel loudspeaker audio signals; and

receive the at least one audio signal as part of a data stream comprising the at least one audio signal and metadata comprising the spatial parameter.
The apparatus as claimed in any of claims 1 to 5, wherein the at least one memory and the computer program code are configured to, with the at least one processor, further cause the apparatus to: analyse the at least one audio signal to determine the spatial parameter; and
receive the spatial parameter as part of a data stream comprising the at least one audio signal and metadata comprising the spatial parameter.
A method for spatial audio signal processing comprising:
receiving at least one audio signal;

determining an energy ratio being a spatial parameter associated with the at least one audio signal

determining a direction smoothness parameter by applying an exponent to the energy ratio;

converting the direction smoothness parameter to an adaptive smoothing parameter;

determining panning gains for applying to a first part of the at least one audio signals;

applying the adaptive smoothing parameter to the panning gains to generate associated smoothed panning gains; and

applying the smoothed panning gains to the first part of the at least one audio signal to generate a positioned audio signal.
The method as claimed in Claim 7, further comprising:
applying a decorrelation to a second part of the at least one audio signal to generate an ambient audio signal; and

combining the positioned audio signal and the ambient audio signal to generate a multichannel audio signal.
The method as claimed in Claims 7 and 8, wherein converting the direction smoothness parameter to an adaptive smoothing parameter comprises:
estimating an energy of the at least one audio signal;

averaging the direction smoothness parameter based on the energy of the at least one audio signal, wherein converting the direction smoothness parameter to the adaptive smoothing parameter comprises converting the averaged direction smoothness parameter to the adaptive smoothing parameter.
The method as claimed in Claim 9, wherein averaging the direction smoothness parameter based on the energy of the at least one audio signal comprises:
determining an averaging parameter based on the energy of the at least one audio signal; and

applying the averaging parameter to the direction smoothness parameter and unity minus the averaging parameter to a previous averaged direction smoothness parameter to generate the averaged direction smoothness parameter.