WO2019175472A1 - Temporal spatial audio parameter smoothing - Google Patents

Temporal spatial audio parameter smoothing Download PDF

Info

Publication number
WO2019175472A1
WO2019175472A1 PCT/FI2019/050178 FI2019050178W WO2019175472A1 WO 2019175472 A1 WO2019175472 A1 WO 2019175472A1 FI 2019050178 W FI2019050178 W FI 2019050178W WO 2019175472 A1 WO2019175472 A1 WO 2019175472A1
Authority
WO
WIPO (PCT)
Prior art keywords
parameter
audio signal
spatial
direction smoothness
caused
Prior art date
Application number
PCT/FI2019/050178
Other languages
French (fr)
Inventor
Mikko-Ville Laitinen
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Priority to EP19767481.5A priority Critical patent/EP3766262B1/en
Publication of WO2019175472A1 publication Critical patent/WO2019175472A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S3/00Systems employing more than two channels, e.g. quadraphonic
    • H04S3/008Systems employing more than two channels, e.g. quadraphonic in which the audio signals are in digital form, i.e. employing more than two discrete digital channels
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/26Pre-filtering or post-filtering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S1/00Two-channel systems
    • H04S1/007Two-channel systems in which the audio signals are in digital form
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • H04S7/303Tracking of listener position or orientation
    • H04S7/304For headphones
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L2019/0001Codebooks
    • G10L2019/0012Smoothing of parameters of the decoder interpolation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/03Aspects of down-mixing multi-channel audio to configurations with lower numbers of playback channels, e.g. 7.1 -> 5.1
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/15Aspects of sound capture and related signal processing for recording or reproduction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/03Application of parametric coding in stereophonic audio systems

Definitions

  • the present application relates to apparatus and methods for temporal spatial audio parameter smoothing. This includes but is not exclusively for sound reproduction systems and sound reproduction methods producing multichannel audio channel outputs.
  • Parametric spatial audio processing is a field of audio signal processing where the spatial aspect of the sound is described using a set of parameters.
  • parameters such as directions of the sound in frequency bands, and the ratio parameters expressing relative energies of the directional and non-directional parts of the captured sound in frequency bands.
  • These parameters are known to well describe the perceptual spatial properties of the captured sound at the position of the microphone array.
  • These parameters can be utilized in synthesis of the spatial sound accordingly, for headphones binaurally, for loudspeakers, or to other formats, such as Ambisonics.
  • the directions and direct-to-total energy ratios in frequency bands are thus a parameterization that is particularly effective for spatial audio capture.
  • a parameter set consisting of a direction parameter in frequency bands and an energy ratio parameter in frequency bands (indicating the proportion of the sound energy that is directional) can be also utilized as the spatial metadata for an audio codec.
  • these parameters can be estimated from microphone-array captured audio signals, and for example a stereo signal can be generated from the microphone array signals to be conveyed with the spatial metadata.
  • the stereo signal could be encoded, for example, with an AAC encoder.
  • a decoder can decode the audio signals into PCM signals, and process the sound in frequency bands (using the spatial metadata) to obtain the spatial output, for example a binaural output.
  • an apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: receive at least one audio signal; determine at least one spatial parameter associated with the at least one audio signal; generate an adaptive smoothing parameter based on the at least one spatial parameter; determine panning gains for applying to a first part of the at least one audio signals; apply the adaptive smoothing parameter to the panning gains to generate associated smoothed panning gains; and apply the smoothed panning gains to the first part of the at least one audio signal to generate a positioned audio signal.
  • the apparatus may be further caused to: apply a decorrelation to a second part of the at least one audio signal to generate an ambient audio signal; and combine the positioned audio signal and the ambient audio signal to generate a multichannel audio signal.
  • the apparatus caused to generate an adaptive smoothing parameter based on the at least one spatial parameter may be caused to: estimate a direction smoothness parameter based on the at least one spatial parameter; and convert the direction smoothness parameter to the adaptive smoothing parameter.
  • the apparatus caused to generate an adaptive smoothing parameter based on the at least one spatial parameter may be caused to: estimate an energy of the at least one audio signal; average the direction smoothness parameter based on the energy of the at least one audio signal, wherein the apparatus caused to convert the direction smoothness parameter to the adaptive smoothing parameter is caused to convert the averaged direction smoothness parameter to the adaptive smoothing parameter.
  • the apparatus caused to average the direction smoothness parameter based on the energy of the at least one audio signal may be caused to: determine an averaging parameter based on the energy of the at least one audio signal; and apply the averaging parameter to the direction smoothness parameter and unity minus the averaging parameter to the previous averaged direction smoothness parameter to generate the averaged direction smoothness parameter.
  • the at least one spatial parameter may comprise an energy ratio associated with the at least one audio signal, wherein the apparatus caused to estimate a direction smoothness parameter based on the at least one spatial parameter may be caused to determine the direction smoothness parameter by applying an exponent to the energy ratio.
  • the at least one spatial parameter may comprise a direction associated with the at least one audio signal, wherein the apparatus caused to estimate a direction smoothness parameter based on the at least one spatial parameter may be caused to determine the direction smoothness parameter by analysing a motion of the direction.
  • the at least one spatial parameter may comprise a diffuseness associated with the at least one audio signal, wherein the apparatus caused to estimate a direction smoothness parameter based on the at least one spatial parameter may be caused to determine the direction smoothness parameter by applying an exponent to difference between unity and the diffuseness.
  • the apparatus caused to receive at least one audio signal may be caused to perform at least one of: receive the at least one audio signal from at least one microphone within a microphone array; determine the at least one audio signal from multichannel loudspeaker audio signals; and receive the at least one audio signal as part of a data stream comprising the at least one audio signals and metadata comprising the at least one spatial parameter.
  • the apparatus caused to determine at least one spatial parameter associated with the at least one audio signal may be caused to perform at least one of: analyse the at least one audio signal to determine the at least one spatial parameter; and receive the at least one spatial parameter as part of a data stream comprising the at least one audio signals and metadata comprising the at least one spatial parameter.
  • a method for spatial audio signal processing comprising: receiving at least one audio signal; determining at least one spatial parameter associated with the at least one audio signal; generating an adaptive smoothing parameter based on the at least one spatial parameter; determining panning gains for applying to a first part of the at least one audio signals; applying the adaptive smoothing parameter to the panning gains to generate associated smoothed panning gains; and applying the smoothed panning gains to the first part of the at least one audio signal to generate a positioned audio signal.
  • the method may further comprise: applying a decorrelation to a second part of the at least one audio signal to generate an ambient audio signal; and combining the positioned audio signal and the ambient audio signal to generate a multichannel audio signal.
  • Generating an adaptive smoothing parameter based on the at least one spatial parameter may comprise: estimating a direction smoothness parameter based on the at least one spatial parameter; and converting the direction smoothness parameter to the adaptive smoothing parameter.
  • Generating an adaptive smoothing parameter based on the at least one spatial parameter may comprise: estimating an energy of the at least one audio signal; averaging the direction smoothness parameter based on the energy of the at least one audio signal, wherein converting the direction smoothness parameter to the adaptive smoothing parameter may comprise converting the averaged direction smoothness parameter to the adaptive smoothing parameter.
  • Averaging the direction smoothness parameter based on the energy of the at least one audio signal may comprise: determining an averaging parameter based on the energy of the at least one audio signal; and applying the averaging parameter to the direction smoothness parameter and unity minus the averaging parameter to the previous averaged direction smoothness parameter to generate the averaged direction smoothness parameter.
  • the at least one spatial parameter may comprise an energy ratio associated with the at least one audio signal, wherein estimating a direction smoothness parameter based on the at least one spatial parameter may comprise determining the direction smoothness parameter by applying an exponent to the energy ratio.
  • the at least one spatial parameter may comprise a direction associated with the at least one audio signal, wherein estimating a direction smoothness parameter based on the at least one spatial parameter may comprise determining the direction smoothness parameter by analysing a motion of the direction.
  • the at least one spatial parameter may comprise a diffuseness associated with the at least one audio signal, wherein estimating a direction smoothness parameter based on the at least one spatial parameter may comprise determining the direction smoothness parameter by applying an exponent to difference between unity and the diffuseness.
  • Receiving at least one audio signal may comprise performing at least one of: receiving the at least one audio signal from at least one microphone within a microphone array; determining the at least one audio signal from multichannel loudspeaker audio signals; and receiving the at least one audio signal as part of a data stream comprising the at least one audio signals and metadata comprising the at least one spatial parameter.
  • Determining at least one spatial parameter associated with the at least one audio signal may comprise at least one of: analysing the at least one audio signal to determine the at least one spatial parameter; and receiving the at least one spatial parameter as part of a data stream comprising the at least one audio signals and metadata comprising the at least one spatial parameter.
  • an apparatus for spatial audio signal processing comprising means for: receiving at least one audio signal; determining at least one spatial parameter associated with the at least one audio signal; generating an adaptive smoothing parameter based on the at least one spatial parameter; determining panning gains for applying to a first part of the at least one audio signals; applying the adaptive smoothing parameter to the panning gains to generate associated smoothed panning gains; and applying the smoothed panning gains to the first part of the at least one audio signal to generate a positioned audio signal.
  • the apparatus may further comprise means for: applying a decorrelation to a second part of the at least one audio signal to generate an ambient audio signal; and combining the positioned audio signal and the ambient audio signal to generate a multichannel audio signal.
  • the means for generating an adaptive smoothing parameter based on the at least one spatial parameter may comprise means for: estimating a direction smoothness parameter based on the at least one spatial parameter; and converting the direction smoothness parameter to the adaptive smoothing parameter.
  • the means for generating an adaptive smoothing parameter based on the at least one spatial parameter may comprise means for: estimating an energy of the at least one audio signal; averaging the direction smoothness parameter based on the energy of the at least one audio signal, wherein the means for converting the direction smoothness parameter to the adaptive smoothing parameter may comprise means for converting the averaged direction smoothness parameter to the adaptive smoothing parameter.
  • the means for averaging the direction smoothness parameter based on the energy of the at least one audio signal may comprise means for: determining an averaging parameter based on the energy of the at least one audio signal; and applying the averaging parameter to the direction smoothness parameter and unity minus the averaging parameter to the previous averaged direction smoothness parameter to generate the averaged direction smoothness parameter.
  • the at least one spatial parameter may comprise an energy ratio associated with the at least one audio signal, wherein the means for estimating a direction smoothness parameter based on the at least one spatial parameter may comprise means for determining the direction smoothness parameter by applying an exponent to the energy ratio.
  • the at least one spatial parameter may comprise a direction associated with the at least one audio signal, wherein the means for estimating a direction smoothness parameter based on the at least one spatial parameter may comprise means for determining the direction smoothness parameter by analysing a motion of the direction.
  • the at least one spatial parameter may comprise a diffuseness associated with the at least one audio signal, wherein the means for estimating a direction smoothness parameter based on the at least one spatial parameter may comprise means for determining the direction smoothness parameter by applying an exponent to difference between unity and the diffuseness.
  • the means for receiving at least one audio signal may comprise means for at least one of: receiving the at least one audio signal from at least one microphone within a microphone array; determining the at least one audio signal from multichannel loudspeaker audio signals; and receiving the at least one audio signal as part of a data stream comprising the at least one audio signals and metadata comprising the at least one spatial parameter.
  • the means for determining at least one spatial parameter associated with the at least one audio signal may comprise means for at least one of: analysing the at least one audio signal to determine the at least one spatial parameter; and receiving the at least one spatial parameter as part of a data stream comprising the at least one audio signals and metadata comprising the at least one spatial parameter.
  • An apparatus comprising means for performing the actions of the method as described above.
  • An apparatus configured to perform the actions of the method as described above.
  • a computer program comprising program instructions for causing a computer to perform the method as described above.
  • a computer program product stored on a medium may cause an apparatus to perform the method as described herein.
  • An electronic device may comprise apparatus as described herein.
  • a chipset may comprise apparatus as described herein.
  • Embodiments of the present application aim to address problems associated with the state of the art. Summary of the Figures
  • Figure 1 shows schematically an example system utilizing embodiments described hereafter
  • Figure 2 shows a flow diagram of the operation of the example system shown in Figure 1 ;
  • Figure 3 shows schematically an example analysis processor shown in Figure 1 according to some embodiments
  • Figure 4 shows a flow diagram of the operation of the example analysis processor shown in Figure 3;
  • Figure 5 shows schematically an example synthesis processor shown in Figure 1 according to some embodiments
  • Figure 6 shows a flow diagram of the operation of the example synthesis processor shown in Figure 5;
  • Figures 7a and 7b show schematically example spatial synthesizers shown in Figure 5 according to some embodiments
  • Figures 8a and 8b show flow diagrams of the operation of the spatial synthesizers shown in Figures 7a and 7b;
  • Figure 9 shows schematically an example smoothing coefficients determiner shown in Figures 7a and 7b according to some embodiments
  • Figure 10 shows a flow diagram of the operation of the smoothing coefficients determiner shown in Figure 9;
  • Figure 1 1 shows example graphs demonstrating the effect of implementing the embodiments
  • Figure 12 shows an example implementation of the embodiments as shown in Figures 1 to 10;
  • Figure 13 shows a further example implementation of the embodiments as shown in Figures 1 to 10;
  • Figure 14 shows schematically an example device suitable for implementing the embodiments shown.
  • the spatial sound source is a microphone array.
  • the spatial sound source may be a 5.1 multichannel or other format multi-channel mix or Ambisonics signals.
  • Parametric spatial audio capture refers to adaptive DSP-driven audio capture methods covering 1 ) analysing perceptually relevant parameters in frequency bands, for example, the directionality of the propagating sound at the recording position, and 2) reproducing spatial sound in a perceptual sense at the rendering side according to the estimated spatial parameters.
  • the reproduction can be, for example, for headphones or multichannel loudspeaker setups.
  • Parametric spatial audio capture methods may employ these determined parameters such as directions of the sound in frequency bands, and the ratios between the directional and non-directional parts of the captured sound in frequency bands to describe the perceptual spatial properties of the captured sound at the position of the microphone array and may use these parameters in synthesis of the spatial sound.
  • SPAC Parametric spatial audio capture methods
  • the spatial properties are estimated from the sound field, they can significantly fluctuate over time and frequency, e.g., due to the reverberation and/or multiple simultaneous sound sources.
  • parametric spatial audio processing methods typically utilize smoothing in the synthesis, in order to avoid possible artefacts caused by rapidly fluctuating parameters (these artefacts are typically referred to as“musical noise”).
  • Similar parametrization may also be used for the compression of spatial audio, e.g., from 5.1 multichannel signals.
  • the parameters are estimated from the input loudspeaker signals. Nevertheless, the parameters typically fluctuate also in this case. Hence, the temporal smoothing is needed also with loudspeaker input.
  • the spatial parameters are determined in the time-frequency domain, i.e., each parameter value is associated with a certain frequency band and temporal frame.
  • Examples of possible spatial parameters include (but are not limited to):
  • the spatial audio parametrizations describe how the sound is distributed in space, either generally (e.g., using directions) or relatively (e.g., as level differences between certain channels).
  • the audio and the parameters may be processed and/or transmitted/stored in between the analysis and the synthesis.
  • the parametric spatial audio processing methods are often based on analysing the direction of arrival (and other spatial parameters) in frequency bands. If there would be a single sound source in anechoic condition, the direction would stably point to the sound source at all frequencies. However, in typical acoustic environments, the microphones capture also other sounds than just the sound source, such as reverberation and ambient sounds. Moreover, there may be multiple simultaneous sources. As a result, the estimated directions typically significantly fluctuate over time and the estimates are different at different frequency bands.
  • Parametric spatial audio processing methods such as employed in embodiments as described in further detail hereafter synthesize the spatial sound based on the analysed parameters (such as the aforementioned direction) and related audio signals (e.g., 2 captured microphone signals).
  • vector base amplitude panning VBAP
  • VBAP computes gains for a subset of loudspeakers based on the direction, and the audio signal is multiplied with these gains and fed to these loudspeakers.
  • the concept as discussed hereafter proposes apparatus and methods to adapt the smoothing needed in the synthesis of spatial sound in parametric spatial audio processing in order to have quality audio output with different types of sound scenes.
  • the embodiments as described hereafter relate to parametric spatial audio processing where a solution is provided to improve the temporal smoothing processing needed in the synthesis of spatial audio in the aforementioned parametric spatial audio processing and where the temporal smoothing is improved by analysing the required amount of smoothing adaptively.
  • the analysis being related to the stability of the direction-related parameter(s) and producing a measure of directional stability and determining the time coefficients of the temporal smoothing based on the measure of directional stability.
  • the direction-related parameter may as described in further detail in the embodiments hereafter refer to a direction.
  • the amount of smoothing can be analysed using the direct-to-total energy ratio.
  • the value of the energy ratio is monitored over time, and where it is constantly high, the time coefficient of the smoothing can be set smaller (less smoothing applied).
  • the time coefficient can be set to a default value (more smoothing applied).
  • FIG. 1 A block diagram of an example system for implementing some embodiments is shown in Figure 1 .
  • Figure 1 shows an example capture device 101 .
  • the capture device may be a VR capture device, a mobile phone or any other suitable electronic apparatus comprising one or more microphone arrays.
  • the capture device 101 thus in some embodiments comprises microphones 100i , I OO2 .
  • the microphone audio signals 102 captured by the microphones 100i , I OO2 may be stored and later processed, or directly processed.
  • An analysis processor 103 may receive the microphone audio signals 102 from the capture device 101 .
  • the analysis processor 103 can, for example, be a computer or a mobile phone (running suitable software), or alternatively a specific device utilizing, for example, field programmable gate arrays (FPGAs) or application specific integrated circuits (ASICs).
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • the capture device 101 and the analysis processor 103 are implemented on the same apparatus or device.
  • the analysis processor Based on the microphone-array signals, the analysis processor creates a data stream 104.
  • the data stream may comprise transport audio signals and spatial metadata (e.g., directions and energy ratios in frequency bands).
  • the data stream 104 may be transmitted or stored for example within some storage 105 such as memory, or alternatively directly processed in the same device.
  • a synthesis processor 107 may receive the data stream 104.
  • the synthesis processor 107 can, for example, be a computer or a mobile phone (running suitable software), or alternatively a specific device utilizing, for example, FPGAs or ASICs.
  • the synthesis processor can be configured to produce output audio signals.
  • the output signals can be binaural signals 109.
  • the output signals can be multi-channel signals.
  • the headphones 1 1 1 or other playback apparatus may be configured to receive the output of the synthesis processor 107 and output the audio signals in a format suitable for listening.
  • the initial operation is the capture (or otherwise input) of the audio signals as shown in Figure 2 by step 201 .
  • the data stream may then be transmitted and received (or stored and retrieved) as shown in Figure 2 by step 205. Having received or retrieved the data stream, the output may be synthesized based at least on the data stream as shown in Figure 2 by step 207.
  • the synthesized audio signal output signals may then be output to a suitable output such as headphones as shown in Figure 2 by step 209.
  • an example analysis processor 103 such as shown in Figure 1 .
  • the input to the analysis processor 103 are the microphone array signals 102.
  • a transport audio signal generator 301 may be configured to receive the microphone array signals 102 and create the transport audio signals.
  • the transport audio signals are selected from the microphone array signals.
  • the microphone array signals may be downmixed to generate the transport audio signals.
  • the transport audio signals may be obtained by processing the microphone array signals.
  • the transport audio signal generator 301 may be configured to generate any suitable number of transport audio signals (or channels), for example in some embodiments the transport audio signal generator 301 is configured to generate two transport audio signals. In some embodiments the transport audio signal generator 301 is further configured to compress the audio signals. For example in some embodiments the audio signals may be compressed using an advanced audio coding (AAC) or enhanced voice services (EVS) compression coding.
  • AAC advanced audio coding
  • EVS enhanced voice services
  • the analysis processor 103 comprises a spatial analyser 303.
  • the spatial analyser 303 is also configured to receive the microphone array signals 103 and generate metadata 304 based on a spatial analysis of the microphone array signals.
  • the spatial analyser 303 may be configured to determine any suitable spatial metadata parameter.
  • spatial metadata parameters determined include (but are not limited to): Direction and direct-to-total energy ratio; Direction and diffuseness; Inter-channel level difference, inter-channel phase difference, and inter-channel coherence. In some embodiments these parameters are determined in time-frequency domain. It should be noted that also other parametrizations may be used than those presented above.
  • the spatial audio parametrizations describe how the sound is distributed in space, either generally (e.g., using directions) or relatively (e.g., as level differences between certain channels).
  • the metadata 304 comprises directions 306 and energy ratios 308.
  • the metadata may be compressed and/or quantized.
  • the analysis processor 103 may furthermore comprise a multiplexer or mux 305 which is configured to receive the metadata 304 and the transport audio signals 302 and generate a combined data stream 104.
  • the combination may be any suitable combination.
  • the input to the analysis processor 103 can also be other types of audio signals, such as multichannel loudspeaker signals, audio objects, or Ambisonic signals.
  • the exact implementation of the analysis processor may be any suitable implementation (as indicated above a computer running suitable software, a FPGAs or ASICs etc) caused to produce the transport audio signals and the spatial metadata in the time- frequency domain.
  • the initial operation is receiving the microphone array audio signals as shown in Figure 4 by step 401 .
  • microphone audio signals are spatially analysed to generate the metadata, for example the directions and energy ratios as shown in Figure 4 by step 405.
  • the metadata and the transport audio signals may then be combined to generate the data stream as shown in Figure 4 by step 407.
  • FIG. 5 an example synthesis processor 107 (as shown in Figure 1 ) according to some embodiments is shown.
  • a demultiplexer, or demux, 501 is configured to receive the data stream 104 and caused to demultiplex the data stream into a transport audio signals 502 and metadata 504.
  • the demultiplexer is furthermore caused to decode the audio signals.
  • the metadata in some embodiments is with the time- frequency domain, and comprises parameters such as directions 0(k,n) 506 and direct-to-total energy ratios r(k,n) 508, where k is the frequency band index and n the temporal frame index.
  • the demultiplexed data is furthermore decompressed/dequantized to attempt to regenerate the originally determined parameters.
  • a spatial synthesizer 503 is configured to receive the transport audio signals 502 and the metadata and caused to generate the multichannel output signals 510 such as the binaural output signals 109 shown in Figure 1 .
  • the initial operation is receiving the data stream as shown in Figure 6 by step
  • Flaving received the data stream it is demultiplexed and optionally decoded to generate the transport audio signals and the metadata as shown in Figure 6 by step 603.
  • the multichannel (binaural or otherwise) output signals may then be synthesized from the transport audio signals and the metadata as shown in Figure 6 by step 605.
  • the multichannel (binaural or otherwise) output signals may then be output as shown in Figure 6 by step 607.
  • the input to the spatial synthesizer 503 is in some embodiments the transport audio signals 502 and furthermore the metadata 504 (which may include the energy ratios 508 and the directions 506).
  • the transport audio signals 502 are transformed to the time-frequency domain using a suitable transformer.
  • a short-time Fourier transformer (STFT) 701 is configured to apply a short-time Fourier transform to the transport audio signals to generate suitable time-frequency domain audio signals S ⁇ k. n ) 700.
  • STFT short-time Fourier transformer
  • any suitable time-frequency transformer may be used, for example a quadrature mirror interbank (QMF).
  • a divider 705 may receive the time-frequency domain audio signals Si(k, n) 700 and the energy ratios 508 and divide the time-frequency domain audio signals Si(k, n ) 700 to ambient and direct parts using the energy ratio r(k, n) 508.
  • a smoothing coefficients determiner 703 may also receive the time-frequency domain audio signals Si(k, n) 700 and the energy ratios 508 and determine suitable smoothing coefficients 706.
  • Figure 7b differs with respect to the example spatial synthesizer shown in Figure 7a in that the smoothing coefficients determiner in Figure 7b is caused to receive the time-frequency domain audio signals Si(k, n) 700 and the directions 506.
  • the smoothing coefficients determiner 703 may be configured to adaptively determine the smoothing coefficient(s) a(k, ri).
  • a panning gain determiner 715 may be configured to receive the directions 506 and based on the output speaker/headphone configuration and the directions determine suitable panning gains 708.
  • the amplitude panning gains may be computed using any suitable manner, for example vector base amplitude panning (VBAP) based on the received direction 0(k,n).
  • VBAP vector base amplitude panning
  • a panning gain smoother 717 is configured to receive the panning gains 708 and the smoothing coefficients 706 and based on these determine suitable smoothed panning gains 710. There are many ways to perform the smoothing. In some embodiments a first-order smoothing may be used. Thus for example the panning gain smoother 717 is configured to receive a current gain g(k, n), smoothing coefficients a(k, n) and also knowledge on the last smoothed gain g'(k, n - 1) and determine a smoothed gain by:
  • the current gain is multiplied with the smoothing coefficient a and the previous smoothed gain is multiplied with (1 - a).
  • any suitable smoothing may be applied.
  • the smoothing ‘filter’ may therefore be multiple order and similarly the smoothing coefficient a(k, n) may be a vector value.
  • the actual value(s) of a may depend on the filterbank, and typically is frequency-dependent (values may include, e.g., 0.1 ). In general, the larger the value is, the less smoothing is applied.
  • a decorrelator 707 is configured to receive the ambient audio signal part 702 and process it to make it perceived as being surrounding, for example by decorrelating and spreading the ambient audio signal part 702 across the audio scene.
  • a positioner 709 is configured to receive the direct audio signal part 704 and the smoothed panning gains 710 and position the direct audio signal part 704 using a suitable positioning, for example using the smoothed panning gains and an amplitude panning operation.
  • a merger 71 1 or other suitable combiner is configured to receive the spread ambient signal part from the decorrelator 707 and the positioned direct audio signals part from the positioner 709 and combine or merge these resulting audio signals.
  • An inverse short-time Fourier transformer (Inverse STFT) 713 is configured to receive the combined audio signals and apply an inverse short-time Fourier transform (or other suitable frequency to time domain transform) to generate the multi-channel audio signals 510 which may be passed to a suitable output device such as the headphones or multi-channel loudspeaker setup.
  • a suitable output device such as the headphones or multi-channel loudspeaker setup.
  • the panning gains are determined directly from the direction metadata, and the“direct sound” is also positioned with these gains after smoothing.
  • the panning gains are not directly determined from the direction metadata, but instead determined indirectly.
  • the smoothing of these gains as described above may be applied to any suitably generated gains.
  • the directions may be used (together with the energy ratios and transport audio signals) to determine a target energy distribution of the output multichannel signals.
  • the target energy distribution may be compared to the energy distribution of the transport audio signals (or to the energy distribution of intermediate signals obtained from the transport audio signals by mixing).
  • Panning gains or any gains that position audio
  • the“Smoother” 717 may be applied to these gains.
  • the method of generating the panning gains may be one of many optional methods which is then smoothed according to methods as described herein.
  • the spatial synthesizer in some embodiments is configured to receive the transport audio signals as shown in Figures 8a and 8b by step 801 .
  • the spatial synthesizer in some embodiments is furthermore configured to receive the energy ratios as shown in Figures 8a and 8b by step 803.
  • the spatial synthesizer in some embodiments is also configured to receive the directions as shown in Figures 8a and 8b by step 805.
  • the received transport audio signals are in some embodiments converted into a time-frequency domain form, for example by applying a suitable time- frequency domain transform to the transport audio signals as shown in Figures 8a and 8b by step 807.
  • the time-frequency domain audio signals may then in some embodiments be divided into ambient and direct parts (based on the energy ratios) as shown in Figures 8a and 8b by step 813.
  • smoothing coefficients may be determined based on the energy ratios and the time-frequency domain audio signals as shown in Figure 8a by step 81 1 .
  • the smoothing coefficients may be determined based on the directions and the time-frequency domain audio signals as shown in Figure 8b by step 851 .
  • Panning gains may be determined based on the received directions as shown in Figures 8a and 8b by step 809.
  • a series of smoothed panning gains may be determined based on the determined panning gains and the smoothing coefficients as shown in Figures 8a and 8b by step 817.
  • the ambient audio signal part may be decorrelated as shown in Figures 8a and 8b by step 815.
  • the positional component of the audio signals may then be determined based on the smoothed panning gains and the direct audio signal part as shown in Figures 8a and 8b by step 819.
  • a positional component of the audio signals or positioned audio signal can be a number of audio signals which are combined to produce a virtual sound source positioned in a three dimensional space.
  • the positional component of the audio signals and the decorrelated ambient audio signal may then be combined or merged as shown in Figures 8a and 8b by step 821 .
  • the combined audio signals may then be inverse time-frequency domain transformed to generate the multichannel audio signals in a suitable format to be output as shown in Figures 8a and 8b by step 823.
  • the smoothing coefficients determiner 703 is configured to generate values which may be used to smooth the panning gains in order to avoid“musical noise” artefacts.
  • the inputs to the smoothing coefficients determiner 703 are shown as the time-frequency domain audio signals 700 and the energy ratios 508.
  • An energy estimator 901 may be configured to receive the time-frequency domain audio signals 700 and determine the energy E(k, n) 902 of the audio signals. For example in some embodiments the energy estimator 901 is configured to generate the energy based on:
  • a direction smoothness estimator 903 is configured to estimate a direction smoothness f(/c, n). In some embodiments, such as shown in the examples in Figures 7a and 8a, this direction smoothness may be estimated or determined from the energy ratios r(k, n) 508. For example the direction smoothness estimator may be configured to calculate the direction smoothness by the following:
  • the direction smoothness value f(/c,n) 904 can be estimated by using or calculating the fluctuation of the direction value.
  • a circular variance of the directions 0(k, n) is determined and this is used as the basis of a direction smoothness.
  • any suitable analysis of the temporal fluctuation of the directions may be used to determine the direction smoothness estimate.
  • An average direction smoothness estimator 905 is configured to receive the energy 902 and direction smoothness estimates 904 and determine an average over time (and in some embodiments over frequency).
  • the average direction smoothness estimator may therefore be configured to perform a first-order smoothing based on a current estimate f(/c, n) a previous average value x' ⁇ , h - 1) and smoothing coefficient b to generate an averaged direction smoothness estimate x' , h) 906, for example by the following:
  • b may be fixed, or it can be adaptively selected, e.g., by
  • a ⁇ may, e.g., be 0.001 and a 2 may, e.g., be 0.5.
  • This adaptive selection attempts to find whether the energy ratio is constantly large, and hence temporal smoothing can be safely made shorter without artefacts.
  • the direction smoothness estimates x may be weighted by the energy E while performing the temporal smoothing.
  • a direction smoothness estimates to smoothing coefficients converter may receive the averaged direction smoothness estimate x' , h) 906 and generate the smoothing coefficients a(k, n). For example in some embodiments the averaged direction smoothness estimates x' ⁇ , h) are converted to the actual smoothing coefficients by the following
  • a(k, n) x' ⁇ k, n)a fast ⁇ k) + (l - x' (k, n))a slow (k)
  • the values of a fast may, e.g., include 0.4, and the values of a slow may, e.g., include 0.1 .
  • These fast and slow coefficients may depend on the actual implementation and may be frequency-dependent.
  • the smoothing coefficients may be a vector instead of a single value. This for example may occur when the smoothing is other than a first- order MR smoothing.
  • These embodiments may therefore may implement “fast settings” and“slow settings” which are interpolated based on the“averaged direction smoothness estimates”. In such embodiments these“settings” may depend on the implementation, for example whether it is a single value or a vector of values.
  • the smoothing coefficients a(k, )706 may then be output.
  • the time-frequency domain audio signals are received as shown in Figure 10 by step 1001 .
  • the estimate of the energy (of the audio signals) may be determined based on the time-frequency domain audio signals as shown in Figure 10 by step 1005.
  • step 1007 Furthermore the estimate of the direction smoothness based on the energy ratios (or based on any other suitable parameter such as an analysis of the directions) is shown in Figure 10 by step 1007.
  • the estimate of the average direction smoothness is then determined based on the energy estimate and the direction smoothness estimates as shown in Figure 10 by step 1009.
  • step 101 1 the average direction smoothness estimate is converted to smoothness coefficients as shown in Figure 10 by step 101 1 .
  • the smoothness coefficients were determined based on the use of the spatial metadata (for example the metadata generated in the analysis processor such as found within a spatial audio capture (SPAC) which generates directions and direct-to-total energy ratios). It should be noted that the above methods can be modified without inventive skill to be used with any method utilizing similar parameters. For example in context of Direct Audio Coding (DirAC), the direction smoothness can be determined as
  • Some of the advantages of the proposed embodiments is that significant amount of smoothing can applied with typical sound scenes, and thus musical noise artefacts are avoided. Furthermore when the sound scene does not require so much smoothing, the amount of smoothing applied can be reduced and thus the reproduction can react faster to changes in the sound field.
  • Figure 1 1 shows three graph traces showing a reference audio signal, a reproduction using a fixed smoothing coefficient and a reproduction using an adaptive smoothing coefficient.
  • the sound scene contains two sources located in different directions in anechoic conditions. The sound was rendered to a multichannel setup.
  • the reference graph trace 1 101 shows the signal of one output channel of the audio signal and shows the first source 1 103 but not the other source.
  • the adaptive smoothing example 1 121 having analysed that the directions are stable, and there is not as much need for temporal smoothing is configured to set the smoothing to a faster mode, and the sound source is not reproduced from the wrong channel. In such a manner the reproduction is perceived to react fast to changes in the direction.
  • the implementation can be by software implemented, for example on a mobile phone (or a computer) 1200.
  • the software running inside the mobile phone 1200 may be configured to receive an encoded bitstream (it may have been e.g., transmitted real-time or it may have been stored to the device).
  • the bitstream can also be any other suitable bitstream.
  • a demultiplexer 1203 (DEMUX) is configured to demultiplex the bitstream into an audio bitstream 1204 and to a spatial metadata bitstream 1206.
  • An enhanced voice standard (EVS) or encoded bitstream decoder 1205 is configured to extract the transport audio signals 1206 from the audio bitstream (or any decoder that corresponds to the utilized codec).
  • a metadata decoder 1207 is used to decompress the spatial metadata 1208, for example comprising the directions 1210 and energy ratios 1212.
  • the spatial synthesiser 1209 (similar to the spatial synthesizer in the embodiments above) is configured to receive transport audio signals 1206 and the metadata 1208 and output multichannel loudspeaker signals 121 1 that may be reproduced using a multichannel loudspeaker setup. In some embodiments the spatial synthesizer 1209 is configured to generate binaural audio signals that may be reproduced using headphones.
  • a microphone array 1301 for example part of a mobile phone, is configured to capture audio signals 1302.
  • the captured microphone array audio signals 1302 may be processed by software 1300 running inside the mobile phone.
  • the software 1300 may be an analysis processor configured to analyse the captured microphone array signals 1302 in a manner such as described with respect to Figures 1 and 3 and is configured to generate spatial metadata 1304 (comprising directions 1306 and energy ratios 1308).
  • a synthesis processor 1305 which is configured to receive the spatial metadata 1304 from the analysis processor 1303 along with the captured microphone array audio signals 1302 (or alternatively a subset or a processed set of the microphone signals).
  • the synthesis processor 1305 may operate in a manner similar to the synthesis processor 1305 as described with respect to Figures 1 and 5, 7a, 7b and 9. Depending on the configuration, the synthesis processor 1305 may be configured to output a multichannel audio signal (for example a binaural signal or a surround loudspeaker signal or Ambisonic signal).
  • the multichannel audio signals 1307 can therefore be listened to directly (when fed to headphones or loudspeakers, or reproduced using an Ambisonic renderer), stored (with any suitable codec) and/or transmitted to a remote device.
  • codec use implementation is described above it is noted that some embodiments may be used with any suitable codec that utilizes smoothing and can provide information on the smoothness of the direction-related parameters.
  • the proposed method can also be applied in any kind of spatial audio processing which operates in time-frequency domain.
  • the device may be any suitable electronics device or apparatus.
  • the device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.
  • the device 1400 comprises at least one processor or central processing unit 1407.
  • the processor 1407 can be configured to execute various program codes such as the methods such as described herein.
  • the device 1400 comprises a memory 141 1 .
  • the at least one processor 1407 is coupled to the memory 141 1 .
  • the memory 141 1 can be any suitable storage means.
  • the memory 141 1 comprises a program code section for storing program codes implementable upon the processor 1407.
  • the memory 141 1 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1407 whenever needed via the memory-processor coupling.
  • the device 1400 comprises a user interface 1405.
  • the user interface 1405 can be coupled in some embodiments to the processor 1407.
  • the processor 1407 can control the operation of the user interface 1405 and receive inputs from the user interface 1405.
  • the user interface 1405 can enable a user to input commands to the device 1400, for example via a keypad.
  • the user interface 1405 can enable the user to obtain information from the device 1400.
  • the user interface 1405 may comprise a display configured to display information from the device 1400 to the user.
  • the user interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1400 and further displaying information to the user of the device 1400.
  • the user interface 1405 may be the user interface for communicating with the position determiner as described herein.
  • the device 1400 comprises an input/output port 1409.
  • the input/output port 1409 in some embodiments comprises a transceiver.
  • the transceiver in such embodiments can be coupled to the processor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network.
  • the transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
  • the transceiver can communicate with further apparatus by any suitable known communications protocol.
  • the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).
  • UMTS universal mobile telecommunications system
  • WLAN wireless local area network
  • IRDA infrared data communication pathway
  • the transceiver input/output port 1409 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1407 executing suitable code.
  • the device may generate a suitable downmix signal and parameter output to be transmitted to the synthesis device.
  • the device 1400 may be employed as at least part of the synthesis device.
  • the input/output port 1409 may be configured to receive the downmix signals and in some embodiments the parameters determined at the capture device or processing device as described herein, and generate a suitable audio signal format output by using the processor 1407 executing suitable code.
  • the input/output port 1409 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones or similar.
  • the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof.
  • some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
  • firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
  • While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
  • the embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware.
  • any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions.
  • the software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
  • the memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory.
  • the data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
  • Embodiments of the inventions may be practiced in various components such as integrated circuit modules.
  • the design of integrated circuits is by and large a highly automated process.
  • Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
  • Programs such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules.
  • the resultant design in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Stereophonic System (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

An apparatus for spatial audio signal processing, the apparatus comprising at least one processor and at least one memory including a computer program code. The at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to receive at least one audio signal (502), determine at least one spatial parameter (508) associated with the at least one audio signal (502), generate an adaptive smoothing parameter (706) based on the at least one spatial parameter (508), determine panning gains (708) for applying to a first part (704) of the at least one audio signals (502), apply the adaptive smoothing parameter (706) to the panning gains (708) to generate associated smoothed panning gains (710), and apply the smoothed panning gains (710) to the first part (704) of the at least one audio signal (502) to generate a positioned audio signal.

Description

TEMPORAL SPATIAL AUDIO PARAMETER SMOOTHING
Field
The present application relates to apparatus and methods for temporal spatial audio parameter smoothing. This includes but is not exclusively for sound reproduction systems and sound reproduction methods producing multichannel audio channel outputs.
Background
Parametric spatial audio processing is a field of audio signal processing where the spatial aspect of the sound is described using a set of parameters. For example, in parametric spatial audio capture from microphone arrays, it is a typical and an effective choice to estimate from the microphone array signals a set of parameters such as directions of the sound in frequency bands, and the ratio parameters expressing relative energies of the directional and non-directional parts of the captured sound in frequency bands. These parameters are known to well describe the perceptual spatial properties of the captured sound at the position of the microphone array. These parameters can be utilized in synthesis of the spatial sound accordingly, for headphones binaurally, for loudspeakers, or to other formats, such as Ambisonics.
The directions and direct-to-total energy ratios in frequency bands are thus a parameterization that is particularly effective for spatial audio capture.
A parameter set consisting of a direction parameter in frequency bands and an energy ratio parameter in frequency bands (indicating the proportion of the sound energy that is directional) can be also utilized as the spatial metadata for an audio codec. For example, these parameters can be estimated from microphone-array captured audio signals, and for example a stereo signal can be generated from the microphone array signals to be conveyed with the spatial metadata. The stereo signal could be encoded, for example, with an AAC encoder. A decoder can decode the audio signals into PCM signals, and process the sound in frequency bands (using the spatial metadata) to obtain the spatial output, for example a binaural output.
Figure imgf000004_0001
There is provided according to a first aspect an apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: receive at least one audio signal; determine at least one spatial parameter associated with the at least one audio signal; generate an adaptive smoothing parameter based on the at least one spatial parameter; determine panning gains for applying to a first part of the at least one audio signals; apply the adaptive smoothing parameter to the panning gains to generate associated smoothed panning gains; and apply the smoothed panning gains to the first part of the at least one audio signal to generate a positioned audio signal.
The apparatus may be further caused to: apply a decorrelation to a second part of the at least one audio signal to generate an ambient audio signal; and combine the positioned audio signal and the ambient audio signal to generate a multichannel audio signal.
The apparatus caused to generate an adaptive smoothing parameter based on the at least one spatial parameter may be caused to: estimate a direction smoothness parameter based on the at least one spatial parameter; and convert the direction smoothness parameter to the adaptive smoothing parameter.
The apparatus caused to generate an adaptive smoothing parameter based on the at least one spatial parameter may be caused to: estimate an energy of the at least one audio signal; average the direction smoothness parameter based on the energy of the at least one audio signal, wherein the apparatus caused to convert the direction smoothness parameter to the adaptive smoothing parameter is caused to convert the averaged direction smoothness parameter to the adaptive smoothing parameter. The apparatus caused to average the direction smoothness parameter based on the energy of the at least one audio signal may be caused to: determine an averaging parameter based on the energy of the at least one audio signal; and apply the averaging parameter to the direction smoothness parameter and unity minus the averaging parameter to the previous averaged direction smoothness parameter to generate the averaged direction smoothness parameter.
The at least one spatial parameter may comprise an energy ratio associated with the at least one audio signal, wherein the apparatus caused to estimate a direction smoothness parameter based on the at least one spatial parameter may be caused to determine the direction smoothness parameter by applying an exponent to the energy ratio.
The at least one spatial parameter may comprise a direction associated with the at least one audio signal, wherein the apparatus caused to estimate a direction smoothness parameter based on the at least one spatial parameter may be caused to determine the direction smoothness parameter by analysing a motion of the direction.
The at least one spatial parameter may comprise a diffuseness associated with the at least one audio signal, wherein the apparatus caused to estimate a direction smoothness parameter based on the at least one spatial parameter may be caused to determine the direction smoothness parameter by applying an exponent to difference between unity and the diffuseness.
The apparatus caused to receive at least one audio signal may be caused to perform at least one of: receive the at least one audio signal from at least one microphone within a microphone array; determine the at least one audio signal from multichannel loudspeaker audio signals; and receive the at least one audio signal as part of a data stream comprising the at least one audio signals and metadata comprising the at least one spatial parameter.
The apparatus caused to determine at least one spatial parameter associated with the at least one audio signal may be caused to perform at least one of: analyse the at least one audio signal to determine the at least one spatial parameter; and receive the at least one spatial parameter as part of a data stream comprising the at least one audio signals and metadata comprising the at least one spatial parameter.
According to a second aspect there is provided a method for spatial audio signal processing comprising: receiving at least one audio signal; determining at least one spatial parameter associated with the at least one audio signal; generating an adaptive smoothing parameter based on the at least one spatial parameter; determining panning gains for applying to a first part of the at least one audio signals; applying the adaptive smoothing parameter to the panning gains to generate associated smoothed panning gains; and applying the smoothed panning gains to the first part of the at least one audio signal to generate a positioned audio signal.
The method may further comprise: applying a decorrelation to a second part of the at least one audio signal to generate an ambient audio signal; and combining the positioned audio signal and the ambient audio signal to generate a multichannel audio signal.
Generating an adaptive smoothing parameter based on the at least one spatial parameter may comprise: estimating a direction smoothness parameter based on the at least one spatial parameter; and converting the direction smoothness parameter to the adaptive smoothing parameter.
Generating an adaptive smoothing parameter based on the at least one spatial parameter may comprise: estimating an energy of the at least one audio signal; averaging the direction smoothness parameter based on the energy of the at least one audio signal, wherein converting the direction smoothness parameter to the adaptive smoothing parameter may comprise converting the averaged direction smoothness parameter to the adaptive smoothing parameter.
Averaging the direction smoothness parameter based on the energy of the at least one audio signal may comprise: determining an averaging parameter based on the energy of the at least one audio signal; and applying the averaging parameter to the direction smoothness parameter and unity minus the averaging parameter to the previous averaged direction smoothness parameter to generate the averaged direction smoothness parameter. The at least one spatial parameter may comprise an energy ratio associated with the at least one audio signal, wherein estimating a direction smoothness parameter based on the at least one spatial parameter may comprise determining the direction smoothness parameter by applying an exponent to the energy ratio.
The at least one spatial parameter may comprise a direction associated with the at least one audio signal, wherein estimating a direction smoothness parameter based on the at least one spatial parameter may comprise determining the direction smoothness parameter by analysing a motion of the direction.
The at least one spatial parameter may comprise a diffuseness associated with the at least one audio signal, wherein estimating a direction smoothness parameter based on the at least one spatial parameter may comprise determining the direction smoothness parameter by applying an exponent to difference between unity and the diffuseness.
Receiving at least one audio signal may comprise performing at least one of: receiving the at least one audio signal from at least one microphone within a microphone array; determining the at least one audio signal from multichannel loudspeaker audio signals; and receiving the at least one audio signal as part of a data stream comprising the at least one audio signals and metadata comprising the at least one spatial parameter.
Determining at least one spatial parameter associated with the at least one audio signal may comprise at least one of: analysing the at least one audio signal to determine the at least one spatial parameter; and receiving the at least one spatial parameter as part of a data stream comprising the at least one audio signals and metadata comprising the at least one spatial parameter.
According to a third aspect there is provided an apparatus for spatial audio signal processing comprising means for: receiving at least one audio signal; determining at least one spatial parameter associated with the at least one audio signal; generating an adaptive smoothing parameter based on the at least one spatial parameter; determining panning gains for applying to a first part of the at least one audio signals; applying the adaptive smoothing parameter to the panning gains to generate associated smoothed panning gains; and applying the smoothed panning gains to the first part of the at least one audio signal to generate a positioned audio signal.
The apparatus may further comprise means for: applying a decorrelation to a second part of the at least one audio signal to generate an ambient audio signal; and combining the positioned audio signal and the ambient audio signal to generate a multichannel audio signal.
The means for generating an adaptive smoothing parameter based on the at least one spatial parameter may comprise means for: estimating a direction smoothness parameter based on the at least one spatial parameter; and converting the direction smoothness parameter to the adaptive smoothing parameter.
The means for generating an adaptive smoothing parameter based on the at least one spatial parameter may comprise means for: estimating an energy of the at least one audio signal; averaging the direction smoothness parameter based on the energy of the at least one audio signal, wherein the means for converting the direction smoothness parameter to the adaptive smoothing parameter may comprise means for converting the averaged direction smoothness parameter to the adaptive smoothing parameter.
The means for averaging the direction smoothness parameter based on the energy of the at least one audio signal may comprise means for: determining an averaging parameter based on the energy of the at least one audio signal; and applying the averaging parameter to the direction smoothness parameter and unity minus the averaging parameter to the previous averaged direction smoothness parameter to generate the averaged direction smoothness parameter.
The at least one spatial parameter may comprise an energy ratio associated with the at least one audio signal, wherein the means for estimating a direction smoothness parameter based on the at least one spatial parameter may comprise means for determining the direction smoothness parameter by applying an exponent to the energy ratio.
The at least one spatial parameter may comprise a direction associated with the at least one audio signal, wherein the means for estimating a direction smoothness parameter based on the at least one spatial parameter may comprise means for determining the direction smoothness parameter by analysing a motion of the direction.
The at least one spatial parameter may comprise a diffuseness associated with the at least one audio signal, wherein the means for estimating a direction smoothness parameter based on the at least one spatial parameter may comprise means for determining the direction smoothness parameter by applying an exponent to difference between unity and the diffuseness.
The means for receiving at least one audio signal may comprise means for at least one of: receiving the at least one audio signal from at least one microphone within a microphone array; determining the at least one audio signal from multichannel loudspeaker audio signals; and receiving the at least one audio signal as part of a data stream comprising the at least one audio signals and metadata comprising the at least one spatial parameter.
The means for determining at least one spatial parameter associated with the at least one audio signal may comprise means for at least one of: analysing the at least one audio signal to determine the at least one spatial parameter; and receiving the at least one spatial parameter as part of a data stream comprising the at least one audio signals and metadata comprising the at least one spatial parameter.
An apparatus comprising means for performing the actions of the method as described above.
An apparatus configured to perform the actions of the method as described above.
A computer program comprising program instructions for causing a computer to perform the method as described above.
A computer program product stored on a medium may cause an apparatus to perform the method as described herein.
An electronic device may comprise apparatus as described herein.
A chipset may comprise apparatus as described herein.
Embodiments of the present application aim to address problems associated with the state of the art. Summary of the Figures
For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:
Figure 1 shows schematically an example system utilizing embodiments described hereafter;
Figure 2 shows a flow diagram of the operation of the example system shown in Figure 1 ;
Figure 3 shows schematically an example analysis processor shown in Figure 1 according to some embodiments;
Figure 4 shows a flow diagram of the operation of the example analysis processor shown in Figure 3;
Figure 5 shows schematically an example synthesis processor shown in Figure 1 according to some embodiments;
Figure 6 shows a flow diagram of the operation of the example synthesis processor shown in Figure 5;
Figures 7a and 7b show schematically example spatial synthesizers shown in Figure 5 according to some embodiments;
Figures 8a and 8b show flow diagrams of the operation of the spatial synthesizers shown in Figures 7a and 7b;
Figure 9 shows schematically an example smoothing coefficients determiner shown in Figures 7a and 7b according to some embodiments;
Figure 10 shows a flow diagram of the operation of the smoothing coefficients determiner shown in Figure 9;
Figure 1 1 shows example graphs demonstrating the effect of implementing the embodiments;
Figure 12 shows an example implementation of the embodiments as shown in Figures 1 to 10;
Figure 13 shows a further example implementation of the embodiments as shown in Figures 1 to 10; and
Figure 14 shows schematically an example device suitable for implementing the embodiments shown. Embodiments of the Application
The following describes in further detail suitable apparatus and possible mechanisms for the provision of adaptive parameter smoothing.
In the following embodiments and examples the spatial sound source is a microphone array. Alternatively the spatial sound source may be a 5.1 multichannel or other format multi-channel mix or Ambisonics signals.
As described above parametric spatial audio capture methods can be used to enable a perceptually accurate spatial sound reproduction. Parametric spatial audio capture refers to adaptive DSP-driven audio capture methods covering 1 ) analysing perceptually relevant parameters in frequency bands, for example, the directionality of the propagating sound at the recording position, and 2) reproducing spatial sound in a perceptual sense at the rendering side according to the estimated spatial parameters. The reproduction can be, for example, for headphones or multichannel loudspeaker setups. By estimating and reproducing the perceptually relevant spatial properties (parameters) of the sound field, a spatial perception similar to that which would occur in the original sound field can be reproduced. As the result, the listener can perceive the multitude of sources, their directions and distances, as well as properties of the surrounding physical space, among the other spatial sound features, as if the listener was in the position of the capture device.
Parametric spatial audio capture methods (SPAC) may employ these determined parameters such as directions of the sound in frequency bands, and the ratios between the directional and non-directional parts of the captured sound in frequency bands to describe the perceptual spatial properties of the captured sound at the position of the microphone array and may use these parameters in synthesis of the spatial sound. As the spatial properties are estimated from the sound field, they can significantly fluctuate over time and frequency, e.g., due to the reverberation and/or multiple simultaneous sound sources. Hence, parametric spatial audio processing methods typically utilize smoothing in the synthesis, in order to avoid possible artefacts caused by rapidly fluctuating parameters (these artefacts are typically referred to as“musical noise”). Similar parametrization may also be used for the compression of spatial audio, e.g., from 5.1 multichannel signals. In this case, the parameters are estimated from the input loudspeaker signals. Nevertheless, the parameters typically fluctuate also in this case. Hence, the temporal smoothing is needed also with loudspeaker input.
Typically, the spatial parameters are determined in the time-frequency domain, i.e., each parameter value is associated with a certain frequency band and temporal frame. Examples of possible spatial parameters include (but are not limited to):
Direction and direct-to-total energy ratio
Direction and diffuseness
Inter-channel level difference, inter-channel phase difference, and inter- channel coherence
These parameters are determined in time-frequency domain. It should be noted that also other parametrizations may be used than those presented above. In general, typically the spatial audio parametrizations describe how the sound is distributed in space, either generally (e.g., using directions) or relatively (e.g., as level differences between certain channels). Moreover, it should be noted that, in such methods, the audio and the parameters may be processed and/or transmitted/stored in between the analysis and the synthesis.
The parametric spatial audio processing methods are often based on analysing the direction of arrival (and other spatial parameters) in frequency bands. If there would be a single sound source in anechoic condition, the direction would stably point to the sound source at all frequencies. However, in typical acoustic environments, the microphones capture also other sounds than just the sound source, such as reverberation and ambient sounds. Moreover, there may be multiple simultaneous sources. As a result, the estimated directions typically significantly fluctuate over time and the estimates are different at different frequency bands.
Parametric spatial audio processing methods such as employed in embodiments as described in further detail hereafter synthesize the spatial sound based on the analysed parameters (such as the aforementioned direction) and related audio signals (e.g., 2 captured microphone signals). In the case of loudspeaker rendering, vector base amplitude panning (VBAP) is a common method to position the audio to the analysed direction. VBAP computes gains for a subset of loudspeakers based on the direction, and the audio signal is multiplied with these gains and fed to these loudspeakers.
The concept as discussed hereafter proposes apparatus and methods to adapt the smoothing needed in the synthesis of spatial sound in parametric spatial audio processing in order to have quality audio output with different types of sound scenes.
Furthermore the embodiments as described hereafter relate to parametric spatial audio processing where a solution is provided to improve the temporal smoothing processing needed in the synthesis of spatial audio in the aforementioned parametric spatial audio processing and where the temporal smoothing is improved by analysing the required amount of smoothing adaptively. The analysis being related to the stability of the direction-related parameter(s) and producing a measure of directional stability and determining the time coefficients of the temporal smoothing based on the measure of directional stability.
The direction-related parameter may as described in further detail in the embodiments hereafter refer to a direction.
In some embodiments the amount of smoothing can be analysed using the direct-to-total energy ratio. The value of the energy ratio is monitored over time, and where it is constantly high, the time coefficient of the smoothing can be set smaller (less smoothing applied). Correspondingly, where the energy ratio is not constantly high, the time coefficient can be set to a default value (more smoothing applied).
A block diagram of an example system for implementing some embodiments is shown in Figure 1 .
Figure 1 shows an example capture device 101 . The capture device may be a VR capture device, a mobile phone or any other suitable electronic apparatus comprising one or more microphone arrays. The capture device 101 thus in some embodiments comprises microphones 100i, I OO2. The microphone audio signals 102 captured by the microphones 100i, I OO2 may be stored and later processed, or directly processed.
An analysis processor 103 may receive the microphone audio signals 102 from the capture device 101 . The analysis processor 103 can, for example, be a computer or a mobile phone (running suitable software), or alternatively a specific device utilizing, for example, field programmable gate arrays (FPGAs) or application specific integrated circuits (ASICs). In some embodiments the capture device 101 and the analysis processor 103 are implemented on the same apparatus or device.
Based on the microphone-array signals, the analysis processor creates a data stream 104. The data stream may comprise transport audio signals and spatial metadata (e.g., directions and energy ratios in frequency bands). The data stream 104 may be transmitted or stored for example within some storage 105 such as memory, or alternatively directly processed in the same device.
A synthesis processor 107 may receive the data stream 104. The synthesis processor 107 can, for example, be a computer or a mobile phone (running suitable software), or alternatively a specific device utilizing, for example, FPGAs or ASICs. Based on the data stream (the transport audio signals and the metadata). The synthesis processor can be configured to produce output audio signals. For headphone listening, the output signals can be binaural signals 109. For loudspeaker rendering, the output signals can be multi-channel signals.
The headphones 1 1 1 or other playback apparatus may be configured to receive the output of the synthesis processor 107 and output the audio signals in a format suitable for listening.
With respect to Figure 2 is shown an example summary of the operations of the apparatus shown in Figure 1 .
The initial operation is the capture (or otherwise input) of the audio signals as shown in Figure 2 by step 201 .
Flaving captured the audio signals they are analysed to generate the data stream as shown in Figure 2 by step 203.
The data stream may then be transmitted and received (or stored and retrieved) as shown in Figure 2 by step 205. Having received or retrieved the data stream, the output may be synthesized based at least on the data stream as shown in Figure 2 by step 207.
The synthesized audio signal output signals may then be output to a suitable output such as headphones as shown in Figure 2 by step 209.
With respect to Figure 3 an example analysis processor 103, such as shown in Figure 1 , is presented. The input to the analysis processor 103 are the microphone array signals 102.
A transport audio signal generator 301 may be configured to receive the microphone array signals 102 and create the transport audio signals. In some embodiments the transport audio signals are selected from the microphone array signals. In some embodiments the microphone array signals may be downmixed to generate the transport audio signals. In some embodiments the transport audio signals may be obtained by processing the microphone array signals.
The transport audio signal generator 301 may be configured to generate any suitable number of transport audio signals (or channels), for example in some embodiments the transport audio signal generator 301 is configured to generate two transport audio signals. In some embodiments the transport audio signal generator 301 is further configured to compress the audio signals. For example in some embodiments the audio signals may be compressed using an advanced audio coding (AAC) or enhanced voice services (EVS) compression coding.
Furthermore the analysis processor 103 comprises a spatial analyser 303. The spatial analyser 303 is also configured to receive the microphone array signals 103 and generate metadata 304 based on a spatial analysis of the microphone array signals. The spatial analyser 303 may be configured to determine any suitable spatial metadata parameter. For example spatial metadata parameters determined include (but are not limited to): Direction and direct-to-total energy ratio; Direction and diffuseness; Inter-channel level difference, inter-channel phase difference, and inter-channel coherence. In some embodiments these parameters are determined in time-frequency domain. It should be noted that also other parametrizations may be used than those presented above. In general, typically the spatial audio parametrizations describe how the sound is distributed in space, either generally (e.g., using directions) or relatively (e.g., as level differences between certain channels). In the example shown in Figure 3 the metadata 304 comprises directions 306 and energy ratios 308. In some embodiments the metadata may be compressed and/or quantized. The analysis processor 103 may furthermore comprise a multiplexer or mux 305 which is configured to receive the metadata 304 and the transport audio signals 302 and generate a combined data stream 104. The combination may be any suitable combination.
It should be noted that in some embodiments the input to the analysis processor 103 can also be other types of audio signals, such as multichannel loudspeaker signals, audio objects, or Ambisonic signals. Furthermore, the exact implementation of the analysis processor may be any suitable implementation (as indicated above a computer running suitable software, a FPGAs or ASICs etc) caused to produce the transport audio signals and the spatial metadata in the time- frequency domain.
With respect to Figure 4 is shown an example summary of the operations of the analysis processor shown in Figure 3.
The initial operation is receiving the microphone array audio signals as shown in Figure 4 by step 401 .
Flaving received the microphone audio signals they are analysed to generate the transport audio signals (for example selection, downmixing or other processing) as shown in Figure 4 by step 403.
Furthermore the microphone audio signals are spatially analysed to generate the metadata, for example the directions and energy ratios as shown in Figure 4 by step 405.
The metadata and the transport audio signals may then be combined to generate the data stream as shown in Figure 4 by step 407.
With respect to Figure 5 an example synthesis processor 107 (as shown in Figure 1 ) according to some embodiments is shown.
A demultiplexer, or demux, 501 is configured to receive the data stream 104 and caused to demultiplex the data stream into a transport audio signals 502 and metadata 504. In some embodiments, where the transport audio signals were compressed within the analysis processor, the demultiplexer is furthermore caused to decode the audio signals. The metadata in some embodiments is with the time- frequency domain, and comprises parameters such as directions 0(k,n) 506 and direct-to-total energy ratios r(k,n) 508, where k is the frequency band index and n the temporal frame index. In the embodiments where the metadata is compressed/dequantized then the demultiplexed data is furthermore decompressed/dequantized to attempt to regenerate the originally determined parameters.
A spatial synthesizer 503 is configured to receive the transport audio signals 502 and the metadata and caused to generate the multichannel output signals 510 such as the binaural output signals 109 shown in Figure 1 .
With respect to Figure 6 is shown an example summary of the operations of the synthesis processor shown in Figure 5.
The initial operation is receiving the data stream as shown in Figure 6 by step
601 .
Flaving received the data stream, it is demultiplexed and optionally decoded to generate the transport audio signals and the metadata as shown in Figure 6 by step 603.
The multichannel (binaural or otherwise) output signals may then be synthesized from the transport audio signals and the metadata as shown in Figure 6 by step 605.
The multichannel (binaural or otherwise) output signals may then be output as shown in Figure 6 by step 607.
With respect to Figures 7a and 7b example spatial synthesizers 503 (as shown in Figure 5) according to some embodiments is shown.
The input to the spatial synthesizer 503 is in some embodiments the transport audio signals 502 and furthermore the metadata 504 (which may include the energy ratios 508 and the directions 506).
In some embodiments the transport audio signals 502 are transformed to the time-frequency domain using a suitable transformer. For example as shown in Figure 7a and 7b a short-time Fourier transformer (STFT) 701 is configured to apply a short-time Fourier transform to the transport audio signals to generate suitable time-frequency domain audio signals S^k. n ) 700. In some embodiments any suitable time-frequency transformer may be used, for example a quadrature mirror interbank (QMF).
A divider 705 may receive the time-frequency domain audio signals Si(k, n) 700 and the energy ratios 508 and divide the time-frequency domain audio signals Si(k, n ) 700 to ambient and direct parts using the energy ratio r(k, n) 508.
With respect to Figure 7a a smoothing coefficients determiner 703 may also receive the time-frequency domain audio signals Si(k, n) 700 and the energy ratios 508 and determine suitable smoothing coefficients 706. Figure 7b differs with respect to the example spatial synthesizer shown in Figure 7a in that the smoothing coefficients determiner in Figure 7b is caused to receive the time-frequency domain audio signals Si(k, n) 700 and the directions 506.
The smoothing coefficients determiner 703 may be configured to adaptively determine the smoothing coefficient(s) a(k, ri).
A panning gain determiner 715 may be configured to receive the directions 506 and based on the output speaker/headphone configuration and the directions determine suitable panning gains 708. The amplitude panning gains may be computed using any suitable manner, for example vector base amplitude panning (VBAP) based on the received direction 0(k,n).
In some embodiments a panning gain smoother 717 is configured to receive the panning gains 708 and the smoothing coefficients 706 and based on these determine suitable smoothed panning gains 710. There are many ways to perform the smoothing. In some embodiments a first-order smoothing may be used. Thus for example the panning gain smoother 717 is configured to receive a current gain g(k, n), smoothing coefficients a(k, n) and also knowledge on the last smoothed gain g'(k, n - 1) and determine a smoothed gain by:
Figure imgf000018_0001
In other words the current gain is multiplied with the smoothing coefficient a and the previous smoothed gain is multiplied with (1 - a). In other embodiments any suitable smoothing may be applied. The smoothing ‘filter’ may therefore be multiple order and similarly the smoothing coefficient a(k, n) may be a vector value. The actual value(s) of a may depend on the filterbank, and typically is frequency-dependent (values may include, e.g., 0.1 ). In general, the larger the value is, the less smoothing is applied.
A decorrelator 707 is configured to receive the ambient audio signal part 702 and process it to make it perceived as being surrounding, for example by decorrelating and spreading the ambient audio signal part 702 across the audio scene.
A positioner 709 is configured to receive the direct audio signal part 704 and the smoothed panning gains 710 and position the direct audio signal part 704 using a suitable positioning, for example using the smoothed panning gains and an amplitude panning operation.
A merger 71 1 or other suitable combiner is configured to receive the spread ambient signal part from the decorrelator 707 and the positioned direct audio signals part from the positioner 709 and combine or merge these resulting audio signals.
An inverse short-time Fourier transformer (Inverse STFT) 713 is configured to receive the combined audio signals and apply an inverse short-time Fourier transform (or other suitable frequency to time domain transform) to generate the multi-channel audio signals 510 which may be passed to a suitable output device such as the headphones or multi-channel loudspeaker setup.
In the examples and embodiments described in detail herein, for example as described with respect to the examples shown in Figures 7a and 7b the panning gains are determined directly from the direction metadata, and the“direct sound” is also positioned with these gains after smoothing.
In some embodiments there may be implementations where the panning gains are not directly determined from the direction metadata, but instead determined indirectly. Thus the smoothing of these gains as described above may be applied to any suitably generated gains.
Thus for example, in some embodiments the directions may be used (together with the energy ratios and transport audio signals) to determine a target energy distribution of the output multichannel signals. The target energy distribution may be compared to the energy distribution of the transport audio signals (or to the energy distribution of intermediate signals obtained from the transport audio signals by mixing). Panning gains (or any gains that position audio) may be obtained as a ratio of these values and the“Smoother” 717 may be applied to these gains.
In summary the method of generating the panning gains may be one of many optional methods which is then smoothed according to methods as described herein.
With respect to Figures 8a and 8b the operations of the spatial synthesizer 503 shown in Figures 7a and 7b according to some embodiments are described in further detail.
The spatial synthesizer in some embodiments is configured to receive the transport audio signals as shown in Figures 8a and 8b by step 801 .
The spatial synthesizer in some embodiments is furthermore configured to receive the energy ratios as shown in Figures 8a and 8b by step 803.
The spatial synthesizer in some embodiments is also configured to receive the directions as shown in Figures 8a and 8b by step 805.
The received transport audio signals are in some embodiments converted into a time-frequency domain form, for example by applying a suitable time- frequency domain transform to the transport audio signals as shown in Figures 8a and 8b by step 807.
The time-frequency domain audio signals may then in some embodiments be divided into ambient and direct parts (based on the energy ratios) as shown in Figures 8a and 8b by step 813.
Furthermore the smoothing coefficients may be determined based on the energy ratios and the time-frequency domain audio signals as shown in Figure 8a by step 81 1 . Alternatively the smoothing coefficients may be determined based on the directions and the time-frequency domain audio signals as shown in Figure 8b by step 851 .
Panning gains may be determined based on the received directions as shown in Figures 8a and 8b by step 809. A series of smoothed panning gains may be determined based on the determined panning gains and the smoothing coefficients as shown in Figures 8a and 8b by step 817.
The ambient audio signal part may be decorrelated as shown in Figures 8a and 8b by step 815.
The positional component of the audio signals may then be determined based on the smoothed panning gains and the direct audio signal part as shown in Figures 8a and 8b by step 819. In such embodiments a positional component of the audio signals or positioned audio signal can be a number of audio signals which are combined to produce a virtual sound source positioned in a three dimensional space.
The positional component of the audio signals and the decorrelated ambient audio signal may then be combined or merged as shown in Figures 8a and 8b by step 821 .
Furthermore the combined audio signals may then be inverse time-frequency domain transformed to generate the multichannel audio signals in a suitable format to be output as shown in Figures 8a and 8b by step 823.
With respect to Figure 9 an example smoothing coefficients determiner 703 (such as shown in Figures 7a and 7b) according to some embodiments is shown. The smoothing coefficients determiner 703 is configured to generate values which may be used to smooth the panning gains in order to avoid“musical noise” artefacts.
The inputs to the smoothing coefficients determiner 703 are shown as the time-frequency domain audio signals 700 and the energy ratios 508.
An energy estimator 901 may be configured to receive the time-frequency domain audio signals 700 and determine the energy E(k, n) 902 of the audio signals. For example in some embodiments the energy estimator 901 is configured to generate the energy based on:
Figure imgf000021_0001
A direction smoothness estimator 903 is configured to estimate a direction smoothness f(/c, n). In some embodiments, such as shown in the examples in Figures 7a and 8a, this direction smoothness may be estimated or determined from the energy ratios r(k, n) 508. For example the direction smoothness estimator may be configured to calculate the direction smoothness by the following:
x{]<., h) r(Ji, ri)p
where p is a constant (e.g., p = 8).
In some embodiments, such as shown in the examples in Figures 7b and 8b, the direction smoothness value f(/c,n) 904 can be estimated by using or calculating the fluctuation of the direction value. In such embodiments a circular variance of the directions 0(k, n) is determined and this is used as the basis of a direction smoothness. In other embodiments any suitable analysis of the temporal fluctuation of the directions may be used to determine the direction smoothness estimate.
An average direction smoothness estimator 905 is configured to receive the energy 902 and direction smoothness estimates 904 and determine an average over time (and in some embodiments over frequency). The average direction smoothness estimator may therefore be configured to perform a first-order smoothing based on a current estimate f(/c, n) a previous average value x' ^, h - 1) and smoothing coefficient b to generate an averaged direction smoothness estimate x' , h) 906, for example by the following:
(k, n = b x{ h) + (1 - b)x'{ h - 1)
where b may be fixed, or it can be adaptively selected, e.g., by
Figure imgf000022_0001
where a± may, e.g., be 0.001 and a2 may, e.g., be 0.5. This adaptive selection attempts to find whether the energy ratio is constantly large, and hence temporal smoothing can be safely made shorter without artefacts. Moreover, the direction smoothness estimates x may be weighted by the energy E while performing the temporal smoothing.
A direction smoothness estimates to smoothing coefficients converter may receive the averaged direction smoothness estimate x' , h) 906 and generate the smoothing coefficients a(k, n). For example in some embodiments the averaged direction smoothness estimates x' ^, h) are converted to the actual smoothing coefficients by the following
a(k, n) = x' {k, n)afast{k) + (l - x' (k, n))aslow(k)
The values of afast may, e.g., include 0.4, and the values of aslow may, e.g., include 0.1 . These fast and slow coefficients may depend on the actual implementation and may be frequency-dependent.
In some embodiments the smoothing coefficients may be a vector instead of a single value. This for example may occur when the smoothing is other than a first- order MR smoothing. These embodiments may therefore may implement “fast settings” and“slow settings” which are interpolated based on the“averaged direction smoothness estimates”. In such embodiments these“settings” may depend on the implementation, for example whether it is a single value or a vector of values.
The smoothing coefficients a(k, )706 may then be output.
With respect to Figure 10 an example flow diagram showing the operation of the smoothing coefficients determiner according to some embodiments is shown.
The time-frequency domain audio signals are received as shown in Figure 10 by step 1001 .
Furthermore the energy ratios are received as shown in Figure 10 by step
1003.
The estimate of the energy (of the audio signals) may be determined based on the time-frequency domain audio signals as shown in Figure 10 by step 1005.
Furthermore the estimate of the direction smoothness based on the energy ratios (or based on any other suitable parameter such as an analysis of the directions) is shown in Figure 10 by step 1007.
The estimate of the average direction smoothness is then determined based on the energy estimate and the direction smoothness estimates as shown in Figure 10 by step 1009.
Then the average direction smoothness estimate is converted to smoothness coefficients as shown in Figure 10 by step 101 1 .
The smoothness coefficients are then output as shown in Figure 10 by step
1013. In the above examples the smoothness coefficients were determined based on the use of the spatial metadata (for example the metadata generated in the analysis processor such as found within a spatial audio capture (SPAC) which generates directions and direct-to-total energy ratios). It should be noted that the above methods can be modified without inventive skill to be used with any method utilizing similar parameters. For example in context of Direct Audio Coding (DirAC), the direction smoothness can be determined as
x{ h) = (1— xfj(k, n))p
where il>(k, n) is diffuseness.
Some of the advantages of the proposed embodiments is that significant amount of smoothing can applied with typical sound scenes, and thus musical noise artefacts are avoided. Furthermore when the sound scene does not require so much smoothing, the amount of smoothing applied can be reduced and thus the reproduction can react faster to changes in the sound field.
The effect of the proposed embodiments can be seen in Figure 1 1 which shows three graph traces showing a reference audio signal, a reproduction using a fixed smoothing coefficient and a reproduction using an adaptive smoothing coefficient. In this example the sound scene contains two sources located in different directions in anechoic conditions. The sound was rendered to a multichannel setup.
The reference graph trace 1 101 shows the signal of one output channel of the audio signal and shows the first source 1 103 but not the other source.
In the fixed smoothing example graph trace 1 1 1 1 , excessive temporal smoothing causes the sound to be reproduced partially still from the first direction (shown from 1 .4 seconds to 1 .8 seconds) even though the sound source is not present anymore for that direction. As a result, the reproduction is perceived to slowly react to changes in the direction.
On the contrary, the adaptive smoothing example 1 121 having analysed that the directions are stable, and there is not as much need for temporal smoothing is configured to set the smoothing to a faster mode, and the sound source is not reproduced from the wrong channel. In such a manner the reproduction is perceived to react fast to changes in the direction.
With respect to Figure 12 an example implementation of some further embodiments is shown. In these embodiments the implementation can be by software implemented, for example on a mobile phone (or a computer) 1200. The software running inside the mobile phone 1200 may be configured to receive an encoded bitstream (it may have been e.g., transmitted real-time or it may have been stored to the device). The bitstream can also be any other suitable bitstream. A demultiplexer 1203 (DEMUX) is configured to demultiplex the bitstream into an audio bitstream 1204 and to a spatial metadata bitstream 1206.
An enhanced voice standard (EVS) or encoded bitstream decoder 1205 is configured to extract the transport audio signals 1206 from the audio bitstream (or any decoder that corresponds to the utilized codec).
A metadata decoder 1207 is used to decompress the spatial metadata 1208, for example comprising the directions 1210 and energy ratios 1212.
The spatial synthesiser 1209 (similar to the spatial synthesizer in the embodiments above) is configured to receive transport audio signals 1206 and the metadata 1208 and output multichannel loudspeaker signals 121 1 that may be reproduced using a multichannel loudspeaker setup. In some embodiments the spatial synthesizer 1209 is configured to generate binaural audio signals that may be reproduced using headphones.
With respect to Figure 13 a further example implementation is shown according to some further embodiments. In this example implementation a microphone array 1301 , for example part of a mobile phone, is configured to capture audio signals 1302. The captured microphone array audio signals 1302 may be processed by software 1300 running inside the mobile phone. In the software 1300 may be an analysis processor configured to analyse the captured microphone array signals 1302 in a manner such as described with respect to Figures 1 and 3 and is configured to generate spatial metadata 1304 (comprising directions 1306 and energy ratios 1308). Furthermore there may comprise a synthesis processor 1305 which is configured to receive the spatial metadata 1304 from the analysis processor 1303 along with the captured microphone array audio signals 1302 (or alternatively a subset or a processed set of the microphone signals). The synthesis processor 1305 may operate in a manner similar to the synthesis processor 1305 as described with respect to Figures 1 and 5, 7a, 7b and 9. Depending on the configuration, the synthesis processor 1305 may be configured to output a multichannel audio signal (for example a binaural signal or a surround loudspeaker signal or Ambisonic signal). The multichannel audio signals 1307 can therefore be listened to directly (when fed to headphones or loudspeakers, or reproduced using an Ambisonic renderer), stored (with any suitable codec) and/or transmitted to a remote device.
Although the codec use implementation is described above it is noted that some embodiments may be used with any suitable codec that utilizes smoothing and can provide information on the smoothness of the direction-related parameters.
Similarly as depicted in the example implementation in Figure 13, the proposed method can also be applied in any kind of spatial audio processing which operates in time-frequency domain.
With respect to Figure 14 an example electronic device which may be used as the analysis or synthesis processor is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.
In some embodiments the device 1400 comprises at least one processor or central processing unit 1407. The processor 1407 can be configured to execute various program codes such as the methods such as described herein.
In some embodiments the device 1400 comprises a memory 141 1 . In some embodiments the at least one processor 1407 is coupled to the memory 141 1 . The memory 141 1 can be any suitable storage means. In some embodiments the memory 141 1 comprises a program code section for storing program codes implementable upon the processor 1407. Furthermore in some embodiments the memory 141 1 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1407 whenever needed via the memory-processor coupling.
In some embodiments the device 1400 comprises a user interface 1405. The user interface 1405 can be coupled in some embodiments to the processor 1407. In some embodiments the processor 1407 can control the operation of the user interface 1405 and receive inputs from the user interface 1405. In some embodiments the user interface 1405 can enable a user to input commands to the device 1400, for example via a keypad. In some embodiments the user interface 1405 can enable the user to obtain information from the device 1400. For example the user interface 1405 may comprise a display configured to display information from the device 1400 to the user. The user interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1400 and further displaying information to the user of the device 1400. In some embodiments the user interface 1405 may be the user interface for communicating with the position determiner as described herein.
In some embodiments the device 1400 comprises an input/output port 1409. The input/output port 1409 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA). The transceiver input/output port 1409 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1407 executing suitable code. Furthermore the device may generate a suitable downmix signal and parameter output to be transmitted to the synthesis device.
In some embodiments the device 1400 may be employed as at least part of the synthesis device. As such the input/output port 1409 may be configured to receive the downmix signals and in some embodiments the parameters determined at the capture device or processing device as described herein, and generate a suitable audio signal format output by using the processor 1407 executing suitable code. The input/output port 1409 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones or similar.
In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.
The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

Claims

CLAIMS:
1 . An apparatus for spatial audio signal processing, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to:
receive at least one audio signal;
determine at least one spatial parameter associated with the at least one audio signal;
generate an adaptive smoothing parameter based on the at least one spatial parameter;
determine panning gains for applying to a first part of the at least one audio signals;
apply the adaptive smoothing parameter to the panning gains to generate associated smoothed panning gains; and
apply the smoothed panning gains to the first part of the at least one audio signal to generate a positioned audio signal.
2. The apparatus as claimed in claim 1 , wherein the apparatus is further caused to:
apply a decorrelation to a second part of the at least one audio signal to generate an ambient audio signal; and
combine the positioned audio signal and the ambient audio signal to generate a multichannel audio signal.
3. The apparatus as claimed in any of claims 1 and 2, wherein the apparatus caused to generate an adaptive smoothing parameter based on the at least one spatial parameter is caused to:
estimate a direction smoothness parameter based on the at least one spatial parameter; and convert the direction smoothness parameter to the adaptive smoothing parameter.
4. The apparatus as claimed in claim 3, wherein the apparatus caused to generate an adaptive smoothing parameter based on the at least one spatial parameter is caused to:
estimate an energy of the at least one audio signal; and
average the direction smoothness parameter based on the energy of the at least one audio signal, wherein the apparatus caused to convert the direction smoothness parameter to the adaptive smoothing parameter is caused to convert the averaged direction smoothness parameter to the adaptive smoothing parameter.
5. The apparatus as claimed in claim 4, wherein the apparatus caused to average the direction smoothness parameter based on the energy of the at least one audio signal is caused to:
determine an averaging parameter based on the energy of the at least one audio signal; and
apply the averaging parameter to the direction smoothness parameter and unity minus the averaging parameter to the previous averaged direction smoothness parameter to generate the averaged direction smoothness parameter.
6. The apparatus as claimed in any of claims 3 to 5, wherein the at least one spatial parameter comprises an energy ratio associated with the at least one audio signal, wherein the apparatus caused to estimate a direction smoothness parameter based on the at least one spatial parameter is caused to determine the direction smoothness parameter by applying an exponent to the energy ratio.
7. The apparatus as claimed in any of claims 3 to 5, wherein the at least one spatial parameter comprises a direction associated with the at least one audio signal, wherein the apparatus caused to estimate a direction smoothness parameter based on the at least one spatial parameter is caused to determine the direction smoothness parameter by analysing a motion of the direction.
8. The apparatus as claimed in any of claims 3 to 5, wherein the at least one spatial parameter comprises a diffuseness associated with the at least one audio signal, wherein the apparatus caused to estimate a direction smoothness parameter based on the at least one spatial parameter is caused to determine the direction smoothness parameter by applying an exponent to difference between unity and the diffuseness.
9. The apparatus as claimed in any of claims 1 to 8, wherein the apparatus caused to receive at least one audio signal is caused to perform at least one of: receive the at least one audio signal from at least one microphone within a microphone array;
determine the at least one audio signal from multichannel loudspeaker audio signals; and
receive the at least one audio signal as part of a data stream comprising the at least one audio signals and metadata comprising the at least one spatial parameter.
10. The apparatus as claimed in any of claims 1 to 9, wherein the apparatus caused to determine at least one spatial parameter associated with the at least one audio signal is caused to perform at least one of:
analyse the at least one audio signal to determine the at least one spatial parameter; and
receive the at least one spatial parameter as part of a data stream comprising the at least one audio signals and metadata comprising the at least one spatial parameter.
1 1 . A method for spatial audio signal processing comprising:
receiving at least one audio signal; determining at least one spatial parameter associated with the at least one audio signal;
generating an adaptive smoothing parameter based on the at least one spatial parameter;
determining panning gains for applying to a first part of the at least one audio signals;
applying the adaptive smoothing parameter to the panning gains to generate associated smoothed panning gains; and
applying the smoothed panning gains to the first part of the at least one audio signal to generate a positioned audio signal.
12. The method as claimed in claim 11 , further comprising:
applying a decorrelation to a second part of the at least one audio signal to generate an ambient audio signal; and
combining the positioned audio signal and the ambient audio signal to generate a multichannel audio signal.
13. The method as claimed in any of claims 11 and 12, wherein generating an adaptive smoothing parameter based on the at least one spatial parameter comprises:
estimating a direction smoothness parameter based on the at least one spatial parameter; and
converting the direction smoothness parameter to the adaptive smoothing parameter.
14. The method as claimed in claim 13, wherein generating an adaptive smoothing parameter based on the at least one spatial parameter comprises:
estimating an energy of the at least one audio signal;
averaging the direction smoothness parameter based on the energy of the at least one audio signal, wherein converting the direction smoothness parameter to the adaptive smoothing parameter comprises converting the averaged direction smoothness parameter to the adaptive smoothing parameter.
15. The method as claimed in claim 14, wherein averaging the direction smoothness parameter based on the energy of the at least one audio signal comprises:
determining an averaging parameter based on the energy of the at least one audio signal; and
applying the averaging parameter to the direction smoothness parameter and unity minus the averaging parameter to the previous averaged direction smoothness parameter to generate the averaged direction smoothness parameter.
16. The method as claimed in any of claims 13 to 15, wherein the at least one spatial parameter comprises an energy ratio associated with the at least one audio signal, wherein estimating a direction smoothness parameter based on the at least one spatial parameter comprises determining the direction smoothness parameter by applying an exponent to the energy ratio.
17. The method as claimed in any of claims 13 to 15, wherein the at least one spatial parameter comprises a direction associated with the at least one audio signal, wherein estimating a direction smoothness parameter based on the at least one spatial parameter comprises determining the direction smoothness parameter by analysing a motion of the direction.
18. The method as claimed in any of claims 13 to 15, wherein the at least one spatial parameter comprises a diffuseness associated with the at least one audio signal, wherein estimating a direction smoothness parameter based on the at least one spatial parameter comprises determining the direction smoothness parameter by applying an exponent to difference between unity and the diffuseness.
19. The method as claimed in any of claims 11 to 18, wherein receiving at least one audio signal comprises performing at least one of:
receiving the at least one audio signal from at least one microphone within a microphone array;
determining the at least one audio signal from multichannel loudspeaker audio signals; and
receiving the at least one audio signal as part of a data stream comprising the at least one audio signals and metadata comprising the at least one spatial parameter.
20. The method as claimed in any of claims 11 to 19, wherein determining at least one spatial parameter associated with the at least one audio signal comprises at least one of:
analysing the at least one audio signal to determine the at least one spatial parameter; and
receiving the at least one spatial parameter as part of a data stream comprising the at least one audio signals and metadata comprising the at least one spatial parameter.
PCT/FI2019/050178 2018-03-13 2019-03-07 Temporal spatial audio parameter smoothing WO2019175472A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP19767481.5A EP3766262B1 (en) 2018-03-13 2019-03-07 Spatial audio parameter smoothing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB1803993.3 2018-03-13
GB1803993.3A GB2571949A (en) 2018-03-13 2018-03-13 Temporal spatial audio parameter smoothing

Publications (1)

Publication Number Publication Date
WO2019175472A1 true WO2019175472A1 (en) 2019-09-19

Family

ID=61972940

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2019/050178 WO2019175472A1 (en) 2018-03-13 2019-03-07 Temporal spatial audio parameter smoothing

Country Status (3)

Country Link
EP (1) EP3766262B1 (en)
GB (1) GB2571949A (en)
WO (1) WO2019175472A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11545166B2 (en) 2019-07-02 2023-01-03 Dolby International Ab Using metadata to aggregate signal processing operations
EP4178231A1 (en) * 2021-11-09 2023-05-10 Nokia Technologies Oy Spatial audio reproduction by positioning at least part of a sound field

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2593419A (en) * 2019-10-11 2021-09-29 Nokia Technologies Oy Spatial audio representation and rendering
TW202123220A (en) * 2019-10-30 2021-06-16 美商杜拜研究特許公司 Multichannel audio encode and decode using directional metadata
JP2023549033A (en) * 2020-10-09 2023-11-22 フラウンホーファー-ゲゼルシャフト・ツール・フェルデルング・デル・アンゲヴァンテン・フォルシュング・アインゲトラーゲネル・フェライン Apparatus, method or computer program for processing encoded audio scenes using parametric smoothing

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090067634A1 (en) 2007-08-13 2009-03-12 Lg Electronics, Inc. Enhancing Audio With Remixing Capability
US20130329922A1 (en) 2012-05-31 2013-12-12 Dts Llc Object-based audio system using vector base amplitude panning
WO2014162171A1 (en) 2013-04-04 2014-10-09 Nokia Corporation Visual audio processing apparatus
JP2015080119A (en) 2013-10-17 2015-04-23 ヤマハ株式会社 Sound image localization device
EP2942981A1 (en) 2014-05-05 2015-11-11 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. System, apparatus and method for consistent acoustic scene reproduction based on adaptive functions
US9820072B2 (en) * 2012-08-31 2017-11-14 Helmut-Schmidt-Universität Universität der Bundeswehr Hamburg Producing a multichannel sound from stereo audio signals
WO2018213159A1 (en) * 2017-05-15 2018-11-22 Dolby Laboratories Licensing Corporation Methods, systems and apparatus for conversion of spatial audio format(s) to speaker signals

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7983922B2 (en) * 2005-04-15 2011-07-19 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Apparatus and method for generating multi-channel synthesizer control signal and apparatus and method for multi-channel synthesizing
AU2012279349B2 (en) * 2011-07-01 2016-02-18 Dolby Laboratories Licensing Corporation System and tools for enhanced 3D audio authoring and rendering
CN105336335B (en) * 2014-07-25 2020-12-08 杜比实验室特许公司 Audio object extraction with sub-band object probability estimation
US10045145B2 (en) * 2015-12-18 2018-08-07 Qualcomm Incorporated Temporal offset estimation

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090067634A1 (en) 2007-08-13 2009-03-12 Lg Electronics, Inc. Enhancing Audio With Remixing Capability
US20130329922A1 (en) 2012-05-31 2013-12-12 Dts Llc Object-based audio system using vector base amplitude panning
US9820072B2 (en) * 2012-08-31 2017-11-14 Helmut-Schmidt-Universität Universität der Bundeswehr Hamburg Producing a multichannel sound from stereo audio signals
WO2014162171A1 (en) 2013-04-04 2014-10-09 Nokia Corporation Visual audio processing apparatus
JP2015080119A (en) 2013-10-17 2015-04-23 ヤマハ株式会社 Sound image localization device
EP2942981A1 (en) 2014-05-05 2015-11-11 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. System, apparatus and method for consistent acoustic scene reproduction based on adaptive functions
WO2018213159A1 (en) * 2017-05-15 2018-11-22 Dolby Laboratories Licensing Corporation Methods, systems and apparatus for conversion of spatial audio format(s) to speaker signals

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LEE, KANGEUN ET AL.: "Immersive Virtual Sound Beyond 5.1 Channel Audio", AES 128TH CONVENTION, 22 May 2010 (2010-05-22), London, UK, XP055726075, ISBN: 978-1-61738-773-9, Retrieved from the Internet <URL:http://www.aes.org/e-lib/browse.cfm?elib=15414> [retrieved on 20190429] *
See also references of EP3766262A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11545166B2 (en) 2019-07-02 2023-01-03 Dolby International Ab Using metadata to aggregate signal processing operations
EP4178231A1 (en) * 2021-11-09 2023-05-10 Nokia Technologies Oy Spatial audio reproduction by positioning at least part of a sound field

Also Published As

Publication number Publication date
GB201803993D0 (en) 2018-04-25
EP3766262B1 (en) 2022-11-23
EP3766262A1 (en) 2021-01-20
GB2571949A (en) 2019-09-18
EP3766262A4 (en) 2021-11-10

Similar Documents

Publication Publication Date Title
US11343630B2 (en) Audio signal processing method and apparatus
US10469978B2 (en) Audio signal processing method and device
US10785589B2 (en) Two stage audio focus for spatial audio processing
US20240007814A1 (en) Determination Of Targeted Spatial Audio Parameters And Associated Spatial Audio Playback
EP3766262B1 (en) Spatial audio parameter smoothing
US9584235B2 (en) Multi-channel audio processing
CN112219236A (en) Spatial audio parameters and associated spatial audio playback
CN112567765B (en) Spatial audio capture, transmission and reproduction
US20220369061A1 (en) Spatial Audio Representation and Rendering
US20240089692A1 (en) Spatial Audio Representation and Rendering
US20220174443A1 (en) Sound Field Related Rendering
US11956615B2 (en) Spatial audio representation and rendering
WO2022258876A1 (en) Parametric spatial audio rendering

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19767481

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019767481

Country of ref document: EP

Effective date: 20201013