EP3766262B1 - Spatial audio parameter smoothing - Google Patents

Spatial audio parameter smoothing Download PDF

Info

Publication number
EP3766262B1
EP3766262B1 EP19767481.5A EP19767481A EP3766262B1 EP 3766262 B1 EP3766262 B1 EP 3766262B1 EP 19767481 A EP19767481 A EP 19767481A EP 3766262 B1 EP3766262 B1 EP 3766262B1
Authority
EP
European Patent Office
Prior art keywords
audio signal
parameter
spatial
generate
direction smoothness
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
EP19767481.5A
Other languages
German (de)
French (fr)
Other versions
EP3766262A1 (en
EP3766262A4 (en
Inventor
Mikko-Ville Laitinen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nokia Technologies Oy
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of EP3766262A1 publication Critical patent/EP3766262A1/en
Publication of EP3766262A4 publication Critical patent/EP3766262A4/en
Application granted granted Critical
Publication of EP3766262B1 publication Critical patent/EP3766262B1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S3/00Systems employing more than two channels, e.g. quadraphonic
    • H04S3/008Systems employing more than two channels, e.g. quadraphonic in which the audio signals are in digital form, i.e. employing more than two discrete digital channels
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/26Pre-filtering or post-filtering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S1/00Two-channel systems
    • H04S1/007Two-channel systems in which the audio signals are in digital form
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • H04S7/303Tracking of listener position or orientation
    • H04S7/304For headphones
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L2019/0001Codebooks
    • G10L2019/0012Smoothing of parameters of the decoder interpolation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/03Aspects of down-mixing multi-channel audio to configurations with lower numbers of playback channels, e.g. 7.1 -> 5.1
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/15Aspects of sound capture and related signal processing for recording or reproduction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/03Application of parametric coding in stereophonic audio systems

Definitions

  • the present application relates to apparatus and methods for temporal spatial audio parameter smoothing. This includes but is not exclusively for sound reproduction systems and sound reproduction methods producing multichannel audio channel outputs.
  • Parametric spatial audio processing is a field of audio signal processing where the spatial aspect of the sound is described using a set of parameters.
  • parameters such as directions of the sound in frequency bands, and the ratio parameters expressing relative energies of the directional and non-directional parts of the captured sound in frequency bands.
  • These parameters are known to well describe the perceptual spatial properties of the captured sound at the position of the microphone array.
  • These parameters can be utilized in synthesis of the spatial sound accordingly, for headphones binaurally, for loudspeakers, or to other formats, such as Ambisonics.
  • the directions and direct-to-total energy ratios in frequency bands are thus a parameterization that is particularly effective for spatial audio capture.
  • a parameter set consisting of a direction parameter in frequency bands and an energy ratio parameter in frequency bands (indicating the proportion of the sound energy that is directional) can be also utilized as the spatial metadata for an audio codec.
  • these parameters can be estimated from microphone-array captured audio signals, and for example a stereo signal can be generated from the microphone array signals to be conveyed with the spatial metadata.
  • the stereo signal could be encoded, for example, with an AAC encoder.
  • a decoder can decode the audio signals into PCM signals, and process the sound in frequency bands (using the spatial metadata) to obtain the spatial output, for example a binaural output.
  • Patent application publication US2009067634 discloses modifying spatial audio parameters associated with one or more audio objects of a stereo or multichannel audio signal to provide remixing capabilities.
  • Patent application publication WO2014162171 discloses a spatial audio analyser configured to determine an audio source with a location associated with a visual image element, and an audio processor arranged to change an audio characteristic of the audio source in response to a control input.
  • Patent application publication EP2942981 discloses an audio signal processing system for consistent acoustic scene reproduction based on informed spatial filtering.
  • Patent application publication US2013329922 discloses using vector base amplitude panning (VBAP) for playing back an object's audio and using the positioning of sound reproduction devices and the object' s location information to determine which sound reproduction devices are used for playing back the object's audio.
  • VBAP vector base amplitude panning
  • Patent application publication JP2015080119 discloses a method of improving the degree of freedom when calculating a panning coefficient for sound image localisation within a three-dimensional space.
  • a computer program product stored on a medium may cause an apparatus to perform the method as described herein.
  • An electronic device may comprise apparatus as described herein.
  • a chipset may comprise apparatus as described herein.
  • Embodiments of the present application aim to address problems associated with the state of the art.
  • the spatial sound source is a microphone array.
  • the spatial sound source may be a 5.1 multichannel or other format multi-channel mix or Ambisonics signals.
  • Parametric spatial audio capture refers to adaptive DSP-driven audio capture methods covering 1) analysing perceptually relevant parameters in frequency bands, for example, the directionality of the propagating sound at the recording position, and 2) reproducing spatial sound in a perceptual sense at the rendering side according to the estimated spatial parameters.
  • the reproduction can be, for example, for headphones or multichannel loudspeaker setups.
  • Parametric spatial audio capture methods may employ these determined parameters such as directions of the sound in frequency bands, and the ratios between the directional and non-directional parts of the captured sound in frequency bands to describe the perceptual spatial properties of the captured sound at the position of the microphone array and may use these parameters in synthesis of the spatial sound.
  • SPAC Parametric spatial audio capture methods
  • the spatial properties are estimated from the sound field, they can significantly fluctuate over time and frequency, e.g., due to the reverberation and/or multiple simultaneous sound sources.
  • parametric spatial audio processing methods typically utilize smoothing in the synthesis, in order to avoid possible artefacts caused by rapidly fluctuating parameters (these artefacts are typically referred to as "musical noise").
  • Similar parametrization may also be used for the compression of spatial audio, e.g., from 5.1 multichannel signals.
  • the parameters are estimated from the input loudspeaker signals. Nevertheless, the parameters typically fluctuate also in this case. Hence, the temporal smoothing is needed also with loudspeaker input.
  • the spatial parameters are determined in the time-frequency domain, i.e., each parameter value is associated with a certain frequency band and temporal frame.
  • Examples of possible spatial parameters include (but are not limited to):
  • the spatial audio parametrizations describe how the sound is distributed in space, either generally (e.g., using directions) or relatively (e.g., as level differences between certain channels).
  • the audio and the parameters may be processed and/or transmitted/stored in between the analysis and the synthesis.
  • the parametric spatial audio processing methods are often based on analysing the direction of arrival (and other spatial parameters) in frequency bands. If there would be a single sound source in anechoic condition, the direction would stably point to the sound source at all frequencies. However, in typical acoustic environments, the microphones capture also other sounds than just the sound source, such as reverberation and ambient sounds. Moreover, there may be multiple simultaneous sources. As a result, the estimated directions typically significantly fluctuate over time and the estimates are different at different frequency bands.
  • Parametric spatial audio processing methods such as employed in embodiments as described in further detail hereafter synthesize the spatial sound based on the analysed parameters (such as the aforementioned direction) and related audio signals (e.g., 2 captured microphone signals).
  • vector base amplitude panning VBAP
  • VBAP computes gains for a subset of loudspeakers based on the direction, and the audio signal is multiplied with these gains and fed to these loudspeakers.
  • the concept as discussed hereafter proposes apparatus and methods to adapt the smoothing needed in the synthesis of spatial sound in parametric spatial audio processing in order to have quality audio output with different types of sound scenes.
  • the embodiments as described hereafter relate to parametric spatial audio processing where a solution is provided to improve the temporal smoothing processing needed in the synthesis of spatial audio in the aforementioned parametric spatial audio processing and where the temporal smoothing is improved by analysing the required amount of smoothing adaptively.
  • the analysis being related to the stability of the direction-related parameter(s) and producing a measure of directional stability and determining the time coefficients of the temporal smoothing based on the measure of directional stability.
  • the direction-related parameter may as described in further detail in the embodiments hereafter refer to a direction.
  • the amount of smoothing can be analysed using the direct-to-total energy ratio.
  • the value of the energy ratio is monitored over time, and where it is constantly high, the time coefficient of the smoothing can be set smaller (less smoothing applied).
  • the time coefficient can be set to a default value (more smoothing applied).
  • FIG. 1 A block diagram of an example system for implementing some embodiments is shown in Figure 1 .
  • Figure 1 shows an example capture device 101.
  • the capture device may be a VR capture device, a mobile phone or any other suitable electronic apparatus comprising one or more microphone arrays.
  • the capture device 101 thus in some embodiments comprises microphones 100 1 , 100 2.
  • the microphone audio signals 102 captured by the microphones 100 1 , 100 2 may be stored and later processed, or directly processed.
  • An analysis processor 103 may receive the microphone audio signals 102 from the capture device 101.
  • the analysis processor 103 can, for example, be a computer or a mobile phone (running suitable software), or alternatively a specific device utilizing, for example, field programmable gate arrays (FPGAs) or application specific integrated circuits (ASICs).
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • the capture device 101 and the analysis processor 103 are implemented on the same apparatus or device.
  • the analysis processor Based on the microphone-array signals, the analysis processor creates a data stream 104.
  • the data stream may comprise transport audio signals and spatial metadata (e.g., directions and energy ratios in frequency bands).
  • the data stream 104 may be transmitted or stored for example within some storage 105 such as memory, or alternatively directly processed in the same device.
  • a synthesis processor 107 may receive the data stream 104.
  • the synthesis processor 107 can, for example, be a computer or a mobile phone (running suitable software), or alternatively a specific device utilizing, for example, FPGAs or ASICs.
  • the synthesis processor can be configured to produce output audio signals.
  • the output signals can be binaural signals 109.
  • the output signals can be multi-channel signals.
  • the headphones 111 or other playback apparatus may be configured to receive the output of the synthesis processor 107 and output the audio signals in a format suitable for listening.
  • the initial operation is the capture (or otherwise input) of the audio signals as shown in Figure 2 by step 201.
  • step 203 Having captured the audio signals they are analysed to generate the data stream as shown in Figure 2 by step 203.
  • the data stream may then be transmitted and received (or stored and retrieved) as shown in Figure 2 by step 205.
  • the output may be synthesized based at least on the data stream as shown in Figure 2 by step 207.
  • the synthesized audio signal output signals may then be output to a suitable output such as headphones as shown in Figure 2 by step 209.
  • an example analysis processor 103 such as shown in Figure 1 .
  • the input to the analysis processor 103 are the microphone array signals 102.
  • a transport audio signal generator 301 may be configured to receive the microphone array signals 102 and create the transport audio signals.
  • the transport audio signals are selected from the microphone array signals.
  • the microphone array signals may be downmixed to generate the transport audio signals.
  • the transport audio signals may be obtained by processing the microphone array signals.
  • the transport audio signal generator 301 may be configured to generate any suitable number of transport audio signals (or channels), for example in some embodiments the transport audio signal generator 301 is configured to generate two transport audio signals. In some embodiments the transport audio signal generator 301 is further configured to compress the audio signals. For example in some embodiments the audio signals may be compressed using an advanced audio coding (AAC) or enhanced voice services (EVS) compression coding.
  • AAC advanced audio coding
  • EVS enhanced voice services
  • the analysis processor 103 comprises a spatial analyser 303.
  • the spatial analyser 303 is also configured to receive the microphone array signals 103 and generate metadata 304 based on a spatial analysis of the microphone array signals.
  • the spatial analyser 303 may be configured to determine any suitable spatial metadata parameter.
  • spatial metadata parameters determined include (but are not limited to): Direction and direct-to-total energy ratio; Direction and diffuseness; Inter-channel level difference, inter-channel phase difference, and inter-channel coherence. In some embodiments these parameters are determined in time-frequency domain. It should be noted that also other parametrizations may be used than those presented above.
  • the spatial audio parametrizations describe how the sound is distributed in space, either generally (e.g., using directions) or relatively (e.g., as level differences between certain channels).
  • the metadata 304 comprises directions 306 and energy ratios 308.
  • the metadata may be compressed and/or quantized.
  • the analysis processor 103 may furthermore comprise a multiplexer or mux 305 which is configured to receive the metadata 304 and the transport audio signals 302 and generate a combined data stream 104.
  • the combination may be any suitable combination.
  • the input to the analysis processor 103 can also be other types of audio signals, such as multichannel loudspeaker signals, audio objects, or Ambisonic signals.
  • the exact implementation of the analysis processor may be any suitable implementation (as indicated above a computer running suitable software, a FPGAs or ASICs etc). caused to produce the transport audio signals and the spatial metadata in the time-frequency domain.
  • the initial operation is receiving the microphone array audio signals as shown in Figure 4 by step 401.
  • the microphone audio signals Having received the microphone audio signals they are analysed to generate the transport audio signals (for example selection, downmixing or other processing) as shown in Figure 4 by step 403.
  • the transport audio signals for example selection, downmixing or other processing
  • microphone audio signals are spatially analysed to generate the metadata, for example the directions and energy ratios as shown in Figure 4 by step 405.
  • the metadata and the transport audio signals may then be combined to generate the data stream as shown in Figure 4 by step 407.
  • FIG. 5 an example synthesis processor 107 (as shown in Figure 1 ) according to some embodiments is shown.
  • a demultiplexer, or demux, 501 is configured to receive the data stream 104 and caused to demultiplex the data stream into a transport audio signals 502 and metadata 504.
  • the demultiplexer is furthermore caused to decode the audio signals.
  • the metadata in some embodiments is with the time-frequency domain, and comprises parameters such as directions ⁇ (k,n) 506 and direct-to-total energy ratios r(k,n) 508, where k is the frequency band index and n the temporal frame index.
  • the demultiplexed data is furthermore decompressed/dequantized to attempt to regenerate the originally determined parameters.
  • a spatial synthesizer 503 is configured to receive the transport audio signals 502 and the metadata and caused to generate the multichannel output signals 510 such as the binaural output signals 109 shown in Figure 1 .
  • the initial operation is receiving the data stream as shown in Figure 6 by step 601.
  • the multichannel (binaural or otherwise) output signals may then be synthesized from the transport audio signals and the metadata as shown in Figure 6 by step 605.
  • the multichannel (binaural or otherwise) output signals may then be output as shown in Figure 6 by step 607.
  • the input to the spatial synthesizer 503 is in some embodiments the transport audio signals 502 and furthermore the metadata 504 (which may include the energy ratios 508 and the directions 506).
  • the transport audio signals 502 are transformed to the time-frequency domain using a suitable transformer.
  • a short-time Fourier transformer (STFT) 701 is configured to apply a short-time Fourier transform to the transport audio signals to generate suitable time-frequency domain audio signals S i ( k , n ) 700.
  • STFT short-time Fourier transformer
  • any suitable time-frequency transformer may be used, for example a quadrature mirror filterbank (QMF).
  • a divider 705 may receive the time-frequency domain audio signals S i ( k , n ) 700 and the energy ratios 508 and divide the time-frequency domain audio signals S i ( k , n ) 700 to ambient and direct parts using the energy ratio r ( k , n ) 508.
  • a smoothing coefficients determiner 703 may also receive the time-frequency domain audio signals S i ( k , n ) 700 and the energy ratios 508 and determine suitable smoothing coefficients 706.
  • Figure 7b differs with respect to the example spatial synthesizer shown in Figure 7a in that the smoothing coefficients determiner in Figure 7b is caused to receive the time-frequency domain audio signals S i ( k , n ) 700 and the directions 506.
  • the smoothing coefficients determiner 703 may be configured to adaptively determine the smoothing coefficient(s) ⁇ ( k , n ) .
  • a panning gain determiner 715 may be configured to receive the directions 506 and based on the output speaker/headphone configuration and the directions determine suitable panning gains 708.
  • the amplitude panning gains may be computed using any suitable manner, for example vector base amplitude panning (VBAP) based on the received direction ⁇ (k,n).
  • VBAP vector base amplitude panning
  • any suitable smoothing may be applied.
  • the smoothing 'filter' may therefore be multiple order and similarly the smoothing coefficient ⁇ ( k, n ) may be a vector value.
  • the actual value(s) of ⁇ may depend on the filterbank, and typically is frequency-dependent (values may include, e.g., 0.1). In general, the larger the value is, the less smoothing is applied.
  • a decorrelator 707 is configured to receive the ambient audio signal part 702 and process it to make it perceived as being surrounding, for example by decorrelating and spreading the ambient audio signal part 702 across the audio scene.
  • a positioner 709 is configured to receive the direct audio signal part 704 and the smoothed panning gains 710 and position the direct audio signal part 704 using a suitable positioning, for example using the smoothed panning gains and an amplitude panning operation.
  • a merger 711 or other suitable combiner is configured to receive the spread ambient signal part from the decorrelator 707 and the positioned direct audio signals part from the positioner 709 and combine or merge these resulting audio signals.
  • An inverse short-time Fourier transformer (Inverse STFT) 713 is configured to receive the combined audio signals and apply an inverse short-time Fourier transform (or other suitable frequency to time domain transform) to generate the multi-channel audio signals 510 which may be passed to a suitable output device such as the headphones or multi-channel loudspeaker setup.
  • a suitable output device such as the headphones or multi-channel loudspeaker setup.
  • the panning gains are determined directly from the direction metadata, and the "direct sound" is also positioned with these gains after smoothing.
  • the panning gains are not directly determined from the direction metadata, but instead determined indirectly.
  • the smoothing of these gains as described above may be applied to any suitably generated gains.
  • the directions may be used (together with the energy ratios and transport audio signals) to determine a target energy distribution of the output multichannel signals.
  • the target energy distribution may be compared to the energy distribution of the transport audio signals (or to the energy distribution of intermediate signals obtained from the transport audio signals by mixing).
  • Panning gains or any gains that position audio may be obtained as a ratio of these values and the "Smoother" 717 may be applied to these gains.
  • the method of generating the panning gains may be one of many optional methods which is then smoothed according to methods as described herein.
  • the spatial synthesizer in some embodiments is configured to receive the transport audio signals as shown in Figures 8a and 8b by step 801.
  • the spatial synthesizer in some embodiments is furthermore configured to receive the energy ratios as shown in Figures 8a and 8b by step 803.
  • the spatial synthesizer in some embodiments is also configured to receive the directions as shown in Figures 8a and 8b by step 805.
  • the received transport audio signals are in some embodiments converted into a time-frequency domain form, for example by applying a suitable time-frequency domain transform to the transport audio signals as shown in Figures 8a and 8b by step 807.
  • the time-frequency domain audio signals may then in some embodiments be divided into ambient and direct parts (based on the energy ratios) as shown in Figures 8a and 8b by step 813.
  • smoothing coefficients may be determined based on the energy ratios and the time-frequency domain audio signals as shown in Figure 8a by step 811.
  • smoothing coefficients may be determined based on the directions and the time-frequency domain audio signals as shown in Figure 8b by step 851.
  • Panning gains may be determined based on the received directions as shown in Figures 8a and 8b by step 809.
  • a series of smoothed panning gains may be determined based on the determined panning gains and the smoothing coefficients as shown in Figures 8a and 8b by step 817.
  • the ambient audio signal part may be decorrelated as shown in Figures 8a and 8b by step 815.
  • the positional component of the audio signals may then be determined based on the smoothed panning gains and the direct audio signal part as shown in Figures 8a and 8b by step 819.
  • a positional component of the audio signals or positioned audio signal can be a number of audio signals which are combined to produce a virtual sound source positioned in a three dimensional space.
  • the positional component of the audio signals and the decorrelated ambient audio signal may then be combined or merged as shown in Figures 8a and 8b by step 821.
  • the combined audio signals may then be inverse time-frequency domain transformed to generate the multichannel audio signals in a suitable format to be output as shown in Figures 8a and 8b by step 823.
  • the smoothing coefficients determiner 703 is configured to generate values which may be used to smooth the panning gains in order to avoid "musical noise" artefacts.
  • the inputs to the smoothing coefficients determiner 703 are shown as the time-frequency domain audio signals 700 and the energy ratios 508.
  • a direction smoothness estimator 903 is configured to estimate a direction smoothness ⁇ ( k , n ) .
  • this direction smoothness may be estimated or determined from the energy ratios r ( k, n ) 508.
  • the direction smoothness value ⁇ ( k , n ) 904 can be estimated by using or calculating the fluctuation of the direction value.
  • a circular variance of the directions ⁇ ( k, n ) is determined and this is used as the basis of a direction smoothness.
  • any suitable analysis of the temporal fluctuation of the directions may be used to determine the direction smoothness estimate.
  • An average direction smoothness estimator 905 is configured to receive the energy 902 and direction smoothness estimates 904 and determine an average over time (and in some embodiments over frequency).
  • a direction smoothness estimates to smoothing coefficients converter may receive the averaged direction smoothness estimate ⁇ ' ( k, n ) 906 and generate the smoothing coefficients ⁇ ( k, n ) .
  • ⁇ fast may, e.g., include 0.4
  • ⁇ slow may, e.g., include 0.1.
  • fast and slow coefficients may depend on the actual implementation and may be frequency-dependent.
  • the smoothing coefficients may be a vector instead of a single value. This for example may occur when the smoothing is other than a first-order IIR smoothing.
  • These embodiments may therefore may implement "fast settings” and “slow settings” which are interpolated based on the "averaged direction smoothness estimates”. In such embodiments these "settings" may depend on the implementation, for example whether it is a single value or a vector of values.
  • the smoothing coefficients ⁇ ( k, n )706 may then be output.
  • the time-frequency domain audio signals are received as shown in Figure 10 by step 1001.
  • the estimate of the energy (of the audio signals) may be determined based on the time-frequency domain audio signals as shown in Figure 10 by step 1005.
  • step 1007 Furthermore the estimate of the direction smoothness based on the energy ratios (or based on any other suitable parameter such as an analysis of the directions) is shown in Figure 10 by step 1007.
  • the estimate of the average direction smoothness is then determined based on the energy estimate and the direction smoothness estimates as shown in Figure 10 by step 1009.
  • the smoothness coefficients are then output as shown in Figure 10 by step 1013.
  • the smoothness coefficients were determined based on the use of the spatial metadata (for example the metadata generated in the analysis processor such as found within a spatial audio capture (SPAC) which generates directions and direct-to-total energy ratios).
  • SPAC spatial audio capture
  • the above methods can be modified without inventive skill to be used with any method utilizing similar parameters.
  • Some of the advantages of the proposed embodiments is that significant amount of smoothing can applied with typical sound scenes, and thus musical noise artefacts are avoided. Furthermore when the sound scene does not require so much smoothing, the amount of smoothing applied can be reduced and thus the reproduction can react faster to changes in the sound field.
  • Figure 11 shows three graph traces showing a reference audio signal, a reproduction using a fixed smoothing coefficient and a reproduction using an adaptive smoothing coefficient.
  • the sound scene contains two sources located in different directions in anechoic conditions. The sound was rendered to a multichannel setup.
  • the reference graph trace 1101 shows the signal of one output channel of the audio signal and shows the first source 1103 but not the other source.
  • the adaptive smoothing example 1121 having analysed that the directions are stable, and there is not as much need for temporal smoothing is configured to set the smoothing to a faster mode, and the sound source is not reproduced from the wrong channel. In such a manner the reproduction is perceived to react fast to changes in the direction.
  • the implementation can be by software implemented, for example on a mobile phone (or a computer) 1200.
  • the software running inside the mobile phone 1200 may be configured to receive an encoded bitstream (it may have been e.g., transmitted real-time or it may have been stored to the device).
  • the bitstream can also be any other suitable bitstream.
  • a demultiplexer 1203 (DEMUX) is configured to demultiplex the bitstream into an audio bitstream 1204 and to a spatial metadata bitstream 1206.
  • An enhanced voice standard (EVS) or encoded bitstream decoder 1205 is configured to extract the transport audio signals 1206 from the audio bitstream (or any decoder that corresponds to the utilized codec).
  • a metadata decoder 1207 is used to decompress the spatial metadata 1208, for example comprising the directions 1210 and energy ratios 1212.
  • the spatial synthesiser 1209 (similar to the spatial synthesizer in the embodiments above) is configured to receive transport audio signals 1206 and the metadata 1208 and output multichannel loudspeaker signals 1211 that may be reproduced using a multichannel loudspeaker setup. In some embodiments the spatial synthesizer 1209 is configured to generate binaural audio signals that may be reproduced using headphones.
  • a microphone array 1301 for example part of a mobile phone, is configured to capture audio signals 1302.
  • the captured microphone array audio signals 1302 may be processed by software 1300 running inside the mobile phone.
  • the software 1300 may be an analysis processor configured to analyse the captured microphone array signals 1302 in a manner such as described with respect to Figures 1 and 3 and is configured to generate spatial metadata 1304 (comprising directions 1306 and energy ratios 1308).
  • a synthesis processor 1305 which is configured to receive the spatial metadata 1304 from the analysis processor 1303 along with the captured microphone array audio signals 1302 (or alternatively a subset or a processed set of the microphone signals).
  • the synthesis processor 1305 may operate in a manner similar to the synthesis processor 1305 as described with respect to Figures 1 and 5 , 7a , 7b and 9 .
  • the synthesis processor 1305 may be configured to output a multichannel audio signal (for example a binaural signal or a surround loudspeaker signal or Ambisonic signal).
  • the multichannel audio signals 1307 can therefore be listened to directly (when fed to headphones or loudspeakers, or reproduced using an Ambisonic renderer), stored (with any suitable codec) and/or transmitted to a remote device.
  • codec use implementation is described above it is noted that some embodiments may be used with any suitable codec that utilizes smoothing and can provide information on the smoothness of the direction-related parameters.
  • the proposed method can also be applied in any kind of spatial audio processing which operates in time-frequency domain.
  • the device may be any suitable electronics device or apparatus.
  • the device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.
  • the device 1400 comprises at least one processor or central processing unit 1407.
  • the processor 1407 can be configured to execute various program codes such as the methods such as described herein.
  • the device 1400 comprises a memory 1411.
  • the at least one processor 1407 is coupled to the memory 1411.
  • the memory 1411 can be any suitable storage means.
  • the memory 1411 comprises a program code section for storing program codes implementable upon the processor 1407.
  • the memory 1411 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1407 whenever needed via the memory-processor coupling.
  • the device 1400 comprises a user interface 1405.
  • the user interface 1405 can be coupled in some embodiments to the processor 1407.
  • the processor 1407 can control the operation of the user interface 1405 and receive inputs from the user interface 1405.
  • the user interface 1405 can enable a user to input commands to the device 1400, for example via a keypad.
  • the user interface 1405 can enable the user to obtain information from the device 1400.
  • the user interface 1405 may comprise a display configured to display information from the device 1400 to the user.
  • the user interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1400 and further displaying information to the user of the device 1400.
  • the user interface 1405 may be the user interface for communicating with the position determiner as described herein.
  • the device 1400 comprises an input/output port 1409.
  • the input/output port 1409 in some embodiments comprises a transceiver.
  • the transceiver in such embodiments can be coupled to the processor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network.
  • the transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
  • the transceiver can communicate with further apparatus by any suitable known communications protocol.
  • the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).
  • UMTS universal mobile telecommunications system
  • WLAN wireless local area network
  • IRDA infrared data communication pathway
  • the transceiver input/output port 1409 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1407 executing suitable code. Furthermore the device may generate a suitable downmix signal and parameter output to be transmitted to the synthesis device.
  • the device 1400 may be employed as at least part of the synthesis device.
  • the input/output port 1409 may be configured to receive the downmix signals and in some embodiments the parameters determined at the capture device or processing device as described herein, and generate a suitable audio signal format output by using the processor 1407 executing suitable code.
  • the input/output port 1409 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones or similar.
  • the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof.
  • some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
  • firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
  • While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
  • the embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware.
  • any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions.
  • the software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
  • the memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory.
  • the data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
  • Embodiments of the inventions may be practiced in various components such as integrated circuit modules.
  • the design of integrated circuits is by and large a highly automated process.
  • Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
  • Programs such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules.
  • the resultant design in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Mathematical Physics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Stereophonic System (AREA)
  • Circuit For Audible Band Transducer (AREA)

Description

    Field
  • The present application relates to apparatus and methods for temporal spatial audio parameter smoothing. This includes but is not exclusively for sound reproduction systems and sound reproduction methods producing multichannel audio channel outputs.
  • Background
  • Parametric spatial audio processing is a field of audio signal processing where the spatial aspect of the sound is described using a set of parameters. For example, in parametric spatial audio capture from microphone arrays, it is a typical and an effective choice to estimate from the microphone array signals a set of parameters such as directions of the sound in frequency bands, and the ratio parameters expressing relative energies of the directional and non-directional parts of the captured sound in frequency bands. These parameters are known to well describe the perceptual spatial properties of the captured sound at the position of the microphone array. These parameters can be utilized in synthesis of the spatial sound accordingly, for headphones binaurally, for loudspeakers, or to other formats, such as Ambisonics.
  • The directions and direct-to-total energy ratios in frequency bands are thus a parameterization that is particularly effective for spatial audio capture.
  • A parameter set consisting of a direction parameter in frequency bands and an energy ratio parameter in frequency bands (indicating the proportion of the sound energy that is directional) can be also utilized as the spatial metadata for an audio codec. For example, these parameters can be estimated from microphone-array captured audio signals, and for example a stereo signal can be generated from the microphone array signals to be conveyed with the spatial metadata. The stereo signal could be encoded, for example, with an AAC encoder. A decoder can decode the audio signals into PCM signals, and process the sound in frequency bands (using the spatial metadata) to obtain the spatial output, for example a binaural output.
  • Patent application publication US2009067634 discloses modifying spatial audio parameters associated with one or more audio objects of a stereo or multichannel audio signal to provide remixing capabilities.
  • Patent application publication WO2014162171 discloses a spatial audio analyser configured to determine an audio source with a location associated with a visual image element, and an audio processor arranged to change an audio characteristic of the audio source in response to a control input.
  • Patent application publication EP2942981 discloses an audio signal processing system for consistent acoustic scene reproduction based on informed spatial filtering.
  • Patent application publication US2013329922 discloses using vector base amplitude panning (VBAP) for playing back an object's audio and using the positioning of sound reproduction devices and the object' s location information to determine which sound reproduction devices are used for playing back the object's audio.
  • Patent application publication JP2015080119 discloses a method of improving the degree of freedom when calculating a panning coefficient for sound image localisation within a three-dimensional space.
  • Summary
  • There is provided according to a first aspect an apparatus for spatial audio signal processing, as set forth in independent claim 1.
  • According to a second aspect there is provided a method for spatial audio signal processing, as set forth in independent claim 7.
  • Preferred embodiments are set forth in the dependent claims.
  • A computer program product stored on a medium may cause an apparatus to perform the method as described herein.
  • An electronic device may comprise apparatus as described herein.
  • A chipset may comprise apparatus as described herein.
  • Embodiments of the present application aim to address problems associated with the state of the art.
  • Summary of the Figures
  • For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:
    • Figure 1 shows schematically an example system utilizing embodiments described hereafter;
    • Figure 2 shows a flow diagram of the operation of the example system shown in Figure 1;
    • Figure 3 shows schematically an example analysis processor shown in Figure 1 according to some embodiments;
    • Figure 4 shows a flow diagram of the operation of the example analysis processor shown in Figure 3;
    • Figure 5 shows schematically an example synthesis processor shown in Figure 1 according to some embodiments;
    • Figure 6 shows a flow diagram of the operation of the example synthesis processor shown in Figure 5;
    • Figures 7a and 7b show schematically example spatial synthesizers shown in Figure 5 according to some embodiments;
    • Figures 8a and 8b show flow diagrams of the operation of the spatial synthesizers shown in Figures 7a and 7b;
    • Figure 9 shows schematically an example smoothing coefficients determiner shown in Figures 7a and 7b according to some embodiments;
    • Figure 10 shows a flow diagram of the operation of the smoothing coefficients determiner shown in Figure 9;
    • Figure 11 shows example graphs demonstrating the effect of implementing the embodiments;
    • Figure 12 shows an example implementation of the embodiments as shown in Figures 1 to 10;
    • Figure 13 shows a further example implementation of the embodiments as shown in Figures 1 to 10; and
    • Figure 14 shows schematically an example device suitable for implementing the embodiments shown.
    Embodiments of the Application
  • The following describes in further detail suitable apparatus and possible mechanisms for the provision of adaptive parameter smoothing.
  • In the following embodiments and examples the spatial sound source is a microphone array. Alternatively the spatial sound source may be a 5.1 multichannel or other format multi-channel mix or Ambisonics signals.
  • As described above parametric spatial audio capture methods can be used to enable a perceptually accurate spatial sound reproduction. Parametric spatial audio capture refers to adaptive DSP-driven audio capture methods covering 1) analysing perceptually relevant parameters in frequency bands, for example, the directionality of the propagating sound at the recording position, and 2) reproducing spatial sound in a perceptual sense at the rendering side according to the estimated spatial parameters. The reproduction can be, for example, for headphones or multichannel loudspeaker setups. By estimating and reproducing the perceptually relevant spatial properties (parameters) of the sound field, a spatial perception similar to that which would occur in the original sound field can be reproduced. As the result, the listener can perceive the multitude of sources, their directions and distances, as well as properties of the surrounding physical space, among the other spatial sound features, as if the listener was in the position of the capture device.
  • Parametric spatial audio capture methods (SPAC) may employ these determined parameters such as directions of the sound in frequency bands, and the ratios between the directional and non-directional parts of the captured sound in frequency bands to describe the perceptual spatial properties of the captured sound at the position of the microphone array and may use these parameters in synthesis of the spatial sound. As the spatial properties are estimated from the sound field, they can significantly fluctuate over time and frequency, e.g., due to the reverberation and/or multiple simultaneous sound sources. Hence, parametric spatial audio processing methods typically utilize smoothing in the synthesis, in order to avoid possible artefacts caused by rapidly fluctuating parameters (these artefacts are typically referred to as "musical noise").
  • Similar parametrization may also be used for the compression of spatial audio, e.g., from 5.1 multichannel signals. In this case, the parameters are estimated from the input loudspeaker signals. Nevertheless, the parameters typically fluctuate also in this case. Hence, the temporal smoothing is needed also with loudspeaker input.
  • Typically, the spatial parameters are determined in the time-frequency domain, i.e., each parameter value is associated with a certain frequency band and temporal frame. Examples of possible spatial parameters include (but are not limited to):
    • Direction and direct-to-total energy ratio
    • Direction and diffuseness
    • Inter-channel level difference, inter-channel phase difference, and inter-channel coherence
  • These parameters are determined in time-frequency domain. It should be noted that also other parametrizations may be used than those presented above. In general, typically the spatial audio parametrizations describe how the sound is distributed in space, either generally (e.g., using directions) or relatively (e.g., as level differences between certain channels). Moreover, it should be noted that, in such methods, the audio and the parameters may be processed and/or transmitted/stored in between the analysis and the synthesis.
  • The parametric spatial audio processing methods are often based on analysing the direction of arrival (and other spatial parameters) in frequency bands. If there would be a single sound source in anechoic condition, the direction would stably point to the sound source at all frequencies. However, in typical acoustic environments, the microphones capture also other sounds than just the sound source, such as reverberation and ambient sounds. Moreover, there may be multiple simultaneous sources. As a result, the estimated directions typically significantly fluctuate over time and the estimates are different at different frequency bands.
  • Parametric spatial audio processing methods such as employed in embodiments as described in further detail hereafter synthesize the spatial sound based on the analysed parameters (such as the aforementioned direction) and related audio signals (e.g., 2 captured microphone signals). In the case of loudspeaker rendering, vector base amplitude panning (VBAP) is a common method to position the audio to the analysed direction. VBAP computes gains for a subset of loudspeakers based on the direction, and the audio signal is multiplied with these gains and fed to these loudspeakers.
  • The concept as discussed hereafter proposes apparatus and methods to adapt the smoothing needed in the synthesis of spatial sound in parametric spatial audio processing in order to have quality audio output with different types of sound scenes.
  • Furthermore the embodiments as described hereafter relate to parametric spatial audio processing where a solution is provided to improve the temporal smoothing processing needed in the synthesis of spatial audio in the aforementioned parametric spatial audio processing and where the temporal smoothing is improved by analysing the required amount of smoothing adaptively. The analysis being related to the stability of the direction-related parameter(s) and producing a measure of directional stability and determining the time coefficients of the temporal smoothing based on the measure of directional stability.
  • The direction-related parameter may as described in further detail in the embodiments hereafter refer to a direction.
  • In some embodiments the amount of smoothing can be analysed using the direct-to-total energy ratio. The value of the energy ratio is monitored over time, and where it is constantly high, the time coefficient of the smoothing can be set smaller (less smoothing applied). Correspondingly, where the energy ratio is not constantly high, the time coefficient can be set to a default value (more smoothing applied).
  • A block diagram of an example system for implementing some embodiments is shown in Figure 1.
  • Figure 1 shows an example capture device 101. The capture device may be a VR capture device, a mobile phone or any other suitable electronic apparatus comprising one or more microphone arrays. The capture device 101 thus in some embodiments comprises microphones 1001, 1002. The microphone audio signals 102 captured by the microphones 1001, 1002 may be stored and later processed, or directly processed.
  • An analysis processor 103 may receive the microphone audio signals 102 from the capture device 101. The analysis processor 103 can, for example, be a computer or a mobile phone (running suitable software), or alternatively a specific device utilizing, for example, field programmable gate arrays (FPGAs) or application specific integrated circuits (ASICs). In some embodiments the capture device 101 and the analysis processor 103 are implemented on the same apparatus or device.
  • Based on the microphone-array signals, the analysis processor creates a data stream 104. The data stream may comprise transport audio signals and spatial metadata (e.g., directions and energy ratios in frequency bands). The data stream 104 may be transmitted or stored for example within some storage 105 such as memory, or alternatively directly processed in the same device.
  • A synthesis processor 107 may receive the data stream 104. The synthesis processor 107 can, for example, be a computer or a mobile phone (running suitable software), or alternatively a specific device utilizing, for example, FPGAs or ASICs. Based on the data stream (the transport audio signals and the metadata). The synthesis processor can be configured to produce output audio signals. For headphone listening, the output signals can be binaural signals 109. For loudspeaker rendering, the output signals can be multi-channel signals.
  • The headphones 111 or other playback apparatus may be configured to receive the output of the synthesis processor 107 and output the audio signals in a format suitable for listening.
  • With respect to Figure 2 is shown an example summary of the operations of the apparatus shown in Figure 1.
  • The initial operation is the capture (or otherwise input) of the audio signals as shown in Figure 2 by step 201.
  • Having captured the audio signals they are analysed to generate the data stream as shown in Figure 2 by step 203.
  • The data stream may then be transmitted and received (or stored and retrieved) as shown in Figure 2 by step 205.
  • Having received or retrieved the data stream, the output may be synthesized based at least on the data stream as shown in Figure 2 by step 207.
  • The synthesized audio signal output signals may then be output to a suitable output such as headphones as shown in Figure 2 by step 209.
  • With respect to Figure 3 an example analysis processor 103, such as shown in Figure 1, is presented. The input to the analysis processor 103 are the microphone array signals 102.
  • A transport audio signal generator 301 may be configured to receive the microphone array signals 102 and create the transport audio signals. In some embodiments the transport audio signals are selected from the microphone array signals. In some embodiments the microphone array signals may be downmixed to generate the transport audio signals. In some embodiments the transport audio signals may be obtained by processing the microphone array signals.
  • The transport audio signal generator 301 may be configured to generate any suitable number of transport audio signals (or channels), for example in some embodiments the transport audio signal generator 301 is configured to generate two transport audio signals. In some embodiments the transport audio signal generator 301 is further configured to compress the audio signals. For example in some embodiments the audio signals may be compressed using an advanced audio coding (AAC) or enhanced voice services (EVS) compression coding.
  • Furthermore the analysis processor 103 comprises a spatial analyser 303. The spatial analyser 303 is also configured to receive the microphone array signals 103 and generate metadata 304 based on a spatial analysis of the microphone array signals. The spatial analyser 303 may be configured to determine any suitable spatial metadata parameter. For example spatial metadata parameters determined include (but are not limited to): Direction and direct-to-total energy ratio; Direction and diffuseness; Inter-channel level difference, inter-channel phase difference, and inter-channel coherence. In some embodiments these parameters are determined in time-frequency domain. It should be noted that also other parametrizations may be used than those presented above. In general, typically the spatial audio parametrizations describe how the sound is distributed in space, either generally (e.g., using directions) or relatively (e.g., as level differences between certain channels). In the example shown in Figure 3 the metadata 304 comprises directions 306 and energy ratios 308. In some embodiments the metadata may be compressed and/or quantized. The analysis processor 103 may furthermore comprise a multiplexer or mux 305 which is configured to receive the metadata 304 and the transport audio signals 302 and generate a combined data stream 104. The combination may be any suitable combination.
  • It should be noted that in some embodiments the input to the analysis processor 103 can also be other types of audio signals, such as multichannel loudspeaker signals, audio objects, or Ambisonic signals. Furthermore, the exact implementation of the analysis processor may be any suitable implementation (as indicated above a computer running suitable software, a FPGAs or ASICs etc). caused to produce the transport audio signals and the spatial metadata in the time-frequency domain.
  • With respect to Figure 4 is shown an example summary of the operations of the analysis processor shown in Figure 3.
  • The initial operation is receiving the microphone array audio signals as shown in Figure 4 by step 401.
  • Having received the microphone audio signals they are analysed to generate the transport audio signals (for example selection, downmixing or other processing) as shown in Figure 4 by step 403.
  • Furthermore the microphone audio signals are spatially analysed to generate the metadata, for example the directions and energy ratios as shown in Figure 4 by step 405.
  • The metadata and the transport audio signals may then be combined to generate the data stream as shown in Figure 4 by step 407.
  • With respect to Figure 5 an example synthesis processor 107 (as shown in Figure 1) according to some embodiments is shown.
  • A demultiplexer, or demux, 501 is configured to receive the data stream 104 and caused to demultiplex the data stream into a transport audio signals 502 and metadata 504. In some embodiments, where the transport audio signals were compressed within the analysis processor, the demultiplexer is furthermore caused to decode the audio signals. The metadata in some embodiments is with the time-frequency domain, and comprises parameters such as directions θ(k,n) 506 and direct-to-total energy ratios r(k,n) 508, where k is the frequency band index and n the temporal frame index. In the embodiments where the metadata is compressed/dequantized then the demultiplexed data is furthermore decompressed/dequantized to attempt to regenerate the originally determined parameters.
  • A spatial synthesizer 503 is configured to receive the transport audio signals 502 and the metadata and caused to generate the multichannel output signals 510 such as the binaural output signals 109 shown in Figure 1.
  • With respect to Figure 6 is shown an example summary of the operations of the synthesis processor shown in Figure 5.
  • The initial operation is receiving the data stream as shown in Figure 6 by step 601.
  • Having received the data stream, it is demultiplexed and optionally decoded to generate the transport audio signals and the metadata as shown in Figure 6 by step 603.
  • The multichannel (binaural or otherwise) output signals may then be synthesized from the transport audio signals and the metadata as shown in Figure 6 by step 605.
  • The multichannel (binaural or otherwise) output signals may then be output as shown in Figure 6 by step 607.
  • With respect to Figures 7a and 7b example spatial synthesizers 503 (as shown in Figure 5) according to some embodiments is shown.
  • The input to the spatial synthesizer 503 is in some embodiments the transport audio signals 502 and furthermore the metadata 504 (which may include the energy ratios 508 and the directions 506).
  • In some embodiments the transport audio signals 502 are transformed to the time-frequency domain using a suitable transformer. For example as shown in Figure 7a and 7b a short-time Fourier transformer (STFT) 701 is configured to apply a short-time Fourier transform to the transport audio signals to generate suitable time-frequency domain audio signals Si (k, n) 700. In some embodiments any suitable time-frequency transformer may be used, for example a quadrature mirror filterbank (QMF).
  • A divider 705 may receive the time-frequency domain audio signals Si (k, n) 700 and the energy ratios 508 and divide the time-frequency domain audio signals Si (k, n) 700 to ambient and direct parts using the energy ratio r(k, n) 508.
  • With respect to Figure 7a a smoothing coefficients determiner 703 may also receive the time-frequency domain audio signals Si (k, n) 700 and the energy ratios 508 and determine suitable smoothing coefficients 706. Figure 7b differs with respect to the example spatial synthesizer shown in Figure 7a in that the smoothing coefficients determiner in Figure 7b is caused to receive the time-frequency domain audio signals Si (k, n) 700 and the directions 506.
  • The smoothing coefficients determiner 703 may be configured to adaptively determine the smoothing coefficient(s) α(k, n).
  • A panning gain determiner 715 may be configured to receive the directions 506 and based on the output speaker/headphone configuration and the directions determine suitable panning gains 708. The amplitude panning gains may be computed using any suitable manner, for example vector base amplitude panning (VBAP) based on the received direction θ(k,n).
  • In some embodiments a panning gain smoother 717 is configured to receive the panning gains 708 and the smoothing coefficients 706 and based on these determine suitable smoothed panning gains 710. There are many ways to perform the smoothing. In some embodiments a first-order smoothing may be used. Thus for example the panning gain smoother 717 is configured to receive a current gain g(k, n), smoothing coefficients α(k, n) and also knowledge on the last smoothed gain g'(k, n - 1) and determine a smoothed gain by: g k n = α k n g k n + 1 α k n g k , n 1
    Figure imgb0001
  • In other words the current gain is multiplied with the smoothing coefficient α and the previous smoothed gain is multiplied with (1 - α).
  • In other embodiments any suitable smoothing may be applied. The smoothing 'filter' may therefore be multiple order and similarly the smoothing coefficient α(k, n) may be a vector value. The actual value(s) of α may depend on the filterbank, and typically is frequency-dependent (values may include, e.g., 0.1). In general, the larger the value is, the less smoothing is applied.
  • A decorrelator 707 is configured to receive the ambient audio signal part 702 and process it to make it perceived as being surrounding, for example by decorrelating and spreading the ambient audio signal part 702 across the audio scene.
  • A positioner 709 is configured to receive the direct audio signal part 704 and the smoothed panning gains 710 and position the direct audio signal part 704 using a suitable positioning, for example using the smoothed panning gains and an amplitude panning operation.
  • A merger 711 or other suitable combiner is configured to receive the spread ambient signal part from the decorrelator 707 and the positioned direct audio signals part from the positioner 709 and combine or merge these resulting audio signals.
  • An inverse short-time Fourier transformer (Inverse STFT) 713 is configured to receive the combined audio signals and apply an inverse short-time Fourier transform (or other suitable frequency to time domain transform) to generate the multi-channel audio signals 510 which may be passed to a suitable output device such as the headphones or multi-channel loudspeaker setup.
  • In the examples and embodiments described in detail herein, for example as described with respect to the examples shown in Figures 7a and 7b the panning gains are determined directly from the direction metadata, and the "direct sound" is also positioned with these gains after smoothing.
  • In some embodiments there may be implementations where the panning gains are not directly determined from the direction metadata, but instead determined indirectly. Thus the smoothing of these gains as described above may be applied to any suitably generated gains.
  • Thus for example, in some embodiments the directions may be used (together with the energy ratios and transport audio signals) to determine a target energy distribution of the output multichannel signals. The target energy distribution may be compared to the energy distribution of the transport audio signals (or to the energy distribution of intermediate signals obtained from the transport audio signals by mixing). Panning gains (or any gains that position audio) may be obtained as a ratio of these values and the "Smoother" 717 may be applied to these gains.
  • In summary the method of generating the panning gains may be one of many optional methods which is then smoothed according to methods as described herein.
  • With respect to Figures 8a and 8b the operations of the spatial synthesizer 503 shown in Figures 7a and 7b according to some embodiments are described in further detail.
  • The spatial synthesizer in some embodiments is configured to receive the transport audio signals as shown in Figures 8a and 8b by step 801.
  • The spatial synthesizer in some embodiments is furthermore configured to receive the energy ratios as shown in Figures 8a and 8b by step 803.
  • The spatial synthesizer in some embodiments is also configured to receive the directions as shown in Figures 8a and 8b by step 805.
  • The received transport audio signals are in some embodiments converted into a time-frequency domain form, for example by applying a suitable time-frequency domain transform to the transport audio signals as shown in Figures 8a and 8b by step 807.
  • The time-frequency domain audio signals may then in some embodiments be divided into ambient and direct parts (based on the energy ratios) as shown in Figures 8a and 8b by step 813.
  • Furthermore the smoothing coefficients may be determined based on the energy ratios and the time-frequency domain audio signals as shown in Figure 8a by step 811. Alternatively the smoothing coefficients may be determined based on the directions and the time-frequency domain audio signals as shown in Figure 8b by step 851.
  • Panning gains may be determined based on the received directions as shown in Figures 8a and 8b by step 809.
  • A series of smoothed panning gains may be determined based on the determined panning gains and the smoothing coefficients as shown in Figures 8a and 8b by step 817.
  • The ambient audio signal part may be decorrelated as shown in Figures 8a and 8b by step 815.
  • The positional component of the audio signals may then be determined based on the smoothed panning gains and the direct audio signal part as shown in Figures 8a and 8b by step 819. In such embodiments a positional component of the audio signals or positioned audio signal can be a number of audio signals which are combined to produce a virtual sound source positioned in a three dimensional space.
  • The positional component of the audio signals and the decorrelated ambient audio signal may then be combined or merged as shown in Figures 8a and 8b by step 821.
  • Furthermore the combined audio signals may then be inverse time-frequency domain transformed to generate the multichannel audio signals in a suitable format to be output as shown in Figures 8a and 8b by step 823.
  • With respect to Figure 9 an example smoothing coefficients determiner 703 (such as shown in Figures 7a and 7b) according to some embodiments is shown. The smoothing coefficients determiner 703 is configured to generate values which may be used to smooth the panning gains in order to avoid "musical noise" artefacts.
  • The inputs to the smoothing coefficients determiner 703 are shown as the time-frequency domain audio signals 700 and the energy ratios 508.
  • An energy estimator 901 may be configured to receive the time-frequency domain audio signals 700 and determine the energy E(k, n) 902 of the audio signals. For example in some embodiments the energy estimator 901 is configured to generate the energy based on: E k n = i S i k n 2
    Figure imgb0002
  • A direction smoothness estimator 903 is configured to estimate a direction smoothness ξ(k, n). In some embodiments, such as shown in the examples in Figures 7a and 8a, this direction smoothness may be estimated or determined from the energy ratios r(k, n) 508. For example the direction smoothness estimator may be configured to calculate the direction smoothness by the following: ξ k n = r k n p
    Figure imgb0003
    where p is a constant (e.g., p = 8).
    In some embodiments, such as shown in the examples in Figures 7b and 8b, the direction smoothness value ξ(k, n) 904 can be estimated by using or calculating the fluctuation of the direction value. In such embodiments a circular variance of the directions θ(k, n) is determined and this is used as the basis of a direction smoothness. In other embodiments any suitable analysis of the temporal fluctuation of the directions may be used to determine the direction smoothness estimate.
  • An average direction smoothness estimator 905 is configured to receive the energy 902 and direction smoothness estimates 904 and determine an average over time (and in some embodiments over frequency). The average direction smoothness estimator may therefore be configured to perform a first-order smoothing based on a current estimate ξ(k, n) a previous average value ξ'(k, n - 1) and smoothing coefficient β to generate an averaged direction smoothness estimate ξ'(k, n) 906, for example by the following: ξ k n = β ξ k n + 1 β ξ k , n 1
    Figure imgb0004
    where β may be fixed, or it can be adaptively selected, e.g., by β k n = { α 1 , ξ k n > ξ k , n 1 α 2 , ξ k n ξ k , n 1
    Figure imgb0005
    where α 1 may, e.g., be 0.001 and α 2 may, e.g., be 0.5. This adaptive selection attempts to find whether the energy ratio is constantly large, and hence temporal smoothing can be safely made shorter without artefacts. Moreover, the direction smoothness estimates ξ may be weighted by the energy E while performing the temporal smoothing.
  • A direction smoothness estimates to smoothing coefficients converter may receive the averaged direction smoothness estimate ξ'(k, n) 906 and generate the smoothing coefficients α(k, n). For example in some embodiments the averaged direction smoothness estimates ξ'(k, n) are converted to the actual smoothing coefficients by the following α k n = ξ k n α fast k + 1 ξ k n α slow k
    Figure imgb0006
  • The values of αfast may, e.g., include 0.4, and the values of αslow may, e.g., include 0.1. These fast and slow coefficients may depend on the actual implementation and may be frequency-dependent.
  • In some embodiments the smoothing coefficients may be a vector instead of a single value. This for example may occur when the smoothing is other than a first-order IIR smoothing. These embodiments may therefore may implement "fast settings" and "slow settings" which are interpolated based on the "averaged direction smoothness estimates". In such embodiments these "settings" may depend on the implementation, for example whether it is a single value or a vector of values.
  • The smoothing coefficients α(k, n)706 may then be output.
  • With respect to Figure 10 an example flow diagram showing the operation of the smoothing coefficients determiner according to some embodiments is shown.
  • The time-frequency domain audio signals are received as shown in Figure 10 by step 1001.
  • Furthermore the energy ratios are received as shown in Figure 10 by step 1003.
  • The estimate of the energy (of the audio signals) may be determined based on the time-frequency domain audio signals as shown in Figure 10 by step 1005.
  • Furthermore the estimate of the direction smoothness based on the energy ratios (or based on any other suitable parameter such as an analysis of the directions) is shown in Figure 10 by step 1007.
  • The estimate of the average direction smoothness is then determined based on the energy estimate and the direction smoothness estimates as shown in Figure 10 by step 1009.
  • Then the average direction smoothness estimate is converted to smoothness coefficients as shown in Figure 10 by step 1011.
  • The smoothness coefficients are then output as shown in Figure 10 by step 1013.
  • In the above examples the smoothness coefficients were determined based on the use of the spatial metadata (for example the metadata generated in the analysis processor such as found within a spatial audio capture (SPAC) which generates directions and direct-to-total energy ratios). It should be noted that the above methods can be modified without inventive skill to be used with any method utilizing similar parameters. For example in context of Direct Audio Coding (DirAC), the direction smoothness can be determined as ξ k n = 1 ψ k n p
    Figure imgb0007
    where ψ(k, n) is diffuseness.
  • Some of the advantages of the proposed embodiments is that significant amount of smoothing can applied with typical sound scenes, and thus musical noise artefacts are avoided. Furthermore when the sound scene does not require so much smoothing, the amount of smoothing applied can be reduced and thus the reproduction can react faster to changes in the sound field.
  • The effect of the proposed embodiments can be seen in Figure 11 which shows three graph traces showing a reference audio signal, a reproduction using a fixed smoothing coefficient and a reproduction using an adaptive smoothing coefficient. In this example the sound scene contains two sources located in different directions in anechoic conditions. The sound was rendered to a multichannel setup.
  • The reference graph trace 1101 shows the signal of one output channel of the audio signal and shows the first source 1103 but not the other source.
  • In the fixed smoothing example graph trace 1111, excessive temporal smoothing causes the sound to be reproduced partially still from the first direction (shown from 1.4 seconds to 1.8 seconds) even though the sound source is not present anymore for that direction. As a result, the reproduction is perceived to slowly react to changes in the direction.
  • On the contrary, the adaptive smoothing example 1121 having analysed that the directions are stable, and there is not as much need for temporal smoothing is configured to set the smoothing to a faster mode, and the sound source is not reproduced from the wrong channel. In such a manner the reproduction is perceived to react fast to changes in the direction.
  • With respect to Figure 12 an example implementation of some further embodiments is shown. In these embodiments the implementation can be by software implemented, for example on a mobile phone (or a computer) 1200. The software running inside the mobile phone 1200 may be configured to receive an encoded bitstream (it may have been e.g., transmitted real-time or it may have been stored to the device). The bitstream can also be any other suitable bitstream. A demultiplexer 1203 (DEMUX) is configured to demultiplex the bitstream into an audio bitstream 1204 and to a spatial metadata bitstream 1206.
  • An enhanced voice standard (EVS) or encoded bitstream decoder 1205 is configured to extract the transport audio signals 1206 from the audio bitstream (or any decoder that corresponds to the utilized codec).
  • A metadata decoder 1207 is used to decompress the spatial metadata 1208, for example comprising the directions 1210 and energy ratios 1212.
  • The spatial synthesiser 1209 (similar to the spatial synthesizer in the embodiments above) is configured to receive transport audio signals 1206 and the metadata 1208 and output multichannel loudspeaker signals 1211 that may be reproduced using a multichannel loudspeaker setup. In some embodiments the spatial synthesizer 1209 is configured to generate binaural audio signals that may be reproduced using headphones.
  • With respect to Figure 13 a further example implementation is shown according to some further embodiments. In this example implementation a microphone array 1301, for example part of a mobile phone, is configured to capture audio signals 1302. The captured microphone array audio signals 1302 may be processed by software 1300 running inside the mobile phone. In the software 1300 may be an analysis processor configured to analyse the captured microphone array signals 1302 in a manner such as described with respect to Figures 1 and 3 and is configured to generate spatial metadata 1304 (comprising directions 1306 and energy ratios 1308). Furthermore there may comprise a synthesis processor 1305 which is configured to receive the spatial metadata 1304 from the analysis processor 1303 along with the captured microphone array audio signals 1302 (or alternatively a subset or a processed set of the microphone signals). The synthesis processor 1305 may operate in a manner similar to the synthesis processor 1305 as described with respect to Figures 1 and 5, 7a, 7b and 9. Depending on the configuration, the synthesis processor 1305 may be configured to output a multichannel audio signal (for example a binaural signal or a surround loudspeaker signal or Ambisonic signal). The multichannel audio signals 1307 can therefore be listened to directly (when fed to headphones or loudspeakers, or reproduced using an Ambisonic renderer), stored (with any suitable codec) and/or transmitted to a remote device.
  • Although the codec use implementation is described above it is noted that some embodiments may be used with any suitable codec that utilizes smoothing and can provide information on the smoothness of the direction-related parameters.
  • Similarly as depicted in the example implementation in Figure 13, the proposed method can also be applied in any kind of spatial audio processing which operates in time-frequency domain.
  • With respect to Figure 14 an example electronic device which may be used as the analysis or synthesis processor is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.
  • In some embodiments the device 1400 comprises at least one processor or central processing unit 1407. The processor 1407 can be configured to execute various program codes such as the methods such as described herein.
  • In some embodiments the device 1400 comprises a memory 1411. In some embodiments the at least one processor 1407 is coupled to the memory 1411. The memory 1411 can be any suitable storage means. In some embodiments the memory 1411 comprises a program code section for storing program codes implementable upon the processor 1407. Furthermore in some embodiments the memory 1411 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1407 whenever needed via the memory-processor coupling.
  • In some embodiments the device 1400 comprises a user interface 1405. The user interface 1405 can be coupled in some embodiments to the processor 1407. In some embodiments the processor 1407 can control the operation of the user interface 1405 and receive inputs from the user interface 1405. In some embodiments the user interface 1405 can enable a user to input commands to the device 1400, for example via a keypad. In some embodiments the user interface 1405 can enable the user to obtain information from the device 1400. For example the user interface 1405 may comprise a display configured to display information from the device 1400 to the user. The user interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1400 and further displaying information to the user of the device 1400. In some embodiments the user interface 1405 may be the user interface for communicating with the position determiner as described herein.
  • In some embodiments the device 1400 comprises an input/output port 1409. The input/output port 1409 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
  • The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).
  • The transceiver input/output port 1409 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1407 executing suitable code. Furthermore the device may generate a suitable downmix signal and parameter output to be transmitted to the synthesis device.
  • In some embodiments the device 1400 may be employed as at least part of the synthesis device. As such the input/output port 1409 may be configured to receive the downmix signals and in some embodiments the parameters determined at the capture device or processing device as described herein, and generate a suitable audio signal format output by using the processor 1407 executing suitable code. The input/output port 1409 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones or similar.
  • In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
  • The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
  • The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
  • Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
  • Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

Claims (10)

  1. An apparatus for spatial audio signal processing, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to:
    receive at least one audio signal;
    determine an energy ratio being a spatial parameter associated with the at least one audio signal
    determine a direction smoothness parameter by applying an exponent to the energy ratio;
    convert the direction smoothness parameter to an adaptive smoothing parameter;
    determine panning gains for applying to a first part of the at least one audio signal;
    apply the adaptive smoothing parameter to the panning gains to generate associated smoothed panning gains; and
    apply the smoothed panning gains to the first part of the at least one audio signal to generate a positioned audio signal.
  2. The apparatus as claimed in claim 1, wherein the at least one memory and the computer program code are configured to, with the at least one processor, further cause the apparatus to:
    apply a decorrelation to a second part of the at least one audio signal to generate an ambient audio signal; and
    combine the positioned audio signal and the ambient audio signal to generate a multichannel audio signal.
  3. The apparatus as claimed in any of claims 1 and 2, wherein the at least one memory and the computer program code are configured to, with the at least one processor, further cause the apparatus to:
    estimate an energy of the at least one audio signal; and
    average the direction smoothness parameter based on the energy of the at least one audio signal, wherein the apparatus caused to convert the direction smoothness parameter to the adaptive smoothing parameter is caused to convert the averaged direction smoothness parameter to the adaptive smoothing parameter.
  4. The apparatus as claimed in claim 3, wherein the at least one memory and the computer program code are configured to, with the at least one processor, further cause the apparatus to:
    determine an averaging parameter based on the energy of the at least one audio signal; and
    apply the averaging parameter to the direction smoothness parameter and unity minus the averaging parameter to a previous averaged direction smoothness parameter to generate the averaged direction smoothness parameter.
  5. The apparatus as claimed in any of claims 1 to 4, wherein the at least one memory and the computer program code are configured to, with the at least one processor, further cause the apparatus to:
    receive the at least one audio signal from at least one microphone within a microphone array;
    determine the at least one audio signal from multichannel loudspeaker audio signals; and
    receive the at least one audio signal as part of a data stream comprising the at least one audio signal and metadata comprising the spatial parameter.
  6. The apparatus as claimed in any of claims 1 to 5, wherein the at least one memory and the computer program code are configured to, with the at least one processor, further cause the apparatus to: analyse the at least one audio signal to determine the spatial parameter; and
    receive the spatial parameter as part of a data stream comprising the at least one audio signal and metadata comprising the spatial parameter.
  7. A method for spatial audio signal processing comprising:
    receiving at least one audio signal;
    determining an energy ratio being a spatial parameter associated with the at least one audio signal
    determining a direction smoothness parameter by applying an exponent to the energy ratio;
    converting the direction smoothness parameter to an adaptive smoothing parameter;
    determining panning gains for applying to a first part of the at least one audio signals;
    applying the adaptive smoothing parameter to the panning gains to generate associated smoothed panning gains; and
    applying the smoothed panning gains to the first part of the at least one audio signal to generate a positioned audio signal.
  8. The method as claimed in Claim 7, further comprising:
    applying a decorrelation to a second part of the at least one audio signal to generate an ambient audio signal; and
    combining the positioned audio signal and the ambient audio signal to generate a multichannel audio signal.
  9. The method as claimed in Claims 7 and 8, wherein converting the direction smoothness parameter to an adaptive smoothing parameter comprises:
    estimating an energy of the at least one audio signal;
    averaging the direction smoothness parameter based on the energy of the at least one audio signal, wherein converting the direction smoothness parameter to the adaptive smoothing parameter comprises converting the averaged direction smoothness parameter to the adaptive smoothing parameter.
  10. The method as claimed in Claim 9, wherein averaging the direction smoothness parameter based on the energy of the at least one audio signal comprises:
    determining an averaging parameter based on the energy of the at least one audio signal; and
    applying the averaging parameter to the direction smoothness parameter and unity minus the averaging parameter to a previous averaged direction smoothness parameter to generate the averaged direction smoothness parameter.
EP19767481.5A 2018-03-13 2019-03-07 Spatial audio parameter smoothing Active EP3766262B1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB1803993.3A GB2571949A (en) 2018-03-13 2018-03-13 Temporal spatial audio parameter smoothing
PCT/FI2019/050178 WO2019175472A1 (en) 2018-03-13 2019-03-07 Temporal spatial audio parameter smoothing

Publications (3)

Publication Number Publication Date
EP3766262A1 EP3766262A1 (en) 2021-01-20
EP3766262A4 EP3766262A4 (en) 2021-11-10
EP3766262B1 true EP3766262B1 (en) 2022-11-23

Family

ID=61972940

Family Applications (1)

Application Number Title Priority Date Filing Date
EP19767481.5A Active EP3766262B1 (en) 2018-03-13 2019-03-07 Spatial audio parameter smoothing

Country Status (3)

Country Link
EP (1) EP3766262B1 (en)
GB (1) GB2571949A (en)
WO (1) WO2019175472A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11545166B2 (en) 2019-07-02 2023-01-03 Dolby International Ab Using metadata to aggregate signal processing operations
GB2593419A (en) * 2019-10-11 2021-09-29 Nokia Technologies Oy Spatial audio representation and rendering
TW202123220A (en) * 2019-10-30 2021-06-16 美商杜拜研究特許公司 Multichannel audio encode and decode using directional metadata
AU2021357364B2 (en) * 2020-10-09 2024-06-27 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus, method, or computer program for processing an encoded audio scene using a parameter smoothing
EP4178231A1 (en) * 2021-11-09 2023-05-10 Nokia Technologies Oy Spatial audio reproduction by positioning at least part of a sound field

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7983922B2 (en) * 2005-04-15 2011-07-19 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Apparatus and method for generating multi-channel synthesizer control signal and apparatus and method for multi-channel synthesizing
US8295494B2 (en) 2007-08-13 2012-10-23 Lg Electronics Inc. Enhancing audio with remixing capability
JP5798247B2 (en) * 2011-07-01 2015-10-21 ドルビー ラボラトリーズ ライセンシング コーポレイション Systems and tools for improved 3D audio creation and presentation
WO2013181272A2 (en) 2012-05-31 2013-12-05 Dts Llc Object-based audio system using vector base amplitude panning
DE102012017296B4 (en) * 2012-08-31 2014-07-03 Hamburg Innovation Gmbh Generation of multichannel sound from stereo audio signals
US10635383B2 (en) 2013-04-04 2020-04-28 Nokia Technologies Oy Visual audio processing apparatus
JP6187131B2 (en) 2013-10-17 2017-08-30 ヤマハ株式会社 Sound image localization device
EP2942981A1 (en) 2014-05-05 2015-11-11 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. System, apparatus and method for consistent acoustic scene reproduction based on adaptive functions
CN105336335B (en) * 2014-07-25 2020-12-08 杜比实验室特许公司 Audio object extraction with sub-band object probability estimation
US10045145B2 (en) * 2015-12-18 2018-08-07 Qualcomm Incorporated Temporal offset estimation
WO2018213159A1 (en) * 2017-05-15 2018-11-22 Dolby Laboratories Licensing Corporation Methods, systems and apparatus for conversion of spatial audio format(s) to speaker signals

Also Published As

Publication number Publication date
GB201803993D0 (en) 2018-04-25
EP3766262A1 (en) 2021-01-20
GB2571949A (en) 2019-09-18
EP3766262A4 (en) 2021-11-10
WO2019175472A1 (en) 2019-09-19

Similar Documents

Publication Publication Date Title
US12114146B2 (en) Determination of targeted spatial audio parameters and associated spatial audio playback
US11343630B2 (en) Audio signal processing method and apparatus
EP3766262B1 (en) Spatial audio parameter smoothing
KR101480258B1 (en) Apparatus and method for decomposing an input signal using a pre-calculated reference curve
US20190394606A1 (en) Two stage audio focus for spatial audio processing
US9313599B2 (en) Apparatus and method for multi-channel signal playback
US20170188174A1 (en) Audio signal processing method and device
US20130195276A1 (en) Multi-Channel Audio Processing
US20160255452A1 (en) Method and apparatus for compressing and decompressing sound field data of an area
US20230071136A1 (en) Method and apparatus for adaptive control of decorrelation filters
US20220369061A1 (en) Spatial Audio Representation and Rendering
US20240089692A1 (en) Spatial Audio Representation and Rendering
CN112567765B (en) Spatial audio capture, transmission and reproduction
US20210099795A1 (en) Spatial Audio Capture
US20210319799A1 (en) Spatial parameter signalling
US11956615B2 (en) Spatial audio representation and rendering
US20240274137A1 (en) Parametric spatial audio rendering
US20240357304A1 (en) Sound Field Related Rendering

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20201013

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
A4 Supplementary search report drawn up and despatched

Effective date: 20211007

RIC1 Information provided on ipc code assigned before grant

Ipc: G10L 19/008 20130101ALI20211002BHEP

Ipc: H04R 1/32 20060101ALI20211002BHEP

Ipc: H04R 3/12 20060101ALI20211002BHEP

Ipc: H04R 5/04 20060101ALI20211002BHEP

Ipc: H04S 7/00 20060101AFI20211002BHEP

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: GRANT OF PATENT IS INTENDED

INTG Intention to grant announced

Effective date: 20220707

GRAS Grant fee paid

Free format text: ORIGINAL CODE: EPIDOSNIGR3

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE PATENT HAS BEEN GRANTED

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

REG Reference to a national code

Ref country code: GB

Ref legal event code: FG4D

REG Reference to a national code

Ref country code: CH

Ref legal event code: EP

REG Reference to a national code

Ref country code: DE

Ref legal event code: R096

Ref document number: 602019022274

Country of ref document: DE

REG Reference to a national code

Ref country code: AT

Ref legal event code: REF

Ref document number: 1533887

Country of ref document: AT

Kind code of ref document: T

Effective date: 20221215

REG Reference to a national code

Ref country code: IE

Ref legal event code: FG4D

REG Reference to a national code

Ref country code: LT

Ref legal event code: MG9D

REG Reference to a national code

Ref country code: NL

Ref legal event code: MP

Effective date: 20221123

REG Reference to a national code

Ref country code: AT

Ref legal event code: MK05

Ref document number: 1533887

Country of ref document: AT

Kind code of ref document: T

Effective date: 20221123

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20221123

Ref country code: PT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20230323

Ref country code: NO

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20230223

Ref country code: LT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20221123

Ref country code: FI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20221123

Ref country code: ES

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20221123

Ref country code: AT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20221123

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: RS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20221123

Ref country code: PL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20221123

Ref country code: LV

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20221123

Ref country code: IS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20230323

Ref country code: HR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20221123

Ref country code: GR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20230224

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: NL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20221123

P01 Opt-out of the competence of the unified patent court (upc) registered

Effective date: 20230527

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SM

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20221123

Ref country code: RO

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20221123

Ref country code: EE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20221123

Ref country code: DK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20221123

Ref country code: CZ

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20221123

REG Reference to a national code

Ref country code: DE

Ref legal event code: R097

Ref document number: 602019022274

Country of ref document: DE

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20221123

Ref country code: AL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20221123

PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: MC

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20221123

REG Reference to a national code

Ref country code: CH

Ref legal event code: PL

26N No opposition filed

Effective date: 20230824

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20221123

REG Reference to a national code

Ref country code: BE

Ref legal event code: MM

Effective date: 20230331

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: LU

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20230307

REG Reference to a national code

Ref country code: IE

Ref legal event code: MM4A

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: LI

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20230331

Ref country code: IE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20230307

Ref country code: CH

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20230331

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: BE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20230331

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: DE

Payment date: 20240130

Year of fee payment: 6

Ref country code: GB

Payment date: 20240201

Year of fee payment: 6

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20221123

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: FR

Payment date: 20240213

Year of fee payment: 6