WO2019105575A1 - Détermination de codage de paramètre audio spatial et décodage associé - Google Patents

Détermination de codage de paramètre audio spatial et décodage associé Download PDF

Info

Publication number
WO2019105575A1
WO2019105575A1 PCT/EP2017/081265 EP2017081265W WO2019105575A1 WO 2019105575 A1 WO2019105575 A1 WO 2019105575A1 EP 2017081265 W EP2017081265 W EP 2017081265W WO 2019105575 A1 WO2019105575 A1 WO 2019105575A1
Authority
WO
WIPO (PCT)
Prior art keywords
resolution
spatial audio
parameter
frequency
time
Prior art date
Application number
PCT/EP2017/081265
Other languages
English (en)
Inventor
Lasse Juhani Laaksonen
Anssi Sakari RÄMÖ
Adriana Vasilache
Mikko Tapio Tammi
Miikka Tapani Vilermo
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Priority to PCT/EP2017/081265 priority Critical patent/WO2019105575A1/fr
Publication of WO2019105575A1 publication Critical patent/WO2019105575A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/24Variable rate codecs, e.g. for generating different qualities using a scalable representation such as hierarchical encoding or layered encoding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S3/00Systems employing more than two channels, e.g. quadraphonic
    • H04S3/02Systems employing more than two channels, e.g. quadraphonic of the matrix type, i.e. in which input signals are combined algebraically, e.g. after having been phase shifted with respect to each other
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/03Application of parametric coding in stereophonic audio systems

Definitions

  • the present application relates to apparatus and methods for sound-field related parameter encoding.
  • Parametric spatial audio processing is a field of audio signal processing where the spatial aspect of the sound is described using a set of parameters.
  • parameters such as directions of the sound in frequency bands, and the ratios between the directional and non-directional parts of the captured sound in frequency bands.
  • These parameters are known to well describe the perceptual spatial properties of the captured sound at the position of the microphone array.
  • These parameters can be utilized in synthesis of the spatial sound accordingly, for headphones binaurally, for loudspeakers, or to other formats, such as Ambisonics and these parameters can be used to estimate sound in positions within an environment captured by the microphone array.
  • the directions and direct-to-total energy ratios in frequency bands are thus a parameterization that is particularly effective for spatial audio capture.
  • a parameter set consisting of a direction parameter in frequency bands and an energy ratio parameter in frequency bands (indicating the directionality of the sound) can be also utilized as the spatial metadata for an audio codec.
  • these parameters can be estimated from microphone-array captured audio signals, and for example a stereo signal can be generated from the microphone array signals to be conveyed with the spatial metadata.
  • the stereo signal could be encoded, for example, with an AAC encoder or multiple instances of an EVS mono encoder.
  • a corresponding decoder(s) can decode the audio signals into PCM signals, and, e.g., a synthesis processing or a renderer can process the sound in frequency bands (using the spatial metadata) to obtain the spatial output, for example a binaural output.
  • the aforementioned solution is particularly suitable for encoding captured spatial sound from microphone arrays (e.g., in mobile phones, VR cameras, stand- alone microphone arrays).
  • microphone arrays e.g., in mobile phones, VR cameras, stand- alone microphone arrays.
  • a further input for the encoder is also multi-channel loudspeaker input, such as 5.1 or 7.1 channel surround inputs.
  • an apparatus for spatial audio signal encoding comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: determine, for two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction; and process the at least one resolution spatial audio parameter to be output and/or stored.
  • the apparatus caused to determine, for two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction may be caused to determine at least one of: at least one spatial audio parameter comprising a direction parameter with an elevation and an azimuth component; at least one energy ratio parameter associated with the direction parameter; at least one diffuseness parameter associated with the direction parameter; and at least one coherence parameter associated with the direction parameter.
  • the apparatus caused to determine, for the two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction may be caused to: determine at least one first time-frequency resolution from a range of frequency resolutions at least one first resolution spatial audio parameter for providing spatial audio reproduction; determine at least one second time- frequency resolution from a range of frequency resolutions at least one second resolution spatial audio parameter for providing spatial audio reproduction; and select one of the first resolution spatial audio parameter and second resolution spatial audio parameter as the at least one time-frequency resolution from a range of frequency resolutions.
  • the apparatus may be further caused to: window the two or more audio signals to generate at least one frame of the two or more audio signals; filter the frame of the two or more audio signals to generate at least two sub-band representations of the at least one frame of the two or more audio signals, wherein the apparatus caused to determine, for two or more audio signals, and for a first time-frequency resolution at least one first resolution spatial audio parameter for providing spatial audio reproduction may be caused to analyse the at least two sub- band representations of the at least one frame of the two or more audio signals using the first time-frequency resolution, and wherein the apparatus caused to determine, for the two or more audio signals, and for a second time-frequency resolution at least one second resolution spatial audio parameter for providing spatial audio reproduction may be further caused to analyse the at least two sub- band representations of the at least one frame of the two or more audio signals using the second time-frequency resolution.
  • the apparatus caused to select at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may be further caused to select the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored on a frame-by-frame basis.
  • the apparatus may be further caused to generate a data structure for the at least one resolution spatial audio parameter, wherein the data structure may comprise at least one of: a switch field configured to indicate a selection of the at least one resolution from the range of frequency resolutions; a time-frequency description field configured to indicate the ordering of the selection of the at least one resolution from the range of frequency resolutions; and at least one parameter field configured to provide the at least one resolution spatial audio parameter.
  • the data structure may comprise at least one of: a switch field configured to indicate a selection of the at least one resolution from the range of frequency resolutions; a time-frequency description field configured to indicate the ordering of the selection of the at least one resolution from the range of frequency resolutions; and at least one parameter field configured to provide the at least one resolution spatial audio parameter.
  • the apparatus caused to select at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may be further caused to generate an embedded data structure for the selected one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter, wherein the embedded data structure may comprise at least one of at least one an embedded field configured to indicate the option selection of the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored.
  • the apparatus may be further caused to determine at least one suitability measure.
  • the apparatus caused to determine at least one suitability measure may be caused to determine the at least one suitability measure based on at least one of: analysis of the two or more audio signals; analysis of a downmix based on the two or more audio signals; analysis of the at least one resolution spatial audio parameter for providing spatial audio reproduction more audio signals for the at least one time-frequency resolution; and visual analysis of an environment generating the at least two or more audio signals.
  • the apparatus caused to select at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may be further caused to select the at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter based on the at least one suitability measure.
  • the apparatus may be caused to determine the at least one time-frequency resolution based on the at least one suitability measure.
  • the apparatus may be further caused to downmix the at least two audio signals based on a selected one of the first time-frequency resolution and the second time-frequency resolution.
  • the apparatus may be further caused to encode the at least two audio signals based on a selected one of the first time-frequency resolution and the second time-frequency resolution.
  • the apparatus may be further caused to encode the selected at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter.
  • a method for spatial audio signal encoding comprising: determining, for two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction; and processing the at least one resolution spatial audio parameter to be output and/or stored.
  • Determining, for two or more audio signals, and for at least one time- frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction may comprise at least one of: determining at least one spatial audio parameter comprising a direction parameter with an elevation and an azimuth component; determining at least one energy ratio parameter associated with the direction parameter; determining at least one diffuseness parameter associated with the direction parameter; and determining at least one coherence parameter associated with the direction parameter.
  • Determining, for the two or more audio signals, and for at least one time- frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction may comprise: determining at least one first time-frequency resolution from a range of frequency resolutions at least one first resolution spatial audio parameter for providing spatial audio reproduction; determining at least one second time-frequency resolution from a range of frequency resolutions at least one second resolution spatial audio parameter for providing spatial audio reproduction; and selecting one of the first resolution spatial audio parameter and second resolution spatial audio parameter as the at least one time-frequency resolution from a range of frequency resolutions.
  • the method may further comprise: windowing the two or more audio signals to generate at least one frame of the two or more audio signals; filtering the frame of the two or more audio signals to generate at least two sub-band representations of the at least one frame of the two or more audio signals, wherein determining, for two or more audio signals, and for a first time-frequency resolution at least one first resolution spatial audio parameter for providing spatial audio reproduction may further comprise analysing the at least two sub-band representations of the at least one frame of the two or more audio signals using the first time-frequency resolution, and wherein determining, for the two or more audio signals, and for a second time- frequency resolution at least one second resolution spatial audio parameter for providing spatial audio reproduction may further comprise analysing the at least two sub-band representations of the at least one frame of the two or more audio signals using the second time-frequency resolution.
  • Selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may further comprise selecting the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored on a frame-by-frame basis.
  • the method may further comprise generating a data structure for the at least one resolution spatial audio parameter, wherein the data structure may comprise at least one of: a switch field configured to indicate a selection of the at least one resolution from the range of frequency resolutions; a time-frequency description field configured to indicate the ordering of the selection of the at least one resolution from the range of frequency resolutions; and at least one parameter field configured to provide the at least one resolution spatial audio parameter.
  • the data structure may comprise at least one of: a switch field configured to indicate a selection of the at least one resolution from the range of frequency resolutions; a time-frequency description field configured to indicate the ordering of the selection of the at least one resolution from the range of frequency resolutions; and at least one parameter field configured to provide the at least one resolution spatial audio parameter.
  • Selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may further comprise generating an embedded data structure for the selected one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter, wherein the embedded data structure may comprise at least one of at least one an embedded field configured to indicate the option selection of the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored.
  • the method may further comprise determining at least one suitability measure.
  • Determining at least one suitability measure may further comprise determining the at least one suitability measure based on at least one of: analysis of the two or more audio signals; analysing a downmix based on the two or more audio signals; analysing the at least one resolution spatial audio parameter for providing spatial audio reproduction more audio signals for the at least one time- frequency resolution; and visual analysing an environment generating the at least two or more audio signals.
  • Selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may further comprise selecting the at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter based on the at least one suitability measure.
  • the method may comprise determining the at least one time-frequency resolution based on the at least one suitability measure.
  • the method may further comprise downmixing the at least two audio signals based on a selected one of the first time-frequency resolution and the second time- frequency resolution.
  • the method may further comprise encoding the at least two audio signals based on a selected one of the first time-frequency resolution and the second time- frequency resolution.
  • the method may further comprise encoding the selected at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter.
  • an apparatus for spatial audio signal encoding comprising: means for determining, for two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction; and means for processing the at least one resolution spatial audio parameter to be output and/or stored.
  • the means for determining, for two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction may comprise at least one of: means for determining at least one spatial audio parameter comprising a direction parameter with an elevation and an azimuth component; means for determining at least one energy ratio parameter associated with the direction parameter; means for determining at least one diffuseness parameter associated with the direction parameter; and means for determining at least one coherence parameter associated with the direction parameter.
  • the means for determining, for the two or more audio signals, and for at least one time-frequency resolution from a range of frequency resolutions at least one resolution spatial audio parameter for providing spatial audio reproduction may comprise: means for determining at least one first time-frequency resolution from a range of frequency resolutions at least one first resolution spatial audio parameter for providing spatial audio reproduction; means for determining at least one second time-frequency resolution from a range of frequency resolutions at least one second resolution spatial audio parameter for providing spatial audio reproduction; and means for selecting one of the first resolution spatial audio parameter and second resolution spatial audio parameter as the at least one time-frequency resolution from a range of frequency resolutions.
  • the apparatus may further comprise: means for windowing the two or more audio signals to generate at least one frame of the two or more audio signals; means for filtering the frame of the two or more audio signals to generate at least two sub-band representations of the at least one frame of the two or more audio signals, wherein the means for determining, for two or more audio signals, and for a first time-frequency resolution at least one first resolution spatial audio parameter for providing spatial audio reproduction may further comprise means for analysing the at least two sub-band representations of the at least one frame of the two or more audio signals using the first time-frequency resolution, and the means for determining, for the two or more audio signals, and for a second time-frequency resolution at least one second resolution spatial audio parameter for providing spatial audio reproduction may further comprises means for analysing the at least two sub-band representations of the at least one frame of the two or more audio signals using the second time-frequency resolution.
  • the means for selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may further comprise means for selecting the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored on a frame-by-frame basis.
  • the apparatus may further comprising means for generating a data structure for the at least one resolution spatial audio parameter, wherein the data structure may comprise at least one of: a switch field configured to indicate a selection of the at least one resolution from the range of frequency resolutions; a time-frequency description field configured to indicate the ordering of the selection of the at least one resolution from the range of frequency resolutions; and at least one parameter field configured to provide the at least one resolution spatial audio parameter.
  • the data structure may comprise at least one of: a switch field configured to indicate a selection of the at least one resolution from the range of frequency resolutions; a time-frequency description field configured to indicate the ordering of the selection of the at least one resolution from the range of frequency resolutions; and at least one parameter field configured to provide the at least one resolution spatial audio parameter.
  • the means for selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may further comprise means for generating an embedded data structure for the selected one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter, wherein the embedded data structure may comprise at least one of at least one an embedded field configured to indicate the option selection of the at least one of the at least one first resolution spatial audio parameter and the at least one second resolution spatial audio parameter to be output and/or stored.
  • the apparatus may further comprise means for determining at least one suitability measure.
  • the means for determining at least one suitability measure may further comprise means for determining the at least one suitability measure based on at least one of: analysis of the two or more audio signals; analysis of a downmix based on the two or more audio signals; analysis of the at least one resolution spatial audio parameter for providing spatial audio reproduction more audio signals for the at least one time-frequency resolution; and visual analysis of an environment generating the at least two or more audio signals.
  • the means for selecting at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter to be output and/or stored may further comprise means for selecting the at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter based on the at least one suitability measure.
  • the apparatus may comprise means for determining the at least one time- frequency resolution based on the at least one suitability measure.
  • the apparatus may further comprise means for downmixing the at least two audio signals based on a selected one of the first time-frequency resolution and the second time-frequency resolution.
  • the apparatus may further comprise means for encoding the at least two audio signals based on a selected one of the first time-frequency resolution and the second time-frequency resolution.
  • the apparatus may further comprise means for encoding the selected at least one of the at least one first resolution spatial audio parameter and at least one second resolution spatial audio parameter.
  • An apparatus comprising means for performing the actions of the method as described above.
  • An apparatus configured to perform the actions of the method as described above.
  • a computer program comprising program instructions for causing a computer to perform the method as described above.
  • a computer program product stored on a medium may cause an apparatus to perform the method as described herein.
  • An electronic device may comprise apparatus as described herein.
  • a chipset may comprise apparatus as described herein.
  • Embodiments of the present application aim to address problems associated with the state of the art.
  • Figure 1 shows schematically a system of apparatus suitable for implementing some embodiments
  • Figure 2 shows schematically the multi-resolution analysis processor as shown in Figure 1 in further detail according to some embodiments;
  • Figures 3a and 3b show schematically example metadata structure according to some embodiments
  • Figure 4 shows schematically example multi-resolution modes according to some embodiments
  • Figure 5 shows schematically an example embedded multi-resolution mode format according to some embodiments
  • Figure 6 shows schematically a further example embedded multi-resolution mode format according to some embodiments
  • Figure 7 shows a flow diagram of the operation of the system as shown in Figure 1 according to some embodiments
  • Figure 8 shows a flow diagram of the operation of the multi-resolution analysis processor as shown in Figure 2 according to some embodiments
  • Figure 9 shows a flow diagram of the operation of analysing the audio signal for one of the multi-resolution analysis operations as shown in Figure 8 according to some embodiments.
  • Figure 10 shows schematically an example device suitable for implementing the apparatus shown.
  • an immersive system is one in which the encoding and decoding attempts to retain the characteristics of the audio scene (as captured by microphones or synthesized otherwise) and aim to produce an immersive effect when presented to the listener.
  • the input format may be any suitable input format, such as multi-channel loudspeaker, ambisonic (FOA/HOA) etc. It is understood that in some embodiments the channel location is based on a location of the microphone or is a virtual location or direction.
  • the output of the example system is a binaural headphone presentation.
  • a rendered audio signal output is passed to the listener as a pair of audio signals for a suitable headphone/earphone/headset signal.
  • the output may be any suitable rendering.
  • a multi-channel loudspeaker arrangement For example the multi-channel loudspeaker signals may be generalised to be two or more playback audio signals.
  • Current speech and audio codecs and in particular immersive audio codecs support a multitude of operating points ranging from a low bit rate operation to transparency.
  • An example of such a codec is the 3GPP IVAS codec for which the standardization process has begun in 3GPP TSG-SA4 in October 2017. The completion of the standard is currently expected by end of 2019.
  • the IVAS codec is an extension of the 3GPP EVS codec and intended for new immersive voice and audio services over 4G/5G.
  • Such immersive services include, e.g., stereo / binaural telephony, multichannel teleconferencing and immersive voice and audio for virtual reality (VR).
  • This multi-purpose audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio.
  • one of interest is a parametric immersive audio format.
  • spatial metadata parameters such as direction and direct-to-total energy ratio (or diffuseness-ratio, absolute energies, or any suitable expression indicating the directionality/non-directionality of the sound at the given time-frequency interval) parameters in frequency bands are particularly suitable for expressing the perceptual properties of natural and synthetic sound fields. Natural sound fields such as captured microphone generated sound fields and synthetic sound scenes such as 5.1 loudspeaker mixes.
  • the spatial metadata parameters such as direction(s), energy ratio(s), diffuseness and coherence can be used to express the features of the sound field accurately.
  • One concept which is being investigated is the use of multi-resolution parametric immersive codecs. These are codecs which have more than a single time-frequency resolution. For example, in some such embodiments it is possible to have analysis and processing with a high frequency resolution, or one with a high temporal resolution, or we can have systems combining these such as via a switched system.
  • These multi-resolution formats may for example be selected from or supported by internal processors and quantizers of an audio codec based on the input formats of the audio signals.
  • the codec may treat a parametric immersive audio format separately or it may pass it (or at least the waveform part) through a similar processing path as it uses for other waveform-based formats.
  • These other formats may include at least one of ambisonics (FOA/HOA), multi- channel (e.g., 2.0, 4.0, 5.1 , 7.1 , 7.1 +4H, 22.2, and so on), and object-based audio.
  • the audio codec may support independent streams with directional metadata and/or individual streams that may have a dependency metadata in addition to a directional metadata.
  • Audio format may also be combined. For example, we can have ⁇ OA + audio objects’ or‘parametric immersive audio + individual streams’ or any other combination that makes sense from the capture, content creation, or rendering point of view.
  • At least some of the formats may be immersive or spatially analysed inside the audio codec. This analysis may as discussed above, determine parameters such as directions of sound sources (expressed, e.g., as a direction on a sphere or alternatively an azimuth and elevation parameter per time-frequency tiles). This processing may then be followed by an immersive downmix (in other words a downmix of the input audio signals suitable for encoding and producing a suitable immersive effect when later decoded and renderer to the listener based on the determined metadata) and metadata extraction.
  • an immersive downmix in other words a downmix of the input audio signals suitable for encoding and producing a suitable immersive effect when later decoded and renderer to the listener based on the determined metadata
  • a parametric representation for enabling efficient encoding and transmission of the immersive scene may be created inside the encoder prior to waveform coding.
  • the processing inside the codec substantially corresponds to a processing outside the codec. The concept is thus in some embodiments to support high frequency resolution and high time resolution approaches for immersive audio coding
  • the concept may thus be characterized by apparatus and methods which implement an immersive metadata format switching functionality that allows for changing time/frequency (T/F) resolution of the parameters on a frame-by-frame basis and use of different strategies in different subband/subframe ranges.
  • T/F time/frequency
  • the immersive metadata of a parametric immersive audio format is defined in a way that it allows for different immersive capture analysis and processing approaches related to frequency and time resolution.
  • the system 100 is shown with an ‘analysis’ part 121 and a‘synthesis’ part 131 .
  • The‘analysis’ part 121 is the part from receiving the multi-channel loudspeaker signals up to an encoding of the metadata and downmix signal and the ‘synthesis’ part 131 is the part from a decoding of the encoded metadata and downmix signal to the presentation of the re-generated signal (for example in multi-channel loudspeaker form).
  • the input to the system 100 and the‘analysis’ part 121 is the multi-channel signals 102.
  • a microphone channel signal input is described, however any suitable input (or synthetic multi-channel) format may be implemented in other embodiments.
  • the multi-channel signals are passed to a downmixer 103 and to an analysis processor 105.
  • the downmixer 103 is configured to receive the multi- channel signals and downmix the signals to a determined number of channels and output the downmix signals 104.
  • the downmixer 103 may be configured to generate a 2 audio channel downmix of the multi-channel signals.
  • the determined number of channels may be any suitable number of channels.
  • the downmixer 103 is optional and the multi-channel signals are passed unprocessed to an encoder 107 in the same manner as the downmix signals are in this example.
  • the downmixer 103 is configured to generate downmix signals 104 on a frame by frame basis. As such in some embodiments the downmixer 103 is configured to receive windowed and filtered audio signals provided by the multi-resolution analysis processor 105 rather than directly via the input.
  • the multi-resolution analysis processor 105 is also configured to receive the multi-channel signals and analyse the signals to produce metadata 106 associated with the multi-channel signals and thus associated with the downmix signals 104.
  • the analysis processor 105 may be configured to generate the metadata which may comprise, for each time-frequency analysis interval, a direction parameter 108, an energy ratio parameter 1 10, a coherence parameter 1 12, and a diffuseness parameter 1 14.
  • the direction, energy ratio and diffuseness parameters may in some embodiments be considered to be spatial audio parameters.
  • the spatial audio parameters comprise parameters which aim to characterize the sound-field created by the multi-channel signals (or two or more playback audio signals in general).
  • the coherence parameters may be considered to be signal relationship audio parameters which aim to characterize the relationship between the multi-channel signals.
  • the multi-resolution analysis processor is configured to generate multiple-resolution analysis of the audio signals. For example in some embodiments the multi-resolution analysis processor generates a first time- frequency resolution metadata parameter set and a second time-frequency resolution metadata parameter set. In some embodiments the multi-resolution analysis processor is configured to select one of the generated sets and pass this to the metadata encoder.
  • the parameters generated may differ from frequency band to frequency band.
  • band X all of the parameters are generated and transmitted, whereas in band Y only one of the parameters is generated and transmitted, and furthermore in band Z no parameters are generated or transmitted.
  • band Z no parameters are generated or transmitted.
  • a practical example of this may be that for some frequency bands such as the highest band some of the parameters are not required for perceptual reasons.
  • the downmix signals 104 and the metadata 106 may be passed to an encoder 107.
  • the encoder 107 may comprise a core coder 109 which is configured to receive the downmix (or otherwise) signals 104 and generate a suitable encoding of these audio signals.
  • the encoder 107 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs.
  • the encoding may be implemented using any suitable scheme.
  • the encoder 107 may furthermore comprise a metadata encoder or quantizer 109 which is configured to receive the metadata and output an encoded or compressed form of the information.
  • the core encoder 109 can be achieved using various tools. For example, an audio coding tools part of an IVAS codec (such as EVS tools) can be used. If the immersive downmix signal is a mono signal, a single-channel element (SCE) encoding can be utilized. This can be, in some embodiments, an encoding corresponding to the EVS standard. If the immersive downmix signal is a stereo signal (linear or binaural), a channel-pair element (CPE) encoding can be utilized. For example, dedicated stereo modes in IVAS can be utilized. If the immersive downmix signal is beyond a stereo representation, it is possible to use, e.g., various combinations of SCE and CPE encodings, or alternatively and in addition, a multichannel encoding. In some embodiments the encoder 107 may further interleave, multiplex to a single data stream or embed the metadata within encoded down mix signals before transmission or storage shown in Figure 1 by the dashed line. The multiplexing may be implemented using any suitable scheme.
  • the received or retrieved data may be received by a decoder/demultiplexer 133.
  • the decoder/demultiplexer 133 may demultiplex the encoded streams and pass the audio encoded stream to a downmix extractor 135 which is configured to decode the audio signals to obtain the downmix signals.
  • the decoder/demultiplexer 133 may comprise a metadata extractor 137 which is configured to receive the encoded metadata and generate metadata.
  • the decoder/demultiplexer 133 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs.
  • the decoded metadata and downmix audio signals may be passed to a synthesis processor 139.
  • the system 100‘synthesis’ part 131 further shows a synthesis processor 139 configured to receive the downmix and the metadata and re-creates in any suitable format a synthesized spatial audio in the form of multi-channel signals 1 10 (these may be multichannel loudspeaker format or in some embodiments any suitable output format such as binaural or Ambisonics signals, depending on the use case) based on the downmix signals and the metadata.
  • a synthesis processor 139 configured to receive the downmix and the metadata and re-creates in any suitable format a synthesized spatial audio in the form of multi-channel signals 1 10 (these may be multichannel loudspeaker format or in some embodiments any suitable output format such as binaural or Ambisonics signals, depending on the use case) based on the downmix signals and the metadata.
  • First the system (analysis part) is configured to receive multi-channel audio signals as shown in Figure 7 by step 701
  • the system (analysis part) is configured to analyse signals over multiple resolutions to generate metadata such as direction parameters; energy ratio parameters; diffuseness parameters and coherence parameters. Then one of the selected multiple resolutions may be selected to be output. The generation of multiple resolution metadata and the selection of one of them is shown in Figure 7 by step 703.
  • the signal analysis which generates the metadata is performed based on a control signal or information or a decision from a controller or other part or function.
  • the control signal may be configured to control the analyser to perform only one resolution analysis which is different from frame to frame (in other words pre-analysis select the analysis resolution rather than post-analysis select the resolution).
  • control signal for the analyser is determined based on analysis of the audio signal characteristics, for example analysis of the input audio signals and based on this only one resolution parameter set per frame are created.
  • the core codec that encodes the downmix may be configured to switch between short and long windows and this can be used as an indication to which resolution is used.
  • the resolution is determined from the audio characteristics of the core coded downmix signal.
  • the resolution does not need to be signalled in the metadata because it can be determined from the core coded dowmix signal.
  • the system (analysis part) is configured to generate a downmix of the multi-channel signals based on the selected resolution as shown in Figure 7 by step 705.
  • the system is then configured to encode for storage/transmission the downmix signal and metadata as shown in Figure 7 by step 707.
  • the system may store/transmit the encoded downmix and metadata as shown in Figure 7 by step 709.
  • the system may retrieve/receive the encoded downmix and metadata as shown in Figure 7 by step 71 1.
  • the system is configured to extract the downmix and metadata from encoded downmix and metadata parameters, for example demultiplex and decode the encoded downmix and metadata parameters, as shown in Figure 7 by step 713.
  • the system (synthesis part) is configured to synthesize an output multi- channel audio signal based on extracted downmix of multi-channel audio signals and metadata as shown in Figure 7 by step 715.
  • the multi-resolution analysis processor 105 comprises a windower 201 .
  • the windower 201 is configured to receive the input audio signals and generate a series of analysis periods or intervals of audio signal sample lengths. These can be passed to a filter bank 203. The windower may thus generate a series of frames from which multi-resolution sub-frames may be extracted from.
  • the analysis processor 105 furthermore may comprise a filter bank 203.
  • the filterbank 203 in some embodiments is configured to perform an initial time- frequency domain transform of the windowed (multi-channel) audio signals 102 and apply a suitable time to frequency domain transform such as a Short Time Fourier Transform (STFT) in order to convert the input time domain signals into a suitable time-frequency signals.
  • STFT Short Time Fourier Transform
  • These time-frequency signals may then be filtered according to any suitable band or sub-band configuration and passed to a series of multi-resolution immersive signal analyser parts.
  • time-frequency signals may be represented in the time-frequency domain representation by
  • n can be considered as a time index with a lower sampling rate than that of the original time-domain signals.
  • the widths of the sub-bands can approximate any suitable distribution. For example the Equivalent rectangular bandwidth (ERB) scale or the Bark scale.
  • the sub-band size may be chosen directly based on the filter bank used in the audio codec.
  • a filter bank may thus result in using of a sub-band bandwidth that has, e.g., a minimum size of 400 Hz.
  • the STFT is used in the filter bank 203 any suitable implementation may be used.
  • the filter bank 203 sub-band widths can be selected based on perceptual properties of human hearing.
  • the multi-resolution analysis processor 105 comprises a series of different resolution immersive signal analysers. These are represented in Figure 2 by a 1 st immersive signal analyser 205i and a 2nd immersive signal analyser 2052.
  • the immersive signal analysers may be configured to determine for a defined time and frequency (T/F) resolution a series of ‘immersive’ parameters for describing the audio signals.
  • T/F time and frequency
  • the immersive signal analyser comprise a direction analyser configured to receive the time-frequency signals and based on these signals estimate direction parameters on a frequency band-by-band (or groups of bands) basis.
  • the direction parameters may be determined based on any audio based ‘direction’ determination.
  • the direction analyser may be configured to estimate the direction with two or more signal inputs. This represents the simplest configuration to estimate a‘direction’, more complex processing may be performed with even more signals.
  • the direction analyser may thus be configured to provide an azimuth and elevation for each frequency band and temporal resolution, denoted as azimuth cp(/c,n) and elevation Q(k,n).
  • the direction parameter 108 may be also be used to perform further analysis of the signal.
  • the analyser is configured to determine an energy ratio parameter.
  • the energy ratio may be considered to be a determination of the energy of the audio signal which can be considered to arrive from a direction.
  • the direct-to-total energy ratio r(k,n) can be estimated, e.g., using a stability measure of the directional estimate, or using any correlation measure, or any other suitable method to obtain a ratio parameter.
  • the signal analyser 205 may be configured to produce further signal parameters such as the two parameters: coherence and diffuseness, both analysed in time-frequency domain.
  • the immersive signal analysis can consist of at least two parts and differ at least in frequency and time resolution. However, they may further differ in terms of parameters being analysed.
  • the 1 st immersive signal analyser 205i may be configured to generate analysis using a 20-ms update rate (in other words with a temporal resolution of 20ms) and with 24 sub-bands (for example where the bands are all the same size each band will have a frequency resolution of X/24 where X is the frequency range of the analysed audio signals, though it is understood that the bands may differ in size in some embodiments), and the 2nd immersive signal analyser 2052 may be configured to generate analysis using a 5- ms update rate with 6 sub-bands. It is to be appreciated that these are examples only and other example resolutions are possible for both high frequency and high time resolution approaches. For example, significantly higher time resolution such as a 1 .25-ms update rate may be utilized in some implementations.
  • each of the multi-resolution parameters are generated in parallel it would be understood that in some embodiments the analysis operations are performed in series or in a hybrid of series and parallel.
  • the multi-resolution analysis processor 105 comprises a switch 207.
  • the switch 207 is configured to receive the analysis parameters from each of the immersive signal analysers 205 and output one (or more of these) to the immersive metadata extractor 209.
  • the switch 207 is configured to operate based on at least one suitability measure. In some embodiments this decision is performed differently for a subset of sub-frames or sub-bands. Thus at least one switching decision may be performed for each frame (or windowed period).
  • the at least one suitability measure can be any suitable measure.
  • the measure or analysis may be at least one of: the input audio signals, the parameters generated by the analysis of the audio signals, and analysis of the downmix audio signals.
  • the determination may be made based on detecting the input audio signals comprise a voice (for example by using a suitable voice activity determination) and that furthermore there are more than one voice source (for example based on a detection of at least two talkers in a scene). In such a situation where there are two simultaneous talkers, a faster time domain update cycle can provide better overall performance than a more accurate spectral resolution.
  • switching suitability measure may be used in various implementations of the invention. For example, if an analysis of the audio scene determines that the audio scene has a relatively flat spectrum and thus noisy ambience, shorter windows may produce better results. Also, where it is determined that there are impulsive energy fluctuations between sub-frames and thus the audio signals comprise transients, a faster update rate can be selected. On the other hand, where it is determined that there are stable tonal sounds (like classical musical instruments) a longer sub-frame window size and thus higher spectral accuracy may be selected.
  • control of the post-analysis selection of the resolution from multiple resolutions or as described earlier a‘pre-analysis’ selection of the resolution may relate to analysis of non-audio scene factors.
  • the selection is based on visual tracking and segmentation of a scene, where the number of and types of sound sources are determined and used to, at least in part, decide the analysis resolution selection.
  • the output of the switch 207 may be passed to the immersive metadata extractor 209.
  • the multi-resolution analysis processor 105 comprises an immersive metadata extractor 209.
  • the immersive metadata extractor 209 may be configured to receive the output of the switch 207 and passed to a metadata compressor/encoder 1 1 1. Furthermore the immersive metadata extractor 209 may be configured to output the resolution of the extracted metadata to the core coder 109 and downmixer 103 such that the time/frequency resolution of the metadata may be matched in the downmixer 103 and/or core coder 109.
  • the multi-resolution analysis processor 105 comprises a metadata compressor/encoder 1 1 1 .
  • the metadata compressor/encoder 1 1 1 is configured to receive the extracted metadata and generate a suitable data format to output to enable the extraction of the metadata at a suitable decoder/receiver. This for example requires the ability to generate suitable control information which enables the receiver/decoder to be able to determine the information content in the encoded format.
  • the immersive audio metadata may be compressed depending on the bit rate restriction of the current coding mode. Thus in some embodiments some immersive fidelity will typically be lost, as is usually the case with lossy compression. However, the compression can utilize various perceptual techniques to minimize the impact.
  • Figure 3a shows a table structure where a switching is supported between the at least two modes with different time/frequency resolutions.
  • the table shows two columns, a metadata field 301 and a metadata value field(s) 303.
  • switch field 302 For two modes this switch field may be supported by a single bit (‘Switch’ field).
  • the two modes may be static for a given transmission (e.g., streaming or communications call), i.e., they may be known (e.g., only two resolutions are allowed by a specific implementation) or the resolutions may be communicated out of band.
  • the data format shown in Figure 3a furthermore shows the various parameters and their values grouped by parameter.
  • the 1 st to nth sub-band values are shown in the first part 31 1
  • the 1 st to nth sub-band values are shown in the second part 313
  • the 1 st to nth sub-band values are shown in the Yth part 315.
  • the metadata structure and size can in some embodiments remain constant, when for example we utilize the time/frequency (T/F) modes of the example analyser shown in Figure 2 which implemented a 20ms/24sub-band and 5ms/6sub-band resolution).
  • T/F time/frequency
  • the first mode may provide 1x24 sub-bands and thus 1 x24 sub-band values per parameter.
  • Figure 3b shows a table structure similar to that shown in Figure 3a but with an additional field, a T/F description field 305 and associated value(s) 306.
  • This field may describe the T/F resolution in the current frame and/or the T/F resolutions overall supported by the encoder/decoder or used in the current transmission. This may take for example the values of the example, where the switching bit or‘Switch’ field value is used to index the T/F description’ field.
  • the T/F description’ field in Figure 3b may furthermore include information on how the parameter values are ordered.
  • the structure as shown in Figure 3b may be adapted.
  • the orderings of ‘Parameter 1’ may be as follows: sub-band 0, 1 , 2, ..., 23 and 0, 1 , 2, 3, 4, 5 for sub-frame 0, followed similarly by sub-frames 1 , 2, 3, respectively, giving the same internal order as for the format shown in Figure 3a.
  • this additional field may allow the decoder to optimize the memory allocation or other aspects for the synthesis.
  • the receiver may receive information about the possible configurations for analysis resolution within a specific transmission from this information. This information may allow for embodiments with faster adaptation.
  • the T/F description field and associated value may be used to convey information on a switching in a subset of the sub-frames or sub- bands.
  • the ‘Switch’ field may here have more values that correspond to mixtures of at least two switched modes within the current frame. For example if a third analysis mode is a 10ms 12 sub-band mode (in other words each 20ms window is divided into two 10ms sub-frames and each sub-frame has 12 sub- bands: 10ms/12sb.
  • the structure of the data format information may also be provided in different fields, as the metadata structures of are only examples.
  • FIG. 4 an example arrangement 401 of different T/F modes within a frame is shown.
  • the bottom line 404 summarizes the time resolution (T) in terms of ms, and the number of frequency sub-bands (F) for each T is indicated 402 over each column.
  • This example as shown in Figure 4 shows six different modes of operation: a first mode utilizes 1 update cycle of 20 ms 403, the second mode utilizes 2 update cycles of 10 ms each 405i and 405 2 , the third mode utilizes 4 update cycles of 5 ms 407i to 407 4 , and the remaining three modes utilize 3 update cycles where one of the cycles is 10 ms and the other two 5 ms each 409i to 4093, 4111 to 4113, and 413i to 4133.
  • the first mode 403 has a single column of 24 sub-bands (0...23) representing one parameter value per sub-band for each of the 24 sub-bands for the 20ms frame length.
  • the second mode has two columns of 12 sub-bands (0..11 , 0..11) representing one parameter value per sub-band for each of the 12 sub-bands for each of the 10ms sub-frames.
  • the third mode has four columns of 6 sub-bands (0..5, 0..5, 0..5, 0..5) representing one parameter value per sub-band for each of the 6 sub-bands for each of the 5ms sub-frames.
  • the fourth mode has three columns, one of 12 sub-bands and two of 6 sub- bands (0..11 , 0..5, 0..5) representing one parameter value per sub-band for each of the 12 sub-bands in the 10ms sub-frame and the 6 sub-bands for each of the two 5ms sub-frames.
  • the fifth mode has three columns, one of 12 sub-bands and two of 6 sub- bands (0..5, 0..5, 0..11) representing one parameter value per sub-band for each of the 6 sub-bands for each of the two 5ms sub-frames and the 12 sub-bands in the 10ms sub-frame.
  • the sixth mode has three columns, one of 12 sub-bands and two of 6 sub- bands (0..5, 0..11 , 0..5) representing one parameter value per sub-band for each of the 6 sub-bands in the first 5ms sub-frame, the 12 sub-bands in the 10ms sub- frame and the 6 sub-bands for each of the last 5ms sub-frame.
  • the metadata compressor/encoder may furthermore define an embedded structure of T/F resolution switching for the immersive metadata. In other words it may define a data structure wherein a different mode may be selected for each of the sub-bands.
  • the embedded structure could be always active. In other words the value‘Embedded bit 0’ would not be used.
  • The“fixed” mode structure may be, e.g., similar to the data structure shown earlier with respect to Figure 3a.
  • the embedded mode differs from the structures shown previously in that they comprise an embedded switching bit before each sub-band or, in some embodiment, e.g., each group of sub-bands. The position of the embedded bits are not fixed due to the different size of the sub-bands (different T/F resolutions).
  • the Figure 5 shows the order of the parameters in this first example.
  • the data structure 501 shows an initial row 502 indicating the data structure is an embedded data structure.
  • the data structure 501 on rows 2 to 5 defines the fixed mode operation wherein the mode switch identifies or selects the first mode (mode_0) or the second mode (mode_1 ) and the order of the sub-bands for each of the modes is then defined in the fixed structure 504. It is understood that the rows 4 and 5 of Figure 5 may not be the same size. Although the example shown herein has two modes and thus is indicated by a single bit, in some embodiments there may be more than two modes and which are indicated by a flag value more than one bit in length.
  • the data structure 501 on rows 8 onwards then defines the embedded mode operation wherein the embedded bits switch or identify between the first mode (mode_0) or the second mode (mode_1 ) for each bifurcation or splitting.
  • row 8 and 12 show the identification 506 of the first element as being either the first mode first sub-band (mode_0 Subband_1 ) where the first embedded bit is 0 and the second mode first sub-band (mode_1 Subband_1 ) where the first embedded bit is 1 .
  • rows 10 and 1 1 follow the selection of the first mode for the first value with the identification 508 of the following elements as being either the first mode second sub-band (mode_0 Subband_2) onwards where the second embedded bit is 0 and the second mode second sub-band (mode_1 Subband_2) onwards where the second embedded bit is 1.
  • the identification 508 of the following elements as being either the first mode second sub-band (mode_0 Subband_2) onwards where the second embedded bit is 0 and the second mode second sub-band (mode_1 Subband_2) onwards where the second embedded bit is 1.
  • there may be further embedded bit values for selecting further sub-bands For example there may be a further set of embedded bit values, embedded bit 3, which select whether mode_0 or mode_1 Sub-band 3 is selected and so on.
  • rows 14 and 15 follow the selection of the second mode for the first value with the identification 510 of the following elements as being either the first mode second sub-band (modeJD Subband_2) onwards where the second embedded bit is 0 and the second mode second sub-band (mode_1 Subband_2) onwards where the second embedded bit is 1. Similarly there may also be further embedded bit values for selecting further mode sub-bands.
  • the data structure is defined such that the position of the embedded bits are known, and the size of the sub-bands is taken into account. This is shown for example with respect to the data structure shown in Figure 6.
  • Figure 6 thus for example shows a data structure 601 shows an initial row 602 indicating the data structure is an embedded data structure.
  • the data structure 601 on rows 2 to 5 defines the fixed mode operation wherein the mode switch identifies or selects the first mode (mode_0) or the second mode (mode_1 ) and the order of the sub-bands for each of the modes is then defined in the fixed structure 604. It is understood that the rows 4 and 5 of Figure 6 may not be the same size.
  • the data structure 601 on rows 6 onwards then defines the embedded mode operation wherein the embedded bits switch or identify between the first mode (modeJD) or the second mode (mode_1 ) for each bifurcation or splitting.
  • row 8 and 12 show the identification 606 of the first element as being either the first mode first sub-band (mode_0 Sb_1 ) where the first embedded bit is 0 and the second mode first sub-band to fourth sub-band (mode_1 Sb_1 to mode_1 Sb_4) where the first embedded bit is 1.
  • rows 10 and 11 follow the selection of the first mode for the first embedded bit value with the identification 608 of the following elements as being either the first mode second sub-band (mode_0 Sb_2) where the second embedded bit is 0 and the second mode fifth sub-band onwards (mode_1 Sb_5) onwards where the second embedded bit is 1 .
  • rows 14 and 15 follow the selection of the second mode for the first value with the identification 610 of the following elements as being either the first mode second sub-band (mode_0 Sb_2) onwards where the second embedded bit is 0 and the second mode fifth sub-band (mode_1 Sb_5) onwards where the second embedded bit is 1 .
  • the first operation is one of windowing the (time domain) multichannel audio signals (to generate frames from which the sub-frames can be generated) as shown in Figure 8 by step 801.
  • the final operation being one of outputting the determined parameters and generating a suitable metadata structure such as described herein and as shown in Figure 8 by step 809.
  • the immersive signal analyser is configured to determine the analysis sub-frame window and sub-bands to be analysed as shown in Figure 9 by step 901.
  • the first operation is one determining a direction analysis to determine directions and energy ratio parameters for each resolution time and frequency as shown in Figure 9 by step 903.
  • the analysis is configured to determine coherence parameters, diffuseness parameters and (optionally modifying energy ratios based on determined coherence parameters) for each resolution time and frequency as shown in Figure 9 by step 905.
  • the device may be any suitable electronics device or apparatus.
  • the device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.
  • the device 1400 comprises at least one processor or central processing unit 1407.
  • the processor 1407 can be configured to execute various program codes such as the methods such as described herein.
  • the device 1400 comprises a memory 141 1 .
  • the at least one processor 1407 is coupled to the memory 141 1.
  • the memory 141 1 can be any suitable storage means.
  • the memory 141 1 comprises a program code section for storing program codes implementable upon the processor 1407.
  • the memory 141 1 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1407 whenever needed via the memory-processor coupling.
  • the device 1400 comprises a user interface 1405.
  • the user interface 1405 can be coupled in some embodiments to the processor 1407.
  • the processor 1407 can control the operation of the user interface 1405 and receive inputs from the user interface 1405.
  • the user interface 1405 can enable a user to input commands to the device 1400, for example via a keypad.
  • the user interface 1405 can enable the user to obtain information from the device 1400.
  • the user interface 1405 may comprise a display configured to display information from the device 1400 to the user.
  • the user interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1400 and further displaying information to the user of the device 1400.
  • the user interface 1405 may be the user interface for communicating with the position determiner as described herein.
  • the device 1400 comprises an input/output port 1409.
  • the input/output port 1409 in some embodiments comprises a transceiver.
  • the transceiver in such embodiments can be coupled to the processor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network.
  • the transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
  • the transceiver can communicate with further apparatus by any suitable known communications protocol.
  • the transceiver or transceiver means can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).
  • UMTS universal mobile telecommunications system
  • WLAN wireless local area network
  • IRDA infrared data communication pathway
  • the transceiver input/output port 1409 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1407 executing suitable code. Furthermore the device may generate a suitable downmix signal and parameter output to be transmitted to the synthesis device.
  • the device 1400 may be employed as at least part of the synthesis device.
  • the input/output port 1409 may be configured to receive the downmix signals and in some embodiments the parameters determined at the capture device or processing device as described herein, and generate a suitable audio signal format output by using the processor 1407 executing suitable code.
  • the input/output port 1409 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones or similar.
  • the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof.
  • some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
  • firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
  • While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
  • the embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware.
  • any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions.
  • the software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
  • the memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory.
  • the data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
  • Embodiments of the inventions may be practiced in various components such as integrated circuit modules.
  • the design of integrated circuits is by and large a highly automated process.
  • Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
  • Programs such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules.
  • the resultant design in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Algebra (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Stereophonic System (AREA)

Abstract

La présente invention concerne un appareil de codage de signal audio spatial, l'appareil comprenant au moins un processeur et au moins une mémoire comprenant un code de programme informatique, l'au moins une mémoire et le code de programme informatique étant configurés pour, avec l'au moins un processeur, amener l'appareil au moins à : déterminer, pour deux ou plus de deux signaux audio, et pour au moins une résolution temps-fréquence parmi une plage de résolutions de fréquence, au moins un paramètre audio spatial de résolution pour fournir une reproduction audio spatiale ; et traiter l'au moins un paramètre audio spatial de résolution de façon à délivrer en sortie et/ou stocker celui-ci.
PCT/EP2017/081265 2017-12-01 2017-12-01 Détermination de codage de paramètre audio spatial et décodage associé WO2019105575A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2017/081265 WO2019105575A1 (fr) 2017-12-01 2017-12-01 Détermination de codage de paramètre audio spatial et décodage associé

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2017/081265 WO2019105575A1 (fr) 2017-12-01 2017-12-01 Détermination de codage de paramètre audio spatial et décodage associé

Publications (1)

Publication Number Publication Date
WO2019105575A1 true WO2019105575A1 (fr) 2019-06-06

Family

ID=60543561

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2017/081265 WO2019105575A1 (fr) 2017-12-01 2017-12-01 Détermination de codage de paramètre audio spatial et décodage associé

Country Status (1)

Country Link
WO (1) WO2019105575A1 (fr)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021053266A3 (fr) * 2019-09-17 2021-04-22 Nokia Technologies Oy Codage de paramètres audio spatiaux et décodage associé
WO2021155460A1 (fr) * 2020-02-03 2021-08-12 Voiceage Corporation Commutation entre des modes de codage stéréo dans un codec sonore multicanal
CN114503610A (zh) * 2019-10-10 2022-05-13 诺基亚技术有限公司 用于沉浸式通信的增强定向信令
EP4085661A4 (fr) * 2020-02-28 2023-01-25 Nokia Technologies Oy Représentation audio et rendu associé
WO2023066456A1 (fr) * 2021-10-18 2023-04-27 Nokia Technologies Oy Génération de métadonnées dans un audio spatial
US11765536B2 (en) 2018-11-13 2023-09-19 Dolby Laboratories Licensing Corporation Representing spatial audio by means of an audio signal and associated metadata
WO2024199874A1 (fr) 2023-03-31 2024-10-03 Nokia Technologies Oy Harmonisation de direction de métadonnées spatiales
WO2024199802A1 (fr) 2023-03-24 2024-10-03 Nokia Technologies Oy Codage de métadonnées hors synchronisation au niveau de la trame
WO2024199873A1 (fr) 2023-03-24 2024-10-03 Nokia Technologies Oy Décodage de métadonnées hors synchronisation au niveau trame

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2384028A2 (fr) * 2008-07-31 2011-11-02 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Génération de signaux pour signaux binauraux
US20150213806A1 (en) * 2012-10-05 2015-07-30 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Encoder, decoder and methods for backward compatible multi-resolution spatial-audio-object-coding
US20160064006A1 (en) * 2013-05-13 2016-03-03 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Audio object separation from mixture signal using object-specific time/frequency resolutions

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2384028A2 (fr) * 2008-07-31 2011-11-02 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Génération de signaux pour signaux binauraux
US20150213806A1 (en) * 2012-10-05 2015-07-30 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Encoder, decoder and methods for backward compatible multi-resolution spatial-audio-object-coding
US20160064006A1 (en) * 2013-05-13 2016-03-03 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Audio object separation from mixture signal using object-specific time/frequency resolutions

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11765536B2 (en) 2018-11-13 2023-09-19 Dolby Laboratories Licensing Corporation Representing spatial audio by means of an audio signal and associated metadata
WO2021053266A3 (fr) * 2019-09-17 2021-04-22 Nokia Technologies Oy Codage de paramètres audio spatiaux et décodage associé
CN114503610A (zh) * 2019-10-10 2022-05-13 诺基亚技术有限公司 用于沉浸式通信的增强定向信令
WO2021155460A1 (fr) * 2020-02-03 2021-08-12 Voiceage Corporation Commutation entre des modes de codage stéréo dans un codec sonore multicanal
EP4085661A4 (fr) * 2020-02-28 2023-01-25 Nokia Technologies Oy Représentation audio et rendu associé
JP2023516303A (ja) * 2020-02-28 2023-04-19 ノキア テクノロジーズ オサケユイチア オーディオ表現および関連するレンダリング
WO2023066456A1 (fr) * 2021-10-18 2023-04-27 Nokia Technologies Oy Génération de métadonnées dans un audio spatial
WO2024199802A1 (fr) 2023-03-24 2024-10-03 Nokia Technologies Oy Codage de métadonnées hors synchronisation au niveau de la trame
WO2024199873A1 (fr) 2023-03-24 2024-10-03 Nokia Technologies Oy Décodage de métadonnées hors synchronisation au niveau trame
WO2024199874A1 (fr) 2023-03-31 2024-10-03 Nokia Technologies Oy Harmonisation de direction de métadonnées spatiales

Similar Documents

Publication Publication Date Title
WO2019105575A1 (fr) Détermination de codage de paramètre audio spatial et décodage associé
JP5081838B2 (ja) オーディオ符号化及び復号
US8817992B2 (en) Multichannel audio coder and decoder
EP3707706B1 (fr) Détermination d'un codage de paramètre audio spatial et décodage associé
JPWO2006022190A1 (ja) オーディオエンコーダ
US20240185869A1 (en) Combining spatial audio streams
CN114365218A (zh) 空间音频参数编码和相关联的解码的确定
WO2019129350A1 (fr) Détermination de codage de paramètre audio spatial et décodage associé
WO2020016479A1 (fr) Quantification éparse de paramètres audio spatiaux
EP3776545B1 (fr) Quantification de paramètres audio spatiaux
AU2021288690A1 (en) Methods and devices for encoding and/or decoding spatial background noise within a multi-channel input signal
WO2019106221A1 (fr) Traitement de paramètres audio spatiaux
WO2022038307A1 (fr) Opération de transmission discontinue pour des paramètres audio spatiaux
EP4396814A1 (fr) Descripteur de silence utilisant des paramètres spatiaux
WO2022223133A1 (fr) Codage de paramètres spatiaux du son et décodage associé
JP6235725B2 (ja) マルチ・チャンネル・オーディオ信号分類器
WO2024175320A1 (fr) Valeurs de priorité aux fins d'un codage audio spatial paramétrique
CA3208666A1 (fr) Transformation de parametres audio spatiaux
WO2020201619A1 (fr) Représentation audio spatiale et rendu associé
JP2022517992A (ja) 高分解能オーディオコーディング

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17808080

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17808080

Country of ref document: EP

Kind code of ref document: A1