WO2023066456A1 - Metadata generation within spatial audio - Google Patents

Metadata generation within spatial audio Download PDF

Info

Publication number
WO2023066456A1
WO2023066456A1 PCT/EP2021/078853 EP2021078853W WO2023066456A1 WO 2023066456 A1 WO2023066456 A1 WO 2023066456A1 EP 2021078853 W EP2021078853 W EP 2021078853W WO 2023066456 A1 WO2023066456 A1 WO 2023066456A1
Authority
WO
WIPO (PCT)
Prior art keywords
metadata
audio signal
information
parameters
resolution
Prior art date
Application number
PCT/EP2021/078853
Other languages
French (fr)
Inventor
Tapani PIHLAJAKUJA
Lasse Juhani Laaksonen
Miikka Tapani Vilermo
Arto Juhani Lehtiniemi
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Priority to PCT/EP2021/078853 priority Critical patent/WO2023066456A1/en
Publication of WO2023066456A1 publication Critical patent/WO2023066456A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/167Audio streaming, i.e. formatting and decoding of an encoded audio signal representation into a data stream for transmission or storage purposes
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/22Mode decision, i.e. based on audio signal content versus external parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Definitions

  • the present application relates to apparatus and methods for capture side metadata generation, but not exclusively for capture side metadata generation of metadata-assisted spatial audio.
  • the immersive voice and audio services (IVAS) codec is an extension of the 3GPP EVS (enhanced voice services) codec and intended for new immersive voice and audio services over 4G/5G.
  • Such immersive services include, e.g., immersive voice and audio for virtual reality (VR).
  • the multi-purpose audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio. It is expected to support a variety of input formats, such as channel-based and scene-based inputs. It is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions.
  • Metadata-assisted spatial audio is one input format proposed for IVAS. It uses audio signal(s) together with corresponding spatial metadata.
  • the spatial metadata comprises parameters which define the spatial aspects of the audio signals and which may contain for example, directions and direct-to-total energy ratios in frequency bands.
  • the MASA stream can, for example, be obtained by capturing spatial audio with microphones of a suitable capture device. For example a mobile device comprising multiple microphones may be configured to capture microphone signals where the set of spatial metadata can be estimated based on the captured microphone signals.
  • the MASA stream can be obtained also from other sources, such as specific spatial audio microphones (such as Ambisonics), studio mixes (for example, a 5.1 audio channel mix) or other content by means of a suitable format conversion.
  • an apparatus for generating spatial audio signal parameters comprising means configured to: obtain apparatus information; generate metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs, such that the metadata control information is configured to adaptively control the generation of metadata based on the metadata codec information and the apparatus information; and generate metadata parameters based on the metadata control information.
  • the metadata control information may control at least one of: a presence of coherence parameters; a number of concurrent direction parameters; a timeresolution of the metadata parameters; and a frequency-resolution of the metadata parameters.
  • the means may be configured to encode the metadata parameters.
  • the means may be further configured to: generate at least one transport audio signal from the at least one audio signal and based on the metadata control information; and encode the at least one transport audio signal.
  • the means may be further configured to: combine the encoded at least one transport audio signals and the encoded metadata parameters; and transmit to a further device and/or store the combined encoded at least one transport audio signals and metadata parameters.
  • the apparatus information may comprise at least one of: the at least one audio signal; a spatial analysis of the at least one audio signal; an estimate of the performance of a spatial analysis of the at least one audio signal; a status of a signal enhancement processing applied to the at least one audio signal; and an apparatus microphone configuration.
  • the status of a signal enhancement processing applied to the at least one audio signal may be a status of a wind noise suppression or reduction processing
  • the means configured to generate metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs may be configured to: determine a first time resolution window control when the status of the wind noise suppression or reduction processing indicates that the at least one audio signal has not had wind noise suppression or reduction processing; and determine a second time resolution window control when the status of the wind noise suppression or reduction processing indicates that the at least one audio signal has had wind noise suppression or reduction processing, the second time resolution window being longer than the first time resolution window.
  • the means configured to generate metadata control information based on the apparatus information may be configured to generate a time-resolution and a frequency-resolution window control based on the determined first or second time resolution window, the available bitrate for the encoded metadata parameters and an encoder specification detailing an encoder target time-resolution and frequencyresolution.
  • the means configured to generate metadata control information based on the apparatus information may be configured to determine an audio signal selection control, wherein the audio selection control may be configured to select at least one of the at least one audio signal based on whether the status of the wind noise suppression or reduction processing indicates that the at least one audio signal has had wind noise suppression or reduction processing.
  • the metadata codec information with respect to a range of possible metadata codecs may comprise at least one of: an expected bit-rate or bandwidth for the encoded metadata parameters; and a coding strategy for the encoding of the metadata parameters.
  • the means configured to generate metadata parameters based on the metadata control information may be configured to generate the metadata parameters from an analysis of the at least one audio signal controlled by the metadata control information.
  • the means configured to generate metadata control information based on the apparatus information may be configured to determine a direction dependent reliability control based on microphone location configuration on the apparatus.
  • the means configured to generate metadata parameters based on the metadata control information may be configured to generate at least two sets of metadata parameters from analysis of the at least one audio signal, with a first set of metadata parameters based on an analysis with a first time-resolution window and a second set of metadata parameters based on an analysis with a second timeresolution window, the second time resolution window being longer than the first time-resolution window.
  • the means configured to generate metadata parameters based on the metadata control information may be configured to select one of the at least two sets of metadata parameters based on the direction dependent reliability control.
  • a method for an apparatus for generating spatial audio signal parameters comprising: obtaining apparatus information; generating metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs, such that the metadata control information is configured to adaptively control the generation of metadata based on the metadata codec information and the apparatus information; and generating metadata parameters based on the metadata control information.
  • the metadata control information may control at least one of: a presence of coherence parameters; a number of concurrent direction parameters; a timeresolution of the metadata parameters; and a frequency-resolution of the metadata parameters.
  • the method may further comprise encoding the metadata parameters.
  • the method may further comprise: generating at least one transport audio signal from the at least one audio signal and based on the metadata control information; and encoding the at least one transport audio signal.
  • the method may further comprise: combining the encoded at least one transport audio signals and the encoded metadata parameters; and transmitting to a further device and/or store the combined encoded at least one transport audio signals and metadata parameters.
  • the apparatus information may comprise at least one of: the at least one audio signal; a spatial analysis of the at least one audio signal; an estimate of the performance of a spatial analysis of the at least one audio signal; a status of a signal enhancement processing applied to the at least one audio signal; and an apparatus microphone configuration.
  • the status of a signal enhancement processing applied to the at least one audio signal may be a status of a wind noise suppression or reduction processing
  • generating metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs may comprise: determining a first time resolution window control when the status of the wind noise suppression or reduction processing indicates that the at least one audio signal has not had wind noise suppression or reduction processing; and determining a second time resolution window control when the status of the wind noise suppression or reduction processing indicates that the at least one audio signal has had wind noise suppression or reduction processing, the second time resolution window being longer than the first time resolution window.
  • Generating metadata control information based on the apparatus information may comprise generating a time-resolution and a frequency-resolution window control based on the determined first or second time resolution window, the available bitrate for the encoded metadata parameters and an encoder specification detailing an encoder target time-resolution and frequency-resolution.
  • Generating metadata control information based on the apparatus information may comprise determining an audio signal selection control, wherein the audio selection control may comprise selecting at least one of the at least one audio signal based on whether the status of the wind noise suppression or reduction processing indicates that the at least one audio signal has had wind noise suppression or reduction processing.
  • the metadata codec information with respect to a range of possible metadata codecs may comprise at least one of: an expected bit-rate or bandwidth for the encoded metadata parameters; and a coding strategy for the encoding of the metadata parameters.
  • Generating metadata parameters based on the metadata control information may comprise generating the metadata parameters from an analysis of the at least one audio signal controlled by the metadata control information.
  • Generating metadata control information based on the apparatus information may comprise determining a direction dependent reliability control based on microphone location configuration on the apparatus.
  • Generating metadata parameters based on the metadata control information may comprise generating at least two sets of metadata parameters from analysis of the at least one audio signal, with a first set of metadata parameters based on an analysis with a first time-resolution window and a second set of metadata parameters based on an analysis with a second time-resolution window, the second time resolution window being longer than the first time-resolution window.
  • Generating metadata parameters based on the metadata control information may comprise selecting one of the at least two sets of metadata parameters based on the direction dependent reliability control.
  • an apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain apparatus information; generate metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs, such that the metadata control information is configured to adaptively control the generation of metadata based on the metadata codec information and the apparatus information; and generate metadata parameters based on the metadata control information.
  • the metadata control information may control at least one of: a presence of coherence parameters; a number of concurrent direction parameters; a timeresolution of the metadata parameters; and a frequency-resolution of the metadata parameters.
  • the apparatus may be caused to encode the metadata parameters.
  • the apparatus may be further caused to: generate at least one transport audio signal from the at least one audio signal and based on the metadata control information; and encode the at least one transport audio signal.
  • the apparatus may be further caused to: combine the encoded at least one transport audio signals and the encoded metadata parameters; and transmit to a further device and/or store the combined encoded at least one transport audio signals and metadata parameters.
  • the apparatus information may comprise at least one of: the at least one audio signal; a spatial analysis of the at least one audio signal; an estimate of the performance of a spatial analysis of the at least one audio signal; a status of a signal enhancement processing applied to the at least one audio signal; and an apparatus microphone configuration.
  • the status of a signal enhancement processing applied to the at least one audio signal may be a status of a wind noise suppression or reduction processing
  • the apparatus caused to generate metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs may be caused to: determine a first time resolution window control when the status of the wind noise suppression or reduction processing indicates that the at least one audio signal has not had wind noise suppression or reduction processing; and determine a second time resolution window control when the status of the wind noise suppression or reduction processing indicates that the at least one audio signal has had wind noise suppression or reduction processing, the second time resolution window being longer than the first time resolution window.
  • the apparatus caused to generate metadata control information based on the apparatus information may be caused to generate a time-resolution and a frequency-resolution window control based on the determined first or second time resolution window, the available bitrate for the encoded metadata parameters and an encoder specification detailing an encoder target time-resolution and frequencyresolution.
  • the apparatus caused to generate metadata control information based on the apparatus information may be caused to determine an audio signal selection control, wherein the audio selection control may be configured to select at least one of the at least one audio signal based on whether the status of the wind noise suppression or reduction processing indicates that the at least one audio signal has had wind noise suppression or reduction processing.
  • the metadata codec information with respect to a range of possible metadata codecs may comprise at least one of: an expected bit-rate or bandwidth for the encoded metadata parameters; and a coding strategy for the encoding of the metadata parameters.
  • the apparatus caused to generate metadata parameters based on the metadata control information may be caused to generate the metadata parameters from an analysis of the at least one audio signal controlled by the metadata control information.
  • the apparatus caused to generate metadata control information based on the apparatus information may be caused to determine a direction dependent reliability control based on microphone location configuration on the apparatus.
  • the apparatus caused to generate metadata parameters based on the metadata control information may be caused to generate at least two sets of metadata parameters from analysis of the at least one audio signal, with a first set of metadata parameters based on an analysis with a first time-resolution window and a second set of metadata parameters based on an analysis with a second timeresolution window, the second time resolution window being longer than the first time-resolution window.
  • the apparatus caused to generate metadata parameters based on the metadata control information may be caused to select one of the at least two sets of metadata parameters based on the direction dependent reliability control.
  • an apparatus comprising: means for obtaining apparatus information; means for generating metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs, such that the metadata control information is configured to adaptively control the generation of metadata based on the metadata codec information and the apparatus information; and means for generating metadata parameters based on the metadata control information.
  • a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus to perform at least the following: obtain apparatus information; generate metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs, such that the metadata control information is configured to adaptively control the generation of metadata based on the metadata codec information and the apparatus information; and generate metadata parameters based on the metadata control information.
  • a non-transitory computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtain apparatus information; generate metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs, such that the metadata control information is configured to adaptively control the generation of metadata based on the metadata codec information and the apparatus information; and generate metadata parameters based on the metadata control information.
  • an apparatus comprising: obtaining circuitry configured to obtain apparatus information; generating circuity configured to generate metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs, such that the metadata control information is configured to adaptively control the generation of metadata based on the metadata codec information and the apparatus information; and generating circuitry configured to generate metadata parameters based on the metadata control information.
  • a computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtain apparatus information; generate metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs, such that the metadata control information is configured to adaptively control the generation of metadata based on the metadata codec information and the apparatus information; and generate metadata parameters based on the metadata control information.
  • An apparatus comprising means for performing the actions of the method as described above.
  • An apparatus configured to perform the actions of the method as described above.
  • a computer program comprising program instructions for causing a computer to perform the method as described above.
  • a computer program product stored on a medium may cause an apparatus to perform the method as described herein.
  • An electronic device may comprise apparatus as described herein.
  • a chipset may comprise apparatus as described herein.
  • Embodiments of the present application aim to address problems associated with the state of the art.
  • Figure 1 shows schematically a system of apparatus suitable for implementing some embodiments
  • Figure 2 shows schematically the metadata generator and encoder according to some embodiments
  • Figure 3 shows a flow diagram of the operation of the example metadata generator and encoder as shown in Figure 2 according to some embodiments
  • Figure 4 shows a flow diagram of the operation of the example metadata generator and encoder as shown in Figure 2 with respect to wind noise reduction embodiments;
  • Figure 5 shows a flow diagram of the operation of the example metadata generator and encoder as shown in Figure 2 with respect to spatial capture embodiments;
  • Figure 6 shows schematically an example device suitable for implementing the apparatus shown.
  • Metadata-Assisted Spatial Audio is an example of a parametric spatial audio format and representation suitable as an input format for IVAS.
  • spatial metadata associated with the audio signals may comprise multiple parameters (such as multiple directions and associated with each direction a direct-to-total ratio, spread coherence, distance, etc.) per timefrequency tile.
  • the spatial metadata may also comprise other parameters or may be associated with other parameters which are considered to be non-directional (such as surround coherence, diffuse-to-total energy ratio, remainder-to-total energy ratio) but when combined with the directional parameters are able to be used to define the characteristics of the audio scene.
  • a reasonable design choice which is able to produce a good quality output is one where the spatial metadata comprises one or more directions for each time-frequency subframe (and associated with each direction direct-to-total ratios, spread coherence, distance values etc) are determined.
  • bandwidth and/or storage limitations may require a codec not to send spatial metadata parameter values for each frequency band and temporal sub-frame.
  • parametric spatial metadata representation can use multiple concurrent spatial directions.
  • MASA the proposed maximum number of concurrent directions is two.
  • parameters such as: Direction index; Direct-to-total ratio; Spread coherence; and Distance.
  • other parameters such as Diffuse- to-total energy ratio; Surround coherence; and Remainder-to-total energy ratio are defined.
  • MASA format may further comprise other parameters.
  • the IVAS codec is configured to operate with a frame size of 20ms.
  • MASA metadata update may be configured to synchronize with this frame size (20 ms) although the typical MASA metadata update rate may be lower, e.g., 5 ms. This is achieved by combining multiple subframes into one synchronized frame.
  • MASA format is a generic versatile format that allows high-resolution format inputs (24 frequency bands and 4 time subframes with 5-ms time resolution).
  • MASA- format compression methods can analyze the metadata and based on the metadata resolution of the input, selects a coding mode. Specifically, the compression method can implement a selection of producing an encoding with a better time resolution or a better frequency resolution. In practice, a better time resolution is preferred when a better time resolution is available which effectively reduces the frequency resolution with all but the highest coding bitrates. This means that time resolution is typically deemed to be more important by the codec.
  • the MASA format is generated often by capture systems (e.g., a mobile device) which have multiple microphones (often more than the number of transport signals) and complex algorithms to process the audio signals and produce the metadata.
  • capture systems e.g., a mobile device
  • multiple microphones often more than the number of transport signals
  • complex algorithms to process the audio signals and produce the metadata.
  • the capture system can have more information on what parts of the captured audio scene are important for the listener than the encoder will have. For example, even though the encoder by default prefers a better time resolution, the capture system may know that a better frequency resolution provides better quality and is therefore more important.
  • the encoder is strictly limited by the standard specification and remains generic to support multitude of different sources for the MASA format. Thus, it is not possible to implement complex capture-device specific adaptations into the encoder. However, generic adaptations are present as mentioned above. The standard specification including these generic adaptions are known for the capture device.
  • the capture device cannot control directly the encoding process even though it would be beneficial as the process is completely autonomous based on the input audio signal and metadata.
  • the capture device should be configured to modify the generated MASA format in such ways that the encoder is forced to the optimal encoding path.
  • the desire for this kind of system is further emphasized when the IVAS standard is finalized as all further development should be implemented on the capture device in order to prevent the whole codec from having to be re-specified.
  • the concept as discussed herein in further detail with respect to the following embodiments is related to coding and generation of parametric spatial audio metadata, for example, MASA spatial audio metadata for use by 3GPP IVAS.
  • the quality of the coded MASA metadata is improved by having an automatic capture-side metadata generation optimization.
  • the automatic capture-side metadata generation is configured to obtain information about the expected bitrate, coding strategy of the codec, and all the other information available on the capture-side.
  • the capture-side metadata generation is configured to select preferences, for example, for: time and/or frequency resolution, presence of coherence parameters, and/or number of concurrent directions.
  • the selected preference is based on the additional information available on the capture side in addition to the codec coding strategy.
  • information available on the capture side can comprise: raw and/or processed captured audio signals, raw and/or processed spatial analysis results, estimates of spatial analysis performance, multiple different spatial analysis results, and/or signal enhancement algorithm (wind noise reduction, audio focus, etc.) status information.
  • signal enhancement algorithm windshield noise reduction, audio focus, etc.
  • the metadata generation is based on the capture side information and known codec specification to control an autonomously selected coding strategy within a codec such that the metadata generated is therefore more optimal for the current capture situation.
  • time-frequency resolution should be optimized; and in mobile device audio capture, a reliability of the spatial analysis depends on the identified directions of the sound sources and capture configuration. In such situations an improved or optimized time-frequency resolution should be based on the reliability of the capture.
  • the system 100 is shown with an ‘analysis’ part 121 and a ‘synthesis’ part 131.
  • the ‘analysis’ part 121 is the part from receiving the multi-channel signals up to an encoding of the spatial metadata and transport signal and the ‘synthesis’ part 131 is the part from a decoding of the encoded spatial metadata and transport signal to the presentation of the regenerated signal (for example in multi-channel loudspeaker form).
  • the ‘analysis’ part 121 is described as a series of parts however in some embodiments the part may be implemented as functions within the same functional apparatus or part. In other words in some embodiments the ‘analysis’ part 121 is an encoder comprising at least one of the transport signal generator or analysis processor as described hereafter.
  • the input to the system 100 and the ‘analysis’ part 121 is the multi-channel signals 102.
  • the ‘analysis’ part 121 may comprise a transport signal generator 103, analysis processor 105, and encoder 107.
  • a microphone channel signal input is described, which can be two or more microphones integrated or connected onto a mobile device (e.g., a smartphone).
  • any suitable input (or synthetic multi-channel) format may be implemented in other embodiments.
  • suitable audio signals format inputs could be microphone arrays, e.g., B-format microphone, planar microphone array or Eigenmike, Ambisonic signals, e.g., first-order Ambisonics (FOA), higher-order Ambisonics (HOA), loudspeaker surround mix and/or objects, artificially created spatial mix, e.g., from audio or VR teleconference bridge, or combinations of the above.
  • Ambisonic signals e.g., first-order Ambisonics (FOA), higher-order Ambisonics (HOA), loudspeaker surround mix and/or objects, artificially created spatial mix, e.g., from audio or VR teleconference bridge, or combinations of the above.
  • FOA first-order Ambisonics
  • HOA higher-order Ambisonics
  • loudspeaker surround mix e.g., from audio or VR teleconference bridge, or combinations of the above.
  • the multi-channel signals are passed to a transport signal generator 103 and to an analysis processor 105.
  • the transport signal generator 103 is configured to receive the multi-channel signals and generate a suitable audio signal format for encoding.
  • the transport signal generator 103 can for example generate a stereo or mono audio signal.
  • the transport audio signals generated by the transport signal generator can be any known format.
  • the transport signal generator 103 can be configured to select a left-right microphone pair, and apply any suitable processing to the audio signal pair, such as automatic gain control, microphone noise removal, wind noise removal, and equalization.
  • the transport signal generator can be configured to formulate directional beam signals towards left and right directions, such as two opposing cardioid signals. Additionally in some embodiments when the input is a loudspeaker surround mix and/or objects, then the transport signal generator 103 can be configured to generate a downmix signal that combines left side channels to a left downmix channel, combines right side channels to a right downmix channel and adds centre channels to both transport channels with a suitable gain.
  • FOA/HOA first order Ambisonic/higher order Ambisonic
  • the transport signal generator is bypassed (or in other words is optional).
  • the analysis and synthesis occur at the same device at a single processing step, without intermediate processing there is no transport signal generation and the input audio signals are passed unprocessed.
  • the number of transport channels generated can be any suitable number.
  • the output of the transport signal generator 103 can be passed to an encoder 107.
  • the analysis processor 105 is also configured to receive the multi-channel signals and analyse the signals to produce the spatial metadata 106 associated with the multi-channel signals and thus associated with the transport signals 104.
  • the spatial metadata associated with the audio signals may be a provided to the encoder as a separate bit-stream.
  • the multichannel signals 102 input comprises spatial metadata and this is passed directly to the encoder 107.
  • the analysis processor 105 may be configured to generate the spatial metadata parameters which may comprise, for each time-frequency analysis interval, at least one direction parameter 108 and at least one energy ratio parameter 1 10 (and in some embodiments other parameters such as described earlier and of which a non-exhaustive list includes number of directions, surround coherence, diffuse-to-total energy ratio, remainder-to-total energy ratio, a spread coherence parameter, and distance parameter).
  • the direction parameter may be represented in any suitable manner, for example as spherical co-ordinates denoted as azimuth ⁇ p(k,n) and elevation 0(k,n).
  • the number of the spatial metadata parameters may differ from time-frequency tile to time-frequency tile.
  • band X all of the spatial metadata parameters are obtained (generated) and transmitted, whereas in band Y only one of the spatial metadata parameters is obtained and transmitted, and furthermore in band Z no parameters are obtained or transmitted.
  • band Z no parameters are obtained or transmitted.
  • the spatial metadata 106 may be passed to an encoder 107.
  • the analysis processor 105 is configured to apply a time-frequency transform for the input signals. Then, for example, in time-frequency tiles when the input is a mobile phone microphone array, the analysis processor could be configured to estimate delay-values between microphone pairs that maximize the inter-microphone correlation. Then based on these delay values the analysis processor may be configured to formulate a corresponding direction value for the spatial metadata. Furthermore the analysis processor may be configured to formulate a direct-to-total ratio parameter based on the correlation value.
  • the analysis processor 105 can be configured to determine an intensity vector.
  • the analysis processor may then be configured to determine a direction parameter value for the spatial metadata based on the intensity vector.
  • a diffuse-to-total ratio can then be determined, from which a direct-to-total ratio parameter value for the spatial metadata can be determined.
  • This analysis method is known in the literature as Directional Audio Coding (DirAC).
  • the analysis processor 105 can be configured to divide the HOA signal into multiple sectors, in each of which the method above is utilized.
  • This sector-based method is known in the literature as higher order DirAC (HO-DirAC).
  • HO-DirAC higher order DirAC
  • the analysis processor can be configured to convert the signal into a FOA/HOA signal(s) format and to obtain direction and direct-to-total ratio parameter values as above.
  • the analysis processor 105 can as described above be configured to generate metadata parameters for the MASA format stream.
  • the metadata parameters are typically generated in the time-frequency (TF) domain and produce parameters for each time-frequency tile.
  • TF-resolution the number of TF-tiles, i.e., the TF-resolution, may be adjusted for metadata generation.
  • a capture system is configured to analyze metadata parameters using one time-frequency resolution which can (in an optimal case) match the TF-resolution of the target format (for example 24 frequency bands and 4 5 ms subframes in the MASA format).
  • the underlying TF-resolution may be different than the native TF- resolution of the format to take advantage of the internal codec algorithms and force specific coding decisions.
  • adjustments of metadata TF-resolution can be implemented within the capture system for metadata generation.
  • the analysis and target TF-resolution it can be possible to analyze at a single TF-resolution that is overall judged to be “better” than the target TF-resolution or may be, e.g., the “best” possible target resolution.
  • An example of the first option would be, for example, a TF-resolution with 240 frequency bands and 4 subframes.
  • An example of the second option would be the MASA-format native resolution of 24 frequency bands and 4 subframes. Regardless of the option used, if the actual desired target is lower in terms of frequency resolution, then some form of metadata reduction via combination is performed.
  • this can be implemented by reducing MASA format related spatial metadata to fewer frequency bands and time frames by a combination of weighted direction vectors and weighted averages of metadata parameters over time and frequency to form a reduced (in terms of TF-tile resolution) set of metadata parameters.
  • This can be directly applicable if the reduced TF-resolution is a direct subset of the source TF-resolution. In other words where the band limits and subframe borders align. Otherwise, some minor or trivial adaptations are implemented.
  • the analyzer TF-resolution is lower than the target TF- resolution, then more parameter values need to be generated.
  • an effective solution is to replicate the existing parameter values into the better TF- resolution.
  • the lower resolution is a direct subset of the better resolution, then a simple assignment of values can be mapped from the lower resolution TF- tiles to corresponding better (higher) resolution TF-tiles. If there is no direct mapping possible (the lower resolution is not a direct subset), then frequency band and time subframe limits of the lower TF-resolution can be compared to the corresponding limits of the better TF-resolution and parameters are assigned based on the nearest corresponding TF-tile.
  • the encoder 107 may comprise an audio encoder core 109 which is configured to receive the transport audio signals 104 and generate a suitable encoding of these audio signals.
  • the encoder 107 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs.
  • the audio encoding may be implemented using any suitable scheme.
  • the encoder 107 may furthermore comprise a spatial metadata encoder/quantizer 1 1 1 which is configured to receive the spatial metadata and output an encoded or compressed form of the information.
  • the encoder 107 may further interleave, multiplex to a single data stream or embed the spatial metadata within encoded downmix signals before transmission or storage shown in Figure 1 by the dashed line.
  • the multiplexing may be implemented using any suitable scheme.
  • the transport signal generator 103 and/or analysis processor 105 may be located on a separate device (or otherwise separate) from the encoder 107.
  • the spatial metadata (and associated non-spatial metadata) parameters associated with the audio signals may be a provided to the encoder as a separate bit-stream.
  • the transport signal generator 103 and/or analysis processor 105 may be part of the encoder 107, i.e., located inside of the encoder and be on a same device.
  • the ‘synthesis’ part 131 is described as a series of parts however in some embodiments the part may be implemented as functions within the same functional apparatus or part.
  • the received or retrieved data may be received by a decoder/demultiplexer 133.
  • the decoder/demultiplexer 133 may demultiplex the encoded streams and pass the audio encoded stream to a transport signal decoder 135 which is configured to decode the audio signals to obtain the transport audio signals.
  • the decoder/demultiplexer 133 may comprise a metadata extractor 137 which is configured to receive the encoded spatial metadata (for example a direction index representing a direction parameter value) and generate spatial metadata.
  • the decoder/demultiplexer 133 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs.
  • the decoded metadata and transport audio signals may be passed to a synthesis processor 139.
  • the system 100 ‘synthesis’ part 131 further shows a synthesis processor 139 configured to receive the transport audio signal and the spatial metadata and re-creates in any suitable format a synthesized spatial audio in the form of multichannel signals 140 (these may be multichannel loudspeaker format or in some embodiments any suitable output format such as binaural or Ambisonics signals, depending on the use case) based on the transport signals and the spatial metadata.
  • a synthesis processor 139 configured to receive the transport audio signal and the spatial metadata and re-creates in any suitable format a synthesized spatial audio in the form of multichannel signals 140 (these may be multichannel loudspeaker format or in some embodiments any suitable output format such as binaural or Ambisonics signals, depending on the use case) based on the transport signals and the spatial metadata.
  • the synthesis processor 139 thus creates the output audio signals, e.g., multichannel loudspeaker signals or binaural signals based on any suitable known method. This is not explained here in further detail.
  • the rendering can be performed for loudspeaker output according to any of the following methods.
  • the transport audio signals can be divided to direct and ambient streams based on the direct-to-total and diffuse-to-total energy ratios.
  • the direct stream can then be rendered based on the direction parameter(s) using amplitude panning.
  • the ambient stream can furthermore be rendered using decorrelation.
  • the direct and the ambient streams can then be combined.
  • the output signals can be reproduced using a multichannel loudspeaker setup or headphones which may be head-tracked.
  • microphone signals from a mobile device are processed with a spatial audio capture system (containing the analysis processor and the transport signal generator), and the resulting spatial metadata and transport audio signals (e.g., in the form of a MASA stream) are forwarded to an encoder (e.g., an IVAS encoder), which contains the encoder.
  • an encoder e.g., an IVAS encoder
  • input signals e.g., 5.1 channel audio signals
  • an encoder e.g., an IVAS encoder
  • there can be two (or more) input audio signals where the first audio signal is processed by the apparatus shown in Figure 1 (resulting in data as an input for the encoder) and the second audio signal is directly forwarded to an encoder (e.g., an IVAS encoder), which contains the analysis processor, the transport signal generator, and the encoder.
  • the audio input signals may then be encoded in the encoder independently or they may, e.g., be combined in the parametric domain according to what may be called, e.g., MASA mixing.
  • synthesis part which comprises separate decoder and synthesis processor entities or apparatus, or the synthesis part can comprise a single entity which comprises both the decoder and the synthesis processor.
  • the decoder block may process in parallel more than one incoming data stream.
  • synthesis processor may be interpreted as an internal or external renderer.
  • the system is configured to receive multi-channel audio signals. Then the system (analysis part) is configured to generate a suitable transport audio signal (for example by selecting some of the audio signal channels). The system is then configured to encode for storage/transmission the transport audio signal. After this the system may store/transmit the encoded transport audio signal and metadata. The system may retrieve/receive the encoded transport audio signal and metadata. Then the system is configured to extract the transport audio signal and metadata from encoded transport audio signal and metadata parameters, for example demultiplex and decode the encoded transport audio signal and metadata parameters.
  • the system (synthesis part) is configured to synthesize an output multichannel audio signal based on extracted transport audio signal and metadata.
  • the apparatus comprises a suitable audio signal capture input 201 .
  • the audio signal input is configured to obtain the captured input audio signals 102 and pass these to the spatial analyser 215, the analysis configurator 213, the transport signal generation configurator 207 and the transport signal generator 203.
  • the input audio signals 102 comprises microphone audio signals, in other words the audio signal capture input 201 comprises a suitable microphone, microphones or microphone array configuration.
  • the audio signal capture input 201 comprises a suitable microphone, microphones or microphone array configuration.
  • suitable audio signal formats and configurations can be employed in some embodiments.
  • the apparatus comprises a capture device status information input 209.
  • the capture device status information input 209 is configured to obtain the capture device information 210 and pass this information to the transport signal generation configurator 207 and the analysis configurator 213.
  • Information 210 from the capture device status information can in some embodiments comprise information with respect to the microphone configuration or arrangement (where the audio signals are microphone audio signals).
  • the information 210 comprises raw and/or processed spatial analysis results from the audio signals (which can in some embodiments be the spatial analysis before conversion to encoder input format but can be after noise reduction or other processing operations).
  • the information comprises an estimate of the spatial analysis performance.
  • the information comprises multiple different spatial analysis results.
  • the information 210 in some embodiments comprises a signal enhancement algorithm (wind noise reduction, audio focus, etc.) status information.
  • the apparatus comprises an encoder 221.
  • the encoder 221 is configured in some embodiments to output encoder information 218.
  • the encoder information 218 in some embodiments comprises information defining the encoder standard specification.
  • the encoder specification can for example define the input format specification (such as defining the time-frequency resolution) or may define metadata encoding details (for example bitrate specific reductions in metadata and how the metadata is quantized).
  • the encoder information 218 furthermore in some embodiments comprises information about current target and/or previous realized bitrates for the encoder. This information can be different than the bitrate the device requests the encoder to use.
  • Other encoder information 218 can also describe mode configuration (DTX, etc.).
  • the apparatus comprises a transport signal generation configurator 207.
  • the transport signal generation configurator 207 in some embodiments is configured to receive the input audio signals 102, the capture device status information 210, the encoder information 218 (and in some embodiments furthermore information 212 from an analysis configurator 213) and determine a configuration for transport audio signal generation. The configuration for transport audio signal generation 208 can then be passed to the transport signal generator 203.
  • the apparatus comprises a transport signal generator 203.
  • the transport signal generator 203 is configured to obtain the configuration for transport audio signal generation 208 and the input audio signals 102 and generate the encoder optimised transport signal(s) 204 which can be passed to the encoder 221.
  • the apparatus in some embodiments comprises an analysis configurator 213.
  • the analysis configurator 213 is in some embodiments configured to receive the input audio signals 102, the capture device status information 210, the encoder information 218 (and in some embodiments furthermore information 212 from the transport signal generation configurator 207) and determine a configuration for the spatial analyser 215.
  • the configuration 223 for the spatial analyser 215 can then be passed to the spatial analyser 215.
  • the apparatus comprises a spatial analyser 215.
  • the spatial analyser 215 is configured to obtain the configuration 223 for spatial analysis and the input audio signals 102 and generate the encoder optimised metadata 216 which can be passed to the encoder 221 .
  • the improved (optimal) configuration for spatial analysis can be deduced.
  • metadata generation this can mean in practice that time-frequency resolution of the spatial analysis is selected in such way that the encoder is directed into a preferred coding path based on the pre-selected I optimized characteristics of the input spatial metadata.
  • analysis for different parameters e.g., coherence parameters
  • the practical control is in selection (or combination) of optimal microphones for transport signals and also selection of how many transport audio signals should be used. This information is then used to control these two parts of the format generation.
  • Figure 3 shows a flow diagram of the operations of the example apparatus shown in Figure 2 for implementing some embodiments.
  • the method comprises capturing the input audio signals as shown in Figure 3 by step 301.
  • the method comprises obtaining capture device information as shown in Figure 3 by step 303.
  • the method comprises obtaining encoder standard specification (or more generally encoder information) as shown in Figure 3 by step 305.
  • the method can furthermore optionally comprise obtaining encoder feedback information as shown in Figure 3 by step 307.
  • the available data can be analysed to determine an improved analysis configuration (or optimal analysis configuration) for the target codec as shown in Figure 3 by step 309.
  • the spatial analysis is controlled for an improved spatial metadata generation as shown in Figure 3 by step 313.
  • configuration information is configured to enable a control of the transport audio signal generation for an improved transport signal output as shown in Figure 3 by step 315.
  • the transport audio signal is then generated based on the control as shown in Figure 3 by step 317. Furthermore the input audio signals are analysed based on the improved spatial metadata generation control as shown in Figure 3 by step 315.
  • Wind noise reduction is a practical solution to a capture situation that can occur when the apparatus is outside with significant amount of wind present. Wind can cause noise in the microphone signals and can produce poor quality captured microphone audio signals that can be practically unusable. As there are usually multiple microphones present in a single capture device, the problem can be alleviated by temporarily removing noisy microphones and/or algorithmically suppressing the noise in the microphones. This information can be used to control the format generation for the encoder to provide an improved end-to-end audio signal quality.
  • the example flow diagram shown in Figure 4 starts with obtaining the input information. This includes:
  • the method in some embodiments is configured to process two independent method flows. Firstly, the status of wind noise reduction (WNR) is analyzed to determine whether wind noise is currently suppressed as shown in Figure 4 by step 409.
  • the status of wind-noise reduction or suppression is, in practice, a binary decision. In other words either there is reduction or suppression or there is not. This process is shown, for example in GB2109928.8.
  • a wind-noise reduction or suppression status can be indicated by a decision to implement noise reduction or suppression.
  • the microphone signals can be analyzed for low frequency (below 150 Hz) level differences and if there is a significant level difference (e.g., above 6 dB), then the probable cause for this is wind noise. Where the wind noise is detected then the decision can be made to implement noise reduction or suppression processing or microphone selection.
  • a long time window is selected for the TF- window for the analysis of the audio signals which is then used as shown in Figure 4 by step 413.
  • a short time window can be selected for the TF-window for the analysis of the audio signals which is then used as shown in Figure 4 by step 41 1 .
  • this information is then combined with the encoder information (and optional encoder feedback information) in order to select an improved (optimal) time-frequency resolution for the spatial analysis as shown in Figure 4 by step 415.
  • the short time window can be a 5 ms time resolution and there are 4 frequency bands
  • the long time window can be a 20 ms time resolution and there are 16 frequency bands.
  • the short time window can be a 5 ms time resolution and there are 12 frequency bands
  • the long time window can be a 20 ms time resolution and there are 24 frequency bands.
  • the second independent or parallel decision with respect to wind noise is based on a decision of whether microphone selection is to be performed in order to implement wind noise reduction or suppression.
  • a decision to be implemented with respect to the selection of which microphones are used in the spatial analysis or transport signal generation is based on a decision of whether microphone selection is to be performed in order to implement wind noise reduction or suppression.
  • the wind noise reduction status input can be used to determine whether wind noise suppression or reduction using microphone selection is being implemented as shown in Figure 4 by step 417. This can, for example, employ a similar noise reduction or suppression decision as described above where microphones whose audio signals indicate that wind noise is present (‘windy’ microphones) are indicated as not being selected (or otherwise where microphones whose audio signals indicate that wind noise is not present are indicated as being selected)
  • an indicator, signal or other control is generated and passed to the spatial analyzer and transport signal generator in order that the usable (or non-‘windy’) microphones can be used as shown in Figure 4 by step 421 .
  • an indicator, signal or other control is generated such that all of the microphones are selected and used (in the spatial analysis and/or generation of the transport audio signals) as shown in Figure 4 by step 419.
  • the (selected microphone) input signals are analysed to generate the spatial metadata as shown in Figure 4 by step 425 and furthermore transport audio signals can be generated as shown in Figure 4 by step 423.
  • the determination or selection of the time-frequency resolution of spatial analysis and MASA format generation is thus adapted based on the wind noise situation and how it is used to direct the codec onto a more optimal processing path.
  • This example can also implement an analysis/coding where a better/higher time resolution is preferred over a better/higher frequency resolution (when a better/higher time resolution is available).
  • the spatial metadata does not have exactly (or almost) the same parameter values for each time subframe (e.g., 5 ms) within one signal frame (e.g., 20 ms), then a better or higher time resolution can be implemented.
  • the adaptation algorithm can, for example, employ two distinct cases:
  • the capture system is set to provide a same value for each time subframe. In some embodiments this can be implemented by changing the spatial analyzer to directly use a 20 ms time resolution when producing spatial metadata parameters. Alternatively, if direct analysis with the desired resolution is not possible, then spatial metadata is averaged over a 20 ms time window and all subframes within each frequency band and spatial metadata parameters are set to the same value. This can be done, for example, based on the example methods such as shown in UKIPO patent applications 1919130.3 and 1919131 .1 . In some embodiments with the MASA format metadata manipulated in this way, the metadata codec is directed to pathway that uses better frequency resolution.
  • spatial analysis is implemented using 5 ms time resolution, i.e., to provide a different value for each time subframe. This directs the codec to a pathway which prefers better time resolution.
  • additional reduction of frequency resolution may be implemented using the methods described above to ensure preference of time resolution.
  • the encoder configuration information (codec specification) and used bitrate (where available) can be analysed to reduce frequency resolution directly to try to match the expected end result frequency resolution. This allows more sophisticated metadata reduction or preferably analysis in the correct resolution to be employed. In such a manner, some embodiments, can provide equal or better quality bitrate reduction algorithms.
  • wind noise can affect microphone signals in such a way that the quality of the coherence parameter analysis may decrease drastically with respect to increasing wind noise.
  • a decrease of analysed coherence parameter values or even disabling the use of coherence parameters by setting them to zero for the complete signal frame can be implemented based on a determination of wind noise.
  • a further example use case is shown with respect to spatial capture reliability determination.
  • This example use case is one where there is information about the reliability of the direction or overall metadata estimates.
  • practical capture devices e.g., mobile phones
  • the example flow diagram shown in Figure 5 starts with obtaining the input information. This includes:
  • a control (or determination) of available spatial analysis time resolution is shown in Figure 5 by step 51 1.
  • a short-time resolution candidate and long-time resolution candidate can be selected and the corresponding short-time and long-time resolution candidate values passed to the spatial analyzer.
  • the reliability of the metadata estimation is calculated for all possible directions based on the capture device microphone positions as shown in Figure 5 by step 509.
  • the reliability determination can be implemented using any suitable manner.
  • a reliability measure can be implemented for the three different cardinal directions (front/back, up/down, left/right) such that based on the analyzed estimated direction vector, the reliability of the metadata can be determined as a combination of these cardinal directions.
  • a reliability of spatial analysis can be determined based on the method shown in GB1619573.7 where a reliability of very closely spaced microphones is known to be less accurate than suitably placed microphones and thus direction estimates close to the axis passing through the very closely spaced microphones is deemed to be lower than the reliability of directions close to the axis passing through the suitably placed microphones.
  • the reliability of the direction estimates is most important in frequencies that are important for speech signals, in other words for a frequency range of 100Hz-4kHz.
  • the reliability of known typical direction estimating methods are best for microphones placed a few centimeters apart from each other. Thus, microphones that are less than a centimeter apart from each other provide less reliable estimates. At lower frequencies this estimate reliability can be due to microphone internal noise and at higher frequencies due to aliasing effects.
  • a spatial analysis reliability parameter is obtained per device as a measurement in controlled environment. Specific test sounds are produced, captured, and analyzed, and spatial analysis reliability parameter 8(0, is produced. This parameter is dependent on the spatial direction 0 and the frequency f. In practical use, this parameter can also be converted to discrete frequency with frequency band index k by, for example, averaging the value over the continuous frequency f within the low and high limits of the band, f kiiow and fk.htgh respectively.
  • this can be used to control the spatial metadata formation.
  • the analysis is implemented on a band-by-band basis but the same methods can be adapted to broadband employment.
  • the metadata comprising the direction parameter, is thus determined for both ‘long-time resolution’ and ‘short-time resolution’.
  • a reliability measure can then be found for the ‘long-time resolution’ direction estimate. If the reliability measure for the estimated direction is high, then reliability measures can then be found for the ‘short-time resolution’ direction estimates. If most of the analyzed directions have a high reliability, then the method can be configured to select short-time resolution metadata for use for the current frequency band. Otherwise, long-time resolution metadata is selected as shown in Figure 5 by step 515.
  • this can be formulated with equations as follows when 0 s (n,fc) and 0 z (fc) are the respective direction estimates for short and long time resolution.
  • short time resolution direction contains a subframe index n representing time.
  • long time resolution may contain a similar time index.
  • N is the total number of subframes in one frame, e.g., four 5 ms subframes in one 20 ms frame.
  • the short-time resolution measure is formed with a simple mean of values but any other method may be suitable (maximum, minimum, average weighted with energy/energy ratio, etc.)
  • the metadata generation can be set accordingly.
  • the reliability decision can be performed for the whole metadata.
  • the following comparison can be implemented where (3 is a tuning parameter with usually value of 1 or larger. If this comparison is true, then the short time resolution is used for the whole metadata and otherwise the method uses long time resolution for the whole metadata. In practice, a specific level of overall reliability can be used for the short time resolution data to be considered. Other relations or even constant comparison (similarly to above) can be also used.
  • the transport audio signals can be generated as shown in Figure 5 by step 517.
  • the example parameters and the details of the codec can change and, for example, the example time resolution values may change or there may be multiple time resolutions of which to select. Furthermore in some embodiments other parameters can be employed and the proposed methodology trivially adapted to these other parameter values.
  • the decisions or selections for example for time resolution
  • the decisions or selections are shown to happen instantly and always related to the input data.
  • the decision or selections may have, for example, a hysteresis effect present in such way that when one mode is selected, it is implemented even if the input data would signal otherwise. This effect may also be implemented in a conservative way. In other words a more reliable and stable mode (in the sense of perceived audio artifacts, in practice, longer time resolution) can be implemented unless there are multiple consecutive frames signifying that the other mode (in practice, shorter time resolution) is to be selected.
  • the device may be any suitable electronics device or apparatus.
  • the device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.
  • the device 1400 comprises at least one processor or central processing unit 1407.
  • the processor 1407 can be configured to execute various program codes such as the methods such as described herein.
  • the device 1400 comprises a memory 1411.
  • the at least one processor 1407 is coupled to the memory 141 1.
  • the memory 141 1 can be any suitable storage means.
  • the memory 141 1 comprises a program code section for storing program codes implementable upon the processor 1407.
  • the memory 141 1 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1407 whenever needed via the memory-processor coupling.
  • the device 1400 comprises a user interface 1405.
  • the user interface 1405 can be coupled in some embodiments to the processor 1407.
  • the processor 1407 can control the operation of the user interface 1405 and receive inputs from the user interface 1405.
  • the user interface 1405 can enable a user to input commands to the device 1400, for example via a keypad.
  • the user interface 1405 can enable the user to obtain information from the device 1400.
  • the user interface 1405 may comprise a display configured to display information from the device 1400 to the user.
  • the user interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1400 and further displaying information to the user of the device 1400.
  • the user interface 1405 may be the user interface for communicating with the position determiner as described herein.
  • the device 1400 comprises an input/output port 1409.
  • the input/output port 1409 in some embodiments comprises a transceiver.
  • the transceiver in such embodiments can be coupled to the processor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network.
  • the transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
  • the transceiver can communicate with further apparatus by any suitable known communications protocol.
  • the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802. X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).
  • UMTS universal mobile telecommunications system
  • WLAN wireless local area network
  • IRDA infrared data communication pathway
  • the transceiver input/output port 1409 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1407 executing suitable code. Furthermore the device may generate a suitable downmix signal and parameter output to be transmitted to the synthesis device.
  • the device 1400 may be employed as at least part of the synthesis device.
  • the input/output port 1409 may be configured to receive the downmix signals and in some embodiments the parameters determined at the capture device or processing device as described herein, and generate a suitable audio signal format output by using the processor 1407 executing suitable code.
  • the input/output port 1409 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones or similar.
  • the transceiver can communicate with further apparatus by any suitable known communications protocol.
  • the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802. X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).
  • UMTS universal mobile telecommunications system
  • WLAN wireless local area network
  • IRDA infrared data communication pathway
  • the transceiver input/output port 1409 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1407 executing suitable code.
  • the various embodiments may be implemented in hardware or special purpose circuitry, software, logic or any combination thereof. Some aspects of the disclosure may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the disclosure is not limited thereto. While various aspects of the disclosure may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
  • circuitry may refer to one or more or all of the following:
  • circuitry (c) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g., firmware) for operation, but the software may not be present when it is not needed for operation.”
  • software e.g., firmware
  • circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware.
  • circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device.
  • the embodiments of this disclosure may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware.
  • Computer software or program also called program product, including software routines, applets and/or macros, may be stored in any apparatus-readable data storage medium and they comprise program instructions to perform particular tasks.
  • a computer program product may comprise one or more computerexecutable components which, when the program is run, are configured to carry out embodiments.
  • the one or more computer-executable components may be at least one software code or portions of it.
  • any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions.
  • the software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
  • the physical media is a non-transitory media.
  • the memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory.
  • the data processors may be of any type suitable to the local technical environment, and may comprise one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), FPGA, gate level circuits and processors based on multi core processor architecture, as non-limiting examples.
  • DSPs digital signal processors
  • ASIC application specific integrated circuits
  • FPGA gate level circuits and processors based on multi core processor architecture, as non-limiting examples.
  • Embodiments of the disclosure may be practiced in various components such as integrated circuit modules.
  • the design of integrated circuits is by and large a highly automated process.
  • Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Abstract

An apparatus for generating spatial audio signal parameters, the spatial audio signal parameters associated with at least one audio signal, the apparatus comprising means configured to: obtain apparatus information; generate metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs, such that the metadata control information is configured to adaptively control the generation of metadata based on the metadata codec information and the apparatus information; and generate metadata parameters based on the metadata control information.

Description

METADATA GENERATION WITHIN SPATIAL AUDIO
Field
The present application relates to apparatus and methods for capture side metadata generation, but not exclusively for capture side metadata generation of metadata-assisted spatial audio.
Background
The immersive voice and audio services (IVAS) codec is an extension of the 3GPP EVS (enhanced voice services) codec and intended for new immersive voice and audio services over 4G/5G. Such immersive services include, e.g., immersive voice and audio for virtual reality (VR). The multi-purpose audio codec is expected to handle the encoding, decoding and rendering of speech, music and generic audio. It is expected to support a variety of input formats, such as channel-based and scene-based inputs. It is also expected to operate with low latency to enable conversational services as well as support high error robustness under various transmission conditions.
Metadata-assisted spatial audio (MASA) is one input format proposed for IVAS. It uses audio signal(s) together with corresponding spatial metadata. The spatial metadata comprises parameters which define the spatial aspects of the audio signals and which may contain for example, directions and direct-to-total energy ratios in frequency bands. The MASA stream can, for example, be obtained by capturing spatial audio with microphones of a suitable capture device. For example a mobile device comprising multiple microphones may be configured to capture microphone signals where the set of spatial metadata can be estimated based on the captured microphone signals. The MASA stream can be obtained also from other sources, such as specific spatial audio microphones (such as Ambisonics), studio mixes (for example, a 5.1 audio channel mix) or other content by means of a suitable format conversion.
One such conversion method is disclosed in Tdoc S4-191 167 (Nokia Corporation: Description of the IVAS MASA C Reference Software; 3GPP TSG- SA4#106 meeting; 21 -25 October, 2019, Busan, Republic of Korea). Summary
There is provided according to a first aspect an apparatus for generating spatial audio signal parameters, the spatial audio signal parameters associated with at least one audio signal, the apparatus comprising means configured to: obtain apparatus information; generate metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs, such that the metadata control information is configured to adaptively control the generation of metadata based on the metadata codec information and the apparatus information; and generate metadata parameters based on the metadata control information.
The metadata control information may control at least one of: a presence of coherence parameters; a number of concurrent direction parameters; a timeresolution of the metadata parameters; and a frequency-resolution of the metadata parameters.
The means may be configured to encode the metadata parameters.
The means may be further configured to: generate at least one transport audio signal from the at least one audio signal and based on the metadata control information; and encode the at least one transport audio signal.
The means may be further configured to: combine the encoded at least one transport audio signals and the encoded metadata parameters; and transmit to a further device and/or store the combined encoded at least one transport audio signals and metadata parameters.
The apparatus information may comprise at least one of: the at least one audio signal; a spatial analysis of the at least one audio signal; an estimate of the performance of a spatial analysis of the at least one audio signal; a status of a signal enhancement processing applied to the at least one audio signal; and an apparatus microphone configuration.
The status of a signal enhancement processing applied to the at least one audio signal may be a status of a wind noise suppression or reduction processing, wherein the means configured to generate metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs may be configured to: determine a first time resolution window control when the status of the wind noise suppression or reduction processing indicates that the at least one audio signal has not had wind noise suppression or reduction processing; and determine a second time resolution window control when the status of the wind noise suppression or reduction processing indicates that the at least one audio signal has had wind noise suppression or reduction processing, the second time resolution window being longer than the first time resolution window.
The means configured to generate metadata control information based on the apparatus information may be configured to generate a time-resolution and a frequency-resolution window control based on the determined first or second time resolution window, the available bitrate for the encoded metadata parameters and an encoder specification detailing an encoder target time-resolution and frequencyresolution.
The means configured to generate metadata control information based on the apparatus information may be configured to determine an audio signal selection control, wherein the audio selection control may be configured to select at least one of the at least one audio signal based on whether the status of the wind noise suppression or reduction processing indicates that the at least one audio signal has had wind noise suppression or reduction processing.
The metadata codec information with respect to a range of possible metadata codecs may comprise at least one of: an expected bit-rate or bandwidth for the encoded metadata parameters; and a coding strategy for the encoding of the metadata parameters.
The means configured to generate metadata parameters based on the metadata control information may be configured to generate the metadata parameters from an analysis of the at least one audio signal controlled by the metadata control information.
The means configured to generate metadata control information based on the apparatus information may be configured to determine a direction dependent reliability control based on microphone location configuration on the apparatus.
The means configured to generate metadata parameters based on the metadata control information may be configured to generate at least two sets of metadata parameters from analysis of the at least one audio signal, with a first set of metadata parameters based on an analysis with a first time-resolution window and a second set of metadata parameters based on an analysis with a second timeresolution window, the second time resolution window being longer than the first time-resolution window.
The means configured to generate metadata parameters based on the metadata control information may be configured to select one of the at least two sets of metadata parameters based on the direction dependent reliability control.
According to a second aspect there is provided a method for an apparatus for generating spatial audio signal parameters, the spatial audio signal parameters associated with at least one audio signal, the method comprising: obtaining apparatus information; generating metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs, such that the metadata control information is configured to adaptively control the generation of metadata based on the metadata codec information and the apparatus information; and generating metadata parameters based on the metadata control information.
The metadata control information may control at least one of: a presence of coherence parameters; a number of concurrent direction parameters; a timeresolution of the metadata parameters; and a frequency-resolution of the metadata parameters.
The method may further comprise encoding the metadata parameters.
The method may further comprise: generating at least one transport audio signal from the at least one audio signal and based on the metadata control information; and encoding the at least one transport audio signal.
The method may further comprise: combining the encoded at least one transport audio signals and the encoded metadata parameters; and transmitting to a further device and/or store the combined encoded at least one transport audio signals and metadata parameters.
The apparatus information may comprise at least one of: the at least one audio signal; a spatial analysis of the at least one audio signal; an estimate of the performance of a spatial analysis of the at least one audio signal; a status of a signal enhancement processing applied to the at least one audio signal; and an apparatus microphone configuration. The status of a signal enhancement processing applied to the at least one audio signal may be a status of a wind noise suppression or reduction processing, wherein generating metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs may comprise: determining a first time resolution window control when the status of the wind noise suppression or reduction processing indicates that the at least one audio signal has not had wind noise suppression or reduction processing; and determining a second time resolution window control when the status of the wind noise suppression or reduction processing indicates that the at least one audio signal has had wind noise suppression or reduction processing, the second time resolution window being longer than the first time resolution window.
Generating metadata control information based on the apparatus information may comprise generating a time-resolution and a frequency-resolution window control based on the determined first or second time resolution window, the available bitrate for the encoded metadata parameters and an encoder specification detailing an encoder target time-resolution and frequency-resolution.
Generating metadata control information based on the apparatus information may comprise determining an audio signal selection control, wherein the audio selection control may comprise selecting at least one of the at least one audio signal based on whether the status of the wind noise suppression or reduction processing indicates that the at least one audio signal has had wind noise suppression or reduction processing.
The metadata codec information with respect to a range of possible metadata codecs may comprise at least one of: an expected bit-rate or bandwidth for the encoded metadata parameters; and a coding strategy for the encoding of the metadata parameters.
Generating metadata parameters based on the metadata control information may comprise generating the metadata parameters from an analysis of the at least one audio signal controlled by the metadata control information.
Generating metadata control information based on the apparatus information may comprise determining a direction dependent reliability control based on microphone location configuration on the apparatus. Generating metadata parameters based on the metadata control information may comprise generating at least two sets of metadata parameters from analysis of the at least one audio signal, with a first set of metadata parameters based on an analysis with a first time-resolution window and a second set of metadata parameters based on an analysis with a second time-resolution window, the second time resolution window being longer than the first time-resolution window.
Generating metadata parameters based on the metadata control information may comprise selecting one of the at least two sets of metadata parameters based on the direction dependent reliability control.
According to a third aspect there is provided an apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain apparatus information; generate metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs, such that the metadata control information is configured to adaptively control the generation of metadata based on the metadata codec information and the apparatus information; and generate metadata parameters based on the metadata control information.
The metadata control information may control at least one of: a presence of coherence parameters; a number of concurrent direction parameters; a timeresolution of the metadata parameters; and a frequency-resolution of the metadata parameters.
The apparatus may be caused to encode the metadata parameters.
The apparatus may be further caused to: generate at least one transport audio signal from the at least one audio signal and based on the metadata control information; and encode the at least one transport audio signal.
The apparatus may be further caused to: combine the encoded at least one transport audio signals and the encoded metadata parameters; and transmit to a further device and/or store the combined encoded at least one transport audio signals and metadata parameters.
The apparatus information may comprise at least one of: the at least one audio signal; a spatial analysis of the at least one audio signal; an estimate of the performance of a spatial analysis of the at least one audio signal; a status of a signal enhancement processing applied to the at least one audio signal; and an apparatus microphone configuration.
The status of a signal enhancement processing applied to the at least one audio signal may be a status of a wind noise suppression or reduction processing, wherein the apparatus caused to generate metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs may be caused to: determine a first time resolution window control when the status of the wind noise suppression or reduction processing indicates that the at least one audio signal has not had wind noise suppression or reduction processing; and determine a second time resolution window control when the status of the wind noise suppression or reduction processing indicates that the at least one audio signal has had wind noise suppression or reduction processing, the second time resolution window being longer than the first time resolution window.
The apparatus caused to generate metadata control information based on the apparatus information may be caused to generate a time-resolution and a frequency-resolution window control based on the determined first or second time resolution window, the available bitrate for the encoded metadata parameters and an encoder specification detailing an encoder target time-resolution and frequencyresolution.
The apparatus caused to generate metadata control information based on the apparatus information may be caused to determine an audio signal selection control, wherein the audio selection control may be configured to select at least one of the at least one audio signal based on whether the status of the wind noise suppression or reduction processing indicates that the at least one audio signal has had wind noise suppression or reduction processing.
The metadata codec information with respect to a range of possible metadata codecs may comprise at least one of: an expected bit-rate or bandwidth for the encoded metadata parameters; and a coding strategy for the encoding of the metadata parameters.
The apparatus caused to generate metadata parameters based on the metadata control information may be caused to generate the metadata parameters from an analysis of the at least one audio signal controlled by the metadata control information.
The apparatus caused to generate metadata control information based on the apparatus information may be caused to determine a direction dependent reliability control based on microphone location configuration on the apparatus.
The apparatus caused to generate metadata parameters based on the metadata control information may be caused to generate at least two sets of metadata parameters from analysis of the at least one audio signal, with a first set of metadata parameters based on an analysis with a first time-resolution window and a second set of metadata parameters based on an analysis with a second timeresolution window, the second time resolution window being longer than the first time-resolution window.
The apparatus caused to generate metadata parameters based on the metadata control information may be caused to select one of the at least two sets of metadata parameters based on the direction dependent reliability control.
According to a fourth aspect there is provided an apparatus comprising: means for obtaining apparatus information; means for generating metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs, such that the metadata control information is configured to adaptively control the generation of metadata based on the metadata codec information and the apparatus information; and means for generating metadata parameters based on the metadata control information..
According to a fifth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus to perform at least the following: obtain apparatus information; generate metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs, such that the metadata control information is configured to adaptively control the generation of metadata based on the metadata codec information and the apparatus information; and generate metadata parameters based on the metadata control information.
According to an sixth aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtain apparatus information; generate metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs, such that the metadata control information is configured to adaptively control the generation of metadata based on the metadata codec information and the apparatus information; and generate metadata parameters based on the metadata control information.
According to a seventh aspect there is provided an apparatus comprising: obtaining circuitry configured to obtain apparatus information; generating circuity configured to generate metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs, such that the metadata control information is configured to adaptively control the generation of metadata based on the metadata codec information and the apparatus information; and generating circuitry configured to generate metadata parameters based on the metadata control information.
According to an eighth aspect there is provided a computer readable medium comprising program instructions for causing an apparatus to perform at least the following: obtain apparatus information; generate metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs, such that the metadata control information is configured to adaptively control the generation of metadata based on the metadata codec information and the apparatus information; and generate metadata parameters based on the metadata control information.
An apparatus comprising means for performing the actions of the method as described above.
An apparatus configured to perform the actions of the method as described above.
A computer program comprising program instructions for causing a computer to perform the method as described above.
A computer program product stored on a medium may cause an apparatus to perform the method as described herein.
An electronic device may comprise apparatus as described herein.
A chipset may comprise apparatus as described herein. Embodiments of the present application aim to address problems associated with the state of the art.
Summary of the Figures
For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:
Figure 1 shows schematically a system of apparatus suitable for implementing some embodiments;
Figure 2 shows schematically the metadata generator and encoder according to some embodiments;
Figure 3 shows a flow diagram of the operation of the example metadata generator and encoder as shown in Figure 2 according to some embodiments;
Figure 4 shows a flow diagram of the operation of the example metadata generator and encoder as shown in Figure 2 with respect to wind noise reduction embodiments;
Figure 5 shows a flow diagram of the operation of the example metadata generator and encoder as shown in Figure 2 with respect to spatial capture embodiments; and
Figure 6 shows schematically an example device suitable for implementing the apparatus shown.
Embodiments of the Application
The following describes in further detail suitable apparatus and possible mechanisms for the encoding of parametric spatial audio streams with transport audio signals and spatial metadata.
As discussed above Metadata-Assisted Spatial Audio (MASA) is an example of a parametric spatial audio format and representation suitable as an input format for IVAS.
It can be considered an audio representation consisting of ‘N channels + spatial metadata’. It is a scene-based audio format particularly suited for spatial audio capture on practical devices, such as smartphones. The idea is to describe the sound scene in terms of time- and frequency-varying sound source directions. Sound energy that is not defined (described) by the directions, is described as diffuse (coming from all directions).
As discussed above spatial metadata associated with the audio signals may comprise multiple parameters (such as multiple directions and associated with each direction a direct-to-total ratio, spread coherence, distance, etc.) per timefrequency tile. The spatial metadata may also comprise other parameters or may be associated with other parameters which are considered to be non-directional (such as surround coherence, diffuse-to-total energy ratio, remainder-to-total energy ratio) but when combined with the directional parameters are able to be used to define the characteristics of the audio scene. For example a reasonable design choice which is able to produce a good quality output is one where the spatial metadata comprises one or more directions for each time-frequency subframe (and associated with each direction direct-to-total ratios, spread coherence, distance values etc) are determined. However as also discussed above, bandwidth and/or storage limitations may require a codec not to send spatial metadata parameter values for each frequency band and temporal sub-frame.
As described above, parametric spatial metadata representation can use multiple concurrent spatial directions. With MASA, the proposed maximum number of concurrent directions is two. For each concurrent direction, there may be associated parameters such as: Direction index; Direct-to-total ratio; Spread coherence; and Distance. In some embodiments other parameters such as Diffuse- to-total energy ratio; Surround coherence; and Remainder-to-total energy ratio are defined.
However the MASA format may further comprise other parameters.
The IVAS codec is configured to operate with a frame size of 20ms. Similarly, MASA metadata update may be configured to synchronize with this frame size (20 ms) although the typical MASA metadata update rate may be lower, e.g., 5 ms. This is achieved by combining multiple subframes into one synchronized frame. MASA format is a generic versatile format that allows high-resolution format inputs (24 frequency bands and 4 time subframes with 5-ms time resolution). MASA- format compression methods can analyze the metadata and based on the metadata resolution of the input, selects a coding mode. Specifically, the compression method can implement a selection of producing an encoding with a better time resolution or a better frequency resolution. In practice, a better time resolution is preferred when a better time resolution is available which effectively reduces the frequency resolution with all but the highest coding bitrates. This means that time resolution is typically deemed to be more important by the codec.
In general, this performs well from the IVAS point-of-view where the provided information is only the device-agnostic metadata and the transport signals. However, the MASA format is generated often by capture systems (e.g., a mobile device) which have multiple microphones (often more than the number of transport signals) and complex algorithms to process the audio signals and produce the metadata. This also means that the capture system can have more information on what parts of the captured audio scene are important for the listener than the encoder will have. For example, even though the encoder by default prefers a better time resolution, the capture system may know that a better frequency resolution provides better quality and is therefore more important.
The encoder is strictly limited by the standard specification and remains generic to support multitude of different sources for the MASA format. Thus, it is not possible to implement complex capture-device specific adaptations into the encoder. However, generic adaptations are present as mentioned above. The standard specification including these generic adaptions are known for the capture device.
However the capture device cannot control directly the encoding process even though it would be beneficial as the process is completely autonomous based on the input audio signal and metadata. To obtain the best end-to-end quality, the capture device should be configured to modify the generated MASA format in such ways that the encoder is forced to the optimal encoding path. The desire for this kind of system is further emphasized when the IVAS standard is finalized as all further development should be implemented on the capture device in order to prevent the whole codec from having to be re-specified.
The concept as discussed herein in further detail with respect to the following embodiments is related to coding and generation of parametric spatial audio metadata, for example, MASA spatial audio metadata for use by 3GPP IVAS. For example in some embodiments the quality of the coded MASA metadata is improved by having an automatic capture-side metadata generation optimization. Furthermore in some embodiments the automatic capture-side metadata generation is configured to obtain information about the expected bitrate, coding strategy of the codec, and all the other information available on the capture-side. Additionally in some embodiments the capture-side metadata generation is configured to select preferences, for example, for: time and/or frequency resolution, presence of coherence parameters, and/or number of concurrent directions. Furthermore in some embodiments the selected preference is based on the additional information available on the capture side in addition to the codec coding strategy. Also in some embodiments information available on the capture side can comprise: raw and/or processed captured audio signals, raw and/or processed spatial analysis results, estimates of spatial analysis performance, multiple different spatial analysis results, and/or signal enhancement algorithm (wind noise reduction, audio focus, etc.) status information.
In some embodiments the metadata generation is based on the capture side information and known codec specification to control an autonomously selected coding strategy within a codec such that the metadata generated is therefore more optimal for the current capture situation.
This approach is beneficial at least in two known use cases:
For example in presence of wind noise (and attempted reduction of wind noise), time-frequency resolution should be optimized; and in mobile device audio capture, a reliability of the spatial analysis depends on the identified directions of the sound sources and capture configuration. In such situations an improved or optimized time-frequency resolution should be based on the reliability of the capture.
With respect to Figure 1 an example apparatus and system for implementing embodiments of the application are shown. The system 100 is shown with an ‘analysis’ part 121 and a ‘synthesis’ part 131. The ‘analysis’ part 121 is the part from receiving the multi-channel signals up to an encoding of the spatial metadata and transport signal and the ‘synthesis’ part 131 is the part from a decoding of the encoded spatial metadata and transport signal to the presentation of the regenerated signal (for example in multi-channel loudspeaker form).
In the following description the ‘analysis’ part 121 is described as a series of parts however in some embodiments the part may be implemented as functions within the same functional apparatus or part. In other words in some embodiments the ‘analysis’ part 121 is an encoder comprising at least one of the transport signal generator or analysis processor as described hereafter.
The input to the system 100 and the ‘analysis’ part 121 is the multi-channel signals 102. The ‘analysis’ part 121 may comprise a transport signal generator 103, analysis processor 105, and encoder 107. In the following examples a microphone channel signal input is described, which can be two or more microphones integrated or connected onto a mobile device (e.g., a smartphone). However any suitable input (or synthetic multi-channel) format may be implemented in other embodiments. For example other suitable audio signals format inputs could be microphone arrays, e.g., B-format microphone, planar microphone array or Eigenmike, Ambisonic signals, e.g., first-order Ambisonics (FOA), higher-order Ambisonics (HOA), loudspeaker surround mix and/or objects, artificially created spatial mix, e.g., from audio or VR teleconference bridge, or combinations of the above.
The multi-channel signals are passed to a transport signal generator 103 and to an analysis processor 105.
In some embodiments the transport signal generator 103 is configured to receive the multi-channel signals and generate a suitable audio signal format for encoding. The transport signal generator 103 can for example generate a stereo or mono audio signal. The transport audio signals generated by the transport signal generator can be any known format. For example when the input is one where the audio signals input are mobile phone microphone array audio signals, the transport signal generator 103 can be configured to select a left-right microphone pair, and apply any suitable processing to the audio signal pair, such as automatic gain control, microphone noise removal, wind noise removal, and equalization. In some embodiments when the input is a first order Ambisonic/higher order Ambisonic (FOA/HOA) signal, the transport signal generator can be configured to formulate directional beam signals towards left and right directions, such as two opposing cardioid signals. Additionally in some embodiments when the input is a loudspeaker surround mix and/or objects, then the transport signal generator 103 can be configured to generate a downmix signal that combines left side channels to a left downmix channel, combines right side channels to a right downmix channel and adds centre channels to both transport channels with a suitable gain.
In some embodiments the transport signal generator is bypassed (or in other words is optional). For example, in some situations where the analysis and synthesis occur at the same device at a single processing step, without intermediate processing there is no transport signal generation and the input audio signals are passed unprocessed. The number of transport channels generated can be any suitable number.
The output of the transport signal generator 103 can be passed to an encoder 107.
In some embodiments the analysis processor 105 is also configured to receive the multi-channel signals and analyse the signals to produce the spatial metadata 106 associated with the multi-channel signals and thus associated with the transport signals 104. In some embodiments the spatial metadata associated with the audio signals may be a provided to the encoder as a separate bit-stream. In some embodiments the multichannel signals 102 input comprises spatial metadata and this is passed directly to the encoder 107.
The analysis processor 105 may be configured to generate the spatial metadata parameters which may comprise, for each time-frequency analysis interval, at least one direction parameter 108 and at least one energy ratio parameter 1 10 (and in some embodiments other parameters such as described earlier and of which a non-exhaustive list includes number of directions, surround coherence, diffuse-to-total energy ratio, remainder-to-total energy ratio, a spread coherence parameter, and distance parameter). The direction parameter may be represented in any suitable manner, for example as spherical co-ordinates denoted as azimuth <p(k,n) and elevation 0(k,n).
In some embodiments the number of the spatial metadata parameters may differ from time-frequency tile to time-frequency tile. Thus for example in band X all of the spatial metadata parameters are obtained (generated) and transmitted, whereas in band Y only one of the spatial metadata parameters is obtained and transmitted, and furthermore in band Z no parameters are obtained or transmitted. A practical example of this may be that for some time-frequency tiles corresponding to the highest frequency band some of the spatial metadata parameters are not required for perceptual reasons. The spatial metadata 106 may be passed to an encoder 107.
In some embodiments the analysis processor 105 is configured to apply a time-frequency transform for the input signals. Then, for example, in time-frequency tiles when the input is a mobile phone microphone array, the analysis processor could be configured to estimate delay-values between microphone pairs that maximize the inter-microphone correlation. Then based on these delay values the analysis processor may be configured to formulate a corresponding direction value for the spatial metadata. Furthermore the analysis processor may be configured to formulate a direct-to-total ratio parameter based on the correlation value.
In some embodiments, for example where the input is a FOA signal, the analysis processor 105 can be configured to determine an intensity vector. The analysis processor may then be configured to determine a direction parameter value for the spatial metadata based on the intensity vector. A diffuse-to-total ratio can then be determined, from which a direct-to-total ratio parameter value for the spatial metadata can be determined. This analysis method is known in the literature as Directional Audio Coding (DirAC).
In some examples, for example where the input is a HOA signal, the analysis processor 105 can be configured to divide the HOA signal into multiple sectors, in each of which the method above is utilized. This sector-based method is known in the literature as higher order DirAC (HO-DirAC). In these examples, there is more than one simultaneous direction parameter value per time-frequency tile corresponding to the multiple sectors.
Additionally in some embodiments where the input is a loudspeaker surround mix and/or audio object(s) based signal, the analysis processor can be configured to convert the signal into a FOA/HOA signal(s) format and to obtain direction and direct-to-total ratio parameter values as above.
The analysis processor 105 can as described above be configured to generate metadata parameters for the MASA format stream. The metadata parameters are typically generated in the time-frequency (TF) domain and produce parameters for each time-frequency tile. For the following examples and embodiments, it can be beneficial to understand how the number of TF-tiles, i.e., the TF-resolution, may be adjusted for metadata generation. In practical use, a capture system is configured to analyze metadata parameters using one time-frequency resolution which can (in an optimal case) match the TF-resolution of the target format (for example 24 frequency bands and 4 5 ms subframes in the MASA format). However, this may not necessarily be the case and adjustments may be needed to match the target TF-resolution. In addition, the underlying TF-resolution may be different than the native TF- resolution of the format to take advantage of the internal codec algorithms and force specific coding decisions. Thus, adjustments of metadata TF-resolution can be implemented within the capture system for metadata generation.
There can be variations in adjusting the metadata TF-resolution. Firstly, and usually quality-wise producing the most optimal solution, is to perform analysis directly at the target TF-resolution. Thus, if the capture system allows it, it is best to change the analysis TF-resolution as necessary.
However where a match between the analysis and target TF-resolution is not possible, then it can be possible to analyze at a single TF-resolution that is overall judged to be “better” than the target TF-resolution or may be, e.g., the “best” possible target resolution. An example of the first option would be, for example, a TF-resolution with 240 frequency bands and 4 subframes. An example of the second option would be the MASA-format native resolution of 24 frequency bands and 4 subframes. Regardless of the option used, if the actual desired target is lower in terms of frequency resolution, then some form of metadata reduction via combination is performed. For example this can be implemented by reducing MASA format related spatial metadata to fewer frequency bands and time frames by a combination of weighted direction vectors and weighted averages of metadata parameters over time and frequency to form a reduced (in terms of TF-tile resolution) set of metadata parameters. This can be directly applicable if the reduced TF-resolution is a direct subset of the source TF-resolution. In other words where the band limits and subframe borders align. Otherwise, some minor or trivial adaptations are implemented. Alternatively, if the method for reducing MASA format related spatial metadata to fewer frequency bands and time frames by a combination of weighted direction vectors and weighted averages of metadata parameters over time and frequency is not suitable (for some reason), in some embodiments a simple parameter smoothing and resampling over time and frequency can be implemented.
Furthermore if the analyzer TF-resolution is lower than the target TF- resolution, then more parameter values need to be generated. In practice, an effective solution is to replicate the existing parameter values into the better TF- resolution. Again, if the lower resolution is a direct subset of the better resolution, then a simple assignment of values can be mapped from the lower resolution TF- tiles to corresponding better (higher) resolution TF-tiles. If there is no direct mapping possible (the lower resolution is not a direct subset), then frequency band and time subframe limits of the lower TF-resolution can be compared to the corresponding limits of the better TF-resolution and parameters are assigned based on the nearest corresponding TF-tile.
The encoder 107 may comprise an audio encoder core 109 which is configured to receive the transport audio signals 104 and generate a suitable encoding of these audio signals. The encoder 107 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs. The audio encoding may be implemented using any suitable scheme.
The encoder 107 may furthermore comprise a spatial metadata encoder/quantizer 1 1 1 which is configured to receive the spatial metadata and output an encoded or compressed form of the information. In some embodiments the encoder 107 may further interleave, multiplex to a single data stream or embed the spatial metadata within encoded downmix signals before transmission or storage shown in Figure 1 by the dashed line. The multiplexing may be implemented using any suitable scheme.
In some embodiments the transport signal generator 103 and/or analysis processor 105 may be located on a separate device (or otherwise separate) from the encoder 107. For example in such embodiments the spatial metadata (and associated non-spatial metadata) parameters associated with the audio signals may be a provided to the encoder as a separate bit-stream.
In some embodiments the transport signal generator 103 and/or analysis processor 105 may be part of the encoder 107, i.e., located inside of the encoder and be on a same device. In the following description the ‘synthesis’ part 131 is described as a series of parts however in some embodiments the part may be implemented as functions within the same functional apparatus or part.
In the decoder side, the received or retrieved data (stream) may be received by a decoder/demultiplexer 133. The decoder/demultiplexer 133 may demultiplex the encoded streams and pass the audio encoded stream to a transport signal decoder 135 which is configured to decode the audio signals to obtain the transport audio signals. Similarly the decoder/demultiplexer 133 may comprise a metadata extractor 137 which is configured to receive the encoded spatial metadata (for example a direction index representing a direction parameter value) and generate spatial metadata.
The decoder/demultiplexer 133 can in some embodiments be a computer (running suitable software stored on memory and on at least one processor), or alternatively a specific device utilizing, for example, FPGAs or ASICs.
The decoded metadata and transport audio signals may be passed to a synthesis processor 139.
The system 100 ‘synthesis’ part 131 further shows a synthesis processor 139 configured to receive the transport audio signal and the spatial metadata and re-creates in any suitable format a synthesized spatial audio in the form of multichannel signals 140 (these may be multichannel loudspeaker format or in some embodiments any suitable output format such as binaural or Ambisonics signals, depending on the use case) based on the transport signals and the spatial metadata.
The synthesis processor 139 thus creates the output audio signals, e.g., multichannel loudspeaker signals or binaural signals based on any suitable known method. This is not explained here in further detail. However, as a simplified example, the rendering can be performed for loudspeaker output according to any of the following methods. For example the transport audio signals can be divided to direct and ambient streams based on the direct-to-total and diffuse-to-total energy ratios. The direct stream can then be rendered based on the direction parameter(s) using amplitude panning. The ambient stream can furthermore be rendered using decorrelation. The direct and the ambient streams can then be combined. The output signals can be reproduced using a multichannel loudspeaker setup or headphones which may be head-tracked.
It should be noted that the processing blocks of Figure 1 can be located in same or different processing entities. For example, in some embodiments, microphone signals from a mobile device are processed with a spatial audio capture system (containing the analysis processor and the transport signal generator), and the resulting spatial metadata and transport audio signals (e.g., in the form of a MASA stream) are forwarded to an encoder (e.g., an IVAS encoder), which contains the encoder. In other embodiments, input signals (e.g., 5.1 channel audio signals) are directly forwarded to an encoder (e.g., an IVAS encoder), which contains the analysis processor, the transport signal generator, and the encoder.
In some embodiments there can be two (or more) input audio signals, where the first audio signal is processed by the apparatus shown in Figure 1 (resulting in data as an input for the encoder) and the second audio signal is directly forwarded to an encoder (e.g., an IVAS encoder), which contains the analysis processor, the transport signal generator, and the encoder. The audio input signals may then be encoded in the encoder independently or they may, e.g., be combined in the parametric domain according to what may be called, e.g., MASA mixing.
In some embodiments there may be a synthesis part which comprises separate decoder and synthesis processor entities or apparatus, or the synthesis part can comprise a single entity which comprises both the decoder and the synthesis processor. In some embodiments, the decoder block may process in parallel more than one incoming data stream. In the application the term synthesis processor may be interpreted as an internal or external renderer.
Therefore in summary first the system (analysis part) is configured to receive multi-channel audio signals. Then the system (analysis part) is configured to generate a suitable transport audio signal (for example by selecting some of the audio signal channels). The system is then configured to encode for storage/transmission the transport audio signal. After this the system may store/transmit the encoded transport audio signal and metadata. The system may retrieve/receive the encoded transport audio signal and metadata. Then the system is configured to extract the transport audio signal and metadata from encoded transport audio signal and metadata parameters, for example demultiplex and decode the encoded transport audio signal and metadata parameters.
The system (synthesis part) is configured to synthesize an output multichannel audio signal based on extracted transport audio signal and metadata.
With respect to Figure 2 is shown in further detail apparatus suitable for implementing some embodiments.
In some embodiments the apparatus comprises a suitable audio signal capture input 201 . The audio signal input is configured to obtain the captured input audio signals 102 and pass these to the spatial analyser 215, the analysis configurator 213, the transport signal generation configurator 207 and the transport signal generator 203. As described earlier the input audio signals 102 comprises microphone audio signals, in other words the audio signal capture input 201 comprises a suitable microphone, microphones or microphone array configuration. However other suitable audio signal formats and configurations can be employed in some embodiments.
Furthermore the apparatus comprises a capture device status information input 209. The capture device status information input 209 is configured to obtain the capture device information 210 and pass this information to the transport signal generation configurator 207 and the analysis configurator 213. Information 210 from the capture device status information can in some embodiments comprise information with respect to the microphone configuration or arrangement (where the audio signals are microphone audio signals). In some embodiments the information 210 comprises raw and/or processed spatial analysis results from the audio signals (which can in some embodiments be the spatial analysis before conversion to encoder input format but can be after noise reduction or other processing operations). In some embodiments the information comprises an estimate of the spatial analysis performance. In still further embodiments the information comprises multiple different spatial analysis results. The information 210 in some embodiments comprises a signal enhancement algorithm (wind noise reduction, audio focus, etc.) status information.
Additionally the apparatus comprises an encoder 221. The encoder 221 is configured in some embodiments to output encoder information 218. The encoder information 218 in some embodiments comprises information defining the encoder standard specification. The encoder specification can for example define the input format specification (such as defining the time-frequency resolution) or may define metadata encoding details (for example bitrate specific reductions in metadata and how the metadata is quantized). The encoder information 218 furthermore in some embodiments comprises information about current target and/or previous realized bitrates for the encoder. This information can be different than the bitrate the device requests the encoder to use. Other encoder information 218 can also describe mode configuration (DTX, etc.).
In some embodiments the apparatus comprises a transport signal generation configurator 207. The transport signal generation configurator 207 in some embodiments is configured to receive the input audio signals 102, the capture device status information 210, the encoder information 218 (and in some embodiments furthermore information 212 from an analysis configurator 213) and determine a configuration for transport audio signal generation. The configuration for transport audio signal generation 208 can then be passed to the transport signal generator 203.
In some embodiments the apparatus comprises a transport signal generator 203. The transport signal generator 203 is configured to obtain the configuration for transport audio signal generation 208 and the input audio signals 102 and generate the encoder optimised transport signal(s) 204 which can be passed to the encoder 221.
The apparatus in some embodiments comprises an analysis configurator 213. The analysis configurator 213 is in some embodiments configured to receive the input audio signals 102, the capture device status information 210, the encoder information 218 (and in some embodiments furthermore information 212 from the transport signal generation configurator 207) and determine a configuration for the spatial analyser 215. The configuration 223 for the spatial analyser 215 can then be passed to the spatial analyser 215.
In some embodiments the apparatus comprises a spatial analyser 215. The spatial analyser 215 is configured to obtain the configuration 223 for spatial analysis and the input audio signals 102 and generate the encoder optimised metadata 216 which can be passed to the encoder 221 . Thus using the obtained information from different sources, the improved (optimal) configuration for spatial analysis can be deduced. For metadata generation, this can mean in practice that time-frequency resolution of the spatial analysis is selected in such way that the encoder is directed into a preferred coding path based on the pre-selected I optimized characteristics of the input spatial metadata. In addition, analysis for different parameters (e.g., coherence parameters) can be controlled. For transport audio signal generation, the practical control is in selection (or combination) of optimal microphones for transport signals and also selection of how many transport audio signals should be used. This information is then used to control these two parts of the format generation.
Furthermore Figure 3 shows a flow diagram of the operations of the example apparatus shown in Figure 2 for implementing some embodiments. Thus the method comprises capturing the input audio signals as shown in Figure 3 by step 301.
Additionally the method comprises obtaining capture device information as shown in Figure 3 by step 303.
Furthermore the method comprises obtaining encoder standard specification (or more generally encoder information) as shown in Figure 3 by step 305.
The method can furthermore optionally comprise obtaining encoder feedback information as shown in Figure 3 by step 307.
Having obtained the input audio signals, capture device information, encoder information and optionally the encoder feedback information the available data can be analysed to determine an improved analysis configuration (or optimal analysis configuration) for the target codec as shown in Figure 3 by step 309.
Having determined the improved analysis configuration for the target codec then the spatial analysis is controlled for an improved spatial metadata generation as shown in Figure 3 by step 313.
Furthermore the configuration information is configured to enable a control of the transport audio signal generation for an improved transport signal output as shown in Figure 3 by step 315.
The transport audio signal is then generated based on the control as shown in Figure 3 by step 317. Furthermore the input audio signals are analysed based on the improved spatial metadata generation control as shown in Figure 3 by step 315.
With respect to Figure 4 shows an example implementation with respect to a wind noise reduction (WNR). Wind noise reduction is a practical solution to a capture situation that can occur when the apparatus is outside with significant amount of wind present. Wind can cause noise in the microphone signals and can produce poor quality captured microphone audio signals that can be practically unusable. As there are usually multiple microphones present in a single capture device, the problem can be alleviated by temporarily removing noisy microphones and/or algorithmically suppressing the noise in the microphones. This information can be used to control the format generation for the encoder to provide an improved end-to-end audio signal quality.
The example flow diagram shown in Figure 4 starts with obtaining the input information. This includes:
Obtaining captured microphones as input audio signals as shown in Figure 4 by step 401 ;
Obtaining information about the wind noise reduction algorithm status as shown in Figure 4 by step 403;
Obtaining the encoder configuration information (standard specification) as shown in Figure 4 by step 405;
Obtaining the encoder feedback information, for example, in the form of the current bitrate as shown in Figure 4 by step 407.
Having obtained the input information the method in some embodiments is configured to process two independent method flows. Firstly, the status of wind noise reduction (WNR) is analyzed to determine whether wind noise is currently suppressed as shown in Figure 4 by step 409. The status of wind-noise reduction or suppression is, in practice, a binary decision. In other words either there is reduction or suppression or there is not. This process is shown, for example in GB2109928.8.
A wind-noise reduction or suppression status can be indicated by a decision to implement noise reduction or suppression. The microphone signals can be analyzed for low frequency (below 150 Hz) level differences and if there is a significant level difference (e.g., above 6 dB), then the probable cause for this is wind noise. Where the wind noise is detected then the decision can be made to implement noise reduction or suppression processing or microphone selection.
If wind noise reduction or suppression is currently being implemented (as indicated by the status information), then a long time window is selected for the TF- window for the analysis of the audio signals which is then used as shown in Figure 4 by step 413.
If wind noise reduction or suppression is not being currently implemented, it means that there is less inherent wind noise present, and the captured audio signal quality is better. In such a scenario then a short time window can be selected for the TF-window for the analysis of the audio signals which is then used as shown in Figure 4 by step 41 1 .
Having determined the time window for the analysis then this information is then combined with the encoder information (and optional encoder feedback information) in order to select an improved (optimal) time-frequency resolution for the spatial analysis as shown in Figure 4 by step 415. For example with respect to a first bitrate, bitrate 1 , the short time window can be a 5 ms time resolution and there are 4 frequency bands, and the long time window can be a 20 ms time resolution and there are 16 frequency bands. For a second bitrate, bitrate 2, the short time window can be a 5 ms time resolution and there are 12 frequency bands, and the long time window can be a 20 ms time resolution and there are 24 frequency bands.
As indicated above the second independent or parallel decision with respect to wind noise is based on a decision of whether microphone selection is to be performed in order to implement wind noise reduction or suppression. Thus there is a decision to be implemented with respect to the selection of which microphones are used in the spatial analysis or transport signal generation.
In other words there is a controlled a determination of whether all microphones are selected or only some of the microphones are selected and then using this information so that only the selected microphones are used in the spatial analysis and transport audio signal generation. This is shown in Figure 4 by the following operation steps.
The wind noise reduction status input can be used to determine whether wind noise suppression or reduction using microphone selection is being implemented as shown in Figure 4 by step 417. This can, for example, employ a similar noise reduction or suppression decision as described above where microphones whose audio signals indicate that wind noise is present (‘windy’ microphones) are indicated as not being selected (or otherwise where microphones whose audio signals indicate that wind noise is not present are indicated as being selected)
Where ‘windy’ microphones are not to be selected then an indicator, signal or other control is generated and passed to the spatial analyzer and transport signal generator in order that the usable (or non-‘windy’) microphones can be used as shown in Figure 4 by step 421 . Where a wind based selection is not implemented or there are no ‘windy’ microphones then an indicator, signal or other control is generated such that all of the microphones are selected and used (in the spatial analysis and/or generation of the transport audio signals) as shown in Figure 4 by step 419.
Having applied the time-frequency resolution control and the microphone selection control then the (selected microphone) input signals are analysed to generate the spatial metadata as shown in Figure 4 by step 425 and furthermore transport audio signals can be generated as shown in Figure 4 by step 423.
In some embodiments the determination or selection of the time-frequency resolution of spatial analysis and MASA format generation is thus adapted based on the wind noise situation and how it is used to direct the codec onto a more optimal processing path. This example can also implement an analysis/coding where a better/higher time resolution is preferred over a better/higher frequency resolution (when a better/higher time resolution is available). In some embodiments if the spatial metadata does not have exactly (or almost) the same parameter values for each time subframe (e.g., 5 ms) within one signal frame (e.g., 20 ms), then a better or higher time resolution can be implemented. The adaptation algorithm can, for example, employ two distinct cases:
• If a longer time window should be used (i.e., better frequency resolution and worse time resolution), then the capture system is set to provide a same value for each time subframe. In some embodiments this can be implemented by changing the spatial analyzer to directly use a 20 ms time resolution when producing spatial metadata parameters. Alternatively, if direct analysis with the desired resolution is not possible, then spatial metadata is averaged over a 20 ms time window and all subframes within each frequency band and spatial metadata parameters are set to the same value. This can be done, for example, based on the example methods such as shown in UKIPO patent applications 1919130.3 and 1919131 .1 . In some embodiments with the MASA format metadata manipulated in this way, the metadata codec is directed to pathway that uses better frequency resolution.
• If a shorter time window should be used, spatial analysis is implemented using 5 ms time resolution, i.e., to provide a different value for each time subframe. This directs the codec to a pathway which prefers better time resolution. In some embodiments, additional reduction of frequency resolution may be implemented using the methods described above to ensure preference of time resolution.
In some embodiments in addition to the main metadata generation manipulation, the encoder configuration information (codec specification) and used bitrate (where available) can be analysed to reduce frequency resolution directly to try to match the expected end result frequency resolution. This allows more sophisticated metadata reduction or preferably analysis in the correct resolution to be employed. In such a manner, some embodiments, can provide equal or better quality bitrate reduction algorithms.
Furthermore wind noise can affect microphone signals in such a way that the quality of the coherence parameter analysis may decrease drastically with respect to increasing wind noise. In some embodiments a decrease of analysed coherence parameter values or even disabling the use of coherence parameters by setting them to zero for the complete signal frame can be implemented based on a determination of wind noise.
With respect to Figure 5 a further example use case is shown with respect to spatial capture reliability determination. This example use case is one where there is information about the reliability of the direction or overall metadata estimates. For example, practical capture devices (e.g., mobile phones) can have very different and even asymmetrical microphone locations. This usually leads to a situation where the reliability of the spatial analysis is different depending on the analyzed direction of arrival. For example, if microphones are very close to each other (less than 1 cm), then direction estimates are less accurate when compared to microphones placed further away from each other. This information can be used to control spatial metadata analysis.
The example flow diagram shown in Figure 5 starts with obtaining the input information. This includes:
Obtaining capture device microphone audio signals as input audio signals as shown in Figure 5 by step 501 ;
Obtaining capture device microphone locations as shown in Figure 5 by step 503;
Obtaining the encoder configuration information (standard specification) as shown in Figure 5 by step 505;
Obtaining the encoder feedback information, for example, in the form of the current bitrate as shown in Figure 5 by step 507.
In this example a control (or determination) of available spatial analysis time resolution is shown in Figure 5 by step 51 1. In some embodiments, based on the encoder configuration (standard specification) and/or the encoder feedback information, a short-time resolution candidate and long-time resolution candidate can be selected and the corresponding short-time and long-time resolution candidate values passed to the spatial analyzer.
Furthermore in some embodiments the reliability of the metadata estimation is calculated for all possible directions based on the capture device microphone positions as shown in Figure 5 by step 509. The reliability determination can be implemented using any suitable manner. For example, in some embodiments, a reliability measure can be implemented for the three different cardinal directions (front/back, up/down, left/right) such that based on the analyzed estimated direction vector, the reliability of the metadata can be determined as a combination of these cardinal directions. A reliability of spatial analysis can be determined based on the method shown in GB1619573.7 where a reliability of very closely spaced microphones is known to be less accurate than suitably placed microphones and thus direction estimates close to the axis passing through the very closely spaced microphones is deemed to be lower than the reliability of directions close to the axis passing through the suitably placed microphones.
In some embodiments the reliability of the direction estimates is most important in frequencies that are important for speech signals, in other words for a frequency range of 100Hz-4kHz. The reliability of known typical direction estimating methods are best for microphones placed a few centimeters apart from each other. Thus, microphones that are less than a centimeter apart from each other provide less reliable estimates. At lower frequencies this estimate reliability can be due to microphone internal noise and at higher frequencies due to aliasing effects.
In some embodiments a spatial analysis reliability parameter is obtained per device as a measurement in controlled environment. Specific test sounds are produced, captured, and analyzed, and spatial analysis reliability parameter 8(0, is produced. This parameter is dependent on the spatial direction 0 and the frequency f. In practical use, this parameter can also be converted to discrete frequency with frequency band index k by, for example, averaging the value over the continuous frequency f within the low and high limits of the band, fkiiow and fk.htgh respectively.
Having determined the reliability, this can be used to control the spatial metadata formation. In the following example the analysis is implemented on a band-by-band basis but the same methods can be adapted to broadband employment.
First, it is possible to analyse the input signals with the short-time resolution candidate spatial time resolution to generate ‘short-time resolution’ spatial metadata values as shown in Figure 5 by step 513.
Then analyse the input signals with the long-time resolution candidate spatial time resolution to generate ‘long-time resolution’ spatial metadata values as shown in Figure 5 by step 514.
The metadata, comprising the direction parameter, is thus determined for both ‘long-time resolution’ and ‘short-time resolution’.
A reliability measure can then be found for the ‘long-time resolution’ direction estimate. If the reliability measure for the estimated direction is high, then reliability measures can then be found for the ‘short-time resolution’ direction estimates. If most of the analyzed directions have a high reliability, then the method can be configured to select short-time resolution metadata for use for the current frequency band. Otherwise, long-time resolution metadata is selected as shown in Figure 5 by step 515.
In some embodiments this can be formulated with equations as follows when 0s(n,fc) and 0z(fc) are the respective direction estimates for short and long time resolution. In this example case, short time resolution direction contains a subframe index n representing time. In some embodiments, long time resolution may contain a similar time index.
For each frequency band, two reliability measures are formed as follows:
<5z(fc) = <5(0z(fc), fc)
Figure imgf000032_0001
Where N is the total number of subframes in one frame, e.g., four 5 ms subframes in one 20 ms frame. Thus, this forms comparable measures for longtime and short-time resolutions. In this case, the short-time resolution measure is formed with a simple mean of values but any other method may be suitable (maximum, minimum, average weighted with energy/energy ratio, etc.)
For each band, we then do a decision of which resolution to use. This is with the following process:
1 . Check if <5z(fc) > 8thrl, where 8thrl is a tuned threshold value.
2. If true, check if <5s(fc) > 8thr2, where 8thr2 is another tuned threshold value.
3. If true, use short time resolution. In all other cases, use long time resolution.
Once the decisions are known band-per-band, the metadata generation can be set accordingly.
This process produces metadata that results in different bit-use distribution in the metadata codec as more reliable bands get more bits to use when there is shortage. This is inherent property of the metadata codec when compressing data based on similarity. Furthermore, any unused bits can be given to the audio codec to improve transport signal quality. As an additional step, the reliability decision can be performed for the whole metadata. In these embodiments the following comparison can be implemented
Figure imgf000033_0001
where (3 is a tuning parameter with usually value of 1 or larger. If this comparison is true, then the short time resolution is used for the whole metadata and otherwise the method uses long time resolution for the whole metadata. In practice, a specific level of overall reliability can be used for the short time resolution data to be considered. Other relations or even constant comparison (similarly to above) can be also used. Once the decision is implemented, metadata generation is tuned correspondingly and the codec is then directed to use overall optimal coding mode for this metadata resolution.
What this process achieves is that when the data is reliable, the method also produces an accurate time fluctuation of directions and better representation of the time structure. On the other hand, when the data would be less reliable, more averaged data is produced which reduces fluctuation artifacts caused by inaccurate direction data. This result is even more beneficial as the best direction reliability is usually obtained in the left/right cardinal direction which means that sources in the view in front are accurate.
Furthermore having determined the metadata the transport audio signals can be generated as shown in Figure 5 by step 517.
In the above examples as shown in Figure 4 and 5 there is described two separate use case examples. In some embodiments similar systems can be defined for other use cases and embodiments can be combined into a single system. For example, the reduction of capture microphones in the example use case as shown in Figure 4 can clearly lead to change in spatial capture reliability which is shown in example use case as shown in Figure 5.
The example parameters and the details of the codec can change and, for example, the example time resolution values may change or there may be multiple time resolutions of which to select. Furthermore in some embodiments other parameters can be employed and the proposed methodology trivially adapted to these other parameter values. In the examples described herein the decisions or selections (for example for time resolution) shown above are shown to happen instantly and always related to the input data. However, in some embodiments the decision or selections may have, for example, a hysteresis effect present in such way that when one mode is selected, it is implemented even if the input data would signal otherwise. This effect may also be implemented in a conservative way. In other words a more reliable and stable mode (in the sense of perceived audio artifacts, in practice, longer time resolution) can be implemented unless there are multiple consecutive frames signifying that the other mode (in practice, shorter time resolution) is to be selected.
With respect to Figure 6 an example electronic device which may be used as the analysis or synthesis device is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1400 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.
In some embodiments the device 1400 comprises at least one processor or central processing unit 1407. The processor 1407 can be configured to execute various program codes such as the methods such as described herein.
In some embodiments the device 1400 comprises a memory 1411. In some embodiments the at least one processor 1407 is coupled to the memory 141 1. The memory 141 1 can be any suitable storage means. In some embodiments the memory 141 1 comprises a program code section for storing program codes implementable upon the processor 1407. Furthermore in some embodiments the memory 141 1 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1407 whenever needed via the memory-processor coupling.
In some embodiments the device 1400 comprises a user interface 1405. The user interface 1405 can be coupled in some embodiments to the processor 1407. In some embodiments the processor 1407 can control the operation of the user interface 1405 and receive inputs from the user interface 1405. In some embodiments the user interface 1405 can enable a user to input commands to the device 1400, for example via a keypad. In some embodiments the user interface 1405 can enable the user to obtain information from the device 1400. For example the user interface 1405 may comprise a display configured to display information from the device 1400 to the user. The user interface 1405 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1400 and further displaying information to the user of the device 1400. In some embodiments the user interface 1405 may be the user interface for communicating with the position determiner as described herein.
In some embodiments the device 1400 comprises an input/output port 1409. The input/output port 1409 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1407 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802. X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).
The transceiver input/output port 1409 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1407 executing suitable code. Furthermore the device may generate a suitable downmix signal and parameter output to be transmitted to the synthesis device.
In some embodiments the device 1400 may be employed as at least part of the synthesis device. As such the input/output port 1409 may be configured to receive the downmix signals and in some embodiments the parameters determined at the capture device or processing device as described herein, and generate a suitable audio signal format output by using the processor 1407 executing suitable code. The input/output port 1409 may be coupled to any suitable audio output for example to a multichannel speaker system and/or headphones or similar.
The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802. X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).
The transceiver input/output port 1409 may be configured to receive the signals and in some embodiments determine the parameters as described herein by using the processor 1407 executing suitable code.
It is also noted herein that while the above describes example embodiments, there are several variations and modifications which may be made to the disclosed solution without departing from the scope of the present invention.
In general, the various embodiments may be implemented in hardware or special purpose circuitry, software, logic or any combination thereof. Some aspects of the disclosure may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the disclosure is not limited thereto. While various aspects of the disclosure may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
As used in this application, the term “circuitry” may refer to one or more or all of the following:
(a) hardware-only circuit implementations (such as implementations in only analog and/or digital circuitry) and
(b) combinations of hardware circuits and software, such as (as applicable):
(i) a combination of analog and/or digital hardware circuit(s) with software/firmware and (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and
(c) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g., firmware) for operation, but the software may not be present when it is not needed for operation.” This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware.
The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device.
The embodiments of this disclosure may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Computer software or program, also called program product, including software routines, applets and/or macros, may be stored in any apparatus-readable data storage medium and they comprise program instructions to perform particular tasks. A computer program product may comprise one or more computerexecutable components which, when the program is run, are configured to carry out embodiments. The one or more computer-executable components may be at least one software code or portions of it.
Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD. The physical media is a non-transitory media. The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may comprise one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), FPGA, gate level circuits and processors based on multi core processor architecture, as non-limiting examples.
Embodiments of the disclosure may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
The scope of protection sought for various embodiments of the disclosure is set out by the independent claims. The embodiments and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the disclosure.
The foregoing description has provided by way of non-limiting examples a full and informative description of the exemplary embodiment of this disclosure. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this disclosure will still fall within the scope of this invention as defined in the appended claims. Indeed, there is a further embodiment comprising a combination of one or more embodiments with any of the other embodiments previously discussed.

Claims

37 CLAIMS:
1. An apparatus for generating spatial audio signal parameters, the spatial audio signal parameters associated with at least one audio signal, the apparatus comprising means configured to: obtain apparatus information; generate metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs, such that the metadata control information is configured to adaptively control the generation of metadata based on the metadata codec information and the apparatus information; and generate metadata parameters based on the metadata control information.
2. The apparatus as claimed in claim 1 , wherein the metadata control information controls at least one of: a presence of coherence parameters; a number of concurrent direction parameters; a time-resolution of the metadata parameters; and a frequency-resolution of the metadata parameters.
3. The apparatus as claimed in any of claims 1 or 2, wherein the means is configured to encode the metadata parameters.
4. The apparatus as claimed in claims 3, wherein the means is further configured to: generate at least one transport audio signal from the at least one audio signal and based on the metadata control information; and encode the at least one transport audio signal.
5. The apparatus as claimed in claim 4, wherein the means is further configured to: combine the encoded at least one transport audio signals and the encoded metadata parameters; and 38 transmit to a further device and/or store the combined encoded at least one transport audio signals and metadata parameters.
6. The apparatus as claimed in any of claims 1 to 5, wherein the apparatus information comprises at least one of: the at least one audio signal; a spatial analysis of the at least one audio signal; an estimate of the performance of a spatial analysis of the at least one audio signal; a status of a signal enhancement processing applied to the at least one audio signal; and an apparatus microphone configuration.
7. The apparatus as claimed in claim 6, wherein the status of a signal enhancement processing applied to the at least one audio signal is a status of a wind noise suppression or reduction processing, wherein the means configured to generate metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs is configured to: determine a first time resolution window control when the status of the wind noise suppression or reduction processing indicates that the at least one audio signal has not had wind noise suppression or reduction processing; and determine a second time resolution window control when the status of the wind noise suppression or reduction processing indicates that the at least one audio signal has had wind noise suppression or reduction processing, the second time resolution window being longer than the first time resolution window.
8. The apparatus as claimed in claim 7, wherein the means configured to generate metadata control information based on the apparatus information is configured to generate a time-resolution and a frequency-resolution window control based on the determined first or second time resolution window, the available bitrate for the encoded metadata parameters and an encoder specification detailing an encoder target time-resolution and frequency-resolution.
9. The apparatus as claimed in claim 8, wherein the means configured to generate metadata control information based on the apparatus information is configured to determine an audio signal selection control, wherein the audio selection control is configured to select at least one of the at least one audio signal based on whether the status of the wind noise suppression or reduction processing indicates that the at least one audio signal has had wind noise suppression or reduction processing.
10. The apparatus as claimed in any of claims 1 to 9, wherein the metadata codec information with respect to a range of possible metadata codecs comprises at least one of: an expected bit-rate or bandwidth for the encoded metadata parameters; a coding strategy for the encoding of the metadata parameters.
11. The apparatus as claimed in claim 10, wherein the means configured to generate metadata parameters based on the metadata control information is configured to generate the metadata parameters from an analysis of the at least one audio signal controlled by the metadata control information.
12. The apparatus as claimed in any of claim 1 to 11 , wherein the means configured to generate metadata control information based on the apparatus information is configured to determine a direction dependent reliability control based on microphone location configuration on the apparatus.
13. The apparatus as claimed in claim 12, wherein the means configured to generate metadata parameters based on the metadata control information is configured to generate at least two sets of metadata parameters from analysis of the at least one audio signal, with a first set of metadata parameters based on an analysis with a first time-resolution window and a second set of metadata parameters based on an analysis with a second time-resolution window, the second time resolution window being longer than the first time-resolution window.
14. The apparatus as claimed in claim 13, wherein the means configured to generate metadata parameters based on the metadata control information is configured to select one of the at least two sets of metadata parameters based on the direction dependent reliability control.
15. A method for an apparatus for generating spatial audio signal parameters, the spatial audio signal parameters associated with at least one audio signal, the method comprising: obtaining apparatus information; generating metadata control information based on the apparatus information and metadata codec information with respect to a range of possible metadata codecs, such that the metadata control information is configured to adaptively control the generation of metadata based on the metadata codec information and the apparatus information; and generating metadata parameters based on the metadata control information.
16. The method as claimed in claim 15, wherein the metadata control information controls at least one of: a presence of coherence parameters; a number of concurrent direction parameters; a time-resolution of the metadata parameters; and a frequency-resolution of the metadata parameters.
17. The method as claimed in any of claims 15 or 16, further comprising encoding the metadata parameters.
18. The method as claimed in claims 17, further comprising: generating at least one transport audio signal from the at least one audio signal and based on the metadata control information; and encoding the at least one transport audio signal.
19. The method as claimed in claim 18, further comprising: combining the encoded at least one transport audio signals and the encoded metadata parameters; and transmitting to a further device and/or store the combined encoded at least one transport audio signals and metadata parameters.
20. The method as claimed in any of claims 15 to 19, wherein the apparatus information comprises at least one of: the at least one audio signal; a spatial analysis of the at least one audio signal; an estimate of the performance of a spatial analysis of the at least one audio signal; a status of a signal enhancement processing applied to the at least one audio signal; and an apparatus microphone configuration.
PCT/EP2021/078853 2021-10-18 2021-10-18 Metadata generation within spatial audio WO2023066456A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2021/078853 WO2023066456A1 (en) 2021-10-18 2021-10-18 Metadata generation within spatial audio

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2021/078853 WO2023066456A1 (en) 2021-10-18 2021-10-18 Metadata generation within spatial audio

Publications (1)

Publication Number Publication Date
WO2023066456A1 true WO2023066456A1 (en) 2023-04-27

Family

ID=78269644

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2021/078853 WO2023066456A1 (en) 2021-10-18 2021-10-18 Metadata generation within spatial audio

Country Status (1)

Country Link
WO (1) WO2023066456A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019105575A1 (en) * 2017-12-01 2019-06-06 Nokia Technologies Oy Determination of spatial audio parameter encoding and associated decoding
WO2020178475A1 (en) * 2019-03-01 2020-09-10 Nokia Technologies Oy Wind noise reduction in parametric audio

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019105575A1 (en) * 2017-12-01 2019-06-06 Nokia Technologies Oy Determination of spatial audio parameter encoding and associated decoding
WO2020178475A1 (en) * 2019-03-01 2020-09-10 Nokia Technologies Oy Wind noise reduction in parametric audio

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
NOKIA CORPORATION: "Description of the IVAS MASA C Reference Software", 3GPP TSG-SA4#106 MEETING, 21 October 2019 (2019-10-21)

Similar Documents

Publication Publication Date Title
US20210210104A1 (en) Spatial Audio Parameter Merging
US20210319799A1 (en) Spatial parameter signalling
US20230402053A1 (en) Combining of spatial audio parameters
WO2021130404A1 (en) The merging of spatial audio parameters
US20230047237A1 (en) Spatial audio parameter encoding and associated decoding
WO2023066456A1 (en) Metadata generation within spatial audio
US20240029745A1 (en) Spatial audio parameter encoding and associated decoding
US20230197087A1 (en) Spatial audio parameter encoding and associated decoding
JP7223872B2 (en) Determining the Importance of Spatial Audio Parameters and Associated Coding
US20230410823A1 (en) Spatial audio parameter encoding and associated decoding
US20240046939A1 (en) Quantizing spatial audio parameters
WO2023156176A1 (en) Parametric spatial audio rendering
EP4320876A1 (en) Separating spatial audio objects
WO2023031498A1 (en) Silence descriptor using spatial parameters
WO2023088560A1 (en) Metadata processing for first order ambisonics
WO2023179846A1 (en) Parametric spatial audio encoding
WO2022223133A1 (en) Spatial audio parameter encoding and associated decoding

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21794379

Country of ref document: EP

Kind code of ref document: A1