WO2024012805A1 - Transporting audio signals inside spatial audio signal - Google Patents

Transporting audio signals inside spatial audio signal Download PDF

Info

Publication number
WO2024012805A1
WO2024012805A1 PCT/EP2023/066359 EP2023066359W WO2024012805A1 WO 2024012805 A1 WO2024012805 A1 WO 2024012805A1 EP 2023066359 W EP2023066359 W EP 2023066359W WO 2024012805 A1 WO2024012805 A1 WO 2024012805A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
focussed
audio
focus
generate
Prior art date
Application number
PCT/EP2023/066359
Other languages
French (fr)
Inventor
Miikka Tapani Vilermo
Lasse Juhani Laaksonen
Arto Juhani Lehtiniemi
Mikko Tapio Tammi
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of WO2024012805A1 publication Critical patent/WO2024012805A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing

Definitions

  • the present application relates to apparatus and methods for transporting audio signals inside a spatial audio signal and specifically but not exclusively apparatus and methods for transporting a focused pair of audio signals within a spatial audio signal using direction metadata.
  • Parametric spatial audio systems can be configured to store and transmit audio signal with associated metadata.
  • the metadata describes spatial (and non- spatial) characteristics of the audio signal.
  • the audio signals and metadata together can be used to render a spatial audio signal, typically for many different playback devices e.g. headphones, stereo speakers, 5.1 speakers, homepods.
  • the metadata typically comprises direction parameters (azimuth, elevation) and ratio parameters (direct-to-ambience ratio i.e. D/A ratio).
  • Direction parameters describe sound source directions typically in time-frequency tiles.
  • Ratio parameters describe the diffuseness of the audio signal i.e. the ratio of direct energy to diffuse energy also in time-frequency tiles. These parameters are psychoacoustically the most important in creating a spatially correct sounding audio to a human listener.
  • audio focus is an audio processing method where sound sources in a direction are amplified with respect to sound sources in other directions.
  • known methods such as beamforming or spatial filtering are employed. Beamforming and spatial filtering approaches both require knowledge about sound directions. These can typically be only estimated if the original microphone signals from known locations are present.
  • an apparatus for generating spatial audio signals, the apparatus comprising means configured to: obtain at least two audio signals; obtain at least one metadata directional parameter associated with the at least two audio signals; generate a first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in a first focussed audio signal direction; generate a second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal focussed in a second focusses audio signal direction which is substantially different to a first focussed audio signal direction; and generate a first and a second output audio signal based on the second focussed audio signal and a panning of the first focussed audio signal, the panning controlled by the at least one metadata directional parameter.
  • the first focussed audio signal direction may be fixed relative to a direction of the apparatus.
  • the first focussed audio signal direction may be the at least one metadata directional parameter value.
  • the means configured to generate the first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in the first focussed audio signal direction may be configured to at least one of: select one of the at least two audio signals to generate the first focussed audio signal; and mix at least two of the at least two audio signals to generate the first focussed audio signal.
  • the means configured to generate the second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal being focussing in the second focussed audio signal direction may be configured to at least one of: select one of the at least two audio signals to generate the second focussed audio signal; and mix at least two of the at least two audio signals to generate the second focussed audio signal.
  • the means configured to generate the first output audio signal based on a panning of the first focussed audio signal and the at least one metadata directional parameter may be configured to generate the first output audio signal as an additive combination of the second focussed audio signals and a panning of the first focussed audio signal to a left channel direction based on the at least one metadata directional parameter.
  • the means configured to generate the second output audio signal may be configured to generate the second output audio signal as a subtractive combination of the second focussed audio signals and a panning of the first focussed audio signal to a right channel direction based on the at least one metadata directional parameter.
  • the means may be further configured to generate further focussed audio signal based on at least one of the at least two audio signals, the further focussed audio signal focussed in a further focussed audio signal direction which is perpendicular to the first focussed audio signal direction.
  • the means configured to obtain at least one metadata directional parameter associated with the at least two audio signals may be configured to analyse the at least two audio signals to generate the at least one metadata directional parameter.
  • the means configured to obtain at least one metadata directional parameter associated with the at least two audio signals may be configured to receive the at least one metadata directional parameter, and the means configured to obtain at least two audio signals may be configured to receive the at least two audio signals.
  • an apparatus for processing spatial audio signals, the apparatus comprising means configured to: obtain a first audio signal and a second audio signal, the first audio signal and the second audio signal based respectively on a panning of a first focussed audio signal and a second focussed audio signal and the second focussed audio signal is focussed in a second direction which is substantially different to a first direction of the first focussed audio signal, and at least one metadata directional parameter, wherein the panning of the first focussed audio signal is associated with the at least one metadata directional parameter; obtain a desired focus directional parameter; generate a focus audio signal towards the desired focus directional parameter value, the focus audio signal based on the desired focus directional parameter, the at least one metadata directional parameter, the first audio signal and the second audio signal; and generate at least one output audio signal based on the focus audio signal.
  • the first focussed audio signal direction may be one of: a fixed direction relative to a direction of a capture apparatus; and the at least one metadata directional parameter value.
  • the means may be configured to, prior to generating the focus audio signal: de-pan the first audio signal to generate the first focussed audio signal; and de-pan the second audio signal to generate the second focussed audio signal, wherein the means configured to generate the focus audio signal may be configured to generate the focus audio signal based on a combination of the first focussed audio signal and the second focussed audio.
  • the means configured to generate at least one output audio signal based on the focus audio signal may be configured to: generate a first output audio signal based on a combination of the focus audio signal and the first audio signal; and generate a second output audio signal based on a combination of the focus audio signal and the second audio signal.
  • a method for an apparatus for generating spatial audio signals comprising: obtaining at least two audio signals; obtaining at least one metadata directional parameter associated with the at least two audio signals; generating a first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in a first focussed audio signal direction; generating a second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal focussed in a second focusses audio signal direction which is substantially different to a first focussed audio signal direction; and generating a first and a second output audio signal based on the second focussed audio signal and a panning of the first focussed audio signal, the panning controlled by the at least one metadata directional parameter.
  • the first focussed audio signal direction may be fixed relative to a direction of the apparatus.
  • the first focussed audio signal direction may be the at least one metadata directional parameter value.
  • Generating the first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in the first focussed audio signal direction may comprise at least one of: selecting one of the at least two audio signals to generate the first focussed audio signal; and mixing at least two of the at least two audio signals to generate the first focussed audio signal.
  • Generating the second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal being focussing in the second focussed audio signal direction may comprise at least one of: selecting one of the at least two audio signals to generate the second focussed audio signal; and mixing at least two of the at least two audio signals to generate the second focussed audio signal.
  • Generating the first output audio signal based on a panning of the first focussed audio signal and the at least one metadata directional parameter may comprise generating the first output audio signal as an additive combination of the second focussed audio signals and a panning of the first focussed audio signal to a left channel direction based on the at least one metadata directional parameter.
  • Generating the second output audio signal may comprise generating the second output audio signal as a subtractive combination of the second focussed audio signals and a panning of the first focussed audio signal to a right channel direction based on the at least one metadata directional parameter.
  • the method may further comprise generating further focussed audio signal based on at least one of the at least two audio signals, the further focussed audio signal focussed in a further focussed audio signal direction which is perpendicular to the first focussed audio signal direction.
  • Obtaining at least one metadata directional parameter associated with the at least two audio signals may comprise analysing the at least two audio signals to generate the at least one metadata directional parameter.
  • Obtaining at least one metadata directional parameter associated with the at least two audio signals may comprise receiving the at least one metadata directional parameter, and obtaining at least two audio signals may comprise receiving the at least two audio signals.
  • a method for an apparatus for processing spatial audio signals comprising: obtaining a first audio signal and a second audio signal, the first audio signal and the second audio signal based respectively on a panning of a first focussed audio signal and a second focussed audio signal and the second focussed audio signal is focussed in a second direction which is substantially different to a first direction of the first focussed audio signal, and at least one metadata directional parameter, wherein the panning of the first focussed audio signal is associated with the at least one metadata directional parameter; obtaining a desired focus directional parameter; generating a focus audio signal towards the desired focus directional parameter value, the focus audio signal based on the desired focus directional parameter, the at least one metadata directional parameter, the first audio signal and the second audio signal; and generating at least one output audio signal
  • the first focussed audio signal direction may be one of: a fixed direction relative to a direction of a capture apparatus; and the at least one metadata directional parameter value.
  • the method may further comprise, prior to generating the focus audio signal: de-panning the first audio signal to generate the first focussed audio signal; and de-panning the second audio signal to generate the second focussed audio signal, wherein generating the focus audio signal may comprise generating the focus audio signal based on a combination of the first focussed audio signal and the second focussed audio.
  • Generating at least one output audio signal based on the focus audio signal may comprise: generating a first output audio signal based on a combination of the focus audio signal and the first audio signal; and generating a second output audio signal based on a combination of the focus audio signal and the second audio signal.
  • an apparatus for generating spatial audio signals comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain at least two audio signals; obtain at least one metadata directional parameter associated with the at least two audio signals; generate a first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in a first focussed audio signal direction; generate a second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal focussed in a second focusses audio signal direction which is substantially different to a first focussed audio signal direction; and generate a first and a second output audio signal based on the second focussed audio signal and a panning of the first focussed audio signal, the panning controlled by the at least one metadata directional parameter.
  • the first focussed audio signal direction may be fixed relative to a direction of the apparatus.
  • the first focussed audio signal direction may be the at least one metadata directional parameter value.
  • the apparatus caused to generate the first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in the first focussed audio signal direction may be caused to at least one of: select one of the at least two audio signals to generate the first focussed audio signal; and mix at least two of the at least two audio signals to generate the first focussed audio signal.
  • the apparatus caused to generate the second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal being focussing in the second focussed audio signal direction may be caused to at least one of: select one of the at least two audio signals to generate the second focussed audio signal; and mix at least two of the at least two audio signals to generate the second focussed audio signal.
  • the apparatus caused to generate the first output audio signal based on a panning of the first focussed audio signal and the at least one metadata directional parameter may be caused to generate the first output audio signal as an additive combination of the second focussed audio signals and a panning of the first focussed audio signal to a left channel direction based on the at least one metadata directional parameter.
  • the apparatus as caused to generate the second output audio signal may be caused to generate the second output audio signal as a subtractive combination of the second focussed audio signals and a panning of the first focussed audio signal to a right channel direction based on the at least one metadata directional parameter.
  • the apparatus may be further caused to generate further focussed audio signal based on at least one of the at least two audio signals, the further focussed audio signal focussed in a further focussed audio signal direction which is perpendicular to the first focussed audio signal direction.
  • the apparatus may be caused to obtain at least one metadata directional parameter associated with the at least two audio signals may be caused to analyse the at least two audio signals to generate the at least one metadata directional parameter.
  • the apparatus caused to obtain at least one metadata directional parameter associated with the at least two audio signals may be caused to receive the at least one metadata directional parameter, and the apparatus caused to obtain at least two audio signals may be caused to receive the at least two audio signals.
  • an apparatus for processing spatial audio signals, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain a first audio signal and a second audio signal, the first audio signal and the second audio signal based respectively on a panning of a first focussed audio signal and a second focussed audio signal and the second focussed audio signal is focussed in a second direction which is substantially different to a first direction of the first focussed audio signal, and at least one metadata directional parameter, wherein the panning of the first focussed audio signal is associated with the at least one metadata directional parameter; obtain a desired focus directional parameter; generate a focus audio signal towards the desired focus directional parameter value, the focus audio signal based on the desired focus directional parameter, the at least one metadata directional parameter, the first audio signal and the second audio signal; and generate at least one output audio signal based on the focus audio signal.
  • the first focussed audio signal direction may be one of: a fixed direction relative to a direction of a capture apparatus; and the at least one metadata directional parameter value.
  • the apparatus may be caused to, prior to generating the focus audio signal: de-pan the first audio signal to generate the first focussed audio signal; and de-pan the second audio signal to generate the second focussed audio signal, wherein the apparatus caused to generate the focus audio signal may be caused to generate the focus audio signal based on a combination of the first focussed audio signal and the second focussed audio.
  • the apparatus caused to generate at least one output audio signal based on the focus audio signal may be configured to: generate a first output audio signal based on a combination of the focus audio signal and the first audio signal; and generate a second output audio signal based on a combination of the focus audio signal and the second audio signal.
  • an apparatus for generating spatial audio signals comprising: obtaining circuitry configured to obtain at least two audio signals; obtaining circuitry configured to obtain at least one metadata directional parameter associated with the at least two audio signals; generating circuitry configured to generate a first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in a first focussed audio signal direction; generating circuitry configured to generate a second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal focussed in a second focusses audio signal direction which is substantially different to a first focussed audio signal direction; and generating circuitry configured to generate a first and a second output audio signal based on the second focussed audio signal and a panning of the first focussed audio signal, the panning controlled by the at least one metadata directional parameter.
  • an apparatus for processing spatial audio signals comprising: obtaining circuitry configured to obtain a first audio signal and a second audio signal, the first audio signal and the second audio signal based respectively on a panning of a first focussed audio signal and a second focussed audio signal and the second focussed audio signal is focussed in a second direction which is substantially different to a first direction of the first focussed audio signal, and at least one metadata directional parameter, wherein the panning of the first focussed audio signal is associated with the at least one metadata directional parameter; obtaining circuitry configured to obtain a desired focus directional parameter; generating circuitry configured to generate a focus audio signal towards the desired focus directional parameter value, the focus audio signal based on the desired focus directional parameter, the at least one metadata directional parameter, the first audio signal and the second audio signal; and generating circuitry configured to generate at least one output audio signal based on the focus audio signal.
  • a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus for generating spatial audio signals to perform at least the following: obtain at least one metadata directional parameter associated with the at least two audio signals; generate a first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in a first focussed audio signal direction; generate a second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal focussed in a second focusses audio signal direction which is substantially different to a first focussed audio signal direction; and generate a first and a second output audio signal based on the second focussed audio signal and a panning of the first focussed audio signal, the panning controlled by the at least one metadata directional parameter.
  • a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus for processing spatial audio signals to perform at least the following: obtain a first audio signal and a second audio signal, the first audio signal and the second audio signal based respectively on a panning of a first focussed audio signal and a second focussed audio signal and the second focussed audio signal is focussed in a second direction which is substantially different to a first direction of the first focussed audio signal, and at least one metadata directional parameter, wherein the panning of the first focussed audio signal is associated with the at least one metadata directional parameter; obtain a desired focus directional parameter; generate a focus audio signal towards the desired focus directional parameter value, the focus audio signal based on the desired focus directional parameter, the at least one metadata directional parameter, the first audio signal and the second audio signal; and generate at least one output audio signal based on the focus audio signal.
  • a non-transitory computer readable medium comprising program instructions for causing an apparatus for generating spatial audio signals to perform at least the following: obtain at least one metadata directional parameter associated with the at least two audio signals; generate a first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in a first focussed audio signal direction; generate a second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal focussed in a second focusses audio signal direction which is substantially different to a first focussed audio signal direction; and generate a first and a second output audio signal based on the second focussed audio signal and a panning of the first focussed audio signal, the panning controlled by the at least one metadata directional parameter.
  • a non-transitory computer readable medium comprising program instructions for causing an apparatus for processing spatial audio signals to perform at least the following: obtain a first audio signal and a second audio signal, the first audio signal and the second audio signal based respectively on a panning of a first focussed audio signal and a second focussed audio signal and the second focussed audio signal is focussed in a second direction which is substantially different to a first direction of the first focussed audio signal, and at least one metadata directional parameter, wherein the panning of the first focussed audio signal is associated with the at least one metadata directional parameter; obtain a desired focus directional parameter; generate a focus audio signal towards the desired focus directional parameter value, the focus audio signal based on the desired focus directional parameter, the at least one metadata directional parameter, the first audio signal and the second audio signal; and generate at least one output audio signal based on the focus audio signal.
  • an apparatus for generating spatial audio signals comprising: means for obtaining at least one metadata directional parameter associated with the at least two audio signals; means for generating a first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in a first focussed audio signal direction; means for generating a second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal focussed in a second focusses audio signal direction which is substantially different to a first focussed audio signal direction; and means for generating a first and a second output audio signal based on the second focussed audio signal and a panning of the first focussed audio signal, the panning controlled by the at least one metadata directional parameter.
  • an apparatus for processing spatial audio signals comprising: means for obtaining a first audio signal and a second audio signal, the first audio signal and the second audio signal based respectively on a panning of a first focussed audio signal and a second focussed audio signal and the second focussed audio signal is focussed in a second direction which is substantially different to a first direction of the first focussed audio signal, and at least one metadata directional parameter, wherein the panning of the first focussed audio signal is associated with the at least one metadata directional parameter; means for obtaining a desired focus directional parameter; means for generating a focus audio signal towards the desired focus directional parameter value, the focus audio signal based on the desired focus directional parameter, the at least one metadata directional parameter, the first audio signal and the second audio signal; and means for generating at least one output audio signal based on the focus audio signal.
  • a computer readable medium comprising program instructions for causing an apparatus for generating spatial audio signals to perform at least the following: obtain at least one metadata directional parameter associated with the at least two audio signals; generate a first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in a first focussed audio signal direction; generate a second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal focussed in a second focusses audio signal direction which is substantially different to a first focussed audio signal direction; and generate a first and a second output audio signal based on the second focussed audio signal and a panning of the first focussed audio signal, the panning controlled by the at least one metadata directional parameter.
  • a computer readable medium comprising program instructions for causing an apparatus for processing spatial audio signals to perform at least the following: obtain a first audio signal and a second audio signal, the first audio signal and the second audio signal based respectively on a panning of a first focussed audio signal and a second focussed audio signal and the second focussed audio signal is focussed in a second direction which is substantially different to a first direction of the first focussed audio signal, and at least one metadata directional parameter, wherein the panning of the first focussed audio signal is associated with the at least one metadata directional parameter; obtain a desired focus directional parameter; generate a focus audio signal towards the desired focus directional parameter value, the focus audio signal based on the desired focus directional parameter, the at least one metadata directional parameter, the first audio signal and the second audio signal; and generate at least one output audio signal based on the focus audio signal.
  • An apparatus comprising means for performing the actions of the method as described above.
  • An apparatus configured to perform the actions of the method as described above.
  • a computer program comprising program instructions for causing a computer to perform the method as described above.
  • a computer program product stored on a medium may cause an apparatus to perform the method as described herein.
  • An electronic device may comprise apparatus as described herein.
  • a chipset may comprise apparatus as described herein.
  • Embodiments of the present application aim to address problems associated with the state of the art.
  • Figure 1 shows schematically a system of apparatus suitable for implementing some embodiments
  • Figure 2 shows schematically an example encoder as shown in the system of apparatus as shown in Figure 1 according to some embodiments;
  • Figure 3 shows a flow diagram of the operation of the example encoder shown in Figure 2 according to some embodiments
  • Figure 4 shows schematically an example decoder as shown in the system of apparatus as shown in Figure 1 according to some embodiments
  • Figure 5 shows a flow diagram of the operation of the example decoder shown in Figure 4 according to some embodiments;
  • Figure 6 shows an example microphone selection for a sound object;
  • Figure 7 shows an example of focus and anti-focus areas as employed in some embodiments
  • Figure 8 shows an example gain function for modifying focus and anti-focus signals according to some embodiments
  • Figure 9 shows schematically a further example encoder as shown in the system of apparatus as shown in Figure 1 according to some embodiments;
  • Figure 10 shows a flow diagram of the operation of the example encoder shown in Figure 9 according to some embodiments.
  • Figure 11 shows schematically a further example decoder as shown in the system of apparatus as shown in Figure 1 according to some embodiments;
  • Figure 12 shows a flow diagram of the operation of the example decoder shown in Figure 11 according to some embodiments
  • Figures 13 to 16 show further example microphone selections for a sound object
  • Figure 17 and 18 show further examples of focus and anti-focus areas as employed in some embodiments.
  • Figure 19 shows a further example gain function for modifying focus and anti-focus signals according to some embodiments.
  • Figure 20 shows an example device suitable for implementing the apparatus shown in previous figures.
  • parametric spatial audio systems can be configured to store and transmit audio signal together with metadata.
  • audio focus or audio focussing is an audio processing method where sound sources in a direction (or within a defined range) are amplified with respect to sound sources in other directions. Although an audio focus or focussing approach is discussed herein it would be considered that an audio de-focus or defocussing occurs where sound sources in a direction (or within a defined range) are diminished or reduced with respect to sound sources in other directions could be exploited in a similar manner to that described in the following).
  • Typical uses for audio focus are: telecommunications where a user voice is amplified compared to background sounds; speech recognition, where voice is amplified to minimize word error rate; sound source amplification in the direction of camera that records video with audio; an off-camera focus where listener is watching a video and wants to focus to some other direction than camera direction. For example the person who recorded the video wants the audio to focus to their child whereas the person watching the video might want to focus the audio to the listener’s child away from the camera axis; a focus-switch where a listener may want to focus to different audio objects at different times watching a video; and a teleconference or live meeting application where different listeners of a meeting may want to focus to different speakers.
  • the following concept which is described with respect to the following embodiments is one where focused signals are hidden or encapsulated inside a (backwards) compatible spatial audio signal that can be received and processed on a conventional player but can be focused using a player such as described herein.
  • a fixed direction focused audio is encapsulated or embedded in a stereo signal.
  • apparatus and methods for creating a suitable backwards compatible spatial audio signal where a dominant sound source is in a ‘correct’ direction and where it is possible for a listener to focus audio towards any direction with help of directional metadata.
  • the dominant sound source direction is assumed to be fixed.
  • the apparatus and methods are configured to generate or create audio signals (which can be designated A and B), where a first audio signal, A, emphasizes a fixed direction and a second audio signal, B, de-emphasizes the fixed direction.
  • the first audio signal, A, and the second audio signal, B can be losslessly mixed with help of direction in metadata to create a spatial audio signal where perceived dominant audio direction is correct and the same as the metadata direction.
  • This ‘mixed’ spatial audio signal can be considered to be the ‘backwards’ compatible audio signal.
  • the mixing is lossless it can be reversed (with help of metadata direction) and the first audio signal, A, and the second audio signal, B, are used to create a focused audio signal based on user desired direction.
  • the focused audio is ‘disguised’ in a stereo signal.
  • the apparatus and methods can be configured to create a backwards compatible spatial audio signal where the dominant sound source is in the ‘correct’ direction and where it is possible for a user to focus audio towards any direction with help of directional metadata.
  • the dominant sound source direction is estimated for each (time-frequency) tile and transmitted as metadata.
  • the first audio signal, A, and the second audio signal, B can then be generated or created where the first audio signal, A, emphasizes metadata direction in each tile and the second audio signal, B, de-emphasizes metadata direction in each tile.
  • the first audio signal, A, and the second audio signal, B can furthermore be losslessly mixed with help of estimated dominant direction in metadata to create a spatial audio signal where perceived dominant audio direction is correct and the same as the metadata direction.
  • This mixed audio signal forms the backwards compatible audio signal.
  • the mix can be reversed (with help of metadata direction) and the first audio signal, A, and the second audio signal, B, can be used to create a focused audio signal based on user desired direction.
  • Embodiments will be described with respect to an example capture (or encoder/analyser) and playback (or decoder/synthesizer) apparatus or system 100 as shown in Figure 1 .
  • the audio signal input is one from a microphone array, however it would be appreciated that the audio input can be any suitable audio input format and the description hereafter details, where differences in the processing occurs when a differing input format is employed.
  • the system 100 is shown with capture part and a playback (decoder/synthesizer) part.
  • the capture part in some embodiments comprises a microphone array audio signals input 102.
  • the input audio signals can be from any suitable source, for example: two or more microphones mounted on a mobile phone, other microphone arrays, e.g., B-format microphone or Eigenmike.
  • the input can be any suitable audio signal input such as Ambisonic signals, e.g., first-order Ambisonics (FOA), higher-order Ambisonics (HOA) or Loudspeaker surround mix and/or objects.
  • Ambisonic signals e.g., first-order Ambisonics (FOA), higher-order Ambisonics (HOA) or Loudspeaker surround mix and/or objects.
  • FOA first-order Ambisonics
  • HOA higher-order Ambisonics
  • Loudspeaker surround mix and/or objects Loudspeaker surround mix and/or objects.
  • the microphone array audio signals input 102 may be provided to a microphone array front end 103.
  • the microphone array front end in some embodiments is configured to implement an analysis processor functionality configured to generate or determine suitable (spatial) metadata associated with the audio signals and implement a suitable transport signal generator functionality to generate transport audio signals.
  • the analysis processor functionality is thus configured to perform spatial analysis on the input audio signals yielding suitable spatial metadata 106 in frequency bands.
  • suitable spatial metadata for example directions and direct- to-total energy ratios (or similar parameters such as diffuseness, i.e. , ambient-to- total ratios) in frequency bands.
  • some examples may comprise the performing of a suitable time-frequency transform for the input signals, and then in frequency bands when the input is a mobile phone microphone array, estimating delay-values between microphone pairs that maximize the inter-microphone correlation, and formulating the corresponding direction value to that delay (as described in GB Patent Application Number 1619573.7 and PCT Patent Application Number PCT/FI2017/050778), and formulating a ratio parameter based on the correlation value.
  • the metadata can be of various forms and in some embodiments comprise spatial metadata and other metadata.
  • a typical parameterization for the spatial metadata is one direction parameter in each frequency band characterized as an azimuth value ⁇ p (k, r) value and elevation value 0 (k, r) and an associated direct- to-total energy ratio in each frequency band r(k, ri), where k is the frequency band index and n is the temporal frame index.
  • the parameters generated may differ from frequency band to frequency band.
  • band X all of the parameters are generated and transmitted, whereas in band Y only one of the parameters is generated and transmitted, and furthermore in band Z no parameters are generated or transmitted.
  • band Z no parameters are generated or transmitted.
  • a practical example of this may be that for some frequency bands such as the highest band some of the parameters are not required for perceptual reasons.
  • the output of the analysis processor functionality is (spatial) metadata 106 determined in time-frequency tiles.
  • the (spatial) metadata 106 may involve directions and energy ratios in frequency bands but may also have any of the metadata types listed previously.
  • the (spatial) metadata 106 can vary overtime and over frequency.
  • the analysis functionality is implemented external to the system 100.
  • the spatial metadata associated with the input audio signals may be provided to an encoder 107 as a separate bit-stream.
  • the spatial metadata may be provided as a set of spatial (direction) index values.
  • the microphone array front end 103 is further configured to implement transport signal generator functionality, in order to generate suitable transport audio signals 104.
  • the transport signal generator functionality is configured to receive the input audio signals, which may for example be the microphone array audio signals 102 and generate the transport audio signals 104.
  • the transport audio signals may be a multi-channel, stereo, binaural or mono audio signal.
  • the generation of transport audio signals 104 can be implemented using any suitable method.
  • the transport signals 104 are the input audio signals, for example the microphone array audio signals.
  • the number of transport channels can also be any suitable number (rather than one or two channels as discussed in the examples).
  • the capture part may comprise an encoder 107.
  • the encoder 107 can be configured to receive the transport audio signals 104 and the spatial metadata 106.
  • the encoder 107 may furthermore be configured to generate a bitstream 108 comprising an encoded or compressed form of the metadata information and transport audio signals.
  • the encoder 107 could be implemented as an IVAS encoder, or any other suitable encoder.
  • the encoder 107 in such embodiments is configured to encode the audio signals and the metadata and form an IVAS bit stream.
  • This bitstream 108 may then be transmitted/stored as shown by the dashed line.
  • the system 100 furthermore may comprise a player or decoder 109 part.
  • the player or decoder 109 is configured to receive, retrieve or otherwise obtain the bitstream 108 and from the bitstream generate suitable spatial audio signals 110 to be presented to the listener/listener playback apparatus.
  • the decoder 109 is therefore configured to receive the bitstream 108 and demultiplex the encoded streams and then decode the audio signals to obtain the transport signals and metadata.
  • the decoder 109 furthermore can be configured to, from the transport audio signals and the spatial metadata, produce the spatial audio signals output 110 for example a binaural audio signal that can be reproduced over headphones.
  • two focused audio signals are hidden or encapsulated inside a ‘backwards’ compatible spatial audio signal in a manner that can be reversed and the two focused audio signals can be used to focus audio.
  • the two focused audio signals can be generated using any suitable focussing method.
  • the focus directions are fixed.
  • a first focus direction can be the ‘camera’ direction or front direction (or mouth reference point direction).
  • the second focus direction can then be defined as the opposite of the first direction (or more generally a direction which is substantially different to the first direction).
  • the term opposite can be generalised as a direction which is substantially different to the first direction.
  • the audio signals can be defined as Xfixfoc and Xfixanti.
  • a series of microphones as part of the microphone array: a first microphone, mic 1 , 290 a second microphone mic 2, 292, and a third microphone, mic 3, 294 which are configured to generate the audio input 102 which is passed to a direction estimator 201.
  • the direction estimator 201 can be considered to be part of the metadata generation operations as described above.
  • the direction estimator 201 thus can be configured to output the microphone audio signals in the form of the audio input 102 and the direction values 208.
  • the direction estimate is an estimate of the dominant sound source direction.
  • the direction estimation as indicated above is implemented in small time frequency tiles by framing the microphone signals in typically 20ms frames, transforming the frames into frequency domain (using DFT (Discrete Fourier Transform), DCT (Discrete Cosine Transform) or filter banks like QMF (Quadrature Mirror Filter)), splitting the frequency domain signal into frequency bands and analysing the direction in the bands.
  • DFT Discrete Fourier Transform
  • DCT Discrete Cosine Transform
  • QMF Quadrature Mirror Filter
  • These type of framed bands of audio are referred to as time-frequency tiles.
  • the tiles are typically narrower in low frequencies and wider in higher frequencies and may follow for example third- octave bands or Bark bands or ERB bands (Equivalent Rectangular Bandwidth). Other methods such as filterbanks exist for creating similar tiles.
  • At least one dominant sound source direction a is estimated for each tile using any suitable method such as described above.
  • the encoder part comprises a microphone selector/focus processor 203 which is configured to obtain the audio input 102 and from these audio signals generate a (fixed direction) focus signal 204 Xfixfoc and and antifocus signal 206 Xfixanti.
  • a simple method for generating the focus 204 and anti-focus 206 audio signal is to use a front microphone to supply the focus audio signal 204 Xfixfoc and select the back microphone to supply the anti-focus audio signal 206 Xfixanti.
  • FIG. 6 shows an example apparatus, a phone with 2 microphones 600.
  • the phone 600 has a defined front direction 603 and a front microphone 607 (a microphone located on the front face of the apparatus) and a back microphone 609 (a second microphone located on the back or rear face of the apparatus).
  • a sound object 601 which has a direction a 605 relative to the front axis 603, then when the direction is less than 90 degrees then the front microphone is the ‘near microphone’ and the back microphone is the ‘far microphone’ with reference to the supplying of the focus audio signal (the front microphone audio signal) and the anti-focus audio signal (the back microphone audio signal).
  • the front microphone is the ‘far microphone’ and the back microphone is the ‘near microphone’ with reference to the supplying of the focus audio signal (the back microphone audio signal) and the anti-focus audio signal (the front) microphone audio signal).
  • the microphone selector/focus processor 203 is configured to generate the focus and antifocus audio signals by applying a beamforming operation on the microphone audio signals.
  • MVDR Minimum Variance Distortionless Response
  • beamforming can be used, using any subset or all device microphones or using spatial filtering or using microphone selection or any combination thereof.
  • other audio focus methods can be used such as spatial filtering.
  • FIG. 7 shows an example apparatus, a phone with 3 microphones 700.
  • the phone 700 has a defined front direction 603 and a first front microphone 607 (a microphone located on the front face of the apparatus), a second front microphone 707 (another microphone located on the front face of the apparatus but near to the opposite end of the phone with respect to the first front microphone) and a back microphone 609 (a further microphone located on the back or rear face of the apparatus and shown in this example opposite the first front microphone).
  • the focus audio signal can be a beamformed or spatially processed audio signal with a focus 701 towards the front direction using any subset or all of the microphones when the direction is less than 90 degrees and the anti-focus audio signal can be a beamformed or spatially processed audio signal with a focus 703 towards the back direction using any subset or all of the microphones.
  • the microphone selector/focus processor 203 is configured to create two signals, Near mic/focus signal Xfixfoc 204 and Far mic/antifocus signal Xfixanti 206 where the Xfixfoc is focused towards an important direction (for example the forwards or camera direction) and Xfixanti is focused towards the opposite direction (or more generally a direction which is substantially different to the first direction). They can be generated using any combination of microphone selection, beamforming, spatial filtering or other audio focus methods.
  • a panner 205 can furthermore be configured to obtain the focus audio signal 204, the anti-focus audio signal 206 and the direction values 208.
  • Near mic/focus signal Xfixfoc 204 and Far mic/antifocus signal Xfixanti 206 are modified by an invertible panning process that makes the Near mic/focus signal Xfixfoc 204 and Far mic/antifocus signal Xfixanti 206 backwards compatible spatial audio signals while the panning process can be removed when necessary.
  • the panned left channel audio signal 224 and the panned right channel audio signal 226 can then be output.
  • the Near mic/focus signal Xfixfoc 204 and Far mic/antifocus signal Xfixanti 206 are converted into a spatial audio (stereo signal) panned left channel audio signal 224 and the panned right channel audio signal 226 which is not so clearly focused to any direction and where sound sources are approximately in correct places.
  • the spatial audio signals are generated such that at least the dominant sound source is in correct perceived direction. This can be done with help of direction a 208, which is the direction of a dominant sound source. As described above a can be estimated for each time-frequency tile.
  • the Xfixfoc audio signal is panned to the dominant sound source direction and the Xfixanti audio signal is treated as an ambient signal and added and subtracted to the channels of the backwards compatible spatial audio signal:
  • the Xanti signal could alternatively be decorrelated using known invertible decorrelators.
  • the focus signals can be binauralized using Inter-aural Level and Time Differences (ILDs and ITDs).
  • ILDs and ITDs are applied in tiles by modifying the level and phase differences to match the desired differences of human hearing (that depend on the direction a), then the process is also reversible and can be used in this invention.
  • the Xfixfoc would be panned to the direction a using ILDs and ITDs whereas Xfixanti would be used as a background ambient signal and simply summed and subtracted from the panned left channel audio signal 224 and the panned right channel audio signal 226 signals as was implemented in the stereo case described above.
  • the panner 205 can then output the direction values 208, the panned left channel audio signal L 224 and the panned right channel audio signal R 226.
  • the encoder further comprises a suitable low bit-rate encoder 207. This optionally is configured to encode the metadata and the panned left and right channel audio signals.
  • the encoder comprises a suitable storage/transmission part 209 configured to store and/or transmit the metadata and audio signals (which as shown herein can be encoded).
  • the output of the encoder is thus configured to produce (encoded) L and R signals can be used in a backwards compatible way. They form a stereo spatial audio signal that can be used like any other stereo signal.
  • the operations comprise that of audio signals obtaining/capturing from microphones as shown in Figure 3 by step 301 .
  • step 303 the following operation is one of direction estimating from audio signals from microphones as shown in Figure 3 by step 303.
  • the following operation is one of microphone selecting/focussing as shown in Figure 3 by step 305.
  • the decoder part for example can in some embodiments comprise a retriever/receiver 401 configured to retrieve or receive the ‘stereo’ audio signals and the metadata including the direction values from the storage or from the network.
  • the retriever/receiver is thus configured be the reciprocal to the storage/transmission 209 as shown in Figure 2.
  • the decoder part comprises a decoder 403, which is optional, which is configured to apply a suitable inverse operation to the encoder 207.
  • the direction 400 values and the panned left channel audio signal L 402 and the panned right channel audio signal R 404 can then be passed to the reverse panner 405 (or directly to the audio focusser 407).
  • the decoder part comprises an optional reverse panner 405.
  • the reverse panner 405 is configured to receive the direction values 400and the panned left channel audio signal L 402 and the panned right channel audio signal R 404 and regenerate the focus audio signal (which can be the near microphone audio signal) 406, the antifocus audio signal (which can be the far microphone audio signal) 408 and the direction 400 values and pass these to the audio focusser 407.
  • the reverse panner 405 is configured to reverse the panning process (applied in the encoder part) and thus ‘access’ the original focused signals:
  • diffuseness parameter can be further employed to assist in the panning process and as such can also be employed in the reverse panning process.
  • the reverse processed can in some embodiments be implemented as the following: ⁇ Xfixfoc
  • the decoder part further can comprise in some embodiments an audio focusser 407 configured to obtain the regenerated focus audio signal (which can be the near microphone audio signal) 406, the antifocus audio signal (which can be the far microphone audio signal) 408 and the direction 400 values. Additionally the audio focusser is configured to receive the listener or device desired focus direction P 410. The audio focusser 410 is thus configured to (with the reverse panner 405) to focus the L and R spatial audio signals towards a direction p by reversing the panning process (and generating the focus and antifocus audio signals) and then generating the focussed audio signal 412 and the direction value 400.
  • Audio focus can be achieved using the Xfixfoc and Xfixanti signals.
  • the audio focusser 407 is configured to create an audio focused signal towards the user input direction [3 by summing the Xfixfoc and Xfixanti signals with suitable gains.
  • the audio focusser as such is configured to use mostly Xfixfoc when user desired direction is the front direction and to use mostly x an ti when user desired direction is opposite to the front direction. For other directions, the Xfixfoc and Xfixanti are mixed more evenly.
  • the focussed audio signal 412 Xf 0C us can be used as such if a mono focused signal is enough.
  • the decoder optionally comprises a focussed signal panner 409.
  • the focussed signal panner 409 can be configured to obtain or receive the focussed audio signal 412 and the direction 400 and be configured to generate left and right channel audio signal outputs (or in some embodiments any suitable number of multichannel outputs).
  • the left channel audio signal output 414 and the right channel audio signal output 416 are generated from the focussed audio signal based on the direction a and further mixed with the received L channel audio signal 402 and R channel audio signal 404 at different levels where different levels of audio focus (e.g. a little focus, medium focus, strong focus or full focus) are desired.
  • the focussed audio signal 412 Xf 0C us can also be spatialized by panning to direction a.
  • the following equation has g ZO om as a gain between 0 and 1 where 1 indicates fully focused and 0 indicates no focus at all.
  • the zoom in order to achieve a better quality spatial audio the zoom can be limited (for example to be a maximum zoom gain of 0.5). This would maintain better audio signal spatial characteristics.
  • the focussed signal panner 409 thus in some embodiments can be configured to apply the following to generate the left channel audio signal output Lout 414 and the right channel audio signal output Rout 416
  • the processing described in the decoder part is implemented in time-frequency tiles and all the parameters may be different in different tiles.
  • the left channel audio signal output Lout 414 and the right channel audio signal output Rout 416 is converted back to time domain and played/stored.
  • the initial operation is one of retrieve/receive (encoded) audio signals as shown in Figure 5 by step 501 .
  • the audio signals can then be low bit rate decoded as shown in Figure 5 by step 503.
  • the channel or reverse-panned audio signals are then audio focussed based on the listener or device direction as shown in Figure 5 by step 507.
  • the focus signal is then optionally panned as shown in Figure 5 by step 509. Then the output audio signals are output as shown in Figure 5 by step 511 .
  • the two focused audio signals are hidden or encapsulated inside a spatial audio signal in such a manner that the operation can be reversed and the two focused audio signals can be used to focus audio, where the focus directions follow the direction of a dominant sound source in each timefrequency tile.
  • a first focus direction can be chosen or selected as the dominant sound source direction.
  • the second focus direction is typically opposite of the first direction.
  • a series of microphones as part of the microphone array: a first microphone, mic 1 , 990 a second microphone mic 2, 992, and a third microphone, mic 3, 994 which are configured to generate the audio input 102 which is passed to a direction estimator 201.
  • the direction estimator 901 can be considered to be part of the metadata generation operations as described above.
  • the direction estimator 901 thus can be configured to output the microphone audio signals in the form of the audio input 102 and the direction values 908.
  • the direction estimate is an estimate of the dominant sound source direction.
  • the direction estimation as indicated above is implemented in small time frequency tiles by framing the microphone signals in typically 20ms frames, transforming the frames into frequency domain (using DFT (Discrete Fourier Transform), DCT (Discrete Cosine Transform) or filter banks like QMF (Quadrature Mirror Filter)), splitting the frequency domain signal into frequency bands and analysing the direction in the bands.
  • DFT Discrete Fourier Transform
  • DCT Discrete Cosine Transform
  • QMF Quadrature Mirror Filter
  • These type of framed bands of audio are referred to as time-frequency tiles.
  • the tiles are typically narrower in low frequencies and wider in higher frequencies and may follow for example third- octave bands or Bark bands or ERB bands (Equivalent Rectangular Bandwidth). Other methods such as filterbanks exist for creating similar tiles.
  • At least one dominant sound source direction a is estimated for each tile using any suitable method such as described above.
  • the apparatus is a phone 1300 with 3 microphones.
  • the phone 1300 has a defined front direction 1303 and a first front microphone 1307 (a microphone located on the front face of the apparatus), a second front microphone 1311 (another microphone located on the front face of the apparatus but near to the opposite end of the phone with respect to the first front microphone) and a back microphone 1309 (a further microphone located on the back or rear face of the apparatus and shown in this example opposite the first front microphone).
  • the focus audio signal can a selection of a first microphone pair 1611 (microphone 1 1307 and microphone 3 1311 ) aligned with the sound object direction for the low frequency sound object and a second microphone pair 1613 (microphone 1 1307 and microphone 2 1309) aligned with the sound object direction for the high frequency sound object.
  • the encoder part comprises a microphone selector/focus processor 903 which is configured to obtain the audio input 102 and direction 908 values and from these audio signals generate a direction dependent focus signal 904 Xf 0C and and antifocus signal 906 x an ti.
  • a simple method for generating the focus 904 and anti-focus 906 audio signal is to select the nearest microphone(s) relative to the determined sound source direction to supply the focus audio signal 904 Xf 0C and select the furthest microphone(s) relative to the determined sound source direction to supply the anti-focus audio signal 906 x an ti.
  • a different microphone pair is selected for low frequencies compared to high frequencies. This microphone selection makes one of the two mics to have the dominant sound source amplified with respect to sound sources in other directions.
  • These near and far microphones are two microphones in a pair that of all the possible microphone pairs in the device are closest aligned so that when a line is drawn through the microphones in the pair, the line points in the determined sound object direction a.
  • the near mic is the mic of the pair that is closer to the dominant sound source direction and far mic is farther away. Thus if the sound source is behind the device, so typically is the near mic too.
  • the near mic is used as focus signal Xf 0C and the far mic is used antifocus signal x an ti that points to the opposite direction of the focus signal.
  • FIG. 13 shows an example apparatus, a phone with 3 microphones 1300.
  • the phone 1300 has a defined front direction 1303 and a first front microphone 1307 (a microphone located on the front face of the apparatus), a second front microphone 1311 (another microphone located on the front face of the apparatus but near to the opposite end of the phone with respect to the first front microphone) and a back microphone 1309 (a further microphone located on the back or rear face of the apparatus and shown in this example opposite the first front microphone).
  • a sound object 1301 which has a direction a 1305 relative to the front axis 1303, then when the direction is less than a defined angle (the angle defined by the physical dimensions of the apparatus and the relative microphone pair virtual angles then the front microphone 1307 is the ‘near microphone’ and the back microphone 1311 is the ‘far microphone’ with reference to the supplying of the focus audio signal (the front microphone audio signal) and the anti-focus audio signal (the back microphone audio signal).
  • a defined angle the angle defined by the physical dimensions of the apparatus and the relative microphone pair virtual angles then the front microphone 1307 is the ‘near microphone’ and the back microphone 1311 is the ‘far microphone’ with reference to the supplying of the focus audio signal (the front microphone audio signal) and the anti-focus audio signal (the back microphone audio signal).
  • the front microphone when the direction is more than the defined angle, such as shown in the example in Figure 14, where the sound object 1401 has an object direction 1405 greater than the defined angle then the front microphone, microphone 1 , 1307 is the ‘near microphone’ and the other front microphone, microphone 3, 1311 is the ‘far microphone’ as the angle formed by the pair of the microphones, microphone 1 1307 and microphone 3 1311 is closer to the determined sound object direction than the angle formed by the pair microphone 1 1307 and microphone 2 1309.
  • the focus audio signal is the microphone 1 audio signal and the anti-focus audio signal the microphone 3 audio signal.
  • the focus audio signal can then be the front microphone, microphone 1 1307, as the ‘far microphone’ and the back microphone, microphone 2 1309, as the ‘near microphone’ as this microphone pair are more aligned with the sound object direction but the back microphone, microphone 2 1309 is closer to the object.
  • the microphone selector/focus processor 903 is configured to generate the focus and antifocus audio signals by applying a beamforming operation on the microphone audio signals.
  • MVDR Minimum Variance Distortionless Response
  • beamforming can be used, using any subset or all device microphones or using spatial filtering or using microphone selection or any combination thereof.
  • other audio focus methods can be used such as spatial filtering.
  • FIG 17 shows an example apparatus, a phone with 3 microphones 1700.
  • the phone 700 has a defined front direction 1703 and a first front microphone (a microphone located on the front face of the apparatus), a second front microphone (another microphone located on the front face of the apparatus but near to the opposite end of the phone with respect to the first front microphone) and a back microphone (a further microphone located on the back or rear face of the apparatus and shown in this example opposite the first front microphone).
  • the focus audio signal can be a beamformed or spatially processed audio signal with a focus 1711 towards the object direction 1705 using any subset or all of the microphones and the anti-focus audio signal can be a beamformed or spatially processed audio signal with a focus 1713 towards at the direction plus 180 degrees using any subset or all of the microphones.
  • the microphone selector/focus processor 903 is configured to create two signals, Near mic/focus signal Xf 0C 904 and Far mic/antifocus signal x an ti 906 where the Xf 0C is focused towards the determined object direction and x an ti is focused in the opposite direction to the determined object direction. They can be generated using any combination of microphone selection, beamforming, spatial filtering or other audio focus methods.
  • a panner 905 can furthermore be configured to obtain the focus audio signal 904, the anti-focus audio signal 906 and the direction values 908.
  • Near mic/focus signal Xf 0C 904 and Far mic/antifocus signal x an ti 906 are modified by an invertible panning process that makes the Near mic/focus signal Xf 0C 904 and Far mic/antifocus signal x an ti 906 backwards compatible spatial audio signals while the panning process can be removed when necessary.
  • the panned left channel audio signal 924 and the panned right channel audio signal 926 can then be output.
  • the Near mic/focus signal Xf 0C 904 and Far mic/antifocus signal x an ti 906 are converted into a spatial audio (stereo signal) panned left channel audio signal 924 and the panned right channel audio signal 926 which is not so clearly focused to any direction and where sound sources are approximately in correct places.
  • the spatial audio signals generated is such that at least the dominant sound source is in correct perceived direction. This can be done with help of direction a 908, which is the direction of a dominant sound source. As described above a can be estimated for each time-frequency tile.
  • the Xf 0C audio signal is panned to the dominant sound source direction and the x an ti audio signal is treated as an ambient signal and added and subtracted to the channels of the backwards compatible spatial audio signal:
  • the Xanti signal could alternatively be decorrelated using known invertible decorrelators.
  • the focus signals can be binauralized using Inter-aural Level and Time Differences (ILDs and ITDs).
  • ILDs and ITDs are applied in tiles by modifying the level and phase differences to match the desired differences of human hearing (that depend on the direction a), then the process is also reversible and can be used in this invention.
  • the Xfixfoc would be panned to the direction a using ILDs and ITDs whereas x an ti would be used as a background ambient signal and simply summed and subtracted from the panned left channel audio signal 924 and the panned right channel audio signal 926 signals as was implemented in the stereo case described above.
  • the panner 905 can then output the direction values 908, the panned left channel audio signal L 924 and the panned right channel audio signal R 926.
  • the encoder further comprises a suitable low bit-rate encoder 907. This optionally is configured to encode the metadata and the panned left and right channel audio signals.
  • the encoder comprises a suitable storage/transmission part 909 configured to store and/or transmit the metadata and audio signals (which as shown herein can be encoded).
  • the output of the encoder is thus configured to produce (encoded) L and R signals can be used in a backwards compatible way. They form a stereo spatial audio signal that can be used like any other stereo signal.
  • the operations comprise that of audio signals obtaining/capturing from microphones as shown in Figure 10 by step 1001.
  • the following operation is one of microphone selecting/focussing (based on the dominant sound source direction) as shown in Figure 10 by step 1005.
  • step 1009 There can be furthermore comprise an operation of low bit rate encoding which is optional as shown in Figure 10 by step 1009.
  • FIG 11 With respect to Figure 11 is shown the example decoder part as shown in Figure 1 in further detail according to some embodiments where microphone selection is determined based on the dominant sound source direction.
  • the decoder part for example can in some embodiments comprise a retriever/receiver 1101 configured to retrieve or receive the ‘stereo’ audio signals and the metadata including the direction values from the storage or from the network.
  • the retriever/receiver is thus configured be the reciprocal to the storage/transmission 909 as shown in Figure 9.
  • the decoder part comprises a decoder 1103, which is optional, which is configured to apply a suitable inverse operation to the encoder 907.
  • the direction 1100 values and the panned left channel audio signal L 1102 and the panned right channel audio signal R 1104 can then be passed to the reverse panner 1105 (or directly to the audio focusser 1107).
  • the decoder part comprises an optional reverse panner 1105.
  • the reverse panner 1105 is configured to receive the direction values 1100 and the panned left channel audio signal L 1102 and the panned right channel audio signal R 1104 and regenerate the focus audio signal (which can be the near microphone audio signal) 1106, the antifocus audio signal (which can be the far microphone audio signal) 1108 and the direction 1100 values and pass these to the audio focusser 1107.
  • the reverse panner 1105 is configured to reverse the panning process (applied in the encoder part) and thus ‘access’ the original focused signals:
  • the reverse processed can in some embodiments be implemented as the following: foc
  • the decoder part further can comprise in some embodiments an audio focusser 1107 configured to regenerate the focus audio signal (which can be the near microphone audio signal) 1106, the antifocus audio signal (which can be the far microphone audio signal) 1108 and the direction 1100 values. Additionally the audio focusser is configured to receive the listener or device desired focus direction P 1110. The audio focusser 1110 is thus configured to (with the reverse panner 1105) to focus the L and R spatial audio signals towards a direction p by reversing the panning process (and generating the focus and antifocus audio signals) and then generating the focussed audio signal 1112 and the direction value 1100.
  • an audio focusser 1107 configured to regenerate the focus audio signal (which can be the near microphone audio signal) 1106, the antifocus audio signal (which can be the far microphone audio signal) 1108 and the direction 1100 values. Additionally the audio focusser is configured to receive the listener or device desired focus direction P 1110. The audio focusser 1110 is thus configured to (with the reverse panner 1105) to focus the L
  • Audio focus can be achieved using the Xf 0C and x an ti signals.
  • the Xf 0C signal emphasizes the front direction a and Xfixanti emphasizes the opposite direction. If a listener or device wants to focus towards the direction of the dominant sound source (i.e.
  • 3 a) then the Xf 0C signal is amplified with respect to the x an ti signal in the output. If the listener or device wants to focus towards the opposite direction then the x an ti signal is amplified with respect to the Xf 0C signal in the output.
  • the audio focusser 1107 is configured to create an audio focused signal towards the user input direction [3. This can be implemented by summing the Xf 0C and x an ti signals with suitable gains. The gains depend on the difference of the directions a and [3. An example function is given with respect to Figure 19.
  • the audio focusser as such is configured to use mostly Xfixfoc when user desired direction is the sound object direction and to use mostly x an ti when listener desired direction is opposite to the sound object direction. For other directions, the Xfoc and Xanti are mixed more evenly.
  • the focussed audio signal 1112 Xf 0C us can be used as such if a mono focused signal is enough.
  • the decoder optionally comprises a focussed signal panner 1109.
  • the focussed signal panner 1109 can be configured to obtain or receive the focussed audio signal 1112 and the direction 1100 and be configured to generate left and right channel audio signal outputs (or in some embodiments any suitable number of multichannel outputs).
  • the left channel audio signal output 1114 and the right channel audio signal output 1116 are generated from the focussed audio signal and the based on the direction a and further mixed with the received L channel audio signal 1102 and R channel audio signal 1104 at different levels where different levels of audio focus (e.g. a little focus, medium focus, strong focus or full focus) are desired.
  • different levels of audio focus e.g. a little focus, medium focus, strong focus or full focus
  • the focussed audio signal 1112 Xf 0C us can also be spatialized by panning to direction a.
  • the following equation has g ZO om as a gain between 0 and 1 where 1 indicates fully focused and 0 indicates no focus at all.
  • the zoom in order to achieve a better quality spatial audio the zoom can be limited (for example to be a maximum zoom gain of 0.5). This would maintain better audio signal spatial characteristics.
  • the focussed signal panner 1109 thus in some embodiments can be configured to apply the following to generate the left channel audio signal output Lout 1114 and the right channel audio signal output Rout 1116
  • the processing described in the decoder part is implemented in time-frequency tiles and all the parameters may be different in different tiles.
  • the left channel audio signal output Lout 1114 and the right channel audio signal output Rout 1116 is converted back to time domain and played/stored.
  • the initial operation is one of retrieve/receive (encoded) audio signals as shown in Figure 12 by step 1201.
  • the audio signals can then be low bit rate decoded as shown in Figure 12 by step 1203.
  • the channel or reverse-panned audio signals are then audio focussed based on the listener or device direction as shown in Figure 12 by step 1207.
  • the focus signal is then optionally panned as shown in Figure 12 by step 1209.
  • these focused signals may be hidden or encapsulated inside other transported audio signals when there are more than two transported signals.
  • the transported signal is a 5.1 channel formal with left, right, center, subwoofer, rear left, and rear right channel signals
  • two of the four focused signals could be hidden or encapsulated inside the left and right signals and the other two focussed signals could be hidden or encapsulated inside rear left and rear right signals.
  • the device may be any suitable electronics device or apparatus.
  • the device 2000 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.
  • the device may for example be configured to implement the encoder/analyser part and/or the decoder part as shown in Figure 1 or any functional block as described above.
  • the device 2000 comprises at least one processor or central processing unit 2007.
  • the processor 2007 can be configured to execute various program codes such as the methods such as described herein.
  • the device 2000 comprises at least one memory 2011 .
  • the at least one processor 2007 is coupled to the memory 2011 .
  • the memory 2011 can be any suitable storage means.
  • the memory 2011 comprises a program code section for storing program codes implementable upon the processor 2007.
  • the memory 2011 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein.
  • the implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 2007 whenever needed via the memory-processor coupling.
  • the device 2000 comprises a user interface 2005.
  • the user interface 2005 can be coupled in some embodiments to the processor 2007.
  • the processor 2007 can control the operation of the user interface 2005 and receive inputs from the user interface 2005.
  • the user interface 2005 can enable a user to input commands to the device 2000, for example via a keypad.
  • the user interface 2005 can enable the user to obtain information from the device 2000.
  • the user interface 2005 may comprise a display configured to display information from the device 2000 to the user.
  • the user interface 2005 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 2000 and further displaying information to the user of the device 2000.
  • the user interface 2005 may be the user interface for communicating.
  • the device 2000 comprises an input/output port 2009.
  • the input/output port 2009 in some embodiments comprises a transceiver.
  • the transceiver in such embodiments can be coupled to the processor 2007 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network.
  • the transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
  • the transceiver can communicate with further apparatus by any suitable known communications protocol.
  • the transceiver can use a suitable radio access architecture based on long term evolution advanced (LTE Advanced, LTE-A) or new radio (NR) (or can be referred to as 5G), universal mobile telecommunications system (UMTS) radio access network (UTRAN or E-UTRAN), long term evolution (LTE, the same as E-UTRA), 2G networks (legacy network technology), wireless local area network (WLAN or Wi-Fi), worldwide interoperability for microwave access (WiMAX), Bluetooth®, personal communications services (PCS), ZigBee®, wideband code division multiple access (WCDMA), systems using ultra-wideband (UWB) technology, sensor networks, mobile ad-hoc networks (MANETs), cellular internet of things (loT) RAN and Internet Protocol multimedia subsystems (IMS), any other suitable option and/or any combination thereof.
  • the transceiver input/output port 2009 may be configured to receive the signals.
  • the device 2000 may be employed as at least part of the synthesis device.
  • the input/output port 2009 may be coupled to headphones (which may be a headtracked or a non-tracked headphones) or similar and loudspeakers.
  • the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof.
  • some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
  • firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
  • While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
  • the embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware.
  • any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions.
  • the software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
  • the memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory.
  • the data processors may be of any type suitable to the local technical environment, and may include one or more of general-purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
  • Embodiments of the inventions may be practiced in various components such as integrated circuit modules.
  • the design of integrated circuits is by and large a highly automated process.
  • Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
  • Programs such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules.
  • the resultant design in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or “fab” for fabrication.

Abstract

An apparatus, for generating spatial audio signals, the apparatus comprising means configured to: obtain at least two audio signals; obtain at least one metadata directional parameter associated with the at least two audio signals; generate a first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in a first focussed audio signal direction; generate a second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal focussed in a second focusses audio signal direction which is substantially different to a first focussed audio signal direction; and generate a first and a second output audio signal based on the second focussed audio signal and a panning of the first focussed audio signal, the panning controlled by the at least one metadata directional parameter.

Description

TRANSPORTING AUDIO SIGNALS INSIDE SPATIAL AUDIO SIGNAL
Field
The present application relates to apparatus and methods for transporting audio signals inside a spatial audio signal and specifically but not exclusively apparatus and methods for transporting a focused pair of audio signals within a spatial audio signal using direction metadata.
Background
Parametric spatial audio systems can be configured to store and transmit audio signal with associated metadata. The metadata describes spatial (and non- spatial) characteristics of the audio signal. The audio signals and metadata together can be used to render a spatial audio signal, typically for many different playback devices e.g. headphones, stereo speakers, 5.1 speakers, homepods.
The metadata typically comprises direction parameters (azimuth, elevation) and ratio parameters (direct-to-ambience ratio i.e. D/A ratio). Direction parameters describe sound source directions typically in time-frequency tiles. Ratio parameters describe the diffuseness of the audio signal i.e. the ratio of direct energy to diffuse energy also in time-frequency tiles. These parameters are psychoacoustically the most important in creating a spatially correct sounding audio to a human listener.
There may be one, two or more audio signals transmitted. A single audio signal with metadata is enough for many use cases, however, the nature of diffuseness and other fine details are only preserved if a stereo signal is transmitted. The difference between the left and right signals contains information about the details of the acoustic space. The more coarse spatial characteristics that are already described in the metadata (direction, D/A ratio) do not necessarily need to be correct in the transmitted audio signals, because the metadata is used to render these characteristics correctly in the decoder regardless of what they are in the audio signals. For backwards compatibility, all spatial characteristics should be correct also for the transmitted audio signals because legacy decoders ignore the metadata and only play the audio signals. Furthermore audio focus is an audio processing method where sound sources in a direction are amplified with respect to sound sources in other directions. Typically, known methods such as beamforming or spatial filtering are employed. Beamforming and spatial filtering approaches both require knowledge about sound directions. These can typically be only estimated if the original microphone signals from known locations are present.
Summary
There is provided according to a first aspect an apparatus, for generating spatial audio signals, the apparatus comprising means configured to: obtain at least two audio signals; obtain at least one metadata directional parameter associated with the at least two audio signals; generate a first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in a first focussed audio signal direction; generate a second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal focussed in a second focusses audio signal direction which is substantially different to a first focussed audio signal direction; and generate a first and a second output audio signal based on the second focussed audio signal and a panning of the first focussed audio signal, the panning controlled by the at least one metadata directional parameter.
The first focussed audio signal direction may be fixed relative to a direction of the apparatus.
The first focussed audio signal direction may be the at least one metadata directional parameter value.
The means configured to generate the first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in the first focussed audio signal direction may be configured to at least one of: select one of the at least two audio signals to generate the first focussed audio signal; and mix at least two of the at least two audio signals to generate the first focussed audio signal.
The means configured to generate the second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal being focussing in the second focussed audio signal direction may be configured to at least one of: select one of the at least two audio signals to generate the second focussed audio signal; and mix at least two of the at least two audio signals to generate the second focussed audio signal.
The means configured to generate the first output audio signal based on a panning of the first focussed audio signal and the at least one metadata directional parameter may be configured to generate the first output audio signal as an additive combination of the second focussed audio signals and a panning of the first focussed audio signal to a left channel direction based on the at least one metadata directional parameter.
The means configured to generate the second output audio signal may be configured to generate the second output audio signal as a subtractive combination of the second focussed audio signals and a panning of the first focussed audio signal to a right channel direction based on the at least one metadata directional parameter.
The means may be further configured to generate further focussed audio signal based on at least one of the at least two audio signals, the further focussed audio signal focussed in a further focussed audio signal direction which is perpendicular to the first focussed audio signal direction.
The means configured to obtain at least one metadata directional parameter associated with the at least two audio signals may be configured to analyse the at least two audio signals to generate the at least one metadata directional parameter.
The means configured to obtain at least one metadata directional parameter associated with the at least two audio signals may be configured to receive the at least one metadata directional parameter, and the means configured to obtain at least two audio signals may be configured to receive the at least two audio signals.
According to a second aspect there is provided an apparatus, for processing spatial audio signals, the apparatus comprising means configured to: obtain a first audio signal and a second audio signal, the first audio signal and the second audio signal based respectively on a panning of a first focussed audio signal and a second focussed audio signal and the second focussed audio signal is focussed in a second direction which is substantially different to a first direction of the first focussed audio signal, and at least one metadata directional parameter, wherein the panning of the first focussed audio signal is associated with the at least one metadata directional parameter; obtain a desired focus directional parameter; generate a focus audio signal towards the desired focus directional parameter value, the focus audio signal based on the desired focus directional parameter, the at least one metadata directional parameter, the first audio signal and the second audio signal; and generate at least one output audio signal based on the focus audio signal.
The first focussed audio signal direction may be one of: a fixed direction relative to a direction of a capture apparatus; and the at least one metadata directional parameter value.
The means may be configured to, prior to generating the focus audio signal: de-pan the first audio signal to generate the first focussed audio signal; and de-pan the second audio signal to generate the second focussed audio signal, wherein the means configured to generate the focus audio signal may be configured to generate the focus audio signal based on a combination of the first focussed audio signal and the second focussed audio.
The means configured to generate at least one output audio signal based on the focus audio signal may be configured to: generate a first output audio signal based on a combination of the focus audio signal and the first audio signal; and generate a second output audio signal based on a combination of the focus audio signal and the second audio signal.
According to a second aspect there is provided a method for an apparatus for generating spatial audio signals, the method comprising: obtaining at least two audio signals; obtaining at least one metadata directional parameter associated with the at least two audio signals; generating a first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in a first focussed audio signal direction; generating a second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal focussed in a second focusses audio signal direction which is substantially different to a first focussed audio signal direction; and generating a first and a second output audio signal based on the second focussed audio signal and a panning of the first focussed audio signal, the panning controlled by the at least one metadata directional parameter.
The first focussed audio signal direction may be fixed relative to a direction of the apparatus.
The first focussed audio signal direction may be the at least one metadata directional parameter value. Generating the first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in the first focussed audio signal direction may comprise at least one of: selecting one of the at least two audio signals to generate the first focussed audio signal; and mixing at least two of the at least two audio signals to generate the first focussed audio signal.
Generating the second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal being focussing in the second focussed audio signal direction may comprise at least one of: selecting one of the at least two audio signals to generate the second focussed audio signal; and mixing at least two of the at least two audio signals to generate the second focussed audio signal.
Generating the first output audio signal based on a panning of the first focussed audio signal and the at least one metadata directional parameter may comprise generating the first output audio signal as an additive combination of the second focussed audio signals and a panning of the first focussed audio signal to a left channel direction based on the at least one metadata directional parameter.
Generating the second output audio signal may comprise generating the second output audio signal as a subtractive combination of the second focussed audio signals and a panning of the first focussed audio signal to a right channel direction based on the at least one metadata directional parameter.
The method may further comprise generating further focussed audio signal based on at least one of the at least two audio signals, the further focussed audio signal focussed in a further focussed audio signal direction which is perpendicular to the first focussed audio signal direction.
Obtaining at least one metadata directional parameter associated with the at least two audio signals may comprise analysing the at least two audio signals to generate the at least one metadata directional parameter.
Obtaining at least one metadata directional parameter associated with the at least two audio signals may comprise receiving the at least one metadata directional parameter, and obtaining at least two audio signals may comprise receiving the at least two audio signals. According to a second aspect there is provided a method for an apparatus for processing spatial audio signals, the method comprising: obtaining a first audio signal and a second audio signal, the first audio signal and the second audio signal based respectively on a panning of a first focussed audio signal and a second focussed audio signal and the second focussed audio signal is focussed in a second direction which is substantially different to a first direction of the first focussed audio signal, and at least one metadata directional parameter, wherein the panning of the first focussed audio signal is associated with the at least one metadata directional parameter; obtaining a desired focus directional parameter; generating a focus audio signal towards the desired focus directional parameter value, the focus audio signal based on the desired focus directional parameter, the at least one metadata directional parameter, the first audio signal and the second audio signal; and generating at least one output audio signal based on the focus audio signal.
The first focussed audio signal direction may be one of: a fixed direction relative to a direction of a capture apparatus; and the at least one metadata directional parameter value.
The method may further comprise, prior to generating the focus audio signal: de-panning the first audio signal to generate the first focussed audio signal; and de-panning the second audio signal to generate the second focussed audio signal, wherein generating the focus audio signal may comprise generating the focus audio signal based on a combination of the first focussed audio signal and the second focussed audio.
Generating at least one output audio signal based on the focus audio signal may comprise: generating a first output audio signal based on a combination of the focus audio signal and the first audio signal; and generating a second output audio signal based on a combination of the focus audio signal and the second audio signal.
According to a fifth aspect there is provided an apparatus for generating spatial audio signals, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain at least two audio signals; obtain at least one metadata directional parameter associated with the at least two audio signals; generate a first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in a first focussed audio signal direction; generate a second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal focussed in a second focusses audio signal direction which is substantially different to a first focussed audio signal direction; and generate a first and a second output audio signal based on the second focussed audio signal and a panning of the first focussed audio signal, the panning controlled by the at least one metadata directional parameter.
The first focussed audio signal direction may be fixed relative to a direction of the apparatus.
The first focussed audio signal direction may be the at least one metadata directional parameter value.
The apparatus caused to generate the first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in the first focussed audio signal direction may be caused to at least one of: select one of the at least two audio signals to generate the first focussed audio signal; and mix at least two of the at least two audio signals to generate the first focussed audio signal.
The apparatus caused to generate the second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal being focussing in the second focussed audio signal direction may be caused to at least one of: select one of the at least two audio signals to generate the second focussed audio signal; and mix at least two of the at least two audio signals to generate the second focussed audio signal.
The apparatus caused to generate the first output audio signal based on a panning of the first focussed audio signal and the at least one metadata directional parameter may be caused to generate the first output audio signal as an additive combination of the second focussed audio signals and a panning of the first focussed audio signal to a left channel direction based on the at least one metadata directional parameter.
The apparatus as caused to generate the second output audio signal may be caused to generate the second output audio signal as a subtractive combination of the second focussed audio signals and a panning of the first focussed audio signal to a right channel direction based on the at least one metadata directional parameter.
The apparatus may be further caused to generate further focussed audio signal based on at least one of the at least two audio signals, the further focussed audio signal focussed in a further focussed audio signal direction which is perpendicular to the first focussed audio signal direction.
The apparatus may be caused to obtain at least one metadata directional parameter associated with the at least two audio signals may be caused to analyse the at least two audio signals to generate the at least one metadata directional parameter.
The apparatus caused to obtain at least one metadata directional parameter associated with the at least two audio signals may be caused to receive the at least one metadata directional parameter, and the apparatus caused to obtain at least two audio signals may be caused to receive the at least two audio signals.
According to a sixth aspect there is provided an apparatus, for processing spatial audio signals, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain a first audio signal and a second audio signal, the first audio signal and the second audio signal based respectively on a panning of a first focussed audio signal and a second focussed audio signal and the second focussed audio signal is focussed in a second direction which is substantially different to a first direction of the first focussed audio signal, and at least one metadata directional parameter, wherein the panning of the first focussed audio signal is associated with the at least one metadata directional parameter; obtain a desired focus directional parameter; generate a focus audio signal towards the desired focus directional parameter value, the focus audio signal based on the desired focus directional parameter, the at least one metadata directional parameter, the first audio signal and the second audio signal; and generate at least one output audio signal based on the focus audio signal.
The first focussed audio signal direction may be one of: a fixed direction relative to a direction of a capture apparatus; and the at least one metadata directional parameter value.
The apparatus may be caused to, prior to generating the focus audio signal: de-pan the first audio signal to generate the first focussed audio signal; and de-pan the second audio signal to generate the second focussed audio signal, wherein the apparatus caused to generate the focus audio signal may be caused to generate the focus audio signal based on a combination of the first focussed audio signal and the second focussed audio.
The apparatus caused to generate at least one output audio signal based on the focus audio signal may be configured to: generate a first output audio signal based on a combination of the focus audio signal and the first audio signal; and generate a second output audio signal based on a combination of the focus audio signal and the second audio signal. According to a seventh aspect there is provided an apparatus for generating spatial audio signals, the apparatus comprising: obtaining circuitry configured to obtain at least two audio signals; obtaining circuitry configured to obtain at least one metadata directional parameter associated with the at least two audio signals; generating circuitry configured to generate a first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in a first focussed audio signal direction; generating circuitry configured to generate a second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal focussed in a second focusses audio signal direction which is substantially different to a first focussed audio signal direction; and generating circuitry configured to generate a first and a second output audio signal based on the second focussed audio signal and a panning of the first focussed audio signal, the panning controlled by the at least one metadata directional parameter.
According to an eighth aspect there is provided an apparatus for processing spatial audio signals, the apparatus comprising: obtaining circuitry configured to obtain a first audio signal and a second audio signal, the first audio signal and the second audio signal based respectively on a panning of a first focussed audio signal and a second focussed audio signal and the second focussed audio signal is focussed in a second direction which is substantially different to a first direction of the first focussed audio signal, and at least one metadata directional parameter, wherein the panning of the first focussed audio signal is associated with the at least one metadata directional parameter; obtaining circuitry configured to obtain a desired focus directional parameter; generating circuitry configured to generate a focus audio signal towards the desired focus directional parameter value, the focus audio signal based on the desired focus directional parameter, the at least one metadata directional parameter, the first audio signal and the second audio signal; and generating circuitry configured to generate at least one output audio signal based on the focus audio signal.
According to a ninth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus for generating spatial audio signals to perform at least the following: obtain at least one metadata directional parameter associated with the at least two audio signals; generate a first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in a first focussed audio signal direction; generate a second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal focussed in a second focusses audio signal direction which is substantially different to a first focussed audio signal direction; and generate a first and a second output audio signal based on the second focussed audio signal and a panning of the first focussed audio signal, the panning controlled by the at least one metadata directional parameter.
According to a tenth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising program instructions] for causing an apparatus for processing spatial audio signals to perform at least the following: obtain a first audio signal and a second audio signal, the first audio signal and the second audio signal based respectively on a panning of a first focussed audio signal and a second focussed audio signal and the second focussed audio signal is focussed in a second direction which is substantially different to a first direction of the first focussed audio signal, and at least one metadata directional parameter, wherein the panning of the first focussed audio signal is associated with the at least one metadata directional parameter; obtain a desired focus directional parameter; generate a focus audio signal towards the desired focus directional parameter value, the focus audio signal based on the desired focus directional parameter, the at least one metadata directional parameter, the first audio signal and the second audio signal; and generate at least one output audio signal based on the focus audio signal.
According to an eleventh aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus for generating spatial audio signals to perform at least the following: obtain at least one metadata directional parameter associated with the at least two audio signals; generate a first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in a first focussed audio signal direction; generate a second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal focussed in a second focusses audio signal direction which is substantially different to a first focussed audio signal direction; and generate a first and a second output audio signal based on the second focussed audio signal and a panning of the first focussed audio signal, the panning controlled by the at least one metadata directional parameter.
According to a twelfth aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus for processing spatial audio signals to perform at least the following: obtain a first audio signal and a second audio signal, the first audio signal and the second audio signal based respectively on a panning of a first focussed audio signal and a second focussed audio signal and the second focussed audio signal is focussed in a second direction which is substantially different to a first direction of the first focussed audio signal, and at least one metadata directional parameter, wherein the panning of the first focussed audio signal is associated with the at least one metadata directional parameter; obtain a desired focus directional parameter; generate a focus audio signal towards the desired focus directional parameter value, the focus audio signal based on the desired focus directional parameter, the at least one metadata directional parameter, the first audio signal and the second audio signal; and generate at least one output audio signal based on the focus audio signal.
According to a thirteenth aspect there is provided an apparatus for generating spatial audio signals, the apparatus comprising: means for obtaining at least one metadata directional parameter associated with the at least two audio signals; means for generating a first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in a first focussed audio signal direction; means for generating a second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal focussed in a second focusses audio signal direction which is substantially different to a first focussed audio signal direction; and means for generating a first and a second output audio signal based on the second focussed audio signal and a panning of the first focussed audio signal, the panning controlled by the at least one metadata directional parameter.
According to a fourteenth aspect there is provided an apparatus for processing spatial audio signals, the apparatus comprising: means for obtaining a first audio signal and a second audio signal, the first audio signal and the second audio signal based respectively on a panning of a first focussed audio signal and a second focussed audio signal and the second focussed audio signal is focussed in a second direction which is substantially different to a first direction of the first focussed audio signal, and at least one metadata directional parameter, wherein the panning of the first focussed audio signal is associated with the at least one metadata directional parameter; means for obtaining a desired focus directional parameter; means for generating a focus audio signal towards the desired focus directional parameter value, the focus audio signal based on the desired focus directional parameter, the at least one metadata directional parameter, the first audio signal and the second audio signal; and means for generating at least one output audio signal based on the focus audio signal.
According to a fifteenth aspect there is provided a computer readable medium comprising program instructions for causing an apparatus for generating spatial audio signals to perform at least the following: obtain at least one metadata directional parameter associated with the at least two audio signals; generate a first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in a first focussed audio signal direction; generate a second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal focussed in a second focusses audio signal direction which is substantially different to a first focussed audio signal direction; and generate a first and a second output audio signal based on the second focussed audio signal and a panning of the first focussed audio signal, the panning controlled by the at least one metadata directional parameter.
According to a sixteenth aspect there is provided a computer readable medium comprising program instructions for causing an apparatus for processing spatial audio signals to perform at least the following: obtain a first audio signal and a second audio signal, the first audio signal and the second audio signal based respectively on a panning of a first focussed audio signal and a second focussed audio signal and the second focussed audio signal is focussed in a second direction which is substantially different to a first direction of the first focussed audio signal, and at least one metadata directional parameter, wherein the panning of the first focussed audio signal is associated with the at least one metadata directional parameter; obtain a desired focus directional parameter; generate a focus audio signal towards the desired focus directional parameter value, the focus audio signal based on the desired focus directional parameter, the at least one metadata directional parameter, the first audio signal and the second audio signal; and generate at least one output audio signal based on the focus audio signal.
An apparatus comprising means for performing the actions of the method as described above.
An apparatus configured to perform the actions of the method as described above.
A computer program comprising program instructions for causing a computer to perform the method as described above.
A computer program product stored on a medium may cause an apparatus to perform the method as described herein.
An electronic device may comprise apparatus as described herein.
A chipset may comprise apparatus as described herein.
Embodiments of the present application aim to address problems associated with the state of the art.
Summary of the Figures
For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:
Figure 1 shows schematically a system of apparatus suitable for implementing some embodiments;
Figure 2 shows schematically an example encoder as shown in the system of apparatus as shown in Figure 1 according to some embodiments;
Figure 3 shows a flow diagram of the operation of the example encoder shown in Figure 2 according to some embodiments;
Figure 4 shows schematically an example decoder as shown in the system of apparatus as shown in Figure 1 according to some embodiments;
Figure 5 shows a flow diagram of the operation of the example decoder shown in Figure 4 according to some embodiments; Figure 6 shows an example microphone selection for a sound object;
Figure 7 shows an example of focus and anti-focus areas as employed in some embodiments;
Figure 8 shows an example gain function for modifying focus and anti-focus signals according to some embodiments;
Figure 9 shows schematically a further example encoder as shown in the system of apparatus as shown in Figure 1 according to some embodiments;
Figure 10 shows a flow diagram of the operation of the example encoder shown in Figure 9 according to some embodiments;
Figure 11 shows schematically a further example decoder as shown in the system of apparatus as shown in Figure 1 according to some embodiments;
Figure 12 shows a flow diagram of the operation of the example decoder shown in Figure 11 according to some embodiments;
Figures 13 to 16 show further example microphone selections for a sound object;
Figure 17 and 18 show further examples of focus and anti-focus areas as employed in some embodiments;
Figure 19 shows a further example gain function for modifying focus and anti-focus signals according to some embodiments; and
Figure 20 shows an example device suitable for implementing the apparatus shown in previous figures.
Embodiments of the Application
The following describes in further detail suitable apparatus and possible mechanisms for the transporting of focused audio signals inside a suitable backwards compatible spatial audio signal and using direction metadata.
As described above parametric spatial audio systems can be configured to store and transmit audio signal together with metadata. Additionally audio focus or audio focussing is an audio processing method where sound sources in a direction (or within a defined range) are amplified with respect to sound sources in other directions. Although an audio focus or focussing approach is discussed herein it would be considered that an audio de-focus or defocussing occurs where sound sources in a direction (or within a defined range) are diminished or reduced with respect to sound sources in other directions could be exploited in a similar manner to that described in the following).
Typical uses for audio focus are: telecommunications where a user voice is amplified compared to background sounds; speech recognition, where voice is amplified to minimize word error rate; sound source amplification in the direction of camera that records video with audio; an off-camera focus where listener is watching a video and wants to focus to some other direction than camera direction. For example the person who recorded the video wants the audio to focus to their child whereas the person watching the video might want to focus the audio to the listener’s child away from the camera axis; a focus-switch where a listener may want to focus to different audio objects at different times watching a video; and a teleconference or live meeting application where different listeners of a meeting may want to focus to different speakers.
Currently audio representations, where a listener is able to freely choose where to focus, have required either all of the captured microphone audio signals to be passed to the player/decoder or a large number of pre-focused audio signals have to be passed to the player/decoder.
In both situations in order to be able to play a meaningful spatial audio signal a transmission channel, a decoder, and a player has to be configured to understand these representations instead of just playing a typically randomly selected focused signal towards a direction.
The following concept which is described with respect to the following embodiments is one where focused signals are hidden or encapsulated inside a (backwards) compatible spatial audio signal that can be received and processed on a conventional player but can be focused using a player such as described herein.
In some embodiments a fixed direction focused audio is encapsulated or embedded in a stereo signal. In such embodiments there is described apparatus and methods for creating a suitable backwards compatible spatial audio signal where a dominant sound source is in a ‘correct’ direction and where it is possible for a listener to focus audio towards any direction with help of directional metadata.
In such embodiments the dominant sound source direction is assumed to be fixed. The apparatus and methods are configured to generate or create audio signals (which can be designated A and B), where a first audio signal, A, emphasizes a fixed direction and a second audio signal, B, de-emphasizes the fixed direction. In such embodiments the first audio signal, A, and the second audio signal, B, can be losslessly mixed with help of direction in metadata to create a spatial audio signal where perceived dominant audio direction is correct and the same as the metadata direction. This ‘mixed’ spatial audio signal can be considered to be the ‘backwards’ compatible audio signal. Furthermore since the mixing is lossless it can be reversed (with help of metadata direction) and the first audio signal, A, and the second audio signal, B, are used to create a focused audio signal based on user desired direction.
In some other embodiments the focused audio is ‘disguised’ in a stereo signal. In such embodiments the apparatus and methods can be configured to create a backwards compatible spatial audio signal where the dominant sound source is in the ‘correct’ direction and where it is possible for a user to focus audio towards any direction with help of directional metadata. In such embodiments the dominant sound source direction is estimated for each (time-frequency) tile and transmitted as metadata. The first audio signal, A, and the second audio signal, B, can then be generated or created where the first audio signal, A, emphasizes metadata direction in each tile and the second audio signal, B, de-emphasizes metadata direction in each tile. The first audio signal, A, and the second audio signal, B, can furthermore be losslessly mixed with help of estimated dominant direction in metadata to create a spatial audio signal where perceived dominant audio direction is correct and the same as the metadata direction. This mixed audio signal forms the backwards compatible audio signal. Furthermore as discussed above as the mixing is lossless the mix can be reversed (with help of metadata direction) and the first audio signal, A, and the second audio signal, B, can be used to create a focused audio signal based on user desired direction.
Embodiments will be described with respect to an example capture (or encoder/analyser) and playback (or decoder/synthesizer) apparatus or system 100 as shown in Figure 1 . In the following example the audio signal input is one from a microphone array, however it would be appreciated that the audio input can be any suitable audio input format and the description hereafter details, where differences in the processing occurs when a differing input format is employed.
The system 100 is shown with capture part and a playback (decoder/synthesizer) part.
The capture part in some embodiments comprises a microphone array audio signals input 102. The input audio signals can be from any suitable source, for example: two or more microphones mounted on a mobile phone, other microphone arrays, e.g., B-format microphone or Eigenmike. In some embodiments, as mentioned above, the input can be any suitable audio signal input such as Ambisonic signals, e.g., first-order Ambisonics (FOA), higher-order Ambisonics (HOA) or Loudspeaker surround mix and/or objects.
The microphone array audio signals input 102 may be provided to a microphone array front end 103. The microphone array front end in some embodiments is configured to implement an analysis processor functionality configured to generate or determine suitable (spatial) metadata associated with the audio signals and implement a suitable transport signal generator functionality to generate transport audio signals.
The analysis processor functionality is thus configured to perform spatial analysis on the input audio signals yielding suitable spatial metadata 106 in frequency bands. For all of the aforementioned input types, there exists known methods to generate suitable spatial metadata, for example directions and direct- to-total energy ratios (or similar parameters such as diffuseness, i.e. , ambient-to- total ratios) in frequency bands. These methods are not detailed herein, however, some examples may comprise the performing of a suitable time-frequency transform for the input signals, and then in frequency bands when the input is a mobile phone microphone array, estimating delay-values between microphone pairs that maximize the inter-microphone correlation, and formulating the corresponding direction value to that delay (as described in GB Patent Application Number 1619573.7 and PCT Patent Application Number PCT/FI2017/050778), and formulating a ratio parameter based on the correlation value.
The metadata can be of various forms and in some embodiments comprise spatial metadata and other metadata. A typical parameterization for the spatial metadata is one direction parameter in each frequency band characterized as an azimuth value <p (k, r) value and elevation value 0 (k, r) and an associated direct- to-total energy ratio in each frequency band r(k, ri), where k is the frequency band index and n is the temporal frame index.
In some embodiments the parameters generated may differ from frequency band to frequency band. Thus, for example in band X all of the parameters are generated and transmitted, whereas in band Y only one of the parameters is generated and transmitted, and furthermore in band Z no parameters are generated or transmitted. A practical example of this may be that for some frequency bands such as the highest band some of the parameters are not required for perceptual reasons.
As such the output of the analysis processor functionality is (spatial) metadata 106 determined in time-frequency tiles. The (spatial) metadata 106 may involve directions and energy ratios in frequency bands but may also have any of the metadata types listed previously. The (spatial) metadata 106 can vary overtime and over frequency.
In some embodiments the analysis functionality is implemented external to the system 100. For example, in some embodiments the spatial metadata associated with the input audio signals may be provided to an encoder 107 as a separate bit-stream. In some embodiments the spatial metadata may be provided as a set of spatial (direction) index values.
The microphone array front end 103, as described above is further configured to implement transport signal generator functionality, in order to generate suitable transport audio signals 104. The transport signal generator functionality is configured to receive the input audio signals, which may for example be the microphone array audio signals 102 and generate the transport audio signals 104. The transport audio signals may be a multi-channel, stereo, binaural or mono audio signal. The generation of transport audio signals 104 can be implemented using any suitable method.
In some embodiments the transport signals 104 are the input audio signals, for example the microphone array audio signals. The number of transport channels can also be any suitable number (rather than one or two channels as discussed in the examples). In some embodiments the capture part may comprise an encoder 107. The encoder 107 can be configured to receive the transport audio signals 104 and the spatial metadata 106. The encoder 107 may furthermore be configured to generate a bitstream 108 comprising an encoded or compressed form of the metadata information and transport audio signals.
The encoder 107, for example, could be implemented as an IVAS encoder, or any other suitable encoder. The encoder 107, in such embodiments is configured to encode the audio signals and the metadata and form an IVAS bit stream.
This bitstream 108 may then be transmitted/stored as shown by the dashed line.
The system 100 furthermore may comprise a player or decoder 109 part. The player or decoder 109 is configured to receive, retrieve or otherwise obtain the bitstream 108 and from the bitstream generate suitable spatial audio signals 110 to be presented to the listener/listener playback apparatus.
The decoder 109 is therefore configured to receive the bitstream 108 and demultiplex the encoded streams and then decode the audio signals to obtain the transport signals and metadata.
The decoder 109 furthermore can be configured to, from the transport audio signals and the spatial metadata, produce the spatial audio signals output 110 for example a binaural audio signal that can be reproduced over headphones.
With respect to Figure 2, there is shown the encoder side in further detail according to some embodiments. In these embodiments two focused audio signals are hidden or encapsulated inside a ‘backwards’ compatible spatial audio signal in a manner that can be reversed and the two focused audio signals can be used to focus audio.
There are many use cases where transporting focused audio signals inside a ‘backwards’ compatible spatial audio signal is particularly advantageous. For example, video calls, where a receiver can focus on any sound source on the capture/encoder side, video recordings where it is possible to edit the audio to emphasize important sound source afterwards.
The two focused audio signals can be generated using any suitable focussing method. In the example shown in Figure 2 the focus directions are fixed. For example in some embodiments a first focus direction can be the ‘camera’ direction or front direction (or mouth reference point direction). The second focus direction can then be defined as the opposite of the first direction (or more generally a direction which is substantially different to the first direction). In the following it would be understood that the term opposite can be generalised as a direction which is substantially different to the first direction.
In some embodiments the audio signals can be defined as Xfixfoc and Xfixanti.
In some embodiments, as shown in Figure 2, there is shown a series of microphones as part of the microphone array: a first microphone, mic 1 , 290 a second microphone mic 2, 292, and a third microphone, mic 3, 294 which are configured to generate the audio input 102 which is passed to a direction estimator 201.
The direction estimator 201 can be considered to be part of the metadata generation operations as described above. The direction estimator 201 thus can be configured to output the microphone audio signals in the form of the audio input 102 and the direction values 208.
The direction estimate is an estimate of the dominant sound source direction. The direction estimation as indicated above is implemented in small time frequency tiles by framing the microphone signals in typically 20ms frames, transforming the frames into frequency domain (using DFT (Discrete Fourier Transform), DCT (Discrete Cosine Transform) or filter banks like QMF (Quadrature Mirror Filter)), splitting the frequency domain signal into frequency bands and analysing the direction in the bands. These type of framed bands of audio are referred to as time-frequency tiles. The tiles are typically narrower in low frequencies and wider in higher frequencies and may follow for example third- octave bands or Bark bands or ERB bands (Equivalent Rectangular Bandwidth). Other methods such as filterbanks exist for creating similar tiles.
In some embodiments at least one dominant sound source direction a is estimated for each tile using any suitable method such as described above.
In the embodiments described herein processing can be (and typically is) implemented in time-frequency tiles. However, for the sake of clarity the following methods are described with respect to one range of frequencies and one time instant. For example typically there would be 20-50 tiles per time instant (=frame) and the number of time instants depends on the frame length and processed audio length. In some embodiments the encoder part comprises a microphone selector/focus processor 203 which is configured to obtain the audio input 102 and from these audio signals generate a (fixed direction) focus signal 204 Xfixfoc and and antifocus signal 206 Xfixanti.
In some embodiments a simple method for generating the focus 204 and anti-focus 206 audio signal is to use a front microphone to supply the focus audio signal 204 Xfixfoc and select the back microphone to supply the anti-focus audio signal 206 Xfixanti.
An example of this is shown in Figure 6, which shows an example apparatus, a phone with 2 microphones 600. The phone 600 has a defined front direction 603 and a front microphone 607 (a microphone located on the front face of the apparatus) and a back microphone 609 (a second microphone located on the back or rear face of the apparatus). Where there is a sound object 601 which has a direction a 605 relative to the front axis 603, then when the direction is less than 90 degrees then the front microphone is the ‘near microphone’ and the back microphone is the ‘far microphone’ with reference to the supplying of the focus audio signal (the front microphone audio signal) and the anti-focus audio signal (the back microphone audio signal). It would be understood that when the direction is more than 90 degrees then the front microphone is the ‘far microphone’ and the back microphone is the ‘near microphone’ with reference to the supplying of the focus audio signal (the back microphone audio signal) and the anti-focus audio signal (the front) microphone audio signal).
In some embodiments the microphone selector/focus processor 203 is configured to generate the focus and antifocus audio signals by applying a beamforming operation on the microphone audio signals. For example, MVDR (Minimum Variance Distortionless Response) beamforming can be used, using any subset or all device microphones or using spatial filtering or using microphone selection or any combination thereof. In some embodiments other audio focus methods can be used such as spatial filtering.
An example of this is shown in Figure 7 which shows an example apparatus, a phone with 3 microphones 700. The phone 700 has a defined front direction 603 and a first front microphone 607 (a microphone located on the front face of the apparatus), a second front microphone 707 (another microphone located on the front face of the apparatus but near to the opposite end of the phone with respect to the first front microphone) and a back microphone 609 (a further microphone located on the back or rear face of the apparatus and shown in this example opposite the first front microphone). Where there is a sound object 601 which has a direction a 605 relative to the front axis 603, then the focus audio signal can be a beamformed or spatially processed audio signal with a focus 701 towards the front direction using any subset or all of the microphones when the direction is less than 90 degrees and the anti-focus audio signal can be a beamformed or spatially processed audio signal with a focus 703 towards the back direction using any subset or all of the microphones.
It would be understood that when the direction is more than 90 degrees then the focus and anti-focus switch front and back directions.
In other words the microphone selector/focus processor 203 is configured to create two signals, Near mic/focus signal Xfixfoc 204 and Far mic/antifocus signal Xfixanti 206 where the Xfixfoc is focused towards an important direction (for example the forwards or camera direction) and Xfixanti is focused towards the opposite direction (or more generally a direction which is substantially different to the first direction). They can be generated using any combination of microphone selection, beamforming, spatial filtering or other audio focus methods.
A panner 205 can furthermore be configured to obtain the focus audio signal 204, the anti-focus audio signal 206 and the direction values 208. Near mic/focus signal Xfixfoc 204 and Far mic/antifocus signal Xfixanti 206 are modified by an invertible panning process that makes the Near mic/focus signal Xfixfoc 204 and Far mic/antifocus signal Xfixanti 206 backwards compatible spatial audio signals while the panning process can be removed when necessary. The panned left channel audio signal 224 and the panned right channel audio signal 226 can then be output.
In some embodiments the Near mic/focus signal Xfixfoc 204 and Far mic/antifocus signal Xfixanti 206 are converted into a spatial audio (stereo signal) panned left channel audio signal 224 and the panned right channel audio signal 226 which is not so clearly focused to any direction and where sound sources are approximately in correct places. The spatial audio signals are generated such that at least the dominant sound source is in correct perceived direction. This can be done with help of direction a 208, which is the direction of a dominant sound source. As described above a can be estimated for each time-frequency tile. In some embodiments the Xfixfoc audio signal is panned to the dominant sound source direction and the Xfixanti audio signal is treated as an ambient signal and added and subtracted to the channels of the backwards compatible spatial audio signal:
Figure imgf000025_0001
Where a common sine panning law is employed:
Figure imgf000025_0002
The Xanti signal could alternatively be decorrelated using known invertible decorrelators.
In some embodiments diffuseness is estimated using any suitable manner and can be expressed as a D/A ratio (Direct-to-Ambient ratio). If the diffuseness is low (D/A ratio = 1 ), then Xfixfoc is panned as in the equation above. If the diffuseness is high (D/A ratio = 0), then the Xfixfoc typically contains a lot of other sound sources as well as the dominant sound source or there is no clear dominant sound source. In this case the focus signal can be panned to all directions equally. This can be achieved with the following:
Figure imgf000025_0003
In some embodiments other panning functions such as Vector Based Amplitude Panning (VPAB) can also be employed. Alternatively in some embodiments the focus signals can be binauralized using Inter-aural Level and Time Differences (ILDs and ITDs). When ILDs and ITDs are applied in tiles by modifying the level and phase differences to match the desired differences of human hearing (that depend on the direction a), then the process is also reversible and can be used in this invention. For binauralization, the Xfixfoc would be panned to the direction a using ILDs and ITDs whereas Xfixanti would be used as a background ambient signal and simply summed and subtracted from the panned left channel audio signal 224 and the panned right channel audio signal 226 signals as was implemented in the stereo case described above.
The panner 205 can then output the direction values 208, the panned left channel audio signal L 224 and the panned right channel audio signal R 226.
In some embodiments the encoder further comprises a suitable low bit-rate encoder 207. This optionally is configured to encode the metadata and the panned left and right channel audio signals.
Furthermore in some embodiments the encoder comprises a suitable storage/transmission part 209 configured to store and/or transmit the metadata and audio signals (which as shown herein can be encoded).
The output of the encoder is thus configured to produce (encoded) L and R signals can be used in a backwards compatible way. They form a stereo spatial audio signal that can be used like any other stereo signal.
Thus with respect to Figure 3 is shown a flow diagram of the operations which are implemented by the encoder part as shown in Figure 2.
For example the operations comprise that of audio signals obtaining/capturing from microphones as shown in Figure 3 by step 301 .
Then the following operation is one of direction estimating from audio signals from microphones as shown in Figure 3 by step 303.
The following operation is one of microphone selecting/focussing as shown in Figure 3 by step 305.
Then there is audio panning applied to the audio signals output by the microphone selecting/focussing as shown in Figure 3 by step 307.
There can be furthermore comprise an operation of low bit rate encoding which is optional as shown in Figure 3 by step 309.
Finally with respect to the encoder side there is shown an operation of storing/transmitting (encoded) audio signals as shown in Figure 3 by step 311 .
With respect to Figure 4 is shown the example decoder part as shown in Figure 1 in further detail according to some embodiments.
The decoder part for example can in some embodiments comprise a retriever/receiver 401 configured to retrieve or receive the ‘stereo’ audio signals and the metadata including the direction values from the storage or from the network. The retriever/receiver is thus configured be the reciprocal to the storage/transmission 209 as shown in Figure 2.
Furthermore in some embodiments the decoder part comprises a decoder 403, which is optional, which is configured to apply a suitable inverse operation to the encoder 207.
The direction 400 values and the panned left channel audio signal L 402 and the panned right channel audio signal R 404 can then be passed to the reverse panner 405 (or directly to the audio focusser 407).
In some embodiments the decoder part comprises an optional reverse panner 405. The reverse panner 405 is configured to receive the direction values 400and the panned left channel audio signal L 402 and the panned right channel audio signal R 404 and regenerate the focus audio signal (which can be the near microphone audio signal) 406, the antifocus audio signal (which can be the far microphone audio signal) 408 and the direction 400 values and pass these to the audio focusser 407.
With help of the direction metadata the reverse panner 405 is configured to reverse the panning process (applied in the encoder part) and thus ‘access’ the original focused signals:
Figure imgf000027_0001
As described above diffuseness parameter can be further employed to assist in the panning process and as such can also be employed in the reverse panning process. The diffuseness parameter can be expressed as a D/A ratio (Direct-to-Ambient ratio). If the diffuseness is low, for example D/A ratio = 1 , then the Xfixfoc is panned using the simple panning equation and can be reverse panned in the manner as shown by the equation above. If the diffuseness is high, for example D/A ratio = 0, then the focus signal has been panned to all directions equally. The reverse processed can in some embodiments be implemented as the following: ■Xfixfoc
Figure imgf000028_0001
For other reversible panning functions the inverse can be found using similar methods.
The decoder part further can comprise in some embodiments an audio focusser 407 configured to obtain the regenerated focus audio signal (which can be the near microphone audio signal) 406, the antifocus audio signal (which can be the far microphone audio signal) 408 and the direction 400 values. Additionally the audio focusser is configured to receive the listener or device desired focus direction P 410. The audio focusser 410 is thus configured to (with the reverse panner 405) to focus the L and R spatial audio signals towards a direction p by reversing the panning process (and generating the focus and antifocus audio signals) and then generating the focussed audio signal 412 and the direction value 400.
Audio focus can be achieved using the Xfixfoc and Xfixanti signals. The Xfixfoc signal emphasizes the front direction a and Xfixanti emphasizes the opposite direction (or more generally a direction which is substantially different to the first direction). If user wants to focus towards the front direction (i.e. (3=0) then in this invention the Xfixfoc signal is amplified with respect to the Xfixanti signal in the output. If user wants to focus towards the ‘back’ or Tear’ direction then in in some embodiments the Xfixanti signal is amplified with respect to the Xfixfoc signal in the output. The same is typically done if user wants to focus near the front direction or near the opposite direction, because focusing is typically not very accurate and as a coarse example for one focusing method, beamforming might amplify sound sources in a 40° wide sector with a 3 microphone device instead of just amplifying sound sources in an exact direction. If user wants to focus clearly towards other directions, neither signal is amplified in the output or the opposite direction is amplified somewhat more that the front direction.
In some embodiments the audio focusser 407 is configured to create an audio focused signal towards the user input direction [3 by summing the Xfixfoc and Xfixanti signals with suitable gains. The gains depend on the difference of the directions a and O=front. For example by applying the following function as shown in Figure 8. focused signal Xfocus 9 foe Xfixfoc T 9 anti ' Tf ixanti
The audio focusser as such is configured to use mostly Xfixfoc when user desired direction is the front direction and to use mostly xanti when user desired direction is opposite to the front direction. For other directions, the Xfixfoc and Xfixanti are mixed more evenly.
The focussed audio signal 412 Xf0Cus can be used as such if a mono focused signal is enough.
In some embodiments the decoder optionally comprises a focussed signal panner 409. The focussed signal panner 409 can be configured to obtain or receive the focussed audio signal 412 and the direction 400 and be configured to generate left and right channel audio signal outputs (or in some embodiments any suitable number of multichannel outputs).
In some embodiments the left channel audio signal output 414 and the right channel audio signal output 416 are generated from the focussed audio signal based on the direction a and further mixed with the received L channel audio signal 402 and R channel audio signal 404 at different levels where different levels of audio focus (e.g. a little focus, medium focus, strong focus or full focus) are desired.
Furthermore in some embodiments the focussed audio signal 412 Xf0Cus can also be spatialized by panning to direction a. The following equation has gZOom as a gain between 0 and 1 where 1 indicates fully focused and 0 indicates no focus at all. In some embodiments, in order to achieve a better quality spatial audio the zoom can be limited (for example to be a maximum zoom gain of 0.5). This would maintain better audio signal spatial characteristics. The focussed signal panner 409 thus in some embodiments can be configured to apply the following to generate the left channel audio signal output Lout 414 and the right channel audio signal output Rout 416
Figure imgf000030_0001
In some embodiments the focussed signal panner 409 is configured to implement a more complex panning taking diffuseness into account. If diffuseness is low (D/A ratio = 1 ), then Xf0Cus is panned as in the equation above. If diffuseness is high (D/A ratio = 0), then the Xf0Cus should be panned to all directions equally. This can be achieved with the following:
Figure imgf000030_0002
In some embodiments the processing described in the decoder part is implemented in time-frequency tiles and all the parameters may be different in different tiles.
In some embodiments the left channel audio signal output Lout 414 and the right channel audio signal output Rout 416 is converted back to time domain and played/stored.
With respect to Figure 5 is shown an example flow diagram of the operations implemented by the embodiments shown with respect to Figure 4.
Thus the initial operation is one of retrieve/receive (encoded) audio signals as shown in Figure 5 by step 501 .
Optionally the audio signals can then be low bit rate decoded as shown in Figure 5 by step 503.
Additionally in some embodiments there is the further optional operation of reverse-panning the audio signals as shown in Figure 5 by step 505.
The channel or reverse-panned audio signals are then audio focussed based on the listener or device direction as shown in Figure 5 by step 507.
The focus signal is then optionally panned as shown in Figure 5 by step 509. Then the output audio signals are output as shown in Figure 5 by step 511 .
In some embodiments the two focused audio signals are hidden or encapsulated inside a spatial audio signal in such a manner that the operation can be reversed and the two focused audio signals can be used to focus audio, where the focus directions follow the direction of a dominant sound source in each timefrequency tile.
In other words in some embodiments a first focus direction can be chosen or selected as the dominant sound source direction. The second focus direction is typically opposite of the first direction.
With respect to Figure 9, there is shown the encoder side in further detail according to some embodiments where the focus directions follow the direction of a dominant sound source.
In some embodiments, as shown in Figure 9, there is shown a series of microphones as part of the microphone array: a first microphone, mic 1 , 990 a second microphone mic 2, 992, and a third microphone, mic 3, 994 which are configured to generate the audio input 102 which is passed to a direction estimator 201.
The direction estimator 901 can be considered to be part of the metadata generation operations as described above. The direction estimator 901 thus can be configured to output the microphone audio signals in the form of the audio input 102 and the direction values 908.
The direction estimate is an estimate of the dominant sound source direction. The direction estimation as indicated above is implemented in small time frequency tiles by framing the microphone signals in typically 20ms frames, transforming the frames into frequency domain (using DFT (Discrete Fourier Transform), DCT (Discrete Cosine Transform) or filter banks like QMF (Quadrature Mirror Filter)), splitting the frequency domain signal into frequency bands and analysing the direction in the bands. These type of framed bands of audio are referred to as time-frequency tiles. The tiles are typically narrower in low frequencies and wider in higher frequencies and may follow for example third- octave bands or Bark bands or ERB bands (Equivalent Rectangular Bandwidth). Other methods such as filterbanks exist for creating similar tiles.
In some embodiments at least one dominant sound source direction a is estimated for each tile using any suitable method such as described above. In the embodiments described herein processing can be (and typically is) implemented in time-frequency tiles. However, for the sake of clarity the following methods are described with respect to one range of frequencies and one time instant. For example typically there would be 20-50 tiles per time instant (=frame) and the number of time instants depends on the frame length and processed audio length.
In some embodiments there can be determined two or more different dominant sound sources. For example as shown in Figure 16 there is shown an example system with two dominant sound sources. The apparatus is a phone 1300 with 3 microphones. The phone 1300 has a defined front direction 1303 and a first front microphone 1307 (a microphone located on the front face of the apparatus), a second front microphone 1311 (another microphone located on the front face of the apparatus but near to the opposite end of the phone with respect to the first front microphone) and a back microphone 1309 (a further microphone located on the back or rear face of the apparatus and shown in this example opposite the first front microphone). There is also shown a high frequency dominant sound object 1603 which has a direction OH 1607 relative to the front axis 1303, and a low frequency dominant sound object 1601 which has a direction OL 1605 relative to the front axis 1303. In this example the focus audio signal can a selection of a first microphone pair 1611 (microphone 1 1307 and microphone 3 1311 ) aligned with the sound object direction for the low frequency sound object and a second microphone pair 1613 (microphone 1 1307 and microphone 2 1309) aligned with the sound object direction for the high frequency sound object.
In some embodiments the encoder part comprises a microphone selector/focus processor 903 which is configured to obtain the audio input 102 and direction 908 values and from these audio signals generate a direction dependent focus signal 904 Xf0C and and antifocus signal 906 xanti.
In some embodiments a simple method for generating the focus 904 and anti-focus 906 audio signal is to select the nearest microphone(s) relative to the determined sound source direction to supply the focus audio signal 904 Xf0C and select the furthest microphone(s) relative to the determined sound source direction to supply the anti-focus audio signal 906 xanti. Thus with respect to the example shown in Figure 16 a different microphone pair is selected for low frequencies compared to high frequencies. This microphone selection makes one of the two mics to have the dominant sound source amplified with respect to sound sources in other directions. This is because the first microphone is selected from the same side (as much as possible) as the dominant sound source direction and the second microphone is from the opposite side of the device (as much as possible) and the device body physically attenuates sounds that come to the first microphone from other sides than the one where the dominant sound source is. The direction estimation result may change continuously as the dominant sound source may move continuously e.g. when there are multiple speakers around the device and the person talking (=dominant sound source) changes continuously or when the dominant sound source moves or the device moves. Also, the direction estimation may be different in different frequencies. Therefore, also the direction from which one channel amplifies sound sources changes continuously, the direction being the same as the estimated direction in the metadata.
This can in some embodiments be implemented by firstly, determining a near microphone and far microphone. These near and far microphones are two microphones in a pair that of all the possible microphone pairs in the device are closest aligned so that when a line is drawn through the microphones in the pair, the line points in the determined sound object direction a. Thus the near and far mic determination can be tile dependent too. The near mic is the mic of the pair that is closer to the dominant sound source direction and far mic is farther away. Thus if the sound source is behind the device, so typically is the near mic too.
The near mic is used as focus signal Xf0C and the far mic is used antifocus signal xanti that points to the opposite direction of the focus signal.
An example of this is also shown in Figure 13, which shows an example apparatus, a phone with 3 microphones 1300. The phone 1300 has a defined front direction 1303 and a first front microphone 1307 (a microphone located on the front face of the apparatus), a second front microphone 1311 (another microphone located on the front face of the apparatus but near to the opposite end of the phone with respect to the first front microphone) and a back microphone 1309 (a further microphone located on the back or rear face of the apparatus and shown in this example opposite the first front microphone). Additionally there is a sound object 1301 which has a direction a 1305 relative to the front axis 1303, then when the direction is less than a defined angle (the angle defined by the physical dimensions of the apparatus and the relative microphone pair virtual angles then the front microphone 1307 is the ‘near microphone’ and the back microphone 1311 is the ‘far microphone’ with reference to the supplying of the focus audio signal (the front microphone audio signal) and the anti-focus audio signal (the back microphone audio signal).
It would be understood that when the direction is more than the defined angle, such as shown in the example in Figure 14, where the sound object 1401 has an object direction 1405 greater than the defined angle then the front microphone, microphone 1 , 1307 is the ‘near microphone’ and the other front microphone, microphone 3, 1311 is the ‘far microphone’ as the angle formed by the pair of the microphones, microphone 1 1307 and microphone 3 1311 is closer to the determined sound object direction than the angle formed by the pair microphone 1 1307 and microphone 2 1309. In this example the focus audio signal is the microphone 1 audio signal and the anti-focus audio signal the microphone 3 audio signal.
Furthermore, as shown in the example in Figure 15, where there is a sound object 1501 which has a direction 1505 closer to the angle defined by the pair microphone 1 1307 and microphone 2 1309, then the focus audio signal can then be the front microphone, microphone 1 1307, as the ‘far microphone’ and the back microphone, microphone 2 1309, as the ‘near microphone’ as this microphone pair are more aligned with the sound object direction but the back microphone, microphone 2 1309 is closer to the object.
In some embodiments the microphone selector/focus processor 903 is configured to generate the focus and antifocus audio signals by applying a beamforming operation on the microphone audio signals. For example, MVDR (Minimum Variance Distortionless Response) beamforming can be used, using any subset or all device microphones or using spatial filtering or using microphone selection or any combination thereof. In some embodiments other audio focus methods can be used such as spatial filtering.
An example of this is shown in Figure 17 which shows an example apparatus, a phone with 3 microphones 1700. The phone 700 has a defined front direction 1703 and a first front microphone (a microphone located on the front face of the apparatus), a second front microphone (another microphone located on the front face of the apparatus but near to the opposite end of the phone with respect to the first front microphone) and a back microphone (a further microphone located on the back or rear face of the apparatus and shown in this example opposite the first front microphone). Where there is a sound object 1701 which has a direction 1705 relative to the front axis 1703, then the focus audio signal can be a beamformed or spatially processed audio signal with a focus 1711 towards the object direction 1705 using any subset or all of the microphones and the anti-focus audio signal can be a beamformed or spatially processed audio signal with a focus 1713 towards at the direction plus 180 degrees using any subset or all of the microphones.
A similar example is shown in Figure 18 where the sound object 1801 has a direction 1805 greater than the defined angle but that the focus 1811 in still in the direction and the antifocus direction 1813 180 degrees different from the direction.
In other words the microphone selector/focus processor 903 is configured to create two signals, Near mic/focus signal Xf0C 904 and Far mic/antifocus signal xanti 906 where the Xf0C is focused towards the determined object direction and xanti is focused in the opposite direction to the determined object direction. They can be generated using any combination of microphone selection, beamforming, spatial filtering or other audio focus methods.
A panner 905 can furthermore be configured to obtain the focus audio signal 904, the anti-focus audio signal 906 and the direction values 908. Near mic/focus signal Xf0C 904 and Far mic/antifocus signal xanti 906 are modified by an invertible panning process that makes the Near mic/focus signal Xf0C 904 and Far mic/antifocus signal xanti 906 backwards compatible spatial audio signals while the panning process can be removed when necessary. The panned left channel audio signal 924 and the panned right channel audio signal 926 can then be output.
In some embodiments the Near mic/focus signal Xf0C 904 and Far mic/antifocus signal xanti 906 are converted into a spatial audio (stereo signal) panned left channel audio signal 924 and the panned right channel audio signal 926 which is not so clearly focused to any direction and where sound sources are approximately in correct places. The spatial audio signals generated is such that at least the dominant sound source is in correct perceived direction. This can be done with help of direction a 908, which is the direction of a dominant sound source. As described above a can be estimated for each time-frequency tile. In some embodiments the Xf0C audio signal is panned to the dominant sound source direction and the xanti audio signal is treated as an ambient signal and added and subtracted to the channels of the backwards compatible spatial audio signal:
Figure imgf000036_0001
Where a common sine panning law is employed:
Figure imgf000036_0002
The Xanti signal could alternatively be decorrelated using known invertible decorrelators.
In some embodiments diffuseness is estimated using any suitable manner and can be expressed as a D/A ratio (Direct-to-Ambient ratio). If the diffuseness is low (D/A ratio = 1 ), then Xf0C is panned as in the equation above. If the diffuseness is high (D/A ratio = 0), then the Xf0C typically contains a lot of other sound sources as well as the dominant sound source or there is no clear dominant sound source. In this case the focus signal can be panned to all directions equally. This can be achieved with the following:
Figure imgf000036_0003
In some embodiments other panning functions such as Vector Based Amplitude Panning (VPAB) can also be employed. Alternatively in some embodiments the focus signals can be binauralized using Inter-aural Level and Time Differences (ILDs and ITDs). When ILDs and ITDs are applied in tiles by modifying the level and phase differences to match the desired differences of human hearing (that depend on the direction a), then the process is also reversible and can be used in this invention. For binauralization, the Xfixfoc would be panned to the direction a using ILDs and ITDs whereas xanti would be used as a background ambient signal and simply summed and subtracted from the panned left channel audio signal 924 and the panned right channel audio signal 926 signals as was implemented in the stereo case described above.
The panner 905 can then output the direction values 908, the panned left channel audio signal L 924 and the panned right channel audio signal R 926.
In some embodiments the encoder further comprises a suitable low bit-rate encoder 907. This optionally is configured to encode the metadata and the panned left and right channel audio signals.
Furthermore in some embodiments the encoder comprises a suitable storage/transmission part 909 configured to store and/or transmit the metadata and audio signals (which as shown herein can be encoded).
The output of the encoder is thus configured to produce (encoded) L and R signals can be used in a backwards compatible way. They form a stereo spatial audio signal that can be used like any other stereo signal.
Thus with respect to Figure 10 is shown a flow diagram of the operations which are implemented by the encoder part as shown in Figure 9.
For example the operations comprise that of audio signals obtaining/capturing from microphones as shown in Figure 10 by step 1001.
Then the following operation is one of direction estimating from audio signals from microphones as shown in Figure 10 by step 1003.
The following operation is one of microphone selecting/focussing (based on the dominant sound source direction) as shown in Figure 10 by step 1005.
Then there is audio panning applied to the audio signals output by the microphone selecting/focussing as shown in Figure 10 by step 1007.
There can be furthermore comprise an operation of low bit rate encoding which is optional as shown in Figure 10 by step 1009.
Finally with respect to the encoder side there is shown an operation of storing/transmitting (encoded) audio signals as shown in Figure 10 by step 1011.
With respect to Figure 11 is shown the example decoder part as shown in Figure 1 in further detail according to some embodiments where microphone selection is determined based on the dominant sound source direction.
The decoder part for example can in some embodiments comprise a retriever/receiver 1101 configured to retrieve or receive the ‘stereo’ audio signals and the metadata including the direction values from the storage or from the network. The retriever/receiver is thus configured be the reciprocal to the storage/transmission 909 as shown in Figure 9.
Furthermore in some embodiments the decoder part comprises a decoder 1103, which is optional, which is configured to apply a suitable inverse operation to the encoder 907.
The direction 1100 values and the panned left channel audio signal L 1102 and the panned right channel audio signal R 1104 can then be passed to the reverse panner 1105 (or directly to the audio focusser 1107).
In some embodiments the decoder part comprises an optional reverse panner 1105. The reverse panner 1105 is configured to receive the direction values 1100 and the panned left channel audio signal L 1102 and the panned right channel audio signal R 1104 and regenerate the focus audio signal (which can be the near microphone audio signal) 1106, the antifocus audio signal (which can be the far microphone audio signal) 1108 and the direction 1100 values and pass these to the audio focusser 1107.
With help of the direction metadata the reverse panner 1105 is configured to reverse the panning process (applied in the encoder part) and thus ‘access’ the original focused signals:
Figure imgf000038_0001
As described above, in some embodiments diffuseness can be further employed to assist in the panning and as such can be employed in the reverse panning is estimated using any suitable manner and can be expressed as a D/A ratio (Direct-to-Ambient ratio). If the diffuseness is low (D/A ratio = 1 ), then the Xf0C is panned using the simple panning equation and can be reverse panned in the manner as shown by the equation above. If the diffuseness is high (D/A ratio = 0), then the focus signal has been panned to all directions equally. The reverse processed can in some embodiments be implemented as the following: foc
Figure imgf000039_0001
For other reversible panning functions the inverse can be found using similar methods.
The decoder part further can comprise in some embodiments an audio focusser 1107 configured to regenerate the focus audio signal (which can be the near microphone audio signal) 1106, the antifocus audio signal (which can be the far microphone audio signal) 1108 and the direction 1100 values. Additionally the audio focusser is configured to receive the listener or device desired focus direction P 1110. The audio focusser 1110 is thus configured to (with the reverse panner 1105) to focus the L and R spatial audio signals towards a direction p by reversing the panning process (and generating the focus and antifocus audio signals) and then generating the focussed audio signal 1112 and the direction value 1100.
Audio focus can be achieved using the Xf0C and xanti signals. The Xf0C signal emphasizes the front direction a and Xfixanti emphasizes the opposite direction. If a listener or device wants to focus towards the direction of the dominant sound source (i.e. |3=a) then the Xf0C signal is amplified with respect to the xanti signal in the output. If the listener or device wants to focus towards the opposite direction then the xanti signal is amplified with respect to the Xf0C signal in the output. The same is typically done if the listener or device wants to focus near the front direction or near the opposite direction, because focusing is typically not very accurate and as a coarse example for one focusing method, beamforming might amplify sound sources in a 40° wide sector with a 3 microphone device instead of just amplifying sound sources in an exact direction. If the device or listener wants to focus clearly towards other directions, neither signal is amplified in the output or the opposite direction is amplified somewhat more that the front direction.
Based on user input direction, the audio focusser 1107 is configured to create an audio focused signal towards the user input direction [3. This can be implemented by summing the Xf0C and xanti signals with suitable gains. The gains depend on the difference of the directions a and [3. An example function is given with respect to Figure 19.
Figure imgf000040_0001
The audio focusser as such is configured to use mostly Xfixfoc when user desired direction is the sound object direction and to use mostly xanti when listener desired direction is opposite to the sound object direction. For other directions, the Xfoc and Xanti are mixed more evenly.
The focussed audio signal 1112 Xf0Cus can be used as such if a mono focused signal is enough.
In some embodiments the decoder optionally comprises a focussed signal panner 1109. The focussed signal panner 1109 can be configured to obtain or receive the focussed audio signal 1112 and the direction 1100 and be configured to generate left and right channel audio signal outputs (or in some embodiments any suitable number of multichannel outputs).
In some embodiments the left channel audio signal output 1114 and the right channel audio signal output 1116 are generated from the focussed audio signal and the based on the direction a and further mixed with the received L channel audio signal 1102 and R channel audio signal 1104 at different levels where different levels of audio focus (e.g. a little focus, medium focus, strong focus or full focus) are desired.
Furthermore in some embodiments the focussed audio signal 1112 Xf0Cus can also be spatialized by panning to direction a. The following equation has gZOom as a gain between 0 and 1 where 1 indicates fully focused and 0 indicates no focus at all. In some embodiments, in order to achieve a better quality spatial audio the zoom can be limited (for example to be a maximum zoom gain of 0.5). This would maintain better audio signal spatial characteristics.
The focussed signal panner 1109 thus in some embodiments can be configured to apply the following to generate the left channel audio signal output Lout 1114 and the right channel audio signal output Rout 1116
Figure imgf000040_0002
In some embodiments the focussed signal panner 1109 is configured to implement a more complex panning taking diffuseness into account. If diffuseness is low (D/A ratio = 1 ), then Xf0Cus is panned as in the equation above. If diffuseness is high (D/A ratio = 0), then the Xf0Cus should be panned to all directions equally. This can be achieved with the following:
Figure imgf000041_0001
In some embodiments the processing described in the decoder part is implemented in time-frequency tiles and all the parameters may be different in different tiles.
In some embodiments the left channel audio signal output Lout 1114 and the right channel audio signal output Rout 1116 is converted back to time domain and played/stored.
With respect to Figure 12 is shown an example flow diagram of the operations implemented by the embodiments shown with respect to Figure 11 .
Thus the initial operation is one of retrieve/receive (encoded) audio signals as shown in Figure 12 by step 1201.
Optionally the audio signals can then be low bit rate decoded as shown in Figure 12 by step 1203.
Additionally in some embodiments there is the further optional operation of reverse-panning the audio signals as shown in Figure 12 by step 1205.
The channel or reverse-panned audio signals are then audio focussed based on the listener or device direction as shown in Figure 12 by step 1207.
The focus signal is then optionally panned as shown in Figure 12 by step 1209.
Then the output audio signals are output as shown in Figure 12 by step 1211.
In some embodiments there may be more than two focused signals. For example there may be four focused signals, two as described above where a first focussed signal is directed towards the dominant object direction, a second focussed signal is directed away from the dominant object direction and two additional focussed signals are focused with directions 90 degrees away from the first two. In some embodiments there may be further focussed signals, these further focussed signals directed towards ‘up’ and/or ‘down’.
In some embodiments these focused signals may be hidden or encapsulated inside other transported audio signals when there are more than two transported signals. Typically, there could be at most as many focused signals as there are transported signals. For example, if the transported signal is a 5.1 channel formal with left, right, center, subwoofer, rear left, and rear right channel signals, then two of the four focused signals could be hidden or encapsulated inside the left and right signals and the other two focussed signals could be hidden or encapsulated inside rear left and rear right signals.
With respect to Figure 20 an example electronic device which may be used as any of the apparatus parts of the system as described above. The device may be any suitable electronics device or apparatus. For example, in some embodiments the device 2000 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc. The device may for example be configured to implement the encoder/analyser part and/or the decoder part as shown in Figure 1 or any functional block as described above.
In some embodiments the device 2000 comprises at least one processor or central processing unit 2007. The processor 2007 can be configured to execute various program codes such as the methods such as described herein.
In some embodiments the device 2000 comprises at least one memory 2011 . In some embodiments the at least one processor 2007 is coupled to the memory 2011 . The memory 2011 can be any suitable storage means. In some embodiments the memory 2011 comprises a program code section for storing program codes implementable upon the processor 2007. Furthermore, in some embodiments the memory 2011 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 2007 whenever needed via the memory-processor coupling. In some embodiments the device 2000 comprises a user interface 2005. The user interface 2005 can be coupled in some embodiments to the processor 2007. In some embodiments the processor 2007 can control the operation of the user interface 2005 and receive inputs from the user interface 2005. In some embodiments the user interface 2005 can enable a user to input commands to the device 2000, for example via a keypad. In some embodiments the user interface 2005 can enable the user to obtain information from the device 2000. For example the user interface 2005 may comprise a display configured to display information from the device 2000 to the user. The user interface 2005 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 2000 and further displaying information to the user of the device 2000. In some embodiments the user interface 2005 may be the user interface for communicating.
In some embodiments the device 2000 comprises an input/output port 2009. The input/output port 2009 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 2007 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable radio access architecture based on long term evolution advanced (LTE Advanced, LTE-A) or new radio (NR) (or can be referred to as 5G), universal mobile telecommunications system (UMTS) radio access network (UTRAN or E-UTRAN), long term evolution (LTE, the same as E-UTRA), 2G networks (legacy network technology), wireless local area network (WLAN or Wi-Fi), worldwide interoperability for microwave access (WiMAX), Bluetooth®, personal communications services (PCS), ZigBee®, wideband code division multiple access (WCDMA), systems using ultra-wideband (UWB) technology, sensor networks, mobile ad-hoc networks (MANETs), cellular internet of things (loT) RAN and Internet Protocol multimedia subsystems (IMS), any other suitable option and/or any combination thereof. The transceiver input/output port 2009 may be configured to receive the signals.
In some embodiments the device 2000 may be employed as at least part of the synthesis device. The input/output port 2009 may be coupled to headphones (which may be a headtracked or a non-tracked headphones) or similar and loudspeakers.
In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general-purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or “fab” for fabrication.
The foregoing description has provided by way of exemplary and nonlimiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

Claims

CLAIMS:
1 . An apparatus, for generating spatial audio signals, the apparatus comprising means configured to: obtain at least two audio signals; obtain at least one metadata directional parameter associated with the at least two audio signals; generate a first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in a first focussed audio signal direction; generate a second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal focussed in a second focusses audio signal direction which is substantially different to a first focussed audio signal direction; and generate a first and a second output audio signal based on the second focussed audio signal and a panning of the first focussed audio signal, the panning controlled by the at least one metadata directional parameter.
2. The apparatus as claimed in claim 1 , wherein the first focussed audio signal direction is fixed relative to a direction of the apparatus.
3. The apparatus as claimed in claim 1 , wherein the first focussed audio signal direction is the at least one metadata directional parameter value.
4. The apparatus as claimed in any of claims 1 to 3, wherein the means configured to generate the first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in the first focussed audio signal direction is configured to at least one of: select one of the at least two audio signals to generate the first focussed audio signal; and mix at least two of the at least two audio signals to generate the first focussed audio signal.
5. The apparatus as claimed in any of claims 1 to 4, wherein the means configured to generate the second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal being focussing in the second focussed audio signal direction is configured to at least one of: select one of the at least two audio signals to generate the second focussed audio signal; and mix at least two of the at least two audio signals to generate the second focussed audio signal.
6. The apparatus as claimed in any of claims 1 to 5, wherein the means configured to generate the first output audio signal based on a panning of the first focussed audio signal and the at least one metadata directional parameter is configured to generate the first output audio signal as an additive combination of the second focussed audio signals and a panning of the first focussed audio signal to a left channel direction based on the at least one metadata directional parameter.
7. The apparatus as claimed in claim 6, wherein the means configured to generate the second output audio signal is configured to generate the second output audio signal as a subtractive combination of the second focussed audio signals and a panning of the first focussed audio signal to a right channel direction based on the at least one metadata directional parameter.
8. The apparatus as claimed in any of claims 1 to 7, wherein the means is further configured to generate further focussed audio signal based on at least one of the at least two audio signals, the further focussed audio signal focussed in a further focussed audio signal direction which is perpendicular to the first focussed audio signal direction.
9. The apparatus as claimed in any of claims 1 to 8, wherein the means configured to obtain at least one metadata directional parameter associated with the at least two audio signals is configured to analyse the at least two audio signals to generate the at least one metadata directional parameter.
10. The apparatus as claimed in any of claims 1 to 9, wherein the means configured to obtain at least one metadata directional parameter associated with the at least two audio signals is configured to receive the at least one metadata directional parameter, and the means configured to obtain at least two audio signals is configured to receive the at least two audio signals.
11. An apparatus, for processing spatial audio signals, the apparatus comprising means configured to: obtain a first audio signal and a second audio signal, the first audio signal and the second audio signal based respectively on a panning of a first focussed audio signal and a second focussed audio signal and the second focussed audio signal is focussed in a second direction which is substantially different to a first direction of the first focussed audio signal, and at least one metadata directional parameter, wherein the panning of the first focussed audio signal is associated with the at least one metadata directional parameter; obtain a desired focus directional parameter; generate a focus audio signal towards the desired focus directional parameter value, the focus audio signal based on the desired focus directional parameter, the at least one metadata directional parameter, the first audio signal and the second audio signal; and generate at least one output audio signal based on the focus audio signal.
12. The apparatus as claimed in claim 11 , wherein the first focussed audio signal direction is one of: a fixed direction relative to a direction of a capture apparatus; and the at least one metadata directional parameter value.
13. The apparatus as claimed in any of claims 11 to 12, wherein the means is configured to, prior to generating the focus audio signal: de-pan the first audio signal to generate the first focussed audio signal; and de-pan the second audio signal to generate the second focussed audio signal, wherein the means configured to generate the focus audio signal is configured to generate the focus audio signal based on a combination of the first focussed audio signal and the second focussed audio.
14. The apparatus as claimed in any of claims 11 to 13, wherein the means configured to generate at least one output audio signal based on the focus audio signal is configured to: generate a first output audio signal based on a combination of the focus audio signal and the first audio signal; and generate a second output audio signal based on a combination of the focus audio signal and the second audio signal.
15. A method for an apparatus for generating spatial audio signals, the method comprising: obtaining at least two audio signals; obtaining at least one metadata directional parameter associated with the at least two audio signals; generating a first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in a first focussed audio signal direction; generating a second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal focussed in a second focusses audio signal direction which is substantially different to a first focussed audio signal direction; and generating a first and a second output audio signal based on the second focussed audio signal and a panning of the first focussed audio signal, the panning controlled by the at least one metadata directional parameter.
16. A method for an apparatus for processing spatial audio signals, the method comprising: obtaining a first audio signal and a second audio signal, the first audio signal and the second audio signal based respectively on a panning of a first focussed audio signal and a second focussed audio signal and the second focussed audio signal is focussed in a second direction which is substantially different to a first direction of the first focussed audio signal, and at least one metadata directional parameter, wherein the panning of the first focussed audio signal is associated with the at least one metadata directional parameter; obtaining a desired focus directional parameter; generating a focus audio signal towards the desired focus directional parameter value, the focus audio signal based on the desired focus directional parameter, the at least one metadata directional parameter, the first audio signal and the second audio signal; and generating at least one output audio signal based on the focus audio signal.
17. An apparatus for generating spatial audio signals, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain at least two audio signals; obtain at least one metadata directional parameter associated with the at least two audio signals; generate a first focussed audio signal based on at least one of the at least two audio signals, the first focussed audio signal being focussing in a first focussed audio signal direction; generate a second focussed audio signal based on at least one of the at least two audio signals, the second focussed audio signal focussed in a second focusses audio signal direction which is substantially different to a first focussed audio signal direction; and generate a first and a second output audio signal based on the second focussed audio signal and a panning of the first focussed audio signal, the panning controlled by the at least one metadata directional parameter.
18. The apparatus as claimed in claim 17, wherein the apparatus is further caused to generate further focussed audio signal based on at least one of the at least two audio signals, the further focussed audio signal focussed in a further focussed audio signal direction which is perpendicular to the first focussed audio signal direction.
19. The apparatus as claimed in any of claim 17 or 18, wherein the apparatus is caused to obtain the at least one metadata directional parameter associated with the at least two audio signals is caused to analyse the at least two audio signals to generate the at least one metadata directional parameter.
20. The apparatus as claimed in any of claims 17 to 19, wherein the apparatus is caused to obtain the at least one metadata directional parameter associated with the at least two audio signals is caused to receive the at least one metadata directional parameter, and the apparatus is caused to obtain at least two audio signals is caused to receive the at least two audio signals.
21 . An apparatus for processing spatial audio signals, the apparatus comprising at least one processor and at least one memory including a computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtain a first audio signal and a second audio signal, the first audio signal and the second audio signal based respectively on a panning of a first focussed audio signal and a second focussed audio signal and the second focussed audio signal is focussed in a second direction which is substantially different to a first direction of the first focussed audio signal, and at least one metadata directional parameter, wherein the panning of the first focussed audio signal is associated with the at least one metadata directional parameter; obtain a desired focus directional parameter; generate a focus audio signal towards the desired focus directional parameter value, the focus audio signal based on the desired focus directional parameter, the at least one metadata directional parameter, the first audio signal and the second audio signal; and generate at least one output audio signal based on the focus audio signal.
22. The apparatus as claimed in claim 21 , wherein the first focussed audio signal direction is one of: a fixed direction relative to a direction of a capture apparatus; and the at least one metadata directional parameter value.
23. The apparatus as claimed in any of claim 21 or 22, wherein the apparatus is caused to, prior to generating the focus audio signal: de-pan the first audio signal to generate the first focussed audio signal; and de-pan the second audio signal to generate the second focussed audio signal, wherein the apparatus is caused to generate the focus audio signal is caused to generate the focus audio signal based on a combination of the first focussed audio signal and the second focussed audio.
24. The apparatus as claimed in any of claims 21 to 23, wherein the apparatus is caused to generate at least one output audio signal based on the focus audio signal is caused to: generate a first output audio signal based on a combination of the focus audio signal and the first audio signal; and generate a second output audio signal based on a combination of the focus audio signal and the second audio signal.
PCT/EP2023/066359 2022-07-12 2023-06-19 Transporting audio signals inside spatial audio signal WO2024012805A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB2210234.7 2022-07-12
GB2210234.7A GB2620593A (en) 2022-07-12 2022-07-12 Transporting audio signals inside spatial audio signal

Publications (1)

Publication Number Publication Date
WO2024012805A1 true WO2024012805A1 (en) 2024-01-18

Family

ID=84539913

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2023/066359 WO2024012805A1 (en) 2022-07-12 2023-06-19 Transporting audio signals inside spatial audio signal

Country Status (2)

Country Link
GB (1) GB2620593A (en)
WO (1) WO2024012805A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190394606A1 (en) * 2017-02-17 2019-12-26 Nokia Technologies Oy Two stage audio focus for spatial audio processing
US20210337338A1 (en) * 2018-08-24 2021-10-28 Nokia Technologies Oy Spatial Audio Processing

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8509454B2 (en) * 2007-11-01 2013-08-13 Nokia Corporation Focusing on a portion of an audio scene for an audio signal

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190394606A1 (en) * 2017-02-17 2019-12-26 Nokia Technologies Oy Two stage audio focus for spatial audio processing
US20210337338A1 (en) * 2018-08-24 2021-10-28 Nokia Technologies Oy Spatial Audio Processing

Also Published As

Publication number Publication date
GB2620593A (en) 2024-01-17
GB202210234D0 (en) 2022-08-24

Similar Documents

Publication Publication Date Title
CN107533843B (en) System and method for capturing, encoding, distributing and decoding immersive audio
US20240007814A1 (en) Determination Of Targeted Spatial Audio Parameters And Associated Spatial Audio Playback
KR20190125987A (en) Two-stage audio focus for spatial audio processing
CN113597776B (en) Wind noise reduction in parametric audio
WO2019193248A1 (en) Spatial audio parameters and associated spatial audio playback
US20210250717A1 (en) Spatial audio Capture, Transmission and Reproduction
US20220328056A1 (en) Sound Field Related Rendering
US20220303710A1 (en) Sound Field Related Rendering
EP3824464A1 (en) Controlling audio focus for spatial audio processing
US20230199417A1 (en) Spatial Audio Representation and Rendering
US11483669B2 (en) Spatial audio parameters
WO2024012805A1 (en) Transporting audio signals inside spatial audio signal
EP4312439A1 (en) Pair direction selection based on dominant audio direction
US20240137728A1 (en) Generating Parametric Spatial Audio Representations
US20230188924A1 (en) Spatial Audio Object Positional Distribution within Spatial Audio Communication Systems
US20240137723A1 (en) Generating Parametric Spatial Audio Representations
EP4358081A2 (en) Generating parametric spatial audio representations
EP4358545A1 (en) Generating parametric spatial audio representations
WO2022258876A1 (en) Parametric spatial audio rendering

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23734508

Country of ref document: EP

Kind code of ref document: A1