WO2021209683A1 - Audio processing - Google Patents

Audio processing Download PDF

Info

Publication number
WO2021209683A1
WO2021209683A1 PCT/FI2021/050234 FI2021050234W WO2021209683A1 WO 2021209683 A1 WO2021209683 A1 WO 2021209683A1 FI 2021050234 W FI2021050234 W FI 2021050234W WO 2021209683 A1 WO2021209683 A1 WO 2021209683A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
signal
audio
directions
spatial
Prior art date
Application number
PCT/FI2021/050234
Other languages
French (fr)
Inventor
Miikka Vilermo
Toni Mäkinen
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of WO2021209683A1 publication Critical patent/WO2021209683A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/005Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0091Means for obtaining special acoustic effects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics
    • H04R1/32Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
    • H04R1/40Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
    • H04R1/406Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S3/00Systems employing more than two channels, e.g. quadraphonic
    • H04S3/008Systems employing more than two channels, e.g. quadraphonic in which the audio signals are in digital form, i.e. employing more than two discrete digital channels
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/056Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for extraction or identification of individual instrumental parts, e.g. melody, chords, bass; Identification or separation of instrumental parts by their characteristic voices or timbres
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/155Musical effects
    • G10H2210/265Acoustic effect simulation, i.e. volume, spatial, resonance or reverberation effects added to a musical sound, usually by appropriate filtering or delays
    • G10H2210/295Spatial effects, musical uses of multiple audio channels, e.g. stereo
    • G10H2210/305Source positioning in a soundscape, e.g. instrument positioning on a virtual soundstage, stereo panning or related delay or reverberation changes; Changing the stereo width of a musical source
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/15Aspects of sound capture and related signal processing for recording or reproduction

Definitions

  • the example and non-limiting embodiments of the present invention relate to processing of audio signals.
  • various example embodiments of the present invention relate to audio processing that involves deriving a processed audio signal where characteristics of respective sounds in one or more sound directions of a spatial audio image represented by an input audio signal are modified.
  • the process of capturing such an audio signal using the mobile device comprises operating a microphone array of the mobile device to capture a plurality of microphone signals and processing the captured microphone signals into a digital audio signal for playback in the mobile device or for further processing in the mobile device, for storage in the mobile device and/or for transmission to one or more other devices to enable subsequent playback in the mobile device or in another device.
  • the digital audio signal may be one that conveys a spatial audio image that represents the audio scene around mobile device, either as such or together with spatial metadata that defines at least some characteristics of the spatial audio scene.
  • the recorded audio signal may comprise a multi- channel signal where each channel is (substantially directly) based on the respective microphone signal or the recorded audio signal may comprise a spatial audio signal derived based on the microphone signals.
  • the digital audio is captured together with the associated video. Capturing a digital audio signal that represents an audio scene around the mobile device provides interesting possibilities for processing the spatial audio image conveyed by the digital audio signal during the capture and/or after the capture.
  • a user may wish to modify characteristics of one or more sound sources in the spatial audio image, for example, to improve perceptual quality or clarity of the one or more sound sources or for entertainment purposes.
  • a straightforward approach for implementing such a procedure includes extracting, from the digital audio signal, a sound signal representing a sound source of interest, modifying the sound signal in a desired manner and inserting the modified sound signal back to the digital audio signal.
  • Extraction of the sound signal may be carried out by applying an audio focusing technique known in the art to the digital audio signal, where the audio focusing aims at representing sounds in a desired sound direction within a spatial audio image represented by the digital audio signal while excluding sounds in other sound directions.
  • a typical solution for audio focusing involves audio beamforming, which is a technique well known in the art.
  • an audio beamforming procedure aims at extracting a beamformed audio signal that represents sounds in a sound direction of interest while suppressing sound in other sound directions.
  • the sound direction of interest may be referred to as a beam direction.
  • audio focusing is applied to refer to an audio processing technique that involves emphasizing sounds in certain sound directions of a spatial audio image in relation to sounds in other sound directions of the spatial audio image.
  • the aim of the beamforming is to extract or derive a beamformed audio signal that represents sounds in the beam direction without representing sounds in other sound directions
  • the beamformed audio signal is typically an audio signal where sounds in the beam direction are emphasized in relation to sounds in other sound directions. Consequently, even if an audio beamforming procedure aims at a beamformed audio signal that only represents sounds in the beam direction, the resulting beamformed audio signal is one where sounds in the beam direction and sounds in a sub-range of directions around the beam direction are emphasized in relation to sounds in other directions in accordance with characteristics of a beam applied for the audio beamforming.
  • the width of a beam applied in the audio beamforming procedure may be considered: the width of the beam may be indicated by a solid angle (typically in horizontal direction only), which defines a sub-range of sound directions around the beam direction that are considered to fall within the beam.
  • the solid angle may define a sub-range of sound directions around the beam direction such that sounds in sound directions outside the solid angle are attenuated at least a predefined amount in relation to a sound direction of maximum amplification (or minimum attenuation) within the solid angle.
  • the predefined amount may be defined, for example as 6 dB or 3 dB.
  • definition of the beam width as the solid angle is a simplified model for indicating the width of the beam and hence the sub-range of sound directions encompassed by the beam when targeted to the beam direction, whereas in a real-life implementation the beam does not strictly cover a well-defined range of sound directions around the sound direction of interest but the beam rather has a width that may vary with audio signal characteristic, with the position of the beam direction with the spatial audio image under consideration and/or with audio frequency.
  • the beamformed audio signal may also include sound components originating from other sound sources in sound directions close to the beam direction and/or ambient sound components (e.g. sounds that have no well-defined sound direction).
  • the beamformed audio signal when the beamformed audio signal is applied as the sound signal that represents the sound source of interest in context of the above- described procedure for modifying characteristics of a sound source in the spatial audio image represented by a digital audio signal, modification of characteristics of the beamformed audio signal in order to modify characteristics of the sound source it serves to represent may unintentionally also modify sound components from other sound sources and/or ambient sound components, which in turn may result in an audio effect different from the one intended and/or distorting the modified digital audio signal resulting from introduction of the modified beamformed audio signal back to the digital audio signal. While the above description applies audio beamforming as an example of audio focusing techniques in general, similar challenges are involved in audio focusing techniques of other kind as well.
  • a method for audio processing comprising: receiving an input audio signal that represents a spatial audio image and respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions; deriving, based on the input audio signal in dependence of said one or more sound directions and the respective audio effects, at least one sound signal that represents sounds in said one or more sound directions and a background signal that represents sounds in other sound directions of the spatial audio image, wherein a sound signal represents sounds in at least one of the one or more sound directions and has one of the audio effects associated therewith; deriving, based on said at least one sound signal, via application of the respective audio effect, respective at least one modified sound signal; and deriving the spatial audio signal based on the at least one modified sound signal and on the background signal.
  • an apparatus for audio processing configured to: receive an input audio signal that represents a spatial audio image and respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions; derive, based on the input audio signal in dependence of said one or more sound directions and the respective audio effects, at least one sound signal that represents sounds in said one or more sound directions and a background signal that represents sounds in other sound directions of the spatial audio image, wherein a sound signal represents sounds in at least one of the one or more sound directions and has one of the audio effects associated therewith; derive, based on said at least one sound signal, via application of the respective audio effect, respective at least one modified sound signal; and derive the spatial audio signal based on the at least one modified sound signal and on the background signal.
  • an apparatus for audio processing comprising: means for receiving an input audio signal that represents a spatial audio image and respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions; means for deriving, based on the input audio signal in dependence of said one or more sound directions and the respective audio effects, at least one sound signal that represents sounds in said one or more sound directions and a background signal that represents sounds in other sound directions of the spatial audio image, wherein a sound signal represents sounds in at least one of the one or more sound directions and has one of the audio effects associated therewith; means for deriving, based on said at least one sound signal, via application of the respective audio effect, respective at least one modified sound signal; and means for deriving the spatial audio signal based on the at least one modified sound signal and on the background signal.
  • an apparatus for audio processing comprises at least one processor; and at least one memory including computer program code, which, when executed by the at least one processor, causes the apparatus to: receive an input audio signal that represents a spatial audio image and respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions; derive, based on the input audio signal in dependence of said one or more sound directions and the respective audio effects, at least one sound signal that represents sounds in said one or more sound directions and a background signal that represents sounds in other sound directions of the spatial audio image, wherein a sound signal represents sounds in at least one of the one or more sound directions and has one of the audio effects associated therewith; derive, based on said at least one sound signal, via application of the respective audio effect, respective at least one modified sound signal; and derive the spatial audio signal based on the at least one modified sound signal and on the background signal.
  • a computer readable medium comprising program instructions for causing an apparatus to perform at least the method according to the example embodiment described in the foregoing is provided.
  • the computer readable medium may comprise a volatile or a non-volatile computer- readable record medium, thereby providing a computer program product comprising at least one computer readable non-transitory medium having program instructions for causing an apparatus to perform at least the method according to the example embodiment described in the foregoing stored thereon.
  • Figure 1 illustrates a block diagram of some components and/or entities of an audio processing system according to an example
  • Figures 2A and 2B illustrate respective block diagrams of some components and/or entities of an audio processing sub-system according to an example
  • Figure 3 illustrates a block diagram of some components and/or entities of an audio processing portion according to a non-limiting example
  • Figure 4 illustrates a flowchart depicting a method according to an example
  • Figure 5 illustrates a flowchart depicting a method according to an example
  • Figure 6 illustrates a block diagram of some elements of an apparatus according to an example.
  • FIG. 1 illustrates a block diagram of some components and/or entities of an audio processing system 100 according to a non-limiting example.
  • the audio processing system 100 comprises an audio capturing portion 110 and an audio processing portion 120.
  • the audio capturing portion 110 is coupled to a microphone array 112 and it is arranged to receive respective microphone signals from a plurality of microphones 112-1, 112-2, ..., 112-K and to record a captured multi-channel audio signal 115 based on the received microphone signals.
  • the microphones 112-1, 112- 2, ..., 112-K represent a plurality of (i.e. two or more) microphones, where an individual one of the microphones may be referred to as a microphone 112-k or as a microphone 112-j.
  • microphone array 112 is to be construed broadly, encompassing any arrangement of two or more microphones 112-k arranged in or coupled to a device implementing the audio processing system 100.
  • the audio processing portion 120 is arranged to receive the multi-channel audio signal 115 from the audio capturing portion 120 and to derive a spatial audio signal 125 based on the multi-channel audio signal 115.
  • Each of the microphone signals represents the same captured sound while they provide a respective different representation of the captured sound, which difference depends on the positions of the microphones 112-k with respect to each other. For a sound source in a certain spatial position with respect to the microphone array 112, this results in a different representation of sounds originating from the certain sound source in each of the microphone signals: a microphone 112-k that is closer to the certain sound source captures the sound originating therefrom at a higher amplitude and earlier than a microphone 112-j that is further away from the certain sound source.
  • Such differences in amplitude and/or time delay enable derivation of a spatial audio signal that represents the audio scene at the time (and place) of capturing the microphone signals and/or applying spatial audio processing based on the multi-channel audio signal 115.
  • the representation of the spatial audio scene captured in the microphone signals and, consequently, in the multi-channel audio signal 115 may be referred to as a spatial audio image.
  • the audio capturing portion 110 may be arranged to record a respective digital audio signal based on each of the microphone signals received from the microphones 112- 1 , 112-2, ..., 112-K of the microphone array 112 at a predefined sampling frequency using a predefined bit depth and to provide the recorded digital audio signals as the multi-channel audio signal 115 to the audio processing portion 120 for further audio processing therein.
  • each digital audio signal recorded at the audio capturing portion 110 may be considered as a respective channel of the multi channel audio signal 115.
  • the multi-channel audio signal 115 may comprise or may be accompanied with audio metadata that includes information characterizing at least some aspects of the multi-channel audio signal 115, e.g.
  • FIGS. 2A and 2B illustrate respective block diagrams of some components of respective audio processing sub-systems 100a and 100b according to a non-limiting example.
  • the audio processing sub-system 100a comprises the microphone array 112 and the audio capturing portion 110 described in the foregoing together with a memory 102.
  • a difference to operation of the audio processing system 100 is that instead of providing the multi-channel audio signal 115 to the audio processing portion 120, the audio capturing portion110 may be arranged to store the multi channel audio signal 115 in the memory 102 for subsequent access by the audio processing portion 120.
  • the multi-channel audio signal 115 may be stored in the memory 102 together with the audio metadata described in the foregoing.
  • the audio processing sub-system 100b comprises the memory 102 and the audio processing entity 120 described in the foregoing. Hence, a difference in operation of the audio processing sub-system 100b in comparison to a corresponding aspect of operation of the audio processing system 100 is that instead of (directly) obtaining the multi-channel audio signal 115 from the audio capturing portion 110, the audio processing portion 120 reads the multi-channel audio signal 115, possibly together with the audio metadata, from the memory 102.
  • the memory 102 read by the audio processing portion 120 is the same one to which the audio capturing portion 110 stores the multi-channel audio signal 115 recorded or derived based on the respective microphone signals obtained from the microphones 112-k of the microphone array 112.
  • such an arrangement may be provided by providing the audio processing sub-systems 100a and 100b in a single device that also includes (or otherwise has access to) the memory 102.
  • the audio processing sub-systems 100a and 100b may be provided in a single device or in two separate devices and the memory 102 may comprise a memory provided in a removable memory device such as a memory card or a memory stick that enables subsequent access to the multi-channel audio signal 115 (possibly together with the metadata) in the memory 102 by the same device that stored this information therein or by another device.
  • the memory 102 may be provided in a further device, e.g. in a server device, that is communicatively coupled to a device providing the both audio processing sub-systems 100a, 100b or to respective devices providing the audio processing sub-system 100a and the audio processing sub-system 100b by a communication network.
  • the memory 102 may be replaced by a transmission channel or by a communication network that enables transferring the multi-channel audio signal 115 (possibly together with the audio metadata) from a first device providing the audio processing sub-system 100a to a second device providing the audio processing sub-system 100b.
  • the transfer of the multi-channel audio signal 115 may comprise transmitting/receiving the multi-channel audio signal 115 as an audio packet stream, whereas the audio capturing portion 110 may further operate to encode the multi-channel audio signal 115 for encoded audio data suitable for transmission in the audio packet stream and the audio processing portion 120 may further operate to decode the encoded audio data received in the audio packet stream into the multi-channel audio signal 115 (or into a reconstructed version thereof) for the audio processing therein.
  • the audio processing portion 120 may be arranged to carry out a spatial audio processing procedure that results in modifying respective sounds representing audio characteristics of one or more sound sources included in the multi-channel audio signal 115 to provide the spatial audio signal 125. Without losing generality, modification of a sound may be referred to as an application of an audio effect to the respective sounds.
  • the audio processing system 100 and operation thereof is throughout this disclosure predominantly described with references to the audio processing portion 120, 220 obtaining an input audio signal as the multi-channel audio signal 115 derived on basis of respective microphone signals received from the microphones 112-1, 112-2, ..., 112-K of the microphone array 112, in other examples the audio input signal to the audio processing portion 120, 220 may comprise an audio signal of other type that represents a spatial audio image and that enables audio focusing, thereby enabling derivation of the spatial audio signal 125 as the output of the audio processing portion 120, 220.
  • Non-limiting examples of such input audio signals of other type that represents a spatial audio image include an Ambisonic (spherical harmonic) audio format and various multi-loudspeaker audio formats (such as 5.1- channel or 7.1. surround sound) known in the art.
  • Ambisonic sinosonic
  • multi-loudspeaker audio formats such as 5.1- channel or 7.1. surround sound
  • audio metadata includes information that defines various aspects related to characteristics of the input audio signal, including spatial characteristics such as parametric data describing the spatial audio field, e.g. respective sound direction-of-arrival estimates, respective ratios between direct and ambient sound energy components, etc. for one or more frequency bands.
  • Figure 3 illustrates a block diagram of some components and/or entities of an audio processing portion 220 according to a non-limiting example.
  • the audio processing portion 220 is arranged to obtain the multi-channel audio signal 115 and respective indications of a sound direction within a spatial audio image represented by the multi-channel audio signal 115 and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions.
  • the audio processing portion 220 comprises an audio decomposition portion 222 for deriving, based on the multi-channel audio signal 115, at least one sound signal 221 that represents sounds in the one or more sound directions within the spatial audio image and a background signal 223 that represents sounds in other directions of the spatial audio image.
  • the audio processing portion further comprises an audio effect portion 224 for deriving, based on the at least one sound signal 221, via application of the respective audio effect, respective at least one modified sound signal 225.
  • he audio processing portion further comprises an audio combiner 226 for deriving the spatial audio signal 125 based on the at least one modified sound signal 225 and the background signal 223.
  • the audio processing portion 220 may include further entities in addition to those illustrated in Figure 3 and/or some of the entities depicted in Figure 3 may combined with other entities while providing the same or corresponding functionality.
  • the entities illustrated in Figure 3 serve to represent logical components of the audio processing portion 220 that are arranged to perform a respective function but that do not impose structural limitations concerning implementation of the respective entity.
  • respective hardware means, respective software means or a respective combination of hardware means and software means may be applied to implement any of the entities illustrated in Figure 3 separately from the other entities, to implement any sub-combination of two or more entities illustrated in Figure 3, or to implement all entities illustrated in Figure 3 in combination.
  • the audio processing portion 220 may be provided as one comprising means for obtaining the multi-channel audio signal 115 that represents a spatial audio image, means for obtaining respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions, means for deriving, based on the multi-channel audio signal 115 in dependence of said one or more sound directions and the respective audio effects, the at least one sound signal 221 that represents sounds in the one or more sound directions and the background signal 223 that represents sounds in other directions of the spatial audio image, means for deriving, based on the at least one sound signal 221, via application of the respective audio effect, respective at least one modified sound signal 225, and means for deriving the spatial audio signal 125 based on the at least one modified sound signal 225 and on the background signal 223.
  • the spatial audio processing procedure may be carried out, for example, in accordance with a method 200 illustrated in a flowchart of Figure 4.
  • the method 200 commences from obtaining the multi-channel audio signal 115 that represents a spatial audio image, as indicated in block 202, and obtaining respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions, as indicated in block 204.
  • the method 200 further comprises deriving, based on the multi-channel audio signal 115 in dependence of said one or more sound directions and the respective audio effects, the at least one sound signal 221 that represents sounds in the one or more sound directions and the background signal 223 that represents sounds in other directions of the spatial audio image, as indicated in block 206.
  • the method 200 further comprises deriving, based on the at least one sound signal 221, via application of the respective audio effect, respective at least one modified sound signal 225, as indicated in block 208. Moreover, the method 200 further comprises deriving the spatial audio signal 125 based on the at least one modified sound signal 225 and on the background signal 223, as indicated in block 210.
  • the audio decomposition portion 221 may be arranged to carry out operations, procedures and/or functions pertaining to block 206
  • the audio effect portion 223 may be arranged to carry out operations, procedures and/or functions pertaining to block 208
  • the audio combiner 225 may be arranged to carry out operations, procedures and/or functions pertaining to block 210.
  • the operation of the audio processing portion 220 is predominantly described with references to the method 200, while the corresponding examples readily pertain to respective entities of the audio processing portion 220 arranged to carry out the respective aspect of the method 200.
  • the audio processing portion 220 and the method 200 may be arranged to process the multi-channel audio signal 115 arranged in a sequence of input frames, each input frame including a respective time segment of digital audio signal for each of the channels, provided as a respective time series of input samples at a predefined sampling frequency (which may be defined, for example, in the audio metadata provided for the multi-channel audio signal 115).
  • the audio processing portion 220 employs a fixed predefined frame length.
  • the frame length may be a selectable frame length that may be selected from a plurality of predefined frame lengths, or the frame length may be an adjustable frame length that may be selected from a predefined range of frame lengths.
  • a frame length may be defined as number of (input) samples L included in the frame for each channel of the multi-channel audio signal 115, which at the predefined sampling frequency maps to a corresponding duration in time.
  • the frames may be non-overlapping or they may be partially overlapping. These values, however, serve as non-limiting examples and frame lengths and/or sampling frequencies different from these examples may be employed instead, depending e.g. on the desired audio bandwidth, on desired framing delay and/or on available processing capacity.
  • At least part of the processing carried out by the audio processing portion 220 and/or the method 200 may be carried out separately for a plurality of frequency bands of the multi-channel audio signal 115. Consequently, e.g. a respective entity of the audio processing portion 220 and/or a respective step of the method 200 may involve (at least conceptually) dividing or decomposing each channel of the multi- channel audio signal 115 into a respective plurality of frequency bands, thereby providing a respective time-frequency representation for each channel of the multi- channel audio signal 115.
  • division into the frequency bands may comprise transforming each channel of the multi-channel audio signal 115 from time domain into a respective frequency-domain audio signal and arranging the resulting frequency-domain samples (also referred to as frequency bins) into respective plurality of frequency bands in each of the channels.
  • time-to-frequency-domain transforms include short-time discrete Fourier transform (STFT) and a complex-modulated quadrature- mirror filter (QMF) bank.
  • multi-channel audio signal 115 is transformed into frequency domain for carrying out an aspect of audio processing by the audio processing portion 220 and/or the method 200 (separately) for a plurality of frequency bands
  • respective inverse transform from the frequency domain back to the time domain may be applied before providing the spatial audio signal 125 as the output of the audio processing portion 220 and/or the method 200.
  • the number of frequency bands and respective bandwidths of the frequency bands may be selected e.g. in accordance with the desired frequency resolution and/or available computing power.
  • the frequency band structure involves 24 frequency sub-bands according to the Bark scale, an equivalent rectangular band (ERB) scale or 3 rd octave band scale known in the art.
  • ERB equivalent rectangular band
  • different number of frequency sub-bands that have the same or different bandwidths may be employed.
  • a specific example in this regard is a single frequency sub-band that covers the input spectrum in its entirety or a single frequency sub-band that covers a continuous subset of the input spectrum.
  • a frequency-domain sample that represents frequency bin b in time frame n of channel i of the frequency-domain audio signal may be denoted as x(i, b,n).
  • Division to frequency bands may be provided by arranging or grouping one or more of consecutive frequency bins obtained for a given channel in the given frame into a respective frequency band, thereby providing a plurality of frequency bands k- 0, ..., K-1, where the frequency-domain audio signal in frequency band k in time frame n of channel i may be denoted as x(i, k,n ) and where the frequency-domain audio signal in frequency band k in time frame n across all channels may be denoted as x(k, n).
  • the latter may be referred to as a respective time-frequency tile.
  • obtaining the multi-channel audio signal 115 may comprise receiving the multi-channel audio signal 115 from the audio capturing portion 110 or over a communication network or reading the multi-channel audio signal from the memory 102, as described in the foregoing. Further along the lines described in the foregoing, the multi-channel audio signal 115 may comprise or it may be otherwise received with the audio metadata that includes information characterizing at least some aspects of the multi-channel audio signal 115.
  • the multi-channel audio signal 115 together with the audio metadata, enables applying the spatial audio processing procedure to create the spatial audio signal 125 (directly) based on the multi-channel audio signal 115 or via creating an intermediate spatial audio signal based on the multi-channel audio signal 115 and carrying out the method 200 based on the intermediate spatial audio signal.
  • an audio effect to be applied to sounds in a given one of the indicated sound directions may be referred to as the audio effect associated with the given one of the indicated sound directions.
  • the one or more sound directions may be also referred to as respective sound directions of interest.
  • obtaining the indications of the one or more sound directions within the spatial audio image and the respective indications of the one or more audio effects associated therewith may comprise receiving or deriving the respective indications based on user input.
  • the audio processing portion 120 may be arranged to receive the respective indications of the sound directions of interest and the audio effects associated therewith as user input provided via user interface (Ul) of a device implementing the audio processing portion 120 or derive these indications based on the user input received via the Ul of the device implementing the audio processing portion 120.
  • the multi-channel audio signal 115 may be accompanied by a video stream (being) captured together with the multi-channel audio signal 115 that depicts at least some of the sound sources represented in the multi-channel audio signal 115 and that is displayed to the user via the Ul.
  • the Ul may provide the user with a possibility to indicate one or more sound sources of interest via pointing their respective illustrations on the displayed video stream, which indicated one or more sound sources may be converted into respective one or more sound directions of interest based on their positions in the displayed images of the video stream.
  • the Ul may provide the user with possibility to select, for each of the one or more sound sources of interest, a respective audio effect to be applied to sounds originating from the respective one of the one or more sound sources of interest.
  • selection of the audio effect to be applied may be made from a set of one or more predefined audio effects.
  • the one or more sound directions of interest may be defined as respective horizontal directions within the spatial audio image represented by the multi-channel audio signal 115 and/or as respective vertical directions within the spatial audio image.
  • references to sound directions (at least implicitly) assume horizontal directions within the spatial audio image, whereas the examples readily generalize into applying vertical directions in addition to or instead of horizontal directions, mutatis mutandis.
  • a sound direction in horizontal direction of the spatial audio image may be defined as an (azimuth) angle with respect to a reference direction.
  • the reference direction is typically, but not necessarily, a direction directly in front of the assumed listening point.
  • the reference direction may be defined as 0° (i.e. zero degrees), whereas a sound direction that is to the left of the reference direction may be indicated by a respective angle in the range 0° ⁇ ⁇ ⁇ 180° and a sound direction that is to the right of the reference direction may be indicated by a respective angle in the range — 180° ⁇ ⁇ ⁇ 0°. with directions at 180° and —180° indicating a sound direction opposite to the reference direction.
  • the respective audio effect to be applied to sounds in one of the one or more sound directions of interest may involve any audio processing that modifies at least one audio characteristics of the respective sounds.
  • applicable audio effects include the following: audio equalization according to a predefined or user-selectable profile, pitch shifting in a predefined or user-selectable manner, vibrato at a predefined or user-selectable rate and at a predefined or user-selectable extent, tremolo in a predefined or user-selectable manner, a vocoder effect in a predefined or user-selectable manner, etc.
  • the audio effect to be applied may be the same for each of the one or more sound directions of interest or the audio effect to be applied may be different across the one or more sound directions of interest.
  • a first audio effect may be selected for a first sound direction and a second audio effect may be selected for a second sound direction (which is a sound direction different from the first sound direction), where the first and second audio effects may be the same (or similar) or the first and second audio effects may be different from each other.
  • derivation of the at least one sound signal 221 that represents sounds in the one or more sound directions of interest and the on the background signal 223 that represents sounds in other directions of the spatial audio image based on the multi-channel audio signal 115 may comprise selectively applying an audio focusing technique to derive the at least one sound signal 221 and the background signal 223 in view of one or more of the following aspects:
  • Audio focusing aims at extracting a focused audio signal that represent sounds in a focus direction while suppressing sounds in other sound directions, whereas in a practical implementation the focused audio signal may encompass sounds in sound directions falling within a focus pattern (substantially) centered at the focus direction while suppressing sounds in other sound directions, where the focus pattern directed to the focus direction encompasses a predefined sub-range of sound directions (substantially) centered at the focus direction, which sub-range may be also referred to as a focus width.
  • An example of audio focusing that may be especially suited for audio processing that relies on the multi-channel audio signal 115 obtained from the microphone array 112 comprises audio beamforming outlined in the foregoing.
  • Audio beamforming aims at extracting a beamformed audio signal that represent sounds in a beam direction while suppressing sounds in other sound directions, whereas in a practical implementation the beamformed audio signal may encompass sounds in sound directions falling within a beam pattern (substantially) centered at the beam direction while suppressing sounds in other sound directions.
  • a beam pattern directed to the beam direction encompasses a predefined sub-range of sound directions (substantially) centered at the beam direction, which sub-range may be also referred to as a beam width.
  • - monoaural audio focusing for deriving, based on the multi-channel audio signal 115, a monaural sound signal that represents sounds in a focus direction while suppressing sounds in other sound directions;
  • the monaural audio focusing for deriving, based on the multi-channel audio signal 115, a spatial sound signal that represents sounds in a desired range of sound directions such that they are retained in their respective original sound directions within the spatial audio image represented by the multi-channel audio signal 115 while suppressing sounds in other sound directions.
  • the monaural audio focusing may be provided by directly applying a focus pattern of a desired focus width directed at a sound direction of interest to derive a respective monoaural sound signal.
  • Monaural audio focusing may be accompanied or followed by determination of one or more spatial characteristics of the spatial audio image represented by the multi-channel audio signal 115, such as sound direction(s) and/or diffuseness.
  • the monaural focused audio signal resulting from monaural audio focusing may be used as basis for creating a respective spatial sound signal where the monaural focused audio signal is panned to its original sound direction with the spatial image and possibly also modified to exhibit determined diffuseness, thereby providing a spatial sound signal that represents sound in the sound direction of interest while suppressing sounds in other sound directions.
  • the spatial audio focusing may be applied for a single sound direction, e.g. by using a first focus pattern of a desired focus width that is directed at a sound direction that is offset from the single sound direction to a first direction (e.g. to the left) to derive a first channel of a spatial sound signal and using a second focus pattern of the desired focus width that is directed at a sound direction that is offset from the single sound direction to a second direction that is opposite to the first direction (e.g. to the right) to derive a second channel of the spatial sound signal.
  • the spatial audio focusing may be applied for a range of sound directions from a first sound direction to a second sound direction, e.g.
  • the spatial sound signal represents sounds in a sub-range of sound directions between the first and second sound directions such that they are panned in their original spatial positions in the spatial audio image while sounds in other sound directions of the spatial audio image are substantially suppressed.
  • the offsets applied in the spatial audio focusing procedure may be predefined ones or they may be selected in view of characteristics of the microphone array 112 applied for capturing the microphone signals that serve as basis for respective channels of the multi-channel audio signal 115, e.g. in view of the positions of the microphones 112-k with respect to each other.
  • a spatial focused audio signal that represents sound in a sound direction of interest may be derived via deriving a left channel of the focused audio signal as a monaural focused audio signal (e.g. via audio beamforming) whose phase center is to the left from the sound direction of interest and deriving a right channel of the focused audio signal as a monaural focused audio signal (e.g. via audio beamforming) whose phase center is to the right from the sound direction of interest.
  • the phase center may be at a location of a microphone in a left side of a microphone array 112 applied for capturing the microphone signals that serve as basis for respective channels of the multi-channel audio signal 115 and for derivation of the right channel the phase center may be at the location of a microphone in the right side of said microphone array 112.
  • a phase center is in such a location that a microphone signal from the phase center is not delayed with respect to microphone signals from other microphones during audio focusing procedure (e.g. in context of audio beamforming).
  • the spatial audio beamforming may be carried on basis of a spatial audio signal that is derived based on the multi-channel audio signal 115 (or that is received at the audio processing portion 220 instead of the multi- channel audio signal 115).
  • sound directions represented by the spatial audio signal are derived via analysis of the spatial audio signal and the spatial audio signal is amplified in those time-frequency tiles that are found, via the analysis of the spatial audio signal, to represent a direction of interest and/or the spatial audio signal is attenuated in those time-frequency tiles that are found not to represent a direction of interest.
  • Spatial audio focusing provides the benefit of maintaining directional sounds of the spatial audio image in their correct sound directions, whereas in monaural audio focusing the sound direction is ‘lost’ and needs to be recreated in order to maintain similar spatial characteristic.
  • spatial audio focusing enables more pleasant and naturally-sounding audio image that also ensures retaining spatial cues that facilitate spatial perception by a human hearing system.
  • derivation of the at least one sound signal 221 may comprise applying, on the multi-channel audio signal 115, monaural audio focusing using a first focus pattern of a predefined focus width directed at the first sound direction ⁇ 1 to derive a first sound signal.
  • the at least one sound signal 221 comprises the first sound signal that represents sounds in the first sound direction ⁇ 1 and that has the first audio effect associated therewith, where the first sound signal comprises a monaural audio signal.
  • derivation of the background signal 223 may comprise applying, on the multi-channel audio signal 115, spatial audio focusing that covers the range of sound directions outside the first focus pattern applied for derivation of the first sound signal.
  • the range of sound directions outside the first focus pattern may be divided into one or more sub-ranges of sound directions and the spatial audio focusing may be applied separately for each of the sub-ranges.
  • the resulting focused audio signal serves as the background signal 223, whereas in case of two or more sub-ranges of sound directions (that each cover a respective sub-range of sound directions outside the first focus pattern) the background signal 223 may be derived as a combination (e.g. as a sum or as an average) of respective focused audio signals obtained for the two or more sub-ranges of sound directions.
  • the background signal 223 may be created, for example, by subtracting the first sound signal from the multi-channel audio signal 115.
  • the subtraction may involve crating a respective monaural focused (e.g. beamformed) signal for each channel of the multi-channel audio signal
  • first sound direction ⁇ 1 and a second sound direction ⁇ 2 there are two sound directions of interest, denoted as a first sound direction ⁇ 1 and a second sound direction ⁇ 2 , where a first audio effect is to be applied to sounds in the first sound direction ⁇ 1 and a second audio effect is to be applied to sounds in the second sound direction ⁇ 2 , where the first audio effect is different from the second audio effect.
  • derivation of the at least one sound signal 221 may comprise applying, on the multi-channel audio signal 115, monoaural audio focusing using a first focus pattern of a predefined focus width directed at the first sound direction ⁇ 1 to derive a first sound signal and applying monoaural audio focusing using a second focus pattern of a predefined focus width directed at the second sound direction ⁇ 2 to derive a second sound signal.
  • the at least one sound signal 221 comprises the first sound signal that represents sounds in the first sound direction ⁇ 1 and that has the first audio effect associated therewith and the second sound signal that represents sounds in the second sound direction ⁇ 2 and that has the second audio effect associated therewith, where both the first and second sound signals comprise respective monaural audio signals.
  • derivation of the background signal 223 may comprise applying, on the multi-channel audio signal 115, spatial audio focusing that covers the range(s) of sound directions outside the first and second focus patterns applied for derivation of the first and second sound signal.
  • the range(s) of sound directions outside the first and second focus patterns may be divided into one or more sub-ranges of sound directions and the spatial audio focusing may be applied separately for each of the sub-ranges.
  • the resulting focused audio signal serves as the background signal 223, whereas in case of two or more sub-ranges of sound directions the background signal 223 may be derived as a combination (e.g. as a sum or as an average) of respective focused audio signals obtained for the two or more sub-ranges of sound directions.
  • the background signal 223 may be created, for example, by subtracting the first and second sound signals from the multi- channel audio signal 115, wherein the subtraction may be carried out along the lines described in the foregoing in context of the first example, mutatis mutandis.
  • derivation of the at least one sound signal 221 may comprise applying, on the multi-channel audio signal 115, spatial audio focusing that covers a first range of sound directions from the first sound direction ⁇ 1 to the second sound direction ⁇ 2 to derive the first sound signal.
  • the at least one sound signal 221 comprises the first sound signal that represents sounds in the first and second sound directions ⁇ 1 , ⁇ 2 and that has the first audio effect associated therewith, where the first sound signal comprises a spatial audio signal.
  • derivation of the background signal 223 may comprise applying, on the multi-channel audio signal 115, spatial audio focusing that covers the (complementary) range of sound directions outside the first range of sound directions applied for derivation of the first sound signal.
  • the range of sound directions outside the first range of sound directions may be divided into one or more sub-ranges of sound directions and the spatial audio focusing may be applied separately for each of the sub-ranges.
  • the resulting focused audio signal serves as the background signal 223, whereas in case of two or more sub-ranges of sound directions the background signal 223 may be derived as a combination (e.g. as a sum or as an average) of respective focused audio signals obtained for the two or more sub-ranges of sound directions.
  • the background signal 223 may be created, for example, by subtracting the first sound signal from the multi-channel audio signal 115, wherein the subtraction may be carried out along the lines described in the foregoing in context of the first example, mutatis mutandis.
  • the distance threshold ⁇ thr may be set in dependence of characteristics of the multi-channel audio signal 115, e.g. in view of the spatial arrangement of microphones of the microphone array 112 applied in capturing the microphone signals that serve as basis for the multi-channel audio signal 115.
  • the distance threshold ⁇ thr may be set such that respective focus patterns directed to the first and second sound directions ⁇ 1 , ⁇ 2 do not overlap when the distance ⁇ dist therebetween exceeds the distance threshold ⁇ thr .
  • the selection of the type of audio focusing and/or the focus width may be carried out in dependence of presence of further (directional) sound sources in sound directions between the first and second sound directions ⁇ 1 , ⁇ 2 , e.g. such that derivation of the at least one sound signal 221 is carried out as described above for the third example in response to not detecting presence of any directional sound sources in sound directions between the first and second sound directions ⁇ 1 , ⁇ 2 , whereas derivation of the at least one sound signal 221 is carried out as described above for the second example in response to detecting presence of one or more directional sound sources in sound directions between the first and second sound directions ⁇ 1 , ⁇ 2 .
  • Presence of one or more directional sound sources (or lack thereof) between the first and second sound directions ⁇ 1 , ⁇ 2 . may be evaluated via analysis of the multi-channel audio signal 115, for example by directing one or more auxiliary focus patterns between the first and second sound directions ⁇ 1 , ⁇ 2 to derive respective auxiliary sound signals and determining presence of one or more directional sound sources in sound directions between the first and second sound directions ⁇ 1 , ⁇ 2 in response to any of the auxiliary sound signals exhibiting signal level (e.g. signal) energy that is close that of sound signals derived from the first sound direction ⁇ 1 and/or from the second sound direction ⁇ 2 .
  • signal level e.g. signal
  • an auxiliary sound signal may be considered to represent a further sound source in case its signal level tis above a predefined percentage of that of the sound signal representing sounds in the first sound direction ⁇ 1 and/or that of the sound signal representing sounds in the second sound direction ⁇ 2 .
  • first sound direction ⁇ 1 there are three sound directions of interest, denoted as a first sound direction ⁇ 1 , a second sound direction ⁇ 2 and at third sound direction ⁇ 3 , where a first audio effect is to be applied to sounds in the first sound direction ⁇ 1 and where a second audio effect that is different from the first audio effect is to be applied to sounds in the second and third sound directions ⁇ 2 , ⁇ 3 .
  • derivation of the at least one sound signal 221 with respect to the first sound direction ⁇ 1 may be carried out as described in the foregoing for the first example
  • derivation of the at least one sound signal 221 with respect to the second and third sound directions ⁇ 2 , ⁇ 3 may be carried out as described in the foregoing for the third example.
  • the at least one sound signal 221 comprises the first sound signal that represents sounds in the first sound direction ⁇ 1 and that has the first audio effect associated therewith and the second sound signal that represents sounds in the second and third sound directions ⁇ 2 , ⁇ 3 and that has the second audio effect associated therewith, where the first sound signal comprises a monaural audio signal and where the second sound signal comprises a spatial audio signal.
  • An advantage arising from deriving the at least one sound signal 221 that each represent one or more sound directions within the spatial audio image represented by the multi-channel audio signal 115 e.g. according to the first to fourth examples described in the foregoing is application of the respective audio effect to a certain one of the at least one sound signal 221 such that they do not interfere with respective audio effects applied to other ones of the at least one sound signal 221 and/or with the sound represented by the background signal 223, thereby ensuring introduction of the audio effects as intended and, consequently, avoiding audible distortions that may arise from the audio effects interfering with each other and/or with the sounds in the background.
  • derivation of the at least one modified sound signal 225 based on the at least one sound signal 221 via application of the respective audio effect may comprise driving the at least one modified sound signal 225 in dependence of the type and the number of sound signals provided as the at least one sound signal 221 .
  • the at least one sound signal 221 may comprise one or more of the following:
  • the at least one sound signal 221 comprises one or more monaural sound signals that each represent sounds in a respective one of the one or more sound directions of interest and/or one or more spatial sound signals that each represent sounds in respective at least two of the one or more sound directions of interest, where each sound signal has a respective audio effect associated therewith.
  • the aspect of deriving the respective at least one modified sound signal 225 based on the at least one sound signal 221 comprises separately applying for each of the sound signals respective audio processing that implements the audio effect associated therewith in dependence of the type of the respective sound signal:
  • derivation of the respective modified sound signal comprises applying the audio processing that implements the audio effect associated with the respective sound signal and applying audio panning to arrange the resulting modified sound content in the respective sound direction of interest the respective sound signal serves to represent;
  • derivation of the respective modified sound signal comprises applying the audio processing that implements the audio effect associated with the respective sound signals, whereas the sound content of the resulting spatial sound signal is readily arranged in the respective sound directions of interest the respective sound signal serves to represent.
  • derivation of the spatial audio signal 125 based on the at least one modified sound signal 225 and the background signal may comprise deriving the spatial audio signal as the sum (or as another linear combination) of the at least one modified sound signal 225 and the background signal.
  • derivation of the spatial audio signal 125 based on the at least one modified sound signal 225 and the background signal may comprise combining the at least one modified sound signal 225 and the background signal in view of energy level of the at least one modified sound signals 225 in relation to energy level of the at least one sound signals 221 .
  • a single modified sound signal may comprise a monaural audio signal that represents sounds in a single sound direction (e.g. in the first sound direction ⁇ 1 ) or a spatial audio signal that represents sounds in one or more sound directions (e.g.
  • the method 300 proceeds from obtaining a sound signal, a modified sound signal derived based on the sound signal, and the background signal 223, indicated in block 302.
  • the sound signal may comprise one of the at least one sound signal 221 and the respective one of the at least on modified sound signals 225.
  • the sound signal (in frequency domain) is denoted as S
  • the modified sound signal (in frequency domain) is denoted as S'
  • the background signal (in frequency domain) may be (also) denoted as B.
  • the method 300 further comprises from computing respective energy of one or more frequency bands of the sound signal S, as indicated in block 304, and computing respective energy in one or more frequency bands of the modified sound signal S', as indicated in block 306.
  • the method may optionally further comprise computing respective energy in one or more frequency bands of the background signal B, as indicated in block 308.
  • the energy of the first sound signal S in the frequency band k e.g. in the time-frequency tile S(k,n)
  • E s (k,n) the energy of the modified first sound signal S' in the frequency band k
  • the method 300 further comprises attenuating the background signal 223 in those frequency bands k where the energy of the sound signal S is higher than the respective energy of the modified sound signal S', thereby deriving a modified background signal B', as indicated in block 310.
  • the method 300 proceeds into deriving the spatial audio signal 125 as a combination of the modified sound signal S' and the modified background signal B', as indicated in block 312.
  • the spatial audio signal 125 may be derived e.g. as a sum, as an average or as another linear combination of the modified sound signal S' and the modified background signal B'.
  • Attenuation of the background signal 223 enables avoiding audible disturbances e.g. in a scenario where implementation of the respective audio effect to one or more of the at least one audio signal 221 results in energy reduction in one or more frequency bands and where, due to limitations of real-life implementations of the audio focusing procedures, some audio components intended for inclusion in the at least one sound signal 221 only remain in the background signal 223.
  • some audio components intended for inclusion in the at least one sound signal 221 only remain in the background signal 223.
  • such audio components that may remain in the background signal could interfere with the audio effect applied for the corresponding sound signal via introduction of the respective modified sound signal, whereas attenuation of some frequency bands of the background signal 223 serves to reduce or even completely avoid such interference.
  • the energy of the sound signal S may be considered to be higher than the respective energy of the modified sound signal S' in frequency band k response to the difference in respective energies in the frequency band k exceeding an energy difference threshold E thr (k). assigned for the frequency band k.
  • the energy difference threshold E thr (k) may be the same across frequency bands or the energy difference threshold E thr (k) may be varied across frequency bands, for example such that the energy difference threshold E thr (k) substantially matches a masking threshold for the respective frequency band.
  • a masking threshold for a frequency band k represents the energy level required for an additional sound in order to make it audible in presence of another sound in the frequency band k.
  • the respective energy difference threshold E thr (k ) may be set to zero for one or more predefined frequency band or for all frequency bands in order to reduce computational complexity.
  • the amount of attenuation to be applied to signals in those frequency bands of the background signal B for which the energy of the sound signal S is higher than the respective energy of the modified sound signal S' may be independent of the extent of difference in energy (e.g. the energy difference ⁇ E(k)).
  • the amount of attenuation to applied to the time-frequency tile B(k,n) of the background signal B to derive the corresponding time-frequency tile B'(k,n) of the modified background signal B' may be the same across frequency bands, e.g. the scaling factor g k may be the same across frequency bands, e.g.
  • a gain factor g k that has either too small or too large value may result in audible distortions in the spatial audio signal 125 and thus the gain factor g k may be limited into a predefined range, e.g. into one that provides attenuation in a range from 2 to 4 dB.
  • the amount of attenuation to be applied to signals in those frequency bands of the background signal B for which the energy of the sound signal S is higher than the respective energy of the modified sound signal S' may be dependent of the extent of difference in energy (e.g. the energy difference ⁇ E(k)), for example, such that the amount of attenuation to be applied in a frequency band k is directly proportional to the difference in respective energies of the sound signal S and the modified sound signal S' in the frequency band k.
  • This may be provided, for example, by setting the value of the scaling factory for driving the time-frequency tile B'(k,n) of the modified background signal B' such that it is directly proportional to the energy difference ⁇ E(k) on the frequency band k.
  • Figure 6 illustrates a block diagram of some components of an exemplifying apparatus 400.
  • the apparatus 400 may comprise further components, elements or portions that are not depicted in Figure 6.
  • the apparatus 400 may be employed e.g. in implementing one or more components described in the foregoing in context of the audio processing portion 220.
  • the apparatus 400 comprises a processor 416 and a memory 415 for storing data and computer program code 417.
  • the memory 415 and a portion of the computer program code 417 stored therein may be further arranged to, with the processor 416, to implement at least some of the operations, procedures and/or functions described in the foregoing in context of the audio processing portion 220.
  • the apparatus 400 comprises a communication portion 412 for communication with other devices.
  • the communication portion 412 comprises at least one communication apparatus that enables wired or wireless communication with other apparatuses.
  • a communication apparatus of the communication portion 412 may also be referred to as a respective communication means.
  • the apparatus 400 may further comprise user I/O (input/output) components 418 that may be arranged, possibly together with the processor 416 and a portion of the computer program code 417, to provide a user interface for receiving input from a user of the apparatus 400 and/or providing output to the user of the apparatus 400 to control at least some aspects of operation of audio processing portion 220 that are implemented by the apparatus 400.
  • the user I/O components 418 may comprise hardware components such as a display, a touchscreen, a touchpad, a mouse, a keyboard, and/or an arrangement of one or more keys or buttons, etc.
  • the user I/O components 418 may be also referred to as peripherals.
  • the processor 416 may be arranged to control operation of the apparatus 400 e.g. in accordance with a portion of the computer program code 417 and possibly further in accordance with the user input received via the user I/O components 418 and/or in accordance with information received via the communication portion 412.
  • processor 416 is depicted as a single component, it may be implemented as one or more separate processing components.
  • memory 415 is depicted as a single component, it may be implemented as one or more separate components, some or all of which may be integrated/removable and/or may provide permanent / semi-permanent/ dynamic/cached storage.
  • the computer program code 417 stored in the memory 415 may comprise computer-executable instructions that control one or more aspects of operation of the apparatus 400 when loaded into the processor 416.
  • the computer-executable instructions may be provided as one or more sequences of one or more instructions.
  • the processor 416 is able to load and execute the computer program code 417 by reading the one or more sequences of one or more instructions included therein from the memory 415.
  • the one or more sequences of one or more instructions may be configured to, when executed by the processor 416, cause the apparatus 400 to carry out at least some of the operations, procedures and/or functions described in the foregoing in context of the audio processing portion 220.
  • the apparatus 400 may comprise at least one processor 416 and at least one memory 415 including the computer program code 417 for one or more programs, the at least one memory 415 and the computer program code 417 configured to, with the at least one processor 416, cause the apparatus 400 to perform at least some of the operations, procedures and/or functions described in the foregoing in context of the audio processing portion 220.
  • the computer programs stored in the memory 415 may be provided e.g. as a respective computer program product comprising at least one computer-readable non-transitory medium having the computer program code 417 stored thereon, the computer program code, when executed by the apparatus 400, causes the apparatus 400 at least to perform at least some of the operations, procedures and/or functions described in the foregoing in context of the audio processing portion 220.
  • the computer-readable non-transitory medium may comprise a memory device or a record medium such as a CD-ROM, a DVD, a Blu-ray disc or another article of manufacture that tangibly embodies the computer program.
  • the computer program may be provided as a signal configured to reliably transfer the computer program.
  • references(s) to a processor should not be understood to encompass only programmable processors, but also dedicated circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processors, etc.
  • FPGA field-programmable gate arrays
  • ASIC application specific circuits
  • signal processors etc.

Abstract

According to an example embodiment, an apparatus for audio processing is provided, the apparatus comprising: means for receiving an input audio signal that represents a spatial audio image and respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions; means for deriving, based on the input audio signal in dependence of said one or more sound directions and the respective audio effects, at least one sound signal that represents sounds in said one or more sound directions and a background signal that represents sounds in other sound directions of the spatial audio image, wherein a sound signal represents sounds in at least one of the one or more sound directions and has one of the audio effects associated therewith; means for deriving, based on said at least one sound signal, via application of the respective audio effect, respective at least one modified sound signal; and means for deriving the spatial audio signal based on the at least one modified sound signal and on the background signal.

Description

Audio processing TECHNICAL FIELD
The example and non-limiting embodiments of the present invention relate to processing of audio signals. In particular, various example embodiments of the present invention relate to audio processing that involves deriving a processed audio signal where characteristics of respective sounds in one or more sound directions of a spatial audio image represented by an input audio signal are modified.
BACKGROUND With the development of microphone technologies and with increase in processing power and storage capacity available in mobile devices, many mobile devices, such as mobile phones, tablet computers, laptop computers, digital cameras, etc. are provided with microphone arrangements that enable capturing audio signals that represent an audio scene around the mobile device. Typically, the process of capturing such an audio signal using the mobile device comprises operating a microphone array of the mobile device to capture a plurality of microphone signals and processing the captured microphone signals into a digital audio signal for playback in the mobile device or for further processing in the mobile device, for storage in the mobile device and/or for transmission to one or more other devices to enable subsequent playback in the mobile device or in another device.
The digital audio signal may be one that conveys a spatial audio image that represents the audio scene around mobile device, either as such or together with spatial metadata that defines at least some characteristics of the spatial audio scene. As examples in this regard, the recorded audio signal may comprise a multi- channel signal where each channel is (substantially directly) based on the respective microphone signal or the recorded audio signal may comprise a spatial audio signal derived based on the microphone signals. Typically, although not necessarily, the digital audio is captured together with the associated video. Capturing a digital audio signal that represents an audio scene around the mobile device provides interesting possibilities for processing the spatial audio image conveyed by the digital audio signal during the capture and/or after the capture. As an example in this regard, upon or after capturing the digital audio signal that conveys the spatial audio image that represents the audio scene around the mobile device, a user may wish to modify characteristics of one or more sound sources in the spatial audio image, for example, to improve perceptual quality or clarity of the one or more sound sources or for entertainment purposes. A straightforward approach for implementing such a procedure includes extracting, from the digital audio signal, a sound signal representing a sound source of interest, modifying the sound signal in a desired manner and inserting the modified sound signal back to the digital audio signal.
Extraction of the sound signal may be carried out by applying an audio focusing technique known in the art to the digital audio signal, where the audio focusing aims at representing sounds in a desired sound direction within a spatial audio image represented by the digital audio signal while excluding sounds in other sound directions. A typical solution for audio focusing involves audio beamforming, which is a technique well known in the art. Hence, an audio beamforming procedure aims at extracting a beamformed audio signal that represents sounds in a sound direction of interest while suppressing sound in other sound directions. In context of the audio beamforming, the sound direction of interest may be referred to as a beam direction. Other techniques for accomplishing an audio focusing to a sound direction of interest include, for example, the one described in [1] In context of the present disclosure, the term audio focusing is applied to refer to an audio processing technique that involves emphasizing sounds in certain sound directions of a spatial audio image in relation to sounds in other sound directions of the spatial audio image.
While in principle the aim of the beamforming is to extract or derive a beamformed audio signal that represents sounds in the beam direction without representing sounds in other sound directions, in a practical implementation isolation of sounds in a certain sound direction while completely excluding sounds in other directions is typically not possible. Instead, in practice the beamformed audio signal is typically an audio signal where sounds in the beam direction are emphasized in relation to sounds in other sound directions. Consequently, even if an audio beamforming procedure aims at a beamformed audio signal that only represents sounds in the beam direction, the resulting beamformed audio signal is one where sounds in the beam direction and sounds in a sub-range of directions around the beam direction are emphasized in relation to sounds in other directions in accordance with characteristics of a beam applied for the audio beamforming.
In this regard, the width of a beam applied in the audio beamforming procedure may be considered: the width of the beam may be indicated by a solid angle (typically in horizontal direction only), which defines a sub-range of sound directions around the beam direction that are considered to fall within the beam. As an example in this regard, the solid angle may define a sub-range of sound directions around the beam direction such that sounds in sound directions outside the solid angle are attenuated at least a predefined amount in relation to a sound direction of maximum amplification (or minimum attenuation) within the solid angle. The predefined amount may be defined, for example as 6 dB or 3 dB. However, definition of the beam width as the solid angle is a simplified model for indicating the width of the beam and hence the sub-range of sound directions encompassed by the beam when targeted to the beam direction, whereas in a real-life implementation the beam does not strictly cover a well-defined range of sound directions around the sound direction of interest but the beam rather has a width that may vary with audio signal characteristic, with the position of the beam direction with the spatial audio image under consideration and/or with audio frequency.
Hence, in addition to sounds that represent a sound source of interest in the beam direction, the beamformed audio signal may also include sound components originating from other sound sources in sound directions close to the beam direction and/or ambient sound components (e.g. sounds that have no well-defined sound direction). Consequently, when the beamformed audio signal is applied as the sound signal that represents the sound source of interest in context of the above- described procedure for modifying characteristics of a sound source in the spatial audio image represented by a digital audio signal, modification of characteristics of the beamformed audio signal in order to modify characteristics of the sound source it serves to represent may unintentionally also modify sound components from other sound sources and/or ambient sound components, which in turn may result in an audio effect different from the one intended and/or distorting the modified digital audio signal resulting from introduction of the modified beamformed audio signal back to the digital audio signal. While the above description applies audio beamforming as an example of audio focusing techniques in general, similar challenges are involved in audio focusing techniques of other kind as well.
References:
[1] WO 2014/162171 A1 SUMMARY
According to an example embodiment, a method for audio processing is provided, the method comprising: receiving an input audio signal that represents a spatial audio image and respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions; deriving, based on the input audio signal in dependence of said one or more sound directions and the respective audio effects, at least one sound signal that represents sounds in said one or more sound directions and a background signal that represents sounds in other sound directions of the spatial audio image, wherein a sound signal represents sounds in at least one of the one or more sound directions and has one of the audio effects associated therewith; deriving, based on said at least one sound signal, via application of the respective audio effect, respective at least one modified sound signal; and deriving the spatial audio signal based on the at least one modified sound signal and on the background signal. According to another example embodiment, an apparatus for audio processing is provided, the apparatus configured to: receive an input audio signal that represents a spatial audio image and respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions; derive, based on the input audio signal in dependence of said one or more sound directions and the respective audio effects, at least one sound signal that represents sounds in said one or more sound directions and a background signal that represents sounds in other sound directions of the spatial audio image, wherein a sound signal represents sounds in at least one of the one or more sound directions and has one of the audio effects associated therewith; derive, based on said at least one sound signal, via application of the respective audio effect, respective at least one modified sound signal; and derive the spatial audio signal based on the at least one modified sound signal and on the background signal.
According to another example embodiment, an apparatus for audio processing is provided, the apparatus comprising: means for receiving an input audio signal that represents a spatial audio image and respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions; means for deriving, based on the input audio signal in dependence of said one or more sound directions and the respective audio effects, at least one sound signal that represents sounds in said one or more sound directions and a background signal that represents sounds in other sound directions of the spatial audio image, wherein a sound signal represents sounds in at least one of the one or more sound directions and has one of the audio effects associated therewith; means for deriving, based on said at least one sound signal, via application of the respective audio effect, respective at least one modified sound signal; and means for deriving the spatial audio signal based on the at least one modified sound signal and on the background signal.
According to another example embodiment, an apparatus for audio processing is provided, wherein the apparatus comprises at least one processor; and at least one memory including computer program code, which, when executed by the at least one processor, causes the apparatus to: receive an input audio signal that represents a spatial audio image and respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions; derive, based on the input audio signal in dependence of said one or more sound directions and the respective audio effects, at least one sound signal that represents sounds in said one or more sound directions and a background signal that represents sounds in other sound directions of the spatial audio image, wherein a sound signal represents sounds in at least one of the one or more sound directions and has one of the audio effects associated therewith; derive, based on said at least one sound signal, via application of the respective audio effect, respective at least one modified sound signal; and derive the spatial audio signal based on the at least one modified sound signal and on the background signal.
According to another example embodiment, a computer readable medium comprising program instructions for causing an apparatus to perform at least the method according to the example embodiment described in the foregoing is provided.
The computer readable medium may comprise a volatile or a non-volatile computer- readable record medium, thereby providing a computer program product comprising at least one computer readable non-transitory medium having program instructions for causing an apparatus to perform at least the method according to the example embodiment described in the foregoing stored thereon.
The exemplifying embodiments of the invention presented in this patent application are not to be interpreted to pose limitations to the applicability of the appended claims. The verb "to comprise" and its derivatives are used in this patent application as an open limitation that does not exclude the existence of also unrecited features. The features described hereinafter are mutually freely combinable unless explicitly stated otherwise.
Some features of the invention are set forth in the appended claims. Aspects of the invention, however, both as to its construction and its method of operation, together with additional objects and advantages thereof, will be best understood from the following description of some example embodiments when read in connection with the accompanying drawings.
BRIEF DESCRIPTION OF FIGURES
The embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, where Figure 1 illustrates a block diagram of some components and/or entities of an audio processing system according to an example;
Figures 2A and 2B illustrate respective block diagrams of some components and/or entities of an audio processing sub-system according to an example; Figure 3 illustrates a block diagram of some components and/or entities of an audio processing portion according to a non-limiting example;
Figure 4 illustrates a flowchart depicting a method according to an example;
Figure 5 illustrates a flowchart depicting a method according to an example; and
Figure 6 illustrates a block diagram of some elements of an apparatus according to an example.
DESCRIPTION OF SOME EMBODIMENTS
Figure 1 illustrates a block diagram of some components and/or entities of an audio processing system 100 according to a non-limiting example. The audio processing system 100 comprises an audio capturing portion 110 and an audio processing portion 120. The audio capturing portion 110 is coupled to a microphone array 112 and it is arranged to receive respective microphone signals from a plurality of microphones 112-1, 112-2, ..., 112-K and to record a captured multi-channel audio signal 115 based on the received microphone signals. The microphones 112-1, 112- 2, ..., 112-K represent a plurality of (i.e. two or more) microphones, where an individual one of the microphones may be referred to as a microphone 112-k or as a microphone 112-j. Flerein, the concept of microphone array 112 is to be construed broadly, encompassing any arrangement of two or more microphones 112-k arranged in or coupled to a device implementing the audio processing system 100. The audio processing portion 120 is arranged to receive the multi-channel audio signal 115 from the audio capturing portion 120 and to derive a spatial audio signal 125 based on the multi-channel audio signal 115.
Each of the microphone signals represents the same captured sound while they provide a respective different representation of the captured sound, which difference depends on the positions of the microphones 112-k with respect to each other. For a sound source in a certain spatial position with respect to the microphone array 112, this results in a different representation of sounds originating from the certain sound source in each of the microphone signals: a microphone 112-k that is closer to the certain sound source captures the sound originating therefrom at a higher amplitude and earlier than a microphone 112-j that is further away from the certain sound source. Together with the knowledge regarding the positions of the microphones 112-k with respect to each other, such differences in amplitude and/or time delay enable derivation of a spatial audio signal that represents the audio scene at the time (and place) of capturing the microphone signals and/or applying spatial audio processing based on the multi-channel audio signal 115. The representation of the spatial audio scene captured in the microphone signals and, consequently, in the multi-channel audio signal 115 may be referred to as a spatial audio image.
The audio capturing portion 110 may be arranged to record a respective digital audio signal based on each of the microphone signals received from the microphones 112- 1 , 112-2, ..., 112-K of the microphone array 112 at a predefined sampling frequency using a predefined bit depth and to provide the recorded digital audio signals as the multi-channel audio signal 115 to the audio processing portion 120 for further audio processing therein. In this regard, each digital audio signal recorded at the audio capturing portion 110 may be considered as a respective channel of the multi channel audio signal 115. The multi-channel audio signal 115 may comprise or may be accompanied with audio metadata that includes information characterizing at least some aspects of the multi-channel audio signal 115, e.g. one of more of the following: the sampling rate (or sampling frequency) applied in the multi-channel audio signal 115, the bit depth applied in the multi-channel audio signal 115, channel configuration information that serves to define the relationship between the channels of the multi-channel audio signal 115, e.g. the respective positions and/or orientations of the microphones 112-k of the microphone array 112 (with respect to a reference position/orientation and/or with respect to other microphones 112-k of the microphone array 112) applied to capture the microphone signals serving as basis for the multi-channel audio signal 115, etc. Figures 2A and 2B illustrate respective block diagrams of some components of respective audio processing sub-systems 100a and 100b according to a non-limiting example. The audio processing sub-system 100a comprises the microphone array 112 and the audio capturing portion 110 described in the foregoing together with a memory 102. A difference to operation of the audio processing system 100 is that instead of providing the multi-channel audio signal 115 to the audio processing portion 120, the audio capturing portion110 may be arranged to store the multi channel audio signal 115 in the memory 102 for subsequent access by the audio processing portion 120. In this regard, the multi-channel audio signal 115 may be stored in the memory 102 together with the audio metadata described in the foregoing.
The audio processing sub-system 100b comprises the memory 102 and the audio processing entity 120 described in the foregoing. Hence, a difference in operation of the audio processing sub-system 100b in comparison to a corresponding aspect of operation of the audio processing system 100 is that instead of (directly) obtaining the multi-channel audio signal 115 from the audio capturing portion 110, the audio processing portion 120 reads the multi-channel audio signal 115, possibly together with the audio metadata, from the memory 102.
In the example provided via respective illustrations of Figures 2A and 2B, the memory 102 read by the audio processing portion 120 is the same one to which the audio capturing portion 110 stores the multi-channel audio signal 115 recorded or derived based on the respective microphone signals obtained from the microphones 112-k of the microphone array 112. As an example, such an arrangement may be provided by providing the audio processing sub-systems 100a and 100b in a single device that also includes (or otherwise has access to) the memory 102. In another example in this regard, the audio processing sub-systems 100a and 100b may be provided in a single device or in two separate devices and the memory 102 may comprise a memory provided in a removable memory device such as a memory card or a memory stick that enables subsequent access to the multi-channel audio signal 115 (possibly together with the metadata) in the memory 102 by the same device that stored this information therein or by another device. In a further variation of the example provided via respective illustrations of Figures 2A and 2B, the memory 102 may be provided in a further device, e.g. in a server device, that is communicatively coupled to a device providing the both audio processing sub-systems 100a, 100b or to respective devices providing the audio processing sub-system 100a and the audio processing sub-system 100b by a communication network.
In a further variation of the example provided via respective illustrations of Figures 2A and 2B, the memory 102 may be replaced by a transmission channel or by a communication network that enables transferring the multi-channel audio signal 115 (possibly together with the audio metadata) from a first device providing the audio processing sub-system 100a to a second device providing the audio processing sub-system 100b. In this regard, the transfer of the multi-channel audio signal 115 may comprise transmitting/receiving the multi-channel audio signal 115 as an audio packet stream, whereas the audio capturing portion 110 may further operate to encode the multi-channel audio signal 115 for encoded audio data suitable for transmission in the audio packet stream and the audio processing portion 120 may further operate to decode the encoded audio data received in the audio packet stream into the multi-channel audio signal 115 (or into a reconstructed version thereof) for the audio processing therein. The audio processing portion 120 may be arranged to carry out a spatial audio processing procedure that results in modifying respective sounds representing audio characteristics of one or more sound sources included in the multi-channel audio signal 115 to provide the spatial audio signal 125. Without losing generality, modification of a sound may be referred to as an application of an audio effect to the respective sounds.
While the audio processing system 100 and operation thereof is throughout this disclosure predominantly described with references to the audio processing portion 120, 220 obtaining an input audio signal as the multi-channel audio signal 115 derived on basis of respective microphone signals received from the microphones 112-1, 112-2, ..., 112-K of the microphone array 112, in other examples the audio input signal to the audio processing portion 120, 220 may comprise an audio signal of other type that represents a spatial audio image and that enables audio focusing, thereby enabling derivation of the spatial audio signal 125 as the output of the audio processing portion 120, 220. Non-limiting examples of such input audio signals of other type that represents a spatial audio image include an Ambisonic (spherical harmonic) audio format and various multi-loudspeaker audio formats (such as 5.1- channel or 7.1. surround sound) known in the art. Depending on the type of the applied input audio signal, it may be accompanied by (audio) metadata that includes information that defines various aspects related to characteristics of the input audio signal, including spatial characteristics such as parametric data describing the spatial audio field, e.g. respective sound direction-of-arrival estimates, respective ratios between direct and ambient sound energy components, etc. for one or more frequency bands.
Figure 3 illustrates a block diagram of some components and/or entities of an audio processing portion 220 according to a non-limiting example. The audio processing portion 220 is arranged to obtain the multi-channel audio signal 115 and respective indications of a sound direction within a spatial audio image represented by the multi-channel audio signal 115 and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions. The audio processing portion 220 comprises an audio decomposition portion 222 for deriving, based on the multi-channel audio signal 115, at least one sound signal 221 that represents sounds in the one or more sound directions within the spatial audio image and a background signal 223 that represents sounds in other directions of the spatial audio image. The audio processing portion further comprises an audio effect portion 224 for deriving, based on the at least one sound signal 221, via application of the respective audio effect, respective at least one modified sound signal 225. he audio processing portion further comprises an audio combiner 226 for deriving the spatial audio signal 125 based on the at least one modified sound signal 225 and the background signal 223.
In other examples, the audio processing portion 220 may include further entities in addition to those illustrated in Figure 3 and/or some of the entities depicted in Figure 3 may combined with other entities while providing the same or corresponding functionality. In particular, the entities illustrated in Figure 3 serve to represent logical components of the audio processing portion 220 that are arranged to perform a respective function but that do not impose structural limitations concerning implementation of the respective entity. Hence, for example, respective hardware means, respective software means or a respective combination of hardware means and software means may be applied to implement any of the entities illustrated in Figure 3 separately from the other entities, to implement any sub-combination of two or more entities illustrated in Figure 3, or to implement all entities illustrated in Figure 3 in combination. As an example in this regard, the audio processing portion 220 may be provided as one comprising means for obtaining the multi-channel audio signal 115 that represents a spatial audio image, means for obtaining respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions, means for deriving, based on the multi-channel audio signal 115 in dependence of said one or more sound directions and the respective audio effects, the at least one sound signal 221 that represents sounds in the one or more sound directions and the background signal 223 that represents sounds in other directions of the spatial audio image, means for deriving, based on the at least one sound signal 221, via application of the respective audio effect, respective at least one modified sound signal 225, and means for deriving the spatial audio signal 125 based on the at least one modified sound signal 225 and on the background signal 223.
The spatial audio processing procedure may be carried out, for example, in accordance with a method 200 illustrated in a flowchart of Figure 4. The method 200 commences from obtaining the multi-channel audio signal 115 that represents a spatial audio image, as indicated in block 202, and obtaining respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions, as indicated in block 204. The method 200 further comprises deriving, based on the multi-channel audio signal 115 in dependence of said one or more sound directions and the respective audio effects, the at least one sound signal 221 that represents sounds in the one or more sound directions and the background signal 223 that represents sounds in other directions of the spatial audio image, as indicated in block 206. The method 200 further comprises deriving, based on the at least one sound signal 221, via application of the respective audio effect, respective at least one modified sound signal 225, as indicated in block 208. Moreover, the method 200 further comprises deriving the spatial audio signal 125 based on the at least one modified sound signal 225 and on the background signal 223, as indicated in block 210.
The operations described with references to blocks 202 to 210 of the method 200 may be varied or complemented in a number of ways without departing from the scope of the spatial audio processing procedure described in the present disclosure, for example in accordance with the examples described in the foregoing and in the following.
In a non-limiting example of applying the audio processing portion 220 to carry out the method 200, the audio decomposition portion 221 may be arranged to carry out operations, procedures and/or functions pertaining to block 206, the audio effect portion 223 may be arranged to carry out operations, procedures and/or functions pertaining to block 208, whereas the audio combiner 225 may be arranged to carry out operations, procedures and/or functions pertaining to block 210. In this regard, in the foregoing the operation of the audio processing portion 220 is predominantly described with references to the method 200, while the corresponding examples readily pertain to respective entities of the audio processing portion 220 arranged to carry out the respective aspect of the method 200.
The audio processing portion 220 and the method 200 may be arranged to process the multi-channel audio signal 115 arranged in a sequence of input frames, each input frame including a respective time segment of digital audio signal for each of the channels, provided as a respective time series of input samples at a predefined sampling frequency (which may be defined, for example, in the audio metadata provided for the multi-channel audio signal 115). In typical example, the audio processing portion 220 employs a fixed predefined frame length. In other examples, the frame length may be a selectable frame length that may be selected from a plurality of predefined frame lengths, or the frame length may be an adjustable frame length that may be selected from a predefined range of frame lengths. A frame length may be defined as number of (input) samples L included in the frame for each channel of the multi-channel audio signal 115, which at the predefined sampling frequency maps to a corresponding duration in time. As an example in this regard, the audio processing portion 220 may employ a fixed frame length of 20 milliseconds (ms), which at a sampling frequency of 8, 16, 32 or 48 kHz results in a frame of L=160, L= 320, L=640 and L= 960 samples per channel, respectively. The frames may be non-overlapping or they may be partially overlapping. These values, however, serve as non-limiting examples and frame lengths and/or sampling frequencies different from these examples may be employed instead, depending e.g. on the desired audio bandwidth, on desired framing delay and/or on available processing capacity.
At least part of the processing carried out by the audio processing portion 220 and/or the method 200 may be carried out separately for a plurality of frequency bands of the multi-channel audio signal 115. Consequently, e.g. a respective entity of the audio processing portion 220 and/or a respective step of the method 200 may involve (at least conceptually) dividing or decomposing each channel of the multi- channel audio signal 115 into a respective plurality of frequency bands, thereby providing a respective time-frequency representation for each channel of the multi- channel audio signal 115. According to a non-limiting example, division into the frequency bands may comprise transforming each channel of the multi-channel audio signal 115 from time domain into a respective frequency-domain audio signal and arranging the resulting frequency-domain samples (also referred to as frequency bins) into respective plurality of frequency bands in each of the channels. Non-limiting examples of applicable time-to-frequency-domain transforms include short-time discrete Fourier transform (STFT) and a complex-modulated quadrature- mirror filter (QMF) bank. In case the multi-channel audio signal 115 is transformed into frequency domain for carrying out an aspect of audio processing by the audio processing portion 220 and/or the method 200 (separately) for a plurality of frequency bands, respective inverse transform from the frequency domain back to the time domain may be applied before providing the spatial audio signal 125 as the output of the audio processing portion 220 and/or the method 200.
In case the (conceptual) division to the plurality of frequency bands is applied, the number of frequency bands and respective bandwidths of the frequency bands may be selected e.g. in accordance with the desired frequency resolution and/or available computing power. In an example, the frequency band structure involves 24 frequency sub-bands according to the Bark scale, an equivalent rectangular band (ERB) scale or 3rd octave band scale known in the art. In other examples, different number of frequency sub-bands that have the same or different bandwidths may be employed. A specific example in this regard is a single frequency sub-band that covers the input spectrum in its entirety or a single frequency sub-band that covers a continuous subset of the input spectrum. A frequency-domain sample that represents frequency bin b in time frame n of channel i of the frequency-domain audio signal may be denoted as x(i, b,n). Division to frequency bands may be provided by arranging or grouping one or more of consecutive frequency bins obtained for a given channel in the given frame into a respective frequency band, thereby providing a plurality of frequency bands k- 0, ..., K-1, where the frequency- domain audio signal in frequency band k in time frame n of channel i may be denoted as x(i, k,n ) and where the frequency-domain audio signal in frequency band k in time frame n across all channels may be denoted as x(k, n). The latter may be referred to as a respective time-frequency tile.
Referring back to operations pertaining to block 202, obtaining the multi-channel audio signal 115 may comprise receiving the multi-channel audio signal 115 from the audio capturing portion 110 or over a communication network or reading the multi-channel audio signal from the memory 102, as described in the foregoing. Further along the lines described in the foregoing, the multi-channel audio signal 115 may comprise or it may be otherwise received with the audio metadata that includes information characterizing at least some aspects of the multi-channel audio signal 115. The multi-channel audio signal 115, together with the audio metadata, enables applying the spatial audio processing procedure to create the spatial audio signal 125 (directly) based on the multi-channel audio signal 115 or via creating an intermediate spatial audio signal based on the multi-channel audio signal 115 and carrying out the method 200 based on the intermediate spatial audio signal. Referring back to operations pertaining to block 204, an audio effect to be applied to sounds in a given one of the indicated sound directions may be referred to as the audio effect associated with the given one of the indicated sound directions. Herein, the one or more sound directions may be also referred to as respective sound directions of interest. In this regard, obtaining the indications of the one or more sound directions within the spatial audio image and the respective indications of the one or more audio effects associated therewith may comprise receiving or deriving the respective indications based on user input. As an example, the audio processing portion 120 may be arranged to receive the respective indications of the sound directions of interest and the audio effects associated therewith as user input provided via user interface (Ul) of a device implementing the audio processing portion 120 or derive these indications based on the user input received via the Ul of the device implementing the audio processing portion 120. In this regard, the multi-channel audio signal 115 may be accompanied by a video stream (being) captured together with the multi-channel audio signal 115 that depicts at least some of the sound sources represented in the multi-channel audio signal 115 and that is displayed to the user via the Ul. The Ul may provide the user with a possibility to indicate one or more sound sources of interest via pointing their respective illustrations on the displayed video stream, which indicated one or more sound sources may be converted into respective one or more sound directions of interest based on their positions in the displayed images of the video stream. Moreover, the Ul may provide the user with possibility to select, for each of the one or more sound sources of interest, a respective audio effect to be applied to sounds originating from the respective one of the one or more sound sources of interest. Herein, selection of the audio effect to be applied may be made from a set of one or more predefined audio effects. The one or more sound directions of interest may be defined as respective horizontal directions within the spatial audio image represented by the multi-channel audio signal 115 and/or as respective vertical directions within the spatial audio image. In the following examples, for clarity and brevity of description, references to sound directions (at least implicitly) assume horizontal directions within the spatial audio image, whereas the examples readily generalize into applying vertical directions in addition to or instead of horizontal directions, mutatis mutandis. A sound direction in horizontal direction of the spatial audio image may be defined as an (azimuth) angle with respect to a reference direction. The reference direction is typically, but not necessarily, a direction directly in front of the assumed listening point. The reference direction may be defined as 0° (i.e. zero degrees), whereas a sound direction that is to the left of the reference direction may be indicated by a respective angle in the range 0° < α < 180° and a sound direction that is to the right of the reference direction may be indicated by a respective angle in the range — 180° < α < 0°. with directions at 180° and —180° indicating a sound direction opposite to the reference direction.
The respective audio effect to be applied to sounds in one of the one or more sound directions of interest may involve any audio processing that modifies at least one audio characteristics of the respective sounds. Non-limiting examples of applicable audio effects include the following: audio equalization according to a predefined or user-selectable profile, pitch shifting in a predefined or user-selectable manner, vibrato at a predefined or user-selectable rate and at a predefined or user-selectable extent, tremolo in a predefined or user-selectable manner, a vocoder effect in a predefined or user-selectable manner, etc. The audio effect to be applied may be the same for each of the one or more sound directions of interest or the audio effect to be applied may be different across the one or more sound directions of interest. To put it in other words, a first audio effect may be selected for a first sound direction and a second audio effect may be selected for a second sound direction (which is a sound direction different from the first sound direction), where the first and second audio effects may be the same (or similar) or the first and second audio effects may be different from each other.
Referring back to operations pertaining to block 206, derivation of the at least one sound signal 221 that represents sounds in the one or more sound directions of interest and the on the background signal 223 that represents sounds in other directions of the spatial audio image based on the multi-channel audio signal 115 may comprise selectively applying an audio focusing technique to derive the at least one sound signal 221 and the background signal 223 in view of one or more of the following aspects:
- the number of sound directions of interest,
- the spatial distance between sound directions of interest (if more than one), - the respective audio effects selected for sounds in the sound directions of interest.
Audio focusing aims at extracting a focused audio signal that represent sounds in a focus direction while suppressing sounds in other sound directions, whereas in a practical implementation the focused audio signal may encompass sounds in sound directions falling within a focus pattern (substantially) centered at the focus direction while suppressing sounds in other sound directions, where the focus pattern directed to the focus direction encompasses a predefined sub-range of sound directions (substantially) centered at the focus direction, which sub-range may be also referred to as a focus width. An example of audio focusing that may be especially suited for audio processing that relies on the multi-channel audio signal 115 obtained from the microphone array 112 comprises audio beamforming outlined in the foregoing. Audio beamforming aims at extracting a beamformed audio signal that represent sounds in a beam direction while suppressing sounds in other sound directions, whereas in a practical implementation the beamformed audio signal may encompass sounds in sound directions falling within a beam pattern (substantially) centered at the beam direction while suppressing sounds in other sound directions. Hence, a beam pattern directed to the beam direction encompasses a predefined sub-range of sound directions (substantially) centered at the beam direction, which sub-range may be also referred to as a beam width.
For the purposes of the present disclosure, the following applications of audio focusing are considered:
- monoaural audio focusing for deriving, based on the multi-channel audio signal 115, a monaural sound signal that represents sounds in a focus direction while suppressing sounds in other sound directions;
- spatial audio focusing for deriving, based on the multi-channel audio signal 115, a spatial sound signal that represents sounds in a desired range of sound directions such that they are retained in their respective original sound directions within the spatial audio image represented by the multi-channel audio signal 115 while suppressing sounds in other sound directions. In this regard, the monaural audio focusing may be provided by directly applying a focus pattern of a desired focus width directed at a sound direction of interest to derive a respective monoaural sound signal. Monaural audio focusing may be accompanied or followed by determination of one or more spatial characteristics of the spatial audio image represented by the multi-channel audio signal 115, such as sound direction(s) and/or diffuseness. The monaural focused audio signal resulting from monaural audio focusing may be used as basis for creating a respective spatial sound signal where the monaural focused audio signal is panned to its original sound direction with the spatial image and possibly also modified to exhibit determined diffuseness, thereby providing a spatial sound signal that represents sound in the sound direction of interest while suppressing sounds in other sound directions.
According to an example, the spatial audio focusing may be applied for a single sound direction, e.g. by using a first focus pattern of a desired focus width that is directed at a sound direction that is offset from the single sound direction to a first direction (e.g. to the left) to derive a first channel of a spatial sound signal and using a second focus pattern of the desired focus width that is directed at a sound direction that is offset from the single sound direction to a second direction that is opposite to the first direction (e.g. to the right) to derive a second channel of the spatial sound signal. In another example, the spatial audio focusing may be applied for a range of sound directions from a first sound direction to a second sound direction, e.g. by using a first focus pattern of a desired focus width that is directed at a sound direction that is offset from the first sound direction to a first direction (e.g. further away from the second direction) to derive a first channel of a spatial sound signal and using a second focus pattern of the desired focus width that is directed at a sound direction that is offset from the second sound direction to a second direction that is opposite to the first direction to derive a second channel of the spatial sound signal. Consequently, the spatial sound signal represents sounds in a sub-range of sound directions between the first and second sound directions such that they are panned in their original spatial positions in the spatial audio image while sounds in other sound directions of the spatial audio image are substantially suppressed. The offsets applied in the spatial audio focusing procedure may be predefined ones or they may be selected in view of characteristics of the microphone array 112 applied for capturing the microphone signals that serve as basis for respective channels of the multi-channel audio signal 115, e.g. in view of the positions of the microphones 112-k with respect to each other.
According to an example, along the lines described in the foregoing, a spatial focused audio signal that represents sound in a sound direction of interest (while suppressing sounds in other sound directions) may be derived via deriving a left channel of the focused audio signal as a monaural focused audio signal (e.g. via audio beamforming) whose phase center is to the left from the sound direction of interest and deriving a right channel of the focused audio signal as a monaural focused audio signal (e.g. via audio beamforming) whose phase center is to the right from the sound direction of interest. As an example, for derivation of the left channel the phase center may be at a location of a microphone in a left side of a microphone array 112 applied for capturing the microphone signals that serve as basis for respective channels of the multi-channel audio signal 115 and for derivation of the right channel the phase center may be at the location of a microphone in the right side of said microphone array 112. Herein, a phase center is in such a location that a microphone signal from the phase center is not delayed with respect to microphone signals from other microphones during audio focusing procedure (e.g. in context of audio beamforming). According to another example, the spatial audio beamforming may be carried on basis of a spatial audio signal that is derived based on the multi-channel audio signal 115 (or that is received at the audio processing portion 220 instead of the multi- channel audio signal 115). In such a scenario, sound directions represented by the spatial audio signal are derived via analysis of the spatial audio signal and the spatial audio signal is amplified in those time-frequency tiles that are found, via the analysis of the spatial audio signal, to represent a direction of interest and/or the spatial audio signal is attenuated in those time-frequency tiles that are found not to represent a direction of interest.
Spatial audio focusing provides the benefit of maintaining directional sounds of the spatial audio image in their correct sound directions, whereas in monaural audio focusing the sound direction is ‘lost’ and needs to be recreated in order to maintain similar spatial characteristic. In this regard, spatial audio focusing enables more pleasant and naturally-sounding audio image that also ensures retaining spatial cues that facilitate spatial perception by a human hearing system.
According to a first example, there is only a single sound direction of interest, denoted as a first sound direction α1 and, consequently, there is only a single audio effect to be considered, referred to as a first audio effect. In this example, derivation of the at least one sound signal 221 may comprise applying, on the multi-channel audio signal 115, monaural audio focusing using a first focus pattern of a predefined focus width directed at the first sound direction α1 to derive a first sound signal. Hence, in this example the at least one sound signal 221 comprises the first sound signal that represents sounds in the first sound direction α1 and that has the first audio effect associated therewith, where the first sound signal comprises a monaural audio signal.
Still referring to the first example, derivation of the background signal 223 may comprise applying, on the multi-channel audio signal 115, spatial audio focusing that covers the range of sound directions outside the first focus pattern applied for derivation of the first sound signal. In this regard, the range of sound directions outside the first focus pattern may be divided into one or more sub-ranges of sound directions and the spatial audio focusing may be applied separately for each of the sub-ranges. In case there is only a single sub-range of sound directions (that covers the range of sound directions outside the first focus pattern), the resulting focused audio signal serves as the background signal 223, whereas in case of two or more sub-ranges of sound directions (that each cover a respective sub-range of sound directions outside the first focus pattern) the background signal 223 may be derived as a combination (e.g. as a sum or as an average) of respective focused audio signals obtained for the two or more sub-ranges of sound directions.
In a variation of the first example, the background signal 223 may be created, for example, by subtracting the first sound signal from the multi-channel audio signal 115. In this regard, the subtraction may involve crating a respective monaural focused (e.g. beamformed) signal for each channel of the multi-channel audio signal
115 such that its phase center is at a location of the microphone applied for capturing the microphone signal that serves as basis for respective channel of the multichannel audio signal 115, deriving a respective channel of a multi-channel difference signal by subtracting the respective monaural focused signal from the respective channel of the multi-channel audio signal 115 and deriving the background signal 223 as a spatial audio signal based on the difference signal.
According to a second example, there are two sound directions of interest, denoted as a first sound direction α1 and a second sound direction α2, where a first audio effect is to be applied to sounds in the first sound direction α1 and a second audio effect is to be applied to sounds in the second sound direction α2, where the first audio effect is different from the second audio effect. In this example, derivation of the at least one sound signal 221 may comprise applying, on the multi-channel audio signal 115, monoaural audio focusing using a first focus pattern of a predefined focus width directed at the first sound direction α1 to derive a first sound signal and applying monoaural audio focusing using a second focus pattern of a predefined focus width directed at the second sound direction α 2 to derive a second sound signal. Hence, in this example the at least one sound signal 221 comprises the first sound signal that represents sounds in the first sound direction α1 and that has the first audio effect associated therewith and the second sound signal that represents sounds in the second sound direction α2 and that has the second audio effect associated therewith, where both the first and second sound signals comprise respective monaural audio signals.
Still referring to the second example, derivation of the background signal 223 may comprise applying, on the multi-channel audio signal 115, spatial audio focusing that covers the range(s) of sound directions outside the first and second focus patterns applied for derivation of the first and second sound signal. In this regard, the range(s) of sound directions outside the first and second focus patterns may be divided into one or more sub-ranges of sound directions and the spatial audio focusing may be applied separately for each of the sub-ranges. In case there is only a single sub-range of sound directions, the resulting focused audio signal serves as the background signal 223, whereas in case of two or more sub-ranges of sound directions the background signal 223 may be derived as a combination (e.g. as a sum or as an average) of respective focused audio signals obtained for the two or more sub-ranges of sound directions.
In a variation of the second example, the background signal 223 may be created, for example, by subtracting the first and second sound signals from the multi- channel audio signal 115, wherein the subtraction may be carried out along the lines described in the foregoing in context of the first example, mutatis mutandis.
Even though described in the foregoing with references to two sound directions of interest, the second example readily generalizes into a scenario where there are three or more sound directions of interest that have different audio effects indicated therefor, mutatis mutandis.
According to a third example, there are two sound directions of interest, denoted as a first sound direction α1 and a second sound direction α2, where the same audio effect is to be applied to sounds in the first sound direction α1 and to sounds in the second sound direction α2, which audio effect may be referred to as the first audio effect. In this example, derivation of the at least one sound signal 221 may comprise applying, on the multi-channel audio signal 115, spatial audio focusing that covers a first range of sound directions from the first sound direction α1 to the second sound direction α2 to derive the first sound signal. Hence, in this example the at least one sound signal 221 comprises the first sound signal that represents sounds in the first and second sound directions α1, α2 and that has the first audio effect associated therewith, where the first sound signal comprises a spatial audio signal.
Still referring to the third example, derivation of the background signal 223 may comprise applying, on the multi-channel audio signal 115, spatial audio focusing that covers the (complementary) range of sound directions outside the first range of sound directions applied for derivation of the first sound signal. In this regard, the range of sound directions outside the first range of sound directions may be divided into one or more sub-ranges of sound directions and the spatial audio focusing may be applied separately for each of the sub-ranges. In case there is only a single sub- range of sound directions, the resulting focused audio signal serves as the background signal 223, whereas in case of two or more sub-ranges of sound directions the background signal 223 may be derived as a combination (e.g. as a sum or as an average) of respective focused audio signals obtained for the two or more sub-ranges of sound directions.
In a variation of the third example, the background signal 223 may be created, for example, by subtracting the first sound signal from the multi-channel audio signal 115, wherein the subtraction may be carried out along the lines described in the foregoing in context of the first example, mutatis mutandis.
In another variation of the third example, the selection of the type of audio focusing and/or the focus width may be carried out in dependence of the distance between the first and second sound directions α1, α2, e.g. such that derivation of the at least one sound signal 221 is carried out as described above for the third example in response to the distance α dist between the first and second sound directions α1, α2 (e.g. αdist= |α1 —α2 ) being smaller than a predefined distance threshold α thr (e.g. αdist< αthr), whereas derivation of the at least one sound signal 221 is carried out as described above for the second example in response to the distance α dist between the first and second sound directions α1, α2 being larger than or equal to the predefined distance threshold αthr (e.g. αdist≥ αthr). The distance threshold αthr may be set in dependence of characteristics of the multi-channel audio signal 115, e.g. in view of the spatial arrangement of microphones of the microphone array 112 applied in capturing the microphone signals that serve as basis for the multi-channel audio signal 115. In an example, the distance threshold αthr may be set such that respective focus patterns directed to the first and second sound directions α1, α2 do not overlap when the distance αdist therebetween exceeds the distance threshold αthr.
In a further variation of the third example, the selection of the type of audio focusing and/or the focus width may be carried out in dependence of presence of further (directional) sound sources in sound directions between the first and second sound directions α1, α2, e.g. such that derivation of the at least one sound signal 221 is carried out as described above for the third example in response to not detecting presence of any directional sound sources in sound directions between the first and second sound directions α1, α2, whereas derivation of the at least one sound signal 221 is carried out as described above for the second example in response to detecting presence of one or more directional sound sources in sound directions between the first and second sound directions α1, α2. Presence of one or more directional sound sources (or lack thereof) between the first and second sound directions α1, α2. may be evaluated via analysis of the multi-channel audio signal 115, for example by directing one or more auxiliary focus patterns between the first and second sound directions α1, α2 to derive respective auxiliary sound signals and determining presence of one or more directional sound sources in sound directions between the first and second sound directions α1, α2 in response to any of the auxiliary sound signals exhibiting signal level (e.g. signal) energy that is close that of sound signals derived from the first sound direction α1 and/or from the second sound direction α2. As an example this regard, an auxiliary sound signal may be considered to represent a further sound source in case its signal level tis above a predefined percentage of that of the sound signal representing sounds in the first sound direction α1 and/or that of the sound signal representing sounds in the second sound direction α2.
Even though described in the foregoing with references to two sound directions of interest, the third example and any variations thereof readily generalize into a scenario where there are three or more sound directions of interest that have the same audio effect indicated therefor, mutatis mutandis.
According to a fourth example, there are three sound directions of interest, denoted as a first sound direction α1, a second sound direction α2 and at third sound direction α3, where a first audio effect is to be applied to sounds in the first sound direction α1 and where a second audio effect that is different from the first audio effect is to be applied to sounds in the second and third sound directions α2, α3. In this example, derivation of the at least one sound signal 221 with respect to the first sound direction α1 may be carried out as described in the foregoing for the first example, whereas derivation of the at least one sound signal 221 with respect to the second and third sound directions α2, α3may be carried out as described in the foregoing for the third example. Hence, in this example the at least one sound signal 221 comprises the first sound signal that represents sounds in the first sound direction α1 and that has the first audio effect associated therewith and the second sound signal that represents sounds in the second and third sound directions α2, α3 and that has the second audio effect associated therewith, where the first sound signal comprises a monaural audio signal and where the second sound signal comprises a spatial audio signal.
An advantage arising from deriving the at least one sound signal 221 that each represent one or more sound directions within the spatial audio image represented by the multi-channel audio signal 115 e.g. according to the first to fourth examples described in the foregoing is application of the respective audio effect to a certain one of the at least one sound signal 221 such that they do not interfere with respective audio effects applied to other ones of the at least one sound signal 221 and/or with the sound represented by the background signal 223, thereby ensuring introduction of the audio effects as intended and, consequently, avoiding audible distortions that may arise from the audio effects interfering with each other and/or with the sounds in the background.
Referring back to operations pertaining to block 208, derivation of the at least one modified sound signal 225 based on the at least one sound signal 221 via application of the respective audio effect may comprise driving the at least one modified sound signal 225 in dependence of the type and the number of sound signals provided as the at least one sound signal 221 . In view of the first to fourth examples described in the foregoing, the at least one sound signal 221 may comprise one or more of the following:
- a first monaural sound signal that represents sounds in the first sound direction α1 and that has the first audio effect associated therewith, - a first spatial sound signal that represents sounds in the first sound direction α1 and sounds in the second sound direction α2 and that has the first audio effect associated therewith,
- a first monaural sound signal that represents sounds in the first sound direction α1 and that has the first audio effect associated therewith and a second monaural sound signal that represents sounds in the second sound direction α2 and that has the second audio effect associated therewith.
Moreover, in general the at least one sound signal 221 comprises one or more monaural sound signals that each represent sounds in a respective one of the one or more sound directions of interest and/or one or more spatial sound signals that each represent sounds in respective at least two of the one or more sound directions of interest, where each sound signal has a respective audio effect associated therewith.
Consequently, the aspect of deriving the respective at least one modified sound signal 225 based on the at least one sound signal 221 (cf. block 208) comprises separately applying for each of the sound signals respective audio processing that implements the audio effect associated therewith in dependence of the type of the respective sound signal:
- In case the respective sound signal comprises a monaural audio signal, derivation of the respective modified sound signal comprises applying the audio processing that implements the audio effect associated with the respective sound signal and applying audio panning to arrange the resulting modified sound content in the respective sound direction of interest the respective sound signal serves to represent; In case the respective sound signal comprises a spatial audio signal, derivation of the respective modified sound signal comprises applying the audio processing that implements the audio effect associated with the respective sound signals, whereas the sound content of the resulting spatial sound signal is readily arranged in the respective sound directions of interest the respective sound signal serves to represent. Referring back to operations pertaining to block 210, according to an example, derivation of the spatial audio signal 125 based on the at least one modified sound signal 225 and the background signal may comprise deriving the spatial audio signal as the sum (or as another linear combination) of the at least one modified sound signal 225 and the background signal.
In another example, derivation of the spatial audio signal 125 based on the at least one modified sound signal 225 and the background signal may comprise combining the at least one modified sound signal 225 and the background signal in view of energy level of the at least one modified sound signals 225 in relation to energy level of the at least one sound signals 221 . For clarity and brevity of description, in the following such energy-level-dependent combination is described with references to a single modified sound signal that may comprise a monaural audio signal that represents sounds in a single sound direction (e.g. in the first sound direction α1) or a spatial audio signal that represents sounds in one or more sound directions (e.g. in the first sound direction α1 or in the first and second sound directions α1 and α2), whereas the example readily generalizes into scenarios that involve one or more sound signals that each represent sounds in a respective one or more sound directions, mutatis mutandis.
As an example of energy-level-dependent combination of the at least one sound signal 225 and the background signal 223 may be carried out according to a method 300 illustrated in a flowchart of Figure 5. The method 300 proceeds from obtaining a sound signal, a modified sound signal derived based on the sound signal, and the background signal 223, indicated in block 302. Flerein, the sound signal may comprise one of the at least one sound signal 221 and the respective one of the at least on modified sound signals 225. In the following, for notational convenience, the sound signal (in frequency domain) is denoted as S, the modified sound signal (in frequency domain) is denoted as S', and the background signal (in frequency domain) may be (also) denoted as B.
The method 300 further comprises from computing respective energy of one or more frequency bands of the sound signal S, as indicated in block 304, and computing respective energy in one or more frequency bands of the modified sound signal S', as indicated in block 306. The method may optionally further comprise computing respective energy in one or more frequency bands of the background signal B, as indicated in block 308. Herein, the energy of the first sound signal S in the frequency band k (e.g. in the time-frequency tile S(k,n)) may be denoted as Es(k,n), the energy of the modified first sound signal S' in the frequency band k (e.g. in the time- frequency tile S'(k,n)) may be denoted as E's(k,n), and the energy of the background signal B in the frequency band k (e.g. in the time-frequency tile B(k,n)) may be denoted as EB(k,n). The method 300 further comprises attenuating the background signal 223 in those frequency bands k where the energy of the sound signal S is higher than the respective energy of the modified sound signal S', thereby deriving a modified background signal B', as indicated in block 310. As an example in this regard, the modified background signal B' may be derived by identifying those frequency bands where the energy Es ( k,n) of the time-frequency tile S ( k,n) in the sound signal S is higher than the energy E's(k,n) of the time-frequency tile S'(k,n) in the modified sound signal S' and deriving the corresponding time-frequency tile B'(k,n) of the modified background signal B' as B'(k,n) = gkB(k,n), where gk denotes a scaling factor with gk < 1, thereby providing an attenuated version of the corresponding time-frequency tile B(k,n) of the background signal 223. Moreover, the remaining time-frequency tiles of the modified background signal B' may be obtained as copies of the respective time-frequency tiles B(k,n) of the background signal B(k,n) (i.e. the ones for which the Es(k,n) ≤ E' s(k,n)) e.g. as B'(k,n) = B(k,n). Finally, the method 300 proceeds into deriving the spatial audio signal 125 as a combination of the modified sound signal S' and the modified background signal B', as indicated in block 312. In this regard, the spatial audio signal 125 may be derived e.g. as a sum, as an average or as another linear combination of the modified sound signal S' and the modified background signal B'.
Attenuation of the background signal 223 enables avoiding audible disturbances e.g. in a scenario where implementation of the respective audio effect to one or more of the at least one audio signal 221 results in energy reduction in one or more frequency bands and where, due to limitations of real-life implementations of the audio focusing procedures, some audio components intended for inclusion in the at least one sound signal 221 only remain in the background signal 223. In such a scenario, without attenuation of the background signal such audio components that may remain in the background signal could interfere with the audio effect applied for the corresponding sound signal via introduction of the respective modified sound signal, whereas attenuation of some frequency bands of the background signal 223 serves to reduce or even completely avoid such interference.
Referring back to operations pertaining to block 310, according to an example, the energy of the sound signal S may be considered to be higher than the respective energy of the modified sound signal S' in frequency band k response to the difference in respective energies in the frequency band k exceeding an energy difference threshold Ethr(k). assigned for the frequency band k. As an example in this regard, the energy of the sound signal S in the frequency band k may be considered to be higher than the respective energy of the modified sound signal S', in response to the energy Es(k,n) of the time-frequency tile S(k,n) in the sound signal S exceeding the energy E' s(k,n) of the time-frequency tile S'(k,n) in the modified sound signal S' by more than the energy difference threshold Ethr(k), e.g. in response to energy difference ΔE(k) = Es(k,n) - E' s(k,n) exceeding the energy difference threshold Ethr(k), i.e. when ΔE(k) > Ethr. Herein, the energy difference threshold Ethr(k) may be the same across frequency bands or the energy difference threshold Ethr(k) may be varied across frequency bands, for example such that the energy difference threshold Ethr(k) substantially matches a masking threshold for the respective frequency band. Herein, a masking threshold for a frequency band k represents the energy level required for an additional sound in order to make it audible in presence of another sound in the frequency band k. In an example, the respective energy difference threshold Ethr(k ) may be set to zero for one or more predefined frequency band or for all frequency bands in order to reduce computational complexity.
Still referring back to operations pertaining to block 310, according to an example, the amount of attenuation to be applied to signals in those frequency bands of the background signal B for which the energy of the sound signal S is higher than the respective energy of the modified sound signal S' may be independent of the extent of difference in energy (e.g. the energy difference ΔE(k)). In such a scenario, in on example the amount of attenuation to applied to the time-frequency tile B(k,n) of the background signal B to derive the corresponding time-frequency tile B'(k,n) of the modified background signal B' may be the same across frequency bands, e.g. the scaling factor gk may be the same across frequency bands, e.g. to a value that provides attenuation in a range from 2 to 4 dB. In another example, the scaling factor gk may be varied across frequency bands, for example such that reduction in the time-frequency tile B'(k,n) of the modified background signal B' in comparison of the time-frequency tile B(k,n) of the background signal B via application of the gain factor gk matches or substantially matches the difference in energy between the respective time-frequency tile S'(k,n) of the modified sound signal S' and the time-frequency tile S(k,n) of the sound signal S, e.g. by setting the gain gk factor as gk = E' s(k,n) /Es(k,n). Applying a gain factor gk that has either too small or too large value may result in audible distortions in the spatial audio signal 125 and thus the gain factor gk may be limited into a predefined range, e.g. into one that provides attenuation in a range from 2 to 4 dB.
In another example, the amount of attenuation to be applied to signals in those frequency bands of the background signal B for which the energy of the sound signal S is higher than the respective energy of the modified sound signal S' may be dependent of the extent of difference in energy (e.g. the energy difference ΔE(k)), for example, such that the amount of attenuation to be applied in a frequency band k is directly proportional to the difference in respective energies of the sound signal S and the modified sound signal S' in the frequency band k. This may be provided, for example, by setting the value of the scaling factory for driving the time-frequency tile B'(k,n) of the modified background signal B' such that it is directly proportional to the energy difference ΔE(k) on the frequency band k.
Figure 6 illustrates a block diagram of some components of an exemplifying apparatus 400. The apparatus 400 may comprise further components, elements or portions that are not depicted in Figure 6. The apparatus 400 may be employed e.g. in implementing one or more components described in the foregoing in context of the audio processing portion 220.
The apparatus 400 comprises a processor 416 and a memory 415 for storing data and computer program code 417. The memory 415 and a portion of the computer program code 417 stored therein may be further arranged to, with the processor 416, to implement at least some of the operations, procedures and/or functions described in the foregoing in context of the audio processing portion 220.
The apparatus 400 comprises a communication portion 412 for communication with other devices. The communication portion 412 comprises at least one communication apparatus that enables wired or wireless communication with other apparatuses. A communication apparatus of the communication portion 412 may also be referred to as a respective communication means.
The apparatus 400 may further comprise user I/O (input/output) components 418 that may be arranged, possibly together with the processor 416 and a portion of the computer program code 417, to provide a user interface for receiving input from a user of the apparatus 400 and/or providing output to the user of the apparatus 400 to control at least some aspects of operation of audio processing portion 220 that are implemented by the apparatus 400. The user I/O components 418 may comprise hardware components such as a display, a touchscreen, a touchpad, a mouse, a keyboard, and/or an arrangement of one or more keys or buttons, etc. The user I/O components 418 may be also referred to as peripherals. The processor 416 may be arranged to control operation of the apparatus 400 e.g. in accordance with a portion of the computer program code 417 and possibly further in accordance with the user input received via the user I/O components 418 and/or in accordance with information received via the communication portion 412.
Although the processor 416 is depicted as a single component, it may be implemented as one or more separate processing components. Similarly, although the memory 415 is depicted as a single component, it may be implemented as one or more separate components, some or all of which may be integrated/removable and/or may provide permanent / semi-permanent/ dynamic/cached storage.
The computer program code 417 stored in the memory 415, may comprise computer-executable instructions that control one or more aspects of operation of the apparatus 400 when loaded into the processor 416. As an example, the computer-executable instructions may be provided as one or more sequences of one or more instructions. The processor 416 is able to load and execute the computer program code 417 by reading the one or more sequences of one or more instructions included therein from the memory 415. The one or more sequences of one or more instructions may be configured to, when executed by the processor 416, cause the apparatus 400 to carry out at least some of the operations, procedures and/or functions described in the foregoing in context of the audio processing portion 220.
Hence, the apparatus 400 may comprise at least one processor 416 and at least one memory 415 including the computer program code 417 for one or more programs, the at least one memory 415 and the computer program code 417 configured to, with the at least one processor 416, cause the apparatus 400 to perform at least some of the operations, procedures and/or functions described in the foregoing in context of the audio processing portion 220.
The computer programs stored in the memory 415 may be provided e.g. as a respective computer program product comprising at least one computer-readable non-transitory medium having the computer program code 417 stored thereon, the computer program code, when executed by the apparatus 400, causes the apparatus 400 at least to perform at least some of the operations, procedures and/or functions described in the foregoing in context of the audio processing portion 220. The computer-readable non-transitory medium may comprise a memory device or a record medium such as a CD-ROM, a DVD, a Blu-ray disc or another article of manufacture that tangibly embodies the computer program. As another example, the computer program may be provided as a signal configured to reliably transfer the computer program.
Reference(s) to a processor should not be understood to encompass only programmable processors, but also dedicated circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processors, etc. Features described in the preceding description may be used in combinations other than the combinations explicitly described.
Although functions have been described with reference to certain features, those functions may be performable by other features whether described or not. Although features have been described with reference to certain embodiments, those features may also be present in other embodiments whether described or not.

Claims

Claims
1. An apparatus for audio processing, the apparatus comprising: means for receiving an input audio signal that represents a spatial audio image and respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions; means for deriving, based on the input audio signal in dependence of said one or more sound directions and the respective audio effects, at least one sound signal that represents sounds in said one or more sound directions and a background signal that represents sounds in other sound directions of the spatial audio image, wherein a sound signal represents sounds in at least one of the one or more sound directions and has one of the audio effects associated therewith; means for deriving, based on said at least one sound signal, via application of the respective audio effect, respective at least one modified sound signal; and means for deriving the spatial audio signal based on the at least one modified sound signal and on the background signal.
2. The apparatus according to claim 1 , wherein the means for deriving the spatial audio signal comprises: means for computing respective energies in one or more frequency bands of the at least one sound signal; means for computing respective energies in said one or more frequency bands of the at least one modified sound signal, means for attenuating the background signal in those ones of the one or more frequency bands where the energy of the at least one sound signal is higher than the respective energy of the at least one modified sound signal to derive a modified background signal; and means for deriving the spatial audio signal as a combination of the at least one modified sound signal and the modified background signal.
3. The apparatus according to claim 2, wherein the means for attenuating the background signal is arranged to consider the energy of the at least one sound signal to be higher than the respective energy of the at least one modified sound signal in a frequency band in response to the energy of the at least one sound signal exceeding the respective energy of the at least one modified sound signal in said frequency band by at least an energy difference threshold assigned for said frequency band.
4. The apparatus according to claim 2 or 3, wherein the means for attenuating the background signal is arranged to attenuate the background signal in a frequency band by an amount that depends on the difference in respective energies of the at least one sound signal and the at least one modified sound signal in said frequency band.
5. The apparatus according to any of claims 2 to 4, wherein the means for deriving the spatial audio signal comprises means for deriving a sum of the at least one sound modified sound signal and the background signal.
6. The apparatus according to claim 1 , wherein the means for deriving the spatial audio signal comprises means for deriving a sum of the at least one sound modified sound signal and the background signal.
7. The apparatus according to any of claims 1 to 6, wherein said sound directions comprise a first sound direction and a second sound direction, wherein said audio effects comprise a first audio effect to be applied to sounds in the first sound direction and a second audio effect that is different from the first audio effect to be applied to sounds in the second sound direction, wherein the means for deriving the at least one sound signal comprises: means for applying, on the input audio signal, monaural audio focusing directed to the first sound direction to derive a first sound signal as a monoaural audio signal that represents sounds in the first sound direction and that has the first audio effect associated therewith; and means for applying, on the input audio signal, monaural audio focusing directed to the second sound direction to derive a second sound signal as a monaural audio signal that represents sounds in the second sound direction and that has the second audio effect associated therewith.
8. The apparatus according to claim 7, wherein the means for applying monoaural audio focusing is arranged to derive a monaural audio signal that represents sounds in a focus direction while suppressing sounds in other sound directions.
9. The apparatus according to any of claims 1 to 8, wherein said sound directions comprise a third sound direction and a fourth sound direction, wherein said audio effects comprise a third audio effect to be applied to sounds in the third and fourth sound directions, wherein the means for deriving the at least one sound signal comprises: means for applying, on the input audio signal, spatial audio focusing directed to a first range of sound directions from the third sound direction to the fourth sound direction to derive a third sound signal as a spatial audio signal that represents sounds in the third and fourth sound directions and that has the third audio effect associated therewith.
10. The apparatus according to any of claims 1 to 9, wherein the means for deriving the background signal comprises means for applying, on the input audio signal, spatial audio focusing directed to sound directions other than said one or more sound directions to derive the background signal as a spatial audio signal that represents sounds in said other sound directions
11. The apparatus according to claim 10, wherein the means for deriving the background signal comprises one of the following: means for applying spatial audio focusing directed to a range of sound directions that covers sound directions other than said one or more sound directions to derive the background signal, means for dividing said range of sound directions that covers sound directions other than said one or more sound directions into two or more sub-ranges of sound directions, means for separately applying, for each of said sub-ranges, spatial audio focusing directed to a respective sub-range of sound directions to derive a respective background signal component, and means for combining the background signal components to derive the background signal.
12. The apparatus according to any of claims 9 to 11, wherein the means for applying spatial audio focusing, when directed to a range of sound directions from a first given sound direction to a second given sound direction, is arranged to provide a focused audio signal as a spatial audio signal where the sound directions within said range are retained in their respective spatial positions of the spatial audio image while suppressing sounds in sound direction outside said range.
13. The apparatus according to claim 12, wherein the means for applying spatial audio focusing is arranged to: apply, on the input audio signal, monaural audio focusing directed to a sound direction that is offset from the first given sound direction to a first direction to derive a first channel of the focused audio signal; and apply, on the input audio signal, monaural audio focusing directed to a sound direction that is offset from the second given sound direction to a second direction that is opposite to the first direction to derive a second channel of the focused audio signal.
14. The apparatus according to any of claims 1 to 9, wherein the means for deriving the background signal comprises means for subtracting the at least one sound signal from the input audio signal.
15. The apparatus according to any of claims 1 to 14, wherein the means for deriving the respective at least one modified sound signal comprises means for applying, to a respective sound signal, audio processing that implements the audio effect associated with the respective sound signal to derive the respective modified sound signal.
16. The apparatus according to claim 15, wherein the means for deriving the respective at least one modified sound signal is arranged to: in case the respective sound signal comprises a monaural audio signal, derive the respective modified sound signal via applying the audio processing that implements the audio effect associated with the respective sound signal and applying audio panning to arrange the resulting modified sound content in the respective sound direction of the spatial audio image the respective sound signal serves to represent; in case the respective sound signal comprises a spatial audio signal, derive the respective modified sound signal via applying the audio processing that implements the audio effect associated with the respective sound signal.
17. The apparatus according to any of claims 1 to 16, wherein said audio effect comprises one of the following: audio equalization, pitch shifting, vibrato, tremolo, a vocoder effect.
18. The apparatus according to any of claims 1 to 17, wherein the input audio signal comprises a multi-channel audio signal derived on basis of respective microphone signals obtained from respective one or more microphones of a microphone array.
19. An apparatus for audio processing, the apparatus comprising at least one processor and at least one memory including computer program code, which, when executed by the at least one processor, causes the apparatus to: receive an input audio signal that represents a spatial audio image and respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions; derive, based on the input audio signal in dependence of said one or more sound directions and the respective audio effects, at least one sound signal that represents sounds in said one or more sound directions and a background signal that represents sounds in other sound directions of the spatial audio image, wherein a sound signal represents sounds in at least one of the one or more sound directions and has one of the audio effects associated therewith; derive, based on said at least one sound signal, via application of the respective audio effect, respective at least one modified sound signal; and derive the spatial audio signal based on the at least one modified sound signal and on the background signal.
20. A method for audio processing, the method comprising: receiving an input audio signal that represents a spatial audio image and respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions; deriving, based on the input audio signal in dependence of said one or more sound directions and the respective audio effects, at least one sound signal that represents sounds in said one or more sound directions and a background signal that represents sounds in other sound directions of the spatial audio image, wherein a sound signal represents sounds in at least one of the one or more sound directions and has one of the audio effects associated therewith; deriving, based on said at least one sound signal, via application of the respective audio effect, respective at least one modified sound signal; and deriving the spatial audio signal based on the at least one modified sound signal and on the background signal.
21. A computer readable medium comprising program instructions for causing an apparatus to perform at least the method according to claim 20.
PCT/FI2021/050234 2020-04-17 2021-03-31 Audio processing WO2021209683A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20205391 2020-04-17
FI20205391 2020-04-17

Publications (1)

Publication Number Publication Date
WO2021209683A1 true WO2021209683A1 (en) 2021-10-21

Family

ID=78084727

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2021/050234 WO2021209683A1 (en) 2020-04-17 2021-03-31 Audio processing

Country Status (1)

Country Link
WO (1) WO2021209683A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014091375A1 (en) * 2012-12-14 2014-06-19 Koninklijke Philips N.V. Reverberation processing in an audio signal
WO2016034454A1 (en) * 2014-09-05 2016-03-10 Thomson Licensing Method and apparatus for enhancing sound sources
GB2562518A (en) * 2017-05-18 2018-11-21 Nokia Technologies Oy Spatial audio processing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014091375A1 (en) * 2012-12-14 2014-06-19 Koninklijke Philips N.V. Reverberation processing in an audio signal
WO2016034454A1 (en) * 2014-09-05 2016-03-10 Thomson Licensing Method and apparatus for enhancing sound sources
GB2562518A (en) * 2017-05-18 2018-11-21 Nokia Technologies Oy Spatial audio processing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
POLITIS, ARCHONTIS ET AL.: "Parametric spatial audio effects", PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE ON DIGITAL AUDIO EFFECTS (DAFX-12, 17 September 2012 (2012-09-17), York, UK, XP055527425, Retrieved from the Internet <URL:https://www.dafx12.york.ac.uk/papers/dafx12_submission_22.pdf> [retrieved on 20180808] *

Similar Documents

Publication Publication Date Title
US10785589B2 (en) Two stage audio focus for spatial audio processing
CN109313907B (en) Combining audio signals and spatial metadata
US9282419B2 (en) Audio processing method and audio processing apparatus
US11457310B2 (en) Apparatus, method and computer program for audio signal processing
CN113597776B (en) Wind noise reduction in parametric audio
US9743215B2 (en) Apparatus and method for center signal scaling and stereophonic enhancement based on a signal-to-downmix ratio
US20220060824A1 (en) An Audio Capturing Arrangement
JP2017530396A (en) Method and apparatus for enhancing a sound source
US20220014866A1 (en) Audio processing
KR101520618B1 (en) Method and apparatus for focusing the sound through the array speaker
US20220260664A1 (en) Audio processing
US11962992B2 (en) Spatial audio processing
WO2021209683A1 (en) Audio processing
US20220150624A1 (en) Method, Apparatus and Computer Program for Processing Audio Signals
CN112788515B (en) Audio processing
KR20200017969A (en) Audio apparatus and method of controlling the same

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21788162

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21788162

Country of ref document: EP

Kind code of ref document: A1