WO2021209683A1

WO2021209683A1 - Audio processing

Info

Publication number: WO2021209683A1
Application number: PCT/FI2021/050234
Authority: WO
Inventors: Miikka Vilermo; Toni Mäkinen
Original assignee: Nokia Technologies Oy
Priority date: 2020-04-17
Filing date: 2021-03-31
Publication date: 2021-10-21

Abstract

According to an example embodiment, an apparatus for audio processing is provided, the apparatus comprising: means for receiving an input audio signal that represents a spatial audio image and respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions; means for deriving, based on the input audio signal in dependence of said one or more sound directions and the respective audio effects, at least one sound signal that represents sounds in said one or more sound directions and a background signal that represents sounds in other sound directions of the spatial audio image, wherein a sound signal represents sounds in at least one of the one or more sound directions and has one of the audio effects associated therewith; means for deriving, based on said at least one sound signal, via application of the respective audio effect, respective at least one modified sound signal; and means for deriving the spatial audio signal based on the at least one modified sound signal and on the background signal.

Description

Audio processing TECHNICAL FIELD

The example and non-limiting embodiments of the present invention relate to processing of audio signals. In particular, various example embodiments of the present invention relate to audio processing that involves deriving a processed audio signal where characteristics of respective sounds in one or more sound directions of a spatial audio image represented by an input audio signal are modified.

BACKGROUND With the development of microphone technologies and with increase in processing power and storage capacity available in mobile devices, many mobile devices, such as mobile phones, tablet computers, laptop computers, digital cameras, etc. are provided with microphone arrangements that enable capturing audio signals that represent an audio scene around the mobile device. Typically, the process of capturing such an audio signal using the mobile device comprises operating a microphone array of the mobile device to capture a plurality of microphone signals and processing the captured microphone signals into a digital audio signal for playback in the mobile device or for further processing in the mobile device, for storage in the mobile device and/or for transmission to one or more other devices to enable subsequent playback in the mobile device or in another device.

The digital audio signal may be one that conveys a spatial audio image that represents the audio scene around mobile device, either as such or together with spatial metadata that defines at least some characteristics of the spatial audio scene. As examples in this regard, the recorded audio signal may comprise a multi- channel signal where each channel is (substantially directly) based on the respective microphone signal or the recorded audio signal may comprise a spatial audio signal derived based on the microphone signals. Typically, although not necessarily, the digital audio is captured together with the associated video. Capturing a digital audio signal that represents an audio scene around the mobile device provides interesting possibilities for processing the spatial audio image conveyed by the digital audio signal during the capture and/or after the capture. As an example in this regard, upon or after capturing the digital audio signal that conveys the spatial audio image that represents the audio scene around the mobile device, a user may wish to modify characteristics of one or more sound sources in the spatial audio image, for example, to improve perceptual quality or clarity of the one or more sound sources or for entertainment purposes. A straightforward approach for implementing such a procedure includes extracting, from the digital audio signal, a sound signal representing a sound source of interest, modifying the sound signal in a desired manner and inserting the modified sound signal back to the digital audio signal.

Extraction of the sound signal may be carried out by applying an audio focusing technique known in the art to the digital audio signal, where the audio focusing aims at representing sounds in a desired sound direction within a spatial audio image represented by the digital audio signal while excluding sounds in other sound directions. A typical solution for audio focusing involves audio beamforming, which is a technique well known in the art. Hence, an audio beamforming procedure aims at extracting a beamformed audio signal that represents sounds in a sound direction of interest while suppressing sound in other sound directions. In context of the audio beamforming, the sound direction of interest may be referred to as a beam direction. Other techniques for accomplishing an audio focusing to a sound direction of interest include, for example, the one described in [1] In context of the present disclosure, the term audio focusing is applied to refer to an audio processing technique that involves emphasizing sounds in certain sound directions of a spatial audio image in relation to sounds in other sound directions of the spatial audio image.

While in principle the aim of the beamforming is to extract or derive a beamformed audio signal that represents sounds in the beam direction without representing sounds in other sound directions, in a practical implementation isolation of sounds in a certain sound direction while completely excluding sounds in other directions is typically not possible. Instead, in practice the beamformed audio signal is typically an audio signal where sounds in the beam direction are emphasized in relation to sounds in other sound directions. Consequently, even if an audio beamforming procedure aims at a beamformed audio signal that only represents sounds in the beam direction, the resulting beamformed audio signal is one where sounds in the beam direction and sounds in a sub-range of directions around the beam direction are emphasized in relation to sounds in other directions in accordance with characteristics of a beam applied for the audio beamforming.

In this regard, the width of a beam applied in the audio beamforming procedure may be considered: the width of the beam may be indicated by a solid angle (typically in horizontal direction only), which defines a sub-range of sound directions around the beam direction that are considered to fall within the beam. As an example in this regard, the solid angle may define a sub-range of sound directions around the beam direction such that sounds in sound directions outside the solid angle are attenuated at least a predefined amount in relation to a sound direction of maximum amplification (or minimum attenuation) within the solid angle. The predefined amount may be defined, for example as 6 dB or 3 dB. However, definition of the beam width as the solid angle is a simplified model for indicating the width of the beam and hence the sub-range of sound directions encompassed by the beam when targeted to the beam direction, whereas in a real-life implementation the beam does not strictly cover a well-defined range of sound directions around the sound direction of interest but the beam rather has a width that may vary with audio signal characteristic, with the position of the beam direction with the spatial audio image under consideration and/or with audio frequency.

Hence, in addition to sounds that represent a sound source of interest in the beam direction, the beamformed audio signal may also include sound components originating from other sound sources in sound directions close to the beam direction and/or ambient sound components (e.g. sounds that have no well-defined sound direction). Consequently, when the beamformed audio signal is applied as the sound signal that represents the sound source of interest in context of the above- described procedure for modifying characteristics of a sound source in the spatial audio image represented by a digital audio signal, modification of characteristics of the beamformed audio signal in order to modify characteristics of the sound source it serves to represent may unintentionally also modify sound components from other sound sources and/or ambient sound components, which in turn may result in an audio effect different from the one intended and/or distorting the modified digital audio signal resulting from introduction of the modified beamformed audio signal back to the digital audio signal. While the above description applies audio beamforming as an example of audio focusing techniques in general, similar challenges are involved in audio focusing techniques of other kind as well.

References:

[1] WO 2014/162171 A1 SUMMARY

According to an example embodiment, a method for audio processing is provided, the method comprising: receiving an input audio signal that represents a spatial audio image and respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions; deriving, based on the input audio signal in dependence of said one or more sound directions and the respective audio effects, at least one sound signal that represents sounds in said one or more sound directions and a background signal that represents sounds in other sound directions of the spatial audio image, wherein a sound signal represents sounds in at least one of the one or more sound directions and has one of the audio effects associated therewith; deriving, based on said at least one sound signal, via application of the respective audio effect, respective at least one modified sound signal; and deriving the spatial audio signal based on the at least one modified sound signal and on the background signal. According to another example embodiment, an apparatus for audio processing is provided, the apparatus configured to: receive an input audio signal that represents a spatial audio image and respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions; derive, based on the input audio signal in dependence of said one or more sound directions and the respective audio effects, at least one sound signal that represents sounds in said one or more sound directions and a background signal that represents sounds in other sound directions of the spatial audio image, wherein a sound signal represents sounds in at least one of the one or more sound directions and has one of the audio effects associated therewith; derive, based on said at least one sound signal, via application of the respective audio effect, respective at least one modified sound signal; and derive the spatial audio signal based on the at least one modified sound signal and on the background signal.

According to another example embodiment, an apparatus for audio processing is provided, the apparatus comprising: means for receiving an input audio signal that represents a spatial audio image and respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions; means for deriving, based on the input audio signal in dependence of said one or more sound directions and the respective audio effects, at least one sound signal that represents sounds in said one or more sound directions and a background signal that represents sounds in other sound directions of the spatial audio image, wherein a sound signal represents sounds in at least one of the one or more sound directions and has one of the audio effects associated therewith; means for deriving, based on said at least one sound signal, via application of the respective audio effect, respective at least one modified sound signal; and means for deriving the spatial audio signal based on the at least one modified sound signal and on the background signal.

According to another example embodiment, an apparatus for audio processing is provided, wherein the apparatus comprises at least one processor; and at least one memory including computer program code, which, when executed by the at least one processor, causes the apparatus to: receive an input audio signal that represents a spatial audio image and respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions; derive, based on the input audio signal in dependence of said one or more sound directions and the respective audio effects, at least one sound signal that represents sounds in said one or more sound directions and a background signal that represents sounds in other sound directions of the spatial audio image, wherein a sound signal represents sounds in at least one of the one or more sound directions and has one of the audio effects associated therewith; derive, based on said at least one sound signal, via application of the respective audio effect, respective at least one modified sound signal; and derive the spatial audio signal based on the at least one modified sound signal and on the background signal.

According to another example embodiment, a computer readable medium comprising program instructions for causing an apparatus to perform at least the method according to the example embodiment described in the foregoing is provided.

The computer readable medium may comprise a volatile or a non-volatile computer- readable record medium, thereby providing a computer program product comprising at least one computer readable non-transitory medium having program instructions for causing an apparatus to perform at least the method according to the example embodiment described in the foregoing stored thereon.

The exemplifying embodiments of the invention presented in this patent application are not to be interpreted to pose limitations to the applicability of the appended claims. The verb "to comprise" and its derivatives are used in this patent application as an open limitation that does not exclude the existence of also unrecited features. The features described hereinafter are mutually freely combinable unless explicitly stated otherwise.

Some features of the invention are set forth in the appended claims. Aspects of the invention, however, both as to its construction and its method of operation, together with additional objects and advantages thereof, will be best understood from the following description of some example embodiments when read in connection with the accompanying drawings.

BRIEF DESCRIPTION OF FIGURES

The embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, where Figure 1 illustrates a block diagram of some components and/or entities of an audio processing system according to an example;

Figures 2A and 2B illustrate respective block diagrams of some components and/or entities of an audio processing sub-system according to an example; Figure 3 illustrates a block diagram of some components and/or entities of an audio processing portion according to a non-limiting example;

Figure 4 illustrates a flowchart depicting a method according to an example;

Figure 5 illustrates a flowchart depicting a method according to an example; and

Figure 6 illustrates a block diagram of some elements of an apparatus according to an example.

DESCRIPTION OF SOME EMBODIMENTS

Figure 1 illustrates a block diagram of some components and/or entities of an audio processing system 100 according to a non-limiting example. The audio processing system 100 comprises an audio capturing portion 110 and an audio processing portion 120. The audio capturing portion 110 is coupled to a microphone array 112 and it is arranged to receive respective microphone signals from a plurality of microphones 112-1, 112-2, ..., 112-K and to record a captured multi-channel audio signal 115 based on the received microphone signals. The microphones 112-1, 112- 2, ..., 112-K represent a plurality of (i.e. two or more) microphones, where an individual one of the microphones may be referred to as a microphone 112-k or as a microphone 112-j. Flerein, the concept of microphone array 112 is to be construed broadly, encompassing any arrangement of two or more microphones 112-k arranged in or coupled to a device implementing the audio processing system 100. The audio processing portion 120 is arranged to receive the multi-channel audio signal 115 from the audio capturing portion 120 and to derive a spatial audio signal 125 based on the multi-channel audio signal 115.

Each of the microphone signals represents the same captured sound while they provide a respective different representation of the captured sound, which difference depends on the positions of the microphones 112-k with respect to each other. For a sound source in a certain spatial position with respect to the microphone array 112, this results in a different representation of sounds originating from the certain sound source in each of the microphone signals: a microphone 112-k that is closer to the certain sound source captures the sound originating therefrom at a higher amplitude and earlier than a microphone 112-j that is further away from the certain sound source. Together with the knowledge regarding the positions of the microphones 112-k with respect to each other, such differences in amplitude and/or time delay enable derivation of a spatial audio signal that represents the audio scene at the time (and place) of capturing the microphone signals and/or applying spatial audio processing based on the multi-channel audio signal 115. The representation of the spatial audio scene captured in the microphone signals and, consequently, in the multi-channel audio signal 115 may be referred to as a spatial audio image.

The audio capturing portion 110 may be arranged to record a respective digital audio signal based on each of the microphone signals received from the microphones 112- 1 , 112-2, ..., 112-K of the microphone array 112 at a predefined sampling frequency using a predefined bit depth and to provide the recorded digital audio signals as the multi-channel audio signal 115 to the audio processing portion 120 for further audio processing therein. In this regard, each digital audio signal recorded at the audio capturing portion 110 may be considered as a respective channel of the multi channel audio signal 115. The multi-channel audio signal 115 may comprise or may be accompanied with audio metadata that includes information characterizing at least some aspects of the multi-channel audio signal 115, e.g. one of more of the following: the sampling rate (or sampling frequency) applied in the multi-channel audio signal 115, the bit depth applied in the multi-channel audio signal 115, channel configuration information that serves to define the relationship between the channels of the multi-channel audio signal 115, e.g. the respective positions and/or orientations of the microphones 112-k of the microphone array 112 (with respect to a reference position/orientation and/or with respect to other microphones 112-k of the microphone array 112) applied to capture the microphone signals serving as basis for the multi-channel audio signal 115, etc. Figures 2A and 2B illustrate respective block diagrams of some components of respective audio processing sub-systems 100a and 100b according to a non-limiting example. The audio processing sub-system 100a comprises the microphone array 112 and the audio capturing portion 110 described in the foregoing together with a memory 102. A difference to operation of the audio processing system 100 is that instead of providing the multi-channel audio signal 115 to the audio processing portion 120, the audio capturing portion110 may be arranged to store the multi channel audio signal 115 in the memory 102 for subsequent access by the audio processing portion 120. In this regard, the multi-channel audio signal 115 may be stored in the memory 102 together with the audio metadata described in the foregoing.

The audio processing sub-system 100b comprises the memory 102 and the audio processing entity 120 described in the foregoing. Hence, a difference in operation of the audio processing sub-system 100b in comparison to a corresponding aspect of operation of the audio processing system 100 is that instead of (directly) obtaining the multi-channel audio signal 115 from the audio capturing portion 110, the audio processing portion 120 reads the multi-channel audio signal 115, possibly together with the audio metadata, from the memory 102.

In the example provided via respective illustrations of Figures 2A and 2B, the memory 102 read by the audio processing portion 120 is the same one to which the audio capturing portion 110 stores the multi-channel audio signal 115 recorded or derived based on the respective microphone signals obtained from the microphones 112-k of the microphone array 112. As an example, such an arrangement may be provided by providing the audio processing sub-systems 100a and 100b in a single device that also includes (or otherwise has access to) the memory 102. In another example in this regard, the audio processing sub-systems 100a and 100b may be provided in a single device or in two separate devices and the memory 102 may comprise a memory provided in a removable memory device such as a memory card or a memory stick that enables subsequent access to the multi-channel audio signal 115 (possibly together with the metadata) in the memory 102 by the same device that stored this information therein or by another device. In a further variation of the example provided via respective illustrations of Figures 2A and 2B, the memory 102 may be provided in a further device, e.g. in a server device, that is communicatively coupled to a device providing the both audio processing sub-systems 100a, 100b or to respective devices providing the audio processing sub-system 100a and the audio processing sub-system 100b by a communication network.

In a further variation of the example provided via respective illustrations of Figures 2A and 2B, the memory 102 may be replaced by a transmission channel or by a communication network that enables transferring the multi-channel audio signal 115 (possibly together with the audio metadata) from a first device providing the audio processing sub-system 100a to a second device providing the audio processing sub-system 100b. In this regard, the transfer of the multi-channel audio signal 115 may comprise transmitting/receiving the multi-channel audio signal 115 as an audio packet stream, whereas the audio capturing portion 110 may further operate to encode the multi-channel audio signal 115 for encoded audio data suitable for transmission in the audio packet stream and the audio processing portion 120 may further operate to decode the encoded audio data received in the audio packet stream into the multi-channel audio signal 115 (or into a reconstructed version thereof) for the audio processing therein. The audio processing portion 120 may be arranged to carry out a spatial audio processing procedure that results in modifying respective sounds representing audio characteristics of one or more sound sources included in the multi-channel audio signal 115 to provide the spatial audio signal 125. Without losing generality, modification of a sound may be referred to as an application of an audio effect to the respective sounds.

While the audio processing system 100 and operation thereof is throughout this disclosure predominantly described with references to the audio processing portion 120, 220 obtaining an input audio signal as the multi-channel audio signal 115 derived on basis of respective microphone signals received from the microphones 112-1, 112-2, ..., 112-K of the microphone array 112, in other examples the audio input signal to the audio processing portion 120, 220 may comprise an audio signal of other type that represents a spatial audio image and that enables audio focusing, thereby enabling derivation of the spatial audio signal 125 as the output of the audio processing portion 120, 220. Non-limiting examples of such input audio signals of other type that represents a spatial audio image include an Ambisonic (spherical harmonic) audio format and various multi-loudspeaker audio formats (such as 5.1- channel or 7.1. surround sound) known in the art. Depending on the type of the applied input audio signal, it may be accompanied by (audio) metadata that includes information that defines various aspects related to characteristics of the input audio signal, including spatial characteristics such as parametric data describing the spatial audio field, e.g. respective sound direction-of-arrival estimates, respective ratios between direct and ambient sound energy components, etc. for one or more frequency bands.

Figure 3 illustrates a block diagram of some components and/or entities of an audio processing portion 220 according to a non-limiting example. The audio processing portion 220 is arranged to obtain the multi-channel audio signal 115 and respective indications of a sound direction within a spatial audio image represented by the multi-channel audio signal 115 and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions. The audio processing portion 220 comprises an audio decomposition portion 222 for deriving, based on the multi-channel audio signal 115, at least one sound signal 221 that represents sounds in the one or more sound directions within the spatial audio image and a background signal 223 that represents sounds in other directions of the spatial audio image. The audio processing portion further comprises an audio effect portion 224 for deriving, based on the at least one sound signal 221, via application of the respective audio effect, respective at least one modified sound signal 225. he audio processing portion further comprises an audio combiner 226 for deriving the spatial audio signal 125 based on the at least one modified sound signal 225 and the background signal 223.

In other examples, the audio processing portion 220 may include further entities in addition to those illustrated in Figure 3 and/or some of the entities depicted in Figure 3 may combined with other entities while providing the same or corresponding functionality. In particular, the entities illustrated in Figure 3 serve to represent logical components of the audio processing portion 220 that are arranged to perform a respective function but that do not impose structural limitations concerning implementation of the respective entity. Hence, for example, respective hardware means, respective software means or a respective combination of hardware means and software means may be applied to implement any of the entities illustrated in Figure 3 separately from the other entities, to implement any sub-combination of two or more entities illustrated in Figure 3, or to implement all entities illustrated in Figure 3 in combination. As an example in this regard, the audio processing portion 220 may be provided as one comprising means for obtaining the multi-channel audio signal 115 that represents a spatial audio image, means for obtaining respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions, means for deriving, based on the multi-channel audio signal 115 in dependence of said one or more sound directions and the respective audio effects, the at least one sound signal 221 that represents sounds in the one or more sound directions and the background signal 223 that represents sounds in other directions of the spatial audio image, means for deriving, based on the at least one sound signal 221, via application of the respective audio effect, respective at least one modified sound signal 225, and means for deriving the spatial audio signal 125 based on the at least one modified sound signal 225 and on the background signal 223.

The spatial audio processing procedure may be carried out, for example, in accordance with a method 200 illustrated in a flowchart of Figure 4. The method 200 commences from obtaining the multi-channel audio signal 115 that represents a spatial audio image, as indicated in block 202, and obtaining respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions, as indicated in block 204. The method 200 further comprises deriving, based on the multi-channel audio signal 115 in dependence of said one or more sound directions and the respective audio effects, the at least one sound signal 221 that represents sounds in the one or more sound directions and the background signal 223 that represents sounds in other directions of the spatial audio image, as indicated in block 206. The method 200 further comprises deriving, based on the at least one sound signal 221, via application of the respective audio effect, respective at least one modified sound signal 225, as indicated in block 208. Moreover, the method 200 further comprises deriving the spatial audio signal 125 based on the at least one modified sound signal 225 and on the background signal 223, as indicated in block 210.

The operations described with references to blocks 202 to 210 of the method 200 may be varied or complemented in a number of ways without departing from the scope of the spatial audio processing procedure described in the present disclosure, for example in accordance with the examples described in the foregoing and in the following.

In a non-limiting example of applying the audio processing portion 220 to carry out the method 200, the audio decomposition portion 221 may be arranged to carry out operations, procedures and/or functions pertaining to block 206, the audio effect portion 223 may be arranged to carry out operations, procedures and/or functions pertaining to block 208, whereas the audio combiner 225 may be arranged to carry out operations, procedures and/or functions pertaining to block 210. In this regard, in the foregoing the operation of the audio processing portion 220 is predominantly described with references to the method 200, while the corresponding examples readily pertain to respective entities of the audio processing portion 220 arranged to carry out the respective aspect of the method 200.

The audio processing portion 220 and the method 200 may be arranged to process the multi-channel audio signal 115 arranged in a sequence of input frames, each input frame including a respective time segment of digital audio signal for each of the channels, provided as a respective time series of input samples at a predefined sampling frequency (which may be defined, for example, in the audio metadata provided for the multi-channel audio signal 115). In typical example, the audio processing portion 220 employs a fixed predefined frame length. In other examples, the frame length may be a selectable frame length that may be selected from a plurality of predefined frame lengths, or the frame length may be an adjustable frame length that may be selected from a predefined range of frame lengths. A frame length may be defined as number of (input) samples L included in the frame for each channel of the multi-channel audio signal 115, which at the predefined sampling frequency maps to a corresponding duration in time. As an example in this regard, the audio processing portion 220 may employ a fixed frame length of 20 milliseconds (ms), which at a sampling frequency of 8, 16, 32 or 48 kHz results in a frame of L=160, L= 320, L=640 and L= 960 samples per channel, respectively. The frames may be non-overlapping or they may be partially overlapping. These values, however, serve as non-limiting examples and frame lengths and/or sampling frequencies different from these examples may be employed instead, depending e.g. on the desired audio bandwidth, on desired framing delay and/or on available processing capacity.

At least part of the processing carried out by the audio processing portion 220 and/or the method 200 may be carried out separately for a plurality of frequency bands of the multi-channel audio signal 115. Consequently, e.g. a respective entity of the audio processing portion 220 and/or a respective step of the method 200 may involve (at least conceptually) dividing or decomposing each channel of the multi- channel audio signal 115 into a respective plurality of frequency bands, thereby providing a respective time-frequency representation for each channel of the multi- channel audio signal 115. According to a non-limiting example, division into the frequency bands may comprise transforming each channel of the multi-channel audio signal 115 from time domain into a respective frequency-domain audio signal and arranging the resulting frequency-domain samples (also referred to as frequency bins) into respective plurality of frequency bands in each of the channels. Non-limiting examples of applicable time-to-frequency-domain transforms include short-time discrete Fourier transform (STFT) and a complex-modulated quadrature- mirror filter (QMF) bank. In case the multi-channel audio signal 115 is transformed into frequency domain for carrying out an aspect of audio processing by the audio processing portion 220 and/or the method 200 (separately) for a plurality of frequency bands, respective inverse transform from the frequency domain back to the time domain may be applied before providing the spatial audio signal 125 as the output of the audio processing portion 220 and/or the method 200.

In case the (conceptual) division to the plurality of frequency bands is applied, the number of frequency bands and respective bandwidths of the frequency bands may be selected e.g. in accordance with the desired frequency resolution and/or available computing power. In an example, the frequency band structure involves 24 frequency sub-bands according to the Bark scale, an equivalent rectangular band (ERB) scale or 3^rd octave band scale known in the art. In other examples, different number of frequency sub-bands that have the same or different bandwidths may be employed. A specific example in this regard is a single frequency sub-band that covers the input spectrum in its entirety or a single frequency sub-band that covers a continuous subset of the input spectrum. A frequency-domain sample that represents frequency bin b in time frame n of channel i of the frequency-domain audio signal may be denoted as x(i, b,n). Division to frequency bands may be provided by arranging or grouping one or more of consecutive frequency bins obtained for a given channel in the given frame into a respective frequency band, thereby providing a plurality of frequency bands k- 0, ..., K-1, where the frequency- domain audio signal in frequency band k in time frame n of channel i may be denoted as x(i, k,n ) and where the frequency-domain audio signal in frequency band k in time frame n across all channels may be denoted as x(k, n). The latter may be referred to as a respective time-frequency tile.

Referring back to operations pertaining to block 202, obtaining the multi-channel audio signal 115 may comprise receiving the multi-channel audio signal 115 from the audio capturing portion 110 or over a communication network or reading the multi-channel audio signal from the memory 102, as described in the foregoing. Further along the lines described in the foregoing, the multi-channel audio signal 115 may comprise or it may be otherwise received with the audio metadata that includes information characterizing at least some aspects of the multi-channel audio signal 115. The multi-channel audio signal 115, together with the audio metadata, enables applying the spatial audio processing procedure to create the spatial audio signal 125 (directly) based on the multi-channel audio signal 115 or via creating an intermediate spatial audio signal based on the multi-channel audio signal 115 and carrying out the method 200 based on the intermediate spatial audio signal. Referring back to operations pertaining to block 204, an audio effect to be applied to sounds in a given one of the indicated sound directions may be referred to as the audio effect associated with the given one of the indicated sound directions. Herein, the one or more sound directions may be also referred to as respective sound directions of interest. In this regard, obtaining the indications of the one or more sound directions within the spatial audio image and the respective indications of the one or more audio effects associated therewith may comprise receiving or deriving the respective indications based on user input. As an example, the audio processing portion 120 may be arranged to receive the respective indications of the sound directions of interest and the audio effects associated therewith as user input provided via user interface (Ul) of a device implementing the audio processing portion 120 or derive these indications based on the user input received via the Ul of the device implementing the audio processing portion 120. In this regard, the multi-channel audio signal 115 may be accompanied by a video stream (being) captured together with the multi-channel audio signal 115 that depicts at least some of the sound sources represented in the multi-channel audio signal 115 and that is displayed to the user via the Ul. The Ul may provide the user with a possibility to indicate one or more sound sources of interest via pointing their respective illustrations on the displayed video stream, which indicated one or more sound sources may be converted into respective one or more sound directions of interest based on their positions in the displayed images of the video stream. Moreover, the Ul may provide the user with possibility to select, for each of the one or more sound sources of interest, a respective audio effect to be applied to sounds originating from the respective one of the one or more sound sources of interest. Herein, selection of the audio effect to be applied may be made from a set of one or more predefined audio effects. The one or more sound directions of interest may be defined as respective horizontal directions within the spatial audio image represented by the multi-channel audio signal 115 and/or as respective vertical directions within the spatial audio image. In the following examples, for clarity and brevity of description, references to sound directions (at least implicitly) assume horizontal directions within the spatial audio image, whereas the examples readily generalize into applying vertical directions in addition to or instead of horizontal directions, mutatis mutandis. A sound direction in horizontal direction of the spatial audio image may be defined as an (azimuth) angle with respect to a reference direction. The reference direction is typically, but not necessarily, a direction directly in front of the assumed listening point. The reference direction may be defined as 0° (i.e. zero degrees), whereas a sound direction that is to the left of the reference direction may be indicated by a respective angle in the range 0° < α < 180° and a sound direction that is to the right of the reference direction may be indicated by a respective angle in the range — 180° < α < 0°. with directions at 180° and —180° indicating a sound direction opposite to the reference direction.

The respective audio effect to be applied to sounds in one of the one or more sound directions of interest may involve any audio processing that modifies at least one audio characteristics of the respective sounds. Non-limiting examples of applicable audio effects include the following: audio equalization according to a predefined or user-selectable profile, pitch shifting in a predefined or user-selectable manner, vibrato at a predefined or user-selectable rate and at a predefined or user-selectable extent, tremolo in a predefined or user-selectable manner, a vocoder effect in a predefined or user-selectable manner, etc. The audio effect to be applied may be the same for each of the one or more sound directions of interest or the audio effect to be applied may be different across the one or more sound directions of interest. To put it in other words, a first audio effect may be selected for a first sound direction and a second audio effect may be selected for a second sound direction (which is a sound direction different from the first sound direction), where the first and second audio effects may be the same (or similar) or the first and second audio effects may be different from each other.

Referring back to operations pertaining to block 206, derivation of the at least one sound signal 221 that represents sounds in the one or more sound directions of interest and the on the background signal 223 that represents sounds in other directions of the spatial audio image based on the multi-channel audio signal 115 may comprise selectively applying an audio focusing technique to derive the at least one sound signal 221 and the background signal 223 in view of one or more of the following aspects:

- the number of sound directions of interest,

- the spatial distance between sound directions of interest (if more than one), - the respective audio effects selected for sounds in the sound directions of interest.

Audio focusing aims at extracting a focused audio signal that represent sounds in a focus direction while suppressing sounds in other sound directions, whereas in a practical implementation the focused audio signal may encompass sounds in sound directions falling within a focus pattern (substantially) centered at the focus direction while suppressing sounds in other sound directions, where the focus pattern directed to the focus direction encompasses a predefined sub-range of sound directions (substantially) centered at the focus direction, which sub-range may be also referred to as a focus width. An example of audio focusing that may be especially suited for audio processing that relies on the multi-channel audio signal 115 obtained from the microphone array 112 comprises audio beamforming outlined in the foregoing. Audio beamforming aims at extracting a beamformed audio signal that represent sounds in a beam direction while suppressing sounds in other sound directions, whereas in a practical implementation the beamformed audio signal may encompass sounds in sound directions falling within a beam pattern (substantially) centered at the beam direction while suppressing sounds in other sound directions. Hence, a beam pattern directed to the beam direction encompasses a predefined sub-range of sound directions (substantially) centered at the beam direction, which sub-range may be also referred to as a beam width.

For the purposes of the present disclosure, the following applications of audio focusing are considered:

- monoaural audio focusing for deriving, based on the multi-channel audio signal 115, a monaural sound signal that represents sounds in a focus direction while suppressing sounds in other sound directions;

- spatial audio focusing for deriving, based on the multi-channel audio signal 115, a spatial sound signal that represents sounds in a desired range of sound directions such that they are retained in their respective original sound directions within the spatial audio image represented by the multi-channel audio signal 115 while suppressing sounds in other sound directions. In this regard, the monaural audio focusing may be provided by directly applying a focus pattern of a desired focus width directed at a sound direction of interest to derive a respective monoaural sound signal. Monaural audio focusing may be accompanied or followed by determination of one or more spatial characteristics of the spatial audio image represented by the multi-channel audio signal 115, such as sound direction(s) and/or diffuseness. The monaural focused audio signal resulting from monaural audio focusing may be used as basis for creating a respective spatial sound signal where the monaural focused audio signal is panned to its original sound direction with the spatial image and possibly also modified to exhibit determined diffuseness, thereby providing a spatial sound signal that represents sound in the sound direction of interest while suppressing sounds in other sound directions.

According to an example, the spatial audio focusing may be applied for a single sound direction, e.g. by using a first focus pattern of a desired focus width that is directed at a sound direction that is offset from the single sound direction to a first direction (e.g. to the left) to derive a first channel of a spatial sound signal and using a second focus pattern of the desired focus width that is directed at a sound direction that is offset from the single sound direction to a second direction that is opposite to the first direction (e.g. to the right) to derive a second channel of the spatial sound signal. In another example, the spatial audio focusing may be applied for a range of sound directions from a first sound direction to a second sound direction, e.g. by using a first focus pattern of a desired focus width that is directed at a sound direction that is offset from the first sound direction to a first direction (e.g. further away from the second direction) to derive a first channel of a spatial sound signal and using a second focus pattern of the desired focus width that is directed at a sound direction that is offset from the second sound direction to a second direction that is opposite to the first direction to derive a second channel of the spatial sound signal. Consequently, the spatial sound signal represents sounds in a sub-range of sound directions between the first and second sound directions such that they are panned in their original spatial positions in the spatial audio image while sounds in other sound directions of the spatial audio image are substantially suppressed. The offsets applied in the spatial audio focusing procedure may be predefined ones or they may be selected in view of characteristics of the microphone array 112 applied for capturing the microphone signals that serve as basis for respective channels of the multi-channel audio signal 115, e.g. in view of the positions of the microphones 112-k with respect to each other.

According to an example, along the lines described in the foregoing, a spatial focused audio signal that represents sound in a sound direction of interest (while suppressing sounds in other sound directions) may be derived via deriving a left channel of the focused audio signal as a monaural focused audio signal (e.g. via audio beamforming) whose phase center is to the left from the sound direction of interest and deriving a right channel of the focused audio signal as a monaural focused audio signal (e.g. via audio beamforming) whose phase center is to the right from the sound direction of interest. As an example, for derivation of the left channel the phase center may be at a location of a microphone in a left side of a microphone array 112 applied for capturing the microphone signals that serve as basis for respective channels of the multi-channel audio signal 115 and for derivation of the right channel the phase center may be at the location of a microphone in the right side of said microphone array 112. Herein, a phase center is in such a location that a microphone signal from the phase center is not delayed with respect to microphone signals from other microphones during audio focusing procedure (e.g. in context of audio beamforming). According to another example, the spatial audio beamforming may be carried on basis of a spatial audio signal that is derived based on the multi-channel audio signal 115 (or that is received at the audio processing portion 220 instead of the multi- channel audio signal 115). In such a scenario, sound directions represented by the spatial audio signal are derived via analysis of the spatial audio signal and the spatial audio signal is amplified in those time-frequency tiles that are found, via the analysis of the spatial audio signal, to represent a direction of interest and/or the spatial audio signal is attenuated in those time-frequency tiles that are found not to represent a direction of interest.

Spatial audio focusing provides the benefit of maintaining directional sounds of the spatial audio image in their correct sound directions, whereas in monaural audio focusing the sound direction is ‘lost’ and needs to be recreated in order to maintain similar spatial characteristic. In this regard, spatial audio focusing enables more pleasant and naturally-sounding audio image that also ensures retaining spatial cues that facilitate spatial perception by a human hearing system.

According to a first example, there is only a single sound direction of interest, denoted as a first sound direction α₁ and, consequently, there is only a single audio effect to be considered, referred to as a first audio effect. In this example, derivation of the at least one sound signal 221 may comprise applying, on the multi-channel audio signal 115, monaural audio focusing using a first focus pattern of a predefined focus width directed at the first sound direction α₁ to derive a first sound signal. Hence, in this example the at least one sound signal 221 comprises the first sound signal that represents sounds in the first sound direction α₁ and that has the first audio effect associated therewith, where the first sound signal comprises a monaural audio signal.

Still referring to the first example, derivation of the background signal 223 may comprise applying, on the multi-channel audio signal 115, spatial audio focusing that covers the range of sound directions outside the first focus pattern applied for derivation of the first sound signal. In this regard, the range of sound directions outside the first focus pattern may be divided into one or more sub-ranges of sound directions and the spatial audio focusing may be applied separately for each of the sub-ranges. In case there is only a single sub-range of sound directions (that covers the range of sound directions outside the first focus pattern), the resulting focused audio signal serves as the background signal 223, whereas in case of two or more sub-ranges of sound directions (that each cover a respective sub-range of sound directions outside the first focus pattern) the background signal 223 may be derived as a combination (e.g. as a sum or as an average) of respective focused audio signals obtained for the two or more sub-ranges of sound directions.

In a variation of the first example, the background signal 223 may be created, for example, by subtracting the first sound signal from the multi-channel audio signal 115. In this regard, the subtraction may involve crating a respective monaural focused (e.g. beamformed) signal for each channel of the multi-channel audio signal

115 such that its phase center is at a location of the microphone applied for capturing the microphone signal that serves as basis for respective channel of the multichannel audio signal 115, deriving a respective channel of a multi-channel difference signal by subtracting the respective monaural focused signal from the respective channel of the multi-channel audio signal 115 and deriving the background signal 223 as a spatial audio signal based on the difference signal.

According to a second example, there are two sound directions of interest, denoted as a first sound direction α₁ and a second sound direction α₂, where a first audio effect is to be applied to sounds in the first sound direction α₁ and a second audio effect is to be applied to sounds in the second sound direction α₂, where the first audio effect is different from the second audio effect. In this example, derivation of the at least one sound signal 221 may comprise applying, on the multi-channel audio signal 115, monoaural audio focusing using a first focus pattern of a predefined focus width directed at the first sound direction α₁ to derive a first sound signal and applying monoaural audio focusing using a second focus pattern of a predefined focus width directed at the second sound direction α ₂ to derive a second sound signal. Hence, in this example the at least one sound signal 221 comprises the first sound signal that represents sounds in the first sound direction α₁ and that has the first audio effect associated therewith and the second sound signal that represents sounds in the second sound direction α₂ and that has the second audio effect associated therewith, where both the first and second sound signals comprise respective monaural audio signals.

Still referring to the second example, derivation of the background signal 223 may comprise applying, on the multi-channel audio signal 115, spatial audio focusing that covers the range(s) of sound directions outside the first and second focus patterns applied for derivation of the first and second sound signal. In this regard, the range(s) of sound directions outside the first and second focus patterns may be divided into one or more sub-ranges of sound directions and the spatial audio focusing may be applied separately for each of the sub-ranges. In case there is only a single sub-range of sound directions, the resulting focused audio signal serves as the background signal 223, whereas in case of two or more sub-ranges of sound directions the background signal 223 may be derived as a combination (e.g. as a sum or as an average) of respective focused audio signals obtained for the two or more sub-ranges of sound directions.

In a variation of the second example, the background signal 223 may be created, for example, by subtracting the first and second sound signals from the multi- channel audio signal 115, wherein the subtraction may be carried out along the lines described in the foregoing in context of the first example, mutatis mutandis.

Even though described in the foregoing with references to two sound directions of interest, the second example readily generalizes into a scenario where there are three or more sound directions of interest that have different audio effects indicated therefor, mutatis mutandis.

According to a third example, there are two sound directions of interest, denoted as a first sound direction α₁ and a second sound direction α₂, where the same audio effect is to be applied to sounds in the first sound direction α₁ and to sounds in the second sound direction α₂, which audio effect may be referred to as the first audio effect. In this example, derivation of the at least one sound signal 221 may comprise applying, on the multi-channel audio signal 115, spatial audio focusing that covers a first range of sound directions from the first sound direction α₁ to the second sound direction α₂ to derive the first sound signal. Hence, in this example the at least one sound signal 221 comprises the first sound signal that represents sounds in the first and second sound directions α₁, α₂ and that has the first audio effect associated therewith, where the first sound signal comprises a spatial audio signal.

Still referring to the third example, derivation of the background signal 223 may comprise applying, on the multi-channel audio signal 115, spatial audio focusing that covers the (complementary) range of sound directions outside the first range of sound directions applied for derivation of the first sound signal. In this regard, the range of sound directions outside the first range of sound directions may be divided into one or more sub-ranges of sound directions and the spatial audio focusing may be applied separately for each of the sub-ranges. In case there is only a single sub- range of sound directions, the resulting focused audio signal serves as the background signal 223, whereas in case of two or more sub-ranges of sound directions the background signal 223 may be derived as a combination (e.g. as a sum or as an average) of respective focused audio signals obtained for the two or more sub-ranges of sound directions.

In a variation of the third example, the background signal 223 may be created, for example, by subtracting the first sound signal from the multi-channel audio signal 115, wherein the subtraction may be carried out along the lines described in the foregoing in context of the first example, mutatis mutandis.

In another variation of the third example, the selection of the type of audio focusing and/or the focus width may be carried out in dependence of the distance between the first and second sound directions α₁, α₂, e.g. such that derivation of the at least one sound signal 221 is carried out as described above for the third example in response to the distance α _dist between the first and second sound directions α₁, α₂ (e.g. α_dist= |α₁ —α₂ ) being smaller than a predefined distance threshold α _thr (e.g. α_dist< α_thr), whereas derivation of the at least one sound signal 221 is carried out as described above for the second example in response to the distance α _dist between the first and second sound directions α₁, α₂ being larger than or equal to the predefined distance threshold α_thr (e.g. α_dist≥ α_thr). The distance threshold α_thr may be set in dependence of characteristics of the multi-channel audio signal 115, e.g. in view of the spatial arrangement of microphones of the microphone array 112 applied in capturing the microphone signals that serve as basis for the multi-channel audio signal 115. In an example, the distance threshold α_thr may be set such that respective focus patterns directed to the first and second sound directions α₁, α₂ do not overlap when the distance α_dist therebetween exceeds the distance threshold α_thr.

In a further variation of the third example, the selection of the type of audio focusing and/or the focus width may be carried out in dependence of presence of further (directional) sound sources in sound directions between the first and second sound directions α₁, α₂, e.g. such that derivation of the at least one sound signal 221 is carried out as described above for the third example in response to not detecting presence of any directional sound sources in sound directions between the first and second sound directions α₁, α₂, whereas derivation of the at least one sound signal 221 is carried out as described above for the second example in response to detecting presence of one or more directional sound sources in sound directions between the first and second sound directions α₁, α₂. Presence of one or more directional sound sources (or lack thereof) between the first and second sound directions α₁, α₂. may be evaluated via analysis of the multi-channel audio signal 115, for example by directing one or more auxiliary focus patterns between the first and second sound directions α₁, α₂ to derive respective auxiliary sound signals and determining presence of one or more directional sound sources in sound directions between the first and second sound directions α₁, α₂ in response to any of the auxiliary sound signals exhibiting signal level (e.g. signal) energy that is close that of sound signals derived from the first sound direction α₁ and/or from the second sound direction α₂. As an example this regard, an auxiliary sound signal may be considered to represent a further sound source in case its signal level tis above a predefined percentage of that of the sound signal representing sounds in the first sound direction α₁ and/or that of the sound signal representing sounds in the second sound direction α₂.

Even though described in the foregoing with references to two sound directions of interest, the third example and any variations thereof readily generalize into a scenario where there are three or more sound directions of interest that have the same audio effect indicated therefor, mutatis mutandis.

According to a fourth example, there are three sound directions of interest, denoted as a first sound direction α₁, a second sound direction α₂ and at third sound direction α₃, where a first audio effect is to be applied to sounds in the first sound direction α₁ and where a second audio effect that is different from the first audio effect is to be applied to sounds in the second and third sound directions α₂, α₃. In this example, derivation of the at least one sound signal 221 with respect to the first sound direction α₁ may be carried out as described in the foregoing for the first example, whereas derivation of the at least one sound signal 221 with respect to the second and third sound directions α₂, α₃may be carried out as described in the foregoing for the third example. Hence, in this example the at least one sound signal 221 comprises the first sound signal that represents sounds in the first sound direction α₁ and that has the first audio effect associated therewith and the second sound signal that represents sounds in the second and third sound directions α₂, α₃ and that has the second audio effect associated therewith, where the first sound signal comprises a monaural audio signal and where the second sound signal comprises a spatial audio signal.

An advantage arising from deriving the at least one sound signal 221 that each represent one or more sound directions within the spatial audio image represented by the multi-channel audio signal 115 e.g. according to the first to fourth examples described in the foregoing is application of the respective audio effect to a certain one of the at least one sound signal 221 such that they do not interfere with respective audio effects applied to other ones of the at least one sound signal 221 and/or with the sound represented by the background signal 223, thereby ensuring introduction of the audio effects as intended and, consequently, avoiding audible distortions that may arise from the audio effects interfering with each other and/or with the sounds in the background.

Referring back to operations pertaining to block 208, derivation of the at least one modified sound signal 225 based on the at least one sound signal 221 via application of the respective audio effect may comprise driving the at least one modified sound signal 225 in dependence of the type and the number of sound signals provided as the at least one sound signal 221 . In view of the first to fourth examples described in the foregoing, the at least one sound signal 221 may comprise one or more of the following:

- a first monaural sound signal that represents sounds in the first sound direction α₁ and that has the first audio effect associated therewith, - a first spatial sound signal that represents sounds in the first sound direction α₁ and sounds in the second sound direction α₂ and that has the first audio effect associated therewith,

- a first monaural sound signal that represents sounds in the first sound direction α₁ and that has the first audio effect associated therewith and a second monaural sound signal that represents sounds in the second sound direction α₂ and that has the second audio effect associated therewith.

Moreover, in general the at least one sound signal 221 comprises one or more monaural sound signals that each represent sounds in a respective one of the one or more sound directions of interest and/or one or more spatial sound signals that each represent sounds in respective at least two of the one or more sound directions of interest, where each sound signal has a respective audio effect associated therewith.

Consequently, the aspect of deriving the respective at least one modified sound signal 225 based on the at least one sound signal 221 (cf. block 208) comprises separately applying for each of the sound signals respective audio processing that implements the audio effect associated therewith in dependence of the type of the respective sound signal:

- In case the respective sound signal comprises a monaural audio signal, derivation of the respective modified sound signal comprises applying the audio processing that implements the audio effect associated with the respective sound signal and applying audio panning to arrange the resulting modified sound content in the respective sound direction of interest the respective sound signal serves to represent; In case the respective sound signal comprises a spatial audio signal, derivation of the respective modified sound signal comprises applying the audio processing that implements the audio effect associated with the respective sound signals, whereas the sound content of the resulting spatial sound signal is readily arranged in the respective sound directions of interest the respective sound signal serves to represent. Referring back to operations pertaining to block 210, according to an example, derivation of the spatial audio signal 125 based on the at least one modified sound signal 225 and the background signal may comprise deriving the spatial audio signal as the sum (or as another linear combination) of the at least one modified sound signal 225 and the background signal.

In another example, derivation of the spatial audio signal 125 based on the at least one modified sound signal 225 and the background signal may comprise combining the at least one modified sound signal 225 and the background signal in view of energy level of the at least one modified sound signals 225 in relation to energy level of the at least one sound signals 221 . For clarity and brevity of description, in the following such energy-level-dependent combination is described with references to a single modified sound signal that may comprise a monaural audio signal that represents sounds in a single sound direction (e.g. in the first sound direction α₁) or a spatial audio signal that represents sounds in one or more sound directions (e.g. in the first sound direction α₁ or in the first and second sound directions α₁ and α₂), whereas the example readily generalizes into scenarios that involve one or more sound signals that each represent sounds in a respective one or more sound directions, mutatis mutandis.

As an example of energy-level-dependent combination of the at least one sound signal 225 and the background signal 223 may be carried out according to a method 300 illustrated in a flowchart of Figure 5. The method 300 proceeds from obtaining a sound signal, a modified sound signal derived based on the sound signal, and the background signal 223, indicated in block 302. Flerein, the sound signal may comprise one of the at least one sound signal 221 and the respective one of the at least on modified sound signals 225. In the following, for notational convenience, the sound signal (in frequency domain) is denoted as S, the modified sound signal (in frequency domain) is denoted as S', and the background signal (in frequency domain) may be (also) denoted as B.

The method 300 further comprises from computing respective energy of one or more frequency bands of the sound signal S, as indicated in block 304, and computing respective energy in one or more frequency bands of the modified sound signal S', as indicated in block 306. The method may optionally further comprise computing respective energy in one or more frequency bands of the background signal B, as indicated in block 308. Herein, the energy of the first sound signal S in the frequency band k (e.g. in the time-frequency tile S(k,n)) may be denoted as E_s(k,n), the energy of the modified first sound signal S' in the frequency band k (e.g. in the time- frequency tile S'(k,n)) may be denoted as E'_s(k,n), and the energy of the background signal B in the frequency band k (e.g. in the time-frequency tile B(k,n)) may be denoted as E_B(k,n). The method 300 further comprises attenuating the background signal 223 in those frequency bands k where the energy of the sound signal S is higher than the respective energy of the modified sound signal S', thereby deriving a modified background signal B', as indicated in block 310. As an example in this regard, the modified background signal B' may be derived by identifying those frequency bands where the energy E_s ( k,n) of the time-frequency tile S ( k,n) in the sound signal S is higher than the energy E'_s(k,n) of the time-frequency tile S'(k,n) in the modified sound signal S' and deriving the corresponding time-frequency tile B'(k,n) of the modified background signal B' as B'(k,n) = g_kB(k,n), where g_k denotes a scaling factor with g_k < 1, thereby providing an attenuated version of the corresponding time-frequency tile B(k,n) of the background signal 223. Moreover, the remaining time-frequency tiles of the modified background signal B' may be obtained as copies of the respective time-frequency tiles B(k,n) of the background signal B(k,n) (i.e. the ones for which the E_s(k,n) ≤ E' _s(k,n)) e.g. as B'(k,n) = B(k,n). Finally, the method 300 proceeds into deriving the spatial audio signal 125 as a combination of the modified sound signal S' and the modified background signal B', as indicated in block 312. In this regard, the spatial audio signal 125 may be derived e.g. as a sum, as an average or as another linear combination of the modified sound signal S' and the modified background signal B'.

Attenuation of the background signal 223 enables avoiding audible disturbances e.g. in a scenario where implementation of the respective audio effect to one or more of the at least one audio signal 221 results in energy reduction in one or more frequency bands and where, due to limitations of real-life implementations of the audio focusing procedures, some audio components intended for inclusion in the at least one sound signal 221 only remain in the background signal 223. In such a scenario, without attenuation of the background signal such audio components that may remain in the background signal could interfere with the audio effect applied for the corresponding sound signal via introduction of the respective modified sound signal, whereas attenuation of some frequency bands of the background signal 223 serves to reduce or even completely avoid such interference.

Referring back to operations pertaining to block 310, according to an example, the energy of the sound signal S may be considered to be higher than the respective energy of the modified sound signal S' in frequency band k response to the difference in respective energies in the frequency band k exceeding an energy difference threshold E_thr(k). assigned for the frequency band k. As an example in this regard, the energy of the sound signal S in the frequency band k may be considered to be higher than the respective energy of the modified sound signal S', in response to the energy E_s(k,n) of the time-frequency tile S(k,n) in the sound signal S exceeding the energy E' _s(k,n) of the time-frequency tile S'(k,n) in the modified sound signal S' by more than the energy difference threshold E_thr(k), e.g. in response to energy difference ΔE(k) = E_s(k,n) - E' _s(k,n) exceeding the energy difference threshold E_thr(k), i.e. when ΔE(k) > E_thr. Herein, the energy difference threshold E_thr(k) may be the same across frequency bands or the energy difference threshold E_thr(k) may be varied across frequency bands, for example such that the energy difference threshold E_thr(k) substantially matches a masking threshold for the respective frequency band. Herein, a masking threshold for a frequency band k represents the energy level required for an additional sound in order to make it audible in presence of another sound in the frequency band k. In an example, the respective energy difference threshold E_thr(k ) may be set to zero for one or more predefined frequency band or for all frequency bands in order to reduce computational complexity.

Still referring back to operations pertaining to block 310, according to an example, the amount of attenuation to be applied to signals in those frequency bands of the background signal B for which the energy of the sound signal S is higher than the respective energy of the modified sound signal S' may be independent of the extent of difference in energy (e.g. the energy difference ΔE(k)). In such a scenario, in on example the amount of attenuation to applied to the time-frequency tile B(k,n) of the background signal B to derive the corresponding time-frequency tile B'(k,n) of the modified background signal B' may be the same across frequency bands, e.g. the scaling factor g_k may be the same across frequency bands, e.g. to a value that provides attenuation in a range from 2 to 4 dB. In another example, the scaling factor g_k may be varied across frequency bands, for example such that reduction in the time-frequency tile B'(k,n) of the modified background signal B' in comparison of the time-frequency tile B(k,n) of the background signal B via application of the gain factor g_k matches or substantially matches the difference in energy between the respective time-frequency tile S'(k,n) of the modified sound signal S' and the time-frequency tile S(k,n) of the sound signal S, e.g. by setting the gain g_k factor as g_k = E' _s(k,n) /E_s(k,n). Applying a gain factor g_k that has either too small or too large value may result in audible distortions in the spatial audio signal 125 and thus the gain factor g_k may be limited into a predefined range, e.g. into one that provides attenuation in a range from 2 to 4 dB.

In another example, the amount of attenuation to be applied to signals in those frequency bands of the background signal B for which the energy of the sound signal S is higher than the respective energy of the modified sound signal S' may be dependent of the extent of difference in energy (e.g. the energy difference ΔE(k)), for example, such that the amount of attenuation to be applied in a frequency band k is directly proportional to the difference in respective energies of the sound signal S and the modified sound signal S' in the frequency band k. This may be provided, for example, by setting the value of the scaling factory for driving the time-frequency tile B'(k,n) of the modified background signal B' such that it is directly proportional to the energy difference ΔE(k) on the frequency band k.

Figure 6 illustrates a block diagram of some components of an exemplifying apparatus 400. The apparatus 400 may comprise further components, elements or portions that are not depicted in Figure 6. The apparatus 400 may be employed e.g. in implementing one or more components described in the foregoing in context of the audio processing portion 220.

The apparatus 400 comprises a processor 416 and a memory 415 for storing data and computer program code 417. The memory 415 and a portion of the computer program code 417 stored therein may be further arranged to, with the processor 416, to implement at least some of the operations, procedures and/or functions described in the foregoing in context of the audio processing portion 220.

The apparatus 400 comprises a communication portion 412 for communication with other devices. The communication portion 412 comprises at least one communication apparatus that enables wired or wireless communication with other apparatuses. A communication apparatus of the communication portion 412 may also be referred to as a respective communication means.

The apparatus 400 may further comprise user I/O (input/output) components 418 that may be arranged, possibly together with the processor 416 and a portion of the computer program code 417, to provide a user interface for receiving input from a user of the apparatus 400 and/or providing output to the user of the apparatus 400 to control at least some aspects of operation of audio processing portion 220 that are implemented by the apparatus 400. The user I/O components 418 may comprise hardware components such as a display, a touchscreen, a touchpad, a mouse, a keyboard, and/or an arrangement of one or more keys or buttons, etc. The user I/O components 418 may be also referred to as peripherals. The processor 416 may be arranged to control operation of the apparatus 400 e.g. in accordance with a portion of the computer program code 417 and possibly further in accordance with the user input received via the user I/O components 418 and/or in accordance with information received via the communication portion 412.

Although the processor 416 is depicted as a single component, it may be implemented as one or more separate processing components. Similarly, although the memory 415 is depicted as a single component, it may be implemented as one or more separate components, some or all of which may be integrated/removable and/or may provide permanent / semi-permanent/ dynamic/cached storage.

The computer program code 417 stored in the memory 415, may comprise computer-executable instructions that control one or more aspects of operation of the apparatus 400 when loaded into the processor 416. As an example, the computer-executable instructions may be provided as one or more sequences of one or more instructions. The processor 416 is able to load and execute the computer program code 417 by reading the one or more sequences of one or more instructions included therein from the memory 415. The one or more sequences of one or more instructions may be configured to, when executed by the processor 416, cause the apparatus 400 to carry out at least some of the operations, procedures and/or functions described in the foregoing in context of the audio processing portion 220.

Hence, the apparatus 400 may comprise at least one processor 416 and at least one memory 415 including the computer program code 417 for one or more programs, the at least one memory 415 and the computer program code 417 configured to, with the at least one processor 416, cause the apparatus 400 to perform at least some of the operations, procedures and/or functions described in the foregoing in context of the audio processing portion 220.

The computer programs stored in the memory 415 may be provided e.g. as a respective computer program product comprising at least one computer-readable non-transitory medium having the computer program code 417 stored thereon, the computer program code, when executed by the apparatus 400, causes the apparatus 400 at least to perform at least some of the operations, procedures and/or functions described in the foregoing in context of the audio processing portion 220. The computer-readable non-transitory medium may comprise a memory device or a record medium such as a CD-ROM, a DVD, a Blu-ray disc or another article of manufacture that tangibly embodies the computer program. As another example, the computer program may be provided as a signal configured to reliably transfer the computer program.

Reference(s) to a processor should not be understood to encompass only programmable processors, but also dedicated circuits such as field-programmable gate arrays (FPGA), application specific circuits (ASIC), signal processors, etc. Features described in the preceding description may be used in combinations other than the combinations explicitly described.

Although functions have been described with reference to certain features, those functions may be performable by other features whether described or not. Although features have been described with reference to certain embodiments, those features may also be present in other embodiments whether described or not.

Claims

1. An apparatus for audio processing, the apparatus comprising: means for receiving an input audio signal that represents a spatial audio image and respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions; means for deriving, based on the input audio signal in dependence of said one or more sound directions and the respective audio effects, at least one sound signal that represents sounds in said one or more sound directions and a background signal that represents sounds in other sound directions of the spatial audio image, wherein a sound signal represents sounds in at least one of the one or more sound directions and has one of the audio effects associated therewith; means for deriving, based on said at least one sound signal, via application of the respective audio effect, respective at least one modified sound signal; and means for deriving the spatial audio signal based on the at least one modified sound signal and on the background signal.

2. The apparatus according to claim 1 , wherein the means for deriving the spatial audio signal comprises: means for computing respective energies in one or more frequency bands of the at least one sound signal; means for computing respective energies in said one or more frequency bands of the at least one modified sound signal, means for attenuating the background signal in those ones of the one or more frequency bands where the energy of the at least one sound signal is higher than the respective energy of the at least one modified sound signal to derive a modified background signal; and means for deriving the spatial audio signal as a combination of the at least one modified sound signal and the modified background signal.

3. The apparatus according to claim 2, wherein the means for attenuating the background signal is arranged to consider the energy of the at least one sound signal to be higher than the respective energy of the at least one modified sound signal in a frequency band in response to the energy of the at least one sound signal exceeding the respective energy of the at least one modified sound signal in said frequency band by at least an energy difference threshold assigned for said frequency band.

4. The apparatus according to claim 2 or 3, wherein the means for attenuating the background signal is arranged to attenuate the background signal in a frequency band by an amount that depends on the difference in respective energies of the at least one sound signal and the at least one modified sound signal in said frequency band.

5. The apparatus according to any of claims 2 to 4, wherein the means for deriving the spatial audio signal comprises means for deriving a sum of the at least one sound modified sound signal and the background signal.

6. The apparatus according to claim 1 , wherein the means for deriving the spatial audio signal comprises means for deriving a sum of the at least one sound modified sound signal and the background signal.

7. The apparatus according to any of claims 1 to 6, wherein said sound directions comprise a first sound direction and a second sound direction, wherein said audio effects comprise a first audio effect to be applied to sounds in the first sound direction and a second audio effect that is different from the first audio effect to be applied to sounds in the second sound direction, wherein the means for deriving the at least one sound signal comprises: means for applying, on the input audio signal, monaural audio focusing directed to the first sound direction to derive a first sound signal as a monoaural audio signal that represents sounds in the first sound direction and that has the first audio effect associated therewith; and means for applying, on the input audio signal, monaural audio focusing directed to the second sound direction to derive a second sound signal as a monaural audio signal that represents sounds in the second sound direction and that has the second audio effect associated therewith.

8. The apparatus according to claim 7, wherein the means for applying monoaural audio focusing is arranged to derive a monaural audio signal that represents sounds in a focus direction while suppressing sounds in other sound directions.

9. The apparatus according to any of claims 1 to 8, wherein said sound directions comprise a third sound direction and a fourth sound direction, wherein said audio effects comprise a third audio effect to be applied to sounds in the third and fourth sound directions, wherein the means for deriving the at least one sound signal comprises: means for applying, on the input audio signal, spatial audio focusing directed to a first range of sound directions from the third sound direction to the fourth sound direction to derive a third sound signal as a spatial audio signal that represents sounds in the third and fourth sound directions and that has the third audio effect associated therewith.

10. The apparatus according to any of claims 1 to 9, wherein the means for deriving the background signal comprises means for applying, on the input audio signal, spatial audio focusing directed to sound directions other than said one or more sound directions to derive the background signal as a spatial audio signal that represents sounds in said other sound directions

11. The apparatus according to claim 10, wherein the means for deriving the background signal comprises one of the following: means for applying spatial audio focusing directed to a range of sound directions that covers sound directions other than said one or more sound directions to derive the background signal, means for dividing said range of sound directions that covers sound directions other than said one or more sound directions into two or more sub-ranges of sound directions, means for separately applying, for each of said sub-ranges, spatial audio focusing directed to a respective sub-range of sound directions to derive a respective background signal component, and means for combining the background signal components to derive the background signal.

12. The apparatus according to any of claims 9 to 11, wherein the means for applying spatial audio focusing, when directed to a range of sound directions from a first given sound direction to a second given sound direction, is arranged to provide a focused audio signal as a spatial audio signal where the sound directions within said range are retained in their respective spatial positions of the spatial audio image while suppressing sounds in sound direction outside said range.

13. The apparatus according to claim 12, wherein the means for applying spatial audio focusing is arranged to: apply, on the input audio signal, monaural audio focusing directed to a sound direction that is offset from the first given sound direction to a first direction to derive a first channel of the focused audio signal; and apply, on the input audio signal, monaural audio focusing directed to a sound direction that is offset from the second given sound direction to a second direction that is opposite to the first direction to derive a second channel of the focused audio signal.

14. The apparatus according to any of claims 1 to 9, wherein the means for deriving the background signal comprises means for subtracting the at least one sound signal from the input audio signal.

15. The apparatus according to any of claims 1 to 14, wherein the means for deriving the respective at least one modified sound signal comprises means for applying, to a respective sound signal, audio processing that implements the audio effect associated with the respective sound signal to derive the respective modified sound signal.

16. The apparatus according to claim 15, wherein the means for deriving the respective at least one modified sound signal is arranged to: in case the respective sound signal comprises a monaural audio signal, derive the respective modified sound signal via applying the audio processing that implements the audio effect associated with the respective sound signal and applying audio panning to arrange the resulting modified sound content in the respective sound direction of the spatial audio image the respective sound signal serves to represent; in case the respective sound signal comprises a spatial audio signal, derive the respective modified sound signal via applying the audio processing that implements the audio effect associated with the respective sound signal.

17. The apparatus according to any of claims 1 to 16, wherein said audio effect comprises one of the following: audio equalization, pitch shifting, vibrato, tremolo, a vocoder effect.

18. The apparatus according to any of claims 1 to 17, wherein the input audio signal comprises a multi-channel audio signal derived on basis of respective microphone signals obtained from respective one or more microphones of a microphone array.

19. An apparatus for audio processing, the apparatus comprising at least one processor and at least one memory including computer program code, which, when executed by the at least one processor, causes the apparatus to: receive an input audio signal that represents a spatial audio image and respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions; derive, based on the input audio signal in dependence of said one or more sound directions and the respective audio effects, at least one sound signal that represents sounds in said one or more sound directions and a background signal that represents sounds in other sound directions of the spatial audio image, wherein a sound signal represents sounds in at least one of the one or more sound directions and has one of the audio effects associated therewith; derive, based on said at least one sound signal, via application of the respective audio effect, respective at least one modified sound signal; and derive the spatial audio signal based on the at least one modified sound signal and on the background signal.

20. A method for audio processing, the method comprising: receiving an input audio signal that represents a spatial audio image and respective indications of a sound direction within the spatial audio image and an audio effect to be applied to sounds in the respective sound direction for one or more sound directions; deriving, based on the input audio signal in dependence of said one or more sound directions and the respective audio effects, at least one sound signal that represents sounds in said one or more sound directions and a background signal that represents sounds in other sound directions of the spatial audio image, wherein a sound signal represents sounds in at least one of the one or more sound directions and has one of the audio effects associated therewith; deriving, based on said at least one sound signal, via application of the respective audio effect, respective at least one modified sound signal; and deriving the spatial audio signal based on the at least one modified sound signal and on the background signal.

21. A computer readable medium comprising program instructions for causing an apparatus to perform at least the method according to claim 20.