WO2018234619A2

WO2018234619A2 - Processing audio signals

Info

Publication number: WO2018234619A2
Application number: PCT/FI2018/050397
Authority: WO
Inventors: Miikka Vilermo; Anssi RÄMÖ; Tuomas Virtanen; Joonas Nikunen
Original assignee: Nokia Technologies Oy
Priority date: 2017-06-20
Filing date: 2018-05-25
Publication date: 2018-12-27
Also published as: GB201709851D0; WO2018234619A3

Abstract

A method, computer-readable medium and apparatus are disclosed for:receiving, via a first track, a near-field audio signal from a near-field microphone;receiving, via a second track, a far-field audio signal from an array comprising one or more far-field microphones, wherein the far-field audio signal comprises audio signal components across one or more channels corresponding respectively to the or each of the far-field microphones; determining, using the near-field audio signal and the or each component of the far-field audio signal, a set of time dependent room impulse response filters, wherein each of the time dependent room impulse response filters is in relation to the near-field microphone and respective the or each of the channels of the microphone array;for one or more channels of the microphone array, filtering the near-field audio signal using one or more room impulse response filters of the respective one or more channels; and augmenting the far-field audio signal by applying the filtered near-field audio signal thereto.

Description

Processing Audio Signals

Field

This specification relates to processing audio signals and, more specifically, to processing audio signals for mixing audio signals.

Background

Spatial audio signals are being used more often to produce a more immersive audio experience. A stereo or multi-channel recording can be passed from the recording or capture apparatus to a listening apparatus and replayed using a suitable multi-channel output such as a multi-channel loudspeaker arrangement and, with virtual surround processing, a pair of stereo headphones or headset.

As the possibilities for using such immersive audio functionality become more widespread, there is a need to ensure that audio signals are mixed in such a way so as to complement the virtual reality environment of the user. For example, if a user is in a virtual reality environment, there is a requirement that audio content from a particular source sounds as though it is coming from a location corresponding to the location of that source in virtual reality.

Summary

In a first aspect, this specification describes a method comprising: receiving, via a first track, a near-field audio signal from a near-field microphone; receiving, via a second track, a far-field audio signal from an array comprising one or more far-field microphones, wherein the far-field audio signal comprises audio signal components across one or more channels corresponding respectively to the or each of the far-field microphones; determining, using the near-field audio signal and the or each component of the far-field audio signal, a set of time dependent room impulse response filters, wherein each of the time dependent room impulse response filters is in relation to the near-field microphone and respective the or each of the channels of the microphone array; for one or more channels of the microphone array, filtering the near-field audio signal using one or more room impulse response filters of the respective one or more channels; and augmenting the far-field audio signal by applying the filtered near-field audio signal thereto. The method may further comprise: receiving audio signals from a plurality of near-field microphones; detecting a user selection of one of the near-field microphone corresponding to the near-field audio signal; and automatically determining the set of time dependent room impulse response filters in response to the user selection of the near-field microphone corresponding to the near-field audio signal.

The method may further comprise detecting a user selection of a signal multiplier to apply to the track relating to the near-field microphone. The method may further comprise: assigning audio signals from a particular near-field microphone with either a positive weighting or a negative weighting; and augmenting the far-field audio signal by adding the filtered near-field audio signal to the far-field audio signal if the weighting is positive or subtracting the filtered near-field audio signal from the far-field audio signal if the weighting is negative.

In a second aspect, this specification describes a method comprising: receiving a plurality of near-field audio signals from respective near-field microphones; receiving a far-field audio signal from a microphone array comprising a plurality of far-field microphones; and determining, using each of the near-field audio signals and the far- field audio signal simultaneously, a set of room impulse response filters, wherein each of the room impulse response filters is in relation to the plurality of near-field microphones and the multi-channel microphone array.

The set of room impulse response filters may be determined using a blockwise linear least squares algorithm.

The set of room impulse response filters may be determined using a recursive least squares algorithm. In a third aspect, this specification describes a method comprising: receiving, from a far-field microphone, a far-field audio signal in a time domain corresponding to a mobile source located within a recording space; receiving, from a near-field

microphone, a near-field audio signal in a time domain corresponding to the mobile source located within the recording space; identifying timeframes that contain near- field signal energy above a signal activity threshold; determining a linear

transformation of the near-field audio signal and a linear transformation of the far-field audio signal with respect to the timeframes identified as containing near-field signal energy above the signal activity threshold; and using the linear transformation of the far-field audio signal and the linear transformation of the near-field audio signal to determine a set of room impulse response filters of the recording space.

In a fourth aspect, this specification describes a method comprising: receiving, from a far-field microphone device, at least one far-field audio signal in a time domain corresponding to a mobile source located within a recording space, the or each far-field audio signal corresponding to a respective channel of an audio mixture of the far-field microphone device; receiving, from a near-field microphone, a first near-field audio signal in a time domain corresponding to the mobile source; determining location information relating to the mobile source; transforming the or each far-field audio signal and the first near-field audio signal from the time domain to a time-frequency domain; and using the transformations of the far-field audio signal and the first near- field audio signal and the location information of the mobile source to determine a set of room impulse response filters of the recording space using a recursive least squares algorithm.

In a fifth aspect, this specification describes a method comprising: receiving, from a near-field microphone, a near-field audio signal in a time domain corresponding to the mobile source located within the recording space; applying a room impulse response filter to the near-field audio signal to obtain a projected near-field audio signal;

subtracting the projected near-field audio signal from an audio mix of a far-field microphone array; determining a time-frequency magnitude ratio mask based on the projected near-field audio signal and the audio mix of the far-field microphone array; using the time-frequency magnitude ratio mask to identify residual components of the projected near-field audio signal; and removing the residual signal components from the audio mix. In a sixth aspect, this specification describes apparatus configured to perform a method according to any preceding aspect.

In a seventh aspect, this specification describes computer-readable instructions which when executed by computing apparatus cause the computing apparatus to perform a method according to any of the preceding first to fifth aspects. In an eighth aspect, this specification describes apparatus comprising: at least one processor; and at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus to: receive, via a first track, a near-field audio signal from a near-field microphone; receive, via a second track, a far-field audio signal from an array comprising one or more far-field microphones, wherein the far-field audio signal comprises audio signal components across one or more channels corresponding respectively to the or each of the far-field microphones; determine, using the near-field audio signal and the or each component of the far-field audio signal, a set of time dependent room impulse response filters, wherein each of the time dependent room impulse response filters is in relation to the near-field

microphone and respective the or each of the channels of the microphone array; for one or more channels of the microphone array, filter the near-field audio signal using one or more room impulse response filters of the respective one or more channels; and augment the far-field audio signal by applying the filtered near-field audio signal thereto.

In a ninth aspect, this specification describes apparatus comprising at least one processor; and at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus to: receive a plurality of near-field audio signals from respective near-field microphones; receive a far-field audio signal from a microphone array comprising a plurality of far-field microphones; and determine, using each of the near-field audio signals and the far-field audio signal simultaneously, a set of room impulse response filters, wherein each of the room impulse response filters is in relation to the plurality of near-field microphones and the multi-channel microphone array.

In a tenth aspect, this specification describes apparatus comprising: at least one processor; and at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus to: receive, from a far-field microphone, a far-field audio signal in a time domain corresponding to a mobile source located within a recording space; receive, from a near-field microphone, a near-field audio signal in a time domain corresponding to the mobile source located within the recording space; identify timeframes that contain near-field signal energy above a signal activity threshold; determine a linear transformation of the near-field audio signal and a linear transformation of the far-field audio signal with respect to the timeframes identified as containing near-field signal energy above the signal activity threshold; and use the linear transformation of the far-field audio signal and the linear transformation of the near-field audio signal to determine a set of room impulse response filters of the recording space. In an eleventh aspect, this specification describes apparatus comprising: at least one processor; and at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus to: receive, from a far-field microphone device, at least one far-field audio signal in a time domain corresponding to a mobile source located within a recording space, the or each far-field audio signal corresponding to a respective channel of an audio mixture of the far-field microphone device; receive, from a near-field microphone, a first near-field audio signal in a time domain corresponding to the mobile source; determine location information relating to the mobile source; transform the or each far-field audio signal and the first near-field audio signal from the time domain to a time-frequency domain; and use the

transformations of the far-field audio signal and the first near-field audio signal and the location information of the mobile source to determine a set of room impulse response filters of the recording space using a recursive least squares algorithm.

In a twelfth aspect, this specification describes apparatus comprising: at least one processor; and at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus to: receive, from a near- field microphone, a near-field audio signal in a time domain corresponding to the mobile source located within the recording space; apply a room impulse response filter to the near-field audio signal to obtain a projected near-field audio signal; subtract the projected near-field audio signal from an audio mix of a far-field microphone array; determine a time-frequency magnitude ratio mask based on the projected near-field audio signal and the audio mix of the far-field microphone array; use the time- frequency magnitude ratio mask to identify residual components of the projected near- field audio signal; and remove the residual signal components from the audio mix.

In a thirteenth aspect, this specification describes a computer-readable medium having computer-readable code stored thereon, the computer readable code, when executed by a least one processor, causing performance of at least: receiving, via a first track, a near-field audio signal from a near-field microphone; receiving, via a second track, a far-field audio signal from an array comprising one or more far-field microphones, wherein the far-field audio signal comprises audio signal components across one or more channels corresponding respectively to the or each of the far-field microphones; determining, using the near-field audio signal and the or each component of the far- field audio signal, a set of time dependent room impulse response filters, wherein each of the time dependent room impulse response filters is in relation to the near-field microphone and respective the or each of the channels of the microphone array; for one or more channels of the microphone array, filtering the near-field audio signal using one or more room impulse response filters of the respective one or more channels; and augmenting the far-field audio signal by applying the filtered near-field audio signal thereto.

In a fourteenth aspect, this specification describes a computer-readable medium having computer-readable code stored thereon, the computer readable code, when executed by a least one processor, causing performance of at least: receiving a plurality of near-field audio signals from respective near-field microphones; receiving a far-field audio signal from a microphone array comprising a plurality of far-field microphones; and determining, using each of the near-field audio signals and the far-field audio signal simultaneously, a set of room impulse response filters, wherein each of the room impulse response filters is in relation to the plurality of near-field microphones and the multi-channel microphone array.

In a fifteenth aspect, this specification describes a computer-readable medium having computer-readable code stored thereon, the computer readable code, when executed by a least one processor, causing performance of at least: receiving, from a far-field microphone, a far-field audio signal in a time domain corresponding to a mobile source located within a recording space; receiving, from a near-field microphone, a near-field audio signal in a time domain corresponding to the mobile source located within the recording space; identifying timeframes that contain near-field signal energy above a signal activity threshold; determining a linear transformation of the near-field audio signal and a linear transformation of the far-field audio signal with respect to the timeframes identified as containing near-field signal energy above the signal activity threshold; and using the linear transformation of the far-field audio signal and the linear transformation of the near-field audio signal to determine a set of room impulse response filters of the recording space. In a sixteenth aspect, this specification describes a computer-readable medium having computer-readable code stored thereon, the computer readable code, when executed by a least one processor, causing performance of at least: receiving, from a far-field microphone device, at least one far-field audio signal in a time domain corresponding to a mobile source located within a recording space, the or each far-field audio signal corresponding to a respective channel of an audio mixture of the far-field microphone device; receiving, from a near-field microphone, a first near-field audio signal in a time domain corresponding to the mobile source; determining location information relating to the mobile source; transforming the or each far-field audio signal and the first near- field audio signal from the time domain to a time-frequency domain; and using the transformations of the far-field audio signal and the first near-field audio signal and the location information of the mobile source to determine a set of room impulse response filters of the recording space using a recursive least squares algorithm.

In a seventeenth aspect, this specification describes computer-readable medium having computer-readable code stored thereon, the computer readable code, when executed by a least one processor, causing performance of at least: receiving, from a near-field microphone, a near-field audio signal in a time domain corresponding to the mobile source located within the recording space; applying a room impulse response filter to the near-field audio signal to obtain a projected near-field audio signal; subtracting the projected near-field audio signal from an audio mix of a far-field microphone array; determining a time-frequency magnitude ratio mask based on the projected near-field audio signal and the audio mix of the far-field microphone array; using the time- frequency magnitude ratio mask to identify residual components of the projected near- field audio signal; and removing the residual signal components from the audio mix. In an eighteenth aspect, this specification describes apparatus comprising: means for receiving, via a first track, a near-field audio signal from a near-field microphone; means for receiving, via a second track, a far-field audio signal from an array comprising one or more far-field microphones, wherein the far- field audio signal comprises audio signal components across one or more channels corresponding respectively to the or each of the far-field microphones; means for determining, using the near-field audio signal and the or each component of the far-field audio signal, a set of time dependent room impulse response filters, wherein each of the time dependent room impulse response filters is in relation to the near -field microphone and respective the or each of the channels of the microphone array; means for, for one or more channels of the microphone array, filtering the near -field audio signal using one or more room impulse response filters of the respective one or more channels; and means for augmenting the far-field audio signal by applying the filtered near -field audio signal thereto. In a nineteenth aspect, this specification describes apparatus comprising: means for receiving a plurality of near-field audio signals from respective near -field microphones; means for receiving a far-field audio signal from a microphone array comprising a plurality of far-field microphones; and means for determining, using each of the near-field audio signals and the far -field audio signal simultaneously, a set of room impulse response filters, wherein each of the room impulse response filters is in relation to the plurality of near -field microphones and the multi-channel microphone array.

In a twentieth aspect, this specification describes apparatus comprising:

means for receiving, from a far-field microphone, a far-field audio signal in a time domain corresponding to a mobile source located within a recording space; means for receiving, from a near-field microphone, a near-field audio signal in a time domain corresponding to the mobile source located within the recording space; means for identifying timeframes that contain near-field signal energy above a signal activity threshold; means for determining a linear transformation of the near-field audio signal and a linear transformation of the far-field audio signal with respect to the timeframes identified as containing near-field signal energy above the signal activity threshold; and means for using the linear transformation of the far-field audio signal and the linear transformation of the near-field audio signal to determine a set of room impulse response filters of the recording space.

In a twenty first aspect, this specification describes apparatus comprising: means for receiving, from a far-field microphone device, at least one far-field audio signal in a time domain corresponding to a mobile source located within a recording space, the or each far-field audio signal corresponding to a respective channel of an audio mixture of the far-field microphone device; means for receiving, from a near-field microphone, a first near-field audio signal in a time domain corresponding to the mobile source; means for determining location information relating to the mobile source; means for transforming the or each far-field audio signal and the first near-field audio signal from the time domain to a time-frequency domain; and means for using the transformations of the far- field audio signal and the first near-field audio signal and the location

information of the mobile source to determine a set of room impulse response filters of the recording space using a recursive least squares algorithm. In a twenty second aspect, this specification describes apparatus comprising: means for receiving, from a near-field microphone, a near-field audio signal in a time domain corresponding to the mobile source located within the recording space; means for applying a room impulse response filter to the near-field audio signal to obtain a projected near-field audio signal; means for subtracting the projected near-field audio signal from an audio mix of a far -field microphone array; means for determining a time-frequency magnitude ratio mask based on the projected near-field audio signal and the audio mix of the far -field microphone array; means for using the time -frequency magnitude ratio mask to identify residual components of the projected near-field audio signal; and means for removing the residual signal components from the audio mix.

Brief description of the drawings

So that the invention may be fully understood, embodiments thereof will now be described with reference to the accompanying drawings, in which:

Figure 1 is a schematic diagram of an audio mixing system and a recording space; Figure 2 is a schematic block diagram of elements of certain embodiments;

Figure 3 is a flow chart illustrating operations carried out in certain embodiments; Figure 4 is an illustration of a recording space;

Figure 5 is a schematic diagram of an audio mixing system and a recording space;

Figure 6 is a schematic diagram of an audio mixing system and a recording space as a target source is replaced with a replacement source;

Figure 7 is a schematic diagram of an audio mixing system and a recording space as a new source is introduced to an audio mixture;

Figure 8 is a flow chart illustrating operations carried out in certain embodiments; and Figure 9 illustrates a user interface according to certain embodiments.

Detailed description

In the description and drawings, like reference numerals refer to like elements throughout. Embodiments of the present invention relate to mixing audio signals received from both a near-field microphone and from a far-field microphone. Example near-field microphones include Lavalier microphones which may be worn by a user to allow hands-free operation or a handheld microphone. In some embodiments, the near-field microphone may be location tagged. The near-field signals obtained from near-field microphones may be termed "dry signals", in that they have little influence from the recording space and have relatively high signal-to-noise ratio (SNR).

Far-field microphones are microphones that are located relatively far away from a sound source. In some embodiments, an array of far-field microphones may be provided, for example in a mobile phone or in a Nokia Ozo (RTM) or similar audio recording apparatus. Devices having multiple microphones may be termed multichannel devices and can detect an audio mixture comprising audio components received from the respective channels.

The microphone signals from far-field microphones may be termed "wet signals", in that they have significant influence from the recording space (for example from ambience, reflections, echoes, reverberation, and other sound sources). Wet signals tend to have relatively low SNR. In essence, the near-field and far-field signals are in different "spaces", near-field signals in a "dry space" and far-field signals in a "wet space".

When the originally "dry" audio content from the sound sources reaches the far-field microphone array the audio signals have changed because of the effect of the recording space. That is to say, the signal becomes "wet" and has a relatively low SNR. The near- field microphones are much closer to the sound sources than the far-field microphone array. This means that the audio signals received at the near-field microphones are much less affected by the recording space. The dry signal has much higher signal to noise ratio and lower cross talk with respect to other sound sources. Therefore, the near-field and far-field signals are very different and mixing the two ("dry" and "wet") results in audible artefacts or non-natural sounding audio content.

Further problems arise, if a signal outside the system needs to be inserted into the audio mixture. For example, an audio stream from an external player such as a professional audio recorder may be mixed with audio content recorded in a particular recording space. These signals need to be mixed together because only the microphone array can provide spatial audio content, for example for a virtual reality (VR) or augmented reality (AR) audio delivery system. However, with simply mixed sound sources this cannot be done due to artefacts or at least due to the virtual presence aspect being lost in listening. Furthermore, future six degrees of freedom (6D0F) audio production systems require ways to estimate room impulse responses.

Additionally, mixing or editing of the multi-channel array signal is not straightforward due to low SNR, cross-talk and spatial artefacts that editing might cause. Editing of the near-field microphone and pre-recorded signal is relatively straightforward due to high SNR and isolation between individual channels. However, near-field signals only provide audio content without spatial information. The resulting mix quality is up to personal preferences and use case demands; however some amount of spatial information insertion capability is often needed. A new problem arises when a totally new "dry" signal is introduced into to the audio mixture, for example from a sound source located externally with respect to the recording space. Since the new audio signal has no room impulse response (RIR) data available for the current room and environment, realistic sounding mixing is not possible without a database of RIR values from all around the space used for the original audio capture. Current audio mixing systems often rely on the expert audio mixer's personal abilities and spatial information may be added to the "dry" near-field signal with signal processors that create artificial spatial information. Examples include reverb processors that generate spatial information with an algorithm for different sounding and tunable spaces or that rely on real impulse responses (convolution processor) with some amount of manual modification to some parameters such as panning, volume, equalization, pre-echo, decay time and residual noise floor adjustments. More information may be found at http://www.nongnu.0rg/freeverb3/.

Hitherto, there are no known methods available that use a collected RIR database together with the position data and/or models of the recording space to render realistic sounding VR, AR or 6D0F audio playback.

Embodiments of this invention provide a database where estimated RIR values are collected around the place of performance based on the captured "dry" and "wet" signals as well as available position data of the near-field microphones (which correspond to the position of the sound source). The RIR data are estimated based on the dry to wet signal transfer function at every relevant position within the recording space. There may be one or more "wet" multi-channel arrays as well as one or more "dry" sound sources collected at the RIR database at the same time. In some embodiments, the RIR database may be collected during an initial calibration phase where a sound source (for example, white noise, talking human, acoustic instrument, a flying drone with speaker, etc) is moving or is moved around the recording space either manually or automatically. The benefit of having calibration recordings and database collection prior to actual performance is that the RIR database can be used during the performance to insert additional sound sources to the audio mix in real-time. Also, the recording space might have higher SNR available in some circumstances, for example when a studio audience is missing and also use of special signals such as white noise that will provide more accurate room impulse responses for the whole frequency range.

In other embodiments, continuous collection of new RIR data is performed during the recording itself, the new RIR data being inserted into the database as the actual performance occurs. Additional RIR data that is inserted into a pre-existing RIR database can also be collected during the actual performance.

Collection of RIR data during a performance can be made in order to add more data points to make the database denser. There are multiple dimensions that can be enhanced in the database. For example, the position grid can be made denser. For instance, data may be acquired for a 10 centimetre (cm) grid instead of an originally calibrated 20 cm grid so that more data points can be gathered.

By adding more spectral points, if calibration was initially performed quickly by walking around the vicinity of the far-field microphone array, all further captured signals will decrease the spectral sparseness of the RIR database.

Since the acoustic environment may change during the performance, the RIR database can contain time varying RIR values. To capture time varying responses, RIR measurements need to be captured over an extended period of time for optimal quality. For example, when more people enter the recording space a damping of the recording space occurs which affects the acoustic properties of that recording space. Figure 1 shows an audio mixing system 100 which comprises a far-field audio recording device 101, such as a video/audio capture device, and one or more near-field audio recording devices 102, such as Lavalier microphones. The far-field audio recording device 101 comprises an array of far-field microphones and may be a mobile phone, a stereoscopic video/ audio capture device or similar recording apparatus such as the Nokia Ozo (RTM). The near-field audio recording devices 102 may be worn by a user, for example a singer or actor. The far-field audio recording device 101 and the near- field audio recording devices 102 are located within a recording space 103. The far-field audio recording device 101 is in communication with an RIR processing apparatus 104 either via a wired or wireless connection. The RIR processing apparatus

104 may be located within the recording space 103 or outside the recording space 103. The RIR processing apparatus 104 has access to an RIR database 105 containing RIR data relating to the recording space 103. The RIR database 105 may be physically incorporated with the RIR processing apparatus 104. Alternatively, the RIR database

105 may be maintained remotely with respect to the RIR processing apparatus 104.

Figure 2 is a schematic block diagram of the RIR processing apparatus 104. The RIR processing apparatus 104 may be incorporated within a general purpose computer. Alternatively, the RIR processing apparatus 104 may be a standalone apparatus.

The RIR processing apparatus 104 may comprise a short-time Fourier transform (STFT) module 201 for determining short-time Fourier transforms of received audio signals. The RIR processing apparatus 104 comprises an RIR estimator 202 and a projection module 203. The RIR processing apparatus 104 comprises a processor 204 which controls the STFT module 201, the RIR estimator 202 and the projection module 203. The RIR processing apparatus 104 comprises a memory 205. The memory comprises a volatile memory 206 such as random access memory (RAM). The memory also comprises non-volatile memory 207, such as read-only memory (ROM).

The RIR processing apparatus 104 further comprises input/output 208 to enable communication with the far-field audio recording device 101 and with the RIR database 105 as well as any other remote entities. The input/output 208 comprises hardware, software and/ or firmware that allows the RIR processing apparatus 104 to

communicate with the far-field audio recording device 101 and with other remote entities via wired or wireless connection using communication protocols known in the art.

Some further details of components and features of the above-described RIR processing apparatus 104 and alternatives will now be described.

The RIR processing apparatus 104 comprises a processor 204 communicatively coupled with memory 205. The memory 205 has computer readable instructions stored thereon, which when executed by the processor 204 causes the processor 204 to cause performance of various ones of the operations described with reference to Figure 3. The RIR processing apparatus 104 may in some instance be referred to, in general terms, as "apparatus".

The RIR processing apparatus 104 may be of any suitable composition. For example, the processor 204 may be a programmable processor that interprets computer program instructions and processes data. The processor 204 may include plural programmable processors. Alternatively, the processor 204 may be, for example, programmable hardware with embedded firmware. The processor 204 may be termed processing means. The processor 204 may alternatively or additionally include one or more Application Specific Integrated Circuits (ASICs). In some instances, processor 204 may be referred to as computing apparatus.

The processor 204 is coupled to the memory (or one or more storage devices) 205 and is operable to read/write data to/from the memory 205. The memory 205 may comprise a single memory unit or a plurality of memory units, upon which the computer readable instructions (or code) is stored. For example, the memory 205 may comprise both volatile memory and non-volatile memory. For example, the computer readable instructions/program code may be stored in the non-volatile memory and may be executed by the processor 204 using the volatile memory for temporary storage of data or data and instructions. Examples of volatile memory include RAM, DRAM, and SDRAM etc. Examples of non-volatile memory include ROM, PROM, EEPROM, flash memory, optical storage, magnetic storage, etc. The memories in general may be referred to as non-transitory computer readable memory media. The term 'memory', in addition to covering memory comprising both non-volatile memory and volatile memory, may also cover one or more volatile memories only, one or more non-volatile memories only, or one or more volatile memories and one or more non-volatile memories.

The computer readable instructions/program code maybe pre-programmed into the RIR processing apparatus 104. Alternatively, the computer readable instructions may arrive at the RIR processing apparatus 104 via an electromagnetic carrier signal or may be copied from a physical entity such as a computer program product, a memory device or a record medium such as a CD-ROM or DVD. The computer readable instructions may provide the logic and routines that enables the devices/ apparatuses to perform the functionality described above. The combination of computer-readable instructions stored on memory (of any of the types described above) may be referred to as a computer program product.

Embodiments of the present invention may be implemented in software, hardware, application logic or a combination of software, hardware and application logic. The software, application logic and/or hardware may reside on memory, or any computer media. In an example embodiment, the application logic, software or an instruction set is maintained on any one of various conventional computer-readable media. In the context of this document, a "memory" or "computer-readable medium" maybe any media or means that can contain, store, communicate, propagate or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer.

Reference to, where relevant, "computer-readable storage medium", "computer program product", "tangibly embodied computer program" etc., or a "processor" or "processing apparatus" etc. should be understood to encompass not only computers having differing architectures such as single/multi-processor architectures and sequencers/parallel architectures, but also specialised circuits such as field

programmable gate arrays FPGA, application specify circuits ASIC, signal processing devices and other devices. References to computer program, instructions, code etc. should be understood to express software for a programmable processor firmware such as the programmable content of a hardware device as instructions for a processor or configured or configuration settings for a fixed function device, gate array,

programmable logic device, etc. - l6 -

Overall algorithm description

The following is a description of one way in which far-field audio signals maybe processed to obtain a short-time Fourier transform (STFT). The far-field audio recording device 101 comprising a microphone array composed of far-field

microphones with indexes (c = 1,...,C) captures a mixture of source signals with indexes p= i,...,P and their signals χ&^}(η) sampled at discrete time instances indexed by n. The sound sources may be moving and have time-varying mixing properties, denoted by room impulse response (RI ), h_cn ^(p)(T) for each channel c at each time index n. Some of the sound sources (e.g. speaker, car, piano or any sound source) have Lavalier microphones 102 close to them. The resulting mixture signal can be given as:

Ve i ) =∑p=i Στ ^W (.n - T) ft2?(r) + n_c(n) (Equation 1) wherein:

y_c{ri) is the audio mixture in time domain for each channel index c of the far-field audio recording device 101, i.e. the signal received at each far-field microphone;

#^(p) is thep^th near-field source signal in time domain (source inde p);

ή^(τ) is the partial impulse response in time domain (sample delay index τ), i.e. the room impulse response;

n_c(n) is the noise signal in time domain.

Applying the short time Fourier transform (STFT) to the time-domain

array signal allows expressing the capture in time-frequency domain

as:

J ft— Lp=i Ld=Q ⁿftd^xft-d ^{+ n}/t^— Lp=i *f_t T "ft

(Equation 2) wherein: yf_t is the STFT of the array mixture (frequency and frame index/, f);

is the STFT of pth near-field source signal (p); h_{f td} is the room impulse response (RIR) in STFT domain (frame delay index d); X . is the STFT of j?th reverberated (filtered/projected) source sig

n.f_t is the STFT of the noise signal.

The STFT of the array signal is denoted by y t =

t are frequency and time frame index, respectively. The source signal as captured by the microphone array of the far-field audio recording device 101 is modelled by convolution between the near-field source STFT

= [hf_tdl, ... , hf_tdc]^T .

The length of the convolutive frequency domain RIR is D timeframes which can vary from a few timeframes to several tens of frames depending on the STFT window length and maximum effective amount of reverberation components in the recording environment. This model differs from the usual assumption of instantaneous mixing in frequency domain with mixing consisting of complex valued weights only for the current timeframe. The additive uncorrelated noise is denoted by ^η# = [nfti,...,nfic]^T. The reverberated source signals are denoted by xjr_t .

The way in which RIR measurements are obtained in accordance with various embodiments will now be explained, with reference to Figure 3 which is a flow chart illustrating various steps taken in embodiments of the invention. The process starts at step 3.1.

At step 3.2 an audio signal y_c(n)is received from the far-field audio recording device

101. At step 3.3 an audio signal x^(n) is received from the near-field audio recording device 102 for those sound sources provided with a near-field audio recording device 102. At step 3.4, the location of the mobile source is determined. The location can be determined using information received from a tag with which the mobile source is provided. Alternatively, the location may be calculated using multilateration techniques described below.

At step 3.5, a short-time Fourier transform (STFT) is applied to both far-field and near- field audio signals. Alternative transforms may be applied to the audio signals as described below. - l8 -

In some embodiments, time differences between the near-field and far-field audio signals can be taken into account. However, if the time differences are large (several hundreds of milliseconds or more) a rough alignment may be done prior to the process commencing. For example, if a wireless connection between a near-field microphone and RIR processor causes a delay, the delay may be manually fixed by delaying the other signals in the RIR processor or by an external delay processor which may be implemented as hardware or software. A signal activity detection (SAD) may be estimated from the near-field signal in order to determine when the RIR estimate is to be updated. For example, if a source does not emit any signal over a time period, its RIR value does not need to be estimated.

The STFT values y#and are input to the RIR estimator 202 at RIR estimation step

3.6. The RIR estimation may be performed using a block-wise linear least squares (LS) projection in offline operation mode, that is where the RIR estimation is performed as part of a calibration operation. Alternatively, a recursive least squares (RLS) algorithm for real time operation mode, that is where the RIR estimation occurs during a performance itself. In other embodiments, the RLS algorithm may be used in offline operation instead of the block-wise linear LS algorithm. In any case, as a result, a set of RIR filters in time-frequency domain are obtained. The process ends at step 3.7.

Block-wise linear least squares projection

The RIR h^_d can be thought of as a projection operator from near-field signal space

(i.e. "dry" signals) to far-field signal space (array capture in case of multiple channels, i.e. "wet" signals).

The projection is time, frequency and channel dependent. The parameters of RIR h^_d can be estimated using linear least squares (LS) regression, which is equivalent to finding the projection between the near-field and far-field signal spaces. The method of LS regression for estimating RIR values may be applied for moving sound sources by processing the input signal in blocks of approximately 500ms and the RIR values may be assumed to be stationary within each block. Block-wise processing with moving sources assumes that the difference between RIR values associated with adjacent frames is relatively small and remains stable within the analysed block. This is valid for sound sources that move at low speeds in an acoustic environment where small changes in source position with respect to the receiver do not cause substantial change in the RIR value.

The method of LS regression is applied individually for each source signal in each channel of the array. Additionally, the RIR values are frequency dependent and each frequency bin of the STFT is processed individually. Thus, in the following discussion it should be understood that the processing is repeated for all channels and all frequencies.

Assuming a block of STFT frames with indices t,...,t + T where the RIR is assumed stationary inside the block, the mixture signal STFT with the convolutive frequency domain mixing can be given as: t =∑d=o Xt-dhd ** y = Xh (Equation 3) wherein y is a vector of far-field STFT coefficients obtained from the far-field audio recording device 101 from frame t to t + T;

X is a matrix containing the near-field STFT coefficients starting from frame t - o and the delayed versions starting from f - i,...,t - D - 1; and

h is the RIR to be estimated.

The length of the RIR filter to be estimated is D STFT frames. The block length is T + frames, and T + 1 > D in order to avoid overfitting due to an overdetermined model.

The above equation (3) can be expressed as:

(Equation 4)

and assuming that data before the first frame index f is not available, the model becomes: x_t 0 ... o h₀

yt+i — ^Xt+1 ^xt ... o

(Equation 5) yt+T. Xt+T ^xt+T-i ^{" X}t+T-{D

The linear LS solution minimization is:

(Equation 6)

is achieved as: h = (X^TX)-¹X^Ty (Equation 7)

The projected source signal for a single block can be trivially obtained as:

D-l

(Equation 8)

A subsequent removal of a particular source signal from the audio mixture is a simple subtraction:

y_t = yt - ^xt (Equation 9) Equation 9 demonstrates the removal of a particular source signal from the audio mixture. As well as removing a source from the audio mixture, it is also possible to add the effect of a source to the audio mix. This may be done by using addition instead of subtraction with a user specified gain.

System calibration and RIR database collection

The RIR estimation presented in embodiments of the present invention allows removal of a target source from the audio mixture or addition of a source to the audio mixture of the far-field audio recording device 101. Based on target source direction of arrival (DOA) trajectory or location estimates of the target source, the signal emitted by the source can be replaced by augmenting separate content to the array mixture of the far- field audio recording device 101.

The problem of augmenting separate signals using the RIR values estimated from the target source in prior approaches lies in the fact that the source signal is not broadband and estimates of RIR values from frequencies with no signal energy emitted are unreliable. Having different spectral content (source signal frequency occupancy in each frame) leads to poor subjective quality of the synthesized augmented source since accurate RIR data for all frequencies are not available.

To overcome this problem, embodiments herein described provide a calibration method with a constant broadband signal which is used to estimate and store RIR values from substantially all possible locations of the recording space. The purpose of the calibration stage is that reliably broadband RIR data from all positions of the recording space are captured before the actual operation (i.e. before an audio recording or broadcast). The location data may be either relative or absolute such as GPS coordinates.

During the operation stage itself (i.e. during a recording or broadcast), the target source is removed from the mixture using the block-wise LS or RLS method described above. The direction of arrival (DOA) is estimated either acoustically or using other localization techniques.

There is a variety of ways in which the DOA may be estimated. In some embodiments, the estimated RIR value in the time domain relating to each channel of the array of the far-field audio device 101, is analysed. The first received RIR sample that is above a threshold gives an estimate of the delay at which the sound arrives at the nearest microphone of the far-field audio device 101. Comparing the delays from all microphones of the far-field audio device 101 provides the time differences of arrival (TDOA) between microphones in the array of the far-field audio device 101. From these values the direction can be calculated using multilateration methods that are known in the art. The augmented source is synthesized using the target source DOA estimates for retrieving the RIR corresponding to each DOA from the database generated in the calibration stage. The length of the calibration stage depends on the size of the recording space and the required density of the database. The length of the calibration stage may vary from around 10 seconds to several minutes.

Figure 4 is a plan view of a recording space 103 in accordance with an embodiment whereby audio data is recorded as part of a calibration stage. A speaker 400 is provided with a near-field microphone 102 such as a Lavalier microphone or a handheld microphone. The speaker 400 may also be provided with a location tag 401. A far-field audio recording device 101 is provided towards the centre of the recording space 103. During the calibration stage, the speaker 400 walks around the recording space 103 along a trajectory T. The speaker 400 speaks so that audio data is recorded by both the far-field audio recording device 101 and the near-field microphone 102. The person may also be playing an instrument or carrying a sound producing loudspeaker.

The room impulse response (RIR) data are collected around the place of performance based on the captured "dry" and "wet" signals as well as available position data from the location tag 401. The RIR data are estimated based on the dry to wet signal transfer function at every relevant position with a processing unit using one of the algorithms described above.

Figure 5 is a plan view of a recording space 103 in accordance with another

embodiment whereby audio data is recorded as part of a calibration stage. In this embodiment, two drones 500 are provided. Each drone 500 is provided with a near- field microphone 102. Each of the drones 500 emits a noise, either through a loudspeaker or merely from the drone rotors. Two or more far-field audio recording devices 101 are also provided.

The RIR database 105 may be collected during an initial calibration phase where an audio source of wideband noise, for example white noise, MLSA sequence, pseudo random noise, or a talking human, an acoustic instrument, a flying drone with speaker or a ground based robot, is moving or is moved around the recording space 103 either manually or automatically. The benefit of having some calibration recordings and database collection prior to an actual performance is that the pre-existing RIR database 105 can be used during the performance to insert additional sound sources to the audio mix in real-time.

Additionally, when wideband noise is used for calibration, the RIR data are more accurate over the whole spectrum. The recording stage will also have higher SNR available, for example when the audience is missing from the recording space 103. This may provide more accurate and/or faster RIR measurements.

In other embodiments, RIR data may be collected during the performance itself. This may be instead of the calibration phase described above or in addition to the calibration phase. In the latter scenario, the reliability of the RIR data captured during the calibration process described above using the least block-wise linear least squares projection may be improved by capturing further RIR during the performance itself. As mentioned above, RIR data estimated are generally valid only for the frequency indices at which the source produced meaningful acoustic output. Usually RIR data are applied to the same close-field signal and no mismatch between time-frequency content and RIR data occurs. However, for example in the case of augmenting a completely near-field signal which is very different from the RIR data available, the RIR data need to be broadband and valid at least for the STFT frequency indices where the augmented signal has significant energy.

In order to avoid the active calibration with a known broadband signal a method for passive online RIR database collection is provided in some embodiments. RIR data estimated at each position of the recording space 103 are used to gradually build a database of broadband RIR data by combining estimates at different times from the same location within the recording space 103. The recent magnitude spectrum of the near-field signal can be used as an indicator of reliability of the RIR data and only frequency indices with substantial signal energy are updated in the database. The database update can vary from simple weighted average to more advanced

combinations based on probabilistic modelling and machine learning in general.

In some embodiments, real-time RIR estimation may be performed by truncating the analysis block of the block-wise least squares process outlined above to the current frame and estimate new filter weights for each frame. Additionally, the block-wise strategy in real-time operation requires constraining the rate of change in RIR filter parameter between adjacent frames to avoid rapid changes in the projected signals. Furthermore, the truncated block-wise least squares process requires inversing the autocorrelation matrix for each new frame of data.

In some embodiments, real-time RIR estimation may be performed by using a recursive least squares (RLS) algorithm. The signal model, consisting of convolutive mixing in time-frequency domain, may be defined as:

p D-l

p=l d=0

P p=l

In real time operation the filter weights vary for each time frame t and, again by dropping the frequency index/and the channel dimension, the filtering equation for a single source at time frame t may be specified as:

*t = YJd l xt-dhtd = x^Tt ht do) where x_t= [¾,¾-i,..., _t-D-₁]^rand h_t = [/ι_ίο,Λ¾...,/ΐω-ι]^Γ.

Efficient real-time operation can be achieved with recursive estimation of the RIR filter weights h using the recursive least squares (RLS) algorithm. The modelling error for timeframe f may be specified as: e_t = y_t ~ Xt (11) where y_t is the observed/desired mixture signal. The cost function to be minimized with respect to filter weights may be expressed as:

which accumulates the estimation error from past frames with exponential weight λ*^~ι. The weight of the cost function can be thought of as a forgetting factor which determines how much past frames contribute to the estimation of the RIR filter weights at the current frame. RLS algorithms where λ < 1 may be referred to in the art as exponentially weighted RLS and λ - 1 may be referred to as growing window RLS.

The RLS algorithm minimizing Equation 12 is based on recursive estimation of the inverse correlation matrix P_t of the close-field signal and the optimal filter weights h_t and can be summarized as:

Initialization:

h_Q =

P₀ = δ-¹!

Repeat for t = 1, 2, ...

The initial regularization of the inverse autocorrelation matrix is achieved by defining δ using a small positive constant, typically from 10^~2 to 10¹. A small δ value causes faster convergence, whereas a larger δ value constrains the initial convergence to happen over a longer time period (for example, over a few seconds).

The contribution of past frames to the RIR filter estimate at current frame t may be varied over frequency. Generally, the forgetting factor λ acts in a similar way as the analysis window shape in the truncated block-wise least squares algorithm. However, small changes in source position can cause substantial changes in the RIR filter values at high frequencies due to highly reflected and more diffuse sound propagation path. Therefore, the contribution of past frames at high frequencies needs to be lower than at low frequencies. It is assumed that the RIR parameters slowly change at lower frequencies and source evidence can be integrated over longer periods, meaning that the exponential weight A^t_! can have substantial values for frames up to 1.5 seconds in past.

A similar regularization as described above with reference to block-wise LS may also be adopted for the RLS algorithm. The regularization is done to achieve a similar effect as in block-wise LS to improve robustness towards low-frequency crosstalk between near- field signals and avoid excessively large RIR weights. The near-field microphones are generally not directive at low frequencies and can pick up fair amount of low-frequency signal content generated by noise source, for example traffic, loudspeakers etc.

In order to specify regularization of the RIR filter estimates, the RLS algorithm is given in a direct form. In other words, the RLS algorithm is given without using a matrix inversion lemma to derive updates directly to the inverse autocorrelation matrix P_t but for the autocorrelation matrix R_t ( R^ ¹ = P_t)- The formulation can be found for example from T. van Waterschoot, G. Rombouts, and M. Moonen, "Optimally regularized recursive least squares for acoustic echo cancellation," in Proceedings of The second annual IEEE BENELUX/DSP Valley Processing Symposium (SPS-DARTS 2006), Antwerp, Belgium, 2005, pp. 2S-29.

The direct form RLS algorithm updates are specified as, Initialization:

h₀ = 0

R₀ = S-¹/

Repeat for t = 1, 2, ...

This algorithm would give the same result as the RLS algorithm discussed above but requires operation for calculating the inverse of the autocorrelation matrix, and is thus computationally more expensive, but does allow regularization of it. The

autocorrelation matrix update with Levenberg-Marquardt regularization (LMR) according to T. van Waterschoot, G. Rombouts, andM. Moonen, "Optimally regularized recursive least squares for acoustic echo cancellation, " in Proceedings of The second annual IEEE BENELUX/DSP Valley Processing Symposium (SPS-DARTS 2006), Antwerp, Belgium, 2005, pp. 28-29 is:

R_t = AH_t_i + x^* _tx + (1 - X) _LMRI (15) where fi_LMR is obtained from the regularization kernel kf increasing towards low frequencies weighted by the inverse average log-spectrum of the close-field signal (1 e ) as discussed above with respect to the block-wise LS algorithm.

Another type of regularization is the Tikhonov regularization (TR), as also

introduced in the case of block-wise LS, which can defined for the RLS algorithm as:

R_t = R_t→ + χΙχΙ + (1 - λ β_ΤκΙ (16) h_t = h_t-i +

+ (1 - A)P_TRh_t_₁) (17)

Similarly as before, β_Τ[( is based on the regularization kernel and the inverse average log-spectrum of the close-field signal. It should be noted that the kernel k_f needs to be modified to account for the differences between block-wise LS and RLS algorithms, and can depend on the level difference between the close-field signal and the far-field mixtures.

In addition to regularization weight being adjusted based on the average log-spectrum, it can also be varied based on the RMS level difference between near-field and far-field signals. The RMS levels of these signals might not be calibrated in real-time operation and thus additional regularization eight strategy is required. A trivial low-pass filter applied to RMS of each individual STFT frame can be used to track the varying RMS level of close-field and far-field signals. The estimated RMS level is used to adjust the regularization weights _LMR or β_τκ in order to achieve similar regularization impact as with RMS calibrated signals assumed in earlier equations. Additional RIR data to be inserted into the RIR database 105 may be collected during the actual performance. This can be made in order to add more data points, for example to make the RIR position database grid denser or for sensing over the time varying responses, for example when more crowd comes inside the room, it dampens the room. The time varying responses may also be useful in post-production if some original performances are edited and later added back to the original recording space 103.

Figure 6 illustrates a recording environment whereby a target source 601 is removed from the audio mixture and replaced with a replacement source 602 at the same position. Based on target source DOA trajectory or location estimates obtained from a location tag of the target source 601, the signal emitted by the target source 601 can be replaced by augmenting separate content to the array mixture. An example scenario of this simple method to replace a speaker inside a room with another person is shown in Figure 6. The replacement of a target source may be done using realtime RIR estimation, where no RIR database 105 need be used. Alternatively, a calibration phase may be performed with respect to the recording space 103, as described above.

A drawback of augmenting separate signals using the RIR data estimated from the target source 601 in real time lies in the fact that the target source signal may not be broadband and estimates of RIR data from frequencies with no signal energy emitted may be unreliable. Where the target source 601 and the replacement source 602 have different spectral content (i.e. source signal frequency occupancy in each frame) poor subjective quality of the synthesized augmented source may result since accurate RIR data for all frequencies may not be available.

In other embodiments a calibration phase is used to build up a RIR database 105, as described above. The RIR data in RIR database 105 that is collected with wideband noise is accurate and reliable over the whole frequency spectrum. Using this pre- collected RIR data enables higher quality replacement of the audio source.

A selection of a position within the recording space is received. This may be the position of the target source 601 received from any location determination method described above. A near-field audio signal is received from the target source 601. A RIR filter related to the position of the target source is identified.

The identified room impulse response filter is then applied to the near-field audio signal of the target source to project the near -field audio signal of the target source into a far-field space. As explained above, this RIR filter may be calculated in real-time.

The projected near-field audio signal may then be removed from the audio mixture, as shown in Equation 9 above. A near-field audio signal from the replacement source 602 is received.

A room impulse response filter relating to the position within the recording space is identified. This may be same room impulse response filter used to remove the target source. Alternatively, the room response response filter applied to the near-field audio signal of the replacement source 602 may be retrieved from a room impulse response filter database collected during a calibration phase.

The selected room impulse response filter is then applied to the near-field audio signal of the replacement source 602 to obtain a projected near-field audio signal of the replacement source 602.

The audio mixture of the far-field microphone device may then be augmented by adding the projected near-field audio signal of the replacement source 602 to the audio mixture. As such, the target source 601 is removed and replaced with the replacement source 602.

Figure 7 illustrates a recording environment whereby a completely new near-field signal recorded from a new source 701 located outside the recording space 103 is inserted into the audio mix of the far-field audio recording device 101. In this case of adding a completely new near-field signal to the augmented mix, the RIR data need to be broadband and valid at least for the STFT frequency indices where the augmenting signal has significant energy. A user may wish for the new source 701 to be added to the recording space 103 at a particular virtual location within the recording space 103. Based on this specified virtual location, the new signal can be used to augment the content to the audio mixture recorded by the far-field microphone array of the far-field audio recording device 101. For example, a virtual person can be visually rendered to an AR view and at the same time the audio can be rendered in such a way that it sounds as though the new source 701 is standing at the location at which the source appears visually in AR. An example scenario of this method to add a virtual speaker to a room is shown in Figure 7.

Rendering a virtual speaker to a room and using some advanced AR, VR or 6D0F rendering a large amount of RIR data. For example, there may be more than one far- field audio recording device 101-1 and 101-2 and new sound sources 701 to be rendered at the same time. The 6 DoF usage scenario requires that rendering from a first position to a second position is possible (in 6 DoF the listener, for whom the audio is being rendered in playback can move freely anywhere in the virtual environment).

Embodiments of the invention use the RIR data from the RIR database 105 to render the audio objects with naturally sounding presence in any location within the scene.

Having time varying RIR responses may be useful in post-production if some original performances are edited and later added back to the original recording space 103. In practice this requires that the most recent time stamped RIR data is obtained from the RIR database 105 in addition to the selected position.

A near-field audio signal from the new source 701 is received to be added to a far-field audio mixture at a selected position of a recording space 103. A room impulse response filter relating to the selected position within the recording space is identified. The room response filter is applied to the near-field audio signal of the new source 701 may be retrieved from a room impulse response filter database collected during a calibration phase. The selected room impulse response filter is then applied to the near -field audio signal of the new source 701 to obtain a projected near-field audio signal of the new source 701.

The audio mixture of the far-field microphone devices 101 may then be augmented by adding the projected near-field audio signal of the new source 701 to the audio mixture.

In some embodiments, RIR values are calculated using multiple sound sources simultaneously. That is, audio signals from separate sound sources are obtained by near-field and far-field microphones and used simultaneously to determine RIR values for the recording space 103. Estimating the RIR values for most or all of the separate sound sources within the recording space 103 improves the quality of the individual RIR values.

In order to mix "dry" close-field signals into "wet" microphone array signals with high quality, the RIR values between the near-field and far-field signals of as many sound sources as possible need to be estimated simultaneously. In these embodiments, a far- field audio recording device 101 comprising a microphone array and two or more separate near-field microphones 102 are provided. Each of the near-field microphones 102 is located near to a sound source. RIR values from the sound sources to the far- field audio recording device 101 for each source and for each microphone are estimated simultaneously.

In these embodiments, the mixture STFT with multiple near-field source signals p = i,...,P can be written as:

The parameters of the RIR filters for multiple sources can be solved simultaneously by arranging the coefficients of the above model to:

(19) The simultaneous estimates of ' avoid crosstalk issues, since regularization tends towards solutions where the reverberated mixture signal is modelled using smallest weights. This means that signal components in the audio mixture become modelled by projection with the near-field signal having the most energy at respective time- frequency regions. This can be thought of as the most probable solution or minimum RIR boost solution. The estimates of source RIR values become more accurate as more information of the mixing is available. For computationally efficient processing, the block-wise LS-solution given in equation 7 above can be formulated for multiple channels using matrices. The model with multiple channels comes to:

D-l

^ ^xt-dh-dc = XH (20) d=0 The matrices may be defined as: yt.i y_t,2 ^■ - Vt.c ^' ^{' x}t ^Xt-1 ^Xt-(D-l) ^0,1 "' c

^Xt + l,l ^Xt+1.2 ^{■ X}t+l,C = ^Xt+1 t " ^Xt+1-(D-1) K2 · ^" Kc

(21)

^Xt+T,l ^Xt+T,2 ' ^Xt+T,c. ^Xt+T ^t+r-i ' ^Xt+T-(D-1). D- 1,1 h-D-1,2

For a block length of T + 1, with C input channels and RIR filter length of D frames the matrix equation dimensions are:

^[(r+l)xC] — ¾T+1)XD]"[DXC] (22)

This needs to be repeated for all frequencies /= i,...,F,

In embodiments where multiple near-field signals are available for RIR calculation, the RIR filters associated with each source may be estimated simultaneously as explained above. In some embodiments, a recursive least squares model (RLS) may be used. The RLS signal model for multiple sources maybe expressed as:

is the RIR value. The vector variables x\ and h_t contain the source signals and filter coefficients as stacked and can be specified as:

and for the filter coefficients as:

In the RLS algorithm, the autocorrelation matrix R_t or its recursively estimated inverse P_t scales accordingly to a size of P_t G €^{DP x Dp}. With the above definitions, the standard RLS algorithm as specified in equation 13 or equation 14 can be used to jointly estimate all near-field signal RIR values simultaneously, which improves the estimation accuracy.

If regularization is used in the case of simultaneously estimated RIR filters, the earlier used regularization matrix /?_LMfl/ in Equation 16 or β_τκΙ in Equation 17 is changed to correspond to the dimension and internal structure of stacked variables h_t and x_t in Equations 24 and 25 while taking into account different average spectra of each near- field source. /¾M« ^mav be used to denote regularization weight of each source p - 1, ... ,P which are again derived from the regularization kernel kf (at each frequency index f ) and source-dependent inverse log average spectrum 1 - e >. The regularization matrix corresponding to RLS formulation (29) with simultaneous estimation of RIR filters of multiple sources is given as:

In some embodiments, signal activity detection (SAD) or voice activity detection (VAD) may be used. The role of the SAD or VAD is to determine when the target source is emitting sound and which frames of the current block contain information (signal energy) than can be used to estimate the RIR using LS projection. Two approaches for utilizing the SAD information in the block-wise LS method for moving sound source RIR estimation are described below.

The first approach comprises zeroing out the mixture signal STFT coefficients in the frames where SAD indicates no signal energy is being emitted, which is repeated for all frequencies in the respective frame. The near-field signal is not zeroed out, since it is considered to already contain a negligible amount of energy and hard zeros would tend the RIR estimation towards unreasonable and unbound solutions, since projection between two zero signals is effectively being estimated. Practical experiments verify that blocks with only few active frames lead to excess amount of reverberation in the projected signal if both signals are being zeroed out before RIR estimation by Equation 7, whereas the effect is avoided by only zeroing out the audio mixture signal.

The second approach comprises using weighted least squares (WLS) for the RIR estimation and giving higher importance to frames that have more near-field signal energy present. The WLS minimization criterion may be expressed as:

where w_t are the weights of each STFT frame t within the current block. The WLS solution is achieved as, h = (JPWJ - Wy (28)

In both approaches, the practical considerations of the SAD implementation include that the SAD output needs to remain active for some time after the near-field signal energy has reduced below threshold level of detection. This is due to the fact that the reverberated signal components in the far-field signal can be delayed for a significant amount of timeframes and their contribution to the RIR estimation needs to be accounted for.

In some implementations if sound sources are mostly speech a Voice Activity Detection (VAD) algorithm, of a type known in the art, may be used in place of the proposed SAD.

In some embodiments RIR filtered (i.e. projected) signals are used as a basis for generating Time/Frequency (T/F) masks. Using projected signals improves the quality of the suppression. This is because the projection (i.e. filtering with the RIR) converts the "dry" near-field source signal into a "wet" signal and thus the created mask is a better match to the "wet" far-field microphone captured signals.

After estimation of the RIR filters for one of the target sources p, the projection may be obtained as:

Further the mixture without pth source may be denoted as: _t ^(p) = _t - ^p) = y_t - ^_t¾^p) (3⁰)

d=0

In case of inaccuracies in estimation of RIR filters, some faint time-frequency details of pth source remains in _t ^(p). The LS projection matches the overall signal energies between mixture and projected near-field signal and thus a magnitude ratio mask for masking out the time-frequency content of x_t ^p) from the resulting mixture y_t ^(p) can be easily defined. The time-frequency masking with magnitude ratio mask defined using the projected signal and the original array mixture may be expressed as:

The magnitude ratio mask can be calculated using perceptually motivated frequency resolution, for example Mel-scale, by integrating over several STFT frequency bins.

The above equations may be used if the T/F-masking is used to improve the quality of removing a source from the far-field microphone signals. If instead the source is to be added i.e. enhanced in the far-field microphone signals then that enhancement can be improved by using the same equations as above but changing the subtraction to addition in equations 30 and 31.

Figure 8 shows operation of a user interface (UI) 900 shown in Figure 9 and a software block for an audio editor software e.g. Digital Audio Workstation (DAW). The DAW may have several input tracks, whereby each track may include one or more channels. At steps 8.1 and 8.2 audio signals are received from a near-field microphone 102 and from a far-field audio recording device 101 respectively.

A first track may be a single channel track from a near-field microphone 102, for example a Lavalier microphone. A second track may be a multiple channel track a from far-field audio recording device 101. A user input to mix audio signals may be detected at step 8.3. When a user wants to mix the near-field microphone track (1 channel) with a far-field audio recording device track (8 channels) the DAW firstly automatically calculates the time dependent RIR filter between the single near-field microphone track channel and each far-field audio recording device track channel, at step 8.4. The near-field microphone channel is then filtered, at step 8.5, with the channel dependent RIR filters and added to each far-field audio recording device track channel, at step 8.6, as part of the audio mixture.

Typically, in DAWs when channels are mixed the user may choose a multiplier for the track to be mixed, a multiplier of -1 may denote that the instrument in the near-field microphone track is to be removed from the far-field audio recording device track. A multiplier of +1 means that the near-field microphone channel track is to be added to the far-field audio recording device track. Many other multipliers may be used and the multiplier may be expressed in decibels or in other formats. The near-field microphone track may be multiplied with the multiplier prior to adding it to the far-field audio recording device track channels.

The UI 900 can be used in different kinds of software. For example, it may be a plugin for a DAW or an independent software component. The software may show many kinds of audio tracks, each track with one or more channels. Each channel may represent a microphone. When a first track is added (or subtracted) to a second track the software calculates the RIR filters between the channels in the first and second tracks. The channels in the first track may then be added or subtracted with user selected gains to the channels in the second track. The user may select how RIR filters are used or not used in the addition/subtraction.

In many cases controlling the UI 900 is simplest if the near-field microphones are specialized to certain functions. For example some microphones may be marked with a "minus" sign and they automatically control the UI 900 so that the sound source recorded with that microphone is subtracted from the far-field microphones. This functionality allows for the simple user interaction of placing such microphones near noise sources such as ventilation, machinery etc. Other microphones may be marked with a "plus" sign and the sound sources recorded by these microphones are enhanced in the far-field microphones. If desired, the different functions discussed herein may be performed in a different order and/or concurrently with each other. Furthermore, if desired, one or more of the above-described functions may be optional or may be combined. Similarly, it will also be appreciated that the flow diagram of Figure 3 is an example only and that various operations depicted therein may be omitted, reordered and or combined.

Whilst the above embodiments have been described with reference to short time Fourier transforms, it should be appreciated that there are several different linear transforms from time to frequency domains which can used instead. Examples include discrete cosine transforms, wavelet transforms or MEL or Bark scale modified implementations. Although various aspects of the invention are set out in the independent claims, other aspects of the invention comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims. It is also noted herein that while the above describes various examples, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications which may be made without departing from the scope of the present invention as defined in the appended claims.

Claims

1. A method comprising:

receiving, via a first track, a near-field audio signal from a near-field

microphone;

receiving, via a second track, a far-field audio signal from an array comprising one or more far-field microphones, wherein the far-field audio signal comprises audio signal components across one or more channels corresponding respectively to the or each of the far-field microphones;

determining, using the near-field audio signal and the or each component of the far-field audio signal, a set of time dependent room impulse response filters, wherein each of the time dependent room impulse response filters is in relation to the near-field microphone and respective the or each of the channels of the microphone array;

for one or more channels of the microphone array, filtering the near-field audio signal using one or more room impulse response filters of the respective one or more channels; and

augmenting the far-field audio signal by applying the filtered near-field audio signal thereto.

2. A method according to claim l, further comprising:

receiving audio signals from a plurality of near-field microphones;

detecting a user selection of one of the near-field microphone corresponding to the near-field audio signal; and

automatically determining the set of time dependent room impulse response filters in response to the user selection of the near-field microphone corresponding to the near-field audio signal.

3. A method according to claim 1 or 2, further comprising detecting a user selection of a signal multiplier to apply to the track relating to the near-field microphone.

4. A method according to claim 1, further comprising:

assigning audio signals from a particular near-field microphone with either a positive weighting or a negative weighting; and augmenting the far-field audio signal by adding the filtered near-field audio signal to the far-field audio signal if the weighting is positive or subtracting the filtered near-field audio signal from the far-field audio signal if the weighting is negative.

5. A method comprising:

receiving a plurality of near-field audio signals from respective near-field microphones;

receiving a far-field audio signal from a microphone array comprising a plurality of far-field microphones; and

determining, using each of the near-field audio signals and the far-field audio signal simultaneously, a set of room impulse response filters, wherein each of the room impulse response filters is in relation to the plurality of near-field microphones and the multi-channel microphone array.

6. A method according to claim 5, wherein the set of room impulse response filters is determined using a blockwise linear least squares algorithm.

7. A method according to claim 5, wherein the set of room impulse response filters is determined using a recursive least squares algorithm.

8. A method comprising:

receiving, from a far-field microphone, a far-field audio signal in a time domain corresponding to a mobile source located within a recording space;

receiving, from a near-field microphone, a near-field audio signal in a time domain corresponding to the mobile source located within the recording space;

identifying timeframes that contain near-field signal energy above a signal activity threshold;

determining a linear transformation of the near-field audio signal and a linear transformation of the far-field audio signal with respect to the timeframes identified as containing near-field signal energy above the signal activity threshold; and

using the linear transformation of the far-field audio signal and the linear transformation of the near-field audio signal to determine a set of room impulse response filters of the recording space.

9. A method comprising: receiving, from a far-field microphone device, at least one far-field audio signal in a time domain corresponding to a mobile source located within a recording space, the or each far-field audio signal corresponding to a respective channel of an audio mixture of the far-field microphone device;

receiving, from a near-field microphone, a first near-field audio signal in a time domain corresponding to the mobile source;

determining location information relating to the mobile source;

transforming the or each far-field audio signal and the first near-field audio signal from the time domain to a time-frequency domain; and

using the transformations of the far-field audio signal and the first near-field audio signal and the location information of the mobile source to determine a set of room impulse response filters of the recording space using a recursive least squares algorithm.

10. A method comprising:

applying a room impulse response filter to the near-field audio signal to obtain a projected near-field audio signal;

subtracting the projected near-field audio signal from an audio mix of a far-field microphone array;

determining a time-frequency magnitude ratio mask based on the projected near-field audio signal and the audio mix of the far-field microphone array;

using the time-frequency magnitude ratio mask to identify residual components of the projected near-field audio signal; and

removing the residual signal components from the audio mix.

11. Apparatus configured to perform a method according to any preceding claim.

12. Computer-readable instructions which when executed by computing apparatus cause the computing apparatus to perform a method as claimed in any of claims 1 to 10.

13. An apparatus comprising:

at least one processor; and

at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus to: receive, via a first track, a near-field audio signal from a near-field microphone; receive, via a second track, a far-field audio signal from an array comprising one or more far-field microphones, wherein the far-field audio signal comprises audio signal components across one or more channels corresponding respectively to the or each of the far-field microphones;

determine, using the near-field audio signal and the or each component of the far-field audio signal, a set of time dependent room impulse response filters, wherein each of the time dependent room impulse response filters is in relation to the near-field microphone and respective the or each of the channels of the microphone array;

for one or more channels of the microphone array, filter the near-field audio signal using one or more room impulse response filters of the respective one or more channels; and

augment the far-field audio signal by applying the filtered near-field audio signal thereto.

14. An apparatus according to claim 13, wherein the computer program code, when executed by the at least one processor, causes the apparatus to:

receive audio signals from a plurality of near-field microphones;

detect a user selection of one of the near-field microphone corresponding to the near-field audio signal; and

automatically determine the set of time dependent room impulse response filters in response to the user selection of the near-field microphone corresponding to the near-field audio signal.

15. An apparatus according to claim 13 or claim 14, wherein the computer program code, when executed by the at least one processor, causes the apparatus to detect a user selection of a signal multiplier to apply to the track relating to the near-field microphone.

16. An apparatus according to claim 13, wherein the computer program code, when executed by the at least one processor, causes the apparatus to :

assign audio signals from a particular near-field microphone with either a positive weighting or a negative weighting; and

augment the far-field audio signal by adding the filtered near-field audio signal to the far-field audio signal if the weighting is positive or subtracting the filtered near- field audio signal from the far-field audio signal if the weighting is negative.

17. An apparatus comprising:

at least one processor; and

at least one memory including computer program code, which when executed by the at least one processor, causes the apparatus to:

receive a plurality of near-field audio signals from respective near-field microphones;

receive a far-field audio signal from a microphone array comprising a plurality of far-field microphones; and

determine, using each of the near-field audio signals and the far-field audio signal simultaneously, a set of room impulse response filters, wherein each of the room impulse response filters is in relation to the plurality of near-field microphones and the multi-channel microphone array.

18. An apparatus according to claim 17, wherein the set of room impulse response filters is determined using a blockwise linear least squares algorithm.

19. An apparatus according to claim 17, wherein the set of room impulse response filters is determined using a recursive least squares algorithm.

20. An apparatus comprising:

at least one processor; and

receive, from a far-field microphone, a far-field audio signal in a time domain corresponding to a mobile source located within a recording space;

receive, from a near-field microphone, a near-field audio signal in a time domain corresponding to the mobile source located within the recording space;

identify timeframes that contain near-field signal energy above a signal activity threshold;

determine a linear transformation of the near-field audio signal and a linear transformation of the far-field audio signal with respect to the timeframes identified as containing near-field signal energy above the signal activity threshold; and

use the linear transformation of the far-field audio signal and the linear transformation of the near-field audio signal to determine a set of room impulse response filters of the recording space.

21. Apparatus comprising:

at least one processor; and

receive, from a far-field microphone device, at least one far-field audio signal in a time domain corresponding to a mobile source located within a recording space, the or each far-field audio signal corresponding to a respective channel of an audio mixture of the far-field microphone device;

receive, from a near-field microphone, a first near-field audio signal in a time domain corresponding to the mobile source;

determine location information relating to the mobile source;

transform the or each far-field audio signal and the first near-field audio signal from the time domain to a time-frequency domain; and

use the transformations of the far-field audio signal and the first near-field audio signal and the location information of the mobile source to determine a set of room impulse response filters of the recording space using a recursive least squares algorithm.

22. Apparatus comprising:

at least one processor; and

apply a room impulse response filter to the near-field audio signal to obtain a projected near-field audio signal;

subtract the projected near-field audio signal from an audio mix of a far-field microphone array;

determine a time-frequency magnitude ratio mask based on the projected near- field audio signal and the audio mix of the far-field microphone array;

use the time-frequency magnitude ratio mask to identify residual components of the projected near-field audio signal; and

remove the residual signal components from the audio mix.

23. A computer-readable medium having computer-readable code stored thereon, the computer readable code, when executed by a least one processor, causing performance of at least:

receiving, via a first track, a near-field audio signal from a near-field

microphone;

24. A computer-readable medium having computer-readable code stored thereon, the computer readable code, when executed by a least one processor, causing performance of at least:

25. A computer-readable medium having computer-readable code stored thereon, the computer readable code, when executed by a least one processor, causing performance of at least:

receiving, from a far-field microphone, a far-field audio signal in a time domain corresponding to a mobile source located within a recording space; receiving, from a near-field microphone, a near-field audio signal in a time domain corresponding to the mobile source located within the recording space;

26. A computer-readable medium having computer-readable code stored thereon, the computer readable code, when executed by a least one processor, causing performance of at least:

receiving, from a far-field microphone device, at least one far-field audio signal in a time domain corresponding to a mobile source located within a recording space, the or each far-field audio signal corresponding to a respective channel of an audio mixture of the far-field microphone device;

determining location information relating to the mobile source;

27. A computer-readable medium having computer-readable code stored thereon, the computer readable code, when executed by a least one processor, causing performance of at least:

applying a room impulse response filter to the near-field audio signal to obtain a projected near-field audio signal; subtracting the projected near-field audio signal from an audio mix of a far-field microphone array;

removing the residual signal components from the audio mix.

28. Apparatus comprising:

means for receiving, via a first track, a near-field audio signal from a near-field microphone;

means for receiving, via a second track, a far-field audio signal from an array comprising one or more far-field microphones, wherein the far-field audio signal comprises audio signal components across one or more channels corresponding respectively to the or each of the far-field microphones;

means for determining, using the near-field audio signal and the or each component of the far-field audio signal, a set of time dependent room impulse response filters, wherein each of the time dependent room impulse response filters is in relation to the near-field microphone and respective the or each of the channels of the microphone array;

means for, for one or more channels of the microphone array, filtering the near- field audio signal using one or more room impulse response filters of the respective one or more channels; and

means for augmenting the far-field audio signal by applying the filtered near- field audio signal thereto.

29. Apparatus comprising:

means for receiving a plurality of near-field audio signals from respective near- field microphones;

means for receiving a far-field audio signal from a microphone array comprising a plurality of far-field microphones; and

means for determining, using each of the near-field audio signals and the far- field audio signal simultaneously, a set of room impulse response filters, wherein each of the room impulse response filters is in relation to the plurality of near-field microphones and the multi-channel microphone array.

30. Apparatus comprising:

means for receiving, from a far-field microphone, a far-field audio signal in a time domain corresponding to a mobile source located within a recording space;

means for receiving, from a near-field microphone, a near-field audio signal in a time domain corresponding to the mobile source located within the recording space; means for identifying timeframes that contain near-field signal energy above a signal activity threshold;

means for determining a linear transformation of the near-field audio signal and a linear transformation of the far-field audio signal with respect to the timeframes identified as containing near-field signal energy above the signal activity threshold; and means for using the linear transformation of the far-field audio signal and the linear transformation of the near-field audio signal to determine a set of room impulse response filters of the recording space.

31. Apparatus comprising:

means for receiving, from a far-field microphone device, at least one far-field audio signal in a time domain corresponding to a mobile source located within a recording space, the or each far-field audio signal corresponding to a respective channel of an audio mixture of the far-field microphone device;

means for receiving, from a near-field microphone, a first near-field audio signal in a time domain corresponding to the mobile source;

means for determining location information relating to the mobile source;

means for transforming the or each far-field audio signal and the first near-field audio signal from the time domain to a time-frequency domain; and

means for using the transformations of the far-field audio signal and the first near-field audio signal and the location information of the mobile source to determine a set of room impulse response filters of the recording space using a recursive least squares algorithm.

32. Apparatus comprising:

means for receiving, from a near-field microphone, a near-field audio signal in a time domain corresponding to the mobile source located within the recording space; means for applying a room impulse response filter to the near-field audio signal to obtain a projected near-field audio signal;

means for subtracting the projected near-field audio signal from an audio mix of a far-field microphone array; means for determining a time-frequency magnitude ratio mask based on the projected near-field audio signal and the audio mix of the far-field microphone array; means for using the time-frequency magnitude ratio mask to identify residual components of the projected near-field audio signal; and

means for removing the residual signal components from the audio mix.