GB2609667A

GB2609667A - Audio rendering

Info

Publication number: GB2609667A
Application number: GB2111674.4A
Authority: GB
Inventors: Nixon Thomas; Pike Chris
Original assignee: British Broadcasting Corp
Current assignee: British Broadcasting Corp
Priority date: 2021-08-13
Filing date: 2021-08-13
Publication date: 2023-02-15
Also published as: GB202111674D0

Abstract

A first method comprises: receiving audio data for object sources including an audio signal and metadata parameters, processing the audio signal using an object source processing path comprising separate direct and discrete paths, processing the output of the object source processing path by applying binaural filters, wherein a delay function is applied to the audio signal per object per ear in the direct path. A second method alternatively comprises: applying a binaural filter per virtual loudspeaker, deriving expected and desired gains for each object and a scaling factor that is the desired gain divided by the expected gain and applying the scaling factor to the signal path, wherein the expected gain is the gain derived taking into account interaction between the binaural filters. The metadata may comprise ADM parameters or input parameters related to the orientation or size of the user’s head.

Description

Audio Rendering

BACKGROUND OF THE INVENTION

This invention relates to audio rendering and in particular rendering of binaural audio.

The concept of binaural audio is well known in the field and involves providing two channel audio signals to be replayed at the ears of a listener. This can be by producing two signals, one respectively for each ear of a listener to be produced through earphones, headphones, VR headset or other device for producing sound signals directly to each ear of the listener, or can be over loudspeakers using "cross-talk cancellation" techniques try to essentially create virtual headphones, where the two signals at the ear are independent, with no cross-talk.

C\I 15 C\I The particular challenge for binaural audio is introducing a realistic spatial sound experience for a user to represent audio from different physical or virtual locations appropriately respectively in each ear. Some known approaches to processing signals, so as to provide appropriate output signals for each ear, include using a binaural impulse response filter, BR filter, such as a binaural room impulse response filter, BRIR filter, which takes as an input anechoic signals (signals having no effects of reflection or frequency change due to a room environment and no spatial auditory cues) and by convolving with BRIR filters can produce output signal to be presented through headphones which provides to the listener an experience similar to listening in a room. The characteristics of the room are built into the BRIR filter including delays due to reflections from walls, attenuation at different frequencies due to materials in the room and so on. In addition, the characteristics of a listener's head when listening in a room will also alter the way in which signals are received at each ear. The position and orientation of the users head will cause variation in how signals from a given source in a room are received at each year. Such characteristics may be defined by a head related impulse response filter, HRIR filter, or a head-related transfer function, HRTF. When measuring signals to produce BRIR filters, the head-related transfer function of a dummy head used to measure sound within is effectively built into the BRIR filter.

The concept of applying BRIR filters to produce binaural signals for augmented reality, virtual reality, extended reality or synthesised reality systems as described above is known in the art.

Increasingly, object-based audio is becoming popular, in which individual objects (audio sources) are represented in an object-oriented way with an audio stream and meta data for that audio stream defining certain characteristics, such as a position in space, size of an object and other characteristics that affect audio presentation. As an example, a human speaker within a room may be represented as one object. Other individual sources may be represented as other objects.

General background noise may be represented as an object in its own right. Indeed, any component of an audio scene may be represented as an object by providing a separate audio stream for that object and accompanying meta data.

C\I 15 Processing arrangements exist for taking object-based audio signals and providing C\I binaural signals as an output. Such arrangements involve convolving input audio streams with BRIR filters to produce binaural outputs and have two sources of input to this process: content and reproduction metadata. The object-based audio content metadata may be represented in standard form such as ADM, Audio Definition Model, as defined by Recommendation ITU-R BS.2076. The other source is reproduction metadata, which can include real-time information about the listener's position and orientation. This could also include the set of filters, which can be exchanged/modified to account for the individual listener and for the listening environment.

SUMMARY OF THE INVENTION

We have appreciated problems related to the production of binaural signals in the context of object-based audio. An object-based audio stream may include objects of different types. We have further appreciated that existing systems either do not support the full set of object parameters, such as ADM parameters, or introduce excessive coloration.

The invention as defined in the claims to which reference is directed with preferred features set out in dependent claims.

In broad terms, the invention resides in a new signal path arrangement for processing object-based audio to produce binaural output signals. This arrangement combines dynamic delays in a direct path, with static delays in a diffuse path in order to support all parameters while reducing coloration. The parameters may be ADM parameters, or use another metadata structure, and include parameters such as size, shape and diffuseness of objects.

The new signal path arrangement involves introducing delay functions for each respective output signal for each ear appropriate to the nature of the object being processed and prior to convolving with BRIR filters. The use of such delays prior to convolving with BRIR filters avoids problems with colouring of the sound. The signal path arrangement may further provide a path for each virtual loudspeaker of a virtual loudspeaker system as a mechanism to simplify compatibility with real C\I 15 loudspeaker systems and also to provide a similar experience to the user as if they C\I were listening to that actual loudspeaker system and to reduce to complexity of introducing more objects into a scene.

The invention also resides in gain control provided in each signal processing path.

The new gain control method avoids problems with gain variation that can be introduced when attempting to represent objects as varying in position in space, size, shape, diffuseness or other changes either by movement of the object or by movement of the listener's head.

BRIEF DESCRIPTION OF THE DRAWINGS

An embodiment of the invention will now be described in more detail by way of example with reference to the drawings, in which: Figure 1: is a diagram showing the main functional components of signal processing paths to produce a binaural output of an arrangement embodying the invention; Figure 2: is a diagram of the control architecture for a system embodying the invention; Figure 3: shows the logical structure of the gain compensation calculator arrangement embodying the invention; C\I 15 C\I Figure 4: shows the effect of mixing signals for a pair of virtual loudspeakers; Figure 5: shows the undesirable effect of correlated signals arriving at different time instants and the resultant comb filter frequency response; Figure 6: is a schematic diagram showing the concept of desired gain an expected gain and how compensation may be applied in a renderer embodying the invention; Figure 7: shows the arrangement of figure 6 with the addition of parameter interpretation for gains and delays; Figure 8: shows a simplified DSP structure of a binaural renderer; Figure 9: shows an example DSP structure showing a simplified implementation; and Figure 10: is a diagram of a general purpose device embodying the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The invention may be embodied in a method, system or device for processing object-based audio signals to produce binaural signal outputs and in a computer program arranged to provide such processing.

The background terminology relating to prior arrangements will first be described before presenting an arrangement of signal paths embodying the inventive concepts. The general field of the invention relates to binaural signal processing.

In this context, binaural audio is an audio signal with a left and right component for presenting to a user via headphones or other device with an output respectively for each ear, but including content that provides an experience to the listener of spatial audio. The spatial nature of the signal is achieved by processing the signal to introduce various effects such as inter-aural time and level difference, reflections and filtering as experienced by a listener in the real world due to acoustic effects of their ears, head and body and the surrounding environment to different components of the audio signal.

Object-based audio involves providing audio signals as separate streams, with the stream for each object, comprising the audio data itself and accompanying metadata for that object. The object may be representing an individual sound source such as a narrator, a more general sound source such as background sound or other component of audio that may be represented as an object.

A particular way of representing audio in an object-oriented way uses a standard audio definition model, ADM, for representing audio and meta data. ADM is the preferred object-oriented standard for the arrangements described herein. The Audio Definition Model is an ITU specification of metadata that can be used to describe object-based audio, scene-based audio and channel-based audio. It can be included in BWF files or used as a streaming format in production environments.

ADM is known to the skilled person, but in brief the ADM is a general description model based on XML (but could be extended to other languages). One of its first applications is an extension to the Broadcast Wave Format (BWF) file which includes an 'AXML' chunk to allow XML metadata to be carried. So the ADM is an XML schema which means the audio is described in XML which can be attached to the BWF file. As the ADM was originally an EBU development it has beeen added to the EBUCore metadata schema also.

The ADM is designed to describe the audio format thoroughly, it is not intended to give instructions on how the audio is rendered. For example, a stereo file contains two channels for speakers positioned at -30 and +30 degrees azimuth. The ADM will describe this clearly, what it will not do is tell you what to do if you have a standard stereo speaker arrangement, a wavefield synthesis speaker array or binaural headphones. That is down to the renderer to decide what is best given the description of the audio it receives and the target playout format. However, the metadata should provide enough information for any sort of renderer to achieve its requirements.

C\I 15 The ADM consists of the following elements: C\I audioTrackFormat The format of the single track of data in the file audioStreamFormat The format of a combination of tracks that need to be combined to decode an audio signal audioChannelFormat The format of a single channel of audio audioBlockFormat A subdivision in time of audioChannelFormat, allowing dynamic properties audioPackFormat A group of channels that belong together (e.g. stereo pair) audioObject A group of actual tracks with a given format audioContent Information about the audio within a an object audioProgramme Information about all the content that form a common programme audioTrackUID Identification of individual tracks in an essence The elements with the Format suffix describe the format of the tracks, streams, channels, blocks and packs in general, but don't describe the audio signal itself.

Therefore these definitions can be reused if necessary. For example a file contain stereo pairs (i.e. 10 tracks), will only need two audioChannelFormats Neff and 'Right') and one audioPackFormat ('Stereo') to be defined. The other elements cover the description of the actual audio content, so there would be 5 audioObjects and 5 audioContent elements. So to summarise, the ADM is designed to allow any format of audio to be fully specified so it can be processed or rendered correctly.

A renderer is the functional device that takes audio signals and associated metadata and provides an appropriate output for assertion to a listener via an output device. In the present context, a renderer is used for generating a headphone signal based on audio and ADM metadata input.

The arrangements described herein provide a renderer for rendering binaural spatial audio for headphones from ADM defined content input. The arrangements provide rendering that can support the ADM standard and also provides an improved approach to object-oriented binaural rendering that reduces problems of colouring the audio signals.

The arrangement embodying the invention comprises two main parts, a signal processing part and a control part which calculates parameters for the signal processing given the metadata inputs. These are implemented as sets of C\I 15 components with multiple audio and parameter inputs and outputs, connected C\I together in a directed acyclic graph structure. All components process data once per frame, which consists of a fixed number of audio samples defined at initialisation time. Components run in order, so the inputs for each component are the outputs of other components within the same frame. This structure therefore adds no latency to the system beyond the latencies added by individual components, and the inherent latency of processing audio in blocks of more than one sample.

The system embodying the invention is described in relation to an arrangement using binaural room impulse response (BRIR) filters which include room acoustic effects but more generally, may use a more generalised binaural impulse response (BIR) filter (not specific to a room). Head-related impulse responses (HRIRs) (a.k.a. HRTFs in the frequency-domain), may also be used and are considered as a form of BIR filter. The term BIR filter will therefore be used as a general term covering all such filters.

Figure 1 is a schematic view of the signal processing paths of an arrangement embodying the invention. The signal processing paths may be implemented as a digital signal processor (DSP) as shown in figure 2. The arrangement comprises logical units arranged as signal processing paths with each signal processing path being repeated to provide multiple instances of each path each instance being for a given purpose. The logical units may be provided by instances of processes within a renderer implemented in software or alternatively could be provided as dedicated hardware or firmware.

The number of instances of each logical unit shown in figure 1 may be understood by considering the groupings of these units shown as "for each object source", "for each ear", "for each direct speaker source", and "for each HOA source channel". Where an input is provided to a logical unit that enters one of these groupings, the signal is provided in as many instances as there are values for that grouping.

Accordingly, a logical unit that is present in the "for each objects source" grouping and also present in the "for each ear" grouping is provided in the number of instances of number of objects multiplied by number of ears (the number of ears is 2). As an example, a delay logical unit 12 in the object path 1 that is also in the C\I 15 ear path 7 has an instance for each object and for each ear. C\I

We introduce here the concept of a virtual loudspeaker as used in this implementation. A virtual loudspeaker path is provided for each of multiple virtual loudspeakers. The output of these paths are convolved in a BRIR then summed to provide the output signals. By taking this approach, existing techniques for rendering to real loudspeakers may be used and then supplemented with the final BRIR filter. This assists in providing a similar experience whether the user is listening through physical loudspeakers or if the signal is processed to provide a binaural signal output. As with other logical units, those that are within each virtual loudspeaker path are present in the number of instances of those virtual loudspeakers. As an example, the gain logical unit 14 is present in the object grouping, ear grouping and virtual loudspeaker grouping and so exists in the number of instances of objects multiplied by ears multiplied by virtual loudspeakers.

Within the arrangement of figure 1, higher-order ambisonics represent a sound field, delays are given with a subsample accuracy and delays and gains may be time varying for example to provide panning effects.

The collection of logical units for each object source 1 comprises a gain function 10 arranged to provide gain in a diffuse path of the object source path 1. The object source path also comprises a direct path with a delay function 12 and a gain function 14. The gain functions will be described later in relation to the gain improvement disclosed herein. The delay function 12 provides an appropriate delay for each object source relative to each ear output and so there are two instances of the delay function (one for each ear) for each object source.

The direct speaker source path 3 has the same arrangement as the direct path within the object source path 1 and is shown separated as meta data may identify an object as being a direct speaker source and accordingly the processing to account for direct path and diffuse path objects can be bypassed and the direct speaker source treated directly as a direct path object.

C\I 15 The direct speaker source path and higher order ambisonics path are optional in C\I the preferred implementation. The main logical units in this embodiment are those within the object source path and in particular the use of an additional delay 12 in a direct path in contrast to no such pre-processing delay in the diffuse path as will now be described.

The collection of logical units for each direct speaker source comprises a delay function 16 and gain function 18. As with the object source path, there is an instance of each function for each direct speaker source and for each ear and so two instances of each function for each direct speaker source.

It is noted for future reference that both the object source path 1 and direct speaker source path 3 include a delay function.

The higher order ambisonics path 5 comprises a gain function 20 and a decoding filter bank 22. It is noted that there is not a delay function for each ear in the higher order ambisonics path. As previously noted, the separate higher order ambisonics path is optional and higher order ambisonics alternatively may be treated with a matrix of gains and then fed through the normal channel-based path.

The above arrangement allows each individual type of object whether an object source, direct speaker source or higher order ambisonics source to be treated appropriately, in particular to provide appropriate delay and gain before combining. The output from the direct path of the object source path 1 is combined with the output of the diffuse path of the object source path 1 via a decorrelation filter bank and delay 26 in the diffuse path and then combined with the output of the direct speaker source path 3 is convolved with BRIRs in 28.

The convolve BRIR function 28 convolves the outputs with BRIRs and provides the chosen effects to provide the acoustic response to the user as if the sound were created in a room.

The difference in processing in the diffuse path and direct path will now be explained in more detail. The diffuse path has a fixed delay 26 at an output of that C\I 15 path for each ear. The delay is the same for all objects -the inter-aural time C\I difference (ITD) for each virtual loudspeaker is the same for all objects in the diffuse path. In contrast, the delay 12 in the direct path is a variable that is determined from object parameters so as to provide the appropriate separation in time. We can thus see that the dynamic delay of the direct path is a delay to provide time separation between the ears, the diffuse delay path has an output with delays 26 that are the same for all objects (but not the same for both ears).

The perception in space depends upon the time difference between signals presented at the ears which is why it is important for the time delay to be correct.

In the absence of the additional pre-processing delay 12 as in prior arrangements, both direct and diffuse sources would be treated with similar delays in the BRIR filter. Further, different virtual loudspeakers would have different ITDs in the direct path. We have appreciated that these issues can cause a comb like filter and provide coloration of sounds.

The delays and gains described could go in either order. However, it is more efficient to add delays before gains, because of the duplication that occurs (either 2 delays followed by 2*2*24 gains, or 2*24 gains followed by 2*2*24 delays, and the delays are more expensive). It would even be possible (and reasonable) to do these in one step To consider why a per-object pre-processing delay is introduced in the direct path consider the difference between the delay 26 and the delay 12. The delay 12 is for each object for each ear and the same delay is provided before objects are then split into virtual loudspeaker paths and so the same delay is provided for each virtual loudspeaker. The delay 26, in contrast, is outside the object group meaning that it is applied after object signals are mixed together and is provided per loudspeaker per ear. In other words, the output from the gain function 10 in the diffuse path leaves the object source grouping and so combines all of the objects together for each virtual loudspeaker prior to delays being provided for each ear.

The separation of the diffuse path and direct path within the objects source path 1 allows a simple implementation by using the ADM diffuseness value 0 to ito define the weighting as between these two subparts of the object source path.

C 15 The diffuse path allows a sound to be treated as one large source. A large source C\I will map to more virtual loudspeakers but if there is no variation in the sound then it can sound odd to the listener. As an example, wind in trees is a sound that would surround the listener but some randomness is needed in each of the virtual loudspeaker channels. The decorrelator is 30 is provided as a separate the correlator instance for each virtual loudspeaker to provide this slight randomness.

HOA rendering has a path 5 distinguished from other paths and is performed with a decoder matrix 22 separate from the Objects and DirectSpeakers paths. The input audio first passes through a gain matrix 20 which routes the input channels to the corresponding bus channel and performs rotation for head-tracking. This passes through the decode matrix 22, is delayed to match the Objects and DirectSpeakers paths by delay 24, and is then mixed into the output.

The reason for providing different delays for the diffuse path and direct path will now be further described in relation to figures 4 and 5.

Fig 4 shows what happens when you have two virtual loudspeakers, and an object which is panned between them. When planning between loudspeakers, the amplitude of the sound of that object from each loudspeaker needs to be varied.

The amplitude of the signal for the loudspeaker towards which the sound should be moving should be increased and the amplitude from the other loudspeaker decreased.

The graphs are time vs. amplitude, and just show the position and amplitude of the direct part of the impulse response --the sound that has travelled directly from the loudspeaker to the ear.

The left-hand side of figure 4 shows the amplitude and time delay of desired signals for left and right ear of two virtual loudspeakers. Virtual loudspeaker 1 (L1) is in front of the listener, so both the left and right ears have the same delay. Virtual loudspeaker 2 (L2) is to the left of the listener, so the sound arrives at the left ear first, and is louder at the left ear. This can be seen by the lower amplitude of the signal arriving at the right ear and the delay of the signal arriving at the right ear.

C\I 15 The right-hand side of figure 4 shows three possible results of panning the sound C\I to be presented between virtual loudspeaker 1 and virtual loudspeaker 2 for binaural rendering having left and right signals depending upon the technique used.

Figure 4 example A shows the ideal result of mixing signals to produce binaural outputs when rendering a point source. The delay and gain is somewhere between that of L1 and L2, as if it had come from a loudspeaker between L1 and L2, where we want the sound to be. We can see that the relative delay between left and right signals is smaller than that for virtual loudspeaker 2. We can also see that the amplitude of the signal for the right ear is slightly higher than that of the corresponding signal for virtual loudspeaker 2.

Figure 4 example B show what happens in virtual loudspeaker rendering to a binaural output if the delay is not extracted from the impulse responses. This shows the same point source processed by a BRIR without prior separation of the delay. The impulse responses from the two virtual loudspeakers are just mixed together, and this causes comb filtering (resulting in coloration), and inaccurate/blurry positioning, because the inter-aural time difference (ITD) is ambiguous.

Figure 4 example C shows the effect of adding a per-virtual-loudspeaker decorrelation filter. This changes the sound through each virtual loudspeaker slightly, in a way which gives the impression of the source arriving from multiple directions, rather than just having an ambiguous direction. The decorrelation filters are per virtual loudspeaker per ear and the same for each ear. As shown in the figures above, the decorrelation filters are applied before the binaural filter. Although hard to see on this scale, first small delays are applied in the same way to the left and right signals derived from virtual loudspeaker 1 and second small delays are applied in the same way to the left and right signals derived from virtual loudspeaker 2.

Relating this to the renderer, example A is what we get in the direct path according to an embodiment of the invention. Because the delay is removed from the BRIRs and added back on a per-ear basis by delay 12. Example C is what we get in the

C

diffuse path, according to an embodiment of the invention, because the delays C\I which were removed from the BRIRs are added back in delay 26. We therefore have appropriate behaviour for both direct and diffuse object sources.

We can thus see to differences in comparison to arrangements that do not properly take account of delays. First, a delay is removed from the BRIR filter and instead provided as a pre-processing step by delay function 12 per object for each ear. The processing path for diffuse object does not include such a pre-processing delay function that varies by object. Instead, a delay is introduced for each ear that is for each year and for each virtual loudspeaker, but not for each object. In this way, the delays are applied differently for direct objects and diffuse objects with the result as shown in relation to figure 4 described above.

The undesired effect of comb filtering will now be described in relation to figure 5.

The time difference shown is of the order of magnitude found in inter-aural time differences. The ripple (or comb) effect in the frequency magnitude response is shown on the right hand side. This example uses pure delays, rather than HRIRs/BRIRs, as it is easier to visualise the effect. The reason for the claim effect is as a result of constructive and destructive inference between signals that are correlated. Delays can introduce this artefact.

Figure 2 shows the control arrangement which will now be described prior to describing gain compensation. The control architecture receives ADM metadata and listener updates, and produces the various gain and delay values which control the behaviour of the signal path described above. The BRIR selection module receives listener related data including a listener orientation. The listener orientation is used to select the closest set of BRIRs available. This results in a BRIR index, and a 'residual' orientation -the orientation relative to the orientation of the head in the chosen BRIR set.

A Get Static Delays module received the BRIR index. The delays used by the Objects diffuse path and the DirectSpeakers path are looked up based on the chosen BRIR set.

An Objects Gain Calc module receives ADM Objects metadata which is modified C\I 15 to account for the listener orientation, and libear is used to calculate the C\I corresponding direct and diffuse gains -libear is an open-source software library https:lleithub.corniebuilibear and is an implementation of Recommendation ITU-R BS.2127. Below this is simply referred to as the EAR.

A Calc Direct Delays module calculates delays. The direct and diffuse gains are used to calculate the per-ear delays in the direct path, based on a weighted average of the delays corresponding to the virtual loudspeakers activated by the gains.

A Gain Compensation module compensates gains. The direct and diffuse gains, per-ear delay and chosen BRIR set are used to modify the direct and diffuse gains in order to compensate for changes in overall gain caused by varying correlation between different pairs of BRIRs.

A DirectSpeakers Gain Calc module modifies ADM metadata according to the listener orientation, then uses libear to calculate loudspeaker gains.

A Calc Delays module performs the same function as the Objects Calc Direct Delays component, but with only direct gains.

A Compensation module performs the same function as the Objects Gain Compensation component, but with only direct gains and no per-ear delay.

Overall, the control architecture shown in figure 2 provides functional units as described above that receives meta data, preferably in ADM format, and provides control signals as shown to a digital signal processor, DSP, which also receives the audio streams for objects, direct speaker and higher order ambisonics. The DSP receives the control signals and audio signals and produces the audio output to be asserted to the user via an output device such as headphones, earbuds and so on.

Figure 3 shows the logical structure of a gain calculator. OTM refers to ObjectsTypeMetadata, which primarily contain data from a single audioBlockFormat.

C\I 15 C\I The EAR Objects panner is used to calculate gains for the virtual loudspeakers, using the loudspeaker layout described below. Gains are calculated once per sample period; these are interpolated by the gain matrices over the length of the sample period, so that the calculated gains are used for the last sample in the period. The input to the gain calculator is a sequence of ObjectsTypeMetadata objects. Given the time of the last sample in the period, two items in this sequence (shown as OTM A and B) and a mixing factor p are identified, such that the output gains will match the behaviour specified in section 7.2 of BS.2127. For example, if the input sequence contains 3 ObjectsTypeMetadata objects2 [b1, b2, b3], and the end of the sample period is half way through b2, then A = b2, B = b3 and p = 0.5.

These two ObjectsTypeMetadata objects are modified to account for the listener's head orientation and position, and gains are calculated for each. Finally the two sets of gains are mixed together with the ratio (1. p) : p. The gain compensation concept addresses issues with prior arrangements in trying to ensure that a constant loudness is presented as an object moves in space relative to the listener (either by the listener rotating their head or by the object moving in the virtual sound space). With a physical loudspeaker system control is achieved by summing speaker energies. However, with headphone arrangements and binaural audio, complex filters are used as described for acoustic transfer and involve delays and summing. As noted above, some problems are avoided by separating the delays before the filters as described. However, we have appreciated that there still are resulting problems with a drop in sound energy as objects move in space. The present solution provides an efficient compensation to such changes in energy. The compensation addresses problems caused by changes in size and location of a sound field introduced by filters and delays.

The gain compensation concept may be applied in the gain functions 10 and 14 previously described and will now be described in relation to figures 6 to 9.

As the object parameters vary, the rendering may exhibit undesired loudness changes. These loudness changes occur because of complex interactions between signal processing blocks of the renderer, i.e., the different filters in the system (the BRIRs for each virtual loudspeaker, the decorrelation filters, and the (\I 15 variable fractional delay filters) and the gains that mix them in differing ratios. The C\I interactions between the filters lead to frequency-dependent gain changes, which may change the perceived loudness, as well as timbral colouration. The present gain compensation approach aims to efficiently correct for changes in perceived loudness.

A simple way to determine the expected gain is to find the impulse response of the DSP chain with a given set of gains and delays, and take the 12 norm. Unfortunately the computational complexity of this solution is high, especially for long BRIRs and lots of virtual loudspeakers; the present arrangement improves upon this.

In broad terms, the present arrangement computes a desired gain and an expected gain. The expected gain is derived taking into account interaction between filters. The desired gain is derived not taking into account interaction between filters. Accordingly, a ratio between the two may be used as a scaling factor so as to compensate for the interaction between filters to remove the unwanted gain effects caused by the interactions.

Figure 6 shows a structure which could be used to compensate for gains introduced by a renderer. The core renderer component applies some parameter-dependent processing to the input audio, and its gain is corrected by calculating the desired and expected gain for the current output, and using those to control a gain at the output of the renderer. The desired gain is the gain which we would like the renderer to have, while the expected gain is the gain that the renderer would have if gain compensation was not implemented.

The technique will be described in relation to binaural rendering, though it would be possible to adapt it to other renderers which use virtual-loudspeaker rendering, in which input audio signals are rendered by first rendering them to a virtual loudspeaker layout, and then applying BRIRs (Binaural Room Impulse Responses) to turn those into binaural signals.

Internally, the renderer can be separated into two components: calculating some internal parameters (per-virtual-loudspeaker gains and per-ear delays) given the external parameters, and actually applying those parameters to some audio using C\I 15 some DSP; we can therefore rearrange this to look like Figure 7 in order to have C\I the compensation components operate on the simpler internal parameters.

The DSP architecture of the renderer is shown in Figure 8. It consists primarily of convolution with BRIRs for virtual loudspeakers, which are fed through two paths.

In the direct path, the input audio passes through a per-ear delay, then a per-ear and per-virtual-loudspeaker gain. In the diffuse path, the audio passes through a per-virtual-loudspeaker gain and a bank of decorrelation filters.

This architecture causes loudness changes primarily because of complex interactions between the different impulse responses in the system which are mixed together by the gains (i.e. the responses for each loudspeaker, the decorrelation filters, and the variable delays); the correlation coefficient between pairs of these impulse responses can vary significantly, so simple normalisation of the gains used is not enough to ensure constant loudness. The same processes cause other problematic effects too (comb filtering, bass boost), but fixing the overall gain is nevertheless helpful.

To illustrate the idea, we'll consider a simple process shown in Figure 9, which consists of two variable gains followed by two fixed gains. This is analogous to that of Figure 8, but eliminates the diffuse path and the delays, has only two virtual loudspeakers, and BRIRs with only a single sample; we'll deal with those complexities in the next section. The gain of this process is trivially ge = ax + by, but once we consider impulse responses with more than one sample we'll want to take the f2 norm of the samples in the overall impulse response, so let's start with: Equation 1 this can be expanded and rearranged to: Equation 2 This can be written in matrix notation; if we have a column vector of variable gains g and a matrix of factors F like: Equation 3 then Equation 4 This can be expanded from 2 to n gains, usually corresponding to n virtual loudspeakers. If the fixed gains are in the column vector h then Equation 5 where i and j are indexes into the gain vector. With this technique if we have a pre-computed factor matrix F then we can compute the overall gain with n2 + n multiplications. Next, we'll explain how to expand this to work with the real DSP chain.

Equation 4 above is the expected gain takes into account the interaction between filters.

We will now describe a more complete implantation. The above was a simplified explanation for a single sample. Real BRIRs have more than a single sample; if we replace each fixed gain hi with an FIR filter whose sample s is Hi,s then: Equation 6 We can adapt the above technique by just modifying the factor matrix: Equation 9 Multiple output channels (2 in the case of binaural) can be handled identically; if the impulse response for channel c and virtual loudspeaker i is stored in Hi,c, then: Equation 10 This is where the efficiency gain comes from: for 2 output channels and I samples the unoptimised technique would result in 41n multiplies, so this optimisation is beneficial if 41> n, which is typical.

Figure 8 shows the output of the decorrelation filter bank (diffuse path) being mixed with the direct path before being convolved with the BRI Rs. This is equivalent to passing the direct and diffuse path outputs through separate convolution processes and summing the outputs, with the direct path being convolved with the unmodified BRIRs, and the diffuse path being convolved with the convolution of the C\I decorrelation filters and BRIRs.

To apply this technique we can use the same equations, but with the gains g replaced by the concatenation of the direct and diffuse gains, and the BRIRs Hi replaced by the concatenation of the unmodified BR IRs and the convolution of the unmodified BRIRs with the decorrelation filters.

The delays in the direct path act to delay the signal for one ear relative to the other; this means that although there are two delays, there is only really a single parameter: the inter-aural time difference (ITD). To use this technique we can precalculate multiple factor matrices for various ITDs, and either select the factor matrix with the closest ITD to the real ITD, or interpolate between the sample points to ensure smooth gain changes.

In practice the addition of these delays adds a factor of 2 or 3 to the number of operations required to compute the expected gain, and multiplies the size of the factors matrix by the number of ITD samples. Note that in the real design there is a set of static delays between the decorrelation filters and the BRIRs in the diffuse path; these delays can be handled by merging them into the decorrelation filters before computing the factor matrix.

To implement head-tracking, the BRIR set may be switched at run-time. As with the delays, a separate factor matrix can be computed for each BRIR set used. We will now describe the way to compute the desired gain. Although we could compensate the renderer to achieve unity gain, it may be desirable to match the gain of the renderer to the gain of the BRIRs, by calculating a desired gain value and compensating to match that. This means that the gain compensation will only correct for errors introduced by the un-desirable interactions between BRIRs rather than measured gains in the BRIRs themselves, making the gain metric used less critical.

This leads to two requirements for the desired gain: C\I 15 * When rendering a source at a position that corresponds with a virtual C\I loudspeaker, the desired gain should match that of the BRIRs for that loudspeaker.

* The desired gain should vary smoothly with respect to the virtual loudspeaker gains (and thus the source position).

A value meeting these requirements could be computed using a model based on the source position, but we can use the factor matrix and gains for this too.

The desired gain calculation is a slight modification of the expected gain, where the contribution of the individual impulse responses are summed up separately, without considering their interactions: Equation 11 Given the definition of F, this is equivalent to: Equation 12 Equation 12 above describes the desired gain which does not contain any interaction between filters. In essence, it takes the diagonal of a matrix in which the diagonal does not contain the interaction between the filters.

The above tricks for handling decorrelation filters and multiple BRIR sets apply here too, though delays do not need to be considered, because the ITD does not affect the diagonal of the factor matrix; we can use any of the ITD samples, and do not need to interpolate.

We will now describe some further enhancements that may be optionally used to obtain better gain metrics. The,C2 norm used to assess the gain is simple, though effective. It may be improved by applying a filter to the generated impulse response in order to weight the effect at some frequencies more highly than others. This can C\I 15 be achieved by applying the desired filter to the impulse responses before C\I computing the factor matrices.

For efficiency this technique requires the factor matrices (F) to be pre-computed. These can grow quite large, which may make them inconvenient to store (both on disk and in memory, depending on the application), and make calculations using them slower than they could be due to limited cache sizes. Apart from reducing the various dimensions (for example by reducing the number of virtual loudspeakers, BRIR sets or ITD samples), there are ways to compress the factor matrices without changing the functionality: * Individual factor matrices are symmetric, so could be stored as triangular matrices, approximately halving the size.

* The ITD only affects the relationship between the direct and diffuse paths, so changing the ITD only changes the upper-right and lower-left quarters of the factor matrices; this means that only these parts need to be stored for each ITD sample, and the rest of the matrix can be stored once. This again approximately halves the size.

Instead of measuring the overall gain, it is possible to use this technique to assess the frequency response of the renderer with given parameters by running it multiple times with different factor matrices, where each factor matrix has been computed with BRIRs that have been convolved with a different filter, selecting the different frequency bands to measure the response in. This could be particularly useful to counteract the low-frequency boost which occurs when there are many non-zero gains in the diffuse path.

In summary, the technique described above involves the following features: off-line, we can calculate a 'factor matrix' F for a BRIR set, where F_O,j) is the contribution of the interaction between BRIR filters i and j.

To calculate the expected gain of a gain vector g, we use equation 4 above C\I 15 C\I To calculate the desired gain, we can use the diagonal of the factor matrix, as in equation 12 above This only deals with gains applied to BRIRs. To handle other parameters: Delays are handled by calculating different factor matrices for a range of delays (actually difference in delays, since increasing delay in one ear always corresponds to decreasing the delay in the other), and then interpolating the results.

So for example if we have a delay of 10.25 samples, we use the factor matrices calculated for 10 samples to calculate g_a and 11 samples delay to calculate g_b, then use g=0.75xg_a+0.25xg_b.

The diffuse path is handled by concatenating the gains for the direct and diffuse paths, and convolving the BRIRs with the decorrelation filters before calculating the factor matrix.

Multiple BRIR sets are handled by just calculating different sets of factor matrices for each BRIR set.

A more specific implementation will now be described. The implementation may be summarised as a gain compensation factor. Given the direct and diffuse gains Gd(I) and Gr(l), the BRIR set v and the direct delays D(e), the sets of gains Gd(I)' and Gi(l)' are calculated, based on a compensation gain gc derived from the desired gain gd and expected gain gi, described below.

Equation 13 Where Equation 14 Equation 13 defines the factor that may be used to provide gain compensation prior to the BRIR filters. In these descriptions, Gc(li) and Gc(Ij) refer to the concatenated direct and diffuse gains: Equation 15 The expected gain is a sum over both ears: Equation 16 The per-ear contribution is: Equation 17 Where p, da and db are chosen so that da and db are valid indices into the factor matrix P. P(d, v, I j, e) is the gain normalisation factor matrix entry for integer sample delay d-1, view v, loudspeaker indices l_i and I j, and ear e.

Equation 18 Where d is the delay in samples relative to the minimum possible delay: Equation 19 The desired gain is similar to the expected gain, but does not take the interactions between BRIRs into account: Equation 20 However implemented, the embodiment relating to gain compensation provides the following key features. Given a set of loudspeaker gains (which may be virtual) for rendering an object, calculate an adjustment gain which compensates for some loudness changes caused by the process of applying those gains (i.e. using virtual loudspeaker rendering). Calculating the adjustment gain from a desired and an expected gain, both based on the parameters. The desired gain is the gain we would expect to get if the application process were ideal, and the expected gain is what we expect to get given interaction between filters.

The preferred steps for implementing this concept may include one or more of applying the adjustment gain to the loudspeaker gains before applying them, applying the adjustment gain to the input or output audio signal, or applying the adjustment gains in some other way (e.g. generating an appropriate filter).

C\I 15 Further preferred steps include calculating more than one adjustment gain for e.g. C\I different frequency bands, different ears, or different BRIR sets, calculating the adjustment gain from loudspeaker gains as well as other parameters (like delays and BRIR set in our case) and using filter samples (BRIR set, HRIR set, decorrelation filters) to calculate the expected and desired gains. Furthermore, calculating the desired and expected gains in a manner equivalent to calculating the effective impulse response of the application process with the given parameters, and measuring the gain (or gains) of the effective impulse response, or calculating the expected gains in a manner equivalent to calculating multiple effective impulse responses (one for each virtual loudspeaker, for example), and summing the gains, assuming there is no interaction.

Some general advantages of the arrangements disclosed will now be described. The arrangement allows objects presented using a metadata model such as ADM to be converted to binaural signals with appropriate room response filters. The arrangement could have a separate object for each component of audio or treat each sound source within the sound space as a separate object or group some together. For example, a narrator may be one object and backgrounds and may be another object.

By defining a standard way of interpreting audio definition model, ADM, with a defined arrangement broadcasters can provide audio data using the ADM language and know how the output will be rendered. A common set of algorithms defined as functional units as described allows virtual loudspeakers to be defined the outputs of which are convolved to provide an output for each ear of a headphone. The advantage of such a standardisation is that existing agreed arrangements using ADM may be adopted and using appropriate binaural room impulse response filtration provide binaural output signals to headphones.

The arrangement may be considered as a hybrid arrangement in which rendering is performed to virtual loudspeakers in an array of loudspeakers prior to rendering to binaural signals. In this way, existing standardised definitions may be used and a consistent experience provided whether the output is provided to separate physical loudspeakers or to a user via headphones.

C\I 15 C\I The operation of the arrangement for an example in which is an array of 24 virtual loudspeakers is therefore as follows. For each loudspeaker for each ear there will be a BRIR function 28 (so 2x24). For a given object to be placed at a certain point in space and size, the weight gains of each of the 24 virtual loudspeakers will be varied. The gain function 14 is therefore applied for each object source for each virtual loudspeaker for each ear. In order to avoid defects due to time of arrival differences, time of arrival functionality is separated from the BRIR and provided as a pre-process delay by delay function 12.

Figure 10 shows a general purpose device that may implement the invention. This may be implemented in a mobile phone, headset, headphones, earbuds hi-fi system or more generally any device comprising a power source, processor, control and digital signal processor, DSP, to provide a pair of outputs one for each ear of a user. The device may receive a stream of audio and meta data, such as ADM described, or may store this in a memory within the device and provide a pair of outputs, one for each ear.

Claims

Claims 1. A method of processing audio data to produce a binaural signal output, comprising: -receiving audio data for each of multiple object sources to be provided as a pair of output signals, one for each ear of a listener, the audio data including an input audio signal for each object and metadata parameters; -processing the input audio signal for each of the object sources with an object source processing path that includes a direct path and a diffuse path and wherein a delay function is applied to the input audio signal per object for each ear within the direct path and the diffuse path does not include a delay function per object for each ear; and -processing outputs of the object source processing path by applying binaural filters.
C\I 15 C\I 2. The method of claim 1, wherein the delay function operates to provide inter-aural delay between respective output signals for each ear per object.
3. The method of claim 1 or 2, where the delay function is calculated from input metadata parameters.
4. The method of claim 3, where the input metadata parameters are object parameters including one or more of source position, size, shape and diffuseness.
5. The method of claim 4, where the input metadata parameters are ADM parameters.
6. The method of any preceding claim, where the delay function is calculated from input parameters related to a user including one or more of head orientation, head size or other user related parameter.
7. The method of any preceding claim, further comprising a delay at an output of the diffuse path that does not change with object metadata parameters.
8. The method of claim 7, wherein the delay is per-virtual loudspeaker per-ear.
9. The method of any preceding claim, wherein at least part of the object direct and diffuse paths are provided for each of multiple virtual loudspeakers.
10. The method of any preceding claim, wherein the signal for each object has a gain applied per virtual loudspeaker.
11. The method of any preceding claim, wherein the gain is independent in the direct and diffuse path and applied per ear.
12. The method of preceding claim, wherein a binaural filter is provided per virtual loudspeaker.
C\I 15 C\I 13. The method of claim 9, further comprising a decorrelation filter in the diffuse path for each of the multiple virtual loudspeakers.
14. The method of claim 13, wherein the delay function that is applied to the input audio signal per object for each ear within the direct path includes an additional delay to compensate for the delay through the decorrelation filters.
15. The method of any preceding claim, wherein the binaural filters comprises measured impulse response filters with delays removed and the delays of the delay functions are derived from the removed delays.
16. The method of any preceding claim, wherein the delays of the delay functions are calculated as a function of virtual loudspeaker gains. 30 17. The method of claim 14, wherein the delays of the delay functions are calculated by taking the sum of each virtual loudspeaker gain multiplied by the corresponding delay appropriate for that virtual loudspeaker and ear, and dividing by the sum of the virtual loudspeaker gains.18. The method of claim 17, wherein the delays of the delay functions are calculated by summing the virtual loudspeaker gains in the direct and diffuse paths.19. The method of claim 17 or 18, wherein where the delays for a frontal virtual loudspeaker are used if the sum of the gains is less than a threshold.20. The method of claim 14 or 15, wherein the virtual loudspeaker gains are loudspeaker gains from a loudspeaker renderer.21. The method of any preceding claim, further comprising a gain function for each of the diffuse path and direct path, wherein the gain function is applied after the delay function in the direct path.C\I 15 22. The method of any preceding claim, wherein the binaural filters are impulse C\I response filters, preferably BRIR or H RFT.23. The method of any preceding claim wherein applying the binaural filters comprises convolving the filters with the outputs of the object source processing path.24. The method of any preceding claim, wherein each binaural filter comprises a filter pair.25. The method according to any preceding claim, wherein at least a portion of the direct path and diffuse path comprises functions for each of multiple virtual loudspeakers and a virtual loudspeaker corresponds to a point in space.26. The method according to any preceding claim, wherein each binaural filter defines a spatial configuration between a virtual loudspeaker and listener.27. A method of processing audio data to produce a binaural signal output, comprising: -receiving audio data for each of multiple object sources to be provided as a pair of output signals, one for each ear of a listener, the audio data including an input audio signal for each object and metadata parameters; -processing the audio data for each of the object sources with gains per virtual loudspeaker path wherein a binaural filter is applied per virtual loudspeaker; -deriving, at least for each object, a desired gain and an expected gain and a scaling factor comprising the desired gain divided by the expected gain; -applying the scaling factor to the signal path such that the output is adjusted by the scaling factor; and -wherein the expected gain is the gain derived taking into account C\I 15 interaction between the binaural filters. C\I28. The method of claim 27, wherein applying the scaling factor comprises adjusting the virtual loudspeaker gains.29. The method of claim 27, wherein applying the scaling factors comprises adjusting parameters of filters.30. The method of claim 27, wherein the desired gain is derived as if there were no interaction between different binaural filters of virtual loudspeakers.31. The method of claim 27 to 30, wherein desired and expected gains are frequency dependent for each object.32. The method according to claim 31, wherein the gain is measured at different frequencies and applied at each frequency.33. The method of any of claims 27 to 32, wherein the outputs of the binaural filters are mixed and provide gain as a result of the interaction between filters 34. 35. 36. 37.C\I 15 C\I r r t r 38. 39. 40. 41.The method of any of claims 27 to 33, wherein the scaling factor is applied per virtual loudspeaker.The method of any of claims 27 to 34, wherein the scaling factor is applied per virtual ear.The method of any of claims 27 to 35, wherein the scaling factor is applied per virtual object.The method of any of claims 27 to 36, wherein the expected gain is calculated by -calculating a factor matrix F, in which F(i, j) encodes the interaction between the paths of virtual loudspeakers i and j; -calculating the sum over all pairs of virtual loudspeakers with indices i and j, of factor matrix entry F(i, j) multiplied with virtual loudspeaker gains i and j, and taking the square root; and optionally the desired gain is calculated by: -taking the sum over each virtual loudspeaker i of factor matrix entry F(i, i) multiplied by the virtual loudspeaker gain i squared, and taking the square root.The method of any of claims 27 to 37, wherein the gains for a direct and diffuse path are treated as separate virtual loudspeakers.The method of any of claims 27 to 38, wherein a different factor matrix is provided for each of several BRIR sets.The method of any of claims 27 to 39, wherein a different factor matrix is provided for a range of possible delay values in the direct path.The method of any of claims 27 to 40, wherein the expected gain is calculated using the factor matrix for two different delays, and interpolated to give an appropriate expected gain for the actual delay used. 42. 43. 44.C\I 15 C\I 45.The method of any of claims 27 to 41, wherein the expected is calculated according to equation 4 herein.The method of any of claims 27 to 42, wherein the desired is calculated according to equation 12 herein.A device for processing audio data to produce a binaural signal output, comprising: -means for receiving audio data for each of multiple object sources to be provided as a pair of output signals, one for each ear of a listener, the audio data including an input audio signal for each object and metadata parameters; -means for processing the input audio signal for each of the object sources with an object source processing path that includes a direct path and a diffuse path and wherein a delay function is applied to the input audio signal per object for each ear within the direct path and the diffuse path does not include a delay function per object for each ear; and -means for processing outputs of the object source processing path by applying binaural filters.A device for processing audio data to produce a binaural signal output, comprising: -means for receiving audio data for each of multiple object sources to be provided as a pair of output signals, one for each ear of a listener, the audio data including an input audio signal for each object and metadata parameters; -means for processing the audio data for each of the object sources with gains per virtual loudspeaker path wherein a binaural filter is applied per virtual loudspeaker; -means for deriving, at least for each object, a desired gain and an expected gain and a scaling factor comprising the desired gain divided by the expected gain; -means for applying the scaling factor to the signal path such that the output is adjusted by the scaling factor; and -wherein the expected gain is the gain derived taking into account interaction between the binaural filters.46. A device for processing audio data according to claim 44 or 45, further comprising means for carrying out the method of any of claims 1 to 43.47. A device according to any of claims 44, 45 or 46, wherein the device is one of a headset, headphones, mobile device or virtual reality headset.