WO2023025376A1

WO2023025376A1 - Apparatus and method for ambisonic binaural audio rendering

Info

Publication number: WO2023025376A1
Application number: PCT/EP2021/073440
Authority: WO
Inventors: Liyun PANG; Martin POLLOW; Lauren WARD; Gavin Kearney; Calum Armstrong; Thomas Mckenzie
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2021-08-25
Filing date: 2021-08-25
Publication date: 2023-03-02

Abstract

An apparatus (100) for Ambisonic binaural rendering of an input signal. The apparatus (100) comprises a left ear transducer (101a) configured to generate a left ear audio signal based on a left ear transducer driver signal, a right ear transducer (101b) configured to generate a right ear audio signal based on a right ear transducer driver signal and a processing circuitry (110) configured to generate the left ear transducer driver signal and the right ear transducer driver signal using Ambisonic binaural rendering of the input signal based on a plurality of virtual loudspeakers. Each virtual loudspeaker is associated with a left ear and a right ear reference head-related transfer function, HRTF, and a left ear and a right ear Ambisonic HRTF. The processing circuitry (110) is further configured to adjust the left ear and the right ear Ambisonic HRTF by adjusting an interaural time difference, ITD, of the left ear and the right ear HRTF for each virtual loudspeaker based on a comparison of the Ambisonic HRTF with a reference ITD of the left ear and the right ear reference HRTF. Moreover, the processing circuitry (110) is configured to generate the left ear transducer driver signal and the right ear transducer driver signal based on the input signal and the plurality of adjusted left ear and right ear Ambisonic HRTFs of the plurality of virtual loudspeakers using Ambisonic binaural rendering. Thus, an easy and runtime-efficient calibration routine for a user (103) of the apparatus (100) is provided.

Description

APPARATUS AND METHOD FOR AMBISONIC BINAURAL AUDIO RENDERING

TECHNICAL FIELD

The present disclosure relates to audio processing and audio rendering in general. More specifically, the disclosure relates to an apparatus and method for Ambisonic binaural audio rendering.

BACKGROUND

Binaural rendering may be used for rendering 3D audio over headphones based on spatial filters known as head-related transfer functions (HRTFs). These filters describe how a sound source at any given angle with respect to the head of a listener results in time, level and spectral differences of the received signals at the ear canals of the listener. However, these spatial filters are unique to the individual listener, since they depend on the anatomic details of the head and the ears of the listener. Generic HRTFs based on averaged head and ear shapes are typically used but have drawbacks in terms of incorrect perception of location of rendered sound sources as well as tonality.

Personalized HRTFs, i.e. HRTFs adapted to the individual listener, provide an improved audio experience, but are more difficult to obtain. They typically require an individual listener to sit still in an anechoic chamber with microphones in the ears of the listener, while loudspeakers at predetermined locations play measurement stimuli. Signal processing is then applied to generate the personalized HRTFs from the measured stimuli.

In Ambisonic binaural rendering the HRTFs are not directly convolved with sound sources, but instead are encoded into the spherical harmonic domain, where the level of spatial detail retained is dependent on the order of the Ambisonic encoding. For example, First Order Ambisonics (FOA) only requires 4 convolutions per ear, whereas 3rd order requires 16 convolutions per ear. These convolutions are not directly with the sources, as is the case with Direct HRTF convolution. Instead, the Ambisonically encoded HRTFs are convolved with a mixture of Ambisonic encoded sound sources, i.e. sound sources which have also been encoded into the spherical harmonic domain.

However, direct convolution of multiple sources with HRTFs can be computationally expensive, particularly if head movements are to be considered. Consequently, Ambisonic rendering is often used as a comparatively efficient method for implementing the binaural rendering as it easily facilitates sound field rotation when head-tracking is used. High orders of Ambisonic rendering are required to maintain spatial accuracy and correct binaural cues (such as interaural level difference and interaural time difference) in the signal, meaning that accurate spatial audio reproduction at low computational cost remains a problem. Furthermore, normally generic HRTFs are often used. This results in errors as the reproduced ITD often does not match the individual’s ITD, which lead to poor localization, front-back confusion and internalisation of rendered sound sources.

SUMMARY

It is an objective to provide an improved apparatus and method for Ambisonic binaural audio rendering.

The foregoing and other objectives are achieved by the subject matter of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

Generally, embodiments disclosed herein address the dual challenge of personalizing the interaural time difference from generic HRTFs whilst also improving their accuracy at low orders of Ambisonic rendering to improve spatial and timbral accuracy of the binaural audio.

More specifically, according to a first aspect an apparatus for Ambisonic binaural rendering of an input signal is provided. The apparatus comprises: a left ear transducer configured to generate a left ear audio signal based on a left ear transducer driver signal and a right ear transducer configured to generate a right ear audio signal based on a right ear transducer driver signal. The apparatus further comprises a processing circuitry configured to generate the left ear transducer driver signal and the right ear transducer driver signal using Ambisonic binaural rendering of the input signal based on a plurality of virtual loudspeakers with a virtual loudspeaker configuration defining the number and positions of the virtual loudspeakers. Each virtual loudspeaker is associated with a left ear and a right ear reference head-related transfer function, HRTF, and a left ear and a right ear Ambisonic HRTF, i.e. an Ambisonic rendered, i.e. representation of the HRTF. In an embodiment, each virtual loudspeaker is additionally associated with a virtual loudspeaker direction. The processing circuitry is further configured to adjust the left ear and the right ear Ambisonic HRTF by adjusting an interaural time difference, ITD, of the left ear and the right ear HRTF for each virtual loudspeaker based on a comparison of the resulting Ambisonic HRTF with a reference ITD of the left ear and the right ear reference HRTF. The processing circuitry is further configured to generate the left ear transducer driver signal and the right ear transducer driver signal based on the input signal and the plurality of adjusted left ear and right ear Ambisonic HRTFs of the plurality of virtual loudspeakers using Ambisonic binaural rendering.

Thus, the apparatus according to the first aspect improves low order ITD rendering whilst remaining computationally efficient. The HRTF preprocessing may be done offline, to prevent added computational complexity at runtime. Once adjusting, i.e. calibration is applied and the optimization routine has generated new HRTFs, the localization accuracy of sound source rendering may be improved significantly. Moreover, the apparatus may be used to improve Ambisonic rendering of measured HRTFs for an individual.

Furthermore, independently of the number of sources, i.e. whether 1 source or 1000 sources are rendered, the number of convolutions for a given order remains the same. Moreover, simple rotation matrices can be employed to counter the Ambisonic mix for head-rotations of the listener when head-tracking is employed. The perceived result is a stable soundfield in 3D space. Ambisonics sound can be decoded to binaural audio by using virtual loudspeaker rendering.

In a further possible implementation form of the first aspect, the plurality of left ear and right ear reference HRTFs are a plurality of personalized left ear and right ear reference HRTFs personalized for a user of the apparatus.

Thus, since generic HRTFs can be modified for personalization, individualized HRTFs do not need to be measured.

In a further possible implementation form of the first aspect, the plurality of personalized left ear and right ear reference HRTFs are personalized for the user of the apparatus based on a head size of the user. Thus, this calibration routine is easy and far less cumbersome for an individual than getting their HRTFs measured. The calibration routine can be employed with consumer grade electronics rather than requiring specialized HRTF measurement equipment.

In a further possible implementation form of the first aspect, the processing circuitry is configured to adjust for each virtual loudspeaker the ITD of the left ear and the right ear Ambisonic HRTF based on a comparison with the reference ITD of the left ear and the right ear reference HRTF using an iterative loop.

In a further possible implementation form of the first aspect, the processing circuitry is configured to adjust for each virtual loudspeaker the ITD of the left ear and the right ear Ambisonic HRTF based on a comparison with the reference ITD of the left ear and the right ear reference HRTF using the iterative loop by: determining for a plurality of target directions the respective reference ITD for each left ear and right ear reference HRTF and the respective ITD for each left ear and right ear Ambisonic HRTF; and iteratively adjusting the respective ITD for each left ear and right ear Ambisonic HRTF based on the comparison between the respective reference ITD and the respective ITD from Ambisonic binaural decoding.

In a further possible implementation form of the first aspect, the processing circuitry is configured for each virtual loudspeaker to iteratively adjust the respective ITD for each left ear and right ear Ambisonic HRTF based on the comparison between the respective reference ITD and the respective Ambisonic ITD, until a difference between the respective reference ITD and the respective ITD from Ambisonic binaural decoding is smaller than a threshold value or until a predefined number of iterations has been reached.

In a further possible implementation form of the first aspect, the processing circuitry is configured to adjust for each virtual loudspeaker the ITD for the left ear and the right ear Ambisonic HRTF by applying, i.e. adding a respective time delay to the contralateral side of the respective virtual loudspeaker HRTF.

In a further possible implementation form of the first aspect, the respective time delay is based on a difference between the respective reference ITD and the respective ITD from Ambisonic binaural decoding. In a further possible implementation form of the first aspect, the processing circuitry is configured for each virtual loudspeaker to iteratively adjust for each iteration the respective ITD for each left ear and right ear HRTF using a respective incremental time delay and to apply a respective cumulative time delay to the contralateral side of the respective HRTF, after the iterative loop has been processed, wherein the respective cumulative time delay is a sum of the respective incremental time delays of each iteration.

In a further possible implementation form of the first aspect, the processing circuitry is further configured to combine the adjusted left ear and right ear Ambisonic HRTFs with the original Ambisonic HRTFs using a linear-phase crossover network, wherein the linear- phase crossover network is configured to generate the left ear transducer driver signal and the right ear transducer driver signal based on the input signal and the plurality of adjusted left ear and right ear Ambisonic HRTFs for frequencies below a crossover frequency and to generate the left ear transducer driver signal and the right ear transducer driver signal based on the input signal and the plurality of initial left ear and right ear Ambisonic HRTFs for frequencies above the crossover frequency.

In a further possible implementation form of the first aspect, the processing circuitry is configured to determine the left ear and right ear Ambisonic HRTF for a respective virtual loudspeaker based on a delta function input signal for the virtual loudspeaker direction of the respective virtual loudspeaker.

In a further possible implementation form of the first aspect, the processing circuitry is configured to generate the left ear transducer driver signal and the right ear transducer driver signal using first order, second order or third order Ambisonic binaural rendering of the input signal.

According to a second aspect headphones are provided, comprising an apparatus according to the first aspect.

According to a third aspect a method for Ambisonic binaural rendering of an input signal is provided. The method comprises a step of generating a left ear transducer driver signal and a right ear transducer driver signal using Ambisonic binaural rendering of the input signal based on a plurality of virtual loudspeakers with a virtual loudspeaker configuration defining the number and positions of the virtual loudspeakers, each virtual loudspeaker being associated a left ear and a right ear reference head-related transfer function, HRTF, and a left ear and a right ear Ambisonic HRTF, i.e. an Ambisonic rendered, i.e. representation of the HRTF. In an embodiment each virtual loudspeaker is additionally associated with a virtual loudspeaker direction.

The method comprises a further step of adjusting for each virtual loudspeaker the left ear and the right ear HRTF by adjusting an interaural time difference, ITD, of the left ear and the right ear HRTF based on a comparison of the ITD of the resulting Ambisonic HRTF with a reference ITD of the left ear and the right ear reference HRTF. The method comprises a further step of generating, based on the input signal and the plurality of adjusted left ear and right ear Ambisonic HRTFs of the plurality of virtual loudspeakers, a left ear transducer driver signal and a right ear transducer driver signal using Ambisonic binaural rendering for driving a left ear transducer configured to generate a left ear audio signal and a right ear transducer configured to generate a right ear audio signal.

The method according to the third aspect can be performed by the apparatus according to the first aspect. Thus, further features of the method according to the third aspect result directly from the functionality of the apparatus according to the first aspect as well as its different implementation forms and embodiments described above and below.

According to a fourth aspect a computer program product is provided, comprising a computer-readable storage medium for storing program code which causes a computer or a processor to perform the method according to the third aspect, when the program code is executed by the computer or the processor.

Details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description, drawings, and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

In the following, embodiments of the present disclosure are described in more detail with reference to the attached figures and drawings, in which:

Fig. 1 is a schematic diagram illustrating an apparatus for Ambisonic binaural rendering of an input signal according to an embodiment; Fig. 2 is a block diagram illustrating processing steps implemented by an apparatus for Ambisonic binaural rendering of an input signal according to an embodiment for generating reference head-related transfer functions that are optimized with respect to the Ambisonic Interaural Time Difference;

Fig. 3 is a block diagram illustrating processing steps implemented by an apparatus for Ambisonic binaural rendering of an input signal according to an embodiment using preprocessed reference head-related transfer functions that are optimized with respect to the Ambisonic Interaural Time Difference;

Fig. 4 is a block diagram illustrating processing steps implemented by an apparatus for Ambisonic binaural rendering of an input signal according to an embodiment; and

Fig. 5 is a flow diagram illustrating a method for Ambisonic binaural rendering of an input signal according to an embodiment.

In the following, identical reference signs refer to identical or at least functionally equivalent features.

DETAILED DESCRIPTION OF THE EMBODIMENTS

In the following description, reference is made to the accompanying figures, which form part of the disclosure, and which show, by way of illustration, specific aspects of embodiments of the present disclosure or specific aspects in which embodiments of the present disclosure may be used. It is understood that embodiments of the present disclosure may be used in other aspects and comprise structural or logical changes not depicted in the figures. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims.

For instance, it is to be understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if one or a plurality of specific method steps are described, a corresponding device may include one or a plurality of units, e.g. functional units, to perform the described one or plurality of method steps (e.g. one unit performing the one or plurality of steps, or a plurality of units each performing one or more of the plurality of steps), even if such one or more units are not explicitly described or illustrated in the figures. On the other hand, for example, if a specific apparatus is described based on one or a plurality of units, e.g. functional units, a corresponding method may include one step to perform the functionality of the one or plurality of units (e.g. one step performing the functionality of the one or plurality of units, or a plurality of steps each performing the functionality of one or more of the plurality of units), even if such one or plurality of steps are not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary embodiments and/or aspects described herein may be combined with each other, unless specifically noted otherwise.

Figure 1 is a schematic diagram illustrating an apparatus 100 for Ambisonic binaural rendering of an input signal. As illustrated in figure 1 , the apparatus 100 comprises a left ear transducer 101a, e.g. loudspeaker 101a configured to generate a left ear audio signal based on a left ear transducer driver signal and a right ear transducer 101 b, e.g. loudspeaker 101 b configured to generate a right ear audio signal based on a right ear transducer driver signal for a user 103. In an embodiment, the apparatus 100 may be implemented in the form of headphones 100.

For controlling the left ear transducer 101a and the right ear transducer 101 b the apparatus 100 further comprises a processing circuitry 110. The processing circuitry 110 may be implemented in hardware and/or software and may comprise digital circuitry, or both analog and digital circuitry. Digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or general-purpose processors. The apparatus 100 may further comprise a memory 105 configured to store executable program code which, when executed by the processing circuitry 110, causes the apparatus 100 to perform the functions and methods described herein.

As will be described in more detail below, the processing circuitry 110 of the binaural audio rendering apparatus 100 is configured to generate the left ear transducer driver signal and the right ear transducer driver signal using Ambisonic binaural rendering of the input signal based on a plurality of virtual loudspeakers with a virtual loudspeaker configuration defining the number and positions of the virtual loudspeakers. Each virtual loudspeaker is associated with a left ear and a right ear reference head-related transfer function, HRTF, and a left ear and a right ear Ambisonic HRTF, which can be interpreted as a rendered Ambisonic, i.e. representation of the HRTF. Each virtual loudspeaker may be additionally associated with a virtual loudspeaker direction. Further details about Ambisonic rendering using HRTFs and Ambisonics in general may be found in the book “Ambisonics, A Practical 3D Audio Theory for Recording, Studio Production, Sound Reinforcement, and Virtual Reality” by Franz Zotter and Matthias Frank, Springer, 2019, which is fully incorporated by reference herein.

As will be described in more detail below, the processing circuitry 110 of the apparatus 100 illustrated in figure 1 is further configured to adjust the left ear and the right ear Ambisonic HRTF by adjusting an interaural time difference, ITD, of the left ear and the right ear HRTF for each virtual loudspeaker based on a comparison of the resulting Ambisonic HRTF with a reference ITD of the left ear and the right ear reference HRTF. The processing circuitry 110 is further configured to generate the left ear transducer driver signal and the right ear transducer driver signal based on the input signal and the plurality of adjusted left ear and right ear Ambisonic HRTFs of the plurality of virtual loudspeakers using Ambisonic binaural rendering.

Fig. 2 is a block diagram illustrating processing steps implemented by the processing circuitry 110 of the apparatus 100 in order to generate Ambisonic Interaural Time Difference Optimised (AITDO) HRTFs according to an embodiment. The processing circuitry 110 of the apparatus 100 may generate personalized reference HRTFs 119 which are Ambisonic Interaural Time Difference (ITD) calibrated 117 based on phase removal and alignment 113 of HRTFs from a reference HRTF Database 111. The calibration 117 may be based on using a head-size measurement 115. The calibration 117 may alternatively be based on an ITD slider method.

The processing circuitry 110 of the apparatus 100 may choose a virtual loudspeaker configuration for an appropriate Ambisonic order, which may be for example an octahedron for a 1st order. The virtual loudspeaker configuration may be chosen by the processing circuitry 110 based on the reference HRTF dataset 111. For each loudspeaker direction in the virtual speaker configuration, an Ambisonic domain HRTF is generated. In an embodiment the generation may be based on using a delta function. An Ambisonic ITD may then be estimated, i.e. calculated 121 b by the processing circuitry 110 for the Ambisonic HRTF and a reference ITD may be estimated, i.e. calculated 121 b by the processing circuitry 110 for the original virtual loudspeaker HRTF, i.e. the personalized HRTF 119 or reference HRTF, and subsequentially the difference in ITD between the two estimations is calculated 123. The process may be repeated for all HRTF directions, creating an array of ITD difference values. The iteration may continue until a difference between the respective reference ITD and the respective Ambisonic ITD is smaller than a threshold value or until a predefined number of iterations has been reached.

Moreover, the processing circuitry 110 may be configured to augment the virtual loudspeaker HRTF signals used in a binaural Ambisonic decoder 133 implemented by the processing circuitry 110 in the mid-bands, i.e. above the spatial aliasing frequency for the Ambisonic order, but below the usable ITD range, e.g. at 1 ,5kHz. If the HRTF is not located on a median plane, where the ITD may be 0, then the HRTF is a candidate for augmentation. The contralateral side of the HRTF may be delayed 125 based on the computed ITD difference between the Ambisonic HRTF and original virtual loudspeaker HRTF. The augmented 127 virtual loudspeaker HRTFs with AITDO may then combined with the original virtual loudspeaker HRTFs using a linear-phase crossover network and subsequentially normalized 129. The pre-processed HRTFs 131 are then switched into the binaural decoder 133, combined 137 with the Ambisonic HRTF generated based on the input signal and the process is repeated iteratively. An array of delay values may be augmented at each iteration which keeps track of the cumulative delay for each HRTF. At each iteration, this array of delays may be used on the original HRTF set, ensuring that the final AITDO pre-processed HRTF dataset will be subject to the crossover filter only once, regardless of the number of iterations.

Fig. 3 is a block diagram illustrating processing steps implemented by the apparatus 100 for Ambisonic binaural rendering of the input signal according to a further embodiment using Ambisonic Interaural Time Difference Optimised preprocessed reference head- related transfer functions. As an illustrative example, the apparatus 100 may be used for supporting spatial audio playback for virtual or augmented reality games on a mobile phone. The AITDO algorithm implemented by the processing circuitry 110 of the apparatus 100 and further detailed above and below may optimize the HRTFs used for binaural based Ambisonic rendering of game objects. In this case, the user 103 may be asked to undertake ITD calibration 117 prior to the game start. The calibration routine may be as simple as getting the subject to measure their head size and then to extract ITDs based on a spherical head model. Alternatively, a full test procedure may be employed where the user 103 is presented sound sources with different cross-head delays and asked to judge when the sound source is perceived to move laterally.

From the calibration routine, a generic set of HRTFs may have their ITDs replaced with the new ITDs from the calibration stage. These reference HRTFs may then be used for the AITDO routine. The routine may then run and a new set of Ambisonic optimized HRTFs 301 is generated. All these steps may be achieved prior to game runtime.

A bitstream may come from a game engine and be transcoded 303 to Ambisonic. At runtime the AITDO HRTFs 301 are loaded and a binaural based Ambisonic Tenderer 305 produces the final headphone mix. Thus, no additional computation complexity is introduced at runtime.

Fig. 4 is a block diagram illustrating processing steps implemented by the processing circuitry 110 of the apparatus 100 for Ambisonic binaural rendering of the input signal according to a further embodiment.

In a first step 401 , for each virtual loudspeaker in a configuration, an Ambisonics signal may be created that corresponds to an incoming plane wave from the direction of the virtual loudspeaker.

In a further step 402, an initial Ambisonic Binaural decoder implemented by the processing circuitry 110 may be used and the Ambisonics signal may then be binaurally decoded by convolution with the Ambisonics representation of the used Ambisonics HRTF, which may result in a 2 channel impulse response for each loudspeaker.

In a further step 403, the ITD may be estimated 403 for the binaural impulse response obtained in the previous step 402 for each speaker direction. The ITD may be estimated in the same way for the original HRIR (head-related impulse response) corresponding to the loudspeaker directions. Then the difference in ITD between these two estimations may be calculated for all directions of the virtual loudspeaker array.

In a further step 404, the ITD difference values of the current iteration, which may include the single value of ITD for each virtual loudspeaker, may be stored in an array of the memory 105 which may contain all of the previous measured ITD differences.

In a further step 405, the processing circuitry 110 may check whether the ITD difference is small enough or the maximum number of iterations has been achieved. If true the process stops in a further step 408, otherwise continues with a further step 406. In the step 406, the processing circuitry 110 may augment the virtual loudspeaker HRTF, i.e. Ambisonic HRTF signals in the mid-bands above the spatial aliasing frequency for the Ambisonics order, but below the usable ITD range, which may be at 1.5kHz, as follows: (i) If the HRTFs are located on the median plane, where ITD may be 0, no augmentation is performed, (ii) If the ITD is too small in the Ambisonics processing, a time-delay of the contralateral HRIR, which may be an inverse Fourier Transform of the HRTF, is applied using the value of the computed ITD offset, (iii) The augmented virtual loudspeaker HRTFs with AITDO are combined with the original virtual loudspeaker HRTFs using a linear-phase crossover network and the result is normalized.

In a further step 407, the augmented virtual loudspeaker HRTFs are used to re-compute the binaural decoder and the process is repeated iteratively, thus going back to step 402.

In the step 408, when the loop stops, the total delay of the HRTFs for each virtual loudspeaker may be calculated 409 by the addition of all partial delays of all iterations, i.e. ITD offsets, and may be used by the processing circuitry 110 for computing 410 the final Ambisonics Binaural decoder. The calculation in step 409 may comprise an augmentation, combination with the original virtual loudspeaker HRTFs using a linear-phase crossover network and a normalization as described above.

Figure 5 is a flow diagram illustrating a method 500 for Ambisonic binaural rendering of the input signal. The method 500 comprises a first step of generating 501 a left ear transducer driver signal and a right ear transducer driver signal using Ambisonic binaural rendering of the input signal based on a plurality of virtual loudspeakers, each virtual loudspeaker being associated with a left ear and a right ear reference head-related transfer function, HRTF, and a left ear and a right ear Ambisonic HRTF.

Moreover, the method 500 comprises a step of adjusting 503 for each virtual loudspeaker the left ear and the right ear HRTF by adjusting an interaural time difference, ITD, of the left ear and the right ear HRTF based on a comparison of the ITD of the resulting Ambisonic HRTF with a reference ITD of the left ear and the right ear reference HRTF.

The method 500 further comprises a step of generating 505, based on the input signal and the plurality of adjusted left ear and right ear Ambisonic HRTFs of the plurality of virtual loudspeakers, a left ear transducer driver signal and a right ear transducer driver signal using Ambisonic binaural rendering for driving a left ear transducer 101a configured to generate a left ear audio signal and a right ear transducer 101 b configured to generate a right ear audio signal.

The method 500 can be performed by the apparatus 100 according to an embodiment. Thus, further features of the method 500 result directly from the functionality of the apparatus 100 as well as its different embodiments described above and below.

The person skilled in the art will understand that the "blocks" ("units") of the various figures (method and apparatus) represent or describe functionalities of embodiments of the present disclosure (rather than necessarily individual "units" in hardware or software) and thus describe equally functions or features of apparatus embodiments as well as method embodiments (unit = step).

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described embodiment of an apparatus is merely exemplary. For example, the unit division is merely logical function division and may be another division in an actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, functional units in the embodiments of the invention may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.

Claims

1 . An apparatus (100) for Ambisonic binaural rendering of an input signal, wherein the apparatus (100) comprises: a left ear transducer (101a) configured to generate a left ear audio signal based on a left ear transducer driver signal and a right ear transducer (101 b) configured to generate a right ear audio signal based on a right ear transducer driver signal; and a processing circuitry (110) configured to generate the left ear transducer driver signal and the right ear transducer driver signal using Ambisonic binaural rendering of the input signal based on a plurality of virtual loudspeakers, each virtual loudspeaker being associated with a left ear and a right ear reference head-related transfer function, HRTF, and a left ear and a right ear Ambisonic HRTF, wherein the processing circuitry (110) is further configured to adjust the left ear and the right ear Ambisonic HRTF by adjusting an interaural time difference, ITD, of the left ear and the right ear HRTF for each virtual loudspeaker based on a comparison of the Ambisonic HRTF with a reference ITD of the left ear and the right ear reference HRTF, and wherein the processing circuitry (110) is further configured to generate the left ear transducer driver signal and the right ear transducer driver signal based on the input signal and the plurality of adjusted left ear and right ear Ambisonic HRTFs of the plurality of virtual loudspeakers using Ambisonic binaural rendering.

2. The apparatus (100) of claim 1 , wherein the plurality of left ear and right ear reference HRTFs are a plurality of personalized left ear and right ear reference HRTFs personalized for a user (103) of the apparatus (100).

3. The apparatus (100) of claim 2, wherein the plurality of personalized left ear and right ear reference HRTFs are personalized for the user (103) of the apparatus (100) based on a head size of the user (103).

4. The apparatus (100) of any one of the preceding claims, wherein the processing circuitry (110) is configured to adjust for each virtual loudspeaker the ITD of the left ear and the right ear Ambisonic HRTF based on a comparison with the reference ITD of the left ear and the right ear reference HRTF using an iterative loop.

5. The apparatus (100) of claim 4, wherein the processing circuitry (110) is configured to adjust for each virtual loudspeaker the ITD of the left ear and the right ear Ambisonic HRTF based on a comparison with the reference ITD of the left ear and the right ear reference HRTF using the iterative loop by: determining for a plurality of target directions the reference ITD for each left ear and right ear reference HRTF and the ITD for each left ear and right ear Ambisonic HRTF; and iteratively adjusting the ITD for each left ear and right ear Ambisonic HRTF based on the comparison between the reference ITD and the ITD from Ambisonic binaural decoding.

6. The apparatus (100) of claim 5, wherein the processing circuitry (110) is configured for each virtual loudspeaker to iteratively adjust the ITD for each left ear and right ear Ambisonic HRTF based on the comparison between the reference ITD and the Ambisonic ITD, until a difference between the reference ITD and the ITD from Ambisonic binaural decoding is smaller than a threshold value or until a predefined number of iterations has been reached.

7. The apparatus (100) of claim 5 or 6, wherein the processing circuitry (110) is configured to adjust for each virtual loudspeaker the ITD for the left ear and the right ear Ambisonic HRTF by applying a time delay to the contralateral side of the virtual loudspeaker HRTF.

8. The apparatus (100) of claim 7, wherein the time delay is based on a difference between the reference ITD and the ITD from Ambisonic binaural decoding.

9. The apparatus (100) of any one of claims 4 to 8, wherein the processing circuitry (110) is configured for each virtual loudspeaker to iteratively adjust for each iteration the ITD for each left ear and right ear HRTF using an incremental time delay and to apply a cumulative time delay to the contralateral side of the HRTF, after the iterative loop has been processed, wherein the cumulative time delay is a sum of the incremental time delays of each iteration.

10. The apparatus (100) of any one of claims 4 to 9, wherein the processing circuitry

(110) is further configured to combine the adjusted left ear and right ear Ambisonic HRTFs with the Ambisonic HRTFs using a linear-phase crossover network, wherein the linear- phase crossover network is configured to generate the left ear transducer driver signal and the right ear transducer driver signal based on the input signal and the plurality of adjusted left ear and right ear Ambisonic HRTFs for frequencies below a crossover frequency and to generate the left ear transducer driver signal and the right ear transducer driver signal based on the input signal and the plurality of initial left ear and right ear Ambisonic HRTFs for frequencies above the crossover frequency.

11 . The apparatus (100) of any one of the preceding claims, wherein the processing circuitry (110) is configured to determine the left ear and right ear Ambisonic HRTF for a respective virtual loudspeaker based on a delta function input signal for the virtual loudspeaker direction of the respective virtual loudspeaker.

12. The apparatus (100) of any one of the preceding claims, wherein the processing circuitry (110) is configured to generate the left ear transducer driver signal and the right ear transducer driver signal using first order, second order or third order Ambisonic binaural rendering of the input signal.

13. Headphones comprising an apparatus (100) according to any one of the preceding claims.

14. A method (500) for Ambisonic binaural rendering of an input signal, wherein the method (500) comprises: generating (501) a left ear transducer driver signal and a right ear transducer driver signal using Ambisonic binaural rendering of the input signal based on a plurality of virtual loudspeakers, each virtual loudspeaker being associated with a left ear and a right ear reference head-related transfer function, HRTF, and a left ear and a right ear Ambisonic HRTF;

16 adjusting (503) for each virtual loudspeaker the left ear and the right ear HRTF by adjusting an interaural time difference, ITD, of the left ear and the right ear HRTF based on a comparison of the ITD of the resulting Ambisonic HRTF with a reference ITD of the left ear and the right ear reference HRTF; and generating (505), based on the input signal and the plurality of adjusted left ear and right ear Ambisonic HRTFs of the plurality of virtual loudspeakers, a left ear transducer driver signal and a right ear transducer driver signal using Ambisonic binaural rendering for driving a left ear transducer (101a) configured to generate a left ear audio signal and a right ear transducer (101b) configured to generate a right ear audio signal.

15. A computer program product comprising a computer-readable storage medium for storing program code which causes a computer or a processor to perform the method of claim 14 when the program code is executed by the computer or the processor.

17