WO2014194005A1 - Filtering with binaural room impulse responses with content analysis and weighting - Google Patents

Filtering with binaural room impulse responses with content analysis and weighting Download PDF

Info

Publication number
WO2014194005A1
WO2014194005A1 PCT/US2014/039864 US2014039864W WO2014194005A1 WO 2014194005 A1 WO2014194005 A1 WO 2014194005A1 US 2014039864 W US2014039864 W US 2014039864W WO 2014194005 A1 WO2014194005 A1 WO 2014194005A1
Authority
WO
WIPO (PCT)
Prior art keywords
impulse response
room impulse
binaural
audio signal
binaural room
Prior art date
Application number
PCT/US2014/039864
Other languages
French (fr)
Inventor
Pei Xiang
Dipanjan Sen
Original Assignee
Qualcomm Incorporated
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qualcomm Incorporated filed Critical Qualcomm Incorporated
Priority to JP2016516799A priority Critical patent/JP6100441B2/en
Priority to EP14733457.7A priority patent/EP3005734B1/en
Priority to KR1020157036270A priority patent/KR101719094B1/en
Priority to CN201480042431.2A priority patent/CN105432097B/en
Publication of WO2014194005A1 publication Critical patent/WO2014194005A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/305Electronic adaptation of stereophonic audio signals to reverberation of the listening space
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S5/00Pseudo-stereo systems, e.g. in which additional channel signals are derived from monophonic signals by means of phase shifting, time delay or reverberation 
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/307Frequency adjustment, e.g. tone control
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10KSOUND-PRODUCING DEVICES; METHODS OR DEVICES FOR PROTECTING AGAINST, OR FOR DAMPING, NOISE OR OTHER ACOUSTIC WAVES IN GENERAL; ACOUSTICS NOT OTHERWISE PROVIDED FOR
    • G10K15/00Acoustics not otherwise provided for
    • G10K15/08Arrangements for producing a reverberation or echo sound
    • G10K15/12Arrangements for producing a reverberation or echo sound using electronic time-delay networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S1/00Two-channel systems
    • H04S1/002Non-adaptive circuits, e.g. manually adjustable or static, for enhancing the sound image or the spatial distribution
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S1/00Two-channel systems
    • H04S1/002Non-adaptive circuits, e.g. manually adjustable or static, for enhancing the sound image or the spatial distribution
    • H04S1/005For headphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/01Multi-channel, i.e. more than two input channels, sound reproduction with two speakers wherein the multi-channel information is substantially preserved
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/01Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/07Synergistic effects of band splitting and sub-band processing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/11Application of ambisonics in stereophonic audio systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S3/00Systems employing more than two channels, e.g. quadraphonic
    • H04S3/002Non-adaptive circuits, e.g. manually adjustable or static, for enhancing the sound image or the spatial distribution
    • H04S3/004For headphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/305Electronic adaptation of stereophonic audio signals to reverberation of the listening space
    • H04S7/306For headphones

Definitions

  • This disclosure relates to audio rendering and, more specifically, binaural rendering of audio data.
  • a method of binauralizing an audio signal comprises applying adaptively determined weights to a plurality of channels of the audio signal to generate a plurality of adaptively weighted channels of the audio signal; combining at least two of the plurality of adaptively weighted channels of the audio signal to generate a combined signal; and applying a binaural room impulse response filter to the combined signal to generate a binaural audio signal.
  • a device comprises one or more processors configured to apply adaptively determined weights to a plurality of channels of the audio signal to generate a plurality of adaptively weighted channels of the audio signal; combine at least two of the plurality of adaptively weighted channels of the audio signal to generate a combined signal; and apply a binaural room impulse response filter to the combined signal to generate a binaural audio signal.
  • an apparatus comprises means for applying adaptively determined weights to a plurality of channels of the audio signal to generate a plurality of adaptively weighted channels of the audio signal; means for combining at least two of the plurality of adaptively weighted channels of the audio signal to generate a combined signal; and means for applying a binaural room impulse response filter to the combined signal to generate a binaural audio signal.
  • a non-transitory computer-readable storage medium has stored thereon instructions that, when executed, cause one or more processors to apply adaptively determined weights to a plurality of channels of the audio signal to generate a plurality of adaptively weighted channels of the audio signal; combine at least two of the plurality of adaptively weighted channels of the audio signal to generate a combined signal; and apply a binaural room impulse response filter to the combined signal to generate a binaural audio signal.
  • FIGS. 1 and 2 are diagrams illustrating spherical harmonic basis functions of various orders and sub-orders.
  • FIG. 3 is a diagram illustrating a system that may perform techniques described in this disclosure to more efficiently render audio signal information.
  • FIG. 4 is a block diagram illustrating an example binaural room impulse response (BRIR).
  • BRIR binaural room impulse response
  • FIG. 5 is a block diagram illustrating an example systems model for producing a BRIR in a room.
  • FIG. 6 is a block diagram illustrating a more in-depth systems model for producing a BRIR in a room.
  • FIG. 7 is a block diagram illustrating an example of an audio playback device that may perform various aspects of the binaural audio rendering techniques described in this disclosure.
  • FIG. 8 is a block diagram illustrating an example of an audio playback device that may perform various aspects of the binaural audio rendering techniques described in this disclosure.
  • FIG. 9 is a flow diagram illustrating an example mode of operation for a binaural rendering device to render spherical harmonic coefficients according to various aspects of the techniques described in this disclosure.
  • FIGS. 10A, 10B depict flow diagrams illustrating alternative modes of operation that may be performed by the audio playback devices of FIGS. 7 and 8 in accordance with various aspects of the techniques described in this disclosure.
  • FIG. 11 is a block diagram illustrating an example of an audio playback device that may perform various aspects of the binaural audio rendering techniques described in this disclosure.
  • FIG. 12 is a flow diagram illustrating a process that may be performed by the audio playback device of FIG. 11 in accordance with various aspects of the techniques described in this disclosure.
  • FIG. 13 is a diagram of an example binaural room impulse response filter.
  • FIG. 14 is a block diagram illustrating a system for a standard computation of a binaural output signal generated by applying binaural room impulse responses to a multichannel audio signal.
  • FIG. 15 is a block diagram illustrating functional components of a system for computing a binaural output signal generated by applying binaural room impulse responses to a multichannel audio signal according to techniques described herein.
  • FIG. 16 is an example plot showing hierarchical cluster analysis on a reflection segment of the multiple binaural room impulse response filters.
  • FIG. 17 is a flowchart illustrating an example mode of operation of an audio playback device according to techniques described in this disclosure.
  • surround sound formats include the popular 5.1 format (which includes the following six channels: front left (FL), front right (FR), center or front center, back left or surround left, back right or surround right, and low frequency effects (LFE)), the growing 7.1 format, and the upcoming 22.2 format (e.g., for use with the Ultra High Definition Television standard).
  • 5.1 format which includes the following six channels: front left (FL), front right (FR), center or front center, back left or surround left, back right or surround right, and low frequency effects (LFE)
  • LFE low frequency effects
  • the growing 7.1 format e.g., for use with the Ultra High Definition Television standard
  • 22.2 format e.g., for use with the Ultra High Definition Television standard
  • 22.2 format e.g., for use with the Ultra High Definition Television standard.
  • 22.2 format e.g., for use with the Ultra High Definition Television standard.
  • 22.2 format e.g., for use with the Ultra High Definition Television standard.
  • 22.2 format e.g., for use
  • the input to a future standardized audio-encoder could optionally be one of three possible formats: (i) traditional channel-based audio, which is meant to be played through loudspeakers at pre-specified positions; (ii) object-based audio, which involves discrete pulse-code-modulation (PCM) data for single audio objects with associated metadata containing their location coordinates (amongst other information); and (iii) scene-based audio, which involves representing the sound field using spherical harmonic coefficients (SHC) - where the coefficients represent 'weights' of a linear summation of spherical harmonic basis functions.
  • SHC in this context, may include Higher Order Ambisonics (Ho A) signals according to an HoA model.
  • Spherical harmonic coefficients may alternatively or additionally include planar models and spherical models.
  • a hierarchical set of elements may be used to represent a sound field.
  • the hierarchical set of elements may refer to a set of elements in which the elements are ordered such that a basic set of lower-ordered elements provides a full representation of the modeled sound field. As the set is extended to include higher-order elements, the representation becomes more detailed.
  • SHC spherical harmonic coefficients
  • ⁇ ⁇ , ⁇ is a point of reference (or
  • _/ n (-) is the spherical Bessel function of order n
  • ⁇ ⁇ ⁇ , ⁇ ) are the spherical harmonic basis functions of order n and suborder m.
  • the term in square brackets is a frequency-domain representation of the signal (i.e., 5( ⁇ , r r , e r , ⁇ p r )) which can be approximated by various time-frequency transformations, such as the discrete Fourier transform (DFT), the discrete cosine transform (DCT), or a wavelet transform.
  • DFT discrete Fourier transform
  • DCT discrete cosine transform
  • wavelet transform a wavelet transform.
  • Other examples of hierarchical sets include sets of wavelet transform coefficients and other sets of coefficients of multiresolution basis functions.
  • the spherical harmonic basis functions are shown in three-dimensional coordinate space with both the order and the suborder shown.
  • the SHC ATM(k) can either be physically acquired (e.g., recorded) by various microphone array configurations or, alternatively, they can be derived from channel-based or object-based descriptions of the sound field.
  • the SHC represents scene -based audio.
  • i is V— ⁇
  • ( ⁇ ) is the spherical Hankel function (of the second kind) of order n
  • ⁇ r s , ⁇ ⁇ , ⁇ ⁇ ⁇ is the location of the object.
  • PCM objects can be represented by the ATM(k) coefficients (e.g., as a sum of the coefficient vectors for the individual objects).
  • these coefficients contain information about the sound field (the pressure as a function of 3D coordinates), and the above represents the transformation from individual objects to a representation of the overall sound field, in the vicinity of the observation point ⁇ r r , ⁇ ⁇ , ⁇ p r ⁇ .
  • the SHCs may also be derived from a microphone-array recording as follows:
  • aTM(t) are the time-domain equivalent of ATM(k) (the SHC)
  • the * represents a convolution operation
  • the ⁇ ,> represents an inner product
  • b n (ri, t) represents a time- domain filter function dependent on r
  • m 1 ⁇ i) are the h microphone signal, where the h microphone transducer is located at radius r i ? elevation angle and azimuth angle ⁇ .
  • the matrix in the above equation may be more generally referred to as ⁇ 3 ( ⁇ , ⁇ ), where the subscript s may indicate that the matrix is for a certain transducer geometry-set, s.
  • the convolution in the above equation (indicated by the *), is on a row-by-row basis, such that, for example, the output ⁇ (t) is the result of the convolution between b 0 ⁇ a, t) and the time series that results from the vector multiplication of the first row of the ⁇ ⁇ ( ⁇ , ⁇ ) matrix, and the column of microphone signals (which varies as a function of time - accounting for the fact that the result of the vector multiplication is a time series).
  • the computation may be most accurate when the transducer positions of the microphone array are in the so called T-design geometries (which is very close to the Eigenmike transducer geometry).
  • T-design geometry may be that the E S ( ), ⁇ ) matrix that results from the geometry, has a very well behaved inverse (or pseudo inverse) and further that the inverse may often be very well approximated by the transpose of the matrix, ⁇ ⁇ ( ⁇ , ⁇ ) .
  • FIG. 3 is a diagram illustrating a system 20 that may perform techniques described in this disclosure to more efficiently render audio signal information.
  • the system 20 includes a content creator 22 and a content consumer 24. While described in the context of the content creator 22 and the content consumer 24, the techniques may be implemented in any context that makes use of SHCs or any other hierarchical elements that define a hierarchical representation of a sound field.
  • the content creator 22 may represent a movie studio or other entity that may generate multi-channel audio content for consumption by content consumers, such as the content consumer 24. Often, this content creator generates audio content in conjunction with video content.
  • the content consumer 24 may represent an individual that owns or has access to an audio playback system, which may refer to any form of audio playback system capable of playing back multi-channel audio content. In the example of FIG. 3, the content consumer 24 owns or has access to audio playback system 32 for rendering hierarchical elements that define a hierarchical representation of a sound field.
  • the content creator 22 includes an audio renderer 28 and an audio editing system 30.
  • the audio renderer 28 may represent an audio processing unit that renders or otherwise generates speaker feeds (which may also be referred to as “loudspeaker feeds,” “speaker signals,” or “loudspeaker signals”).
  • Each speaker feed may correspond to a speaker feed that reproduces sound for a particular channel of a multi-channel audio system or to a virtual loudspeaker feed that are intended for convolution with a head- related transfer function (HRTF) filters matching the speaker position.
  • HRTF head- related transfer function
  • Each speaker feed may correspond to a channel of spherical harmonic coefficients (where a channel may be denoted by an order and/or suborder of associated spherical basis functions to which the spherical harmonic coefficients correspond), which uses multiple channels of SHCs to represent a directional sound field.
  • the audio renderer 28 may render speaker feeds for conventional 5.1, 7.1 or 22.2 surround sound formats, generating a speaker feed for each of the 5, 7 or 22 speakers in the 5.1, 7.1 or 22.2 surround sound speaker systems.
  • the audio renderer 28 may be configured to render speaker feeds from source spherical harmonic coefficients for any speaker configuration having any number of speakers, given the properties of source spherical harmonic coefficients discussed above.
  • the audio renderer 28 may, in this manner, generate a number of speaker feeds, which are denoted in FIG. 3 as speaker feeds 29.
  • the content creator may, during the editing process, render spherical harmonic coefficients 27 ("SHCs 27"), listening to the rendered speaker feeds in an attempt to identify aspects of the sound field that do not have high fidelity or that do not provide a convincing surround sound experience.
  • the content creator 22 may then edit source spherical harmonic coefficients (often indirectly through manipulation of different objects from which the source spherical harmonic coefficients may be derived in the manner described above).
  • the content creator 22 may employ the audio editing system
  • the audio editing system 30 represents any system capable of editing audio data and outputting this audio data as one or more source spherical harmonic coefficients.
  • the content creator 22 may generate bitstream 31 based on the spherical harmonic coefficients 27. That is, the content creator 22 includes a bitstream generation device 36, which may represent any device capable of generating the bitstream 31. In some instances, the bitstream generation device 36 may represent an encoder that bandwidth compresses (through, as one example, entropy encoding) the spherical harmonic coefficients 27 and that arranges the entropy encoded version of the spherical harmonic coefficients 27 in an accepted format to form the bitstream 31.
  • the bitstream generation device 36 may represent an audio encoder (possibly, one that complies with a known audio coding standard, such as MPEG surround, or a derivative thereof) that encodes the multichannel audio content 29 using, as one example, processes similar to those of conventional audio surround sound encoding processes to compress the multi-channel audio content or derivatives thereof.
  • the compressed multi-channel audio content 29 may then be entropy encoded or coded in some other way to bandwidth compress the content 29 and arranged in accordance with an agreed upon format to form the bitstream 31.
  • the content creator 22 may transmit the bitstream
  • the content creator 22 may output the bitstream 31 to an intermediate device positioned between the content creator 22 and the content consumer 24.
  • This intermediate device may store the bitstream 31 for later delivery to the content consumer 24, which may request this bitstream.
  • the intermediate device may comprise a file server, a web server, a desktop computer, a laptop computer, a tablet computer, a mobile phone, a smart phone, or any other device capable of storing the bitstream 31 for later retrieval by an audio decoder.
  • This intermediate device may reside in a content delivery network capable of streaming the bitstream 31 (and possibly in conjunction with transmitting a corresponding video data bitstream) to subscribers, such as the content consumer 24, requesting the bitstream 31.
  • the content creator 22 may store the bitstream 31 to a storage medium, such as a compact disc, a digital video disc, a high definition video disc or other storage media, most of which are capable of being read by a computer and therefore may be referred to as computer-readable storage media or non- transitory computer-readable storage media.
  • a storage medium such as a compact disc, a digital video disc, a high definition video disc or other storage media, most of which are capable of being read by a computer and therefore may be referred to as computer-readable storage media or non- transitory computer-readable storage media.
  • the transmission channel may refer to those channels by which content stored to these mediums are transmitted (and may include retail stores and other store-based delivery mechanism).
  • the techniques of this disclosure should not therefore be limited in this respect to the example of FIG. 3.
  • the audio playback system 32 may represent any audio playback system capable of playing back multi-channel audio data.
  • the audio playback system 32 includes a binaural audio renderer 34 that renders SHCs 27' for output as binaural speaker feeds 35A-35B (collectively, "speaker feeds 35").
  • Binaural audio renderer 34 may provide for different forms of rendering, such as one or more of the various ways of performing vector-base amplitude panning (VBAP), and/or one or more of the various ways of performing sound field synthesis.
  • a "and/or" B may refer to A, B, or a combination of A and B.
  • the audio playback system 32 may further include an extraction device 38.
  • the extraction device 38 may represent any device capable of extracting spherical harmonic coefficients 27' ("SHCs 27'," which may represent a modified form of or a duplicate of spherical harmonic coefficients 27) through a process that may generally be reciprocal to that of the bitstream generation device 36.
  • the audio playback system 32 may receive the spherical harmonic coefficients 27' and uses binaural audio renderer 34 to render spherical harmonic coefficients 27' and thereby generate speaker feeds 35 (corresponding to the number of loudspeakers electrically or possibly wirelessly coupled to the audio playback system 32, which are not shown in the example of FIG. 3 for ease of illustration purposes).
  • the number of speaker feeds 35 may be two, and audio playback system may wirelessly couple to a pair of headphones that includes the two corresponding loudspeakers.
  • binaural audio renderer 34 may output more or fewer speaker feeds than is illustrated and primarily described with respect to FIG. 3.
  • Binary room impulse response (BRIR) filters 37 of audio playback system that each represents a response at a location to an impulse generated at an impulse location.
  • BRIR filters 37 are "binaural" in that they are each generated to be representative of the impulse response as would be experienced by a human ear at the location. Accordingly, BRIR filters for an impulse are often generated and used for sound rendering in pairs, with one element of the pair for the left ear and another for the right ear.
  • binaural audio renderer 34 uses left BRIR filters 33A and right BRIR filters 33B to render respective binaural audio outputs 35A and 35B.
  • BRIR filters 37 may be generated by convolving a sound source signal with head-related transfer functions (HRTFs) measured as impulses responses (IRs). The impulse location corresponding to each of the BRIR filters 37 may represent a position of a virtual loudspeaker in a virtual space.
  • binaural audio renderer 34 convolves SHCs 27' with BRIR filters 37 corresponding to the virtual loudspeakers, then accumulates (i.e., sums) the resulting convolutions to render the sound field defined by SHCs 27' for output as speaker feeds 35.
  • binaural audio renderer 34 may apply techniques for reducing rendering computation by manipulating BRIR filters 37 while rendering SHCs 27' as speaker feeds 35.
  • the techniques include segmenting BRIR filters 37 into a number of segments that represent different stages of an impulse response at a location within a room. These segments correspond to different physical phenomena that generate the pressure (or lack thereof) at any point on the sound field. For example, because each of BRIR filters 37 is timed coincident with the impulse, the first or "initial" segment may represent a time until the pressure wave from the impulse location reaches the location at which the impulse response is measured. With the exception of the timing information, BRIR filters 37 values for respective initial segments may be insignificant and may be excluded from a convolution with the hierarchical elements that describe the sound field.
  • each of BRIR filters 37 may include a last or "tail" segment that include impulse response signals attenuated to below the dynamic range of human hearing or attenuated to below a designated threshold, for instance.
  • BRIR filters 37 values for respective tails segments may also be insignificant and may be excluded from a convolution with the hierarchical elements that describe the sound field.
  • the techniques may include determining a tail segment by performing a Schroeder backward integration with a designated threshold and discarding elements from the tail segment where backward integration exceeds the designated threshold.
  • the designated threshold is -60 dB for reverberation time RT 6 o.
  • An additional segment of each of BRIR filters 37 may represent the impulse response caused by the impulse-generated pressure wave without the inclusion of echo effects from the room. These segments may be represented and described as a head- related transfer functions (HRTFs) for BRIR filters 37, where HRTFs capture the impulse response due to the diffraction and reflection of pressure waves about the head, shoulders/torso, and outer ear as the pressure wave travels toward the ear drum. HRTF impulse responses are the result of a linear and time-invariant system (LTI) and may be modeled as minimum-phase filters.
  • the techniques to reduce HRTF segment computation during rendering may, in some examples, include minimum-phase reconstruction and using infinite impulse response (IIR) filters to reduce an order of the original finite impulse response (FIR) filter (e.g., the HRTF filter segment).
  • IIR infinite impulse response
  • Minimum-phase filters implemented as IIR filters may be used to approximate the HRTF filters for BRIR filters 37 with a reduced filter order. Reducing the order leads to a concomitant reduction in the number of calculations for a time-step in the frequency domain.
  • the residual/excess filter resulting from the construction of minimum-phase filters may be used to estimate the interaural time difference (ITD) that represents the time or phase distance caused by the distance a sound pressure wave travels from a source to each ear.
  • ITD interaural time difference
  • the ITD can then be used to model sound localization for one or both ears after computing a convolution of one or more BRIR filters 37 with the hierarchical elements that describe the sound field (i.e., determine binauralization).
  • a still further segment of each of BRIR filters 37 is subsequent to the HRTF segment and may account for effects of the room on the impulse response.
  • This room segment may be further decomposed into an early echoes (or "early reflection") segment and a late reverberation segment (that is, early echoes and late reverberation may each be represented by separate segments of each of BRIR filters 37).
  • onset of the early echo segment may be identified by deconvo luting the BRIR filters 37 with the HRTF to identify the HRTF segment.
  • Subsequent to the HRTF segment is the early echo segment.
  • the HRTF and early echo segments are direction-dependent in that location of the corresponding virtual speaker determines the signal in a significant respect.
  • binaural audio renderer 34 uses BRIR filters 37 prepared for the spherical harmonics domain ( ⁇ , ⁇ ) or other domain for the hierarchical elements that describe the sound field. That is, BRIR filters 37 may be defined in the spherical harmonics domain (SHD) as transformed BRIR filters 37 to allow binaural audio renderer 34 to perform fast convolution while taking advantage of certain properties of the data set, including the symmetry of BRIR filters 37 (e.g. left/right) and of SHCs 27'. In such examples, transformed BRIR filters 37 may be generated by multiplying (or convolving in the time-domain) the SHC rendering matrix and the original BRIR filters. Mathematically, this can be expressed according to the following equations (l)-(5):
  • BRIR' (N+i) 2 ,L,ieft — ⁇ CQV+I) 2 ,L * BRIR L> ieft (1) SHC (N+1) 2 ,L * BRIR Lt right (2) or
  • Equation (3) depicts either (1) or (2) in matrix form for fourth-order spherical harmonic coefficients (which may be an alternative way to refer to those of the spherical harmonic coefficients associated with spherical basis functions of the fourth-order or less). Equation (3) may of course be modified for higher- or lower-order spherical harmonic coefficients. Equations (4)-(5) depict the summation of the transformed left and right BRIR filters 37 over the loudspeaker dimension, L, to generate summed SHC- binaural rendering matrices (BRIR").
  • the summed SHC-binaural rendering matrices have dimensionality [(N+l) 2 , Length, 2], where Length is a length of the impulse response vectors to which any combination of equations (l)-(5) may be applied.
  • L right SHC( N+1 L * BRIR L right .
  • the SHC rendering matrix presented in the above equations (l)-(3), SHC includes elements for each order/sub-order combination of SHCs 27', which effectively define a separate SHC channel, where the element values are set for a position for the speaker, L, in the spherical harmonic domain.
  • BRIRi e f t represents the BRIR response at the left ear or position for an impulse produced at the location for the speaker, L, and is depicted in (3) using impulse response vectors B[ for E [0, L] ⁇ .
  • BRIR ( N +i) 2 ,L,ieft represents one half of a "SHC-binaural rendering matrix," i.e., the SHC-binaural rendering matrix at the left ear or position for an impulse produced at the location for speakers, L, transformed to the spherical harmonics domain.
  • BRIR (N+i) 2 ,L,ri g ht represents the other half of the SHC-binaural rendering matrix.
  • the techniques may include applying the SHC rendering matrix only to the HRTF and early reflection segments of respective original BRIR filters 37 to generate transformed BRIR filters 37 and an SHC-binaural rendering matrix. This may reduce a length of convolutions with SHCs 27'.
  • the SHC-binaural rendering matrices having dimensionality that incorporates the various loudspeakers in the spherical harmonics domain may be summed to generate a ⁇ N+X) 2 * Length*! filter matrix that combines SHC rendering and BRIR rendering/mixing. That is, SHC- binaural rendering matrices for each of the L loudspeakers may be combined by, e.g., summing the coefficients over the L dimension. For SHC-binaural rendering matrices of length Length, this produces a (N+X) 2 * Length*!
  • Length may be a length of a segment of the BRIR filters segmented in accordance with techniques described herein.
  • Techniques for model reduction may also be applied to the altered rendering filters, which allows SHCs 27' (e.g., the SHC contents) to be directly filtered with the new filter matrix (a summed SHC-binaural rendering matrix).
  • Binaural audio renderer 34 may then convert to binaural audio by summing the filtered arrays to obtain the binaural output signals 35A, 35B.
  • BRIR filters 37 of audio playback system 32 represent transformed BRIR filters in the spherical harmonics domain previously computed according to any one or more of the above-described techniques.
  • transformation of original BRIR filters 37 may be performed at run-time.
  • the techniques may promote further reduction of the computation of binaural outputs 35 A, 35B by using only the SHC-binaural rendering matrix for either the left or right ear.
  • binaural audio renderer 34 may make conditional decisions for either outputs signal 35A or 35B as a second channel when rendering the final output.
  • reference to processing content or to modifying rendering matrices described with respect to either the left or right ear should be understood to be similarly applicable to the other ear.
  • binaural audio renderer 34 may provide efficient rendering of binaural output signals 35A, 35B from SHCs 27'.
  • FIG. 4 is a block diagram illustrating an example binaural room impulse response (BRIR).
  • BRIR 40 illustrates five segments 42A-42E.
  • the initial segment 42A and tail segment 42E both include quiet samples that may be insignificant and excluded from rendering computation.
  • Head-related transfer function (HRTF) segment 42B includes the impulse response due to head-related transfer and may be identified using techniques described herein.
  • Early echoes (alternatively, "early reflections") segment 42C and late room reverb segment 42D combine the HRTF with room effects, i.e., the impulse response of early echoes segment 42C matches that of the HRTF for BRIR 40 filtered by early echoes and late reverberation of the room.
  • HRTF head-related transfer function
  • Early echoes segment 42C may include more discrete echoes in comparison to late room reverb segment 42D, however.
  • the mixing time is the time between early echoes segment 42C and late room reverb segment 42D and indicates the time at which early echoes become dense reverb.
  • the mixing time is illustrated as occurring at approximately 1.5xl0 4 samples into the HRTF, or approximately 7.0xl0 4 samples from the onset of HRTF segment 42B.
  • the techniques include computing the mixing time using statistical data and estimation from the room volume.
  • the perceptual mixing time with 50% confidence internal, t mp50 is approximately 36 milliseconds (ms) and with 95% confidence interval, t mp95 , is approximately 80 ms.
  • FIG. 5 is a block diagram illustrating an example systems model 50 for producing a BRIR, such as BRIR 40 of FIG. 4, in a room.
  • the model includes cascaded systems, here room 52A and HRTF 52B. After HRTF 52B is applied to an impulse, the impulse response matches that of the HRTF filtered by early echoes of the room 52A.
  • FIG. 6 is a block diagram illustrating a more in-depth systems model 60 for producing a BRIR, such as BRIR 40 of FIG. 4, in a room.
  • This model 60 also includes cascaded systems, here HRTF 62A, early echoes 62B, and residual room 62C (which combines HRTF and room echoes).
  • Model 60 depicts the decomposition of room 52A into early echoes 62B and residual room 62C and treats each system 62A, 62B, 62C as linear-time invariant.
  • Early echoes 62B includes more discrete echoes than residual room 62C. Accordingly, early echoes 62B may vary per virtual speaker channel, while residual room 62C having a longer tail may be synthesized as a single stereo copy.
  • HRTF data may be available as measured in an anechoic chamber.
  • Early echoes 62B may be determined by deconvoluting the BRIR and the HRTF data to identify the location of early echoes (which may be referred to as "reflections"). In some examples, HRTF data is not readily available and the techniques for identifying early echoes 62B include blind estimation.
  • a straightforward approach may include regarding the first few milliseconds (e.g., the first 5, 10, 15, or 20 ms) as direct impulse filtered by the HRTF.
  • the techniques may include computing the mixing time using statistical data and estimation from the room volume.
  • the techniques may include synthesizing one or more BRIR filters for residual room 62C.
  • BRIR reverb tails (represented as system residual room 62C in FIG. 6) can be interchanged in some instances without perceptual punishments.
  • the BRIR reverb tails can be synthesized with Gaussian white noise that matches the Energy Decay Relief (EDR) and Frequency- Dependent Interaural Coherence (FDIC).
  • EDR Energy Decay Relief
  • FDIC Frequency- Dependent Interaural Coherence
  • a common synthetic BRIR reverb tail may be generated for BRIR filters.
  • the common EDR may be an average of the EDRs of all speakers or may be the front zero degree EDR with energy matching to the average energy.
  • the FDIC may be an average FDIC across all speakers or may be the minimum value across all speakers for a maximally decorrelated measure for spaciousness.
  • reverb tails can also be simulated with artificial reverb with Feedback Delay Networks (FDN).
  • FDN Feedback Delay Networks
  • FIG. 7 is a block diagram illustrating an example of an audio playback device that may perform various aspects of the binaural audio rendering techniques described in this disclosure. While illustrated as a single device, i.e., audio playback device 100 in the example of FIG. 7, the techniques may be performed by one or more devices. Accordingly, the techniques should be not limited in this respect.
  • audio playback device 100 may include an extraction unit 104 and a binaural rendering unit 102.
  • the extraction unit 104 may represent a unit configured to extract encoded audio data from bitstream 120.
  • the extraction unit 104 may forward the extracted encoded audio data in the form of spherical harmonic coefficients (SHCs) 122 (which may also be referred to a higher order ambisonics (HO A) in that the SHCs 122 may include at least one coefficient associated with an order greater than one) to the binaural rendering unit 146.
  • SHCs spherical harmonic coefficients
  • HO A higher order ambisonics
  • audio playback device 100 includes an audio decoding unit configured to decode the encoded audio data so as to generate the SHCs 122.
  • the audio decoding unit may perform an audio decoding process that is in some aspects reciprocal to the audio encoding process used to encode SHCs 122.
  • the audio decoding unit may include a time-frequency analysis unit configured to transform SHCs of encoded audio data from the time domain to the frequency domain, thereby generating the SHCs 122.
  • the audio decoding unit may invoke the time-frequency analysis unit to convert the SHCs from the time domain to the frequency domain so as to generate SHCs 122 (specified in the frequency domain).
  • the time-frequency analysis unit may apply any form of Fourier-based transform, including a fast Fourier transform (FFT), a discrete cosine transform (DCT), a modified discrete cosine transform (MDCT), and a discrete sine transform (DST) to provide a few examples, to transform the SHCs from the time domain to SHCs 122 in the frequency domain.
  • FFT fast Fourier transform
  • DCT discrete cosine transform
  • MDCT modified discrete cosine transform
  • DST discrete sine transform
  • SHCs 122 may already be specified in the frequency domain in bitstream 120.
  • the time-frequency analysis unit may pass SHCs 122 to the binaural rendering unit 102 without applying a transform or otherwise transforming the received SHCs 122. While described with respect to SHCs 122 specified in the frequency domain, the techniques may be performed with respect to SHCs 122 specified in the time domain.
  • Binaural rendering unit 102 represents a unit configured to binauralize SHCs 122.
  • Binaural rendering unit 102 may, in other words, represent a unit configured to render the SHCs 122 to a left and right channel, which may feature spatialization to model how the left and right channel would be heard by a listener in a room in which the SHCs 122 were recorded.
  • the binaural rendering unit 102 may render SHCs 122 to generate a left channel 136A and a right channel 136B (which may collectively be referred to as "channels 136" suitable for playback via a headset, such as headphones. As shown in the example of FIG.
  • the binaural rendering unit 102 includes BRIR filters 108, a BRIR conditioning unit 106, a residual room response unit 110, a BRIR SHC-domain conversion unit 112, a convolution unit 114, and a combination unit 116.
  • BRIR filters 108 include one or more BRIR filters and may represent an example of BRIR filters 37 of FIG. 3.
  • BRIR filters 108 may include separate BRIR filters 126 A, 126B representing the effect of the left and right HRTF on the respective BRIRs.
  • BRIR conditioning unit 106 receives L instances of BRIR filters 126A, 126B, one for each virtual loudspeaker L and with each BRIR filter having length N. BRIR filters 126A, 126B may already be conditioned to remove quiet samples. BRIR conditioning unit 106 may apply techniques described above to segment BRIR filters 126A, 126B to identify respective HRTF, early reflection, and residual room segments.
  • BRIR conditioning unit 106 provides the HRTF and early reflection segments to BRIR SHC-domain conversion unit 112 as matrices 129A, 129B representing left and right matrices of size [a, L], where a is a length of the concatenation of the HRTF and early reflection segments and I is a number of loudspeakers (virtual or real).
  • BRIR conditioning unit 106 provides the residual room segments of BRIR filters 126 A, 126B to residual room response unit 110 as left and right residual room matrices 128 A, 128B of size [b, L], where b is a length of the residual room segments and L is a number of loudspeakers (virtual or real).
  • Residual room response unit 110 may apply techniques describe above to compute or otherwise determine left and right common residual room response segments for convolution with at least some portion of the hierarchical elements (e.g., spherical harmonic coefficients) describing the sound field, as represented in FIG. 7 by SHCs 122. That is, residual room response unit 110 may receive left and right residual room matrices 128 A, 128B and combine respective left and right residual room matrices 128 A, 128B over L to generate left and right common residual room response segments. Residual room response unit 110 may perform the combination by, in some instances, averaging the left and right residual room matrices 128A, 128B over L.
  • the hierarchical elements e.g., spherical harmonic coefficients
  • Residual room response unit 110 may then compute a fast convolution of the left and right common residual room response segments with at least one channel of SHCs 122, illustrated in FIG. 7 as channel(s) 124B.
  • channel(s) 124B is the W channel (i.e., 0 th order) of the SHCs 122 channels, which encodes the non-directional portion of a sound field.
  • fast convolution by residual room response unit 110 with left and right common residual room response segments produces left and right output signals 134A, 134B of length Length.
  • fast convolution and “convolution” may refer to a convolution operation in the time domain as well as to a point-wise multiplication operation in the frequency domain.
  • convolution in the time domain is equivalent to point- wise multiplication in the frequency domain, where the time and frequency domains are transforms of one another.
  • the output transform is the point-wise product of the input transform with the transfer function.
  • convolution and point-wise multiplication can refer to conceptually similar operations made with respect to the respective domains (time and frequency, herein).
  • Convolution units 114, 214, 230; residual room response units 210, 354; filters 384 and reverb 386; may alternatively apply multiplication in the frequency domain, where the inputs to these components is provided in the frequency domain rather than the time domain.
  • Other operations described herein as "fast convolution” or “convolution” may, similarly, also refer to multiplication in the frequency domain, where the inputs to these operations is provided in the frequency domain rather than the time domain.
  • residual room response unit 110 may receive, from BRIR conditioning unit 106, a value for an onset time of the common residual room response segments. Residual room response unit 110 may zero-pad or otherwise delay the outputs signals 134A, 134B in anticipation of combination with earlier segments for the BRIR filters 108.
  • BRIR SHC-domain conversion unit 112 (hereinafter "domain conversion unit 112") applies an SHC rendering matrix to BRIR matrices to potentially convert the left and right BRIR filters 126 A, 126B to the spherical harmonic domain and then to potentially sum the filters over L.
  • Domain conversion unit 112 outputs the conversion result as left and right SHC-binaural rendering matrices 130A, 130B, respectively.
  • matrices 129A, 129B are of size [a, L]
  • each of SHC-binaural rendering matrices 130A, 130B is of size [(N+l) 2 , a] after summing the filters over L (see equations (4)-(5) for example).
  • SHC-binaural rendering matrices 130A, 130B are configured in audio playback device 100 rather than being computed at run-time or a setup-time.
  • multiple instances of SHC-binaural rendering matrices 130A, 130B are configured in audio playback device 100, and audio playback device 100 selects a left/right pair of the multiple instances to apply to SHCs 124 A.
  • Convolution unit 114 convolves left and right binaural rendering matrices 130A, 130B with SHCs 124 A, which may in some examples be reduced in order from the order of SHCs 122.
  • SHCs 124A in the frequency (e.g., SHC) domain convolution unit 114 may compute respective point- wise multiplications of SHCs 124 A with left and right binaural rendering matrices 130A, 130B.
  • SHC signal of length Length the convolution results in left and right filtered SHC channels 132A, 132B of size [Length, (N+l) 2 ], there typically being a row for each output signals matrix for each order/sub- order combination of the spherical harmonics domain.
  • Combination unit 116 may combine left and right filtered SHC channels 132A, 132B with output signals 134A, 134B to produce binaural output signals 136A, 136B. Combination unit 116 may then separately sum each left and right filtered SHC channels 132A, 132B over L to produce left and right binaural output signals for the HRTF and early echoes (reflection) segments prior to combining the left and right binaural output signals with left and right output signals 134A, 134B to produce binaural output signals 136A, 136B.
  • FIG. 8 is a block diagram illustrating an example of an audio playback device that may perform various aspects of the binaural audio rendering techniques described in this disclosure.
  • Audio playback device 200 may represent an example instance of audio playback device 100 of FIG. 7 is further detail.
  • Audio playback device 200 may include an optional SHCs order reduction unit 204 that processes inbound SHCs 242 from bitstream 240 to reduce an order of the SHCs 242.
  • Optional SHCs order reduction provides the highest-order (e.g., 0 th order) channel 262 of SHCs 242 (e.g., the W channel) to residual room response unit 210, and provides reduced-order SHCs 242 to convolution unit 230.
  • convolution unit 230 receives SHCs 272 that are identical to SHCs 242. In either case, SHCs 272 have dimensions [Length, (N+l) 2 ], where N is the order of SHCs 272.
  • BRIR conditioning unit 206 and BRIR filters 208 may represent example instances of BRIR conditioning unit 106 and BRIR filters 108 of FIG. 7.
  • Convolution unit 214 of residual response unit 214 receives common left and right residual room segments 244A, 244B conditioned by BRIR condition unit 206 using techniques described above, and convolution unit 214 convolves the common left and right residual room segments 244A, 244B with highest-order channel 262 to produce left and right residual room signals 262 A, 262B.
  • Delay unit 216 may zero-pad the left and right residual room signals 262A, 262B with the onset number of samples to the common left and right residual room segments 244A, 244B to produce left and right residual room output signals 268 A, 268B.
  • BRIR SHC-domain conversion unit 220 may represent an example instance of domain conversion unit 112 of FIG. 7.
  • transform unit 222 applies an SHC rendering matrix 224 of (N+l) 2 dimensionality to matrices 248 A, 248B representing left and right matrices of size [a, L], where a is a length of the concatenation of the HRTF and early reflection segments and L is a number of loudspeakers (e.g., virtual loudspeakers).
  • Transform unit 222 outputs left and right matrices 252A, 252B in the SHC-domain having dimensions [(N+l) 2 , a, L].
  • Summation unit 226 may sum each of left and right matrices 252A, 252B over L to produce left and right intermediate SHC-rendering matrices 254A, 254B having dimensions [(N+l) 2 , a].
  • Reduction unit 228 may apply techniques described above to further reduce computation complexity of applying SHC-rendering matrices to SHCs 272, such as minimum-phase reduction and using Balanced Model Truncation methods to design IIR filters to approximate the frequency response of the respective minimum phase portions of intermediate SHC-rendering matrices 254A, 254B that have had minimum-phase reduction applied.
  • Reduction unit 228 outputs left and right SHC- rendering matrices 256 A, 256B.
  • Convolution unit 230 filters the SHC contents in the form of SHCs 272 to produce intermediate signals 258 A, 258B, which summation unit 232 sums to produce left and right signals 260A, 260B.
  • Combination unit 234 combines left and right residual room output signals 268A, 268B and left and right signals 260A, 260B to produce left and right binaural output signals 270 A, 270B.
  • binaural rendering unit 202 may implement further reductions to computation by using only one of the SHC-binaural rendering matrices 252A, 252B generated by transform unit 222.
  • convolution unit 230 may operate on just one of the left or right signals, reducing convolution operations by half.
  • Summation unit 232 makes conditional decisions for the second channel when rendering the outputs 260 A, 260B.
  • FIG. 9 is a flowchart illustrating an example mode of operation for a binaural rendering device to render spherical harmonic coefficients according to techniques described in this disclosure.
  • the example mode of operation is described with respect to audio playback device 200 of FIG. 7.
  • Binaural room impulse response (BRIR) conditioning unit 206 conditions left and right BRIR filters 246A, 246B, respectively, by extracting direction-dependent components/segments from the BRIR filters 246 A, 246B, specifically the head-related transfer function and early echoes segments (300).
  • Each of left and right BRIR filters 126A, 126B may include BRIR filters for one or more corresponding loudspeakers.
  • BRIR conditioning unit 106 provides a concatenation of the extracted head-related transfer function and early echoes segments to BRIR SHC-domain conversion unit 220 as left and right matrices 248A, 248B.
  • BRIR SHC-domain conversion unit 220 applies an HOA rendering matrix 224 to transform left and right filter matrices 248A, 248B including the extracted head- related transfer function and early echoes segments to generate left and right filter matrices 252A, 252B in the spherical harmonic (e.g., HOA) domain (302).
  • audio playback device 200 may be configured with left and right filter matrices 252A, 252B.
  • audio playback device 200 receives BRIR filters 208 in an out-of-band or in-band signal of bitstream 240, in which case audio playback device 200 generates left and right filter matrices 252A, 252B.
  • Summation unit 226 sums the respective left and right filter matrices 252A, 252B over the loudspeaker dimension to generate a binaural rendering matrix in the SHC domain that includes left and right intermediate SHC-rendering matrices 254A, 254B (304).
  • a reduction unit 228 may further reduce the intermediate SHC-rendering matrices 254A, 254B to generate left and right SHC-rendering matrices 256A, 256B.
  • a convolution unit 230 of binaural rendering unit 202 applies the left and right intermediate SHC-rendering matrices 256A, 256B to SHC content (such as spherical harmonic coefficients 272) to produce left and right filtered SHC (e.g., HO A) channels 258A, 258B (306).
  • SHC content such as spherical harmonic coefficients 272
  • left and right filtered SHC e.g., HO A
  • Summation unit 232 sums each of the left and right filtered SHC channels 258 A, 258B over the SHC dimension, (N+l) 2 , to produce left and right signals 260A, 260B for the direction-dependent segments (308).
  • Combination unit 116 may then combine the left and right signals 260A, 260B with left and right residual room output signals 268A, 268B to generate a binaural output signal including left and right binaural output signals 270A, 270B.
  • FIG. 10A is a diagram illustrating an example mode of operation 310 that may be performed by the audio playback devices of FIGS. 7 and 8 in accordance with various aspects of the techniques described in this disclosure. Mode of operation 310 is described herein after with respect to audio playback device 200 of FIG. 8.
  • Binaural rendering unit 202 of audio playback device 200 may be configured with BRIR data 312, which may be an example instance of BRIR filters 208, and HOA rendering matrix 314, which may be an example instance of HOA rendering matrix 224.
  • Audio playback device 200 may receive BRIR data 312 and HOA rendering matrix 314 in an in-band or out-of-band signaling channel vis-a-vis the bitstream 240.
  • BRIR data 312 in this example has L filters representing, for instance, L real or virtual loudspeakers, each of the L filters being length K.
  • Each of the L filters may include left and right components ("x 2").
  • each of the L filters may include a single component for left or right, which is symmetrical to its counterpart: right or left. This may reduce a cost of fast convolution.
  • BRIR conditioning unit 206 of audio playback device 200 may condition the BRIR data 312 by applying segmentation and combination operations. Specifically, in the example mode of operation 310, BRIR conditioning unit 206 segments each of the L filters according to techniques described herein into HRTF plus early echo segments of combined length a to produce matrix 315 (dimensionality [a, 2, L]) and into residual room response segments to produce residual matrix 339 (dimensionality [b, 2, L]) (324).
  • the length K of the L filters of BRIR data 312 is approximately the sum of a and b.
  • Transform unit 222 may apply HOA/SHC rendering matrix 314 of ( ⁇ +1) 2 dimensionality to the L filters of matrix 315 to produce matrix 317 (which may be an example instance of a combination of left and right matrices 252A, 252B) of dimensionality [(N+l) 2 , a, 2, L].
  • Summation unit 226 may sum each of left and right matrices 252A, 252B over L to produce intermediate SHC-rendering matrix 335 having dimensionality [(N+l) 2 , a, 2] (the third dimension having value 2 representing left and right components; intermediate SHC-rendering matrix 335 may represent as an example instance of both left and right intermediate SHC-rendering matrices 254A, 254B) (326).
  • audio playback device 200 may be configured with intermediate SHC-rendering matrix 335 for application to the HOA content 316 (or reduced version thereof, e.g., HOA content 321).
  • reduction unit 228 may apply further reductions to computation by using only one of the left or right components of matrix 317 (328).
  • Audio playback device 200 receives HOA content 316 of order N / and length Length and, in some aspects, applies an order reduction operation to reduce the order of the spherical harmonic coefficients (SHCs) therein to N (330).
  • N/ indicates the order of the (I)nput HOA content 321.
  • the HOA content 321 of order reduction operation (330) is, like HOA content 316, in the SHC domain.
  • the optional order reduction operation also generates and provides the highest-order (e.g., the 0 th order) signal 319 to residual response unit 210 for a fast convolution operation (338).
  • HOA order reduction unit 204 does not reduce an order of HOA content 316
  • the apply fast convolution operation (332) operates on input that does not have a reduced order.
  • HOA content 321 input to the fast convolution operation (332) has dimensions [Length, (N+l) 2 ], where N is the order.
  • Audio playback device 200 may apply fast convolution of HOA content 321 with matrix 335 to produce HOA signal 323 having left and right components thus dimensions [Length, (N+l) 2 , 2] (332). Again, fast convolution may refer to point-wise multiplication of the HOA content 321 and matrix 335 in the frequency domain or convolution in the time domain. Audio playback device 200 may further sum HOA signal 323 over (N+l) 2 to produce a summed signal 325 having dimensions [Length, 2] (334).
  • audio playback device 200 may combine the L residual room response segments, in accordance with techniques herein described, to generate a common residual room response matrix 327 having dimensions [b, 2] (336). Audio playback device 200 may apply fast convolution of the 0 th order HOA signal 319 with the common residual room response matrix 327 to produce room response signal 329 having dimensions [Length, 2] (338).
  • audio playback device 200 Because, to generate the L residual response room response segments of residual matrix 339, audio playback device 200 obtained the residual response room response segments starting at the (a+l)* samples of the L filters of BRIR data 312, audio playback device 200 accounts for the initial a samples by delaying (e.g., padding) a samples to generate room response signal 311 having dimensions [Length, 2] (340).
  • Audio playback device 200 combines summed signal 325 with room response signal 311 by adding the elements to produce output signal 318 having dimensions [Length, 2] (342). In this way, audio playback device may avoid applying fast convolution for each of the L residual room response segments. For a 22 channel input for conversion to binaural audio output signal, this may reduce the number of fast convolutions for generating the residual room response from 22 to 2.
  • FIG. 10B is a diagram illustrating an example mode of operation 350 that may be performed by the audio playback devices of FIGS. 7 and 8 in accordance with various aspects of the techniques described in this disclosure.
  • Mode of operation 350 is described herein after with respect to audio playback device 200 of FIG. 8 and is similar to mode of operation 310.
  • mode of operation 350 includes first rendering the HOA content into multichannel speaker signals in the time domain for L real or virtual loudspeakers, and then applying efficient BRIR filtering on each of the speaker feeds, in accordance with techniques described herein.
  • audio playback device 200 transforms HOA content 321 to multichannel audio signal 333 having dimensions [Length, L] (344).
  • audio playback device does not transform BRIR data 312 to the SHC domain. Accordingly, applying reduction by audio playback device 200 to signal 314 generates matrix 337 having dimensions [a, 2, L] (328).
  • Audio playback device 200 then applies fast convolution 332 of multichannel audio signal 333 with matrix 337 to produce multichannel audio signal 341 having dimensions [Length, L, 2] (with left and right components) (348). Audio playback device 200 may then sum the multichannel audio signal 341 by the L channels/speakers to produce signal 325 having dimensions [Length, 2] (346).
  • FIG. 11 is a block diagram illustrating an example of an audio playback device 350 that may perform various aspects of the binaural audio rendering techniques described in this disclosure. While illustrated as a single device, i.e., audio playback device 350 in the example of FIG. 11, the techniques may be performed by one or more devices. Accordingly, the techniques should be not limited in this respect. [0098] Moreover, while generally described above with respect to the examples of FIGS. 1-lOB as being applied in the spherical harmonics domain, the techniques may also be implemented with respect to any form of audio signals, including channel-based signals that conform to the above noted surround sound formats, such as the 5.1 surround sound format, the 7.1 surround sound format, and/or the 22.2 surround sound format. The techniques should therefore also not be limited to audio signals specified in the spherical harmonic domain, but may be applied with respect to any form of audio signal.
  • the audio playback device 350 may be similar to the audio playback device 100 shown in the example of FIG. 7. However, the audio playback device 350 may operate or otherwise perform the techniques with respect to general channel-based audio signals that, as one example, conform to the 22.2 surround sound format.
  • the extraction unit 104 may extract audio channels 352, where audio channels 352 may generally include "n" channels, and is assumed to include, in this example, 22 channels that conform to the 22.2 surround sound format. These channels 352 are provided to both residual room response unit 354 and per-channel truncated filter unit 356 of the binaural rendering unit 351.
  • the BRIR filters 108 include one or more BRIR filters and may represent an example of the BRIR filters 37 of FIG. 3.
  • the BRIR filters 108 may include the separate BRIR filters 126 A, 126B representing the effect of the left and right HRTF on the respective BRIRs.
  • the BRIR conditioning unit 106 receives n instances of the BRIR filters 126 A, 126B, one for each channel n and with each BRIR filter having length N.
  • the BRIR filters 126 A, 126B may already be conditioned to remove quiet samples.
  • the BRIR conditioning unit 106 may apply techniques described above to segment the BRIR filters 126A, 126B to identify respective HRTF, early reflection, and residual room segments.
  • the BRIR conditioning unit 106 provides the HRTF and early reflection segments to the per-channel truncated filter unit 356 as matrices 129 A, 129B representing left and right matrices of size [a, L], where a is a length of the concatenation of the HRTF and early reflection segments and n is a number of loudspeakers (virtual or real).
  • the BRIR conditioning unit 106 provides the residual room segments of BRIR filters 126A, 126B to residual room response unit 354 as left and right residual room matrices 128A, 128B of size [b, L], where b is a length of the residual room segments and n is a number of loudspeakers (virtual or real).
  • the residual room response unit 354 may apply techniques describe above to compute or otherwise determine left and right common residual room response segments for convolution with the audio channels 352. That is, residual room response unit 110 may receive the left and right residual room matrices 128 A, 128B and combine the respective left and right residual room matrices 128 A, 128B over n to generate left and right common residual room response segments. The residual room response unit 354 may perform the combination by, in some instances, averaging the left and right residual room matrices 128 A, 128B over n.
  • the residual room response unit 354 may then compute a fast convolution of the left and right common residual room response segments with at least one of audio channel 352.
  • the residual room response unit 352 may receive, from the BRIR conditioning unit 106, a value for an onset time of the common residual room response segments.
  • Residual room response unit 354 may zero-pad or otherwise delay the output signals 134A, 134B in anticipation of combination with earlier segments for the BRIR filters 108.
  • the output signals 134 A may represent left audio signals while the output signals 134B may represent right audio signals.
  • the per-channel truncated filter unit 356 may apply the HRTF and early reflection segments of the BRIR filters to the channels 352. More specifically, the per-channel truncated filter unit 356 may apply the matrixes 129 A and 129B representative of the HRTF and early reflection segments of the BRIR filters to each one of the channels 352. In some instances, the matrixes 129A and 129B may be combined to form a single matrix 129. Moreover, typically, there is a left one of each of the HRTF and early reflection matrices 129 A and 129B and a right one of each of the HRTF and early reflection matrices 129A and 129B.
  • the per-channel direction unit 356 may apply each of the left and right matrixes 129 A, 129B to output left and right filtered channels 358 A and 358B.
  • the combination unit 116 may combine (or, in other words, mix) the left filtered channels 358A with the output signals 134A, while combining (or, in other words, mixing) the right filtered channels 358B with the output signals 134B to produce binaural output signals 136A, 136B.
  • the binaural output signal 136A may correspond to a left audio channel
  • the binaural output signal 136B may correspond to a right audio channel.
  • the binaural rendering unit 351 may invoke the residual room response unit 354 and the per-channel truncated filter unit 356 concurrent to one another such that the residual room response unit 354 operates concurrent to the operation of the per-channel truncated filter unit 356. That is, in some examples, the residual room response unit 354 may operate in parallel (but often not simultaneously) with the per- channel truncated filter unit 356, often to improve the speed with which the binaural output signals 136A, 136B may be generated. While shown in various FIGS, above as potentially operating in a cascaded fashion, the techniques may provide for concurrent or parallel operation of any of the units or modules described in this disclosure, unless specifically indicated otherwise.
  • FIG. 12 is a diagram illustrating a process 380 that may be performed by the audio playback device 350 of FIG. 11 in accordance with various aspects of the techniques described in this disclosure.
  • Process 380 achieves a decomposition of each BRIR into two parts: (a) smaller components which incorporate the effects of HRTF and early reflections represented by left filters 384A l -384N L and by right filters 384A R - 384N R (collectively, “filters 384”) and (b) a common 'reverb tail' that is generated from properties of all the tails of the original BRIRs and represented by left reverb filter 386L and right reverb filter 386R (collectively, "common filters 386").
  • the per-channel filters 384 shown in the process 380 may represent part (a) noted above, while the common filters 386 shown in the process 380 may represent part (b) noted above.
  • the process 380 performs this decomposition by analyzing the BRIRs to eliminate inaudible components and determine components which comprise the HRTF/early reflections and components due to late reflections/diffusion. This results in an FIR filter of length, as one example, 2704 taps, for part (a) and an FIR filter of length, as another example, 15232 taps for part (b).
  • the audio playback device 350 may apply only the shorter FIR filters to each of the individual n channels, which is assumed to be 22 for purposes of illustration, in operation 396.
  • the complexity of this operation may be represented in the first part of computation (using a 4096 point FFT) in Equation (8) reproduced below.
  • the audio playback device 350 may apply the common 'reverb tail' not to each of the 22 channels but rather to an additive mix of them all in operation 398.
  • This complexity is represented in the second half of the complexity calculation in Equation (8), again which is shown in the attached Appendix.
  • the process 380 may represent a method of binaural audio rendering that generates a composite audio signal, based on mixing audio content from a plurality of N channels.
  • process 380 may further align the composite audio signal, by a delay, with the output of N channel filters, wherein each channel filter includes a truncated BRIR filter.
  • the audio playback device 350 may then filter the aligned composite audio signal with a common synthetic residual room impulse response in operation 398 and mix the output of each channel filter with the filtered aligned composite audio signal in operations 390L and 390R for the left and right components of binaural audio output 388L, 388R.
  • the truncated BRIR filter and the common synthetic residual impulse response are pre-loaded in a memory.
  • the filtering of the aligned composite audio signal is performed in a temporal frequency domain.
  • the filtering of the aligned composite audio signal is performed in a time domain through a convolution.
  • the truncated BRIR filter and common synthetic residual impulse response is based on a decomposition analysis.
  • the decomposition analysis is performed on each of N room impulse responses, and results in N truncated room impulse responses and N residual impulse responses (where N may be denoted as n or n above).
  • the truncated impulse response represents less than forty percent of the total length of each room impulse response.
  • the truncated impulse response includes a tap range between 111 and 17,830.
  • each of the N residual impulse responses is combined into a common synthetic residual room response that reduces complexity.
  • mixing the output of each channel filter with the filtered aligned composite audio signal includes a first set of mixing for a left speaker output, and a second set of mixing for a right speaker output.
  • the method of the various examples of process 380 described above or any combination thereof may be performed by a device comprising a memory and one or more processors, an apparatus comprising means for performing each step of the method, and one or more processors that perform each step of the method by executing instructions stored on a non-transitory computer-readable storage medium.
  • any of the specific features set forth in any of the examples described above may be combined into a beneficial example of the described techniques. That is, any of the specific features are generally applicable to all examples of the techniques. Various examples of the techniques have been described.
  • the techniques described in this disclosure may in some cases identify only samples 1 1 1 to 17830 across BRIR set that are audible. Calculating a mixing time T mp 95 from the volume of an example room, the techniques may then let all BRIRs share a common reverb tail after 53.6ms, resulting in a 15232 sample long common reverb tail and remaining 2704 sample HRTF + reflection impulses, with 3ms crossfade between them. In terms of a computational cost break down, the following may be arrived at
  • Cmod max(100 * (C conv - C)/C conv , 0), (6) where C conv , is an estimate of an unoptimized implementation:
  • C is some aspect, may be determined by two additive factors:
  • a BRIR filter denoted as B n (z) may be decomposed into two functions BT n (z) and BR n (z), which denote the truncated BRIR filter and the reverb BRIR filter, respectively. Part (a) noted above may refer to this truncated BRIR filter, while part (b) above may refer to the reverb BRIR filter. Bn(z) may then equal BT n (z) + (z "m * BR n (z)), where m denotes the delay.
  • the output signal Y(z) may therefore be computed as:
  • the process 380 may analyze the BR n (z) to derive a common synthetic reverb tail segment, where this common BR(z) may be applied instead of the channel specific BR n (z).
  • this common (or channel general) synthetic BR(z) is used, Y(z) may be computed as:
  • FIG. 13 is a diagram of an example binaural room impulse response filter (BRIR) 400.
  • BRIR 400 illustrates five segments 402A-402C.
  • HRTF head-related transfer function
  • segment 402A includes the impulse response due to head-related transfer and may be identified using techniques described herein.
  • the HRTF is equivalent to measuring the impulse response in an anechoic chamber. Since the first reflections of a room usually have a longer delay than HRTF, it is assumed that the first portion of the BRIR is an HRTF impulse response.
  • the reflections segment 402B combines the HRTF with room effects, i.e., the impulse response of the reflections segment 402B matches that of the HRTF segment 402A for the BRIR 400 filtered by early discrete echoes in comparison to the reverberation segment 402C.
  • the mixing time is the time between the reflections segment 402B and the reverberation segment 402C and indicates the time at which early echoes become dense reverb.
  • Reverberation segment 402C behaves like Gaussian noise and discrete echoes can no longer be separated.
  • multichannel audio with high resolution and high channel count are considered.
  • headphone representation is need. This involves virtualizing all speaker feeds / channels into a stereo headset.
  • a set of one or more pairs of impulse responses may be applied to the multichannel audio.
  • the BRIR 400 may represent one pair of such impulse responses.
  • Applying the BRIR 400 filter using standard block Fast-Fourier Transform (FFT) to a channel of the multichannel audio may be computationally intensive. Applying an entire set of pairs of impulse responses to corresponding channels of the multichannel audio even more so.
  • FFT Fast-Fourier Transform
  • FIG. 14 is a block diagram illustrating a system 410 for a computation of a binaural output signal generated by applying binaural room impulse responses to a multichannel audio signal.
  • Each of inputs 412A-412N represents a single channel of an overall multichannel audio signal.
  • Each of BRIRs 414A-414N represents a pair of binaural impulse room response filters having left and right components.
  • the computation procedure applies, to each of the inputs 412A-412N, a corresponding BRIR of BRIRs 414A-414N to the single-channel (mono) input to generate a binaural audio signal for the single-channel input as rendered at the locations represents by the applied BRIR.
  • the N binaural audio signals are then accumulated by accumulator 416 to produce the stereo headphone signal or overall binaural audio signal, which is output by the system 410 as output 418.
  • FIG. 15 is a block diagram illustrating components of an audio playback device 500 for computing a binaural output signal generated by applying binaural room impulse responses to a multichannel audio signal according to techniques described herein.
  • the audio playback device 500 includes multiple components for implementing various computation reduction methods of the present disclosure in combination. Some aspects of the audio playback device 500 may include any combination in any number of the various computation reduction methods. Audio playback device 500 may represent an example of any of audio playback system 32, audio playback device 100, audio playback device 200, and audio playback device 350, and include components similar to any of the above-listed device for implementing the various computation reduction methods of the present disclosure.
  • the computation reduction methods may include any combination of the following:
  • Part a (corresponding to HRTF segment 402A and HRTF unit 504): usually a few milliseconds, for localization and can be computationally reduced by converting into inter-aural delays (ITDs) and minimum phase filters, which can be further reduced using IIR filters, as one example.
  • ITDs inter-aural delays
  • minimum phase filters which can be further reduced using IIR filters, as one example.
  • Part b (corresponding to reflections segment 402B and reflection unit 502):
  • the length may vary by room and will typically last usually tens of milliseconds. Although computational intensive if done for each channel separately, the techniques describe herein may apply respective common filters generated for sub-groups of these channels.
  • Part c (corresponding to reverberation segment 402C and reverberation unit 506): A common filter is calculated for all channels (e.g., 22 channels for a 22.2 format). Instead of resynthesizing a new reverb tail based on direct average over the frequency domain Energy Decay Relief (EDR) curve, the reverberation unit 506 applies a different weighting scheme to the average that is optionally enhanced by a correcting weight that changes with input signal content.
  • EDR Energy Decay Relief
  • the audio playback device 500 receives N single channel inputs 412A-412N (collectively, “inputs 412") of a multichannel audio signal and applies segments of binaural room impulse response (BRIR) filters to generate and output a stereo headphone signal or overall binaural audio signal.
  • BRIR binaural room impulse response
  • reflection unit combines the discrete inputs 412 into different groups using weighted sums (weighted using e.g., adaptive weighting factors 520Ai_ K -520Mi_j, 522A-522N).
  • For the common reverb illustrated, e.g., by reverberation section 402C of FIG.
  • reverberation unit 506 combines inputs 412 together with respective adaptive weighting factors (522A-522N, e.g. stereo, different weights for left/right per input) and then processes the combined inputs using a common reverb filter 524 (a stereo impulse response filter) applied using FFT filtering (after applying a delay 526).
  • respective adaptive weighting factors 522A-522N, e.g. stereo, different weights for left/right per input
  • Reflection unit 502 applies average reflection filters 512A-512M similar to common reverb filter 524 to different sub-groups of the inputs 412 combined together into the sub-groups with adaptive weighting factors (520Ai_ K -520Mi_j).
  • HRTF unit 504 applies the head-related transfer function (HRTF) filters 414A-414N (collectively, "HRTF filters 414") that have, in this example device, been converted to interaural time delay (ITDs) 530A-530N and minimum phase filters (these may be further approximated with multi-state infinite impulse response (IIR) filters).
  • HRTF filters 414 head-related transfer function filters 414
  • IIR interaural time delay
  • adaptive refers to adjustment to the weighting factors according to qualities of the input signal to which the adaptive weighting factor is applied. In some aspects, the various adaptive weighting factors may not be adaptive.
  • an Echo Density Profile which measures the fraction of impulse response taps outside of a window standard deviation, over a 1024 sliding window. When the value reaches 1 for the first time, this indicating that the impulse response starts to resemble Gaussian noise and marks the beginning of reverb.
  • the final values in milliseconds
  • Tmp50 36.1 (50 meaning average perceptual mixing time on regression analysis)
  • HRTF unit 504 applies the head-related transfer function (HRTF) filters 414 that have been converted to interaural time delay (ITDs) 530A- 530N and minimum phase filters.
  • the minimum phase filter may be obtained by windowing the Cepstrum of original filter; the delay may be estimated by linear regression on 500 ⁇ 4000 Hz frequency region of the phase; for IIR approximation, a Balanced Model Truncation (BMT) method may be used to extract the most important components of the amplitude response on a frequency warped filter.
  • HRTF head-related transfer function
  • Reverberation unit 506 After mixing time the impulse response tails (e.g., reverberation segment 402C) are theoretically interchangeable without much perceptual difference. Reverberation unit 506 therefore applies a common reverberation filter 524 to substitute each response tail of the respective BRIRs corresponding to inputs 412. There are example ways to obtain the common reverberation filter 524 for application in reverberation unit 506 of the audio playback device 500:
  • the first method (1) takes the characteristics/shape of each original filter equally. Some filters may have very low energy (e.g. the top center channel in 22.2 setup) and yet have equal "votes" in the common filter 524.
  • the second method (2) naturally weights each filter according to its energy level, so a more energetic or "louder” filter gets more votes in the common filter 524. This direct average may also assume that there is not much correlation between filters, which may be true at least for individually obtained BRIRs in a good listening room.
  • the third method (3) is based on techniques whereby frequency dependent interaural coherence (FDIC) is used to resynthesize reverb tails of a BRIR.
  • FDIC frequency dependent interaural coherence
  • Each BRIR first goes through short-term Fourier transform (STFT), and its FDIC is calculated as:
  • Hi and H R are the Short-time Fourier Transform (STFT) of the left and right impulse response.
  • H ⁇ L and H ⁇ R are the synthesized STFT of the filter
  • N ⁇ and N 2 are the STFT of independently generated Gaussian noise
  • c and d are the EDRs indexed by frequency and time
  • Ps are the time-smoothed short-time power spectrum estimates of the noise signal.
  • the techniques may include:
  • each filter has a "vote" in the common FDIC commensurate with its energy. Louder filters therefore get more of their FDIC images in the common filter 524.
  • the common reverberation filter 524 generation techniques may tradeoff the accuracy for the top channel when synthesizing the common filter 524, while the main front center, left and right channels may get a lot of emphasis.
  • the common or average FDIC calculated with multiple weights is calculated as: >
  • FDI is the FDIC of the z ' -th BRIR channel
  • wji is the weight factor of criterion j for BRIR channel i.
  • One of the y-th criterion mentioned here may be BRIR energy, while another may be signal content energy.
  • the denominator sum normalizes such that the combined weights eventually add up to 1. When weights are all equal to 1 , the equation reduces to a simple average.
  • a common EDR (c and d in previous equations) can be calculated as: tUH average
  • weights here may not necessarily be the same as the weights of the FDIC.
  • any of the above methods described with respect to generating common reverberation filter 524 may also be used to synthesize reflection filters 512A-512M. That is, a sub-group of channels' reflections can be similarly synthesized, although the error will typically be larger because signals produced by reflections are less noise-like. However, all the center channel reflections will share similar coherence evaluation and energy decay; all left-side channels reflections can be combined with proper weighting; alternatively, left front channels may form one group, left back and height channels may form another group, and so forth, in accordance with the channel format (e.g., 22.2).
  • the channel format e.g. 22.2
  • Reflection channels may be grouped in any combination. By examining the correlation between the reflection segments of the impulse responses, relatively highly-correlated channels can be grouped together for a subgroup common reflection filter 512 synthesis.
  • reflection unit 502 groups at least input 412A and input 412N in a subgroup. Reflection filter 512A represents a common filter generated for this subgroup, and reflection unit 502 applies the reflection filter 512A to a combination of the inputs of the subgroup which, again, include at least input 412A and input 412N in the illustrated example.
  • the correlation matrix for the respective reflection portions of a set of BRIR filters is examined.
  • the set of BRIR filters may represent a current set of BRIR filters.
  • the correlation matrix is adjusted by (l-corr)/2 to obtain a dissimilarity matrix, which is used to conduct a complete linkage for cluster analysis.
  • a hierarchical cluster analysis may be run on the reflection portions of a 22.2 channel BRIR set according to a correlation on their time envelopes. As can be seen, by setting a cutoff score of 0.6, the left channels can be grouped into 4 sub-groups and the right channels can be grouped into 3 sub-groups with convincing similarities. By examining the speaker locations in the 22.2 setup, the cluster analysis results coincide with common sense functionalities and geometry of the 22.2 channel setup.
  • the impulse response for any of the common filters may be a two-column vector:
  • the reflection unit 502 and/or reverberation unit 506 first mixes the inputs 412 into a specific group for the filter and then applies the common filter. For example, reverberation unit 506 may mix all 412 into and then apply common reverberation filter 524. Since the original filters before common filter synthesis have varying energies, equally-mixed inputs 412 may not match the original condition. If the energy of a filter impulse response h is calculated as:
  • an initial weight for the input signal can be calculated as: where hi is the original filter for channel i before common filter synthesis.
  • Adaptive weight factors 520Ai_ K -520Mi_j for reflection unit 502 and adaptive weight factors 522A-522N for reverberation unit 506 may represent any of weights W [ .
  • n is the discrete time index
  • w norm is according to the energy ratio between summed energy of weighted signals and energy of the weighted summed signal:
  • Combination step 510 combines all of the filtered signals generated by reflection unit 502, HRTF unit 504, and reverberation unit 506. In some examples, at least one of reflection unit 502 and reverberation unit 506 do not include applying adaptive weight factors.
  • HRTF unit 504 applies both the HRTF portion and the refiection portion of the BRIR filters for the inputs 412, i.e., audio playback device 500 in such examples does not group inputs 412N into M subgroups to which common reflection filters 512A-512M are applied.
  • FIG. 17 is a flowchart illustrating an example mode of operation of an audio playback device according to techniques described in this disclosure. The example mode of operation is described with respect to audio playback device 500 of FIG. 15.
  • the audio playback device 500 receives single input channels and applies adaptively determined weights to the channels (600). The audio playback device 500 combines these adaptively weighted channels to generate a combined audio signal (602). The audio playback device 500 further applies a binaural room impulse response filter to the combined audio signal to generate a binaural audio signal (604).
  • the binaural room impulse response filter may be, e.g., a combined reflection or a reverberation filter generated according to any of the techniques described above.
  • the audio playback device 500 outputs an output/overall audio signal that is generated, at least in part, from the binaural audio signal generated at step 604 (606).
  • the overall audio signal may be a combination of multiple binaural audio signals for one or more reflection sub-groups combined and filtered, a reverberation group combined and filtered, and respective HRTF signals filtered for each of the channel of the audio signal.
  • the audio playback device 500 applies a delay, as needed to the filtered signals to align the signals for combination to produce the overall output binaural audio signal.
  • One example is directed to a method of binauralizing an audio signal comprising obtaining a common filter for reflection segments of a sub-group of a plurality of binaural room impulse response filters; and applying the common filter to a summary audio signal determined from a plurality of channels of the audio signal to generate a transformed summary audio signal.
  • the summary audio signal comprises a combination of a subgroup of the plurality of channels of the audio signal corresponding to the sub-group of the plurality of binaural room impulse response filters.
  • the method further comprises applying respective head- related transfer function segments of the plurality of binaural room impulse response filters to corresponding ones of the plurality of channels of the audio signal to generate a plurality of transformed channels of the audio signal; and combining the first transformed summary audio signal and the transformed channels of the audio signal to generate an output binaural audio signal.
  • obtaining the common filter comprises computing an average of the sub-group of the plurality of binaural room impulse response filters as the common filter.
  • the method further comprises combining a sub-group of channels of the audio signal that correspond to the sub-group of the plurality of binaural room impulse response filters to generate the summary audio signal.
  • the common filter is a first common filter
  • the sub-group is a first sub-group
  • the summary audio signal is a first summary audio signal
  • the transformed summary audio signal is a first transformed summary audio signal
  • the method further comprises generating a second common filter for a second, different sub-group of the plurality of binaural room impulse response filters by computing an average of the second sub-group of the plurality of binaural room impulse response filters; combining a second sub-group of channels of the audio signal that correspond to the second sub-group of the plurality of binaural room impulse response filters to generate a second summary audio signal; and applying the second common filter to the second summary audio signal to generate a second transformed summary audio signal, wherein combining the first transformed summary audio signal and the transformed channels of the audio signal to generate an output audio signal comprises combining the first transformed summary audio signal, the second transformed summary audio signal, and the transformed channels of the audio signal to generate the output audio signal.
  • obtaining the common filter comprises computing a weighted average of the sub-group of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the binaural room impulse response filters.
  • obtaining the common filter comprises computing the average of the sub-group of the plurality of binaural room impulse response filters without normalizing the binaural room impulse response filters of the sub-group of the plurality of binaural room impulse response filters.
  • obtaining the common filter comprises computing a direct average of the sub-group of the plurality of binaural room impulse response filters.
  • obtaining the common filter comprises resynthesizing the common filter using white noise controlled by energy envelope and coherence control.
  • obtaining the common filter comprises computing respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters; computing an average frequency-dependent inter-aural coherence value using the respective frequency- dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters; and synthesizing the common filter using the average frequency-dependent inter-aural coherence value.
  • computing the average frequency-dependent inter-aural coherence value comprises computing a direct average frequency-dependent inter-aural coherence value.
  • computing the average frequency-dependent inter-aural coherence value comprises computing the average frequency-dependent inter-aural coherence value as the minimum frequency-dependent inter-aural coherence values of the respective frequency-dependent inter-aural coherence values for each of the subgroup of the plurality of binaural room impulse response filters.
  • computing the average frequency-dependent inter-aural coherence value comprises weighting each of the respective frequency-dependent inter- aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters by the respective, relative energy of Energy Decay Relief and accumulating the weighted frequency-dependent inter-aural coherence values to generate the average frequency-dependent inter-aural coherence value.
  • computing the average frequency-dependent inter-aural coherence value comprises computing:
  • FDIC memge is the average frequency-dependent inter-aural coherence value
  • i denotes a binaural room impulse response filter of the sub-group of the plurality of binaural room impulse response filters
  • FDId denotes a frequency- dependent inter-aural coherence value for the h binaural room impulse response filter
  • Wy denotes a weight of a criterion j for the z th binaural room impulse response filter.
  • the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
  • synthesizing the common filter using the average frequency- dependent inter-aural coherence value comprises computing: tuK average - ⁇ i (Uj Wji) .
  • EDR memge is an average Energy Decay Relief value
  • i denotes a channel of the sub-group of channels of the audio signal
  • EDRj denotes an Energy Decay Relief value for the h channel of the sub-group of channels of the audio signal
  • Wy denotes a weight of a criterion j for the h channel of the subgroup of channels of the audio signal.
  • the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
  • the channels of the audio signal comprise a plurality of hierarchical elements.
  • the plurality of hierarchical elements comprise spherical harmonic coefficients.
  • a method comprises generating a common filter for reverberation segments of a plurality of binaural room impulse response filters that are weighted according to the respective energies of the binaural room impulse response filters.
  • generating the common filter comprises computing a weighted average of the reverberation segments of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the binaural room impulse response filters.
  • generating the common filter comprises computing the average of the reverberation segments of the plurality of binaural room impulse response filters without normalizing the binaural room impulse response filters of the plurality of binaural room impulse response filters.
  • generating the common filter comprises computing a direct average of the reverberation segments of the plurality of binaural room impulse response filters.
  • generating the common filter comprises resynthesizing the common filter using white noise controlled by energy envelope and coherence control.
  • generating the common filter comprises: computing respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters; computing an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters; and synthesizing the common filter using the average frequency-dependent inter-aural coherence value.
  • computing the average frequency-dependent inter-aural coherence value comprises computing a direct average frequency-dependent inter-aural coherence value.
  • computing the average frequency-dependent inter-aural coherence value comprises computing the average frequency-dependent inter-aural coherence value as the minimum frequency-dependent inter-aural coherence values of the respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters.
  • computing the average frequency-dependent inter-aural coherence value comprises weighting each of the respective frequency-dependent inter- aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters by the respective, relative energy of Energy Decay Relief and accumulating the weighted frequency-dependent inter-aural coherence values to generate the average frequency-dependent inter-aural coherence value.
  • computing the average frequency-dependent inter-aural coherence value comprises computing:
  • FDIC memge is the average frequency-dependent inter-aural coherence value
  • i denotes a binaural room impulse response filter of the plurality of binaural room impulse response filters
  • FDIQ denotes a frequency-dependent inter-aural coherence value for the h binaural room impulse response filter
  • w, j denotes a weight of a criterion j for the h binaural room impulse response filter.
  • the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of channels of the audio signal.
  • synthesizing the common filter using the average frequency- dependent inter-aural coherence value comprises computing: tUK average
  • EDR memge is an average Energy Decay Relief value, wherein i denotes a channel of the audio signal, wherein EDRj denotes an Energy Decay Relief value for the h channel of the audio signal, and wherein 1 ⁇ 43 ⁇ 4 ⁇ denotes a weight of a criterion j for the h channel of the audio signal.
  • the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the audio signal.
  • a method comprises generating a common filter for reflection segments of a sub-group of a plurality of binaural room impulse response filters.
  • generating the common filter comprises computing a weighted average of the reflection segments of a sub-group of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the sub-group of the binaural room impulse response filters.
  • generating the common filter comprises computing the average of the reflection segments of the sub-group of the plurality of binaural room impulse response filters without normalizing the binaural room impulse response filters of the sub-group of the plurality of binaural room impulse response filters.
  • generating the common filter comprises computing a direct average of the reflection segments of the sub-group of the plurality of binaural room impulse response filters.
  • generating the common filter comprises resynthesizing the common filter using white noise controlled by energy envelope and coherence control.
  • generating the common filter comprises: computing respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters; computing an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters; and synthesizing the common filter using the average frequency-dependent inter-aural coherence value.
  • computing the average frequency-dependent inter-aural coherence value comprises computing a direct average frequency-dependent inter-aural coherence value.
  • computing the average frequency-dependent inter-aural coherence value comprises computing the average frequency-dependent inter-aural coherence value as the minimum frequency-dependent inter-aural coherence values of the respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters.
  • computing the average frequency-dependent inter-aural coherence value comprises weighting each of the respective frequency-dependent inter- aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters by the respective, relative energy of Energy Decay Relief and accumulating the weighted frequency-dependent inter-aural coherence values to generate the average frequency-dependent inter-aural coherence value.
  • computing the average frequency-dependent inter-aural coherence value comprises computing:
  • FDIC memge is the average frequency-dependent inter-aural coherence value
  • i denotes a binaural room impulse response filter of the sub-group of the plurality of binaural room impulse response filters
  • FDId denotes a frequency- dependent inter-aural coherence value for the h binaural room impulse response filter
  • Wy denotes a weight of a criterion j for the h binaural room impulse response filter.
  • the criterion j is one of an energy for the i binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
  • synthesizing the common filter using the average frequency- dependent inter-aural coherence value comprises computing: tuK average - ⁇ i (Uj Wji) .
  • EDR memge is an average Energy Decay Relief value
  • i denotes a channel of the sub-group of channels of the audio signal
  • EDRj denotes an Energy Decay Relief value for the h channel of the sub-group of channels of the audio signal
  • Wy denotes a weight of a criterion j for the h channel of the subgroup of channels of the audio signal.
  • the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
  • a method of binauralizing an audio signal comprises applying adaptively determined weights to a plurality of channels of the audio signal prior to applying one or more segments of a plurality of binaural room impulse response filters; and applying the one or more segments to the plurality of binaural room impulse response filters.
  • the initial adaptively determined weights for the channels of the audio signal are computed according to an energy of a corresponding binaural room impulse response filter of the plurality of binaural room impulse response filters.
  • the method further comprises obtaining a common filter for a plurality of binaural room impulse response filters, wherein the h initial adaptively determined weight w t for the z th channel is computed according to:
  • the method further comprises applying the common filter to the summary audio signal to generate a transformed summary audio signal by computing ⁇ ⁇ h, wherein ⁇ denotes a convolution operation and ini denotes the h channel of the audio signal.
  • combining the channels of the audio signal to generate a summary audio signal by applying respective adaptive weight factors to the channels comprises computing:
  • the channels of the audio signal comprise a plurality of hierarchical elements.
  • the plurality of hierarchical elements comprise spherical harmonic coefficients.
  • the plurality of hierarchical elements comprise higher order ambisonics.
  • a method comprises applying respective head-related transfer function segments of a plurality of binaural room impulse response filters to corresponding channels of an audio signal to generate a plurality of transformed channels of the audio signal; generating a common filter by computing a weighted average of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the plurality of binaural room impulse response filters; combining the channels of the audio signal to generate a summary audio signal; applying the common filter to the summary audio signal to generate a transformed summary audio signal; combining the transformed summary audio signal and the transformed channels of the audio signal to generate an output audio signal.
  • generating a common filter by computing a weighted average of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the plurality of binaural room impulse response filters comprises computing an average of the plurality of binaural room impulse response filters without normalizing any of the plurality of binaural room impulse response filters.
  • generating a common filter by computing a weighted average of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the plurality of binaural room impulse response filters comprises computing a direct average of the plurality of binaural room impulse response filters.
  • generating a common filter by computing a weighted average of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the plurality of binaural room impulse response filters comprises resynthesizing the common filter using white noise controlled by energy envelope and coherence control.
  • generating a common filter by computing a weighted average of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the plurality of binaural room impulse response filters comprises computing respective frequency-dependent inter-aural coherence values for each of the plurality of binaural room impulse response filters; computing an average frequency-dependent inter-aural coherence value using the respective frequency- dependent inter-aural coherence values for each of the plurality of binaural room impulse response filters; and synthesizing the common filter using the average frequency-dependent inter-aural coherence value.
  • computing an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the plurality of binaural room impulse response filters comprises computing a direct average frequency-dependent inter-aural coherence value.
  • computing an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters comprises computing the average frequency-dependent inter-aural coherence value as the minimum frequency-dependent inter-aural coherence values of the respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters.
  • computing an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters comprises weighting each of the respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters by the respective, relative energy of Energy Decay Relief and accumulating the weighted frequency-dependent inter-aural coherence values to generate the average frequency-dependent inter-aural coherence value.
  • computing an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters comprises computing:
  • FDIC crampge is the average frequency-dependent inter-aural coherence value
  • i denotes a binaural room impulse response filter of the plurality of binaural room impulse response filters
  • FDIQ denotes a frequency-dependent inter-aural coherence value for the h binaural room impulse response filter
  • w, j denotes a weight of a criterion j for the h binaural room impulse response filter.
  • the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the audio signal.
  • synthesizing the common filter using the average frequency-dependent inter-aural coherence value comprises computing:
  • EDR memge is an average Energy Decay Relief value
  • i denotes a channel of the audio signal
  • EDRj denotes an Energy Decay Relief value for the h channel of the audio signal
  • 1 ⁇ 43 ⁇ 4 ⁇ denotes a weight of a criterion j for the h channel of the audio signal.
  • the criterion j is one of an energy for the i binaural room impulse response filter or a signal content energy for the h channel of the audio signal.
  • the channels of the audio signal comprise a plurality of hierarchical elements.
  • the plurality of hierarchical elements comprise spherical harmonic coefficients.
  • the plurality of hierarchical elements comprise higher order ambisonics.
  • a method comprises applying respective head-related transfer function segments of a plurality of binaural room impulse response filters to corresponding channels of an audio signal to generate a plurality of transformed channels of the audio signal; generating a common filter by computing an average of the plurality of binaural room impulse response filters; combining the channels of the audio signal to generate a summary audio signal by applying respective adaptive weight factors to the channels; applying the common filter to the summary audio signal to generate a transformed summary audio signal; and combining the transformed summary audio signal and the transformed channels of the audio signal to generate an output audio signal.
  • the initial adaptive weight factors for the channels of the audio signal are computed according to an energy of a corresponding binaural room impulse response filter of the plurality of binaural room impulse response filters.
  • the h initial adaptive weight factor w t for the h channel is computed according to
  • applying the common filter to the summary audio signal to generate a transformed summary audio signal comprises computing:
  • 0 denotes a convolution operation and i denotes the i channel of the audio signal.
  • combining the channels of the audio signal to generate a summary audio signal by applying respective adaptive weight factors to the channels comprises computing:
  • n denotes the summary audio signal, wherein n is a sample index
  • the channels of the audio signal comprise a plurality of hierarchical elements.
  • the plurality of hierarchical elements comprise spherical harmonic coefficients.
  • the plurality of hierarchical elements comprise higher order ambisonics.
  • a device comprises a memory configured to store a common filter for reflection segments of a sub-group of a plurality of binaural room impulse response filters; and a processor configured to apply the common filter to a summary audio signal determined from a plurality of channels of the audio signal to generate a transformed summary audio signal.
  • the summary audio signal comprises a combination of a subgroup of the plurality of channels of the audio signal corresponding to the sub-group of the plurality of binaural room impulse response filters.
  • the processor is further configured to apply respective head- related transfer function segments of the plurality of binaural room impulse response filters to corresponding ones of the plurality of channels of the audio signal to generate a plurality of transformed channels of the audio signal; and combine the first transformed summary audio signal and the transformed channels of the audio signal to generate an output binaural audio signal.
  • the common filter comprises an average of the sub-group of the plurality of binaural room impulse response filters.
  • the processor is further configured to combine a sub-group of channels of the audio signal that correspond to the sub-group of the plurality of binaural room impulse response filters to generate the summary audio signal.
  • the common filter is a first common filter, wherein the subgroup is a first sub-group, wherein the summary audio signal is a first summary audio signal, and wherein the transformed summary audio signal is a first transformed summary audio signal
  • the processor is further configured to generate a second common filter for a second, different sub-group of the plurality of binaural room impulse response filters by computing an average of the second sub-group of the plurality of binaural room impulse response filters; combine a second sub-group of channels of the audio signal that correspond to the second sub-group of the plurality of binaural room impulse response filters to generate a second summary audio signal; and apply the second common filter to the second summary audio signal to generate a second transformed summary audio signal, wherein to combine the first transformed summary audio signal and the transformed channels of the audio signal to generate an output audio signal wherein the processor is further configured to combine the first transformed summary audio signal, the second transformed summary audio signal, and the transformed channels of the audio signal to generate the output audio signal.
  • the common filter comprises a weighted average of the subgroup of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the binaural room impulse response filters.
  • the common filter comprises an average of the sub-group of the plurality of binaural room impulse response filters without normalizing the binaural room impulse response filters of the sub-group of the plurality of binaural room impulse response filters.
  • the common filter comprises a direct average of the subgroup of the plurality of binaural room impulse response filters.
  • the common filter comprises a resynthesized common filter generated using white noise controlled by energy envelope and coherence control.
  • the processor is further configured to: compute respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters; compute an average frequency- dependent inter-aural coherence value using the respective frequency-dependent inter- aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters; and synthesize the common filter using the average frequency- dependent inter-aural coherence value.
  • the processor is further configured to compute a direct average frequency-dependent inter-aural coherence value.
  • the processor is further configured to compute the average frequency-dependent inter-aural coherence value as the minimum frequency-dependent inter-aural coherence values of the respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters.
  • the processor is further configured to weight each of the respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters by the respective, relative energy of Energy Decay Relief and accumulate the weighted frequency-dependent inter-aural coherence values to generate the average frequency-dependent inter-aural coherence value.
  • the processor is further configured to compute:
  • FDIC average ⁇
  • FDIC a ea e is the average frequency-dependent inter-aural coherence value
  • i denotes a binaural room impulse response filter of the sub-group of the plurality of binaural room impulse response filters
  • FDId denotes a frequency- dependent inter-aural coherence value for the h binaural room impulse response filter
  • w, j denotes a weight of a criterion j for the h binaural room impulse response filter.
  • the criterion j is one of an energy for the i binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
  • the processor is further configured to compute:
  • EDi? average is an average Energy Decay Relief value
  • i denotes a channel of the sub-group of channels of the audio signal
  • EDRj denotes an Energy Decay Relief value for the h channel of the sub-group of channels of the audio signal
  • w, j denotes a weight of a criterion j for the h channel of the subgroup of channels of the audio signal.
  • the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
  • the channels of the audio signal comprise a plurality of hierarchical elements.
  • the plurality of hierarchical elements comprise spherical harmonic coefficients.
  • the plurality of hierarchical elements comprise higher order ambisonics.
  • a device comprises a processor configured to generate a common filter for reverberation segments of a plurality of binaural room impulse response filters that are weighted according to the respective energies of the binaural room impulse response filters.
  • the processor is further configured to compute a weighted average of the reverberation segments of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the binaural room impulse response filters.
  • the processor is further configured to compute the average of the reverberation segments of the plurality of binaural room impulse response filters without normalizing the binaural room impulse response filters of the plurality of binaural room impulse response filters. [0262] In some examples, to generate the common filter the processor is further configured to compute a direct average of the reverberation segments of the plurality of binaural room impulse response filters.
  • the processor is further configured to resynthesize the common filter using white noise controlled by energy envelope and coherence control.
  • the processor is further configured to compute respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters; compute an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters; and synthesize the common filter using the average frequency-dependent inter-aural coherence value.
  • the processor is further configured to compute a direct average frequency-dependent inter-aural coherence value.
  • the processor is further configured to compute the average frequency-dependent inter-aural coherence value as the minimum frequency-dependent inter-aural coherence values of the respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters.
  • the processor is further configured to weight each of the respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters by the respective, relative energy of Energy Decay Relief and accumulate the weighted frequency-dependent inter-aural coherence values to generate the average frequency-dependent inter-aural coherence value.
  • the processor is further configured to compute:
  • FDIC average ⁇
  • FDIC a ea e is the average frequency-dependent inter-aural coherence value
  • i denotes a binaural room impulse response filter of the plurality of binaural room impulse response filters
  • FDI denotes a frequency-dependent inter-aural coherence value for the h binaural room impulse response filter
  • w, j denotes a weight of a criterion j for the h binaural room impulse response filter.
  • the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of channels of the audio signal.
  • the processor is further configured to compute:
  • EDi? average is an average Energy Decay Relief value
  • i denotes a channel of the audio signal
  • EDRj denotes an Energy Decay Relief value for the h channel of the audio signal
  • 1 ⁇ 43 ⁇ 4 ⁇ denotes a weight of a criterion j for the h channel of the audio signal.
  • the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the audio signal.
  • a device comprises a processor configured to generate a common filter for reflection segments of a sub-group of a plurality of binaural room impulse response filters.
  • the processor is further configured to compute a weighted average of the reflection segments of a sub-group of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the sub-group of the binaural room impulse response filters.
  • the processor is further configured to compute the average of the reflection segments of the sub-group of the plurality of binaural room impulse response filters without normalizing the binaural room impulse response filters of the sub-group of the plurality of binaural room impulse response filters.
  • the processor is further configured to compute a direct average of the reflection segments of the sub-group of the plurality of binaural room impulse response filters. [0276] In some examples, to generate the common filter the processor is further configured to resynthesize the common filter using white noise controlled by energy envelope and coherence control.
  • the processor is further configured to compute respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters; compute an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters; and synthesize the common filter using the average frequency- dependent inter-aural coherence value.
  • the processor is further configured to compute a direct average frequency-dependent inter-aural coherence value.
  • the processor is further configured to compute the average frequency-dependent inter-aural coherence value as the minimum frequency-dependent inter-aural coherence values of the respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters.
  • the processor is further configured to weight each of the respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters by the respective, relative energy of Energy Decay Relief and accumulating the weighted frequency-dependent inter-aural coherence values to generate the average frequency- dependent inter-aural coherence value.
  • the processor is further configured to compute:
  • FDIC crampge is the average frequency-dependent inter-aural coherence value
  • i denotes a binaural room impulse response filter of the sub-group of the plurality of binaural room impulse response filters
  • FDId denotes a frequency- dependent inter-aural coherence value for the i binaural room impulse response filter
  • Wy denotes a weight of a criterion j for the h binaural room impulse response filter.
  • the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
  • the processor is further configured to compute:
  • EDRaverage is an average Energy Decay Relief value
  • i denotes a channel of the sub-group of channels of the audio signal
  • EDRj denotes an Energy Decay Relief value for the h channel of the sub-group of channels of the audio signal
  • Wy denotes a weight of a criterion j for the h channel of the subgroup of channels of the audio signal.
  • the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
  • a device comprises a processor configured to apply adaptively determined weights to a plurality of channels of the audio signal prior to applying one or more segments of a plurality of binaural room impulse response filters; and apply the one or more segments to the plurality of binaural room impulse response filters.
  • the processor computes the initial adaptively determined weights for the channels of the audio signal according to an energy of a corresponding binaural room impulse response filter of the plurality of binaural room impulse response filters.
  • the processor is further configured to: apply the common filter to the summary audio signal to generate a transformed summary audio signal by computing:
  • 0 denotes a convolution operation and in, denotes the h channel of the audio signal.
  • the processor is further configured to: combine the channels of the audio signal to generate a summary audio signal by applying respective adaptive weight factors to the channels by computing:
  • norm ⁇ n) ⁇ E( ⁇ wiiniY wherein in, denotes the h channel of the audio sig & nal.
  • the channels of the audio signal comprise a plurality of hierarchical elements.
  • the plurality of hierarchical elements comprise spherical harmonic coefficients.
  • the plurality of hierarchical elements comprise higher order ambisonics.
  • a device comprises means for obtaining a common filter for reflection segments of a sub-group of a plurality of binaural room impulse response filters; and means for applying the common filter to a summary audio signal determined from a plurality of channels of the audio signal to generate a transformed summary audio signal.
  • the summary audio signal comprises a combination of a subgroup of the plurality of channels of the audio signal corresponding to the sub-group of the plurality of binaural room impulse response filters.
  • the device further comprises means for applying respective head-related transfer function segments of the plurality of binaural room impulse response filters to corresponding ones of the plurality of channels of the audio signal to generate a plurality of transformed channels of the audio signal; and means for combining the first transformed summary audio signal and the transformed channels of the audio signal to generate an output binaural audio signal.
  • the means for obtaining the common filter comprises means for computing an average of the sub-group of the plurality of binaural room impulse response filters as the common filter.
  • the device further comprises means for combining a subgroup of channels of the audio signal that correspond to the sub-group of the plurality of binaural room impulse response filters to generate the summary audio signal.
  • the common filter is a first common filter, wherein the subgroup is a first sub-group, wherein the summary audio signal is a first summary audio signal, and wherein the transformed summary audio signal is a first transformed summary audio signal
  • the device further comprises means for generating a second common filter for a second, different sub-group of the plurality of binaural room impulse response filters by computing an average of the second sub-group of the plurality of binaural room impulse response filters; means for combining a second subgroup of channels of the audio signal that correspond to the second sub-group of the plurality of binaural room impulse response filters to generate a second summary audio signal; and means for applying the second common filter to the second summary audio signal to generate a second transformed summary audio signal, wherein the means for combining the first transformed summary audio signal and the transformed channels of the audio signal to generate an output audio signal comprises means for combining the first transformed summary audio signal, the second transformed summary audio signal, and the transformed channels of the audio signal to generate the output audio signal.
  • the means for obtaining the common filter comprises means for computing a weighted average of the sub-group of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the binaural room impulse response filters.
  • the means for obtaining the common filter comprises means for computing the average of the sub-group of the plurality of binaural room impulse response filters without normalizing the binaural room impulse response filters of the sub-group of the plurality of binaural room impulse response filters. [0301] In some examples, the means for obtaining the common filter comprises means for computing a direct average of the sub-group of the plurality of binaural room impulse response filters.
  • the means for obtaining the common filter comprises means for resynthesizing the common filter using white noise controlled by energy envelope and coherence control.
  • the means for obtaining the common filter comprises: means for computing respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters; means for computing an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters; and means for synthesizing the common filter using the average frequency-dependent inter-aural coherence value.
  • the means for computing the average frequency-dependent inter-aural coherence value comprises means for computing a direct average frequency- dependent inter-aural coherence value.
  • the means for computing the average frequency-dependent inter-aural coherence value comprises means for computing the average frequency- dependent inter-aural coherence value as the minimum frequency-dependent inter-aural coherence values of the respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters.
  • the means for computing the average frequency-dependent inter-aural coherence value comprises means for weighting each of the respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters by the respective, relative energy of Energy Decay Relief and means for accumulating the weighted frequency-dependent inter-aural coherence values to generate the average frequency-dependent inter-aural coherence value.
  • the means for computing the average frequency-dependent inter-aural coherence value comprises means for computing:
  • FDIC a ea e is the average frequency-dependent inter-aural coherence value
  • i denotes a binaural room impulse response filter of the sub-group of the plurality of binaural room impulse response filters
  • FDId denotes a frequency- dependent inter-aural coherence value for the h binaural room impulse response filter
  • 1 ⁇ 43 ⁇ 4 ⁇ denotes a weight of a criterion j for the h binaural room impulse response filter.
  • the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
  • the means for synthesizing the common filter using the average frequency-dependent inter-aural coherence value comprises means for computing:
  • EDR memge is an average Energy Decay Relief value
  • i denotes a channel of the sub-group of channels of the audio signal
  • EDRj denotes an Energy Decay Relief value for the h channel of the sub-group of channels of the audio signal
  • 1 ⁇ 43 ⁇ 4 ⁇ denotes a weight of a criterion j for the h channel of the subgroup of channels of the audio signal.
  • the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
  • the channels of the audio signal comprise a plurality of hierarchical elements.
  • the plurality of hierarchical elements comprise spherical harmonic coefficients.
  • the plurality of hierarchical elements comprise higher order ambisonics.
  • a device comprises means for generating a common filter for reverberation segments of a plurality of binaural room impulse response filters that are weighted according to the respective energies of the binaural room impulse response filters.
  • the means for generating the common filter comprises means for computing a weighted average of the reverberation segments of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the binaural room impulse response filters.
  • the means for generating the common filter comprises means for computing the average of the reverberation segments of the plurality of binaural room impulse response filters without normalizing the binaural room impulse response filters of the plurality of binaural room impulse response filters.
  • the means for generating the common filter comprises means for computing a direct average of the reverberation segments of the plurality of binaural room impulse response filters.
  • the means for generating the common filter comprises means for resynthesizing the common filter using white noise controlled by energy envelope and coherence control.
  • the means for generating the common filter comprises: means for computing respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters; means for computing an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters; and means for synthesizing the common filter using the average frequency-dependent inter- aural coherence value.
  • the means for computing the average frequency-dependent inter-aural coherence value comprises means for computing a direct average frequency-dependent inter-aural coherence value.
  • the means for computing the average frequency-dependent inter-aural coherence value comprises means for computing the average frequency- dependent inter-aural coherence value as the minimum frequency-dependent inter-aural coherence values of the respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters.
  • the means for computing the average frequency-dependent inter-aural coherence value comprises means for weighting each of the respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters by the respective, relative energy of Energy Decay Relief and means for accumulating the weighted frequency-dependent inter-aural coherence values to generate the average frequency- dependent inter-aural coherence value.
  • the means for computing the average frequency-dependent inter-aural coherence value comprises means for computing:
  • FDIC memge is the average frequency-dependent inter-aural coherence value
  • i denotes a binaural room impulse response filter of the plurality of binaural room impulse response filters
  • FDIQ denotes a frequency-dependent inter-aural coherence value for the h binaural room impulse response filter
  • 1 ⁇ 43 ⁇ 4 ⁇ denotes a weight of a criterion j for the h binaural room impulse response filter.
  • the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of channels of the audio signal.
  • the means for synthesizing the common filter using the average frequency-dependent inter-aural coherence value comprises means for computing:
  • EDR memge is an average Energy Decay Relief value
  • i denotes a channel of the audio signal
  • EDRj denotes an Energy Decay Relief value for the h channel of the audio signal
  • 1 ⁇ 43 ⁇ 4 ⁇ denotes a weight of a criterion j for the h channel of the audio signal.
  • the criterion j is one of an energy for the i binaural room impulse response filter or a signal content energy for the h channel of the audio signal.
  • a device comprises means for generating a common filter for reflection segments of a sub-group of a plurality of binaural room impulse response filters.
  • the means for generating the common filter comprises means for computing a weighted average of the reflection segments of a sub-group of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the sub-group of the binaural room impulse response filters.
  • the means for generating the common filter comprises means for computing the average of the reflection segments of the sub-group of the plurality of binaural room impulse response filters without normalizing the binaural room impulse response filters of the sub-group of the plurality of binaural room impulse response filters.
  • the means for generating the common filter comprises means for computing a direct average of the reflection segments of the sub-group of the plurality of binaural room impulse response filters.
  • the means for generating the common filter comprises means for resynthesizing the common filter using white noise controlled by energy envelope and coherence control.
  • the means for generating the common filter comprises: means for computing respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters; means for computing an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters; and means for synthesizing the common filter using the average frequency-dependent inter-aural coherence value.
  • the means for computing the average frequency-dependent inter-aural coherence value comprises means for computing a direct average frequency-dependent inter-aural coherence value.
  • the means for computing the average frequency-dependent inter-aural coherence value comprises means for computing the average frequency- dependent inter-aural coherence value as the minimum frequency-dependent inter-aural coherence values of the respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters.
  • the means for computing the average frequency-dependent inter-aural coherence value comprises means for weighting each of the respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters by the respective, relative energy of Energy Decay Relief and means for accumulating the weighted frequency-dependent inter-aural coherence values to generate the average frequency-dependent inter-aural coherence value.
  • the means for computing the average frequency-dependent inter-aural coherence value comprises means for computing: FDIC, average ⁇
  • FDIC crampge is the average frequency-dependent inter-aural coherence value
  • i denotes a binaural room impulse response filter of the sub-group of the plurality of binaural room impulse response filters
  • FDId denotes a frequency- dependent inter-aural coherence value for the h binaural room impulse response filter
  • Wy denotes a weight of a criterion j for the h binaural room impulse response filter.
  • the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
  • the means for synthesizing the common filter using the average frequency-dependent inter-aural coherence value comprises means for computing:
  • EDR milli is an average Energy Decay Relief value
  • i denotes a channel of the sub-group of channels of the audio signal
  • EDRj denotes an Energy Decay Relief value for the h channel of the sub-group of channels of the audio signal
  • Wy denotes a weight of a criterion j for the h channel of the subgroup of channels of the audio signal.
  • the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
  • a device comprises means for applying adaptively determined weights to a plurality of channels of the audio signal prior to applying one or more segments of a plurality of binaural room impulse response filters; and means for applying the one or more segments to the plurality of binaural room impulse response filters.
  • the initial adaptively determined weights for the channels of the audio signal are computed according to an energy of a corresponding binaural room impulse response filter of the plurality of binaural room impulse response filters.
  • the device further comprises means for obtaining a common filter for a plurality of binaural room impulse response filters, wherein the h initial adaptively determined weight w t for the h channel is computed according to
  • the device further comprises means for applying the common filter to the summary audio signal to generate a transformed summary audio signal by computing:
  • 0 denotes a convolution operation and i denotes the i channel of the audio signal.
  • the device further comprises means for combining the channels of the audio signal to generate a summary audio signal by applying respective adaptive weight factors to the channels comprises computing:
  • n denotes the summary audio signal, wherein n is a sample index
  • the channels of the audio signal comprise a plurality of hierarchical elements.
  • the plurality of hierarchical elements comprise spherical harmonic coefficients.
  • the plurality of hierarchical elements comprise higher order ambisonics.
  • a non-transitory computer-readable storage medium has stored thereon instructions that, when executed, cause one or more processors to obtain a common filter for reflection segments of a sub-group of a plurality of binaural room impulse response filters; and apply the common filter to a summary audio signal determined from a plurality of channels of the audio signal to generate a transformed summary audio signal.
  • a non-transitory computer-readable storage medium has stored thereon instructions that, when executed, cause one or more processors to generate a common filter for reverberation segments of a plurality of binaural room impulse response filters that are weighted according to the respective energies of the binaural room impulse response filters.
  • a non-transitory computer-readable storage medium has stored thereon instructions that, when executed, cause one or more processors to generate a common filter for reflection segments of a sub-group of a plurality of binaural room impulse response filters.
  • a non-transitory computer-readable storage medium has stored thereon instructions that, when executed, cause one or more processors to apply adaptively determined weights to a plurality of channels of the audio signal prior to applying one or more segments of a plurality of binaural room impulse response filters; and apply the one or more segments to the plurality of binaural room impulse response filters.
  • a device comprises a processor configured to perform any combination the methods of any combination of the examples described above.
  • a device comprises means for performing each step of the method of any combination of the examples described above.
  • a non-transitory computer-readable storage medium has stored thereon instructions that, when executed, cause one or more processors to perform the method of any combination of the examples described above.
  • Computer-readable media may include computer-readable storage media, which corresponds to a tangible medium such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, e.g., according to a communication protocol.
  • computer-readable media generally may correspond to (1) tangible computer-readable storage media which is non-transitory or (2) a communication medium such as a signal or carrier wave.
  • Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this disclosure.
  • a computer program product may include a computer-readable medium.
  • Such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage, or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium.
  • coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave
  • coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium.
  • Disk and disc includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
  • processors such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field programmable logic arrays
  • processors may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein.
  • the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Also, the techniques could be fully implemented in one or more circuits or logic elements.
  • the techniques of this disclosure may be implemented in a wide variety of devices or apparatuses, including a wireless handset, an integrated circuit (IC) or a set of ICs (e.g., a chip set).
  • IC integrated circuit
  • a set of ICs e.g., a chip set.
  • Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require realization by different hardware units. Rather, as described above, various units may be combined in a codec hardware unit or provided by a collection of interoperative hardware units, including one or more processors as described above, in conjunction with suitable software and/or firmware.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Stereophonic System (AREA)

Abstract

A device comprising one or more processors is configured to apply adaptively determined weights to a plurality of channels of the audio signal to generate a plurality of adaptively weighted channels of the audio signal. The processors are further configured to combine at least two of the plurality of adaptively weighted channels of the audio signal to generate a combined signal. The processors are further configured to apply a binaural room impulse response filter to the combined signal to generate a binaural audio signal.

Description

FILTERING WITH BINAURAL ROOM IMPULSE RESPONSES WITH CONTENT ANALYSIS AND WEIGHTING
PRIORITY CLAIM
[0001] This application claims the benefit of U.S. Provisional Application No. 61/828,620, filed May 29, 2013, U.S. Provisional Patent Application No. 61/847,543, filed July 17, 2013, U.S. Provisional Application No. 61/886,593, filed October 3, 2013, and U.S. Provisional Application No. 61/886,620, filed October 3, 2013.
TECHNICAL FIELD
[0002] This disclosure relates to audio rendering and, more specifically, binaural rendering of audio data.
SUMMARY
[0003] In general, techniques are described for binaural audio rendering through application of binaural room impulse response (BRIR) filters to source audio streams.
[0004] As one example, a method of binauralizing an audio signal comprises applying adaptively determined weights to a plurality of channels of the audio signal to generate a plurality of adaptively weighted channels of the audio signal; combining at least two of the plurality of adaptively weighted channels of the audio signal to generate a combined signal; and applying a binaural room impulse response filter to the combined signal to generate a binaural audio signal.
[0005] As another example, a device comprises one or more processors configured to apply adaptively determined weights to a plurality of channels of the audio signal to generate a plurality of adaptively weighted channels of the audio signal; combine at least two of the plurality of adaptively weighted channels of the audio signal to generate a combined signal; and apply a binaural room impulse response filter to the combined signal to generate a binaural audio signal.
[0006] As another example, an apparatus comprises means for applying adaptively determined weights to a plurality of channels of the audio signal to generate a plurality of adaptively weighted channels of the audio signal; means for combining at least two of the plurality of adaptively weighted channels of the audio signal to generate a combined signal; and means for applying a binaural room impulse response filter to the combined signal to generate a binaural audio signal.
[0007] As another example, a non-transitory computer-readable storage medium has stored thereon instructions that, when executed, cause one or more processors to apply adaptively determined weights to a plurality of channels of the audio signal to generate a plurality of adaptively weighted channels of the audio signal; combine at least two of the plurality of adaptively weighted channels of the audio signal to generate a combined signal; and apply a binaural room impulse response filter to the combined signal to generate a binaural audio signal.
[0008] The details of one or more aspects of the techniques are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of these techniques will be apparent from the description and drawings, and from the claims.
BRIEF DESCRIPTION OF THE DRAWINGS
[0009] FIGS. 1 and 2 are diagrams illustrating spherical harmonic basis functions of various orders and sub-orders.
[0010] FIG. 3 is a diagram illustrating a system that may perform techniques described in this disclosure to more efficiently render audio signal information.
[0011] FIG. 4 is a block diagram illustrating an example binaural room impulse response (BRIR).
[0012] FIG. 5 is a block diagram illustrating an example systems model for producing a BRIR in a room.
[0013] FIG. 6 is a block diagram illustrating a more in-depth systems model for producing a BRIR in a room.
[0014] FIG. 7 is a block diagram illustrating an example of an audio playback device that may perform various aspects of the binaural audio rendering techniques described in this disclosure.
[0015] FIG. 8 is a block diagram illustrating an example of an audio playback device that may perform various aspects of the binaural audio rendering techniques described in this disclosure.
[0016] FIG. 9 is a flow diagram illustrating an example mode of operation for a binaural rendering device to render spherical harmonic coefficients according to various aspects of the techniques described in this disclosure. [0017] FIGS. 10A, 10B depict flow diagrams illustrating alternative modes of operation that may be performed by the audio playback devices of FIGS. 7 and 8 in accordance with various aspects of the techniques described in this disclosure.
[0018] FIG. 11 is a block diagram illustrating an example of an audio playback device that may perform various aspects of the binaural audio rendering techniques described in this disclosure.
[0019] FIG. 12 is a flow diagram illustrating a process that may be performed by the audio playback device of FIG. 11 in accordance with various aspects of the techniques described in this disclosure.
[0020] FIG. 13 is a diagram of an example binaural room impulse response filter.
[0021] FIG. 14 is a block diagram illustrating a system for a standard computation of a binaural output signal generated by applying binaural room impulse responses to a multichannel audio signal.
[0022] FIG. 15 is a block diagram illustrating functional components of a system for computing a binaural output signal generated by applying binaural room impulse responses to a multichannel audio signal according to techniques described herein.
[0023] FIG. 16 is an example plot showing hierarchical cluster analysis on a reflection segment of the multiple binaural room impulse response filters.
[0024] FIG. 17 is a flowchart illustrating an example mode of operation of an audio playback device according to techniques described in this disclosure.
[0025] Like reference characters denote like elements throughout the figures and text.
DETAILED DESCRIPTION
[0026] The evolution of surround sound has made available many output formats for entertainment nowadays. Examples of such surround sound formats include the popular 5.1 format (which includes the following six channels: front left (FL), front right (FR), center or front center, back left or surround left, back right or surround right, and low frequency effects (LFE)), the growing 7.1 format, and the upcoming 22.2 format (e.g., for use with the Ultra High Definition Television standard). Another example of spatial audio format are the Spherical Harmonic coefficients (also known as Higher Order Ambisonics).
[0027] The input to a future standardized audio-encoder (a device which converts PCM audio representations to an bitstream - conserving the number of bits required per time sample) could optionally be one of three possible formats: (i) traditional channel-based audio, which is meant to be played through loudspeakers at pre-specified positions; (ii) object-based audio, which involves discrete pulse-code-modulation (PCM) data for single audio objects with associated metadata containing their location coordinates (amongst other information); and (iii) scene-based audio, which involves representing the sound field using spherical harmonic coefficients (SHC) - where the coefficients represent 'weights' of a linear summation of spherical harmonic basis functions. The SHC, in this context, may include Higher Order Ambisonics (Ho A) signals according to an HoA model. Spherical harmonic coefficients may alternatively or additionally include planar models and spherical models.
[0028] There are various 'surround-sound' formats in the market. They range, for example, from the 5.1 home theatre system (which has been the most successful in terms of making inroads into living rooms beyond stereo) to the 22.2 system developed by NHK (Nippon Hoso Kyokai or Japan Broadcasting Corporation). Content creators (e.g., Hollywood studios) would like to produce the soundtrack for a movie once, and not spend the efforts to remix it for each speaker configuration. Recently, standard committees have been considering ways in which to provide an encoding into a standardized bitstream and a subsequent decoding that is adaptable and agnostic to the speaker geometry and acoustic conditions at the location of the renderer.
[0029] To provide such flexibility for content creators, a hierarchical set of elements may be used to represent a sound field. The hierarchical set of elements may refer to a set of elements in which the elements are ordered such that a basic set of lower-ordered elements provides a full representation of the modeled sound field. As the set is extended to include higher-order elements, the representation becomes more detailed.
[0030] One example of a hierarchical set of elements is a set of spherical harmonic coefficients (SHC). The following expression demonstrates a description or representation of a sound field using SHC:
Figure imgf000006_0001
This expression shows that the pressure Pi at any point {rr, 6r, φτ} (which are expressed in spherical coordinates relative to the microphone capturing the sound field in this example) of the sound field can be represented uniquely by the SHC A™(k). Here,
θγ, ψγ) is a point of reference (or
Figure imgf000006_0002
observation point), _/n (-) is the spherical Bessel function of order n, and Υ^ βγ, ψ^) are the spherical harmonic basis functions of order n and suborder m. It can be recognized that the term in square brackets is a frequency-domain representation of the signal (i.e., 5(ω, rr, er, <pr)) which can be approximated by various time-frequency transformations, such as the discrete Fourier transform (DFT), the discrete cosine transform (DCT), or a wavelet transform. Other examples of hierarchical sets include sets of wavelet transform coefficients and other sets of coefficients of multiresolution basis functions.
[0031] FIG. 1 is a diagram illustrating spherical harmonic basis functions from the zero order (n = 0) to the fourth order (n = 4). As can be seen, for each order, there is an expansion of suborders m which are shown but not explicitly noted in the example of FIG. 1 for ease of illustration purposes.
[0032] FIG. 2 is another diagram illustrating spherical harmonic basis functions from the zero order (n = 0) to the fourth order (n = 4). In FIG. 2, the spherical harmonic basis functions are shown in three-dimensional coordinate space with both the order and the suborder shown.
[0033] In any event, the SHC A™(k) can either be physically acquired (e.g., recorded) by various microphone array configurations or, alternatively, they can be derived from channel-based or object-based descriptions of the sound field. The SHC represents scene -based audio. For example, a fourth-order SHC representation involves (1+4)2 =25 coefficients per time sample.
[0034] To illustrate how these SHCs may be derived from an object-based description, consider the following equation. The coefficients A™(k) for the sound field corresponding to an individual audio object may be expressed as:
where i is V— Ϊ, () is the spherical Hankel function (of the second kind) of order n, and {rs, θε, φε} is the location of the object. Knowing the source energy ^(ω) as a function of frequency (e.g., using time-frequency analysis techniques, such as performing a fast Fourier transform on the PCM stream) allows us to convert each PCM object and its location into the SHC A™(k). Further, it can be shown (since the above is a linear and orthogonal decomposition) that the A™(k) coefficients for each object are additive. In this manner, a multitude of PCM objects can be represented by the A™(k) coefficients (e.g., as a sum of the coefficient vectors for the individual objects). Essentially, these coefficients contain information about the sound field (the pressure as a function of 3D coordinates), and the above represents the transformation from individual objects to a representation of the overall sound field, in the vicinity of the observation point {rr, θτ, <pr}.
[0035] The SHCs may also be derived from a microphone-array recording as follows:
( = bn{n, t) * {YF u ipd.mm)
where, a™(t) are the time-domain equivalent of A™(k) (the SHC), the * represents a convolution operation, the <,> represents an inner product, bn(ri, t) represents a time- domain filter function dependent on r,, m1{i) are the h microphone signal, where the h microphone transducer is located at radius ri ? elevation angle and azimuth angle φ^. Thus, if there are 32 transducers in the microphone array and each microphone is positioned on a sphere such that, =a, is a constant (such as those on an Eigenmike EM32 device from mhAcoustics), the 25 SHCs may be derived using a matrix operation as follows:
Figure imgf000008_0001
The matrix in the above equation may be more generally referred to as Ε3 (θ, φ), where the subscript s may indicate that the matrix is for a certain transducer geometry-set, s. The convolution in the above equation (indicated by the *), is on a row-by-row basis, such that, for example, the output α (t) is the result of the convolution between b0 {a, t) and the time series that results from the vector multiplication of the first row of the Εε (θ, φ) matrix, and the column of microphone signals (which varies as a function of time - accounting for the fact that the result of the vector multiplication is a time series). The computation may be most accurate when the transducer positions of the microphone array are in the so called T-design geometries (which is very close to the Eigenmike transducer geometry). One characteristic of the T-design geometry may be that the ES ( ), φ) matrix that results from the geometry, has a very well behaved inverse (or pseudo inverse) and further that the inverse may often be very well approximated by the transpose of the matrix, Εε (θ, φ) . If the filtering operation with bn(a, t)were to be ignored, this property would allow the recovery of the microphone signals from the SHC (i.e., [?ηέ (ί:)] = [Es(6, (p)]~1 [SHC] in this example). The remaining figures are described below in the context of object-based and SHC-based audio-coding.
[0036] FIG. 3 is a diagram illustrating a system 20 that may perform techniques described in this disclosure to more efficiently render audio signal information. As shown in the example of FIG. 3, the system 20 includes a content creator 22 and a content consumer 24. While described in the context of the content creator 22 and the content consumer 24, the techniques may be implemented in any context that makes use of SHCs or any other hierarchical elements that define a hierarchical representation of a sound field.
[0037] The content creator 22 may represent a movie studio or other entity that may generate multi-channel audio content for consumption by content consumers, such as the content consumer 24. Often, this content creator generates audio content in conjunction with video content. The content consumer 24 may represent an individual that owns or has access to an audio playback system, which may refer to any form of audio playback system capable of playing back multi-channel audio content. In the example of FIG. 3, the content consumer 24 owns or has access to audio playback system 32 for rendering hierarchical elements that define a hierarchical representation of a sound field.
[0038] The content creator 22 includes an audio renderer 28 and an audio editing system 30. The audio renderer 28 may represent an audio processing unit that renders or otherwise generates speaker feeds (which may also be referred to as "loudspeaker feeds," "speaker signals," or "loudspeaker signals"). Each speaker feed may correspond to a speaker feed that reproduces sound for a particular channel of a multi-channel audio system or to a virtual loudspeaker feed that are intended for convolution with a head- related transfer function (HRTF) filters matching the speaker position. Each speaker feed may correspond to a channel of spherical harmonic coefficients (where a channel may be denoted by an order and/or suborder of associated spherical basis functions to which the spherical harmonic coefficients correspond), which uses multiple channels of SHCs to represent a directional sound field.
[0039] In the example of FIG. 3, the audio renderer 28 may render speaker feeds for conventional 5.1, 7.1 or 22.2 surround sound formats, generating a speaker feed for each of the 5, 7 or 22 speakers in the 5.1, 7.1 or 22.2 surround sound speaker systems. Alternatively, the audio renderer 28 may be configured to render speaker feeds from source spherical harmonic coefficients for any speaker configuration having any number of speakers, given the properties of source spherical harmonic coefficients discussed above. The audio renderer 28 may, in this manner, generate a number of speaker feeds, which are denoted in FIG. 3 as speaker feeds 29. [0040] The content creator may, during the editing process, render spherical harmonic coefficients 27 ("SHCs 27"), listening to the rendered speaker feeds in an attempt to identify aspects of the sound field that do not have high fidelity or that do not provide a convincing surround sound experience. The content creator 22 may then edit source spherical harmonic coefficients (often indirectly through manipulation of different objects from which the source spherical harmonic coefficients may be derived in the manner described above). The content creator 22 may employ the audio editing system
30 to edit the spherical harmonic coefficients 27. The audio editing system 30 represents any system capable of editing audio data and outputting this audio data as one or more source spherical harmonic coefficients.
[0041] When the editing process is complete, the content creator 22 may generate bitstream 31 based on the spherical harmonic coefficients 27. That is, the content creator 22 includes a bitstream generation device 36, which may represent any device capable of generating the bitstream 31. In some instances, the bitstream generation device 36 may represent an encoder that bandwidth compresses (through, as one example, entropy encoding) the spherical harmonic coefficients 27 and that arranges the entropy encoded version of the spherical harmonic coefficients 27 in an accepted format to form the bitstream 31. In other instances, the bitstream generation device 36 may represent an audio encoder (possibly, one that complies with a known audio coding standard, such as MPEG surround, or a derivative thereof) that encodes the multichannel audio content 29 using, as one example, processes similar to those of conventional audio surround sound encoding processes to compress the multi-channel audio content or derivatives thereof. The compressed multi-channel audio content 29 may then be entropy encoded or coded in some other way to bandwidth compress the content 29 and arranged in accordance with an agreed upon format to form the bitstream 31. Whether directly compressed to form the bitstream 31 or rendered and then compressed to form the bitstream 31 , the content creator 22 may transmit the bitstream
31 to the content consumer 24.
[0042] While shown in FIG. 3 as being directly transmitted to the content consumer 24, the content creator 22 may output the bitstream 31 to an intermediate device positioned between the content creator 22 and the content consumer 24. This intermediate device may store the bitstream 31 for later delivery to the content consumer 24, which may request this bitstream. The intermediate device may comprise a file server, a web server, a desktop computer, a laptop computer, a tablet computer, a mobile phone, a smart phone, or any other device capable of storing the bitstream 31 for later retrieval by an audio decoder. This intermediate device may reside in a content delivery network capable of streaming the bitstream 31 (and possibly in conjunction with transmitting a corresponding video data bitstream) to subscribers, such as the content consumer 24, requesting the bitstream 31. Alternatively, the content creator 22 may store the bitstream 31 to a storage medium, such as a compact disc, a digital video disc, a high definition video disc or other storage media, most of which are capable of being read by a computer and therefore may be referred to as computer-readable storage media or non- transitory computer-readable storage media. In this context, the transmission channel may refer to those channels by which content stored to these mediums are transmitted (and may include retail stores and other store-based delivery mechanism). In any event, the techniques of this disclosure should not therefore be limited in this respect to the example of FIG. 3.
[0043] As further shown in the example of FIG. 3, the content consumer 24 owns or otherwise has access to the audio playback system 32. The audio playback system 32 may represent any audio playback system capable of playing back multi-channel audio data. The audio playback system 32 includes a binaural audio renderer 34 that renders SHCs 27' for output as binaural speaker feeds 35A-35B (collectively, "speaker feeds 35"). Binaural audio renderer 34 may provide for different forms of rendering, such as one or more of the various ways of performing vector-base amplitude panning (VBAP), and/or one or more of the various ways of performing sound field synthesis. As used herein, A "and/or" B may refer to A, B, or a combination of A and B.
[0044] The audio playback system 32 may further include an extraction device 38. The extraction device 38 may represent any device capable of extracting spherical harmonic coefficients 27' ("SHCs 27'," which may represent a modified form of or a duplicate of spherical harmonic coefficients 27) through a process that may generally be reciprocal to that of the bitstream generation device 36. In any event, the audio playback system 32 may receive the spherical harmonic coefficients 27' and uses binaural audio renderer 34 to render spherical harmonic coefficients 27' and thereby generate speaker feeds 35 (corresponding to the number of loudspeakers electrically or possibly wirelessly coupled to the audio playback system 32, which are not shown in the example of FIG. 3 for ease of illustration purposes). The number of speaker feeds 35 may be two, and audio playback system may wirelessly couple to a pair of headphones that includes the two corresponding loudspeakers. However, in various instances binaural audio renderer 34 may output more or fewer speaker feeds than is illustrated and primarily described with respect to FIG. 3.
[0045] Binary room impulse response (BRIR) filters 37 of audio playback system that each represents a response at a location to an impulse generated at an impulse location. BRIR filters 37 are "binaural" in that they are each generated to be representative of the impulse response as would be experienced by a human ear at the location. Accordingly, BRIR filters for an impulse are often generated and used for sound rendering in pairs, with one element of the pair for the left ear and another for the right ear. In the illustrated example, binaural audio renderer 34 uses left BRIR filters 33A and right BRIR filters 33B to render respective binaural audio outputs 35A and 35B.
[0046] For example, BRIR filters 37 may be generated by convolving a sound source signal with head-related transfer functions (HRTFs) measured as impulses responses (IRs). The impulse location corresponding to each of the BRIR filters 37 may represent a position of a virtual loudspeaker in a virtual space. In some examples, binaural audio renderer 34 convolves SHCs 27' with BRIR filters 37 corresponding to the virtual loudspeakers, then accumulates (i.e., sums) the resulting convolutions to render the sound field defined by SHCs 27' for output as speaker feeds 35. As described herein, binaural audio renderer 34 may apply techniques for reducing rendering computation by manipulating BRIR filters 37 while rendering SHCs 27' as speaker feeds 35.
[0047] In some instances, the techniques include segmenting BRIR filters 37 into a number of segments that represent different stages of an impulse response at a location within a room. These segments correspond to different physical phenomena that generate the pressure (or lack thereof) at any point on the sound field. For example, because each of BRIR filters 37 is timed coincident with the impulse, the first or "initial" segment may represent a time until the pressure wave from the impulse location reaches the location at which the impulse response is measured. With the exception of the timing information, BRIR filters 37 values for respective initial segments may be insignificant and may be excluded from a convolution with the hierarchical elements that describe the sound field. Similarly, each of BRIR filters 37 may include a last or "tail" segment that include impulse response signals attenuated to below the dynamic range of human hearing or attenuated to below a designated threshold, for instance. BRIR filters 37 values for respective tails segments may also be insignificant and may be excluded from a convolution with the hierarchical elements that describe the sound field. In some examples, the techniques may include determining a tail segment by performing a Schroeder backward integration with a designated threshold and discarding elements from the tail segment where backward integration exceeds the designated threshold. In some examples, the designated threshold is -60 dB for reverberation time RT6o.
[0048] An additional segment of each of BRIR filters 37 may represent the impulse response caused by the impulse-generated pressure wave without the inclusion of echo effects from the room. These segments may be represented and described as a head- related transfer functions (HRTFs) for BRIR filters 37, where HRTFs capture the impulse response due to the diffraction and reflection of pressure waves about the head, shoulders/torso, and outer ear as the pressure wave travels toward the ear drum. HRTF impulse responses are the result of a linear and time-invariant system (LTI) and may be modeled as minimum-phase filters. The techniques to reduce HRTF segment computation during rendering may, in some examples, include minimum-phase reconstruction and using infinite impulse response (IIR) filters to reduce an order of the original finite impulse response (FIR) filter (e.g., the HRTF filter segment).
[0049] Minimum-phase filters implemented as IIR filters may be used to approximate the HRTF filters for BRIR filters 37 with a reduced filter order. Reducing the order leads to a concomitant reduction in the number of calculations for a time-step in the frequency domain. In addition, the residual/excess filter resulting from the construction of minimum-phase filters may be used to estimate the interaural time difference (ITD) that represents the time or phase distance caused by the distance a sound pressure wave travels from a source to each ear. The ITD can then be used to model sound localization for one or both ears after computing a convolution of one or more BRIR filters 37 with the hierarchical elements that describe the sound field (i.e., determine binauralization).
[0050] A still further segment of each of BRIR filters 37 is subsequent to the HRTF segment and may account for effects of the room on the impulse response. This room segment may be further decomposed into an early echoes (or "early reflection") segment and a late reverberation segment (that is, early echoes and late reverberation may each be represented by separate segments of each of BRIR filters 37). Where HRTF data is available for BRIR filters 37, onset of the early echo segment may be identified by deconvo luting the BRIR filters 37 with the HRTF to identify the HRTF segment. Subsequent to the HRTF segment is the early echo segment. Unlike the residual room response, the HRTF and early echo segments are direction-dependent in that location of the corresponding virtual speaker determines the signal in a significant respect.
[0051] In some examples, binaural audio renderer 34 uses BRIR filters 37 prepared for the spherical harmonics domain (θ, φ) or other domain for the hierarchical elements that describe the sound field. That is, BRIR filters 37 may be defined in the spherical harmonics domain (SHD) as transformed BRIR filters 37 to allow binaural audio renderer 34 to perform fast convolution while taking advantage of certain properties of the data set, including the symmetry of BRIR filters 37 (e.g. left/right) and of SHCs 27'. In such examples, transformed BRIR filters 37 may be generated by multiplying (or convolving in the time-domain) the SHC rendering matrix and the original BRIR filters. Mathematically, this can be expressed according to the following equations (l)-(5):
BRIR' (N+i)2,L,ieft — ^CQV+I)2,L * BRIRL>ieft (1)
Figure imgf000014_0001
= SHC(N+1)2,L * BRIRLtright (2) or
Figure imgf000014_0002
Υο (θ2, φ2) ... Y0° (eL, (pL)
BRIR' Υ 1 (0!, φ1) Υ 12, φ2) ... Y1-1(0L, φ
(N+l)2,L,right (3)
Figure imgf000014_0003
Figure imgf000014_0005
Figure imgf000014_0004
BRIR" (w+1)2, right =∑k=l[B RI R' (N+1)2 ,k,right] (5)
[0052] Here, (3) depicts either (1) or (2) in matrix form for fourth-order spherical harmonic coefficients (which may be an alternative way to refer to those of the spherical harmonic coefficients associated with spherical basis functions of the fourth-order or less). Equation (3) may of course be modified for higher- or lower-order spherical harmonic coefficients. Equations (4)-(5) depict the summation of the transformed left and right BRIR filters 37 over the loudspeaker dimension, L, to generate summed SHC- binaural rendering matrices (BRIR"). In combination, the summed SHC-binaural rendering matrices have dimensionality [(N+l)2, Length, 2], where Length is a length of the impulse response vectors to which any combination of equations (l)-(5) may be applied. In some instances of equations (1) and (2), the rendering matrix SHC may be binauralized such that equation (1) may be modified to BRIR' N+i)2,L,ieft = SHC(N+1-)2)L>left * BRIRL left and equation (2) may be modified to BRIR' (N+ 1y. L right = SHC(N+1 L * BRIRL right.
[0053] The SHC rendering matrix presented in the above equations (l)-(3), SHC, includes elements for each order/sub-order combination of SHCs 27', which effectively define a separate SHC channel, where the element values are set for a position for the speaker, L, in the spherical harmonic domain. BRIRi eft represents the BRIR response at the left ear or position for an impulse produced at the location for the speaker, L, and is depicted in (3) using impulse response vectors B[ for E [0, L]}. BRIR (N+i)2,L,ieft represents one half of a "SHC-binaural rendering matrix," i.e., the SHC-binaural rendering matrix at the left ear or position for an impulse produced at the location for speakers, L, transformed to the spherical harmonics domain. BRIR (N+i)2,L,right represents the other half of the SHC-binaural rendering matrix.
[0054] In some examples, the techniques may include applying the SHC rendering matrix only to the HRTF and early reflection segments of respective original BRIR filters 37 to generate transformed BRIR filters 37 and an SHC-binaural rendering matrix. This may reduce a length of convolutions with SHCs 27'.
[0055] In some examples, as depicted in equations (4)-(5), the SHC-binaural rendering matrices having dimensionality that incorporates the various loudspeakers in the spherical harmonics domain may be summed to generate a {N+X)2* Length*! filter matrix that combines SHC rendering and BRIR rendering/mixing. That is, SHC- binaural rendering matrices for each of the L loudspeakers may be combined by, e.g., summing the coefficients over the L dimension. For SHC-binaural rendering matrices of length Length, this produces a (N+X)2* Length*! summed SHC-binaural rendering matrix that may be applied to an audio signal of spherical harmonics coefficients to binauralize the signal. Length may be a length of a segment of the BRIR filters segmented in accordance with techniques described herein.
[0056] Techniques for model reduction may also be applied to the altered rendering filters, which allows SHCs 27' (e.g., the SHC contents) to be directly filtered with the new filter matrix (a summed SHC-binaural rendering matrix). Binaural audio renderer 34 may then convert to binaural audio by summing the filtered arrays to obtain the binaural output signals 35A, 35B.
[0057] In some examples, BRIR filters 37 of audio playback system 32 represent transformed BRIR filters in the spherical harmonics domain previously computed according to any one or more of the above-described techniques. In some examples, transformation of original BRIR filters 37 may be performed at run-time.
[0058] In some examples, because the BRIR filters 37 are typically symmetric, the techniques may promote further reduction of the computation of binaural outputs 35 A, 35B by using only the SHC-binaural rendering matrix for either the left or right ear. When summing SHCs 27' filtered by a filter matrix, binaural audio renderer 34 may make conditional decisions for either outputs signal 35A or 35B as a second channel when rendering the final output. As described herein, reference to processing content or to modifying rendering matrices described with respect to either the left or right ear should be understood to be similarly applicable to the other ear.
[0059] In this way, the techniques may provide multiple approaches to reduce a length of BRIR filters 37 in order to potentially avoid direct convolution of the excluded BRIR filter samples with multiple channels. As a result, binaural audio renderer 34 may provide efficient rendering of binaural output signals 35A, 35B from SHCs 27'.
[0060] FIG. 4 is a block diagram illustrating an example binaural room impulse response (BRIR). BRIR 40 illustrates five segments 42A-42E. The initial segment 42A and tail segment 42E both include quiet samples that may be insignificant and excluded from rendering computation. Head-related transfer function (HRTF) segment 42B includes the impulse response due to head-related transfer and may be identified using techniques described herein. Early echoes (alternatively, "early reflections") segment 42C and late room reverb segment 42D combine the HRTF with room effects, i.e., the impulse response of early echoes segment 42C matches that of the HRTF for BRIR 40 filtered by early echoes and late reverberation of the room. Early echoes segment 42C may include more discrete echoes in comparison to late room reverb segment 42D, however. The mixing time is the time between early echoes segment 42C and late room reverb segment 42D and indicates the time at which early echoes become dense reverb. The mixing time is illustrated as occurring at approximately 1.5xl04 samples into the HRTF, or approximately 7.0xl04 samples from the onset of HRTF segment 42B. In some examples, the techniques include computing the mixing time using statistical data and estimation from the room volume. In some examples, the perceptual mixing time with 50% confidence internal, tmp50, is approximately 36 milliseconds (ms) and with 95% confidence interval, tmp95, is approximately 80 ms. In some examples, late room reverb segment 42D of a filter corresponding to BRIR 40 may be synthesized using coherence-matched noise tails. [0061] FIG. 5 is a block diagram illustrating an example systems model 50 for producing a BRIR, such as BRIR 40 of FIG. 4, in a room. The model includes cascaded systems, here room 52A and HRTF 52B. After HRTF 52B is applied to an impulse, the impulse response matches that of the HRTF filtered by early echoes of the room 52A.
[0062] FIG. 6 is a block diagram illustrating a more in-depth systems model 60 for producing a BRIR, such as BRIR 40 of FIG. 4, in a room. This model 60 also includes cascaded systems, here HRTF 62A, early echoes 62B, and residual room 62C (which combines HRTF and room echoes). Model 60 depicts the decomposition of room 52A into early echoes 62B and residual room 62C and treats each system 62A, 62B, 62C as linear-time invariant.
[0063] Early echoes 62B includes more discrete echoes than residual room 62C. Accordingly, early echoes 62B may vary per virtual speaker channel, while residual room 62C having a longer tail may be synthesized as a single stereo copy. For some measurement mannequins used to obtain a BRIR, HRTF data may be available as measured in an anechoic chamber. Early echoes 62B may be determined by deconvoluting the BRIR and the HRTF data to identify the location of early echoes (which may be referred to as "reflections"). In some examples, HRTF data is not readily available and the techniques for identifying early echoes 62B include blind estimation. However, a straightforward approach may include regarding the first few milliseconds (e.g., the first 5, 10, 15, or 20 ms) as direct impulse filtered by the HRTF. As noted above, the techniques may include computing the mixing time using statistical data and estimation from the room volume.
[0064] In some examples, the techniques may include synthesizing one or more BRIR filters for residual room 62C. After the mixing time, BRIR reverb tails (represented as system residual room 62C in FIG. 6) can be interchanged in some instances without perceptual punishments. Further, the BRIR reverb tails can be synthesized with Gaussian white noise that matches the Energy Decay Relief (EDR) and Frequency- Dependent Interaural Coherence (FDIC). In some examples, a common synthetic BRIR reverb tail may be generated for BRIR filters. In some examples, the common EDR may be an average of the EDRs of all speakers or may be the front zero degree EDR with energy matching to the average energy. In some examples, the FDIC may be an average FDIC across all speakers or may be the minimum value across all speakers for a maximally decorrelated measure for spaciousness. In some examples, reverb tails can also be simulated with artificial reverb with Feedback Delay Networks (FDN). [0065] With a common reverb tail, the later portion of a corresponding BRIR filter may be excluded from separate convolution with each speaker feed, but instead may be applied once onto the mix of all speaker feeds. As described above, and in further detail below, the mixing of all speaker feeds can be further simplified with spherical harmonic coefficients signal rendering.
[0066] FIG. 7 is a block diagram illustrating an example of an audio playback device that may perform various aspects of the binaural audio rendering techniques described in this disclosure. While illustrated as a single device, i.e., audio playback device 100 in the example of FIG. 7, the techniques may be performed by one or more devices. Accordingly, the techniques should be not limited in this respect.
[0067] As shown in the example of FIG. 7, audio playback device 100 may include an extraction unit 104 and a binaural rendering unit 102. The extraction unit 104 may represent a unit configured to extract encoded audio data from bitstream 120. The extraction unit 104 may forward the extracted encoded audio data in the form of spherical harmonic coefficients (SHCs) 122 (which may also be referred to a higher order ambisonics (HO A) in that the SHCs 122 may include at least one coefficient associated with an order greater than one) to the binaural rendering unit 146.
[0068] In some examples, audio playback device 100 includes an audio decoding unit configured to decode the encoded audio data so as to generate the SHCs 122. The audio decoding unit may perform an audio decoding process that is in some aspects reciprocal to the audio encoding process used to encode SHCs 122. The audio decoding unit may include a time-frequency analysis unit configured to transform SHCs of encoded audio data from the time domain to the frequency domain, thereby generating the SHCs 122. That is, when the encoded audio data represents a compressed form of the SHC 122 that is not converted from the time domain to the frequency domain, the audio decoding unit may invoke the time-frequency analysis unit to convert the SHCs from the time domain to the frequency domain so as to generate SHCs 122 (specified in the frequency domain). The time-frequency analysis unit may apply any form of Fourier-based transform, including a fast Fourier transform (FFT), a discrete cosine transform (DCT), a modified discrete cosine transform (MDCT), and a discrete sine transform (DST) to provide a few examples, to transform the SHCs from the time domain to SHCs 122 in the frequency domain. In some instances, SHCs 122 may already be specified in the frequency domain in bitstream 120. In these instances, the time-frequency analysis unit may pass SHCs 122 to the binaural rendering unit 102 without applying a transform or otherwise transforming the received SHCs 122. While described with respect to SHCs 122 specified in the frequency domain, the techniques may be performed with respect to SHCs 122 specified in the time domain.
[0069] Binaural rendering unit 102 represents a unit configured to binauralize SHCs 122. Binaural rendering unit 102 may, in other words, represent a unit configured to render the SHCs 122 to a left and right channel, which may feature spatialization to model how the left and right channel would be heard by a listener in a room in which the SHCs 122 were recorded. The binaural rendering unit 102 may render SHCs 122 to generate a left channel 136A and a right channel 136B (which may collectively be referred to as "channels 136") suitable for playback via a headset, such as headphones. As shown in the example of FIG. 7, the binaural rendering unit 102 includes BRIR filters 108, a BRIR conditioning unit 106, a residual room response unit 110, a BRIR SHC-domain conversion unit 112, a convolution unit 114, and a combination unit 116.
[0070] BRIR filters 108 include one or more BRIR filters and may represent an example of BRIR filters 37 of FIG. 3. BRIR filters 108 may include separate BRIR filters 126 A, 126B representing the effect of the left and right HRTF on the respective BRIRs.
[0071] BRIR conditioning unit 106 receives L instances of BRIR filters 126A, 126B, one for each virtual loudspeaker L and with each BRIR filter having length N. BRIR filters 126A, 126B may already be conditioned to remove quiet samples. BRIR conditioning unit 106 may apply techniques described above to segment BRIR filters 126A, 126B to identify respective HRTF, early reflection, and residual room segments. BRIR conditioning unit 106 provides the HRTF and early reflection segments to BRIR SHC-domain conversion unit 112 as matrices 129A, 129B representing left and right matrices of size [a, L], where a is a length of the concatenation of the HRTF and early reflection segments and I is a number of loudspeakers (virtual or real). BRIR conditioning unit 106 provides the residual room segments of BRIR filters 126 A, 126B to residual room response unit 110 as left and right residual room matrices 128 A, 128B of size [b, L], where b is a length of the residual room segments and L is a number of loudspeakers (virtual or real).
[0072] Residual room response unit 110 may apply techniques describe above to compute or otherwise determine left and right common residual room response segments for convolution with at least some portion of the hierarchical elements (e.g., spherical harmonic coefficients) describing the sound field, as represented in FIG. 7 by SHCs 122. That is, residual room response unit 110 may receive left and right residual room matrices 128 A, 128B and combine respective left and right residual room matrices 128 A, 128B over L to generate left and right common residual room response segments. Residual room response unit 110 may perform the combination by, in some instances, averaging the left and right residual room matrices 128A, 128B over L.
[0073] Residual room response unit 110 may then compute a fast convolution of the left and right common residual room response segments with at least one channel of SHCs 122, illustrated in FIG. 7 as channel(s) 124B. In some examples, because left and right common residual room response segments represent ambient, non-directional sound, channel(s) 124B is the W channel (i.e., 0th order) of the SHCs 122 channels, which encodes the non-directional portion of a sound field. In such examples, for a W channel sample of length Length, fast convolution by residual room response unit 110 with left and right common residual room response segments produces left and right output signals 134A, 134B of length Length.
[0074] As used herein, the terms "fast convolution" and "convolution" may refer to a convolution operation in the time domain as well as to a point-wise multiplication operation in the frequency domain. In other words and as is well-known to those skilled in the art of signal processing, convolution in the time domain is equivalent to point- wise multiplication in the frequency domain, where the time and frequency domains are transforms of one another. The output transform is the point-wise product of the input transform with the transfer function. Accordingly, convolution and point-wise multiplication (or simply "multiplication") can refer to conceptually similar operations made with respect to the respective domains (time and frequency, herein). Convolution units 114, 214, 230; residual room response units 210, 354; filters 384 and reverb 386; may alternatively apply multiplication in the frequency domain, where the inputs to these components is provided in the frequency domain rather than the time domain. Other operations described herein as "fast convolution" or "convolution" may, similarly, also refer to multiplication in the frequency domain, where the inputs to these operations is provided in the frequency domain rather than the time domain.
[0075] In some examples, residual room response unit 110 may receive, from BRIR conditioning unit 106, a value for an onset time of the common residual room response segments. Residual room response unit 110 may zero-pad or otherwise delay the outputs signals 134A, 134B in anticipation of combination with earlier segments for the BRIR filters 108. [0076] BRIR SHC-domain conversion unit 112 (hereinafter "domain conversion unit 112") applies an SHC rendering matrix to BRIR matrices to potentially convert the left and right BRIR filters 126 A, 126B to the spherical harmonic domain and then to potentially sum the filters over L. Domain conversion unit 112 outputs the conversion result as left and right SHC-binaural rendering matrices 130A, 130B, respectively. Where matrices 129A, 129B are of size [a, L], each of SHC-binaural rendering matrices 130A, 130B is of size [(N+l)2 , a] after summing the filters over L (see equations (4)-(5) for example). In some examples, SHC-binaural rendering matrices 130A, 130B are configured in audio playback device 100 rather than being computed at run-time or a setup-time. In some examples, multiple instances of SHC-binaural rendering matrices 130A, 130B are configured in audio playback device 100, and audio playback device 100 selects a left/right pair of the multiple instances to apply to SHCs 124 A.
[0077] Convolution unit 114 convolves left and right binaural rendering matrices 130A, 130B with SHCs 124 A, which may in some examples be reduced in order from the order of SHCs 122. For SHCs 124A in the frequency (e.g., SHC) domain, convolution unit 114 may compute respective point- wise multiplications of SHCs 124 A with left and right binaural rendering matrices 130A, 130B. For an SHC signal of length Length, the convolution results in left and right filtered SHC channels 132A, 132B of size [Length, (N+l)2], there typically being a row for each output signals matrix for each order/sub- order combination of the spherical harmonics domain.
[0078] Combination unit 116 may combine left and right filtered SHC channels 132A, 132B with output signals 134A, 134B to produce binaural output signals 136A, 136B. Combination unit 116 may then separately sum each left and right filtered SHC channels 132A, 132B over L to produce left and right binaural output signals for the HRTF and early echoes (reflection) segments prior to combining the left and right binaural output signals with left and right output signals 134A, 134B to produce binaural output signals 136A, 136B.
[0079] FIG. 8 is a block diagram illustrating an example of an audio playback device that may perform various aspects of the binaural audio rendering techniques described in this disclosure. Audio playback device 200 may represent an example instance of audio playback device 100 of FIG. 7 is further detail.
[0080] Audio playback device 200 may include an optional SHCs order reduction unit 204 that processes inbound SHCs 242 from bitstream 240 to reduce an order of the SHCs 242. Optional SHCs order reduction provides the highest-order (e.g., 0th order) channel 262 of SHCs 242 (e.g., the W channel) to residual room response unit 210, and provides reduced-order SHCs 242 to convolution unit 230. In instances in which SHCs order reduction unit 204 does not reduce an order of SHCs 242, convolution unit 230 receives SHCs 272 that are identical to SHCs 242. In either case, SHCs 272 have dimensions [Length, (N+l)2], where N is the order of SHCs 272.
[0081] BRIR conditioning unit 206 and BRIR filters 208 may represent example instances of BRIR conditioning unit 106 and BRIR filters 108 of FIG. 7. Convolution unit 214 of residual response unit 214 receives common left and right residual room segments 244A, 244B conditioned by BRIR condition unit 206 using techniques described above, and convolution unit 214 convolves the common left and right residual room segments 244A, 244B with highest-order channel 262 to produce left and right residual room signals 262 A, 262B. Delay unit 216 may zero-pad the left and right residual room signals 262A, 262B with the onset number of samples to the common left and right residual room segments 244A, 244B to produce left and right residual room output signals 268 A, 268B.
[0082] BRIR SHC-domain conversion unit 220 (hereinafter, domain conversion unit 220) may represent an example instance of domain conversion unit 112 of FIG. 7. In the illustrated example, transform unit 222 applies an SHC rendering matrix 224 of (N+l)2 dimensionality to matrices 248 A, 248B representing left and right matrices of size [a, L], where a is a length of the concatenation of the HRTF and early reflection segments and L is a number of loudspeakers (e.g., virtual loudspeakers). Transform unit 222 outputs left and right matrices 252A, 252B in the SHC-domain having dimensions [(N+l)2, a, L]. Summation unit 226 may sum each of left and right matrices 252A, 252B over L to produce left and right intermediate SHC-rendering matrices 254A, 254B having dimensions [(N+l)2, a]. Reduction unit 228 may apply techniques described above to further reduce computation complexity of applying SHC-rendering matrices to SHCs 272, such as minimum-phase reduction and using Balanced Model Truncation methods to design IIR filters to approximate the frequency response of the respective minimum phase portions of intermediate SHC-rendering matrices 254A, 254B that have had minimum-phase reduction applied. Reduction unit 228 outputs left and right SHC- rendering matrices 256 A, 256B.
[0083] Convolution unit 230 filters the SHC contents in the form of SHCs 272 to produce intermediate signals 258 A, 258B, which summation unit 232 sums to produce left and right signals 260A, 260B. Combination unit 234 combines left and right residual room output signals 268A, 268B and left and right signals 260A, 260B to produce left and right binaural output signals 270 A, 270B.
[0084] In some examples, binaural rendering unit 202 may implement further reductions to computation by using only one of the SHC-binaural rendering matrices 252A, 252B generated by transform unit 222. As a result, convolution unit 230 may operate on just one of the left or right signals, reducing convolution operations by half. Summation unit 232, in such examples, makes conditional decisions for the second channel when rendering the outputs 260 A, 260B.
[0085] FIG. 9 is a flowchart illustrating an example mode of operation for a binaural rendering device to render spherical harmonic coefficients according to techniques described in this disclosure. For illustration purposes, the example mode of operation is described with respect to audio playback device 200 of FIG. 7. Binaural room impulse response (BRIR) conditioning unit 206 conditions left and right BRIR filters 246A, 246B, respectively, by extracting direction-dependent components/segments from the BRIR filters 246 A, 246B, specifically the head-related transfer function and early echoes segments (300). Each of left and right BRIR filters 126A, 126B may include BRIR filters for one or more corresponding loudspeakers. BRIR conditioning unit 106 provides a concatenation of the extracted head-related transfer function and early echoes segments to BRIR SHC-domain conversion unit 220 as left and right matrices 248A, 248B.
[0086] BRIR SHC-domain conversion unit 220 applies an HOA rendering matrix 224 to transform left and right filter matrices 248A, 248B including the extracted head- related transfer function and early echoes segments to generate left and right filter matrices 252A, 252B in the spherical harmonic (e.g., HOA) domain (302). In some examples, audio playback device 200 may be configured with left and right filter matrices 252A, 252B. In some examples, audio playback device 200 receives BRIR filters 208 in an out-of-band or in-band signal of bitstream 240, in which case audio playback device 200 generates left and right filter matrices 252A, 252B. Summation unit 226 sums the respective left and right filter matrices 252A, 252B over the loudspeaker dimension to generate a binaural rendering matrix in the SHC domain that includes left and right intermediate SHC-rendering matrices 254A, 254B (304). A reduction unit 228 may further reduce the intermediate SHC-rendering matrices 254A, 254B to generate left and right SHC-rendering matrices 256A, 256B. [0087] A convolution unit 230 of binaural rendering unit 202 applies the left and right intermediate SHC-rendering matrices 256A, 256B to SHC content (such as spherical harmonic coefficients 272) to produce left and right filtered SHC (e.g., HO A) channels 258A, 258B (306).
[0088] Summation unit 232 sums each of the left and right filtered SHC channels 258 A, 258B over the SHC dimension, (N+l)2, to produce left and right signals 260A, 260B for the direction-dependent segments (308). Combination unit 116 may then combine the left and right signals 260A, 260B with left and right residual room output signals 268A, 268B to generate a binaural output signal including left and right binaural output signals 270A, 270B.
[0089] FIG. 10A is a diagram illustrating an example mode of operation 310 that may be performed by the audio playback devices of FIGS. 7 and 8 in accordance with various aspects of the techniques described in this disclosure. Mode of operation 310 is described herein after with respect to audio playback device 200 of FIG. 8. Binaural rendering unit 202 of audio playback device 200 may be configured with BRIR data 312, which may be an example instance of BRIR filters 208, and HOA rendering matrix 314, which may be an example instance of HOA rendering matrix 224. Audio playback device 200 may receive BRIR data 312 and HOA rendering matrix 314 in an in-band or out-of-band signaling channel vis-a-vis the bitstream 240. BRIR data 312 in this example has L filters representing, for instance, L real or virtual loudspeakers, each of the L filters being length K. Each of the L filters may include left and right components ("x 2"). In some cases, each of the L filters may include a single component for left or right, which is symmetrical to its counterpart: right or left. This may reduce a cost of fast convolution.
[0090] BRIR conditioning unit 206 of audio playback device 200 may condition the BRIR data 312 by applying segmentation and combination operations. Specifically, in the example mode of operation 310, BRIR conditioning unit 206 segments each of the L filters according to techniques described herein into HRTF plus early echo segments of combined length a to produce matrix 315 (dimensionality [a, 2, L]) and into residual room response segments to produce residual matrix 339 (dimensionality [b, 2, L]) (324). The length K of the L filters of BRIR data 312 is approximately the sum of a and b. Transform unit 222 may apply HOA/SHC rendering matrix 314 of (Ν+1)2 dimensionality to the L filters of matrix 315 to produce matrix 317 (which may be an example instance of a combination of left and right matrices 252A, 252B) of dimensionality [(N+l)2, a, 2, L]. Summation unit 226 may sum each of left and right matrices 252A, 252B over L to produce intermediate SHC-rendering matrix 335 having dimensionality [(N+l)2, a, 2] (the third dimension having value 2 representing left and right components; intermediate SHC-rendering matrix 335 may represent as an example instance of both left and right intermediate SHC-rendering matrices 254A, 254B) (326). In some examples, audio playback device 200 may be configured with intermediate SHC-rendering matrix 335 for application to the HOA content 316 (or reduced version thereof, e.g., HOA content 321). In some examples, reduction unit 228 may apply further reductions to computation by using only one of the left or right components of matrix 317 (328).
[0091] Audio playback device 200 receives HOA content 316 of order N/ and length Length and, in some aspects, applies an order reduction operation to reduce the order of the spherical harmonic coefficients (SHCs) therein to N (330). N/ indicates the order of the (I)nput HOA content 321. The HOA content 321 of order reduction operation (330) is, like HOA content 316, in the SHC domain. The optional order reduction operation also generates and provides the highest-order (e.g., the 0th order) signal 319 to residual response unit 210 for a fast convolution operation (338). In instances in which HOA order reduction unit 204 does not reduce an order of HOA content 316, the apply fast convolution operation (332) operates on input that does not have a reduced order. In either case, HOA content 321 input to the fast convolution operation (332) has dimensions [Length, (N+l)2], where N is the order.
[0092] Audio playback device 200 may apply fast convolution of HOA content 321 with matrix 335 to produce HOA signal 323 having left and right components thus dimensions [Length, (N+l)2, 2] (332). Again, fast convolution may refer to point-wise multiplication of the HOA content 321 and matrix 335 in the frequency domain or convolution in the time domain. Audio playback device 200 may further sum HOA signal 323 over (N+l)2 to produce a summed signal 325 having dimensions [Length, 2] (334).
[0093] Returning now to residual matrix 339, audio playback device 200 may combine the L residual room response segments, in accordance with techniques herein described, to generate a common residual room response matrix 327 having dimensions [b, 2] (336). Audio playback device 200 may apply fast convolution of the 0th order HOA signal 319 with the common residual room response matrix 327 to produce room response signal 329 having dimensions [Length, 2] (338). Because, to generate the L residual response room response segments of residual matrix 339, audio playback device 200 obtained the residual response room response segments starting at the (a+l)* samples of the L filters of BRIR data 312, audio playback device 200 accounts for the initial a samples by delaying (e.g., padding) a samples to generate room response signal 311 having dimensions [Length, 2] (340).
[0094] Audio playback device 200 combines summed signal 325 with room response signal 311 by adding the elements to produce output signal 318 having dimensions [Length, 2] (342). In this way, audio playback device may avoid applying fast convolution for each of the L residual room response segments. For a 22 channel input for conversion to binaural audio output signal, this may reduce the number of fast convolutions for generating the residual room response from 22 to 2.
[0095] FIG. 10B is a diagram illustrating an example mode of operation 350 that may be performed by the audio playback devices of FIGS. 7 and 8 in accordance with various aspects of the techniques described in this disclosure. Mode of operation 350 is described herein after with respect to audio playback device 200 of FIG. 8 and is similar to mode of operation 310. However, mode of operation 350 includes first rendering the HOA content into multichannel speaker signals in the time domain for L real or virtual loudspeakers, and then applying efficient BRIR filtering on each of the speaker feeds, in accordance with techniques described herein. To that end, audio playback device 200 transforms HOA content 321 to multichannel audio signal 333 having dimensions [Length, L] (344). In addition, audio playback device does not transform BRIR data 312 to the SHC domain. Accordingly, applying reduction by audio playback device 200 to signal 314 generates matrix 337 having dimensions [a, 2, L] (328).
[0096] Audio playback device 200 then applies fast convolution 332 of multichannel audio signal 333 with matrix 337 to produce multichannel audio signal 341 having dimensions [Length, L, 2] (with left and right components) (348). Audio playback device 200 may then sum the multichannel audio signal 341 by the L channels/speakers to produce signal 325 having dimensions [Length, 2] (346).
[0097] FIG. 11 is a block diagram illustrating an example of an audio playback device 350 that may perform various aspects of the binaural audio rendering techniques described in this disclosure. While illustrated as a single device, i.e., audio playback device 350 in the example of FIG. 11, the techniques may be performed by one or more devices. Accordingly, the techniques should be not limited in this respect. [0098] Moreover, while generally described above with respect to the examples of FIGS. 1-lOB as being applied in the spherical harmonics domain, the techniques may also be implemented with respect to any form of audio signals, including channel-based signals that conform to the above noted surround sound formats, such as the 5.1 surround sound format, the 7.1 surround sound format, and/or the 22.2 surround sound format. The techniques should therefore also not be limited to audio signals specified in the spherical harmonic domain, but may be applied with respect to any form of audio signal.
[0099] As shown in the example of FIG. 11, the audio playback device 350 may be similar to the audio playback device 100 shown in the example of FIG. 7. However, the audio playback device 350 may operate or otherwise perform the techniques with respect to general channel-based audio signals that, as one example, conform to the 22.2 surround sound format. The extraction unit 104 may extract audio channels 352, where audio channels 352 may generally include "n" channels, and is assumed to include, in this example, 22 channels that conform to the 22.2 surround sound format. These channels 352 are provided to both residual room response unit 354 and per-channel truncated filter unit 356 of the binaural rendering unit 351.
[0100] As described above, the BRIR filters 108 include one or more BRIR filters and may represent an example of the BRIR filters 37 of FIG. 3. The BRIR filters 108 may include the separate BRIR filters 126 A, 126B representing the effect of the left and right HRTF on the respective BRIRs.
[0101] The BRIR conditioning unit 106 receives n instances of the BRIR filters 126 A, 126B, one for each channel n and with each BRIR filter having length N. The BRIR filters 126 A, 126B may already be conditioned to remove quiet samples. The BRIR conditioning unit 106 may apply techniques described above to segment the BRIR filters 126A, 126B to identify respective HRTF, early reflection, and residual room segments. The BRIR conditioning unit 106 provides the HRTF and early reflection segments to the per-channel truncated filter unit 356 as matrices 129 A, 129B representing left and right matrices of size [a, L], where a is a length of the concatenation of the HRTF and early reflection segments and n is a number of loudspeakers (virtual or real). The BRIR conditioning unit 106 provides the residual room segments of BRIR filters 126A, 126B to residual room response unit 354 as left and right residual room matrices 128A, 128B of size [b, L], where b is a length of the residual room segments and n is a number of loudspeakers (virtual or real). [0102] The residual room response unit 354 may apply techniques describe above to compute or otherwise determine left and right common residual room response segments for convolution with the audio channels 352. That is, residual room response unit 110 may receive the left and right residual room matrices 128 A, 128B and combine the respective left and right residual room matrices 128 A, 128B over n to generate left and right common residual room response segments. The residual room response unit 354 may perform the combination by, in some instances, averaging the left and right residual room matrices 128 A, 128B over n.
[0103] The residual room response unit 354 may then compute a fast convolution of the left and right common residual room response segments with at least one of audio channel 352. In some examples, the residual room response unit 352 may receive, from the BRIR conditioning unit 106, a value for an onset time of the common residual room response segments. Residual room response unit 354 may zero-pad or otherwise delay the output signals 134A, 134B in anticipation of combination with earlier segments for the BRIR filters 108. The output signals 134 A may represent left audio signals while the output signals 134B may represent right audio signals.
[0104] The per-channel truncated filter unit 356 (hereinafter "truncated filter unit 356") may apply the HRTF and early reflection segments of the BRIR filters to the channels 352. More specifically, the per-channel truncated filter unit 356 may apply the matrixes 129 A and 129B representative of the HRTF and early reflection segments of the BRIR filters to each one of the channels 352. In some instances, the matrixes 129A and 129B may be combined to form a single matrix 129. Moreover, typically, there is a left one of each of the HRTF and early reflection matrices 129 A and 129B and a right one of each of the HRTF and early reflection matrices 129A and 129B. That is, there is typically an HRTF and early reflection matrix for the left ear and the right ear. The per-channel direction unit 356 may apply each of the left and right matrixes 129 A, 129B to output left and right filtered channels 358 A and 358B. The combination unit 116 may combine (or, in other words, mix) the left filtered channels 358A with the output signals 134A, while combining (or, in other words, mixing) the right filtered channels 358B with the output signals 134B to produce binaural output signals 136A, 136B. The binaural output signal 136A may correspond to a left audio channel, and the binaural output signal 136B may correspond to a right audio channel.
[0105] In some examples, the binaural rendering unit 351 may invoke the residual room response unit 354 and the per-channel truncated filter unit 356 concurrent to one another such that the residual room response unit 354 operates concurrent to the operation of the per-channel truncated filter unit 356. That is, in some examples, the residual room response unit 354 may operate in parallel (but often not simultaneously) with the per- channel truncated filter unit 356, often to improve the speed with which the binaural output signals 136A, 136B may be generated. While shown in various FIGS, above as potentially operating in a cascaded fashion, the techniques may provide for concurrent or parallel operation of any of the units or modules described in this disclosure, unless specifically indicated otherwise.
[0106] FIG. 12 is a diagram illustrating a process 380 that may be performed by the audio playback device 350 of FIG. 11 in accordance with various aspects of the techniques described in this disclosure. Process 380 achieves a decomposition of each BRIR into two parts: (a) smaller components which incorporate the effects of HRTF and early reflections represented by left filters 384Al-384NL and by right filters 384AR- 384NR (collectively, "filters 384") and (b) a common 'reverb tail' that is generated from properties of all the tails of the original BRIRs and represented by left reverb filter 386L and right reverb filter 386R (collectively, "common filters 386"). The per-channel filters 384 shown in the process 380 may represent part (a) noted above, while the common filters 386 shown in the process 380 may represent part (b) noted above.
[0107] The process 380 performs this decomposition by analyzing the BRIRs to eliminate inaudible components and determine components which comprise the HRTF/early reflections and components due to late reflections/diffusion. This results in an FIR filter of length, as one example, 2704 taps, for part (a) and an FIR filter of length, as another example, 15232 taps for part (b). According to the process 380, the audio playback device 350 may apply only the shorter FIR filters to each of the individual n channels, which is assumed to be 22 for purposes of illustration, in operation 396. The complexity of this operation may be represented in the first part of computation (using a 4096 point FFT) in Equation (8) reproduced below. In the process 380, the audio playback device 350 may apply the common 'reverb tail' not to each of the 22 channels but rather to an additive mix of them all in operation 398. This complexity is represented in the second half of the complexity calculation in Equation (8), again which is shown in the attached Appendix.
[0108] In this respect, the process 380 may represent a method of binaural audio rendering that generates a composite audio signal, based on mixing audio content from a plurality of N channels. In addition, process 380 may further align the composite audio signal, by a delay, with the output of N channel filters, wherein each channel filter includes a truncated BRIR filter. Moreover, in process 380, the audio playback device 350 may then filter the aligned composite audio signal with a common synthetic residual room impulse response in operation 398 and mix the output of each channel filter with the filtered aligned composite audio signal in operations 390L and 390R for the left and right components of binaural audio output 388L, 388R.
[0109] In some examples, the truncated BRIR filter and the common synthetic residual impulse response are pre-loaded in a memory.
[0110] In some examples, the filtering of the aligned composite audio signal is performed in a temporal frequency domain.
[0111] In some examples, the filtering of the aligned composite audio signal is performed in a time domain through a convolution.
[0112] In some examples, the truncated BRIR filter and common synthetic residual impulse response is based on a decomposition analysis.
[0113] In some examples, the decomposition analysis is performed on each of N room impulse responses, and results in N truncated room impulse responses and N residual impulse responses (where N may be denoted as n or n above).
[0114] In some examples, the truncated impulse response represents less than forty percent of the total length of each room impulse response.
[0115] In some examples, the truncated impulse response includes a tap range between 111 and 17,830.
[0116] In some examples, each of the N residual impulse responses is combined into a common synthetic residual room response that reduces complexity.
[0117] In some examples, mixing the output of each channel filter with the filtered aligned composite audio signal includes a first set of mixing for a left speaker output, and a second set of mixing for a right speaker output.
[0118] In various examples, the method of the various examples of process 380 described above or any combination thereof may be performed by a device comprising a memory and one or more processors, an apparatus comprising means for performing each step of the method, and one or more processors that perform each step of the method by executing instructions stored on a non-transitory computer-readable storage medium.
[0119] Moreover, any of the specific features set forth in any of the examples described above may be combined into a beneficial example of the described techniques. That is, any of the specific features are generally applicable to all examples of the techniques. Various examples of the techniques have been described.
[0120] The techniques described in this disclosure may in some cases identify only samples 1 1 1 to 17830 across BRIR set that are audible. Calculating a mixing time Tmp95 from the volume of an example room, the techniques may then let all BRIRs share a common reverb tail after 53.6ms, resulting in a 15232 sample long common reverb tail and remaining 2704 sample HRTF + reflection impulses, with 3ms crossfade between them. In terms of a computational cost break down, the following may be arrived at
(a) Common reverb tail: 10*6*log2(2* 15232/10).
(b) Remaining impulses: 22*6*log2(2*4096), using 4096 FFT to do it in one frame.
(c) Additional 22 additions.
[0121] As a result, a final figure of Merit may therefore approximately equal Cmod = max(100*(Cconv - C) / Cconv, 0) = 88.0, where:
Cmod = max(100 * (Cconv - C)/Cconv, 0), (6) where Cconv, is an estimate of an unoptimized implementation:
Cconv = (22+2)*(10)*(6*Zo#2(2*48000/10)), (7)
C, is some aspect, may be determined by two additive factors:
C = 22 * 6 * log2 {2 * 4096) + 10 * 6 * log2 {2 * ^ )). (8)
[0122] Thus, in some aspects, the figure of merit, Cmod = 87.35.
[0123] A BRIR filter denoted as Bn(z) may be decomposed into two functions BTn(z) and BRn(z), which denote the truncated BRIR filter and the reverb BRIR filter, respectively. Part (a) noted above may refer to this truncated BRIR filter, while part (b) above may refer to the reverb BRIR filter. Bn(z) may then equal BTn(z) + (z"m * BRn(z)), where m denotes the delay. The output signal Y(z) may therefore be computed as:
∑~Γ0¾ (ζ) BTn (z) + z~m■ Xn z) * BRn (z)] (9)
[0124] The process 380 may analyze the BRn(z) to derive a common synthetic reverb tail segment, where this common BR(z) may be applied instead of the channel specific BRn(z). When this common (or channel general) synthetic BR(z) is used, Y(z) may be computed as:
∑n=o¾(z) BTn (z) + z"™BRn (z)] ∑^ Xn (z) (10)
[0125] FIG. 13 is a diagram of an example binaural room impulse response filter (BRIR) 400. BRIR 400 illustrates five segments 402A-402C. Head-related transfer function (HRTF) segment 402A includes the impulse response due to head-related transfer and may be identified using techniques described herein. The HRTF is equivalent to measuring the impulse response in an anechoic chamber. Since the first reflections of a room usually have a longer delay than HRTF, it is assumed that the first portion of the BRIR is an HRTF impulse response. The reflections segment 402B combines the HRTF with room effects, i.e., the impulse response of the reflections segment 402B matches that of the HRTF segment 402A for the BRIR 400 filtered by early discrete echoes in comparison to the reverberation segment 402C. The mixing time is the time between the reflections segment 402B and the reverberation segment 402C and indicates the time at which early echoes become dense reverb. Reverberation segment 402C behaves like Gaussian noise and discrete echoes can no longer be separated.
[0126] In the upcoming MPEG-H standardization, multichannel audio with high resolution and high channel count are considered. To make the rendering portable, headphone representation is need. This involves virtualizing all speaker feeds / channels into a stereo headset. To render to a headphone representation, a set of one or more pairs of impulse responses may be applied to the multichannel audio. The BRIR 400 may represent one pair of such impulse responses. Applying the BRIR 400 filter using standard block Fast-Fourier Transform (FFT) to a channel of the multichannel audio may be computationally intensive. Applying an entire set of pairs of impulse responses to corresponding channels of the multichannel audio even more so. The techniques described hereinafter provide efficient binaural filtering without sacrificing significantly from the quality of the result of standard filtering (e.g., block FFT).
[0127] FIG. 14 is a block diagram illustrating a system 410 for a computation of a binaural output signal generated by applying binaural room impulse responses to a multichannel audio signal. Each of inputs 412A-412N represents a single channel of an overall multichannel audio signal. Each of BRIRs 414A-414N represents a pair of binaural impulse room response filters having left and right components. In operation, the computation procedure applies, to each of the inputs 412A-412N, a corresponding BRIR of BRIRs 414A-414N to the single-channel (mono) input to generate a binaural audio signal for the single-channel input as rendered at the locations represents by the applied BRIR. The N binaural audio signals are then accumulated by accumulator 416 to produce the stereo headphone signal or overall binaural audio signal, which is output by the system 410 as output 418.
[0128] FIG. 15 is a block diagram illustrating components of an audio playback device 500 for computing a binaural output signal generated by applying binaural room impulse responses to a multichannel audio signal according to techniques described herein. The audio playback device 500 includes multiple components for implementing various computation reduction methods of the present disclosure in combination. Some aspects of the audio playback device 500 may include any combination in any number of the various computation reduction methods. Audio playback device 500 may represent an example of any of audio playback system 32, audio playback device 100, audio playback device 200, and audio playback device 350, and include components similar to any of the above-listed device for implementing the various computation reduction methods of the present disclosure.
[0129] The computation reduction methods may include any combination of the following:
[0130] Part a (corresponding to HRTF segment 402A and HRTF unit 504): usually a few milliseconds, for localization and can be computationally reduced by converting into inter-aural delays (ITDs) and minimum phase filters, which can be further reduced using IIR filters, as one example.
[0131] Part b (corresponding to reflections segment 402B and reflection unit 502): The length may vary by room and will typically last usually tens of milliseconds. Although computational intensive if done for each channel separately, the techniques describe herein may apply respective common filters generated for sub-groups of these channels.
[0132] Part c (corresponding to reverberation segment 402C and reverberation unit 506): A common filter is calculated for all channels (e.g., 22 channels for a 22.2 format). Instead of resynthesizing a new reverb tail based on direct average over the frequency domain Energy Decay Relief (EDR) curve, the reverberation unit 506 applies a different weighting scheme to the average that is optionally enhanced by a correcting weight that changes with input signal content. [0133] In a manner similar to system 410 of FIG. 14, the audio playback device 500 receives N single channel inputs 412A-412N (collectively, "inputs 412") of a multichannel audio signal and applies segments of binaural room impulse response (BRIR) filters to generate and output a stereo headphone signal or overall binaural audio signal. As illustrated in FIG. 15, reflection unit combines the discrete inputs 412 into different groups using weighted sums (weighted using e.g., adaptive weighting factors 520Ai_K-520Mi_j, 522A-522N). For the common reverb (illustrated, e.g., by reverberation section 402C of FIG. 13), reverberation unit 506 combines inputs 412 together with respective adaptive weighting factors (522A-522N, e.g. stereo, different weights for left/right per input) and then processes the combined inputs using a common reverb filter 524 (a stereo impulse response filter) applied using FFT filtering (after applying a delay 526).
[0134] Reflection unit 502 applies average reflection filters 512A-512M similar to common reverb filter 524 to different sub-groups of the inputs 412 combined together into the sub-groups with adaptive weighting factors (520Ai_K-520Mi_j). HRTF unit 504 applies the head-related transfer function (HRTF) filters 414A-414N (collectively, "HRTF filters 414") that have, in this example device, been converted to interaural time delay (ITDs) 530A-530N and minimum phase filters (these may be further approximated with multi-state infinite impulse response (IIR) filters). As used herein, "adaptive" refers to adjustment to the weighting factors according to qualities of the input signal to which the adaptive weighting factor is applied. In some aspects, the various adaptive weighting factors may not be adaptive.
[0135] To compute the mixing time for the BRIRs for each of the inputs 412, an Echo Density Profile, which measures the fraction of impulse response taps outside of a window standard deviation, over a 1024 sliding window, is calculated. When the value reaches 1 for the first time, this indicating that the impulse response starts to resemble Gaussian noise and marks the beginning of reverb. For each of the individual HRTF filters 414, there may be different calculations, with the final values (in milliseconds) by measurement are determined by averaging across the N channels:
• Tmp50 = 36.1 (50 meaning average perceptual mixing time on regression analysis)
• Tmp95 = 80.7 (95 meaning transparent on 95% expert listeners, more strict). [0136] There are also theoretical formulae for mixing time calculation based on room volume. For a room that is 300 cubic meters large, e.g., according to formulae from volume:
• Tv50 = 31.2
• Tv95 = 53.6
[0137] As noted above, HRTF unit 504 applies the head-related transfer function (HRTF) filters 414 that have been converted to interaural time delay (ITDs) 530A- 530N and minimum phase filters. The minimum phase filter may be obtained by windowing the Cepstrum of original filter; the delay may be estimated by linear regression on 500 ~ 4000 Hz frequency region of the phase; for IIR approximation, a Balanced Model Truncation (BMT) method may be used to extract the most important components of the amplitude response on a frequency warped filter.
[0138] With respect to reverberation unit 506, after mixing time the impulse response tails (e.g., reverberation segment 402C) are theoretically interchangeable without much perceptual difference. Reverberation unit 506 therefore applies a common reverberation filter 524 to substitute each response tail of the respective BRIRs corresponding to inputs 412. There are example ways to obtain the common reverberation filter 524 for application in reverberation unit 506 of the audio playback device 500:
(1) Normalize each filter by its energy (e.g., the sum of the square values of all samples in the impulse response) and then average across all the normalized filters.
(2) Directly average all filters, e.g., compute the simple mean.
(3) Resynthesize an average filter with white noise controlled by energy envelope and coherence control.
[0139] The first method (1) takes the characteristics/shape of each original filter equally. Some filters may have very low energy (e.g. the top center channel in 22.2 setup) and yet have equal "votes" in the common filter 524.
[0140] The second method (2) naturally weights each filter according to its energy level, so a more energetic or "louder" filter gets more votes in the common filter 524. This direct average may also assume that there is not much correlation between filters, which may be true at least for individually obtained BRIRs in a good listening room.
[0141] The third method (3) is based on techniques whereby frequency dependent interaural coherence (FDIC) is used to resynthesize reverb tails of a BRIR. Each BRIR first goes through short-term Fourier transform (STFT), and its FDIC is calculated as:
Figure imgf000036_0001
where i is the frequency index and k is the time index. R(.) denotes the real portion. Hi and HR are the Short-time Fourier Transform (STFT) of the left and right impulse response.
[0142] With certain FDIC and EDR, an impulse response can be synthesized using Gaussian noise as
Μ , k) e{i, k) {#(¾, k)Ni ii, k) 4· 6{¾\ &) 2(i, k))
= d(i,.k) {<»(*, &) JV* (i>&)··· b(it k)N^(i, k)) ,
where
Figure imgf000036_0002
[0143] Here H~L and H~R are the synthesized STFT of the filter, N} and N2 are the STFT of independently generated Gaussian noise; c and d are the EDRs indexed by frequency and time, and Ps are the time-smoothed short-time power spectrum estimates of the noise signal.
[0144] To obtain average FDIC, the techniques may include:
• Use one of the FDIC of the original filter, e.g. front center channel
• Direct average over all FDICs
• Use minimum of all FDICs: this will generate a maximally spacious average filter but is not necessarily close to the original filter mixture.
• Weight FDIC with their relative energy of EDR and then sum together.
With the latter method (weighted FDIC), each filter has a "vote" in the common FDIC commensurate with its energy. Louder filters therefore get more of their FDIC images in the common filter 524.
[0145] Furthermore, by examining a repertoire of input signal, additional patterns may be discovered, leading to additional weights from the content energy distribution. For example, the top channel in a 22.2 setup typically has a low-energy BRIR, and content producers may seldom author contents in that position (e.g., the occasional airplane fly- by). Thus the common reverberation filter 524 generation techniques may tradeoff the accuracy for the top channel when synthesizing the common filter 524, while the main front center, left and right channels may get a lot of emphasis. Expressed in a general equation, the common or average FDIC calculated with multiple weights is calculated as: >
Figure imgf000037_0001
where FDI is the FDIC of the z'-th BRIR channel, and wji (> 0) is the weight factor of criterion j for BRIR channel i. One of the y-th criterion mentioned here may be BRIR energy, while another may be signal content energy. The denominator sum normalizes such that the combined weights eventually add up to 1. When weights are all equal to 1 , the equation reduces to a simple average. Similarly, a common EDR (c and d in previous equations) can be calculated as: tUHaverage
Figure imgf000037_0002
and the weights here may not necessarily be the same as the weights of the FDIC.
[0146] Any of the above methods described with respect to generating common reverberation filter 524 may also be used to synthesize reflection filters 512A-512M. That is, a sub-group of channels' reflections can be similarly synthesized, although the error will typically be larger because signals produced by reflections are less noise-like. However, all the center channel reflections will share similar coherence evaluation and energy decay; all left-side channels reflections can be combined with proper weighting; alternatively, left front channels may form one group, left back and height channels may form another group, and so forth, in accordance with the channel format (e.g., 22.2). This may reduce the N channels each having reflection segments (e.g., reflection segment 402B) into M (e.g., 3-5) sub-groups to reduce computation. Similar content- based weighting can be applied to the reflection-combined filters 512A-512M as well, as described above with respect to synthesizing reverberation filter 524. Reflection channels may be grouped in any combination. By examining the correlation between the reflection segments of the impulse responses, relatively highly-correlated channels can be grouped together for a subgroup common reflection filter 512 synthesis. [0147] In the illustrated example, reflection unit 502 groups at least input 412A and input 412N in a subgroup. Reflection filter 512A represents a common filter generated for this subgroup, and reflection unit 502 applies the reflection filter 512A to a combination of the inputs of the subgroup which, again, include at least input 412A and input 412N in the illustrated example.
[0148] As one example, the correlation matrix for the respective reflection portions of a set of BRIR filters is examined. The set of BRIR filters may represent a current set of BRIR filters. The correlation matrix is adjusted by (l-corr)/2 to obtain a dissimilarity matrix, which is used to conduct a complete linkage for cluster analysis.
[0149] As shown in FIG. 16, a hierarchical cluster analysis may be run on the reflection portions of a 22.2 channel BRIR set according to a correlation on their time envelopes. As can be seen, by setting a cutoff score of 0.6, the left channels can be grouped into 4 sub-groups and the right channels can be grouped into 3 sub-groups with convincing similarities. By examining the speaker locations in the 22.2 setup, the cluster analysis results coincide with common sense functionalities and geometry of the 22.2 channel setup.
[0150] Returning now to FIG. 15, the impulse response for any of the common filters (e.g., the reflection filters 512A-512M and the common reverberation filter 524) may be a two-column vector:
h = [ = [lFFT (HL {i, k)) lFFT (HR {i,
Figure imgf000038_0001
[0151] Once the common filter is calculated, at online processing, the reflection unit 502 and/or reverberation unit 506 first mixes the inputs 412 into a specific group for the filter and then applies the common filter. For example, reverberation unit 506 may mix all 412 into and then apply common reverberation filter 524. Since the original filters before common filter synthesis have varying energies, equally-mixed inputs 412 may not match the original condition. If the energy of a filter impulse response h is calculated as:
Figure imgf000038_0002
where n is the sample index; each h[n] is a stereo sample for the left/right impulse responses), then an initial weight for the input signal can be calculated as:
Figure imgf000038_0003
where hi is the original filter for channel i before common filter synthesis.
[0152] By using the common filter, the original filtering process of ∑(ί έ © becomes Σ ίνμπι © h, where in, is an input sample for the input signal. Here,
©denotes convolution, and each h filter is a stereo impulse response; thus left and right channel carries these processes individually. For slightly more efficient processing, any of stereo weights wt can be converted to a single value weight by averaging left/right weights, and then the stereo input mix upon application of the common filter becomes a mono mix instead. Adaptive weight factors 520Ai_K-520Mi_j for reflection unit 502 and adaptive weight factors 522A-522N for reverberation unit 506 may represent any of weights W[ .
[0153] By using the wt on input signals, the underlying assumptions is that the input channel are not correlated, thus each input goes through the filter with same energy as before, and the summed signal's energy is approximately the same as the sum of all weighted signals' energies. In practice, a more "reverberant" sound is often perceived, and a much higher energy level of the resynthesized version is observed. This is due to the fact that the input channels are often correlated. For example, for a multichannel mix generated by panning mono sources and moving them around, the panning algorithm usually generates highly correlated components across different channels. And for correlated channels, the energy will be higher using the initial weights wt .
[0154] Thus, instead of calculating the mixed input signal as inmix =∑ wt irii, a time- varying energy normalization weight may be applied and the new input signal mix should therefore be calculated as:
inmix n) = , norm (n)∑ f ίηέ (η),
where n is the discrete time index, and the normalization wnorm is according to the energy ratio between summed energy of weighted signals and energy of the weighted summed signal:
Figure imgf000039_0001
over a segment of signal frames. In the equation, signal index is not written in the right side. This average energy estimation on the right side can be achieved in the time- domain with a first-order smoothing filter on the energy of the summed energy and energy of the summed signal. Thus a smooth energy curve may be obtained for division. Or, since the audio playback device 500 may apply FFT overlap-add on the filtering already, for each FFT frame, audio playback device 500 can estimate one normalization weight and the overlap-add scheme will take care of the smoothing effect over time already.
[0155] Between HRTF, reflection and reverb tails (or reverberation) segments, a cosine curve crossfade is applied (with duration of, e.g., 0.2 ms or 10 samples) to smoothly transition between them. For example, if HRTFs are 256 samples long, reflections are 2048 samples long, and reverb is 4096 samples long, the total equivalent filter length of the renderer would be 256 + 2048 + 4096 - 2 * 10 = 6380 samples.
[0156] Combination step 510 combines all of the filtered signals generated by reflection unit 502, HRTF unit 504, and reverberation unit 506. In some examples, at least one of reflection unit 502 and reverberation unit 506 do not include applying adaptive weight factors. In some examples of audio playback device 500, HRTF unit 504 applies both the HRTF portion and the refiection portion of the BRIR filters for the inputs 412, i.e., audio playback device 500 in such examples does not group inputs 412N into M subgroups to which common reflection filters 512A-512M are applied.
[0157] FIG. 17 is a flowchart illustrating an example mode of operation of an audio playback device according to techniques described in this disclosure. The example mode of operation is described with respect to audio playback device 500 of FIG. 15.
[0158] The audio playback device 500 receives single input channels and applies adaptively determined weights to the channels (600). The audio playback device 500 combines these adaptively weighted channels to generate a combined audio signal (602). The audio playback device 500 further applies a binaural room impulse response filter to the combined audio signal to generate a binaural audio signal (604). The binaural room impulse response filter may be, e.g., a combined reflection or a reverberation filter generated according to any of the techniques described above. The audio playback device 500 outputs an output/overall audio signal that is generated, at least in part, from the binaural audio signal generated at step 604 (606). The overall audio signal may be a combination of multiple binaural audio signals for one or more reflection sub-groups combined and filtered, a reverberation group combined and filtered, and respective HRTF signals filtered for each of the channel of the audio signal. The audio playback device 500 applies a delay, as needed to the filtered signals to align the signals for combination to produce the overall output binaural audio signal. [0159] In addition to or as an alternative to the above, the following examples are described. The features described in any of the following examples may be utilized with any of the other examples described herein.
[0160] One example is directed to a method of binauralizing an audio signal comprising obtaining a common filter for reflection segments of a sub-group of a plurality of binaural room impulse response filters; and applying the common filter to a summary audio signal determined from a plurality of channels of the audio signal to generate a transformed summary audio signal.
[0161] In some examples, the summary audio signal comprises a combination of a subgroup of the plurality of channels of the audio signal corresponding to the sub-group of the plurality of binaural room impulse response filters.
[0162] In some examples, the method further comprises applying respective head- related transfer function segments of the plurality of binaural room impulse response filters to corresponding ones of the plurality of channels of the audio signal to generate a plurality of transformed channels of the audio signal; and combining the first transformed summary audio signal and the transformed channels of the audio signal to generate an output binaural audio signal.
[0163] In some examples, obtaining the common filter comprises computing an average of the sub-group of the plurality of binaural room impulse response filters as the common filter.
[0164] In some examples, the method further comprises combining a sub-group of channels of the audio signal that correspond to the sub-group of the plurality of binaural room impulse response filters to generate the summary audio signal.
[0165] In some examples, the common filter is a first common filter, the sub-group is a first sub-group, the summary audio signal is a first summary audio signal, and wherein the transformed summary audio signal is a first transformed summary audio signal, and the method further comprises generating a second common filter for a second, different sub-group of the plurality of binaural room impulse response filters by computing an average of the second sub-group of the plurality of binaural room impulse response filters; combining a second sub-group of channels of the audio signal that correspond to the second sub-group of the plurality of binaural room impulse response filters to generate a second summary audio signal; and applying the second common filter to the second summary audio signal to generate a second transformed summary audio signal, wherein combining the first transformed summary audio signal and the transformed channels of the audio signal to generate an output audio signal comprises combining the first transformed summary audio signal, the second transformed summary audio signal, and the transformed channels of the audio signal to generate the output audio signal.
[0166] In some examples, obtaining the common filter comprises computing a weighted average of the sub-group of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the binaural room impulse response filters.
[0167] In some examples, obtaining the common filter comprises computing the average of the sub-group of the plurality of binaural room impulse response filters without normalizing the binaural room impulse response filters of the sub-group of the plurality of binaural room impulse response filters.
[0168] In some examples, obtaining the common filter comprises computing a direct average of the sub-group of the plurality of binaural room impulse response filters.
[0169] In some examples, obtaining the common filter comprises resynthesizing the common filter using white noise controlled by energy envelope and coherence control.
[0170] In some examples, wherein obtaining the common filter comprises computing respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters; computing an average frequency-dependent inter-aural coherence value using the respective frequency- dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters; and synthesizing the common filter using the average frequency-dependent inter-aural coherence value.
[0171] In some examples, computing the average frequency-dependent inter-aural coherence value comprises computing a direct average frequency-dependent inter-aural coherence value.
[0172] In some examples, computing the average frequency-dependent inter-aural coherence value comprises computing the average frequency-dependent inter-aural coherence value as the minimum frequency-dependent inter-aural coherence values of the respective frequency-dependent inter-aural coherence values for each of the subgroup of the plurality of binaural room impulse response filters.
[0173] In some examples, computing the average frequency-dependent inter-aural coherence value comprises weighting each of the respective frequency-dependent inter- aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters by the respective, relative energy of Energy Decay Relief and accumulating the weighted frequency-dependent inter-aural coherence values to generate the average frequency-dependent inter-aural coherence value.
[0174] In some examples, computing the average frequency-dependent inter-aural coherence value comprises computing:
U ^average ∑i(rl; w;i) '
wherein FDICmemge is the average frequency-dependent inter-aural coherence value, wherein i denotes a binaural room impulse response filter of the sub-group of the plurality of binaural room impulse response filters, wherein FDId denotes a frequency- dependent inter-aural coherence value for the h binaural room impulse response filter, wherein Wy denotes a weight of a criterion j for the zth binaural room impulse response filter.
[0175] In some examples, the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
[0176] In some examples, synthesizing the common filter using the average frequency- dependent inter-aural coherence value comprises computing: tuKaverage - ∑i (Uj Wji) . wherein EDRmemge is an average Energy Decay Relief value, wherein i denotes a channel of the sub-group of channels of the audio signal, wherein EDRj denotes an Energy Decay Relief value for the h channel of the sub-group of channels of the audio signal, and wherein Wy denotes a weight of a criterion j for the h channel of the subgroup of channels of the audio signal.
[0177] In some examples, the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
[0178] In some examples, the channels of the audio signal comprise a plurality of hierarchical elements.
[0179] In some examples, the plurality of hierarchical elements comprise spherical harmonic coefficients.
[0180] In some examples, the plurality of hierarchical elements comprise higher order ambisonics. [0181] In another example, a method comprises generating a common filter for reverberation segments of a plurality of binaural room impulse response filters that are weighted according to the respective energies of the binaural room impulse response filters.
[0182] In some examples, generating the common filter comprises computing a weighted average of the reverberation segments of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the binaural room impulse response filters.
[0183] In some examples, generating the common filter comprises computing the average of the reverberation segments of the plurality of binaural room impulse response filters without normalizing the binaural room impulse response filters of the plurality of binaural room impulse response filters.
[0184] In some examples, generating the common filter comprises computing a direct average of the reverberation segments of the plurality of binaural room impulse response filters.
[0185] In some examples, generating the common filter comprises resynthesizing the common filter using white noise controlled by energy envelope and coherence control.
[0186] In some examples, generating the common filter comprises: computing respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters; computing an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters; and synthesizing the common filter using the average frequency-dependent inter-aural coherence value.
[0187] In some examples, computing the average frequency-dependent inter-aural coherence value comprises computing a direct average frequency-dependent inter-aural coherence value.
[0188] In some examples, computing the average frequency-dependent inter-aural coherence value comprises computing the average frequency-dependent inter-aural coherence value as the minimum frequency-dependent inter-aural coherence values of the respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters. [0189] In some examples, computing the average frequency-dependent inter-aural coherence value comprises weighting each of the respective frequency-dependent inter- aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters by the respective, relative energy of Energy Decay Relief and accumulating the weighted frequency-dependent inter-aural coherence values to generate the average frequency-dependent inter-aural coherence value.
[0190] In some examples, computing the average frequency-dependent inter-aural coherence value comprises computing:
wherein FDICmemge is the average frequency-dependent inter-aural coherence value, wherein i denotes a binaural room impulse response filter of the plurality of binaural room impulse response filters, wherein FDIQ denotes a frequency-dependent inter-aural coherence value for the h binaural room impulse response filter, and wherein w,j denotes a weight of a criterion j for the h binaural room impulse response filter.
[0191] In some examples, the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of channels of the audio signal.
[0192] In some examples, synthesizing the common filter using the average frequency- dependent inter-aural coherence value comprises computing: tUKaverage
Figure imgf000045_0001
wherein EDRmemge is an average Energy Decay Relief value, wherein i denotes a channel of the audio signal, wherein EDRj denotes an Energy Decay Relief value for the h channel of the audio signal, and wherein ¼¾ denotes a weight of a criterion j for the h channel of the audio signal.
[0193] In some examples, the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the audio signal.
[0194] In another example, a method comprises generating a common filter for reflection segments of a sub-group of a plurality of binaural room impulse response filters.
[0195] In some examples, generating the common filter comprises computing a weighted average of the reflection segments of a sub-group of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the sub-group of the binaural room impulse response filters.
[0196] In some examples, generating the common filter comprises computing the average of the reflection segments of the sub-group of the plurality of binaural room impulse response filters without normalizing the binaural room impulse response filters of the sub-group of the plurality of binaural room impulse response filters.
[0197] In some examples, generating the common filter comprises computing a direct average of the reflection segments of the sub-group of the plurality of binaural room impulse response filters.
[0198] In some examples, generating the common filter comprises resynthesizing the common filter using white noise controlled by energy envelope and coherence control.
[0199] In some examples, generating the common filter comprises: computing respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters; computing an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters; and synthesizing the common filter using the average frequency-dependent inter-aural coherence value.
[0200] In some examples, computing the average frequency-dependent inter-aural coherence value comprises computing a direct average frequency-dependent inter-aural coherence value.
[0201] In some examples, computing the average frequency-dependent inter-aural coherence value comprises computing the average frequency-dependent inter-aural coherence value as the minimum frequency-dependent inter-aural coherence values of the respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters.
[0202] In some examples, computing the average frequency-dependent inter-aural coherence value comprises weighting each of the respective frequency-dependent inter- aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters by the respective, relative energy of Energy Decay Relief and accumulating the weighted frequency-dependent inter-aural coherence values to generate the average frequency-dependent inter-aural coherence value.
[0203] In some examples, computing the average frequency-dependent inter-aural coherence value comprises computing:
Figure imgf000047_0001
wherein FDICmemge is the average frequency-dependent inter-aural coherence value, wherein i denotes a binaural room impulse response filter of the sub-group of the plurality of binaural room impulse response filters, wherein FDId denotes a frequency- dependent inter-aural coherence value for the h binaural room impulse response filter, and wherein Wy denotes a weight of a criterion j for the h binaural room impulse response filter.
[0204] In some examples, the criterion j is one of an energy for the i binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
[0205] In some examples, synthesizing the common filter using the average frequency- dependent inter-aural coherence value comprises computing: tuKaverage - ∑i (Uj Wji) . wherein EDRmemge is an average Energy Decay Relief value, wherein i denotes a channel of the sub-group of channels of the audio signal, wherein EDRj denotes an Energy Decay Relief value for the h channel of the sub-group of channels of the audio signal, and wherein Wy denotes a weight of a criterion j for the h channel of the subgroup of channels of the audio signal.
[0206] In some examples, the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
[0207] In another example, a method of binauralizing an audio signal comprises applying adaptively determined weights to a plurality of channels of the audio signal prior to applying one or more segments of a plurality of binaural room impulse response filters; and applying the one or more segments to the plurality of binaural room impulse response filters. [0208] In some examples, the initial adaptively determined weights for the channels of the audio signal are computed according to an energy of a corresponding binaural room impulse response filter of the plurality of binaural room impulse response filters.
[0209] In some examples, the method further comprises obtaining a common filter for a plurality of binaural room impulse response filters, wherein the h initial adaptively determined weight wt for the zth channel is computed according to:
Figure imgf000048_0001
wherein hi is the h binaural room impulse response filter, wherein h is the common filter, and wherein E(K) =∑"=o i[n]2 , wherein n is a sample index and each h[n] is a stereo sample at n.
[0210] In some examples, the method further comprises applying the common filter to the summary audio signal to generate a transformed summary audio signal by computing∑ννμηι © h, wherein © denotes a convolution operation and ini denotes the h channel of the audio signal.
[0211] In some examples, combining the channels of the audio signal to generate a summary audio signal by applying respective adaptive weight factors to the channels comprises computing:
Figure imgf000048_0002
wherein inmix(n) denotes the summary audio signal, wherein n is a sample index, and wherein
∑E Wi ini)
wnorm and wherein in, denotes the h channel of the audio signal.
[0212] In some examples, the channels of the audio signal comprise a plurality of hierarchical elements.
[0213] In some examples, the plurality of hierarchical elements comprise spherical harmonic coefficients.
[0214] In some examples, the plurality of hierarchical elements comprise higher order ambisonics.
[0215] In another example, a method comprises applying respective head-related transfer function segments of a plurality of binaural room impulse response filters to corresponding channels of an audio signal to generate a plurality of transformed channels of the audio signal; generating a common filter by computing a weighted average of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the plurality of binaural room impulse response filters; combining the channels of the audio signal to generate a summary audio signal; applying the common filter to the summary audio signal to generate a transformed summary audio signal; combining the transformed summary audio signal and the transformed channels of the audio signal to generate an output audio signal.
[0216] In some examples, generating a common filter by computing a weighted average of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the plurality of binaural room impulse response filters comprises computing an average of the plurality of binaural room impulse response filters without normalizing any of the plurality of binaural room impulse response filters.
[0217] In some examples, generating a common filter by computing a weighted average of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the plurality of binaural room impulse response filters comprises computing a direct average of the plurality of binaural room impulse response filters.
[0218] In some examples, generating a common filter by computing a weighted average of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the plurality of binaural room impulse response filters comprises resynthesizing the common filter using white noise controlled by energy envelope and coherence control.
[0219] In some examples, generating a common filter by computing a weighted average of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the plurality of binaural room impulse response filters comprises computing respective frequency-dependent inter-aural coherence values for each of the plurality of binaural room impulse response filters; computing an average frequency-dependent inter-aural coherence value using the respective frequency- dependent inter-aural coherence values for each of the plurality of binaural room impulse response filters; and synthesizing the common filter using the average frequency-dependent inter-aural coherence value.
[0220] In some examples, computing an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the plurality of binaural room impulse response filters comprises computing a direct average frequency-dependent inter-aural coherence value.
[0221] In some examples, computing an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters comprises computing the average frequency-dependent inter-aural coherence value as the minimum frequency-dependent inter-aural coherence values of the respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters.
[0222] In some examples, computing an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters comprises weighting each of the respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters by the respective, relative energy of Energy Decay Relief and accumulating the weighted frequency-dependent inter-aural coherence values to generate the average frequency-dependent inter-aural coherence value.
[0223] In some examples, computing an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters comprises computing:
FDIC, average ~
Figure imgf000050_0001
wherein FDICavemge is the average frequency-dependent inter-aural coherence value, wherein i denotes a binaural room impulse response filter of the plurality of binaural room impulse response filters, wherein FDIQ denotes a frequency-dependent inter-aural coherence value for the h binaural room impulse response filter, and wherein w,j denotes a weight of a criterion j for the h binaural room impulse response filter.
[0224] In some examples, the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the audio signal.
[0225] In some examples, synthesizing the common filter using the average frequency- dependent inter-aural coherence value comprises computing:
EDR,
Figure imgf000050_0002
average wherein EDRmemge is an average Energy Decay Relief value, wherein i denotes a channel of the audio signal, wherein EDRj denotes an Energy Decay Relief value for the h channel of the audio signal, and wherein ¼¾ denotes a weight of a criterion j for the h channel of the audio signal.
[0226] In some examples, the criterion j is one of an energy for the i binaural room impulse response filter or a signal content energy for the h channel of the audio signal.
[0227] In some examples, the channels of the audio signal comprise a plurality of hierarchical elements.
[0228] In some examples, the plurality of hierarchical elements comprise spherical harmonic coefficients.
[0229] In some examples, the plurality of hierarchical elements comprise higher order ambisonics.
[0230] In another example, a method comprises applying respective head-related transfer function segments of a plurality of binaural room impulse response filters to corresponding channels of an audio signal to generate a plurality of transformed channels of the audio signal; generating a common filter by computing an average of the plurality of binaural room impulse response filters; combining the channels of the audio signal to generate a summary audio signal by applying respective adaptive weight factors to the channels; applying the common filter to the summary audio signal to generate a transformed summary audio signal; and combining the transformed summary audio signal and the transformed channels of the audio signal to generate an output audio signal.
[0231] In some examples, the initial adaptive weight factors for the channels of the audio signal are computed according to an energy of a corresponding binaural room impulse response filter of the plurality of binaural room impulse response filters.
[0232] In some examples, the h initial adaptive weight factor wt for the h channel is computed according to
Figure imgf000051_0001
wherein hi is the i binaural room impulse response filter, wherein h is the common filter, and wherein E h) =∑"=o h[n]2, wherein n is a sample index and each h[n] is a stereo sample at n. [0233] In some examples, applying the common filter to the summary audio signal to generate a transformed summary audio signal comprises computing:
Figure imgf000052_0001
wherein 0 denotes a convolution operation and i denotes the i channel of the audio signal.
[0234] In some examples, combining the channels of the audio signal to generate a summary audio signal by applying respective adaptive weight factors to the channels comprises computing:
Figure imgf000052_0002
wherein inmix(n denotes the summary audio signal, wherein n is a sample index, and wherein
Figure imgf000052_0003
wherein in, denotes the h channel of the audio signal.
[0235] In some examples, the channels of the audio signal comprise a plurality of hierarchical elements.
[0236] In some examples, the plurality of hierarchical elements comprise spherical harmonic coefficients.
[0237] In some examples, the plurality of hierarchical elements comprise higher order ambisonics.
[0238] In some examples, a device comprises a memory configured to store a common filter for reflection segments of a sub-group of a plurality of binaural room impulse response filters; and a processor configured to apply the common filter to a summary audio signal determined from a plurality of channels of the audio signal to generate a transformed summary audio signal.
[0239] In some examples, the summary audio signal comprises a combination of a subgroup of the plurality of channels of the audio signal corresponding to the sub-group of the plurality of binaural room impulse response filters.
[0240] In some examples, the processor is further configured to apply respective head- related transfer function segments of the plurality of binaural room impulse response filters to corresponding ones of the plurality of channels of the audio signal to generate a plurality of transformed channels of the audio signal; and combine the first transformed summary audio signal and the transformed channels of the audio signal to generate an output binaural audio signal.
[0241] In some examples, the common filter comprises an average of the sub-group of the plurality of binaural room impulse response filters.
[0242] In some examples, the processor is further configured to combine a sub-group of channels of the audio signal that correspond to the sub-group of the plurality of binaural room impulse response filters to generate the summary audio signal.
[0243] In some examples, the common filter is a first common filter, wherein the subgroup is a first sub-group, wherein the summary audio signal is a first summary audio signal, and wherein the transformed summary audio signal is a first transformed summary audio signal, wherein the processor is further configured to generate a second common filter for a second, different sub-group of the plurality of binaural room impulse response filters by computing an average of the second sub-group of the plurality of binaural room impulse response filters; combine a second sub-group of channels of the audio signal that correspond to the second sub-group of the plurality of binaural room impulse response filters to generate a second summary audio signal; and apply the second common filter to the second summary audio signal to generate a second transformed summary audio signal, wherein to combine the first transformed summary audio signal and the transformed channels of the audio signal to generate an output audio signal wherein the processor is further configured to combine the first transformed summary audio signal, the second transformed summary audio signal, and the transformed channels of the audio signal to generate the output audio signal.
[0244] In some examples, the common filter comprises a weighted average of the subgroup of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the binaural room impulse response filters.
[0245] In some examples, the common filter comprises an average of the sub-group of the plurality of binaural room impulse response filters without normalizing the binaural room impulse response filters of the sub-group of the plurality of binaural room impulse response filters.
[0246] In some examples, the common filter comprises a direct average of the subgroup of the plurality of binaural room impulse response filters.
[0247] In some examples, the common filter comprises a resynthesized common filter generated using white noise controlled by energy envelope and coherence control. [0248] In some examples, the processor is further configured to: compute respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters; compute an average frequency- dependent inter-aural coherence value using the respective frequency-dependent inter- aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters; and synthesize the common filter using the average frequency- dependent inter-aural coherence value.
[0249] In some examples, to compute the average frequency-dependent inter-aural coherence value wherein the processor is further configured to compute a direct average frequency-dependent inter-aural coherence value.
[0250] In some examples, to compute the average frequency-dependent inter-aural coherence value the processor is further configured to compute the average frequency- dependent inter-aural coherence value as the minimum frequency-dependent inter-aural coherence values of the respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters.
[0251] In some examples, to compute the average frequency-dependent inter-aural coherence value the processor is further configured to weight each of the respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters by the respective, relative energy of Energy Decay Relief and accumulate the weighted frequency-dependent inter-aural coherence values to generate the average frequency-dependent inter-aural coherence value.
[0252] In some examples, to compute the average frequency-dependent inter-aural coherence value wherein the processor is further configured to compute:
FDIC, average ~ wherein FDICa ea e is the average frequency-dependent inter-aural coherence value, wherein i denotes a binaural room impulse response filter of the sub-group of the plurality of binaural room impulse response filters, wherein FDId denotes a frequency- dependent inter-aural coherence value for the h binaural room impulse response filter, and wherein w,j denotes a weight of a criterion j for the h binaural room impulse response filter. [0253] In some examples, the criterion j is one of an energy for the i binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
[0254] In some examples, to synthesize the common filter using the average frequency- dependent inter-aural coherence value the processor is further configured to compute:
_∑iOlj wjiEDRi)
tURaverage - ∑i (Xlj Wji . wherein EDi?average is an average Energy Decay Relief value, wherein i denotes a channel of the sub-group of channels of the audio signal, wherein EDRj denotes an Energy Decay Relief value for the h channel of the sub-group of channels of the audio signal, and wherein w,j denotes a weight of a criterion j for the h channel of the subgroup of channels of the audio signal.
[0255] In some examples, the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
[0256] In some examples, the channels of the audio signal comprise a plurality of hierarchical elements.
[0257] In some examples, the plurality of hierarchical elements comprise spherical harmonic coefficients.
[0258] In some examples, the plurality of hierarchical elements comprise higher order ambisonics.
[0259] In another example, a device comprises a processor configured to generate a common filter for reverberation segments of a plurality of binaural room impulse response filters that are weighted according to the respective energies of the binaural room impulse response filters.
[0260] In some examples, to generate the common filter the processor is further configured to compute a weighted average of the reverberation segments of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the binaural room impulse response filters.
[0261] In some examples, to generate the common filter the processor is further configured to compute the average of the reverberation segments of the plurality of binaural room impulse response filters without normalizing the binaural room impulse response filters of the plurality of binaural room impulse response filters. [0262] In some examples, to generate the common filter the processor is further configured to compute a direct average of the reverberation segments of the plurality of binaural room impulse response filters.
[0263] In some examples, to generate the common filter the processor is further configured to resynthesize the common filter using white noise controlled by energy envelope and coherence control.
[0264] In some examples, to generate the common filter the processor is further configured to compute respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters; compute an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters; and synthesize the common filter using the average frequency-dependent inter-aural coherence value.
[0265] In some examples, to compute the average frequency-dependent inter-aural coherence value the processor is further configured to compute a direct average frequency-dependent inter-aural coherence value.
[0266] In some examples, to compute the average frequency-dependent inter-aural coherence value the processor is further configured to compute the average frequency- dependent inter-aural coherence value as the minimum frequency-dependent inter-aural coherence values of the respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters.
[0267] In some examples, to compute the average frequency-dependent inter-aural coherence value the processor is further configured to weight each of the respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters by the respective, relative energy of Energy Decay Relief and accumulate the weighted frequency- dependent inter-aural coherence values to generate the average frequency-dependent inter-aural coherence value.
[0268] In some examples, to compute the average frequency-dependent inter-aural coherence value the processor is further configured to compute:
FDIC, average ~ wherein FDICa ea e is the average frequency-dependent inter-aural coherence value, wherein i denotes a binaural room impulse response filter of the plurality of binaural room impulse response filters, wherein FDI denotes a frequency-dependent inter-aural coherence value for the h binaural room impulse response filter, and wherein w,j denotes a weight of a criterion j for the h binaural room impulse response filter.
[0269] In some examples, the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of channels of the audio signal.
[0270] In some examples, to synthesize the common filter using the average frequency- dependent inter-aural coherence value the processor is further configured to compute:
_∑iOlj wjiEDRi)
tURaverage - ∑i (Xlj Wji . wherein EDi?average is an average Energy Decay Relief value, wherein i denotes a channel of the audio signal, wherein EDRj denotes an Energy Decay Relief value for the h channel of the audio signal, and wherein ¼¾ denotes a weight of a criterion j for the h channel of the audio signal.
[0271] In some examples, the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the audio signal.
[0272] In another example, a device comprises a processor configured to generate a common filter for reflection segments of a sub-group of a plurality of binaural room impulse response filters.
[0273] In some examples, to generate the common filter the processor is further configured to compute a weighted average of the reflection segments of a sub-group of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the sub-group of the binaural room impulse response filters.
[0274] In some examples, to generate the common filter the processor is further configured to compute the average of the reflection segments of the sub-group of the plurality of binaural room impulse response filters without normalizing the binaural room impulse response filters of the sub-group of the plurality of binaural room impulse response filters.
[0275] In some examples, to generate the common filter the processor is further configured to compute a direct average of the reflection segments of the sub-group of the plurality of binaural room impulse response filters. [0276] In some examples, to generate the common filter the processor is further configured to resynthesize the common filter using white noise controlled by energy envelope and coherence control.
[0277] In some examples, to generate the common filter the processor is further configured to compute respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters; compute an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters; and synthesize the common filter using the average frequency- dependent inter-aural coherence value.
[0278] In some examples, to compute the average frequency-dependent inter-aural coherence value the processor is further configured to compute a direct average frequency-dependent inter-aural coherence value.
[0279] In some examples, to compute the average frequency-dependent inter-aural coherence value the processor is further configured to compute the average frequency- dependent inter-aural coherence value as the minimum frequency-dependent inter-aural coherence values of the respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters.
[0280] In some examples, to compute the average frequency-dependent inter-aural coherence value the processor is further configured to weight each of the respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters by the respective, relative energy of Energy Decay Relief and accumulating the weighted frequency-dependent inter-aural coherence values to generate the average frequency- dependent inter-aural coherence value.
[0281] In some examples, to compute the average frequency-dependent inter-aural coherence value the processor is further configured to compute:
FDIC, average ~
Figure imgf000058_0001
wherein FDICavemge is the average frequency-dependent inter-aural coherence value, wherein i denotes a binaural room impulse response filter of the sub-group of the plurality of binaural room impulse response filters, wherein FDId denotes a frequency- dependent inter-aural coherence value for the i binaural room impulse response filter, and wherein Wy denotes a weight of a criterion j for the h binaural room impulse response filter.
[0282] In some examples, the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
[0283] In some examples, to synthesize the common filter using the average frequency- dependent inter-aural coherence value the processor is further configured to compute:
EDR,
Figure imgf000059_0001
average wherein EDRaverage is an average Energy Decay Relief value, wherein i denotes a channel of the sub-group of channels of the audio signal, wherein EDRj denotes an Energy Decay Relief value for the h channel of the sub-group of channels of the audio signal, and wherein Wy denotes a weight of a criterion j for the h channel of the subgroup of channels of the audio signal.
[0284] In some examples, the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
[0285] In some examples, a device comprises a processor configured to apply adaptively determined weights to a plurality of channels of the audio signal prior to applying one or more segments of a plurality of binaural room impulse response filters; and apply the one or more segments to the plurality of binaural room impulse response filters.
[0286] In some examples, the processor computes the initial adaptively determined weights for the channels of the audio signal according to an energy of a corresponding binaural room impulse response filter of the plurality of binaural room impulse response filters.
[0287] In some examples, the processor is further configured to obtain a common filter for a plurality of binaural room impulse response filters, wherein the h initial adaptively determined weight wt for the i channel is computed according to wt =
Figure imgf000059_0002
wherein ht is the h binaural room impulse response filter, wherein h is the common filter, and wherein E(K) =∑"=o i[n]2, wherein n is a sample index and each h[n] is a stereo sample at n.
[0288] In some examples, the processor is further configured to: apply the common filter to the summary audio signal to generate a transformed summary audio signal by computing:
Figure imgf000060_0001
wherein 0 denotes a convolution operation and in, denotes the h channel of the audio signal.
[0289] In some examples, the processor is further configured to: combine the channels of the audio signal to generate a summary audio signal by applying respective adaptive weight factors to the channels by computing:
inmix(n) = wnorm(n)
Figure imgf000060_0002
wherein inmix(n) denotes the summary audio signal, wherein n is a sample index, and wherein w„orm ( \ E w mi) ,
norm \n) = ^ E(∑wiiniY wherein in, denotes the h channel of the audio sig &nal.
[0290] In some examples, the channels of the audio signal comprise a plurality of hierarchical elements.
[0291] In some examples, the plurality of hierarchical elements comprise spherical harmonic coefficients.
[0292] In some examples, the plurality of hierarchical elements comprise higher order ambisonics.
[0293] In another example, a device comprises means for obtaining a common filter for reflection segments of a sub-group of a plurality of binaural room impulse response filters; and means for applying the common filter to a summary audio signal determined from a plurality of channels of the audio signal to generate a transformed summary audio signal.
[0294] In some examples, the summary audio signal comprises a combination of a subgroup of the plurality of channels of the audio signal corresponding to the sub-group of the plurality of binaural room impulse response filters.
[0295] In some examples, the device further comprises means for applying respective head-related transfer function segments of the plurality of binaural room impulse response filters to corresponding ones of the plurality of channels of the audio signal to generate a plurality of transformed channels of the audio signal; and means for combining the first transformed summary audio signal and the transformed channels of the audio signal to generate an output binaural audio signal.
[0296] In some examples, the means for obtaining the common filter comprises means for computing an average of the sub-group of the plurality of binaural room impulse response filters as the common filter.
[0297] In some examples, the device further comprises means for combining a subgroup of channels of the audio signal that correspond to the sub-group of the plurality of binaural room impulse response filters to generate the summary audio signal.
[0298] In some examples, the common filter is a first common filter, wherein the subgroup is a first sub-group, wherein the summary audio signal is a first summary audio signal, and wherein the transformed summary audio signal is a first transformed summary audio signal, and the device further comprises means for generating a second common filter for a second, different sub-group of the plurality of binaural room impulse response filters by computing an average of the second sub-group of the plurality of binaural room impulse response filters; means for combining a second subgroup of channels of the audio signal that correspond to the second sub-group of the plurality of binaural room impulse response filters to generate a second summary audio signal; and means for applying the second common filter to the second summary audio signal to generate a second transformed summary audio signal, wherein the means for combining the first transformed summary audio signal and the transformed channels of the audio signal to generate an output audio signal comprises means for combining the first transformed summary audio signal, the second transformed summary audio signal, and the transformed channels of the audio signal to generate the output audio signal.
[0299] In some examples, the means for obtaining the common filter comprises means for computing a weighted average of the sub-group of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the binaural room impulse response filters.
[0300] In some examples, the means for obtaining the common filter comprises means for computing the average of the sub-group of the plurality of binaural room impulse response filters without normalizing the binaural room impulse response filters of the sub-group of the plurality of binaural room impulse response filters. [0301] In some examples, the means for obtaining the common filter comprises means for computing a direct average of the sub-group of the plurality of binaural room impulse response filters.
[0302] In some examples, the means for obtaining the common filter comprises means for resynthesizing the common filter using white noise controlled by energy envelope and coherence control.
[0303] In some examples, the means for obtaining the common filter comprises: means for computing respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters; means for computing an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters; and means for synthesizing the common filter using the average frequency-dependent inter-aural coherence value.
[0304] In some examples, the means for computing the average frequency-dependent inter-aural coherence value comprises means for computing a direct average frequency- dependent inter-aural coherence value.
[0305] In some examples, the means for computing the average frequency-dependent inter-aural coherence value comprises means for computing the average frequency- dependent inter-aural coherence value as the minimum frequency-dependent inter-aural coherence values of the respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters.
[0306] In some examples, the means for computing the average frequency-dependent inter-aural coherence value comprises means for weighting each of the respective frequency-dependent inter-aural coherence values for each of the sub-group of the plurality of binaural room impulse response filters by the respective, relative energy of Energy Decay Relief and means for accumulating the weighted frequency-dependent inter-aural coherence values to generate the average frequency-dependent inter-aural coherence value.
[0307] In some examples, the means for computing the average frequency-dependent inter-aural coherence value comprises means for computing:
FDIC, average
Figure imgf000062_0001
wherein FDICa ea e is the average frequency-dependent inter-aural coherence value, wherein i denotes a binaural room impulse response filter of the sub-group of the plurality of binaural room impulse response filters, wherein FDId denotes a frequency- dependent inter-aural coherence value for the h binaural room impulse response filter, and wherein ¼¾ denotes a weight of a criterion j for the h binaural room impulse response filter.
[0308] In some examples, the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
[0309] In some examples, the means for synthesizing the common filter using the average frequency-dependent inter-aural coherence value comprises means for computing:
_∑iOlj wjiEDRi)
tURaverage ∑i(n.Wji) · wherein EDRmemge is an average Energy Decay Relief value, wherein i denotes a channel of the sub-group of channels of the audio signal, wherein EDRj denotes an Energy Decay Relief value for the h channel of the sub-group of channels of the audio signal, and wherein ¼¾ denotes a weight of a criterion j for the h channel of the subgroup of channels of the audio signal.
[0310] In some examples, the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
[0311] In some examples, the channels of the audio signal comprise a plurality of hierarchical elements.
[0312] In some examples, the plurality of hierarchical elements comprise spherical harmonic coefficients.
[0313] In some examples, the plurality of hierarchical elements comprise higher order ambisonics.
[0314] In another examples, a device comprises means for generating a common filter for reverberation segments of a plurality of binaural room impulse response filters that are weighted according to the respective energies of the binaural room impulse response filters.
[0315] In some examples, the means for generating the common filter comprises means for computing a weighted average of the reverberation segments of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the binaural room impulse response filters. [0316] In some examples, the means for generating the common filter comprises means for computing the average of the reverberation segments of the plurality of binaural room impulse response filters without normalizing the binaural room impulse response filters of the plurality of binaural room impulse response filters.
[0317] In some examples, the means for generating the common filter comprises means for computing a direct average of the reverberation segments of the plurality of binaural room impulse response filters.
[0318] In some examples, the means for generating the common filter comprises means for resynthesizing the common filter using white noise controlled by energy envelope and coherence control.
[0319] In some examples, the means for generating the common filter comprises: means for computing respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters; means for computing an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters; and means for synthesizing the common filter using the average frequency-dependent inter- aural coherence value.
[0320] In some examples, the means for computing the average frequency-dependent inter-aural coherence value comprises means for computing a direct average frequency- dependent inter-aural coherence value.
[0321] In some examples, the means for computing the average frequency-dependent inter-aural coherence value comprises means for computing the average frequency- dependent inter-aural coherence value as the minimum frequency-dependent inter-aural coherence values of the respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters.
[0322] In some examples, the means for computing the average frequency-dependent inter-aural coherence value comprises means for weighting each of the respective frequency-dependent inter-aural coherence values for each of the reverberation segments of the plurality of binaural room impulse response filters by the respective, relative energy of Energy Decay Relief and means for accumulating the weighted frequency-dependent inter-aural coherence values to generate the average frequency- dependent inter-aural coherence value. [0323] In some examples, the means for computing the average frequency-dependent inter-aural coherence value comprises means for computing:
_∑iOlj wjiFDICi)
t UlLaverage ∑i(n . Wji) . wherein FDICmemge is the average frequency-dependent inter-aural coherence value, wherein i denotes a binaural room impulse response filter of the plurality of binaural room impulse response filters, wherein FDIQ denotes a frequency-dependent inter-aural coherence value for the h binaural room impulse response filter, and wherein ¼¾ denotes a weight of a criterion j for the h binaural room impulse response filter.
[0324] In some examples, the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of channels of the audio signal.
[0325] In some examples, the means for synthesizing the common filter using the average frequency-dependent inter-aural coherence value comprises means for computing:
_∑iOlj wjiEDRi)
tURaverage ∑i (n . Wji) · wherein EDRmemge is an average Energy Decay Relief value, wherein i denotes a channel of the audio signal, wherein EDRj denotes an Energy Decay Relief value for the h channel of the audio signal, and wherein ¼¾ denotes a weight of a criterion j for the h channel of the audio signal.
[0326] In some examples, the criterion j is one of an energy for the i binaural room impulse response filter or a signal content energy for the h channel of the audio signal.
[0327] In another example, a device comprises means for generating a common filter for reflection segments of a sub-group of a plurality of binaural room impulse response filters.
[0328] In some examples, the means for generating the common filter comprises means for computing a weighted average of the reflection segments of a sub-group of the plurality of binaural room impulse response filters that is weighted according to the respective energies of the sub-group of the binaural room impulse response filters.
[0329] In some examples, the means for generating the common filter comprises means for computing the average of the reflection segments of the sub-group of the plurality of binaural room impulse response filters without normalizing the binaural room impulse response filters of the sub-group of the plurality of binaural room impulse response filters.
[0330] In some examples, the means for generating the common filter comprises means for computing a direct average of the reflection segments of the sub-group of the plurality of binaural room impulse response filters.
[0331] In some examples, the means for generating the common filter comprises means for resynthesizing the common filter using white noise controlled by energy envelope and coherence control.
[0332] In some examples, the means for generating the common filter comprises: means for computing respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters; means for computing an average frequency-dependent inter-aural coherence value using the respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters; and means for synthesizing the common filter using the average frequency-dependent inter-aural coherence value.
[0333] In some examples, the means for computing the average frequency-dependent inter-aural coherence value comprises means for computing a direct average frequency- dependent inter-aural coherence value.
[0334] In some examples, the means for computing the average frequency-dependent inter-aural coherence value comprises means for computing the average frequency- dependent inter-aural coherence value as the minimum frequency-dependent inter-aural coherence values of the respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters.
[0335] In some examples, the means for computing the average frequency-dependent inter-aural coherence value comprises means for weighting each of the respective frequency-dependent inter-aural coherence values for each of the reflection segments of the sub-group of the plurality of binaural room impulse response filters by the respective, relative energy of Energy Decay Relief and means for accumulating the weighted frequency-dependent inter-aural coherence values to generate the average frequency-dependent inter-aural coherence value.
[0336] In some examples, the means for computing the average frequency-dependent inter-aural coherence value comprises means for computing: FDIC, average ~
Figure imgf000067_0001
wherein FDICavemge is the average frequency-dependent inter-aural coherence value, wherein i denotes a binaural room impulse response filter of the sub-group of the plurality of binaural room impulse response filters, wherein FDId denotes a frequency- dependent inter-aural coherence value for the h binaural room impulse response filter, and wherein Wy denotes a weight of a criterion j for the h binaural room impulse response filter.
[0337] In some examples, the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
[0338] In some examples, the means for synthesizing the common filter using the average frequency-dependent inter-aural coherence value comprises means for computing:
EDR, average ~
Figure imgf000067_0002
wherein EDRavemge is an average Energy Decay Relief value, wherein i denotes a channel of the sub-group of channels of the audio signal, wherein EDRj denotes an Energy Decay Relief value for the h channel of the sub-group of channels of the audio signal, and wherein Wy denotes a weight of a criterion j for the h channel of the subgroup of channels of the audio signal.
[0339] In some examples, the criterion j is one of an energy for the h binaural room impulse response filter or a signal content energy for the h channel of the sub-group of channels of the audio signal.
[0340] In another example, a device comprises means for applying adaptively determined weights to a plurality of channels of the audio signal prior to applying one or more segments of a plurality of binaural room impulse response filters; and means for applying the one or more segments to the plurality of binaural room impulse response filters.
[0341] In some examples, the initial adaptively determined weights for the channels of the audio signal are computed according to an energy of a corresponding binaural room impulse response filter of the plurality of binaural room impulse response filters. [0342] In some examples, the device further comprises means for obtaining a common filter for a plurality of binaural room impulse response filters, wherein the h initial adaptively determined weight wt for the h channel is computed according to
Figure imgf000068_0001
wherein hj is the i binaural room impulse response filter, wherein h is the common filter, and wherein E(K) =∑"=o i[n]2 , wherein n is a sample index and each h[n] is a stereo sample at n.
[0343] In some examples, the device further comprises means for applying the common filter to the summary audio signal to generate a transformed summary audio signal by computing:
Figure imgf000068_0002
wherein 0 denotes a convolution operation and i denotes the i channel of the audio signal.
[0344] In some examples, the device further comprises means for combining the channels of the audio signal to generate a summary audio signal by applying respective adaptive weight factors to the channels comprises computing:
Figure imgf000068_0003
wherein inmix(n denotes the summary audio signal, wherein n is a sample index, and wherein
Figure imgf000068_0004
wherein in, denotes the h channel of the audio signal.
[0345] In some examples, the channels of the audio signal comprise a plurality of hierarchical elements.
[0346] In some examples, the plurality of hierarchical elements comprise spherical harmonic coefficients.
[0347] In some examples, the plurality of hierarchical elements comprise higher order ambisonics.
[0348] In another example, a non-transitory computer-readable storage medium has stored thereon instructions that, when executed, cause one or more processors to obtain a common filter for reflection segments of a sub-group of a plurality of binaural room impulse response filters; and apply the common filter to a summary audio signal determined from a plurality of channels of the audio signal to generate a transformed summary audio signal.
[0349] In another example, a non-transitory computer-readable storage medium has stored thereon instructions that, when executed, cause one or more processors to generate a common filter for reverberation segments of a plurality of binaural room impulse response filters that are weighted according to the respective energies of the binaural room impulse response filters.
[0350] In another example, a non-transitory computer-readable storage medium has stored thereon instructions that, when executed, cause one or more processors to generate a common filter for reflection segments of a sub-group of a plurality of binaural room impulse response filters.
[0351] In another example, a non-transitory computer-readable storage medium has stored thereon instructions that, when executed, cause one or more processors to apply adaptively determined weights to a plurality of channels of the audio signal prior to applying one or more segments of a plurality of binaural room impulse response filters; and apply the one or more segments to the plurality of binaural room impulse response filters.
[0352] In another example, a device comprises a processor configured to perform any combination the methods of any combination of the examples described above.
[0353] In another example, a device comprises means for performing each step of the method of any combination of the examples described above.
[0354] In another example, a non-transitory computer-readable storage medium has stored thereon instructions that, when executed, cause one or more processors to perform the method of any combination of the examples described above.
[0355] It should be understood that, depending on the example, certain acts or events of any of the methods described herein can be performed in a different sequence, may be added, merged, or left out altogether (e.g., not all described acts or events are necessary for the practice of the method). Moreover, in certain examples, acts or events may be performed concurrently, e.g., through multi-threaded processing, interrupt processing, or multiple processors, rather than sequentially. In addition, while certain aspects of this disclosure are described as being performed by a single device, module or unit for purposes of clarity, it should be understood that the techniques of this disclosure may be performed by a combination of devices, units or modules. [0356] In one or more examples, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. Computer-readable media may include computer-readable storage media, which corresponds to a tangible medium such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, e.g., according to a communication protocol.
[0357] In this manner, computer-readable media generally may correspond to (1) tangible computer-readable storage media which is non-transitory or (2) a communication medium such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this disclosure. A computer program product may include a computer-readable medium.
[0358] By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage, or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium.
[0359] It should be understood, however, that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transient media, but are instead directed to non-transient, tangible storage media. Disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media. [0360] Instructions may be executed by one or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Accordingly, the term "processor," as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. In addition, in some aspects, the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated in a combined codec. Also, the techniques could be fully implemented in one or more circuits or logic elements.
[0361] The techniques of this disclosure may be implemented in a wide variety of devices or apparatuses, including a wireless handset, an integrated circuit (IC) or a set of ICs (e.g., a chip set). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require realization by different hardware units. Rather, as described above, various units may be combined in a codec hardware unit or provided by a collection of interoperative hardware units, including one or more processors as described above, in conjunction with suitable software and/or firmware.
[0362] Various embodiments of the techniques have been described. These and other embodiments are within the scope of the following claims.

Claims

CLAIMS What is claimed is:
1. A method of binauralizing an audio signal, the method comprising:
applying adaptively determined weights to a plurality of channels of the audio signal to generate a plurality of adaptively weighted channels of the audio signal;
combining at least two of the plurality of adaptively weighted channels of the audio signal to generate a combined signal; and
applying a binaural room impulse response filter to the combined signal to generate a binaural audio signal.
2. The method of claim 1, wherein the binaural room impulse response filter comprises a common filter for reverberation segments of at least two binaural room impulse response filters corresponding, respectively, to the at least two of the plurality of adaptively weighted channels.
3. The method of claim 2, wherein the reflection segments of the at least two binaural room impulse response filters are weighted according to the respective energies of at least a portion of the at least two binaural room impulse response filters.
4. The method of claim 1, wherein the binaural room impulse response filter comprises a common filter for reflection segments of at least two binaural room impulse response filters corresponding, respectively, to the at least two of the plurality of adaptively weighted channels.
5. The method of claim 4, wherein the reflection segments of the at least two binaural room impulse response filters are weighted according to the respective energies of at least a portion of the at least two binaural room impulse response filters.
6. The method of claim 1,
wherein the at least two of the plurality of adaptively weighted channels of the audio signal comprise a first sub-group,
wherein the combined signal comprises a first combined signal,
wherein the binaural room impulse response filter comprises a first binaural room impulse response filter, and
wherein the binaural audio signal comprises a first binaural audio signal, the method further comprising:
combining a second sub-group to generate a second combined signal, the second sub-group comprising at least two of the plurality of adaptively weighted channels of the audio signal;
applying a second binaural room impulse response filter to the second combined signal to generate a second binaural audio signal; and
combining the first binaural audio signal and the second binaural audio signal to generate a third binaural audio signal.
7. The method of claim 1, wherein the binaural room impulse response filter comprises a common filter for at least two binaural room impulse response filters corresponding, respectively, to the at least two of the plurality of adaptively weighted channels, the method further comprising:
computing an average of the at least two binaural room impulse response filters without normalizing the at least two binaural room impulse response filters to generate the common filter.
8. The method of claim 1, wherein the binaural room impulse response filter comprises a common filter for at least two binaural room impulse response filters corresponding, respectively, to the at least two of the plurality of adaptively weighted channels, the method further comprising:
computing respective frequency-dependent inter-aural coherence values for each of the at least two binaural room impulse response filters;
computing an average frequency-dependent inter-aural coherence value of the respective frequency-dependent inter-aural coherence values for the at least two binaural room impulse response filters; and
synthesizing the common filter using the average frequency-dependent inter- aural coherence value.
9. The method of claim 1 , wherein the initial adaptively determined weights for plurality of channels of the audio signal are determined according to respective energies of at least two binaural room impulse response filters corresponding, respectively, to the at least two of the plurality of adaptively weighted channels.
10. The method of claim 1, wherein the plurality of channels of the audio signal each comprises spherical harmonic coefficients.
11. A device comprising one or more processors configured to:
apply adaptively determined weights to a plurality of channels of the audio signal to generate a plurality of adaptively weighted channels of the audio signal;
combine at least two of the plurality of adaptively weighted channels of the audio signal to generate a combined signal; and
apply a binaural room impulse response filter to the combined signal to generate a binaural audio signal.
12. The device of claim 11, wherein the binaural room impulse response filter comprises a common filter for reverberation segments of at least two binaural room impulse response filters corresponding, respectively, to the at least two of the plurality of adaptively weighted channels.
13. The device of claim 12, wherein the reflection segments of the at least two binaural room impulse response filters are weighted according to the respective energies of at least a portion of the at least two binaural room impulse response filters.
14. The device of claim 11, wherein the binaural room impulse response filter comprises a common filter for reflection segments of at least two binaural room impulse response filters corresponding, respectively, to the at least two of the plurality of adaptively weighted channels.
15. The device of claim 14, wherein the reflection segments of the at least two binaural room impulse response filters are weighted according to the respective energies of at least a portion of the at least two binaural room impulse response filters.
16. The device of claim 11,
wherein the at least two of the plurality of adaptively weighted channels of the audio signal comprise a first sub-group,
wherein the combined signal comprises a first combined signal,
wherein the binaural room impulse response filter comprises a first binaural room impulse response filter, and
wherein the binaural audio signal comprises a first binaural audio signal, the one or more processors further configured to:
combine a second sub-group to generate a second combined signal, the second sub-group comprising at least two of the plurality of adaptively weighted channels of the audio signal;
apply a second binaural room impulse response filter to the second combined signal to generate a second binaural audio signal; and
combine the first binaural audio signal and the second binaural audio signal to generate a third binaural audio signal.
17. The device of claim 11, wherein the binaural room impulse response filter comprises a common filter for at least two binaural room impulse response filters corresponding, respectively, to the at least two of the plurality of adaptively weighted channels, the one or more processors further configured to:
compute an average of the at least two binaural room impulse response filters without normalizing the at least two binaural room impulse response filters to generate the common filter.
18. The device of claim 11, wherein the binaural room impulse response filter comprises a common filter for at least two binaural room impulse response filters corresponding, respectively, to the at least two of the plurality of adaptively weighted channels, the one or more processors further configured to:
compute respective frequency-dependent inter-aural coherence values for each of the at least two binaural room impulse response filters;
compute an average frequency-dependent inter-aural coherence value of the respective frequency-dependent inter-aural coherence values for the at least two binaural room impulse response filters; and
synthesize the common filter using the average frequency-dependent inter-aural coherence value.
19. The device of claim 11, wherein the initial adaptively determined weights for plurality of channels of the audio signal are determined according to respective energies of at least two binaural room impulse response filters corresponding, respectively, to the at least two of the plurality of adaptively weighted channels.
20. The device of claim 11, wherein the plurality of channels of the audio signal each comprises spherical harmonic coefficients.
21. An apparatus comprising :
means for applying adaptively determined weights to a plurality of channels of the audio signal to generate a plurality of adaptively weighted channels of the audio signal;
means for combining at least two of the plurality of adaptively weighted channels of the audio signal to generate a combined signal; and
means for applying a binaural room impulse response filter to the combined signal to generate a binaural audio signal.
22. The apparatus of claim 21, wherein the binaural room impulse response filter comprises a common filter for reverberation segments of at least two binaural room impulse response filters corresponding, respectively, to the at least two of the plurality of adaptively weighted channels.
23. The apparatus of claim 22, wherein the reflection segments of the at least two binaural room impulse response filters are weighted according to the respective energies of at least a portion of the at least two binaural room impulse response filters.
24. The apparatus of claim 21, wherein the binaural room impulse response filter comprises a common filter for reflection segments of at least two binaural room impulse response filters corresponding, respectively, to the at least two of the plurality of adaptively weighted channels.
25. The apparatus of claim 24, wherein the reflection segments of the at least two binaural room impulse response filters are weighted according to the respective energies of at least a portion of the at least two binaural room impulse response filters.
26. The apparatus of claim 21,
wherein the at least two of the plurality of adaptively weighted channels of the audio signal comprise a first sub-group,
wherein the combined signal comprises a first combined signal,
wherein the binaural room impulse response filter comprises a first binaural room impulse response filter, and wherein the binaural audio signal comprises a first binaural audio signal, the apparatus further comprising:
means for combining a second sub-group to generate a second combined signal, the second sub-group comprising at least two of the plurality of adaptively weighted channels of the audio signal;
means for applying a second binaural room impulse response filter to the second combined signal to generate a second binaural audio signal; and
means for combining the first binaural audio signal and the second binaural audio signal to generate a third binaural audio signal.
27. The apparatus of claim 21, wherein the binaural room impulse response filter comprises a common filter for at least two binaural room impulse response filters corresponding, respectively, to the at least two of the plurality of adaptively weighted channels, the apparatus further comprising:
means for computing an average of the at least two binaural room impulse response filters without normalizing the at least two binaural room impulse response filters to generate the common filter.
28. The apparatus of claim 21, wherein the binaural room impulse response filter comprises a common filter for at least two binaural room impulse response filters corresponding, respectively, to the at least two of the plurality of adaptively weighted channels, the apparatus further comprising:
means for computing respective frequency-dependent inter-aural coherence values for each of the at least two binaural room impulse response filters;
means for computing an average frequency-dependent inter-aural coherence value of the respective frequency-dependent inter-aural coherence values for the at least two binaural room impulse response filters; and
means for synthesizing the common filter using the average frequency- dependent inter-aural coherence value.
29. The apparatus of claim 21, wherein the initial adaptively determined weights for plurality of channels of the audio signal are determined according to respective energies of at least two binaural room impulse response filters corresponding, respectively, to the at least two of the plurality of adaptively weighted channels.
30. A non-transitory computer-readable storage medium having stored thereon instructions that, when executed, cause one or more processors to:
apply adaptively determined weights to a plurality of channels of the audio signal to generate a plurality of adaptively weighted channels of the audio signal;
combine at least two of the plurality of adaptively weighted channels of the audio signal to generate a combined signal; and
apply a binaural room impulse response filter to the combined signal to generate a binaural audio signal.
PCT/US2014/039864 2013-05-29 2014-05-28 Filtering with binaural room impulse responses with content analysis and weighting WO2014194005A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2016516799A JP6100441B2 (en) 2013-05-29 2014-05-28 Binaural room impulse response filtering using content analysis and weighting
EP14733457.7A EP3005734B1 (en) 2013-05-29 2014-05-28 Filtering with binaural room impulse responses with content analysis and weighting
KR1020157036270A KR101719094B1 (en) 2013-05-29 2014-05-28 Filtering with binaural room impulse responses with content analysis and weighting
CN201480042431.2A CN105432097B (en) 2013-05-29 2014-05-28 Filtering with binaural room impulse responses with content analysis and weighting

Applications Claiming Priority (10)

Application Number Priority Date Filing Date Title
US201361828620P 2013-05-29 2013-05-29
US61/828,620 2013-05-29
US201361847543P 2013-07-17 2013-07-17
US61/847,543 2013-07-17
US201361886593P 2013-10-03 2013-10-03
US201361886620P 2013-10-03 2013-10-03
US61/886,620 2013-10-03
US61/886,593 2013-10-03
US14/288,277 US9369818B2 (en) 2013-05-29 2014-05-27 Filtering with binaural room impulse responses with content analysis and weighting
US14/288,277 2014-05-27

Publications (1)

Publication Number Publication Date
WO2014194005A1 true WO2014194005A1 (en) 2014-12-04

Family

ID=51985133

Family Applications (3)

Application Number Title Priority Date Filing Date
PCT/US2014/039864 WO2014194005A1 (en) 2013-05-29 2014-05-28 Filtering with binaural room impulse responses with content analysis and weighting
PCT/US2014/039848 WO2014193993A1 (en) 2013-05-29 2014-05-28 Filtering with binaural room impulse responses
PCT/US2014/039863 WO2014194004A1 (en) 2013-05-29 2014-05-28 Binaural rendering of spherical harmonic coefficients

Family Applications After (2)

Application Number Title Priority Date Filing Date
PCT/US2014/039848 WO2014193993A1 (en) 2013-05-29 2014-05-28 Filtering with binaural room impulse responses
PCT/US2014/039863 WO2014194004A1 (en) 2013-05-29 2014-05-28 Binaural rendering of spherical harmonic coefficients

Country Status (7)

Country Link
US (3) US9369818B2 (en)
EP (3) EP3005733B1 (en)
JP (3) JP6227764B2 (en)
KR (3) KR101719094B1 (en)
CN (3) CN105432097B (en)
TW (1) TWI615042B (en)
WO (3) WO2014194005A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10349197B2 (en) 2014-08-13 2019-07-09 Samsung Electronics Co., Ltd. Method and device for generating and playing back audio signal
US10820135B2 (en) 2016-10-19 2020-10-27 Audible Reality Inc. System for and method of generating an audio image
US11606663B2 (en) 2018-08-29 2023-03-14 Audible Reality Inc. System for and method of controlling a three-dimensional audio engine
RU2807215C2 (en) * 2019-04-03 2023-11-13 Долби Лабораторис Лайсэнзин Корпорейшн Media server with scalable stage for voice signals

Families Citing this family (130)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8788080B1 (en) 2006-09-12 2014-07-22 Sonos, Inc. Multi-channel pairing in a media system
US9202509B2 (en) 2006-09-12 2015-12-01 Sonos, Inc. Controlling and grouping in a multi-zone media system
US8483853B1 (en) 2006-09-12 2013-07-09 Sonos, Inc. Controlling and manipulating groupings in a multi-zone media system
US8923997B2 (en) 2010-10-13 2014-12-30 Sonos, Inc Method and apparatus for adjusting a speaker system
US11429343B2 (en) 2011-01-25 2022-08-30 Sonos, Inc. Stereo playback configuration and control
US11265652B2 (en) 2011-01-25 2022-03-01 Sonos, Inc. Playback device pairing
US8938312B2 (en) 2011-04-18 2015-01-20 Sonos, Inc. Smart line-in processing
US9042556B2 (en) 2011-07-19 2015-05-26 Sonos, Inc Shaping sound responsive to speaker orientation
US8811630B2 (en) 2011-12-21 2014-08-19 Sonos, Inc. Systems, methods, and apparatus to filter audio
US9084058B2 (en) 2011-12-29 2015-07-14 Sonos, Inc. Sound field calibration using listener localization
US9131305B2 (en) * 2012-01-17 2015-09-08 LI Creative Technologies, Inc. Configurable three-dimensional sound system
US9729115B2 (en) 2012-04-27 2017-08-08 Sonos, Inc. Intelligently increasing the sound level of player
US9524098B2 (en) 2012-05-08 2016-12-20 Sonos, Inc. Methods and systems for subwoofer calibration
USD721352S1 (en) 2012-06-19 2015-01-20 Sonos, Inc. Playback device
US9219460B2 (en) 2014-03-17 2015-12-22 Sonos, Inc. Audio settings based on environment
US9668049B2 (en) 2012-06-28 2017-05-30 Sonos, Inc. Playback device calibration user interfaces
US9690271B2 (en) 2012-06-28 2017-06-27 Sonos, Inc. Speaker calibration
US9690539B2 (en) 2012-06-28 2017-06-27 Sonos, Inc. Speaker calibration user interface
US9106192B2 (en) 2012-06-28 2015-08-11 Sonos, Inc. System and method for device playback calibration
US9706323B2 (en) 2014-09-09 2017-07-11 Sonos, Inc. Playback device calibration
US8930005B2 (en) 2012-08-07 2015-01-06 Sonos, Inc. Acoustic signatures in a playback system
US8965033B2 (en) 2012-08-31 2015-02-24 Sonos, Inc. Acoustic optimization
US9008330B2 (en) 2012-09-28 2015-04-14 Sonos, Inc. Crossover frequency adjustments for audio speakers
USD721061S1 (en) 2013-02-25 2015-01-13 Sonos, Inc. Playback device
CN108806704B (en) 2013-04-19 2023-06-06 韩国电子通信研究院 Multi-channel audio signal processing device and method
CN104982042B (en) 2013-04-19 2018-06-08 韩国电子通信研究院 Multi channel audio signal processing unit and method
US9369818B2 (en) 2013-05-29 2016-06-14 Qualcomm Incorporated Filtering with binaural room impulse responses with content analysis and weighting
US9384741B2 (en) * 2013-05-29 2016-07-05 Qualcomm Incorporated Binauralization of rotated higher order ambisonics
EP2840811A1 (en) * 2013-07-22 2015-02-25 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Method for processing an audio signal; signal processing unit, binaural renderer, audio encoder and audio decoder
EP2830043A3 (en) * 2013-07-22 2015-02-18 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Method for Processing an Audio Signal in accordance with a Room Impulse Response, Signal Processing Unit, Audio Encoder, Audio Decoder, and Binaural Renderer
US9319819B2 (en) 2013-07-25 2016-04-19 Etri Binaural rendering method and apparatus for decoding multi channel audio
EP3767970B1 (en) 2013-09-17 2022-09-28 Wilus Institute of Standards and Technology Inc. Method and apparatus for processing multimedia signals
WO2015060654A1 (en) * 2013-10-22 2015-04-30 한국전자통신연구원 Method for generating filter for audio signal and parameterizing device therefor
DE102013223201B3 (en) * 2013-11-14 2015-05-13 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Method and device for compressing and decompressing sound field data of a region
WO2015099429A1 (en) 2013-12-23 2015-07-02 주식회사 윌러스표준기술연구소 Audio signal processing method, parameterization device for same, and audio signal processing device
US10382880B2 (en) * 2014-01-03 2019-08-13 Dolby Laboratories Licensing Corporation Methods and systems for designing and applying numerically optimized binaural room impulse responses
US9226073B2 (en) 2014-02-06 2015-12-29 Sonos, Inc. Audio output balancing during synchronized playback
US9226087B2 (en) 2014-02-06 2015-12-29 Sonos, Inc. Audio output balancing during synchronized playback
US9264839B2 (en) 2014-03-17 2016-02-16 Sonos, Inc. Playback device configuration based on proximity detection
EP3122073B1 (en) 2014-03-19 2023-12-20 Wilus Institute of Standards and Technology Inc. Audio signal processing method and apparatus
JP6442037B2 (en) * 2014-03-21 2018-12-19 華為技術有限公司Huawei Technologies Co.,Ltd. Apparatus and method for estimating total mixing time based on at least a first pair of room impulse responses and corresponding computer program
KR101856540B1 (en) 2014-04-02 2018-05-11 주식회사 윌러스표준기술연구소 Audio signal processing method and device
US9367283B2 (en) 2014-07-22 2016-06-14 Sonos, Inc. Audio settings
USD883956S1 (en) 2014-08-13 2020-05-12 Sonos, Inc. Playback device
US9910634B2 (en) 2014-09-09 2018-03-06 Sonos, Inc. Microphone calibration
US10127006B2 (en) 2014-09-09 2018-11-13 Sonos, Inc. Facilitating calibration of an audio playback device
US9952825B2 (en) 2014-09-09 2018-04-24 Sonos, Inc. Audio processing algorithms
US9891881B2 (en) 2014-09-09 2018-02-13 Sonos, Inc. Audio processing algorithm database
US9774974B2 (en) * 2014-09-24 2017-09-26 Electronics And Telecommunications Research Institute Audio metadata providing apparatus and method, and multichannel audio data playback apparatus and method to support dynamic format conversion
US9560464B2 (en) * 2014-11-25 2017-01-31 The Trustees Of Princeton University System and method for producing head-externalized 3D audio through headphones
US9973851B2 (en) 2014-12-01 2018-05-15 Sonos, Inc. Multi-channel playback of audio content
WO2016130834A1 (en) 2015-02-12 2016-08-18 Dolby Laboratories Licensing Corporation Reverberation generation for headphone virtualization
US10664224B2 (en) 2015-04-24 2020-05-26 Sonos, Inc. Speaker calibration user interface
WO2016172593A1 (en) 2015-04-24 2016-10-27 Sonos, Inc. Playback device calibration user interfaces
USD886765S1 (en) 2017-03-13 2020-06-09 Sonos, Inc. Media playback device
US20170085972A1 (en) 2015-09-17 2017-03-23 Sonos, Inc. Media Player and Media Player Design
USD768602S1 (en) 2015-04-25 2016-10-11 Sonos, Inc. Playback device
USD920278S1 (en) 2017-03-13 2021-05-25 Sonos, Inc. Media playback device with lights
USD906278S1 (en) 2015-04-25 2020-12-29 Sonos, Inc. Media player device
US10248376B2 (en) 2015-06-11 2019-04-02 Sonos, Inc. Multiple groupings in a playback system
US9729118B2 (en) 2015-07-24 2017-08-08 Sonos, Inc. Loudness matching
US9538305B2 (en) 2015-07-28 2017-01-03 Sonos, Inc. Calibration error conditions
US10932078B2 (en) 2015-07-29 2021-02-23 Dolby Laboratories Licensing Corporation System and method for spatial processing of soundfield signals
US9736610B2 (en) 2015-08-21 2017-08-15 Sonos, Inc. Manipulation of playback device response using signal processing
US9712912B2 (en) 2015-08-21 2017-07-18 Sonos, Inc. Manipulation of playback device response using an acoustic filter
US10978079B2 (en) * 2015-08-25 2021-04-13 Dolby Laboratories Licensing Corporation Audio encoding and decoding using presentation transform parameters
AU2016312404B2 (en) 2015-08-25 2020-11-26 Dolby International Ab Audio decoder and decoding method
US10262677B2 (en) * 2015-09-02 2019-04-16 The University Of Rochester Systems and methods for removing reverberation from audio signals
USD1043613S1 (en) 2015-09-17 2024-09-24 Sonos, Inc. Media player
US9693165B2 (en) 2015-09-17 2017-06-27 Sonos, Inc. Validation of audio calibration using multi-dimensional motion check
JP6437695B2 (en) 2015-09-17 2018-12-12 ソノズ インコーポレイテッド How to facilitate calibration of audio playback devices
EP3402221B1 (en) * 2016-01-08 2020-04-08 Sony Corporation Audio processing device and method, and program
US9743207B1 (en) 2016-01-18 2017-08-22 Sonos, Inc. Calibration using multiple recording devices
US10003899B2 (en) 2016-01-25 2018-06-19 Sonos, Inc. Calibration with particular locations
US11106423B2 (en) 2016-01-25 2021-08-31 Sonos, Inc. Evaluating calibration of a playback device
US9886234B2 (en) 2016-01-28 2018-02-06 Sonos, Inc. Systems and methods of distributing audio to one or more playback devices
US10142755B2 (en) * 2016-02-18 2018-11-27 Google Llc Signal processing methods and systems for rendering audio on virtual loudspeaker arrays
US9591427B1 (en) * 2016-02-20 2017-03-07 Philip Scott Lyren Capturing audio impulse responses of a person with a smartphone
US9881619B2 (en) 2016-03-25 2018-01-30 Qualcomm Incorporated Audio processing for an acoustical environment
WO2017165968A1 (en) * 2016-03-29 2017-10-05 Rising Sun Productions Limited A system and method for creating three-dimensional binaural audio from stereo, mono and multichannel sound sources
US9864574B2 (en) 2016-04-01 2018-01-09 Sonos, Inc. Playback device calibration based on representation spectral characteristics
US9860662B2 (en) 2016-04-01 2018-01-02 Sonos, Inc. Updating playback device configuration information based on calibration data
US9763018B1 (en) 2016-04-12 2017-09-12 Sonos, Inc. Calibration of audio playback devices
JP6821699B2 (en) * 2016-04-20 2021-01-27 ジェネレック・オーワイGenelec Oy How to regularize active monitoring headphones and their inversion
CN105792090B (en) * 2016-04-27 2018-06-26 华为技术有限公司 A kind of method and apparatus for increasing reverberation
EP3472832A4 (en) * 2016-06-17 2020-03-11 DTS, Inc. Distance panning using near / far-field rendering
US9794710B1 (en) 2016-07-15 2017-10-17 Sonos, Inc. Spatial audio correction
US9860670B1 (en) 2016-07-15 2018-01-02 Sonos, Inc. Spectral correction using spatial calibration
US10372406B2 (en) 2016-07-22 2019-08-06 Sonos, Inc. Calibration interface
US10459684B2 (en) 2016-08-05 2019-10-29 Sonos, Inc. Calibration of a playback device based on an estimated frequency response
CN106412793B (en) * 2016-09-05 2018-06-12 中国科学院自动化研究所 The sparse modeling method and system of head-position difficult labor based on spheric harmonic function
EP3293987B1 (en) 2016-09-13 2020-10-21 Nokia Technologies Oy Audio processing
US10412473B2 (en) 2016-09-30 2019-09-10 Sonos, Inc. Speaker grill with graduated hole sizing over a transition area for a media device
USD827671S1 (en) 2016-09-30 2018-09-04 Sonos, Inc. Media playback device
USD851057S1 (en) 2016-09-30 2019-06-11 Sonos, Inc. Speaker grill with graduated hole sizing over a transition area for a media device
US10492018B1 (en) 2016-10-11 2019-11-26 Google Llc Symmetric binaural rendering for high-order ambisonics
US10712997B2 (en) 2016-10-17 2020-07-14 Sonos, Inc. Room association based on name
EP3312833A1 (en) * 2016-10-19 2018-04-25 Holosbase GmbH Decoding and encoding apparatus and corresponding methods
WO2018079254A1 (en) * 2016-10-28 2018-05-03 Panasonic Intellectual Property Corporation Of America Binaural rendering apparatus and method for playing back of multiple audio sources
US9992602B1 (en) 2017-01-12 2018-06-05 Google Llc Decoupled binaural rendering
US10009704B1 (en) 2017-01-30 2018-06-26 Google Llc Symmetric spherical harmonic HRTF rendering
US10158963B2 (en) * 2017-01-30 2018-12-18 Google Llc Ambisonic audio with non-head tracked stereo based on head position and time
JP7038725B2 (en) * 2017-02-10 2022-03-18 ガウディオ・ラボ・インコーポレイテッド Audio signal processing method and equipment
DE102017102988B4 (en) 2017-02-15 2018-12-20 Sennheiser Electronic Gmbh & Co. Kg Method and device for processing a digital audio signal for binaural reproduction
US11200906B2 (en) * 2017-09-15 2021-12-14 Lg Electronics, Inc. Audio encoding method, to which BRIR/RIR parameterization is applied, and method and device for reproducing audio by using parameterized BRIR/RIR information
US10388268B2 (en) * 2017-12-08 2019-08-20 Nokia Technologies Oy Apparatus and method for processing volumetric audio
US10523171B2 (en) 2018-02-06 2019-12-31 Sony Interactive Entertainment Inc. Method for dynamic sound equalization
US10652686B2 (en) 2018-02-06 2020-05-12 Sony Interactive Entertainment Inc. Method of improving localization of surround sound
US11929091B2 (en) 2018-04-27 2024-03-12 Dolby Laboratories Licensing Corporation Blind detection of binauralized stereo content
EP4093057A1 (en) 2018-04-27 2022-11-23 Dolby Laboratories Licensing Corp. Blind detection of binauralized stereo content
US10872602B2 (en) 2018-05-24 2020-12-22 Dolby Laboratories Licensing Corporation Training of acoustic models for far-field vocalization processing systems
WO2020014506A1 (en) * 2018-07-12 2020-01-16 Sony Interactive Entertainment Inc. Method for acoustically rendering the size of a sound source
US10299061B1 (en) 2018-08-28 2019-05-21 Sonos, Inc. Playback device calibration
US11206484B2 (en) 2018-08-28 2021-12-21 Sonos, Inc. Passive speaker authentication
US11272310B2 (en) * 2018-08-29 2022-03-08 Dolby Laboratories Licensing Corporation Scalable binaural audio stream generation
US11503423B2 (en) * 2018-10-25 2022-11-15 Creative Technology Ltd Systems and methods for modifying room characteristics for spatial audio rendering over headphones
US11304021B2 (en) 2018-11-29 2022-04-12 Sony Interactive Entertainment Inc. Deferred audio rendering
CN109801643B (en) * 2019-01-30 2020-12-04 龙马智芯(珠海横琴)科技有限公司 Processing method and device for reverberation suppression
US11076257B1 (en) * 2019-06-14 2021-07-27 EmbodyVR, Inc. Converting ambisonic audio to binaural audio
US11341952B2 (en) * 2019-08-06 2022-05-24 Insoundz, Ltd. System and method for generating audio featuring spatial representations of sound sources
US10734965B1 (en) 2019-08-12 2020-08-04 Sonos, Inc. Audio calibration of a portable playback device
CN112578434A (en) * 2019-09-27 2021-03-30 中国石油化工股份有限公司 Minimum phase infinite impulse response filtering method and filtering system
US11967329B2 (en) * 2020-02-20 2024-04-23 Qualcomm Incorporated Signaling for rendering tools
JP7147804B2 (en) * 2020-03-25 2022-10-05 カシオ計算機株式会社 Effect imparting device, method and program
FR3113993B1 (en) * 2020-09-09 2023-02-24 Arkamys Sound spatialization process
WO2022108494A1 (en) * 2020-11-17 2022-05-27 Dirac Research Ab Improved modeling and/or determination of binaural room impulse responses for audio applications
WO2023085186A1 (en) * 2021-11-09 2023-05-19 ソニーグループ株式会社 Information processing device, information processing method, and information processing program
CN116189698A (en) * 2021-11-25 2023-05-30 广州视源电子科技股份有限公司 Training method and device for voice enhancement model, storage medium and equipment
WO2024089034A2 (en) * 2022-10-24 2024-05-02 Brandenburg Labs Gmbh Audio signal processor and related method and computer program for generating a two-channel audio signal using a specific separation and combination processing
WO2024163721A1 (en) * 2023-02-01 2024-08-08 Qualcomm Incorporated Artificial reverberation in spatial audio

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080273708A1 (en) * 2007-05-03 2008-11-06 Telefonaktiebolaget L M Ericsson (Publ) Early Reflection Method for Enhanced Externalization

Family Cites Families (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5371799A (en) * 1993-06-01 1994-12-06 Qsound Labs, Inc. Stereo headphone sound source localization system
DE4328620C1 (en) 1993-08-26 1995-01-19 Akg Akustische Kino Geraete Process for simulating a room and / or sound impression
US5955992A (en) * 1998-02-12 1999-09-21 Shattil; Steve J. Frequency-shifted feedback cavity used as a phased array antenna controller and carrier interference multiple access spread-spectrum transmitter
ATE501606T1 (en) 1998-03-25 2011-03-15 Dolby Lab Licensing Corp METHOD AND DEVICE FOR PROCESSING AUDIO SIGNALS
FR2836571B1 (en) * 2002-02-28 2004-07-09 Remy Henri Denis Bruno METHOD AND DEVICE FOR DRIVING AN ACOUSTIC FIELD RESTITUTION ASSEMBLY
FR2847376B1 (en) 2002-11-19 2005-02-04 France Telecom METHOD FOR PROCESSING SOUND DATA AND SOUND ACQUISITION DEVICE USING THE SAME
FI118247B (en) * 2003-02-26 2007-08-31 Fraunhofer Ges Forschung Method for creating a natural or modified space impression in multi-channel listening
US8027479B2 (en) 2006-06-02 2011-09-27 Coding Technologies Ab Binaural multi-channel decoder in the context of non-energy conserving upmix rules
FR2903562A1 (en) 2006-07-07 2008-01-11 France Telecom BINARY SPATIALIZATION OF SOUND DATA ENCODED IN COMPRESSION.
JP5254983B2 (en) 2007-02-14 2013-08-07 エルジー エレクトロニクス インコーポレイティド Method and apparatus for encoding and decoding object-based audio signal
CN103716748A (en) * 2007-03-01 2014-04-09 杰里·马哈布比 Audio spatialization and environment simulation
WO2009046223A2 (en) 2007-10-03 2009-04-09 Creative Technology Ltd Spatial audio analysis and synthesis for binaural reproduction and format conversion
WO2010070016A1 (en) 2008-12-19 2010-06-24 Dolby Sweden Ab Method and apparatus for applying reverb to a multi-channel audio signal using spatial cue parameters
GB2478834B (en) * 2009-02-04 2012-03-07 Richard Furse Sound system
JP2011066868A (en) 2009-08-18 2011-03-31 Victor Co Of Japan Ltd Audio signal encoding method, encoding device, decoding method, and decoding device
NZ587483A (en) 2010-08-20 2012-12-21 Ind Res Ltd Holophonic speaker system with filters that are pre-configured based on acoustic transfer functions
EP2423702A1 (en) 2010-08-27 2012-02-29 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for resolving ambiguity from a direction of arrival estimate
US9641951B2 (en) 2011-08-10 2017-05-02 The Johns Hopkins University System and method for fast binaural rendering of complex acoustic scenes
US9369818B2 (en) 2013-05-29 2016-06-14 Qualcomm Incorporated Filtering with binaural room impulse responses with content analysis and weighting
JP6458738B2 (en) 2013-11-19 2019-01-30 ソニー株式会社 Sound field reproduction apparatus and method, and program
DE112014005332T5 (en) 2013-11-22 2016-08-04 Jtekt Corporation Tapered roller bearing and power transmission device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080273708A1 (en) * 2007-05-03 2008-11-06 Telefonaktiebolaget L M Ericsson (Publ) Early Reflection Method for Enhanced Externalization

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MENZER FRITZ ET AL: "Investigations on an Early-Reflection-Free Model for BRIRs", JAES, AES, 60 EAST 42ND STREET, ROOM 2520 NEW YORK 10165-2520, USA, vol. 58, no. 9, 1 September 2010 (2010-09-01), pages 709 - 723, XP040567065 *
RAFAELY BOAZ ET AL: "Interaural cross correlation in a sound field represented by spherical harmonics", THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, AMERICAN INSTITUTE OF PHYSICS FOR THE ACOUSTICAL SOCIETY OF AMERICA, NEW YORK, NY, US, vol. 127, no. 2, 1 February 2010 (2010-02-01), pages 823 - 828, XP012135229, ISSN: 0001-4966, DOI: 10.1121/1.3278605 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10349197B2 (en) 2014-08-13 2019-07-09 Samsung Electronics Co., Ltd. Method and device for generating and playing back audio signal
US10820135B2 (en) 2016-10-19 2020-10-27 Audible Reality Inc. System for and method of generating an audio image
US11516616B2 (en) 2016-10-19 2022-11-29 Audible Reality Inc. System for and method of generating an audio image
US11606663B2 (en) 2018-08-29 2023-03-14 Audible Reality Inc. System for and method of controlling a three-dimensional audio engine
RU2807215C2 (en) * 2019-04-03 2023-11-13 Долби Лабораторис Лайсэнзин Корпорейшн Media server with scalable stage for voice signals

Also Published As

Publication number Publication date
KR20160015269A (en) 2016-02-12
US9420393B2 (en) 2016-08-16
TWI615042B (en) 2018-02-11
CN105340298A (en) 2016-02-17
US20140355796A1 (en) 2014-12-04
KR101728274B1 (en) 2017-04-18
TW201509201A (en) 2015-03-01
EP3005733A1 (en) 2016-04-13
JP6227764B2 (en) 2017-11-08
JP6100441B2 (en) 2017-03-22
WO2014194004A1 (en) 2014-12-04
KR20160015268A (en) 2016-02-12
CN105340298B (en) 2017-05-31
US20140355795A1 (en) 2014-12-04
JP2016523464A (en) 2016-08-08
CN105325013B (en) 2017-11-21
US9369818B2 (en) 2016-06-14
CN105432097A (en) 2016-03-23
KR101788954B1 (en) 2017-10-20
EP3005733B1 (en) 2021-02-24
US20140355794A1 (en) 2014-12-04
EP3005734A1 (en) 2016-04-13
KR20160015265A (en) 2016-02-12
CN105325013A (en) 2016-02-10
CN105432097B (en) 2017-04-26
JP2016523466A (en) 2016-08-08
JP2016523465A (en) 2016-08-08
KR101719094B1 (en) 2017-03-22
JP6067934B2 (en) 2017-01-25
EP3005734B1 (en) 2019-06-19
EP3005735A1 (en) 2016-04-13
US9674632B2 (en) 2017-06-06
WO2014193993A1 (en) 2014-12-04
EP3005735B1 (en) 2021-02-24

Similar Documents

Publication Publication Date Title
EP3005734B1 (en) Filtering with binaural room impulse responses with content analysis and weighting
US11096000B2 (en) Method and apparatus for processing multimedia signals
US9384741B2 (en) Binauralization of rotated higher order ambisonics
EP2962298B1 (en) Specifying spherical harmonic and/or higher order ambisonics coefficients in bitstreams
JP2016531484A (en) Method for processing an audio signal, signal processing unit, binaural renderer, audio encoder and audio decoder
BR112016014892B1 (en) Method and apparatus for audio signal processing
AU2015330759A1 (en) Signaling channels for scalable coding of higher order ambisonic audio data

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 201480042431.2

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14733457

Country of ref document: EP

Kind code of ref document: A1

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2014733457

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2016516799

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 20157036270

Country of ref document: KR

Kind code of ref document: A