WO2024059505A1 - Rendu divisé à suivi de tête et personnalisation de fonction de transfert liée à la tête - Google Patents

Rendu divisé à suivi de tête et personnalisation de fonction de transfert liée à la tête Download PDF

Info

Publication number
WO2024059505A1
WO2024059505A1 PCT/US2023/073857 US2023073857W WO2024059505A1 WO 2024059505 A1 WO2024059505 A1 WO 2024059505A1 US 2023073857 W US2023073857 W US 2023073857W WO 2024059505 A1 WO2024059505 A1 WO 2024059505A1
Authority
WO
WIPO (PCT)
Prior art keywords
rendering
hrtf
head pose
binaural
post
Prior art date
Application number
PCT/US2023/073857
Other languages
English (en)
Inventor
Stefan Bruhn
Rishabh Tyagi
Original Assignee
Dolby Laboratories Licensing Corporation
Dolby International Ab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dolby Laboratories Licensing Corporation, Dolby International Ab filed Critical Dolby Laboratories Licensing Corporation
Publication of WO2024059505A1 publication Critical patent/WO2024059505A1/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • H04S7/303Tracking of listener position or orientation
    • H04S7/304For headphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/01Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]

Definitions

  • immersive audio is an essential media component of XR services. These services may typically support adjusting the presented immersive audio/visual scene in response to 3DoF or 6DoF user (head) movements.
  • To carry out the corresponding immersive audio renditions as high quality requires typically high numerical complexity.
  • One potential solution to address this problem is to carry out the rendering not on the device itself but rather on some entity of the mobile/wireless network to which the end-device is connected or on a powerful mobile user equipment (UE) to which the end-device is tethered. In that case, the end-device would for example only receive the already binaurally rendered audio.
  • the 3DoF/6DoF head pose information (head-tracking metadata) would need to be transmitted to the rendering entity (network entity/UE).
  • a problem with this is the latency for transmissions between end-device and network entity/UE, which can be in the order of 100ms or more. Doing the rendering on the network entity/UE would thus mean that it must rely on outdated head- tracking metadata and that the binauralized audio played out by the end-rendering device is not matching the actual head pose of the head/end-device. This latency is referred to as motion-to- sound latency. If it is too large, the end user will perceive it as quality degradation. For the video component of the immersive media rendering, this problem is being addressed by split render approaches, where an approximative part of the video scene is rendered by the network entity/UE and final video scene adjustments are done on the end-device. For audio, the field is currently less explored.
  • immersive voice and audio services e.g., immersive voice and audio services (IVAS)
  • IVAS immersive voice and audio services
  • HRTFs head related transfer functions
  • decoding and head-tracked binaural rendering may be computationally complex operations.
  • scene-based audio e.g., higher-order Ambisonics
  • channel-based audio e.g., with 7.1.4 channel layout
  • object- based audio with many objects may each rely on a large multitude of constituent audio components, which, due to this multitude, are computationally complex to decode and render.
  • decoding a bitstream and binaural rendering in response to the user’s head movement requires a large amount of computational processing.
  • the computational complexity requires power and produces heat that may be problematic for small portable devices like AR glasses. It is an object of the present invention to overcome the problems described herein, and to provide a split rendering, where head pose specific processing may be performed at a second device. According to some implementations, this and other objects are achieved by a method according to claim 1 or claim 14.
  • a user-held device According to another implementation, this and other objects are achieved by a user-held device according to claim 22.
  • Techniques for direction of arrival (DOA) based head-tracked split rendering and head-related transfer function (HRTF) personalization are described.
  • Head-tracked audio decoding and binaural rendering may be split between two or more devices.
  • a first device may coordinate split decoding and rendering operations with a second device.
  • the first device e.g., a smartphone, receives a main bitstream representation of encoded audio.
  • the first device decodes and renders the main bitstream into pre-rendered binaural signals using a main decoder and binaural renderer, and encodes the pre-rendered binaural signals and post-render metadata, including information about the HRTF associated with the binaural rendering.
  • the first device provides the pre-rendered binaural signals and post-renderer metadata to the second device as a multiplexed intermediate bitstream.
  • the second device e.g., a headphone, AR glasses, or an earbud, tracks current head pose information.
  • the second device decodes the pre-rendered binaural signals and post-renderer metadata from the intermediate bitstream, and provides the decoded pre-rendered binaural signals and post-renderer metadata to a lightweight renderer.
  • the lightweight renderer renders the pre-rendered binaural signals into binaural audio based on the post-renderer metadata, the current head pose information, generic HRTF, and optionally personalized HRTF.
  • the post-rendering metadata includes at least an indication of the pre-rendering HRTF that has been used in the binaural pre-rendering.
  • the pre-rendering HRTF is associated with a direction of arrival (DOA) of a dominant directional component of the audio content (typically two angles) in relation to an assumed head pose.
  • DOA direction of arrival
  • the indication of the pre-rendering HRTF may be the DOA, or some sort of index, allowing the user-held device to identify the correct HRTF.
  • the indication of the pre-rendering HRTF also includes one or several parameters that may be personalized.
  • the rendering may involve calculating a compensated stereo audio signal by applying an HRTF compensation operation, configured to compensate an effect of a pre-rendering HRTF, to the binaural audio signal, and calculating a binaural output signal by applying a post-rendering HRTF to the compensated stereo signal. These steps may be performed in one single operation.
  • the HRTF compensation operation may involve an inverse of the pre-rendering HRTF, e.g.
  • HRTF head-related transfer functions
  • BRIRs binaural room impulse responses
  • all HRTF processing needs to be performed for each time frame and for each frequency band, often expressed as time/frequency-tiles.
  • the assumed head pose is included in the metadata.
  • the user-held device is configured to send the current head pose to the main device.
  • the second device encodes at least a portion of the head pose information into a head pose bitstream and provides the bitstream to the first device.
  • the first device decodes the head pose bitstream to obtain the head pose information, and then applies the head pose information to the main decoder and pre-renderer.
  • the main decoder/pre-renderer decodes and pre-renders the main bitstream based on the received head pose information (also referred to as assumed head pose) and generic HRTF.
  • the user-held device can estimate the assumed head pose based on an expected delay of transmission.
  • Information of the assumed head pose is transmitted together with other information to the second device unless that device derives the assumed head pose from a priori knowledge, which may be based on (head pose) information previously transmitted to the first device or an assumed head pose that is pre-agreed between both devices.
  • the present disclosure relates to a further inventive concept, involving techniques for DOA based head-tracked split rendering with a suitable prototype signal and an optional diffused signal.
  • the first device decodes the main bitstream using a main decoder, and renders the decoded bitstream as a dominant directional component, referred to as a prototype signal, and zero or more diffused signals, and post-render metadata.
  • the first device then encodes the prototype signal and zero or more diffused signals (or parameters representing them) and post- renderer metadata and provides it to the second device as a multiplexed intermediate bitstream.
  • the second device decodes the prototype signal and zero or more diffused signals and post- renderer metadata from the intermediate bitstream, and provides the decoded prototype signal and zero or more diffused signals and post-renderer metadata to a lightweight renderer.
  • the lightweight renderer renders the prototype signal and zero or more diffused signals into binaural audio based on the post-renderer metadata, information relating to the head pose, generic HRTF, and optionally personalized HRTF.
  • the wearable device performs lightweight rendering based on the current head pose of the user without having to rely only on a binaural rendition by a heavy-duty rendering device that may only have access to delayed/outdated head pose information, thereby reducing motion-to-sound latency due to the potential use of outdated head pose information during rendering.
  • the allocation of amount of processing between the first device can be flexible, e.g., by tuning the amount of head pose information transmitted from the second device to the first device, from none to entirety, thereby allowing matching various wearable devices with different processing power.
  • FIG. 1 is a block diagram of an example system implementing head-tracked split rendering.
  • FIG. 2 is a flow chart illustrating processing in a first or main device.
  • FIG. 3 is a flow chart illustrating processing in a second or user-held device.
  • FIG. 4 illustrates example techniques of DOA-based split rendering with pre-rendered binaural signal.
  • FIG. 5 illustrates example techniques of HRTF personalization.
  • FIG. 6 illustrates example techniques of DOA-based split rendering with prototype signal.
  • the computer hardware may for example be a server computer, a client computer, a personal computer (PC), a tablet PC, a set-top box (STB), a personal digital assistant (PDA), a cellular telephone, a smartphone, a web appliance, a network router, switch or bridge, or any machine capable of executing instructions (sequential or otherwise) that specify actions to be taken by that computer hardware.
  • PC personal computer
  • PDA personal digital assistant
  • a cellular telephone a smartphone
  • web appliance a web appliance
  • network router switch or bridge
  • the present disclosure shall relate to any collection of computer hardware that individually or jointly execute instructions to perform any one or more of the concepts discussed herein.
  • Certain or all components may be implemented by one or more processors that accept computer- readable (also called machine-readable) code containing a set of instructions that when executed by one or more of the processors carry out at least one of the methods described herein.
  • processors capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken are included.
  • a typical processing system e.g., a computer hardware
  • processors may include one or more of a CPU, a graphics processing unit, and a programmable DSP unit.
  • the processing system further may include a memory subsystem including a hard drive, SSD, RAM and/or ROM.
  • a bus subsystem may be included for communicating between the components.
  • the software may reside in the memory subsystem and/or within the processor during execution thereof by the computer system.
  • the one or more processors may operate as a standalone device or may be connected, e.g., networked to other processor(s).
  • Such a network may be built on various different network protocols, and may be the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof.
  • the software may be distributed on computer readable media, which may comprise computer storage media (or non-transitory media) and communication media (or transitory media).
  • computer storage media includes both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media includes, but is not limited to, physical (non-transitory) storage media in various forms, such as ROM, PROM, EPROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.
  • communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
  • the disclosure assumes that there is an immersive audio codec, such as an IVAS codec, being used in some extended reality, XR, application.
  • the main decoding and pre-rendering may be done by a first device (user equipment, UE) or the Edge or other network node of an assumed 5G system.
  • the second device contains a post decoder and a (lightweight) post renderer.
  • the first device (main device) may be a mobile device like a laptop, or tablet, or smartphone, or a stationary device such as a workstation or a server.
  • the first device may also be a combination of several processing devices.
  • the second device may be a user-held (e.g., worn) device, such as a pair of augmented reality, AR, glasses.
  • a user-held (e.g., worn) device such as a pair of augmented reality, AR, glasses.
  • One basic assumption and inventive insight applied in the field of split rendering is that, per time-frequency tile, the audio is composed of one dominant directional component and a diffuse (omni-directional) component.
  • the directional component is assumed to be a prototype signal ⁇ arriving from a certain direction of arrival (DOA) while the diffuse component is a decorrelated version of that prototype signal.
  • DOA direction of arrival
  • This concept was proven to be very powerful in spatial audio coding approaches like DirAC or metadata assisted spatial audio (MASA) coding. Based on at least these assumptions, various example implementations may comprise the following steps: 1.
  • the pre-renderer binauralizes the decoded immersive audio using a set of generic HRTFs (or BRIRs) given the head pose P' that has either been transmitted from the lightweight device equipped with a head-tracker or that may just be a pre-set value that does not necessarily correspond to the any actual head pose of the user, but rather could be a reasonable default, like a straight forward looking head pose.
  • the application of the HRTFs during the binaural pre- rendering operation may be done with a specifically selected HRTF for each time/frequency tile.
  • the HRTFs are selected based on a direction of arrival (DOA) of the dominant component of the immersive audio content, in relation to the assumed head pose. 2.
  • DOA direction of arrival
  • the first or main device encodes and transmits the binauralized audio channels and an indication of the used HRTFs and/or the DOA angles, and the assumed head pose P'.
  • the post-renderer of the second device aims at adjusting the received left and right binaural signals with respect to the current head pose P (if it deviates from P').
  • left and right HRTF-compensated signals are calculated by the second device essentially by inverse HRTF filtering of the left and right audio channels and optionally combining them linearly.
  • the HRTF compensated signals are the filtered by the second device with the correct HRTFs corresponding to the correct head pose. 6.
  • FIG. 1 is a block diagram of an example system implementing head-tracked split rendering arranged in accordance with various aspects of the present invention.
  • the example system includes a first device 10, and a second device 20.
  • the first device 10 may also be referred to as a main device, while the second device 20 may also be referred to as a mobile or user-held device.
  • the first or main device 10 includes a decoder/renderer 11, an optional head pose decoder 12, encoders 13 and 14, and a multiplexer 15.
  • the decoder/renderer e.g., an IVAS decoder, receives (step S1) the main bitstream b 1 including encoded immersive audio content, decodes (step S2) the immersive audio content, and performs binaural rendering (step S3) of the decoded audio content using HRTFs associated with a direction of arrival, DOA relative an assumed head pose P' of the user.
  • the used HRTFs are typically taken out of a set Hg of generic HRTFs (for various directions of arrival, DOA).
  • the assumed head pose P' may be an appropriate default head pose or may be an actual user head pose received from the user-held device, which may optionally be decoded by the head pose decoder 12 (step S21). Such a decoded user head pose may represent a recent, but not quite current, head pose of the user.
  • the renderer 11 outputs the binaural signal L 1 , R 1 as well as post- rendering metadata M.
  • the post-rendering metadata M includes an indication of the used HRTFs, expressed e.g., as a direction of arrival, DOA, of the dominant directional component of the immersive audio content, expressed in relation to the assumed head pose P' or an index of the used HRTFs.
  • the post-rendering metadata M may also include an indication of the head pose P' associated with the binaural rendering.
  • the encoders 13, 14 are arranged to encode (step S4) the binaural signal L 1 , R 1 and the post-rendering metadata M, into encoded signals b 11 and b 12 , respectively.
  • the multiplexer 15 is arranged to multiplex or combine (step S5) the encoded binaural signal b11, and the encoded metadata b12, into an intermediate bitstream b2, which is transmitted (step S6) to the second device 20.
  • the second device 20, which may be a user-held device 20, includes a demuxer 21, decoders 22 and 23, an encoder 25, a renderer 26, and a head-tracker 24.
  • the user-held device 20 receives (step S11) the intermediate bitstream b2, and demuxer 21 separates the intermediate bitstream b2 into encoded signals b 21 and b 22 ; which are received by the corresponding decoders 22 and 23.
  • Decoders 22 and 23 responsively decode (step S12) the encoded signals b21 and b22 to obtain a decoded binaural signal L2, R2 and decoded metadata M'.
  • the metadata M' includes in indication of the HRTF used, e.g. indicated by an index or a direction of arrival with respect to a forward looking head pose.
  • the head-tracker 124 which may be included in the user-held device 120 or be connected thereto, detects (step S13) a current head pose P of the user’s head.
  • the encoder 25 may optionally be used to encode (step S131) the detected head pose P as b P and transmit the encoded detected head pose bP to the main device 10.
  • the metadata M' may also include the assumed head pose P' used in renderer 11.
  • the user-held device can estimate the assumed head-pose based on an expected transmission delay.
  • the assumed head pose can be assumed to be the head pose detected at a point in time corresponding to the expected transmission delay.
  • the renderer 26 receives the decoded binaural audio signal L2, R2, the DOA or used HRTFs, the assumed head pose P’ and the current head pose P, and calculates an output binaural signal L out , R out .
  • This processing involves identifying (step S14) a post-rendering HRTF corresponding to the detected, current head pose P, calculating a compensated stereo audio signal (step S15) by applying an HRTF compensation operation, configured to compensate an effect of the pre-rendering HRTF, to the binaural audio signal, and finally applying (step S16) the identified post-rendering HRTF.
  • the renderer 26 is provided with HRTF data, typically a set Hg of generic HRTFs (for various directions of arrival, DOA).
  • the renderer 26 may also be with a set Hp of personalized HRTFs.
  • FIG. 2 is a flow chart illustrating processing in a first device (or main device), which includes the above-described steps S1 – S6, and optional step S21.
  • the process includes, at step S1, “receive bitstream”, receiving a bitstream, and at step S2, “decode”, decoding the bitstream by a decoder to obtain decoded immersive audio content.
  • step S21 “decode pose”, the process may include the optional step of receiving and decoding an indication of a current user head pose from a second, user-held device, and determining the assumed user head pose based on the current user head pose.
  • pre-render the process involves binauralizing the immersive audio content by a pre-renderer to generate a pre- rendered binaural signal, the binauralizing using a pre-rendering HRTF out of a set of HRTFs and an assumed head pose of a user.
  • encode the process involves encoding the pre- rendered binaural signal, and encoding post-rendering metadata, the metadata indicating the pre- rendering HRTF.
  • combine the process involves combining, in a multiplexer, the encoded binaural audio signal and the encoded post-rendering metadata, to form a bitstream including a binaural audio representation.
  • FIG. 3 is a flow chart illustrating processing in a second device (or user-held device), which includes above-described steps S11 - S16, and optional step S131.
  • the process includes, at step S11, “receive bitstream”, receiving, from a first device (or main device), a bitstream including a representation of a binaural pre-rendering of an immersive audio content. The binaural pre-rendering has been obtained with respect to an assumed head pose P'.
  • “decode” the process involves decoding the bitstream to obtain a binaural audio signal and associated post-rendering metadata.
  • the metadata is indicating a pre-rendering HRTF used in the binaural pre-rendering, where the pre-rendering HRTF is associated with the assumed head pose P’.
  • detect current pose the process involves obtaining user head pose information indicating a current head pose P.
  • encode pose the process may include the optional step of encoding the detected head pose P with an encoder, and transmitting an indication of the current head pose P to the main device.
  • the main device can then use the current pose received from the second, user-held device as assumed pose. Therefore, in this case, the second, user-held device can estimate the assumed head pose P' based on an expected transmission delay (and a previously transmitted current pose).
  • step S14 “identify post-rendering HRTF”, the process involves identifying a post-rendering HRTF based on the metadata, the assumed head pose P' and the current head pose P.
  • step S15 “calculate compensated audio”, the process involves calculating a compensated stereo audio signal by applying an HRTF compensation operation, configured to compensate an effect of the pre- rendering HRTF, to the binaural audio signal. Described herein are various example HRTF-compensation operations that are suitable for calculating the compensated stereo audio signal. These operations may involve any number of methods that suitably counter and adjust for various effects resulting from the pre-rendered HRTF operations. In some examples, the HRTF compensation may involve an inverse mathematical operation of the pre-rendering HRTF.
  • the HRTF compensation may be implemented with a look-up table type of operation, where – to reduce memory needs – the lightweight device may rely on a less dense set of HRTFs than used in the pre-rendering device.
  • the set of inverse HRTFs available to the lightweight device may comprise suitable approximations of the inverses of the HRTFs available in the HRTF set at the pre-rendering device.
  • the HRTF compensation may be implemented with a numerical approximation method that may include interpolation, either linear or non-linear or combinations thereof.
  • the HRTF compensation may be implemented with a best fit type of approximation. Combinations of various methods are equally applicable and considered within the scope of the present disclosure.
  • step S16 “apply post-rendering HRTF”, the process involves calculating a binaural output signal by applying the post-rendering HRTF to the compensated stereo signal.
  • Steps S15 and S16 may be performed as a single operation.
  • the processes illustrated by figures 2 and 3 include a collection of blocks, which represent a sequence of operations or steps that can be implemented in hardware, software, or a combination thereof.
  • the blocks may represent computer-executable instructions that, when executed by one or more processors, perform the recited operations.
  • computer-executable instructions may include routines, programs, objects, components, data structures, and the like that perform or implement functions.
  • a basic assumption is that, per time-frequency tile, the audio is composed of one dominant directional component and a diffuse (omni-directional) component.
  • the directional component is assumed to be a prototype signal ⁇ arriving from a certain DOA having azimuth and elevation angles ⁇ ⁇ , ⁇ ⁇ expressed in some room coordinate system.
  • the diffuse component is a decorrelated version of the prototype signal ⁇ .
  • ⁇ ⁇ , ⁇ ′ are azimuth and elevation angles of the directional component relative to the head pose ⁇ ′ assumed at the pre-renderer 11.
  • Approach at the post-renderer The post-renderer aims at adjusting the received left and right binaural signals ⁇ ! and ⁇ ! with respect to the current head pose P, if it deviates from P'.
  • % ⁇ and % ⁇ are suitable weighting factors or operators.
  • the HRTF compensated left and right channel signals are ⁇ ! ⁇ h ⁇ & ⁇ ⁇ ⁇ , ⁇ ⁇ and, respectively, ⁇ ! ⁇ h & ⁇ ⁇ ⁇ ⁇ , ⁇ ⁇ ⁇ for left and right channel signals.
  • This approach leads to correct directional components in the output signals with regards to the present head pose, i.e.
  • % ⁇ , % ⁇ can be linear and non-linear operators like (frequency selective) filter operators or gain limiters to avoid the output samples exceeding a predetermined number range.
  • FIG. 4 illustrates example techniques of DOA-based split rendering with pre-rendered binaural signals.
  • the figure shows further the assumed situation at the post-renderer that has knowledge of the actual head pose ⁇ with angle ⁇ . It is shown how the assumed wavefront arrives from a different DOA with respect to the actual head pose P - angle ⁇ instead of ⁇ ′ - which in turn results in a different ITD ⁇ ⁇ . Notably, there are also other ILDs and spectra corresponding to the actual head pose.
  • One main concept of the disclosure visualized in FIG. 4 is thus to change the ITD from ⁇ ⁇ ⁇ ⁇ to ⁇ ⁇ by first compensating ⁇ ⁇ ⁇ ⁇ and then applying ⁇ ⁇ .
  • ILD and spectra are modified by compensating those corresponding to the head pose ⁇ ′ and the applying ILD and spectra of the HRTFs corresponding to the actual head pose ⁇ .
  • the above description leads to the following simplified approach (also making reference to FIG. 4): Assumptions: The back-front axis A of the listener 30 defines the x-axis of a right-handed coordinate system. Furthermore, in many relevant cases, a user may mostly made head movements around the yaw- axis (z-axis) and most immersive audio content has sound sources that are close to the horizontal plane. Thus, the elevation angle of the DOA is relatively close to zero degrees (e.g. bound withing the interval of [-20,20] degrees).
  • a head pose ⁇ ′ of the listener is assumed while pre-rendering.
  • the pre-renderer renders under an azimuthal component of the DOA ⁇ ’ which deviates from the true azimuth ⁇ at playback time.
  • the post-renderer renders under the azimuthal component of the DOA ⁇ corresponding to the listener head pose ⁇ at playback time.
  • the post-renderer should adjust the ITD of the directional component in a given time-frequency tile from ⁇ ⁇ ⁇ ⁇ to ⁇ ⁇ .
  • the post-renderer also adjusts inter-aural level differences and spectral shape given the true head pose compared to the assumed head pose by the pre-renderer. It is notable, that similar formulations are possible even without the assumption of a bounded elevation component of the DOA. Even in that case, the post-renderer operations can be decomposed to ITD adjustments, inter-aural level differences and spectral shape adjustment. In that case, the amount of required ITD adjustment will however depend on azimuth and elevation angles of the DOA assumed while pre-rendering and effective during post-rendering.
  • FIG. 5 illustrates example techniques of HRTF personalization. It is shown how an assumed wavefront from arriving at the head 30 of a listener from DOA angle ⁇ results in different ITDs depending on the size of the listener head. Pre-rendering with generic HRTFs may assume a listener head dimension with generic inter-aural distance DJ E . This will result in generic ITDs corresponding to ⁇ J ⁇ ⁇ and corresponding ILDs and spectral shapes of left and right audio signals. Personalized HRTFs would be based on (more) correct listener head dimensions. Accordingly, this would result in more correct, personalized, ITDs ⁇ K ⁇ ⁇ and more correct corresponding ILDs and spectral shapes of left and right audio signals.
  • the general idea of the HRTF personalization is that the post-renderer will compensate for the generic HRTFs and impose the effect of the personalized HRTFs.
  • the gross concept is very similar to the above-described head pose correction at the post renderer. Accordingly, both concepts are compatible with each other and can be combined easily.
  • the same description as above for the main device 10 with pre-renderer 11 applies.
  • the post-renderer 26 in the user held device 20 makes adjustments using a set of HRTFs H p whereby the post-renderer is aware of the generic HRTFs that were used by the pre-
  • the post-renderer 26 aims at adjusting the received left and right binaural signals ⁇ ! and ⁇ ! with respect to the current head pose P, if it deviates from P’, and with respect to the set of personalized HRTFs H p .
  • the signal ⁇ and the decorrelator signals ⁇ ⁇ and ⁇ ⁇ are unavailable. Instead, ⁇ ⁇ "#$ and ⁇ " ⁇ #$ will be approximated in a parametric approach using available signals ⁇ ! and ⁇ ! .
  • ⁇ L +,-- ⁇ ⁇ [ ⁇ % ⁇ h & L , ⁇ ⁇ ⁇ , ⁇ ⁇ ⁇ ⁇ ⁇ + ⁇ 1 ⁇ % ⁇ ⁇ h & L , ⁇ ⁇ ⁇ ⁇ , ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ h K, ⁇ ⁇ , ⁇ ⁇ ⁇ ⁇ ⁇ ]
  • the involved delay change of the decorrelated diffuse component may perceptually not matter, the gain/shape change may lead to timbral deviations or coloration effects.
  • the post-renderer receives direction of arrival (DOA) information.
  • DOA information may be represented as azimuth and elevation angles (DOA angles) ⁇ ⁇ , ⁇ ′ of the dominant directional component of the immersive audio content in relation to the assumed head pose P'. It should be noted that the DOA is determined per time-frequency tile. Indexes of the used HRTFs are another form to provide the DOA information to the post-renderer.
  • the post-renderer must be aware of the head pose ⁇ ′ assumed at pre-renderer. Corresponding information may be transmitted to the post-renderer (i.e., in the metadata). It is also possible to rely on the fact that P' corresponds to the true head pose at an earlier time instant, which has been transmitted from the post-renderer to the pre-renderer. Assuming the transmission delay from post-renderer to pre-renderer is a priori known or can be estimated, this would make the transmission of P' to the post-renderer unnecessary.
  • One way to estimate the transmission delay from post-renderer to pre-renderer is to base it on round-trip delay measurements from post-renderer to pre-renderer and back to post-renderer, e.g., using time stamps.
  • the parameters ⁇ ⁇ , ⁇ ⁇ -- , % ⁇ , % ⁇ are mathematically inter-connected. There is thus the possibility to exploit this inter-dependency, which for example help finding suitable choices of % ⁇ , % ⁇ with which it may be possible to avoid using a decorrelator in the post-renderer.
  • the benefit of such an approach is the avoidance of post-renderer complexity.
  • pre-rendered binaural channel signals ⁇ ⁇ , ⁇ ⁇ are transmitted in complex-valued quadrature mirror filterbank (CQMF)/frequency domain, which would avoid doing a forward time-to-CQMF/frequency domain operation in the post renderer, which would be advantageous in terms of complexity and delay.
  • CQMF complex-valued quadrature mirror filterbank
  • a notable difference between the present approach and conventional techniques is that the present approach relies on compensation of the HRTF filter operations of the pre-renderer and applying the HRTFs that would ideally have been used.
  • alternative techniques rely on transforming the binaural output channels using a linear transform whose coefficients are obtained following an LMS approach and interpolation.
  • An example implementation for DOA-based split rendering with prototype signal comprises the following steps: 1.
  • the pre-renderer or decoder generates a prototype signal (S).
  • S a prototype signal
  • Some example approaches to generate S are as follows: a. Get the Ambisonics W or omni directional channel representation from the decoder output with any of the known techniques and use that as S. b. Get a representation of dominant eigen signal from the decoder output and use that as S. c.
  • Pre-render the decoded immersive audio using a set of generic HRTFs (or BRIRs), generate S aL + bR;, where in L and R are the Left and Right channels of the pre-rendered bin signal, a and b are complex or real-only gain factors per time-frequency tile and can be either dynamically computed or statically predetermined values, e.g.
  • the main device transmits the coded prototype signal S and, the assumed head pose P' and/or the assumed DOA angles (or equivalent information) and diffuseness parameters.
  • the post-renderer decodes the prototype signal bits and generates S' (which should be same as S if the codec used to code S has zero delay and is lossless).
  • the post-renderer aims at generating left and right binaural signals with respect to the current head pose P (if it deviates from P'). 4.
  • the post-renderer adjusts the DOA angles sent by the main device based on the difference between P and P'.
  • the post- renderer Together with S', HRTFs at post renderer, and adjusted DOA angles, the post- renderer generates directional components of the post rendered binaural signal. Diffuseness parameters are used with decorrelated S' to fill in the diffused energy in the post rendered binaural signal.
  • Approach at the post-renderer The post-renderer aims at adjusting the received DOA with respect to the current head pose P, if it deviates from P' and then together with prototype signal S' and set of HRTFs H p which may be personalized or generic.
  • FIG. 6 illustrates example techniques of DOA-based split rendering with prototype signal.
  • the main device 110 includes a decoder/renderer 111, a head pose decoder 112, encoders 113, 114 and a multiplexer 115.
  • the decoder/renderer 111 receives the main bitstream b1 and performs rendering synthesis of a prototype signal S, having a direction of arrival, DOA, in relation to an assumed head pose P' of the user.
  • the assumed head pose P' may be an appropriate default head pose or may be an actual user head pose received from the user-held device, which may optionally be decoded by the head pose decoder 112.
  • Such a decoded user head pose will represent a recent, but not quite current, head pose of the user.
  • the renderer 111 here outputs a prototype signal S and metadata M including at least the direction of arrival, DOA, of the prototype signal.
  • the encoders 113, 114 encode the prototype signal S and the metadata M, and the multiplexer 115 multiplexes the encoded prototype signal b11 and encoded metadata b 12 into one intermediate bitstream b 2 .
  • the user-held device 120 includes a demuxer 121, decoders 122, 123, a head-tracker 124, an encoder 125, and a post-renderer 126.
  • the demuxer 121 receives the intermediate bitstream and separates it into two encoded signals b21 and b22, and the two decoders 122, 123 responsively decode these signals to obtain a decoded prototype signal S' and decoded metadata M', e.g., the DOA and (optionally) assumed head pose P' used in renderer 111.
  • the head-tracker 124 which may be included in the user-held device 120 or be connected thereto, detects a current head pose P of the user’s head.
  • the encoder 125 encodes the detected head pose P and transmits it to the main device 110.
  • the post-renderer 126 receives the decoded prototype signal S', the DOA, the assumed head pose P' and the current head pose P, and calculates an output binaural signal L out , R out .
  • the post-renderer 126 is provided with HRTF data, typically a set H g of generic HRTFs (for various directions of arrival, DOA).
  • the renderer 125 may also be with a set Hp of personalized HRTFs.
  • AR/MR involving audio Audio Audio zoom/magnifier Like magnifying glasses but for sound. The user may zoom in on sounds of interest. Overlay of real-world objects with sounds: Real-world objects/items will be associated with sounds. Useful but not limited to assistance systems for sight-impaired persons. Dialog enhancement/smart ambient noise reduction: Help for people with cocktail party problem, lifting the active voices over the ambient noise. Mood sound ambiance: Like mood light. Sound will be associated with real-world environment, items and personal preference. 2.
  • Use case characteristics will typically rely on audio/visual capture, some scene analysis and generation of the augmented sound signal. In some scenarios, it may also be overlaid with immersive sound from some network node or the far end in a communication. The use cases will typically rely on head-tracked audio/visual rendering. 3. Further non-AR/MR use cases Immersive voice communication (2-party, conferencing) and immersive content streaming with AR glasses as end device are likely IVAS use cases. Some of them may rely on head-tracked audio rendering, some may not. Some use cases may involve one-to-many immersive distribution of head-tracked audio. Aspects of the systems described herein may be implemented in an appropriate computer- based sound processing network environment for processing digital or digitized audio files.
  • Portions of the adaptive audio system may include one or more networks that comprise any desired number of individual machines, including one or more routers (not shown) that serve to buffer and route the data transmitted among the computers.
  • a network may be built on various different network protocols, and may be the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), or any combination thereof.
  • WAN Wide Area Network
  • LAN Local Area Network
  • One or more of the components, blocks, processes or other functional components may be implemented through a computer program that controls execution of a processor-based computing device of the system.
  • the various functions disclosed herein may be described using any number of combinations of hardware, firmware, and/or as data and/or instructions embodied in various machine-readable or computer-readable media, in terms of their behavioral, register transfer, logic component, and/or other characteristics.
  • Computer-readable media in which such formatted data and/or instructions may be embodied include, but are not limited to, physical (non-transitory), non-volatile storage media in various forms, such as optical, magnetic or semiconductor storage media. While one or more implementations have been described by way of example and in terms of the specific embodiments, it is to be understood that one or more implementations are not limited to the disclosed embodiments. To the contrary, it is intended to cover various modifications and similar arrangements as would be apparent to those skilled in the art. Therefore, the scope of the appended claims should be accorded the broadest interpretation so as to encompass all such modifications and similar arrangements. Further details and embodiments of the present invention may be understood from the following list of enumerated exemplary embodiments, EEEs: EEE1.
  • a method of processing audio comprising: receiving, by a first device, a main bitstream representation of encoded audio; obtaining, by a second device, user head pose information; determining, by the first device from the main bitstream, downmixed signals comprising at least one channel and metadata; providing, by the first device to a second device, the downmixed signals and metadata; rendering, by a lightweight renderer of the second device, the downmixed signals into output binaural audio based on the metadata, the user head pose information.
  • EEE2 The method of EEE1, wherein the downmixed signals comprise pre-rendered binaural signals.
  • determining the pre-rendered binaural signals and rendering metadata comprises: decoding the main bitstream representation by a main renderer of the first device to generate decoded audio; binauralizing the decoded audio by a pre-renderer of the first device to generate the pre- rendered binaural signals and rendering metadata, wherein the pre-renderer performs the binauralizing using at least one of: a generic head-related transfer function (HRTF) or binaural room impulse response (BRIR), or the user head pose information, the user head pose information the user information being obtained from at least one of: a head tracker of the second device, a storage device storing a pre-set value, or an assumed direction of arrival (DOA) angle.
  • HRTF generic head-related transfer function
  • BRIR binaural room impulse response
  • the method of EEE 4, wherein rendering the pre-rendered binaural signals into output binaural audio comprises adjusting, by the lightweight renderer, left and right channels of the pre-rendered binaural signals with respect to a current user head pose obtained through the head tracker over the assumed user head pose used by the pre-renderer.
  • rendering the pre-rendered binaural signals comprises: inverse HRTF filtering left and right channels of the pre-rendered binaural signals according the HRTF or assumed DOA angle used by the pre-renderer; and linearly combining the inverse HRTF filtered signals.
  • EEE8. The method of EEE6 or 7, wherein linearly combining the inverse HRTF filtering signals includes mitigating an error in a diffuse component by selecting a weight of the linear combining.
  • any of claims 2-8 comprising applying HRTF personalization, wherein the pre-render applies a generic HRTF, and the light-weight renderer compensates for the generic HRTF and subsequently applies a personalized HRTF.
  • EEE10 The method of EEE1, wherein the downmixed signals comprise a prototype signal.
  • EEE11 The method of EEE10, wherein the prototype signal comprises a single channel.
  • EEE12. The method of EEE10 or 11, wherein computing the prototype signal comprises: decoding the main bitstream representation by a main decoder of the first device to generate decoded audio; and applying gains to the decoded audio and adding the decoded audio with applied gains to the decoded audio.
  • EEE13 The method of any of claims 2-8, comprising applying HRTF personalization, wherein the pre-render applies a generic HRTF, and the light-weight renderer compensates for the generic HRTF and subsequently applies a personalized HRTF.
  • any of EEE10-12 comprising: computing the prototype signal and DOA angles based on assumed head pose P' and diffuseness parameters; sending the assumed head pose P', DOA angles for the assumed head pose P', diffuseness parameters and the prototype signal to a post renderer device; adjusting at the post renderer device, the DOA angles based on an actual head pose P; computing the directional components using the prototype signal and a set of HRTFs and adjusted DOA angles; computing diffused components using the diffuseness parameters and a decorrelated version of prototype signal; and adding directional and diffused components to generate a post-rendered binaural output.
  • EEE14 The method of any of EEE10-12, comprising: computing the prototype signal and DOA angles based on assumed head pose P' and diffuseness parameters; sending the assumed head pose P', DOA angles for the assumed head pose P', diffuseness parameters and the prototype signal to a post renderer device; adjusting at the post renderer device, the DOA angles based on an actual head pose P; computing the directional components using the prototype signal
  • EEE15 A system including one or more processors configured to perform operations of any one of EEE1-14.
  • EEE16 A computer program product configured to cause one or more processors to perform operations of any one of EEE1-14.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Stereophonic System (AREA)

Abstract

L'invention concerne des systèmes, des procédés et des produits de programme informatique pour restituer un son divisé basé sur la direction d'arrivée (DOA) et la personnalisation de la fonction de transfert liée à la tête (HRTF). Le rendu audio suivi par la tête est divisé entre deux dispositifs. Un premier dispositif reçoit une représentation de flux binaire principal d'audio codé. Un second dispositif suit des informations de pose de tête. Le premier dispositif restitue le flux binaire principal à l'aide d'un décodeur principal, et encode le flux binaire décodé en signaux binauraux pré-rendus et en métadonnées de post-rendu. Le second dispositif décode les signaux binauraux de pré-rendu et les métadonnées de post-rendu à partir du flux binaire intermédiaire, et fournit les signaux binauraux de pré-rendu décodés et les métadonnées de post-rendu à un restituteur léger. Le moteur de rendu léger restitue les signaux binauraux de pré-rendu en audio binaural sur la base des métadonnées de post-rendu, des informations sur la position de la tête, de la HRTF générique et de la HRTF personnalisée.
PCT/US2023/073857 2022-09-12 2023-09-11 Rendu divisé à suivi de tête et personnalisation de fonction de transfert liée à la tête WO2024059505A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202263405538P 2022-09-12 2022-09-12
US63/405,538 2022-09-12
US202263422331P 2022-11-03 2022-11-03
US63/422,331 2022-11-03

Publications (1)

Publication Number Publication Date
WO2024059505A1 true WO2024059505A1 (fr) 2024-03-21

Family

ID=88236833

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/073857 WO2024059505A1 (fr) 2022-09-12 2023-09-11 Rendu divisé à suivi de tête et personnalisation de fonction de transfert liée à la tête

Country Status (1)

Country Link
WO (1) WO2024059505A1 (fr)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140222439A1 (en) * 2006-02-07 2014-08-07 Lg Electronics Inc. Apparatus and Method for Encoding/Decoding Signal

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140222439A1 (en) * 2006-02-07 2014-08-07 Lg Electronics Inc. Apparatus and Method for Encoding/Decoding Signal

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"Spatial Audio Processing: MPEG Surround and Other Applications", 1 January 2007, JOHN WILEY & SONS, article J BREEBAART ET AL: "Binaural Cues for Multiple Sound Sources", pages: 139 - 154, XP055325102, DOI: 10.1002/9780470723494.ch8 *
BREEBAART J ET AL: "Multi-channel goes mobile: MPEG surround binaural rendering", AES INTERNATIONAL CONFERENCE. AUDIO FOR MOBILE AND HANDHELDDEVICES, XX, XX, 2 September 2006 (2006-09-02), pages 1 - 13, XP007902577 *
MINNAAR PAULI ET AL: "The importance of head movements for binaural room synthesis - a pilot experiment", 1 January 2000 (2000-01-01), XP093108229, Retrieved from the Internet <URL:https://www.researchgate.net/profile/Henrik-Moller-3/publication/247904840_THE_IMPORTANCE_OF_HEAD_MOVEMENTS_FOR_BINAURAL_ROOM_SYNTHESIS/links/56c7460708aee3cee539402f/THE-IMPORTANCE-OF-HEAD-MOVEMENTS-FOR-BINAURAL-ROOM-SYNTHESIS.pdf> [retrieved on 20231204] *

Similar Documents

Publication Publication Date Title
CN107533843B (zh) 用于捕获、编码、分布和解码沉浸式音频的系统和方法
AU2014295309B2 (en) Apparatus, method, and computer program for mapping first and second input channels to at least one output channel
JP4944902B2 (ja) バイノーラルオーディオ信号の復号制御
RU2643630C1 (ru) Способ и устройство для рендеринга акустического сигнала и машиночитаемый носитель записи
WO2019086757A1 (fr) Détermination de paramètres audios spatiaux ciblés et lecture audio spatiale associée
US20150350801A1 (en) Binaural audio processing
KR101540911B1 (ko) 헤드폰 재생 방법, 헤드폰 재생 시스템, 컴퓨터 프로그램 제품
CN112567763B (zh) 用于音频信号处理的装置和方法
US11765536B2 (en) Representing spatial audio by means of an audio signal and associated metadata
US20230254659A1 (en) Recording and rendering audio signals
AU2019394097B2 (en) Apparatus, method and computer program for encoding, decoding, scene processing and other procedures related to DirAC based spatial audio coding using diffuse compensation
WO2021130405A1 (fr) Combinaison de paramètres audio spatiaux
CN115190414A (zh) 用于音频处理的设备和方法
EP3216234B1 (fr) Appareil de traitement de signal audio et procédé pour modifier une image stéréoscopique d&#39;un signal stéréoscopique
EP3808106A1 (fr) Capture, transmission et reproduction audio spatiales
EP4128824A1 (fr) Représentation audio spatiale et rendu
WO2024059505A1 (fr) Rendu divisé à suivi de tête et personnalisation de fonction de transfert liée à la tête
US20220366919A1 (en) Audio encoding/decoding with transform parameters
CN115462097A (zh) 用于使能渲染空间音频信号的装置、方法和计算机程序
WO2024123936A2 (fr) Rendu binaural
US20230274747A1 (en) Stereo-based immersive coding
WO2022123108A1 (fr) Appareil, procédés et programmes informatiques pour fournir un contenu audio spatial
WO2024115045A1 (fr) Rendu audio binaural d&#39;audio spatial
GB2607934A (en) Apparatus, methods and computer programs for obtaining spatial metadata
WO2023187208A1 (fr) Procédés et systèmes de rendu audio immersif à 3dof/6dof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23782717

Country of ref document: EP

Kind code of ref document: A1